Introduction
After completing my first Kaggle competition on Housing Prices, I decided to tackle the Spaceship Titanic dataset. The goal is to predict whether passengers were transported to another dimension during the voyage.
This competition has been a great opportunity to improve my workflow, learn about pipelines, and practice model evaluation. In this post, I’ll walk through my process — from a clean pipeline to my first submission, which achieved 0.797 accuracy on the leaderboard.
Step 1: Building a Clean Pipeline
Before jumping into fancy models, I wanted a reproducible and reliable workflow. I used ColumnTransformer to preprocess my features:
Numerical features: median imputation + scaling
Categorical features: most frequent imputation + one-hot encoding
This ensures the pipeline handles all preprocessing consistently, avoids data leakage, and keeps everything modular.
Step 2: Testing Different Models
I experimented with three models to see which would perform best:
Logistic Regression – surprisingly strong, with a CV accuracy of ~78% and CV ROC-AUC of 0.86.
Random Forest – slightly worse than Logistic Regression (CV ROC-AUC 0.855), likely due to underfitting or shallow trees.
Gradient Boosting – my best-performing model, achieving CV ROC-AUC of 0.874.
This step reinforced an important lesson: more complex models do not always guarantee better results, especially when preprocessing and linear signals dominate the dataset.
Step 3: Submitting My First Baseline
With Gradient Boosting selected, I:
Trained the pipeline on the entire training dataset
Predicted the test set using the trained pipeline
Created a submission file with
PassengerIdandTransported
Result: my first submission scored 0.79705 accuracy on the Kaggle leaderboard.
This is a strong baseline before any feature engineering, providing a reference for improvements in future iterations.
Next Steps
For my next iteration, I plan to:
Explore feature engineering, such as:
Extracting deck and cabin information
Combining spend columns into totals or flags
Tune Gradient Boosting hyperparameters
Experiment with XGBoost for potential performance gains
Track CV ROC-AUC vs leaderboard score to evaluate improvements
Conclusion
This submission marks my second Kaggle competition and demonstrates the importance of a clean pipeline, CV evaluation, and model selection. Achieving ~0.80 accuracy without feature engineering shows that good workflow + baseline models already get you far — and the next improvements will come from thoughtful feature engineering and model tuning.