From Baseline to Submission: My Gradient Boosting Pipeline on Spaceship Titanic

Introduction

After completing my first Kaggle competition on Housing Prices, I decided to tackle the Spaceship Titanic dataset. The goal is to predict whether passengers were transported to another dimension during the voyage.

This competition has been a great opportunity to improve my workflow, learn about pipelines, and practice model evaluation. In this post, I’ll walk through my process — from a clean pipeline to my first submission, which achieved 0.797 accuracy on the leaderboard.

Step 1: Building a Clean Pipeline

Before jumping into fancy models, I wanted a reproducible and reliable workflow. I used ColumnTransformer to preprocess my features:

Numerical features: median imputation + scaling
Categorical features: most frequent imputation + one-hot encoding

This ensures the pipeline handles all preprocessing consistently, avoids data leakage, and keeps everything modular.

Step 2: Testing Different Models

I experimented with three models to see which would perform best:

Logistic Regression – surprisingly strong, with a CV accuracy of ~78% and CV ROC-AUC of 0.86.
Random Forest – slightly worse than Logistic Regression (CV ROC-AUC 0.855), likely due to underfitting or shallow trees.
Gradient Boosting – my best-performing model, achieving CV ROC-AUC of 0.874.

This step reinforced an important lesson: more complex models do not always guarantee better results, especially when preprocessing and linear signals dominate the dataset.

Step 3: Submitting My First Baseline

With Gradient Boosting selected, I:

Trained the pipeline on the entire training dataset
Predicted the test set using the trained pipeline
Created a submission file with PassengerId and Transported

Result: my first submission scored 0.79705 accuracy on the Kaggle leaderboard.

This is a strong baseline before any feature engineering, providing a reference for improvements in future iterations.

Next Steps

For my next iteration, I plan to:

Explore feature engineering, such as:
- Extracting deck and cabin information
- Combining spend columns into totals or flags
Tune Gradient Boosting hyperparameters
Experiment with XGBoost for potential performance gains
Track CV ROC-AUC vs leaderboard score to evaluate improvements

Conclusion

This submission marks my second Kaggle competition and demonstrates the importance of a clean pipeline, CV evaluation, and model selection. Achieving ~0.80 accuracy without feature engineering shows that good workflow + baseline models already get you far — and the next improvements will come from thoughtful feature engineering and model tuning.