From Baseline to Submission: My Gradient Boosting Pipeline on Spaceship Titanic

Introduction

After completing my first Kaggle competition on Housing Prices, I decided to tackle the Spaceship Titanic dataset. The goal is to predict whether passengers were transported to another dimension during the voyage.

This competition has been a great opportunity to improve my workflow, learn about pipelines, and practice model evaluation. In this post, I’ll walk through my process — from a clean pipeline to my first submission, which achieved 0.797 accuracy on the leaderboard.

Step 1: Building a Clean Pipeline

Before jumping into fancy models, I wanted a reproducible and reliable workflow. I used ColumnTransformer to preprocess my features:

  • Numerical features: median imputation + scaling

  • Categorical features: most frequent imputation + one-hot encoding

This ensures the pipeline handles all preprocessing consistently, avoids data leakage, and keeps everything modular.

Step 2: Testing Different Models

I experimented with three models to see which would perform best:

  1. Logistic Regression – surprisingly strong, with a CV accuracy of ~78% and CV ROC-AUC of 0.86.

  2. Random Forest – slightly worse than Logistic Regression (CV ROC-AUC 0.855), likely due to underfitting or shallow trees.

  3. Gradient Boosting – my best-performing model, achieving CV ROC-AUC of 0.874.

This step reinforced an important lesson: more complex models do not always guarantee better results, especially when preprocessing and linear signals dominate the dataset.

Step 3: Submitting My First Baseline

With Gradient Boosting selected, I:

  1. Trained the pipeline on the entire training dataset

  2. Predicted the test set using the trained pipeline

  3. Created a submission file with PassengerId and Transported

Result: my first submission scored 0.79705 accuracy on the Kaggle leaderboard.

This is a strong baseline before any feature engineering, providing a reference for improvements in future iterations.

Next Steps

For my next iteration, I plan to:

  1. Explore feature engineering, such as:

    • Extracting deck and cabin information

    • Combining spend columns into totals or flags

  2. Tune Gradient Boosting hyperparameters

  3. Experiment with XGBoost for potential performance gains

  4. Track CV ROC-AUC vs leaderboard score to evaluate improvements

Conclusion

This submission marks my second Kaggle competition and demonstrates the importance of a clean pipeline, CV evaluation, and model selection. Achieving ~0.80 accuracy without feature engineering shows that good workflow + baseline models already get you far — and the next improvements will come from thoughtful feature engineering and model tuning.

Leave a Comment