My First Kaggle Submission: A Baseline and What I’m Improving Next

I recently submitted my first baseline model to the Kaggle Housing Prices competition.

This post is not about a perfect solution—it’s about documenting my learning process as I work through the problem step by step. I plan to make 3–4 submissions in total, each one improving on the previous version. This post covers the absolute baseline and what foundational work I’m focusing on next.

 Why I’m Doing This Competition

I’m using the Housing Prices competition as a way to build practical skills:

  • Practice end-to-end machine learning workflows.
  • Get comfortable with feature handling and evaluation.
  • Learn by doing, not just watching tutorials.
  • Document my progress publicly so I can think more clearly and solidify my understanding.

This is part of my broader goal to get better at applied machine learning.

What the Housing Prices Competition Is About

The goal of the competition is to predict house prices using a rich dataset that contains:

  • Numerical features (e.g., size, year built).
  • Categorical features (e.g., neighborhood, house style).

The evaluation metric is Root Mean Squared Log Error (RMSLE). This metric is key because it means predicting prices accurately across different ranges matters more than just minimizing raw error. Specifically, RMSLE heavily penalizes large relative errors when predicting smaller house prices, which is why log transformations are so important in this domain.

My Baseline Approach (Minimalist Pipeline)

For my first submission, I deliberately kept things simple. My primary goal was to validate the end-to-end process: read data, train model, predict, and submit.

What I Did (The Simple Pipeline)

  • Used only numerical features.
  • Scaled the features (e.g., using StandardScaler).
  • Trained a simple Ridge Regression model.
  • Did minimal preprocessing (e.g., ignoring outliers).
  • No feature engineering.
  • No cross-validation (CV).

The goal was to:

  1. Understand the data format.
  2. Set up a clean training – prediction – submission pipeline.
  3. Get a valid score on the leaderboard.

Baseline Result & Key Takeaways

My first submission score on the Kaggle public leaderboard was: 0.3494.

I wasn’t aiming for a strong score at this stage. The important thing was getting a working baseline and confirming that my approach made sense. Seeing the submission go through was already a win!

What I Learned Immediately

A few things became obvious immediately after submitting:

  1. Ignoring categorical features leaves a massive amount of predictive information unused (features like Neighborhood are likely crucial).
  2. A simple linear model like Ridge Regression can only go so far with limited data preparation.
  3. The competition rewards thoughtful preprocessing and feature handling.

This baseline gave me a necessary reference point to improve from.

My Structured Improvement Plan

For the next submission, my focus is on doing things properly and building a robust foundation, rather than jumping straight to complex models.

Step 1: Robust Pipeline and Preprocessing (Next Post Focus)

  • Add categorical features: Incorporate techniques like One-Hot Encoding or Target Encoding.
  • Use a proper scikit-learn Pipeline: This will ensure cleaner code and prevent data leakage between steps.
  • Apply transformations: Use the necessary log transformation where appropriate, especially on the target variable, to better align the error distribution with the RMSLE metric.
  • Introduce Cross-Validation (CV): Get a reliable estimate of model performance before submitting to the leaderboard.

Step 2: Advanced Modeling (Submissions 3 & 4)

Once the foundation is solid, I’ll move on to:

  • Exploring Gradient Boosting models (e.g., XGBoost, LightGBM), which are often top performers in structured data competitions.
  • Understanding where the performance gains actually come from through feature importance analysis.

How Many Submissions I Expect?

  1. Baseline (This one)
  2. Improved preprocessing + pipeline
  3. Stronger models (e.g., gradient boosting)
  4. Optional refinement if needed

Each submission will build on the last.

What’s Next?

My next step is improving feature handling and setting up a cleaner, more robust pipeline using sklearn.compose and sklearn.pipeline.

I’ll write a follow-up post once I submit the next version and compare its performance to this baseline. This blog is mainly a place for me to document what I’m learning—and hopefully help others who are a few steps behind me.

Leave a Comment