Improving My Baseline Model: From Simple Linear Regression to a Proper Pipeline

In my previous post, I built a simple baseline model for the House Prices Kaggle competition using only numerical features, scaling, and a linear model.

Since then, I’ve iterated on that baseline by introducing a proper preprocessing pipeline, adding categorical features through one-hot encoding, and applying feature engineering, cross-validation, and hyperparameter tuning. The goal wasn’t to chase a leaderboard score, but to build a more realistic and disciplined machine learning workflow.

In this post, I walk through the key improvements I made to the baseline model and what I learned from the process, before moving on to more advanced models in future experiments.

Quick Recap of the Baseline

Model: Ridge Regression

Features: Numerical features only

Preprocessing: Manual scaling

Evaluation: Train/validation split

Result: Kaggle score around 0.34

This was intentionally simple, but clearly incomplete.

Problems With the Baseline

Categorical features were completely ignored

Preprocessing was done manually (easy to leak data)

No unified pipeline

Target variable (price) was highly skewed

Evaluation setup was not robust

Improvements I Made

🔹 Using a Pipeline

Keeps preprocessing and model together

Prevents data leakage

Makes experiments reproducible

 

This immediately made my workflow cleaner and safer.

🔹 Handling Categorical Features

Stopped dropping them

Encoded them properly

Why house data depends heavily on categories

Improved the model’s ability to learn housing-related patterns

🔹 Log Transforming the Target

House prices are right-skewed

Log transform stabilized the target

Evaluation became more meaningful

🔹 Better Evaluation Discipline

Used consistent RMSE calculation

Compared models properly

Avoided misleading improvements

Results

These changes reduced my validation RMSE  significantly compared to the baseline and gave me more confidence in the model’s ability to generalize.

What I Learned

Pipelines are not optional, even for baselines

Data preprocessing matters as much as the model

Target transformations can have a big impact

Clean evaluation > chasing scores

What I’ll Try Next

Feature engineering

Cross-validation and tuning

Tree-based models (Random Forest, Gradient Boosting)

Comparing linear vs non-linear models

Leave a Comment