Scaling to Success: Implementing and Optimizing Penalized Models

testv September 16, 2024

0 3 11 minutes read

This post will demonstrate the usage of Lasso, Ridge, and ElasticNet models using the Ames housing dataset. These models are particularly valuable when dealing with data that may suffer from multicollinearity. We leverage these advanced regression techniques to show how feature scaling and hyperparameter tuning can improve model performance. In this post, we’ll provide a step-by-step walkthrough on setting up preprocessing pipelines, implementing each model with scikit-learn, and fine-tuning them to achieve optimal results. This comprehensive approach not only aids in better prediction accuracy but also deepens your understanding of how different regularization methods affect model training and outcomes.

Let’s get started.

Scaling to Success: Implementing and Optimizing Penalized Models
Photo by Jeffrey F Lin. Some rights reserved.

Overview

This post is divided into three parts; they are:

The Crucial Role of Feature Scaling in Penalized Regression Models
Practical Implementation of Penalized Models with the Ames Dataset
Optimizing Hyperparameters for Penalized Regression Models

The Crucial Role of Feature Scaling in Penalized Regression Models

Data preprocessing is a pivotal step that significantly impacts model performance. One essential preprocessing step, particularly crucial when dealing with penalized regression models such as Lasso, Ridge, and ElasticNet, is feature scaling. But what exactly is feature scaling, and why is it indispensable for these models?

What is Feature Scaling?

Feature scaling is a method used to standardize the range of independent variables or features within data. The most common technique, known as standardization, involves rescaling the features so that they each have a mean of zero and a standard deviation of one. This adjustment is achieved by subtracting the mean of each feature from every observation and then dividing it by the standard deviation of that feature.

Why is Scaling Essential Before Applying Penalized Models?

Penalized regression models add a penalty to the size of the coefficients, which helps reduce overfitting and improve the generalizability of the model. However, the effectiveness of these penalties heavily depends on the scale of the input features:

Uniform Penalty Application: Without scaling, features with larger scales can disproportionately influence the model. This imbalance can lead to a model unfairly penalizing smaller-scale features, potentially ignoring their significant impacts.
Model Stability and Convergence: Features with varied scales can cause numerical instability during model training. This instability can make achieving convergence to an optimal solution difficult or result in a suboptimal model.

In the following example, we will demonstrate how to use the StandardScaler class on numeric features to address these issues effectively. This approach ensures that our penalized models—Lasso, Ridge, and ElasticNet—perform optimally, providing reliable and robust predictions.

Practical Implementation of Penalized Models with the Ames Dataset

Having discussed the importance of feature scaling, let’s dive into a practical example using the Ames housing dataset. This example demonstrates how to preprocess data and apply penalized regression models in Python using scikit-learn. The process involves setting up pipelines for both numeric and categorical data, ensuring a robust and reproducible workflow.

# Import necessary libraries import pandas as pd from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.linear_model import Lasso, Ridge, ElasticNet # Load the dataset and remove columns with missing values Ames = pd.read_csv(‘Ames.csv’).dropna(axis=1) # Identify numeric and categorical features, excluding ‘PID’ and ‘SalePrice’ numeric_features = Ames.select_dtypes(include=[‘int64’, ‘float64’]).drop(columns=[‘PID’, ‘SalePrice’]).columns categorical_features = Ames.select_dtypes(include=[‘object’]).columns X = Ames[numeric_features.tolist() + categorical_features.tolist()] # Target variable y = Ames[‘SalePrice’] # Pipeline for numeric features numeric_transformer = Pipeline(steps=[ (‘scaler’, StandardScaler()) ]) # Pipeline for categorical features categorical_transformer = Pipeline(steps=[ (‘onehot’, OneHotEncoder(handle_unknown=’ignore’)) ]) # Combined preprocessor for both numeric and categorical data preprocessor = ColumnTransformer( transformers=[ (‘num’, numeric_transformer, numeric_features), (‘cat’, categorical_transformer, categorical_features) ]) # Define the model pipelines with preprocessor and regressor pipelines = { ‘Lasso’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, Lasso(max_iter=20000))]), ‘Ridge’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, Ridge())]), ‘ElasticNet’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, ElasticNet())]) } # Perform cross-validation and store results in a dictionary cv_results = {} for name, pipeline in pipelines.items(): scores = cross_val_score(pipeline, X, y) cv_results[name] = round(scores.mean(), 4) # Output the mean cross-validation scores print(cv_results)

# Import necessary libraries

import pandas as pd

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.compose import ColumnTransformer

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import Lasso, Ridge, ElasticNet

# Load the dataset and remove columns with missing values

Ames = pd.read_csv(‘Ames.csv’).dropna(axis=1)

# Identify numeric and categorical features, excluding ‘PID’ and ‘SalePrice’

numeric_features = Ames.select_dtypes(include=[‘int64’, ‘float64’]).drop(columns=[‘PID’, ‘SalePrice’]).columns

categorical_features = Ames.select_dtypes(include=[‘object’]).columns

X = Ames[numeric_features.tolist() + categorical_features.tolist()]

# Target variable

y = Ames[‘SalePrice’]

# Pipeline for numeric features

numeric_transformer = Pipeline(steps=[

(‘scaler’, StandardScaler())

])

# Pipeline for categorical features

categorical_transformer = Pipeline(steps=[

(‘onehot’, OneHotEncoder(handle_unknown=‘ignore’))

])

# Combined preprocessor for both numeric and categorical data

preprocessor = ColumnTransformer(

transformers=[

(‘num’, numeric_transformer, numeric_features),

(‘cat’, categorical_transformer, categorical_features)

])

# Define the model pipelines with preprocessor and regressor

pipelines = {

‘Lasso’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, Lasso(max_iter=20000))]),

‘Ridge’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, Ridge())]),

‘ElasticNet’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, ElasticNet())])

}

# Perform cross-validation and store results in a dictionary

cv_results = {}

for name, pipeline in pipelines.items():

scores = cross_val_score(pipeline, X, y)

cv_results[name] = round(scores.mean(), 4)

# Output the mean cross-validation scores

print(cv_results)

First, we import the necessary libraries and load the Ames dataset, removing any columns with missing values to simplify our initial model. We identify and separate the numeric and categorical features, excluding “PID” (a unique identifier for each property) and “SalePrice” (our target variable).

We then construct two separate pipelines for preprocessing:

Numeric Features: We use StandardScaler to standardize the numeric features, ensuring that they contribute equally to our model without being biased by their original scale.
Categorical Features: OneHotEncoder is employed to convert categorical variables into a format that can be provided to the machine learning algorithms, handling any unknown categories that might appear in future data sets.

Both pipelines are combined into a ColumnTransformer. This setup simplifies the code and encapsulates all preprocessing steps into a single transformer object that can be seamlessly integrated with any model. With preprocessing defined, we set up three different pipelines, each corresponding to a different penalized regression model: Lasso, Ridge, and ElasticNet. Each pipeline integrates ColumnTransformer with a regressor, allowing us to maintain clarity and modularity in our code. Upon applying cross-validation to our penalized regression models, we obtained the following scores:

{‘Lasso’: 0.8863, ‘Ridge’: 0.8885, ‘ElasticNet’: 0.8299}

{‘Lasso’: 0.8863, ‘Ridge’: 0.8885, ‘ElasticNet’: 0.8299}

These results suggest that while all three models perform reasonably well, Ridge seems to handle this dataset best among the three, at least under the current settings.

Optimizing Hyperparameters for Penalized Regression Models

After establishing the foundation of feature scaling and implementing our penalized models on the Ames housing dataset, we now focus on an essential aspect of model development—hyperparameter tuning. This process is vital to refining our models and achieving the best performance. In this section, we’ll explore how adjusting the hyperparameters, specifically the regularization strength (alpha) and the balance between L1 and L2 penalties (l1_ratio for ElasticNet), can impact the performance of our models.

In the case of the Lasso model, we focus on tuning the alpha parameter, which controls the strength of the L1 penalty. The L1 penalty encourages the model to reduce the number of non-zero coefficients, which could potentially lead to simpler, more interpretable models.

#Building on block of code above #Implement GridSearchCV on Lasso to obtain optimal alpha from sklearn.model_selection import GridSearchCV # Define range of alpha values for Lasso alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1 # Setup Grid Search for Lasso lasso_grid = GridSearchCV(estimator=pipelines[‘Lasso’], param_grid={‘regressor__alpha’: alpha}, verbose=1) #Prints out progress lasso_grid.fit(X, y) # Extract the best alpha and best score Lasso lasso_best_alpha = lasso_grid.best_params_[‘regressor__alpha’] lasso_best_score = lasso_grid.best_score_ print(f”Best alpha for Lasso: {lasso_best_alpha}”) print(f”Best cross-validation score: {round(lasso_best_score, 4)}”)

#Building on block of code above

#Implement GridSearchCV on Lasso to obtain optimal alpha

from sklearn.model_selection import GridSearchCV

# Define range of alpha values for Lasso

alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1

# Setup Grid Search for Lasso

lasso_grid = GridSearchCV(estimator=pipelines[‘Lasso’],

param_grid={‘regressor__alpha’: alpha},

verbose=1) #Prints out progress

lasso_grid.fit(X, y)

# Extract the best alpha and best score Lasso

lasso_best_alpha = lasso_grid.best_params_[‘regressor__alpha’]

lasso_best_score = lasso_grid.best_score_

print(f“Best alpha for Lasso: {lasso_best_alpha}”)

print(f“Best cross-validation score: {round(lasso_best_score, 4)}”)

Setting verbose=1 in the GridSearchCV setup has provided you with helpful output about the number of fits performed, which gives a clearer picture of the computational workload involved. The output you’ve shared confirms that the grid search effectively explored different alpha values across 5 folds for each candidate, totaling 100 model fits:

Fitting 5 folds for each of 20 candidates, totalling 100 fits Best alpha for Lasso: 17 Best cross-validation score: 0.8881

Fitting 5 folds for each of 20 candidates, totalling 100 fits

Best alpha for Lasso: 17

Best cross-validation score: 0.8881

The alpha value of 17 is relatively high, suggesting that the model benefits from a stronger level of regularization. This could indicate some level of multicollinearity or other factors in the dataset that make model simplification (fewer variables or smaller coefficients) beneficial for prediction accuracy.

For the Ridge model, we also tune the alpha parameter, but here it affects the L2 penalty. Unlike L1, the L2 penalty does not zero out coefficients; instead, it reduces their magnitude, which helps in dealing with multicollinearity and model overfitting:

#Building on block of code above #Implement GridSearchCV on Ridge to obtain optimal alpha from sklearn.model_selection import GridSearchCV # Define range of alpha for Ridge alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1 # Setup Grid Search for Ridge ridge_grid = GridSearchCV(estimator=pipelines[‘Ridge’], param_grid={‘regressor__alpha’: alpha}, verbose=1) #Prints out progress ridge_grid.fit(X, y) # Extract the best alpha and best score for Ridge ridge_best_alpha = ridge_grid.best_params_[‘regressor__alpha’] ridge_best_score = ridge_grid.best_score_ print(f”Best alpha for Ridge: {ridge_best_alpha}”) print(f”Best cross-validation score: {round(ridge_best_score, 4)}”)

#Building on block of code above

#Implement GridSearchCV on Ridge to obtain optimal alpha

from sklearn.model_selection import GridSearchCV

# Define range of alpha for Ridge

alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1

# Setup Grid Search for Ridge

ridge_grid = GridSearchCV(estimator=pipelines[‘Ridge’],

param_grid={‘regressor__alpha’: alpha},

verbose=1) #Prints out progress

ridge_grid.fit(X, y)

# Extract the best alpha and best score for Ridge

ridge_best_alpha = ridge_grid.best_params_[‘regressor__alpha’]

ridge_best_score = ridge_grid.best_score_

print(f“Best alpha for Ridge: {ridge_best_alpha}”)

print(f“Best cross-validation score: {round(ridge_best_score, 4)}”)

The results from the GridSearchCV for Ridge regression show a best alpha of 3 with a cross-validation score of 0.889. This score is slightly higher than what was observed with the Lasso model (0.8881 with alpha at 17):

Fitting 5 folds for each of 20 candidates, totalling 100 fits Best alpha for Ridge: 3 Best cross-validation score: 0.889

Fitting 5 folds for each of 20 candidates, totalling 100 fits

Best alpha for Ridge: 3

Best cross-validation score: 0.889

The optimal alpha value for Ridge being significantly lower than for Lasso (3 versus 17) suggests that the dataset might benefit from the less aggressive regularization approach that Ridge offers. Ridge regularization (L2) doesn’t reduce coefficients to zero but rather shrinks them, which can be beneficial if many features have predictive power, albeit small. The fact that Ridge slightly outperformed Lasso in this case (0.889 vs. 0.8881) might indicate that feature elimination (which Lasso does through zeroing out coefficients) is not as beneficial for this dataset as feature shrinkage, which Ridge does. This could imply that most, if not all, predictors have some level of contribution to the target variable.

ElasticNet combines the penalties of Lasso and Ridge, controlled by alpha and l1_ratio. Tuning these parameters allows us to find a sweet spot between feature elimination and feature shrinkage, harnessing the strengths of both L1 and L2 regularization.

The l1_ratio parameter is specific to ElasticNet. ElasticNet is a hybrid model that combines penalties from both Lasso and Ridge. In this model:

alpha still controls the overall strength of the penalty.
l1_ratio specifies the balance between L1 and L2 regularization, where:
- l1_ratio = 1 corresponds to Lasso,
- l1_ratio = 0 corresponds to Ridge,
- Values in between adjust the mix of the two.

#Building on block of code above #Implement GridSearchCV on ElasticNet to obtain optimal parameters from sklearn.model_selection import GridSearchCV # Define range of alpha for ElasticNet alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1 # Define range of L1 ratio for ElasticNet l1_ratio = [0.05, 0.5, 0.95] # Setup Grid Search for ElasticNet elasticnet_grid = GridSearchCV(estimator=pipelines[‘ElasticNet’], param_grid={‘regressor__alpha’: alpha, ‘regressor__l1_ratio’: l1_ratio}, verbose=1) #Prints out progress elasticnet_grid.fit(X, y) # Extract the best parameters and best score for ElasticNet elasticnet_best_params = elasticnet_grid.best_params_ elasticnet_best_score = elasticnet_grid.best_score_ print(f”Best parameters for ElasticNet: {elasticnet_best_params}”) print(f”Best cross-validation score: {round(elasticnet_best_score, 4)}”)

#Building on block of code above

#Implement GridSearchCV on ElasticNet to obtain optimal parameters

from sklearn.model_selection import GridSearchCV

# Define range of alpha for ElasticNet

alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1

# Define range of L1 ratio for ElasticNet

l1_ratio = [0.05, 0.5, 0.95]

# Setup Grid Search for ElasticNet

elasticnet_grid = GridSearchCV(estimator=pipelines[‘ElasticNet’],

param_grid={‘regressor__alpha’: alpha,

‘regressor__l1_ratio’: l1_ratio},

verbose=1) #Prints out progress

elasticnet_grid.fit(X, y)

# Extract the best parameters and best score for ElasticNet

elasticnet_best_params = elasticnet_grid.best_params_

elasticnet_best_score = elasticnet_grid.best_score_

print(f“Best parameters for ElasticNet: {elasticnet_best_params}”)

print(f“Best cross-validation score: {round(elasticnet_best_score, 4)}”)

In the initial setup, before tuning, ElasticNet scored a cross-validation R² of 0.8299. This was notably lower than the scores achieved by Lasso and Ridge, indicating that the default parameters may not have been optimal for this model on the Ames housing dataset. After tuning, the best parameters for ElasticNet improved its score to 0.8762.

Fitting 5 folds for each of 60 candidates, totalling 300 fits Best parameters for ElasticNet: {‘regressor__alpha’: 1, ‘regressor__l1_ratio’: 0.95} Best cross-validation score: 0.8762

Fitting 5 folds for each of 60 candidates, totalling 300 fits

Best parameters for ElasticNet: {‘regressor__alpha’: 1, ‘regressor__l1_ratio’: 0.95}

Best cross-validation score: 0.8762

The lift from 0.8299 to 0.8762 demonstrates the substantial impact of fine-tuning the hyperparameters can have on model performance. This underscores the necessity of hyperparameter optimization, especially in models like ElasticNet that balance two types of regularization. The tuning effectively adjusted the balance between the L1 and L2 penalties, finding a configuration that better fits the dataset. While the model’s performance after tuning did not surpass the best Ridge model (which scored 0.889), it closed the gap considerably, demonstrating that with the right parameters, ElasticNet can compete closely with the simpler regularization models.

Summary

In this guide, we explored the application and optimization of penalized regression models—Lasso, Ridge, and ElasticNet—using the Ames housing dataset. We started by highlighting the importance of feature scaling to ensure equal contribution from all features. Through setting up scikit-learn pipelines, we demonstrated how different models perform with basic configurations, with Ridge slightly outperforming the others initially. We then focused on hyperparameter tuning, which not only significantly improved ElasticNet’s performance by adjusting alpha and l1_ratio but also deepened our understanding of the behavior of different models under various configurations. This insight is crucial, as it helps select the right model and settings for specific datasets and prediction goals, highlighting that hyperparameter tuning is not just about achieving higher accuracy but also about understanding model dynamics.

Specifically, you learned:

The critical role of feature scaling in the context of penalized models.
How to implement Lasso, Ridge, and ElasticNet models using scikit-learn pipelines.
How to optimize model performance using GridSearchCV and hyperparameter tuning.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner’s Guide to Data Science!

Learn the mindset to become successful in data science projects

…using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner’s Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more…all to support you in creating a narrative from a dataset.