Scaling to Success: Implementing and Optimizing Penalized Models
This post will demonstrate the usage of Lasso, Ridge, and ElasticNet models using the Ames housing dataset. These models are particularly valuable when dealing with data that may suffer from multicollinearity. We leverage these advanced regression techniques to show how feature scaling and hyperparameter tuning can improve model performance. In this post, we’ll provide a step-by-step walkthrough on setting up preprocessing pipelines, implementing each model with scikit-learn
, and fine-tuning them to achieve optimal results. This comprehensive approach not only aids in better prediction accuracy but also deepens your understanding of how different regularization methods affect model training and outcomes.
Let’s get started.
Overview
This post is divided into three parts; they are:
- The Crucial Role of Feature Scaling in Penalized Regression Models
- Practical Implementation of Penalized Models with the Ames Dataset
- Optimizing Hyperparameters for Penalized Regression Models
The Crucial Role of Feature Scaling in Penalized Regression Models
Data preprocessing is a pivotal step that significantly impacts model performance. One essential preprocessing step, particularly crucial when dealing with penalized regression models such as Lasso, Ridge, and ElasticNet, is feature scaling. But what exactly is feature scaling, and why is it indispensable for these models?
What is Feature Scaling?
Feature scaling is a method used to standardize the range of independent variables or features within data. The most common technique, known as standardization, involves rescaling the features so that they each have a mean of zero and a standard deviation of one. This adjustment is achieved by subtracting the mean of each feature from every observation and then dividing it by the standard deviation of that feature.
Why is Scaling Essential Before Applying Penalized Models?
Penalized regression models add a penalty to the size of the coefficients, which helps reduce overfitting and improve the generalizability of the model. However, the effectiveness of these penalties heavily depends on the scale of the input features:
- Uniform Penalty Application: Without scaling, features with larger scales can disproportionately influence the model. This imbalance can lead to a model unfairly penalizing smaller-scale features, potentially ignoring their significant impacts.
- Model Stability and Convergence: Features with varied scales can cause numerical instability during model training. This instability can make achieving convergence to an optimal solution difficult or result in a suboptimal model.
In the following example, we will demonstrate how to use the StandardScaler
class on numeric features to address these issues effectively. This approach ensures that our penalized models—Lasso, Ridge, and ElasticNet—perform optimally, providing reliable and robust predictions.
Practical Implementation of Penalized Models with the Ames Dataset
Having discussed the importance of feature scaling, let’s dive into a practical example using the Ames housing dataset. This example demonstrates how to preprocess data and apply penalized regression models in Python using scikit-learn
. The process involves setting up pipelines for both numeric and categorical data, ensuring a robust and reproducible workflow.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# Import necessary libraries import pandas as pd from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.linear_model import Lasso, Ridge, ElasticNet
# Load the dataset and remove columns with missing values Ames = pd.read_csv(‘Ames.csv’).dropna(axis=1)
# Identify numeric and categorical features, excluding ‘PID’ and ‘SalePrice’ numeric_features = Ames.select_dtypes(include=[‘int64’, ‘float64’]).drop(columns=[‘PID’, ‘SalePrice’]).columns categorical_features = Ames.select_dtypes(include=[‘object’]).columns X = Ames[numeric_features.tolist() + categorical_features.tolist()]
# Target variable y = Ames[‘SalePrice’]
# Pipeline for numeric features numeric_transformer = Pipeline(steps=[ (‘scaler’, StandardScaler()) ])
# Pipeline for categorical features categorical_transformer = Pipeline(steps=[ (‘onehot’, OneHotEncoder(handle_unknown=‘ignore’)) ])
# Combined preprocessor for both numeric and categorical data preprocessor = ColumnTransformer( transformers=[ (‘num’, numeric_transformer, numeric_features), (‘cat’, categorical_transformer, categorical_features) ])
# Define the model pipelines with preprocessor and regressor pipelines = { ‘Lasso’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, Lasso(max_iter=20000))]), ‘Ridge’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, Ridge())]), ‘ElasticNet’: Pipeline(steps=[(‘preprocessor’, preprocessor), (‘regressor’, ElasticNet())]) }
# Perform cross-validation and store results in a dictionary cv_results = {} for name, pipeline in pipelines.items(): scores = cross_val_score(pipeline, X, y) cv_results[name] = round(scores.mean(), 4)
# Output the mean cross-validation scores print(cv_results) |
First, we import the necessary libraries and load the Ames dataset, removing any columns with missing values to simplify our initial model. We identify and separate the numeric and categorical features, excluding “PID” (a unique identifier for each property) and “SalePrice” (our target variable).
We then construct two separate pipelines for preprocessing:
- Numeric Features: We use
StandardScaler
to standardize the numeric features, ensuring that they contribute equally to our model without being biased by their original scale. - Categorical Features:
OneHotEncoder
is employed to convert categorical variables into a format that can be provided to the machine learning algorithms, handling any unknown categories that might appear in future data sets.
Both pipelines are combined into a ColumnTransformer
. This setup simplifies the code and encapsulates all preprocessing steps into a single transformer object that can be seamlessly integrated with any model. With preprocessing defined, we set up three different pipelines, each corresponding to a different penalized regression model: Lasso, Ridge, and ElasticNet. Each pipeline integrates ColumnTransformer
with a regressor, allowing us to maintain clarity and modularity in our code. Upon applying cross-validation to our penalized regression models, we obtained the following scores:
{‘Lasso’: 0.8863, ‘Ridge’: 0.8885, ‘ElasticNet’: 0.8299} |
These results suggest that while all three models perform reasonably well, Ridge seems to handle this dataset best among the three, at least under the current settings.
Optimizing Hyperparameters for Penalized Regression Models
After establishing the foundation of feature scaling and implementing our penalized models on the Ames housing dataset, we now focus on an essential aspect of model development—hyperparameter tuning. This process is vital to refining our models and achieving the best performance. In this section, we’ll explore how adjusting the hyperparameters, specifically the regularization strength (alpha
) and the balance between L1 and L2 penalties (l1_ratio
for ElasticNet), can impact the performance of our models.
In the case of the Lasso model, we focus on tuning the alpha parameter, which controls the strength of the L1 penalty. The L1 penalty encourages the model to reduce the number of non-zero coefficients, which could potentially lead to simpler, more interpretable models.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
#Building on block of code above #Implement GridSearchCV on Lasso to obtain optimal alpha
from sklearn.model_selection import GridSearchCV
# Define range of alpha values for Lasso alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1
# Setup Grid Search for Lasso lasso_grid = GridSearchCV(estimator=pipelines[‘Lasso’], param_grid={‘regressor__alpha’: alpha}, verbose=1) #Prints out progress
lasso_grid.fit(X, y)
# Extract the best alpha and best score Lasso lasso_best_alpha = lasso_grid.best_params_[‘regressor__alpha’] lasso_best_score = lasso_grid.best_score_
print(f“Best alpha for Lasso: {lasso_best_alpha}”) print(f“Best cross-validation score: {round(lasso_best_score, 4)}”) |
Setting verbose=1
in the GridSearchCV
setup has provided you with helpful output about the number of fits performed, which gives a clearer picture of the computational workload involved. The output you’ve shared confirms that the grid search effectively explored different alpha values across 5 folds for each candidate, totaling 100 model fits:
Fitting 5 folds for each of 20 candidates, totalling 100 fits Best alpha for Lasso: 17 Best cross-validation score: 0.8881 |
The alpha value of 17 is relatively high, suggesting that the model benefits from a stronger level of regularization. This could indicate some level of multicollinearity or other factors in the dataset that make model simplification (fewer variables or smaller coefficients) beneficial for prediction accuracy.
For the Ridge model, we also tune the alpha parameter, but here it affects the L2 penalty. Unlike L1, the L2 penalty does not zero out coefficients; instead, it reduces their magnitude, which helps in dealing with multicollinearity and model overfitting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
#Building on block of code above #Implement GridSearchCV on Ridge to obtain optimal alpha
from sklearn.model_selection import GridSearchCV
# Define range of alpha for Ridge alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1
# Setup Grid Search for Ridge ridge_grid = GridSearchCV(estimator=pipelines[‘Ridge’], param_grid={‘regressor__alpha’: alpha}, verbose=1) #Prints out progress
ridge_grid.fit(X, y)
# Extract the best alpha and best score for Ridge ridge_best_alpha = ridge_grid.best_params_[‘regressor__alpha’] ridge_best_score = ridge_grid.best_score_
print(f“Best alpha for Ridge: {ridge_best_alpha}”) print(f“Best cross-validation score: {round(ridge_best_score, 4)}”) |
The results from the GridSearchCV
for Ridge regression show a best alpha of 3 with a cross-validation score of 0.889. This score is slightly higher than what was observed with the Lasso model (0.8881 with alpha at 17):
Fitting 5 folds for each of 20 candidates, totalling 100 fits Best alpha for Ridge: 3 Best cross-validation score: 0.889 |
The optimal alpha value for Ridge being significantly lower than for Lasso (3 versus 17) suggests that the dataset might benefit from the less aggressive regularization approach that Ridge offers. Ridge regularization (L2) doesn’t reduce coefficients to zero but rather shrinks them, which can be beneficial if many features have predictive power, albeit small. The fact that Ridge slightly outperformed Lasso in this case (0.889 vs. 0.8881) might indicate that feature elimination (which Lasso does through zeroing out coefficients) is not as beneficial for this dataset as feature shrinkage, which Ridge does. This could imply that most, if not all, predictors have some level of contribution to the target variable.
ElasticNet combines the penalties of Lasso and Ridge, controlled by alpha and l1_ratio. Tuning these parameters allows us to find a sweet spot between feature elimination and feature shrinkage, harnessing the strengths of both L1 and L2 regularization.
The l1_ratio
parameter is specific to ElasticNet. ElasticNet is a hybrid model that combines penalties from both Lasso and Ridge. In this model:
alpha
still controls the overall strength of the penalty.l1_ratio
specifies the balance between L1 and L2 regularization, where:l1_ratio = 1
corresponds to Lasso,l1_ratio = 0
corresponds to Ridge,- Values in between adjust the mix of the two.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
#Building on block of code above #Implement GridSearchCV on ElasticNet to obtain optimal parameters
from sklearn.model_selection import GridSearchCV
# Define range of alpha for ElasticNet alpha = list(range(1, 21, 1)) # Ranges from 1 to 20 in increments of 1
# Define range of L1 ratio for ElasticNet l1_ratio = [0.05, 0.5, 0.95]
# Setup Grid Search for ElasticNet elasticnet_grid = GridSearchCV(estimator=pipelines[‘ElasticNet’], param_grid={‘regressor__alpha’: alpha, ‘regressor__l1_ratio’: l1_ratio}, verbose=1) #Prints out progress
elasticnet_grid.fit(X, y)
# Extract the best parameters and best score for ElasticNet elasticnet_best_params = elasticnet_grid.best_params_ elasticnet_best_score = elasticnet_grid.best_score_
print(f“Best parameters for ElasticNet: {elasticnet_best_params}”) print(f“Best cross-validation score: {round(elasticnet_best_score, 4)}”) |
In the initial setup, before tuning, ElasticNet scored a cross-validation R² of 0.8299. This was notably lower than the scores achieved by Lasso and Ridge, indicating that the default parameters may not have been optimal for this model on the Ames housing dataset. After tuning, the best parameters for ElasticNet improved its score to 0.8762.
Fitting 5 folds for each of 60 candidates, totalling 300 fits Best parameters for ElasticNet: {‘regressor__alpha’: 1, ‘regressor__l1_ratio’: 0.95} Best cross-validation score: 0.8762 |
The lift from 0.8299 to 0.8762 demonstrates the substantial impact of fine-tuning the hyperparameters can have on model performance. This underscores the necessity of hyperparameter optimization, especially in models like ElasticNet that balance two types of regularization. The tuning effectively adjusted the balance between the L1 and L2 penalties, finding a configuration that better fits the dataset. While the model’s performance after tuning did not surpass the best Ridge model (which scored 0.889), it closed the gap considerably, demonstrating that with the right parameters, ElasticNet can compete closely with the simpler regularization models.
Further Reading
APIs
Tutorials
Resources
Summary
In this guide, we explored the application and optimization of penalized regression models—Lasso, Ridge, and ElasticNet—using the Ames housing dataset. We started by highlighting the importance of feature scaling to ensure equal contribution from all features. Through setting up scikit-learn
pipelines, we demonstrated how different models perform with basic configurations, with Ridge slightly outperforming the others initially. We then focused on hyperparameter tuning, which not only significantly improved ElasticNet’s performance by adjusting alpha
and l1_ratio
but also deepened our understanding of the behavior of different models under various configurations. This insight is crucial, as it helps select the right model and settings for specific datasets and prediction goals, highlighting that hyperparameter tuning is not just about achieving higher accuracy but also about understanding model dynamics.
Specifically, you learned:
- The critical role of feature scaling in the context of penalized models.
- How to implement Lasso, Ridge, and ElasticNet models using
scikit-learn
pipelines. - How to optimize model performance using
GridSearchCV
and hyperparameter tuning.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.