Boosting Over Bagging: Enhancing Predictive Accuracy with Gradient Boosting Regressors
Ensemble learning techniques primarily fall into two categories: bagging and boosting. Bagging improves stability and accuracy by aggregating independent predictions, whereas boosting sequentially corrects the errors of prior models, improving their performance with each iteration. This post begins our deep dive into boosting, starting with the Gradient Boosting Regressor. Through its application on the Ames Housing Dataset, we will demonstrate how boosting uniquely enhances models, setting the stage for exploring various boosting techniques in upcoming posts.
Let’s get started.
Overview
This post is divided into four parts; they are:
- What is Boosting?
- Comparing Model Performance: Decision Tree Baseline to Gradient Boosting Ensembles
- Optimizing Gradient Boosting with Learning Rate Adjustments
- Final Optimization: Tuning Learning Rate and Number of Trees
What is Boosting?
Boosting is an ensemble technique combining multiple models to create a strong learner. Unlike other ensemble methods that may build models in parallel, boosting adds models sequentially, with each new model focusing on improving the areas where previous models struggled. This methodically improves the ensemble’s accuracy with each iteration, making it particularly effective for complex datasets.
Key Features of Boosting:
- Sequential Learning: Boosting builds one model at a time. Each new model learns from the shortcomings of the previous ones, allowing for progressive improvement in capturing data complexities.
- Error Correction: New learners focus on previously mispredicted instances, continuously enhancing the ensemble’s capability to capture difficult patterns in the data.
- Model Complexity: The ensemble’s complexity grows as more models are added, enabling it to capture intricate data structures effectively.
Boosting vs. Bagging
Bagging involves building several models (often independently) and combining their outputs to enhance the ensemble’s overall performance, primarily by reducing the risk of overfitting the noise in the training data, in contrast, boosting focuses on improving the accuracy of predictions by learning from errors sequentially, which allows it to adapt more intricately to the data.
Boosting Regressors in scikit-learn:
Scikit-learn provides several implementations of boosting, tailored for different needs and data scenarios:
- AdaBoost Regressor: Employs a sequence of weak learners and adjusts their focus based on the errors of the previous model, improving where past models were lacking.
- Gradient Boosting Regressor: Builds models one at a time, with each new model trained to correct the residuals (errors) made by the previous ones, improving accuracy through careful adjustments.
- HistGradient Boosting Regressor: An optimized form of Gradient Boosting designed for larger datasets, which speeds up calculations by using histograms to approximate gradients.
Each method utilizes the core principles of boosting to improve its components’ performance, showcasing the versatility and power of this approach in tackling predictive modeling challenges. In the following sections of this post, we will demonstrate a practical application of the Gradient Boosting Regressor using the Ames Housing Dataset.
Comparing Model Performance: Decision Tree Baseline to Gradient Boosting Ensembles
In transitioning from the theoretical aspects of boosting to its practical applications, this section will demonstrate the Gradient Boosting Regressor using the meticulously preprocessed Ames Housing Dataset. Our preprocessing steps, consistent across various tree-based models, ensure that the improvements observed can be attributed directly to the model’s capabilities, setting the stage for an effective comparison.
The code below establishes our comparative analysis framework by first setting up a baseline using a single Decision Tree, which is not an ensemble method. This baseline will allow us to illustrate the incremental benefits brought by actual ensemble methods clearly. Following this, we configure two versions, each of Bagging, Random Forest, and the Gradient Boosting Regressor, with 100 and 200 trees, respectively, to explore the enhancements these ensemble techniques offer over the baseline.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
# Import necessary libraries for preprocessing and modeling import pandas as pd from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import cross_val_score from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer from sklearn.ensemble import GradientBoostingRegressor, BaggingRegressor, RandomForestRegressor
# Load the dataset Ames = pd.read_csv(‘Ames.csv’)
# Adjust data types for categorical variables for col in [‘MSSubClass’, ‘YrSold’, ‘MoSold’]: Ames[col] = Ames[col].astype(‘object’)
# Exclude ‘PID’ and ‘SalePrice’ from features and specifically handle the ‘Electrical’ column numeric_features = Ames.select_dtypes(include=[‘int64’, ‘float64’]).drop(columns=[‘PID’, ‘SalePrice’]).columns categorical_features = Ames.select_dtypes(include=[‘object’]).columns.difference([‘Electrical’]) electrical_feature = [‘Electrical’]
# Manually specify the categories for ordinal encoding according to the data dictionary ordinal_order = { ‘Electrical’: [‘Mix’, ‘FuseP’, ‘FuseF’, ‘FuseA’, ‘SBrkr’], # Electrical system ‘LotShape’: [‘IR3’, ‘IR2’, ‘IR1’, ‘Reg’], # General shape of property ‘Utilities’: [‘ELO’, ‘NoSeWa’, ‘NoSewr’, ‘AllPub’], # Type of utilities available ‘LandSlope’: [‘Sev’, ‘Mod’, ‘Gtl’], # Slope of property ‘ExterQual’: [‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Evaluates the quality of the material on the exterior ‘ExterCond’: [‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Evaluates the present condition of the material on the exterior ‘BsmtQual’: [‘None’, ‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Height of the basement ‘BsmtCond’: [‘None’, ‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # General condition of the basement ‘BsmtExposure’: [‘None’, ‘No’, ‘Mn’, ‘Av’, ‘Gd’], # Walkout or garden level basement walls ‘BsmtFinType1’: [‘None’, ‘Unf’, ‘LwQ’, ‘Rec’, ‘BLQ’, ‘ALQ’, ‘GLQ’], # Quality of basement finished area ‘BsmtFinType2’: [‘None’, ‘Unf’, ‘LwQ’, ‘Rec’, ‘BLQ’, ‘ALQ’, ‘GLQ’], # Quality of second basement finished area ‘HeatingQC’: [‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Heating quality and condition ‘KitchenQual’: [‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Kitchen quality ‘Functional’: [‘Sal’, ‘Sev’, ‘Maj2’, ‘Maj1’, ‘Mod’, ‘Min2’, ‘Min1’, ‘Typ’], # Home functionality ‘FireplaceQu’: [‘None’, ‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Fireplace quality ‘GarageFinish’: [‘None’, ‘Unf’, ‘RFn’, ‘Fin’], # Interior finish of the garage ‘GarageQual’: [‘None’, ‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Garage quality ‘GarageCond’: [‘None’, ‘Po’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Garage condition ‘PavedDrive’: [‘N’, ‘P’, ‘Y’], # Paved driveway ‘PoolQC’: [‘None’, ‘Fa’, ‘TA’, ‘Gd’, ‘Ex’], # Pool quality ‘Fence’: [‘None’, ‘MnWw’, ‘GdWo’, ‘MnPrv’, ‘GdPrv’] # Fence quality }
# Extract list of ALL ordinal features from dictionary ordinal_features = list(ordinal_order.keys())
# List of ordinal features except Electrical ordinal_except_electrical = [feature for feature in ordinal_features if feature != ‘Electrical’]
# Define transformations for various feature types electrical_transformer = Pipeline(steps=[ (‘impute_electrical’, SimpleImputer(strategy=‘most_frequent’)), (‘ordinal_electrical’, OrdinalEncoder(categories=[ordinal_order[‘Electrical’]])) ])
numeric_transformer = Pipeline(steps=[ (‘impute_mean’, SimpleImputer(strategy=‘mean’)) ])
# Updated categorical imputer using SimpleImputer categorical_imputer = SimpleImputer(strategy=‘constant’, fill_value=‘None’)
ordinal_transformer = Pipeline([ (‘impute_ordinal’, categorical_imputer), (‘ordinal’, OrdinalEncoder(categories=[ordinal_order[feature] for feature in ordinal_features if feature in ordinal_except_electrical])) ])
nominal_features = [feature for feature in categorical_features if feature not in ordinal_features] categorical_transformer = Pipeline([ (‘impute_nominal’, categorical_imputer), (‘onehot’, OneHotEncoder(handle_unknown=‘ignore’)) ])
# Combined preprocessor for numeric, ordinal, nominal, and specific electrical data preprocessor = ColumnTransformer( transformers=[ (‘electrical’, electrical_transformer, [‘Electrical’]), (‘num’, numeric_transformer, numeric_features), (‘ordinal’, ordinal_transformer, ordinal_except_electrical), (‘nominal’, categorical_transformer, nominal_features) ])
# Define model pipelines including Gradient Boosting Regressor models = { ‘Decision Tree (1 Tree)’: DecisionTreeRegressor(random_state=42), ‘Bagging Regressor (100 Decision Trees)’: BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42), n_estimators=100, random_state=42), ‘Bagging Regressor (200 Decision Trees)’: BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42), n_estimators=200, random_state=42), ‘Random Forest (Default of 100 Trees)’: RandomForestRegressor(random_state=42), ‘Random Forest (200 Trees)’: RandomForestRegressor(n_estimators=200, random_state=42), ‘Gradient Boosting Regressor (Default of 100 Trees)’: GradientBoostingRegressor(random_state=42), ‘Gradient Boosting Regressor (200 Trees)’: GradientBoostingRegressor(n_estimators=200, random_state=42) }
# Evaluate models using cross-validation and print results results = {} for name, model in models.items(): model_pipeline = Pipeline([ (‘preprocessor’, preprocessor), (‘regressor’, model) ]) scores = cross_val_score(model_pipeline, Ames.drop(columns=‘SalePrice’), Ames[‘SalePrice’], cv=5) results[name] = round(scores.mean(), 4) print(f“{name}: Mean CV R² = {results[name]}”) |
Below are the cross-validation results, showcasing how each model performs in terms of mean R² values:
Decision Tree (1 Tree): Mean CV R² = 0.7663 Bagging Regressor (100 Decision Trees): Mean CV R² = 0.8957 Bagging Regressor (200 Decision Trees): Mean CV R² = 0.897 Random Forest (Default of 100 Trees): Mean CV R² = 0.8954 Random Forest (200 Trees): Mean CV R² = 0.8969 Gradient Boosting Regressor (Default of 100 Trees): Mean CV R² = 0.9027 Gradient Boosting Regressor (200 Trees): Mean CV R² = 0.9061 |
The results from our ensemble models underline several key insights into the behavior and performance of advanced regression techniques:
- Baseline and Enhancement: Starting with a basic Decision Tree Regressor, which serves as our baseline with an R² of 0.7663, we observe significant performance uplifts as we introduce more complex models. Both Bagging and Random Forest Regressors, using different numbers of trees, show improved scores, illustrating the power of ensemble methods in leveraging multiple learning models to reduce error.
- Gradient Boosting Regressor’s Edge: Particularly notable is the Gradient Boosting Regressor. With its default setting of 100 trees, it achieves an R² of 0.9027, and further increasing the number of trees to 200 nudges the score up to 0.9061. This indicates the effectiveness of GBR in this context and highlights its efficiency in sequential improvement from additional learners.
- Marginal Gains from More Trees: While increasing the number of trees generally results in better performance, the incremental gains diminish as we expand the ensemble size. This trend is evident across Bagging, Random Forest, and Gradient Boosting models, suggesting a point of diminishing returns where additional computational resources yield minimal performance improvements.
The results highlight the Gradient Boosting Regressor’s robust performance. It effectively leverages comprehensive preprocessing and the sequential improvement strategy characteristic of boosting. Next, we will explore how adjusting the learning rate can refine our model’s performance, enhancing its predictive accuracy.
Optimizing Gradient Boosting with Learning Rate Adjustments
The learning_rate
is unique to boosting models like the Gradient Boosting Regressor, distinguishing it from other models such as Decision Trees and Random Forests, which do not have a direct equivalent of this parameter. Adjusting the learning_rate
allows us to delve deeper into the mechanics of boosting and enhance our model’s predictive power by fine-tuning how aggressively it learns from each successive tree.
What is the Learning Rate?
In the context of Gradient Boosting Regressors and other gradient descent-based algorithms, the “learning rate” is a crucial hyperparameter that controls the speed at which the model learns. At its core, the learning rate influences the size of the steps the model takes toward the optimal solution during training. Here’s a breakdown:
- Size of Steps: The learning rate determines the magnitude of the updates to the model’s weights during training. A higher learning rate makes larger updates, allowing the model to learn faster but at the risk of overshooting the optimal solution. Conversely, a lower learning rate makes smaller updates, which means the model learns slower but with potentially higher precision.
- Impact on Model Training:
- Convergence: A learning rate that is too high may cause the training process to converge too quickly to a suboptimal solution, or it might not converge at all as it overshoots the minimum.
- Accuracy and Overfitting: A learning rate that is too low can lead the model to learn too slowly, which may require more trees to achieve similar accuracy, potentially leading to overfitting if not monitored.
- Tuning: Choosing the right learning rate balances speed and accuracy. It is often selected through trial and error or more systematic approaches like GridSearchCV and RandomizedSearchCV, as adjusting the learning rate can significantly affect the model’s performance and training time.
By adjusting the learning rate, data scientists can control how quickly a boosting model adapts to the complexity of its errors. This makes the learning rate a powerful tool in fine-tuning model performance, especially in boosting algorithms where each new tree is built to correct the residuals (errors) left by the previous trees.
To optimize the learning_rate
, we start with GridSearchCV, a systematic method that will explore predefined values ([0.001, 0.01, 0.1, 0.2, 0.3]) to ascertain the most effective setting for enhancing our model’s accuracy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Experiment with GridSearchCV from sklearn.model_selection import GridSearchCV
# Parameter grid for GridSearchCV param_grid = { ‘regressor__learning_rate’: [0.001, 0.01, 0.1, 0.2, 0.3] }
# Setup the GridSearchCV grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring=‘r2’, verbose=1)
# Fit the GridSearchCV to the data grid_search.fit(Ames.drop(columns=‘SalePrice’), Ames[‘SalePrice’])
# Best parameters and best score from Grid Search print(“Best parameters (Grid Search):”, grid_search.best_params_) print(“Best score (Grid Search):”, round(grid_search.best_score_, 4)) |
Here are the results from our GridSearchCV, focused solely on optimizing the learning_rate
parameter:
Fitting 5 folds for each of 5 candidates, totalling 25 fits Best parameters (Grid Search): {‘regressor__learning_rate’: 0.1} Best score (Grid Search): 0.9061 |
Using GridSearchCV, we found that a learning_rate
of 0.1 yielded the best result, matching the default setting. This suggests that for our dataset and preprocessing setup, increasing or decreasing the rate around this value does not significantly improve the model.
Following this, we utilize RandomizedSearchCV to expand our search. Unlike GridSearchCV, RandomizedSearchCV randomly selects from a continuous range, allowing for a potentially more precise optimization by exploring between the standard values, thus providing a comprehensive understanding of how subtle variations in learning_rate
can impact performance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Experiment with RandomizedSearchCV from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform
# Parameter distribution for RandomizedSearchCV param_dist = { ‘regressor__learning_rate’: uniform(0.001, 0.299) # Uniform distribution between 0.001 and 0.3 }
# Setup the RandomizedSearchCV random_search = RandomizedSearchCV(model_pipeline, param_distributions=param_dist, n_iter=50, cv=5, scoring=‘r2’, verbose=1, random_state=42)
# Fit the RandomizedSearchCV to the data random_search.fit(Ames.drop(columns=‘SalePrice’), Ames[‘SalePrice’])
# Best parameters and best score from Random Search print(“Best parameters (Random Search):”, random_search.best_params_) print(“Best score (Random Search):”, round(random_search.best_score_, 4)) |
Contrasting with GridSearchCV, RandomizedSearchCV identified a slightly different optimal learning_rate
of approximately 0.158, which enhanced our model’s performance. This improvement underscores the value of a randomized search, particularly when fine-tuning models, as it can explore a more diverse set of possibilities and potentially yield better configurations.
Fitting 5 folds for each of 50 candidates, totalling 250 fits Best parameters (Random Search): {‘regressor__learning_rate’: 0.1579021730580391} Best score (Random Search): 0.9134 |
The optimization through RandomizedSearchCV has demonstrated its efficacy by pinpointing a learning rate that pushes our model’s performance to new heights, achieving an R² score of 0.9134. These experiments with learning_rate
adjustments through GridSearchCV and RandomizedSearchCV illustrate the delicate balance required in tuning gradient boosting models. They also highlight the benefits of exploring both systematic and randomized parameter search strategies to optimize a model fully.
Encouraged by the gains achieved through these optimization strategies, we will now extend our focus to fine-tuning both the learning_rate
and n_estimators
simultaneously. This next phase aims to uncover even more optimal settings by exploring the combined impact of these crucial parameters on our Gradient Boosting Regressor’s performance.
Final Optimization: Tuning Learning Rate and Number of Trees
We begin with GridSearchCV to systematically explore combinations of learning_rate
and n_estimators
. This approach provides a structured way to assess the impact of varying both parameters on our model’s accuracy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Build on previous blocks of code # ‘preprocessor’ is already set up as your preprocessing pipeline model_pipeline = Pipeline([ (‘preprocessor’, preprocessor), (‘regressor’, GradientBoostingRegressor(random_state=42)) ])
# Parameter grid for GridSearchCV param_grid = { ‘regressor__learning_rate’: [0.001, 0.01, 0.1, 0.2, 0.3], ‘regressor__n_estimators’: [100, 200, 300, 400, 500] }
# Setup the GridSearchCV grid_search = GridSearchCV(model_pipeline, param_grid, cv=5, scoring=‘r2’, verbose=1)
# Fit the GridSearchCV to the data grid_search.fit(Ames.drop(columns=‘SalePrice’), Ames[‘SalePrice’])
# Best parameters and best score from Grid Search print(“Best parameters (Grid Search):”, grid_search.best_params_) print(“Best score (Grid Search):”, round((grid_search.best_score_), 4)) |
The GridSearchCV process evaluated 25 different combinations across 5 folds, totaling 125 fits:
Fitting 5 folds for each of 25 candidates, totalling 125 fits Best parameters (Grid Search): {‘regressor__learning_rate’: 0.1, ‘regressor__n_estimators’: 500} Best score (Grid Search): 0.9089 |
It confirmed that a learning_rate
of 0.1—the default setting—remains effective. However, it suggested an increase to 500 trees could slightly improve our model’s performance, elevating the R² score to 0.9089. This is a modest enhancement compared to the R² of 0.9061 achieved earlier with 200 trees and a learning_rate
of 0.1. Interestingly, our previous randomized search yielded an even better result of 0.9134 with only 200 trees and learning_rate
approximately 0.158, illustrating the potential benefits of exploring a broader parameter space to optimize performance.
To ensure that we have thoroughly explored the parameter space and to uncover even better configurations potentially, we’ll now employ RandomizedSearchCV. This method allows for a more explorative and less deterministic approach by sampling from a continuous distribution of parameter values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Build on previous blocks of code from scipy.stats import uniform, randint
# Parameter distribution for RandomizedSearchCV param_dist = { ‘regressor__learning_rate’: uniform(0.001, 0.299), # Uniform distribution between 0.001 and 0.3 ‘regressor__n_estimators’: randint(100, 501) # Uniform distribution of integers from 100 to 500 }
# Setup the RandomizedSearchCV random_search = RandomizedSearchCV(model_pipeline, param_distributions=param_dist, n_iter=50, cv=5, scoring=‘r2’, verbose=1, random_state=42)
# Fit the RandomizedSearchCV to the data random_search.fit(Ames.drop(columns=‘SalePrice’), Ames[‘SalePrice’])
# Best parameters and best score from Random Search print(“Best parameters (Random Search):”, random_search.best_params_) print(“Best score (Random Search):”, round((random_search.best_score_), 4)) |
The RandomizedSearchCV extended our search across a broader range of possibilities, testing 50 different configurations across 5 folds, totaling 250 fits:
Fitting 5 folds for each of 50 candidates, totalling 250 fits Best parameters (Random Search): {‘regressor__learning_rate’: 0.12055843054286139, ‘regressor__n_estimators’: 287} Best score (Random Search): 0.9158 |
It identified an even more effective setting with a learning_rate
of approximately 0.121 and n_estimators
at 287, achieving our best R² score yet at 0.9158. This underscores the potential of randomized parameter tuning to discover optimal settings that more rigid methods might miss.
To validate the performance improvements achieved through our tuning efforts, we will now perform a final cross-validation using the Gradient Boosting Regressor configured with the best parameters identified: n_estimators
set to 287 and a learning_rate
of approximately 0.121.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Build on previous blocks of code # Cross check model performance of Gradient Boosting Regressor with tuned parameters
# ‘preprocessor’ is already set up as your preprocessing pipeline model_pipeline = Pipeline([ (‘preprocessor’, preprocessor), (‘regressor’, GradientBoostingRegressor(n_estimators=287, learning_rate=0.12055843054286139, random_state=42)) ])
# Using the full dataset X, y X = Ames.drop(columns=‘SalePrice’) y = Ames[‘SalePrice’]
# Perform 5-fold cross-validation cv_scores = cross_val_score(model_pipeline, X, y, cv=5, scoring=‘r2’)
# Output the mean cross-validated score of tuned model print(“Performance of Gradient Boosting Regressor with tuned parameters:”, round(cv_scores.mean(), 4)) |
The final output confirms the performance of our tuned Gradient Boosting Regressor.
Performance of Gradient Boosting Regressor with tuned parameters: 0.9158 |
By optimizing both learning_rate
and n_estimators
, we have achieved an R² score of 0.9158. This score not only validates the enhancements made through parameter tuning but also emphasizes the capability of the Gradient Boosting Regressor to adapt and perform consistently across the dataset.
APIs
Tutorials
Ames Housing Dataset & Data Dictionary
Summary
This post explored the capabilities of the Gradient Boosting Regressor (GBR), from understanding the foundational concepts of boosting to advanced optimization techniques using the Ames Housing Dataset. It focused on key parameters of the GBR such as the number of trees and learning rate, essential for refining the model’s accuracy and efficiency. Through systematic and randomized approaches, it demonstrated how to fine-tune these parameters using GridSearchCV and RandomizedSearchCV, enhancing the model’s performance significantly.
Specifically, you learned:
- The fundamentals of boosting and how it differs from other ensemble techniques like bagging.
- How to achieve incremental improvements by experimenting with a range of models.
- Techniques for tuning learning rate and number of trees for the Gradient Boosting Regressor.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.