Exploring LightGBM: Leaf-Wise Growth with GBDT and GOSS
LightGBM is a highly efficient gradient boosting framework. It has gained traction for its speed and performance, particularly with large and complex datasets. Developed by Microsoft, this powerful algorithm is known for its unique ability to handle large volumes of data with significant ease compared to traditional methods.
In this post, we will experiment with LightGBM framework on the Ames Housing dataset. In particular, we will shed some light on its versatile boosting strategies—Gradient Boosting Decision Tree (GBDT) and Gradient-based One-Side Sampling (GOSS). These strategies offer distinct advantages. Through this post, we will compare their performance and characteristics.
We begin by setting up LightGBM and proceed to examine its application in both theoretical and practical contexts.
Let’s get started.
Overview
This post is divided into four parts; they are:
- Introduction to LightGBM and Initial Setup
- Testing LightGBM’s GBDT and GOSS on the Ames Dataset
- Fine-Tuning LightGBM’s Tree Growth: A Focus on Leaf-wise Strategy
- Comparing Feature Importance in LightGBM’s GBDT and GOSS Models
Introduction to LightGBM and Initial Setup
LightGBM (Light Gradient Boosting Machine) was developed by Microsoft. It is a machine learning framework that provides the necessary components and utilities to build, train, and deploy machine learning models. The models are based on decision tree algorithms and use gradient boosting at its core. The framework is open source and can be installed on your system using the following command:
This command will download and install the LightGBM package along with its necessary dependencies.
While LightGBM, XGBoost, and Gradient Boosting Regressor (GBR) are all based on the principle of gradient boosting, several key distinctions set LightGBM apart due to both its default behaviors and a range of optional parameters that enhance its functionality:
- Exclusive Feature Bundling (EFB): As a default feature, LightGBM employs EFB to reduce the number of features, which is particularly useful for high-dimensional sparse data. This process is automatic, helping to manage data dimensionality efficiently without extensive manual intervention.
- Gradient-Based One-Side Sampling (GOSS): As an optional parameter that can be enabled, GOSS retains instances with large gradients. The gradient represents how much the loss function would change if the model’s prediction for that instance changed slightly. A large gradient means that the current model’s prediction for that data point is far from the actual target value. Instances with large gradients are considered more important for training because they represent areas where the model needs significant improvement. In the GOSS algorithm, instances with large gradients are often referred to as “under-trained” because they indicate areas where the model’s performance is poor and needs more focus during training. The GOSS algorithm specifically retains all instances with large gradients in its sampling process, ensuring that these critical data points are always included in the training subset. On the other hand, instances with small gradients are considered “well-trained” because the model’s predictions for these points are closer to the actual values, resulting in smaller errors.
- Leaf-wise Tree Growth: Whereas both GBR and XGBoost typically grow trees level-wise, LightGBM default tree growth strategy is leaf-wise. Unlike level-wise growth, where all nodes at a given depth are split before moving to the next level, LightGBM grows trees by choosing to split the leaf that results in the largest decrease in the loss function. This approach can lead to asymmetric, irregular trees of larger depth, which can be more expressive and efficient than balanced trees grown level-wise.
These are a few characteristics of LightGBM that differentiate it from the traditional GBR and XGBoost. With these unique advantages in mind, we are prepared to delve into the empirical side of our exploration.
Testing LightGBM’s GBDT and GOSS on the Ames Dataset
Building on our understanding of LightGBM’s distinct features, this segment shifts from theory to practice. We will utilize the Ames Housing dataset to rigorously test two specific boosting strategies within the LightGBM framework: the standard Gradient Boosting Decision Tree (GBDT) and the innovative Gradient-based One-Side Sampling (GOSS). We aim to explore these techniques and provide a comparative analysis of their effectiveness.
Before we dive into the model building, it’s crucial to prepare the dataset properly. This involves loading the data and ensuring all categorical features are correctly processed, taking full advantage of LightGBM’s handling of categorical variables. Like XGBoost, LightGBM can natively handle missing values and categorical data, simplifying the preprocessing steps and leading to more robust models. This capability is crucial as it directly influences the accuracy and efficiency of the model training process.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# Import libraries to run LightGBM import pandas as pd import lightgbm as lgb from sklearn.model_selection import cross_val_score
# Load the Ames Housing Dataset data = pd.read_csv(‘Ames.csv’) X = data.drop(‘SalePrice’, axis=1) y = data[‘SalePrice’]
# Convert categorical columns to ‘category’ dtype categorical_cols = X.select_dtypes(include=[‘object’]).columns X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’))
# Define the default GBDT model gbdt_model = lgb.LGBMRegressor() gbdt_scores = cross_val_score(gbdt_model, X, y, cv=5) print(f“Average R² score for default Light GBM (with GBDT): {gbdt_scores.mean():.4f}”)
# Define the GOSS model goss_model = lgb.LGBMRegressor(boosting_type=‘goss’) goss_scores = cross_val_score(goss_model, X, y, cv=5) print(f“Average R² score for Light GBM with GOSS: {goss_scores.mean():.4f}”) |
Results:
Average R² score for default Light GBM (with GBDT): 0.9145 Average R² score for Light GBM with GOSS: 0.9109 |
The initial results from our 5-fold cross-validation experiments provide intriguing insights into the performance of the two models. The default GBDT model achieved an average R² score of 0.9145, demonstrating robust predictive accuracy. On the other hand, the GOSS model, which specifically targets instances with large gradients, recorded a slightly lower average R² score of 0.9109.
The slight difference in performance might be attributed to the way GOSS prioritizes certain data points over others, which can be particularly beneficial in datasets where mispredictions are more concentrated. However, in a relatively homogeneous dataset like Ames, the advantages of GOSS may not be fully realized.
Fine-Tuning LightGBM’s Tree Growth: A Focus on Leaf-wise Strategy
One of the distinguishing features of LightGBM is its ability to construct decision trees leaf-wise rather than level-wise. This leaf-wise approach allows trees to grow by optimizing loss reductions, potentially leading to better model performance but posing a risk of overfitting if not properly tuned. In this section, we explore the impact of varying the number of leaves in a tree.
We start by defining a series of experiments to systematically test how different settings for num_leaves
affect the performance of two LightGBM variants: the traditional Gradient Boosting Decision Tree (GBDT) and the Gradient-based One-Side Sampling (GOSS). These experiments are crucial for identifying the optimal complexity level of the models for our specific dataset—the Ames Housing Dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# Experiment with Leaf-wise Tree Growth import pandas as pd import lightgbm as lgb from sklearn.model_selection import cross_val_score
# Load the Ames Housing Dataset data = pd.read_csv(‘Ames.csv’) X = data.drop(‘SalePrice’, axis=1) y = data[‘SalePrice’]
# Convert categorical columns to ‘category’ dtype categorical_cols = X.select_dtypes(include=[‘object’]).columns X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’))
# Define a range of leaf sizes to test leaf_sizes = [5, 10, 15, 31, 50, 100]
# Results storage results = {}
# Experiment with different leaf sizes for GBDT results[‘GBDT’] = {} print(“Testing different ‘num_leaves’ for GBDT:”) for leaf_size in leaf_sizes: model = lgb.LGBMRegressor(boosting_type=‘gbdt’, num_leaves=leaf_size) scores = cross_val_score(model, X, y, cv=5, scoring=‘r2’) results[‘GBDT’][leaf_size] = scores.mean() print(f“num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}”)
# Experiment with different leaf sizes for GOSS results[‘GOSS’] = {} print(“\nTesting different ‘num_leaves’ for GOSS:”) for leaf_size in leaf_sizes: model = lgb.LGBMRegressor(boosting_type=‘goss’, num_leaves=leaf_size) scores = cross_val_score(model, X, y, cv=5, scoring=‘r2’) results[‘GOSS’][leaf_size] = scores.mean() print(f“num_leaves = {leaf_size}: Average R² score = {scores.mean():.4f}”) |
Results:
Testing different ‘num_leaves’ for GBDT: num_leaves = 5: Average R² score = 0.9150 num_leaves = 10: Average R² score = 0.9193 num_leaves = 15: Average R² score = 0.9158 num_leaves = 31: Average R² score = 0.9145 num_leaves = 50: Average R² score = 0.9111 num_leaves = 100: Average R² score = 0.9101
Testing different ‘num_leaves’ for GOSS: num_leaves = 5: Average R² score = 0.9151 num_leaves = 10: Average R² score = 0.9168 num_leaves = 15: Average R² score = 0.9130 num_leaves = 31: Average R² score = 0.9109 num_leaves = 50: Average R² score = 0.9117 num_leaves = 100: Average R² score = 0.9124 |
The results from our cross-validation experiments provide insightful data on how the num_leaves
parameter influences the performance of GBDT and GOSS models. Both models perform optimally at a num_leaves
setting of 10, achieving the highest R² scores. This indicates that a moderate level of complexity suffices to capture the underlying patterns in the Ames Housing dataset without overfitting. This finding is particularly interesting, given that the default setting for num_leaves
in LightGBM is 31.
For GBDT, increasing the number of leaves beyond 10 leads to a decrease in performance, suggesting that too much complexity can detract from the model’s generalization capabilities. In contrast, GOSS shows a slightly more tolerant behavior towards higher leaf counts, although the improvements plateau, indicating no further gains from increased complexity.
This experiment underscores the importance of tuning num_leaves
in LightGBM. By carefully selecting this parameter, we can effectively balance model accuracy and complexity, ensuring robust performance across different data scenarios. Further experimentation with other parameters in conjunction with num_leaves
could potentially unlock even better performance and stability.
Comparing Feature Importance in LightGBM’s GBDT and GOSS Models
After fine-tuning the num_leaves
parameter and assessing the basic performance of the GBDT and GOSS models, we now shift our focus to understanding the influence of individual features within these models. In this section, we explore the most important features by each boosting strategy through visualization.
Here is the code that achieves this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
# Importing libraries to compare feature importance between GBDT and GOSS: import pandas as pd import numpy as np import lightgbm as lgb from sklearn.model_selection import KFold import matplotlib.pyplot as plt import seaborn as sns
# Prepare data data = pd.read_csv(‘Ames.csv’) X = data.drop(‘SalePrice’, axis=1) y = data[‘SalePrice’] categorical_cols = X.select_dtypes(include=[‘object’]).columns X[categorical_cols] = X[categorical_cols].apply(lambda x: x.astype(‘category’))
# Set up K-fold cross-validation kf = KFold(n_splits=5) gbdt_feature_importances = [] goss_feature_importances = []
# Iterate over each split for train_index, test_index in kf.split(X): X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Train GBDT model with optimal num_leaves gbdt_model = lgb.LGBMRegressor(boosting_type=‘gbdt’, num_leaves=10) gbdt_model.fit(X_train, y_train) gbdt_feature_importances.append(gbdt_model.feature_importances_)
# Train GOSS model with optimal num_leaves goss_model = lgb.LGBMRegressor(boosting_type=‘goss’, num_leaves=10) goss_model.fit(X_train, y_train) goss_feature_importances.append(goss_model.feature_importances_)
# Average feature importance across all folds for each model avg_gbdt_feature_importance = np.mean(gbdt_feature_importances, axis=0) avg_goss_feature_importance = np.mean(goss_feature_importances, axis=0)
# Convert to DataFrame feat_importances_gbdt = pd.DataFrame({‘Feature’: X.columns, ‘Importance’: avg_gbdt_feature_importance}) feat_importances_goss = pd.DataFrame({‘Feature’: X.columns, ‘Importance’: avg_goss_feature_importance})
# Sort and take the top 10 features top_gbdt_features = feat_importances_gbdt.sort_values(by=‘Importance’, ascending=False).head(10) top_goss_features = feat_importances_goss.sort_values(by=‘Importance’, ascending=False).head(10)
# Plotting plt.figure(figsize=(16, 12)) plt.subplot(1, 2, 1) sns.barplot(data=top_gbdt_features, y=‘Feature’, x=‘Importance’, orient=‘h’, palette=‘viridis’) plt.title(‘Top 10 LightGBM GBDT Features’, fontsize=18) plt.xlabel(‘Importance’, fontsize=16) plt.ylabel(‘Feature’, fontsize=16) plt.xticks(fontsize=13) plt.yticks(fontsize=14)
plt.subplot(1, 2, 2) sns.barplot(data=top_goss_features, y=‘Feature’, x=‘Importance’, orient=‘h’, palette=‘viridis’) plt.title(‘Top 10 LightGBM GOSS Features’, fontsize=18) plt.xlabel(‘Importance’, fontsize=16) plt.ylabel(‘Feature’, fontsize=16) plt.xticks(fontsize=13) plt.yticks(fontsize=14)
plt.tight_layout() plt.show() |
Using the same Ames Housing dataset, we applied a k-fold cross-validation method to maintain consistency with our previous experiments. However, this time, we concentrated on extracting and analyzing the feature importance from the models. Feature importance, which indicates how useful each feature is in constructing the boosted decision trees, is crucial for interpreting the behavior of machine learning models. It helps in understanding which features contribute most to the predictive power of the model, providing insights into the underlying data and the model’s decision-making process.
Here’s how we performed the feature importance extraction:
- Model Training: Each model (GBDT and GOSS) was trained across different folds of the data with the optimal
num_leaves
parameter set to 10. - Importance Extraction: After training, each model’s feature importance was extracted. This importance reflects the number of times a feature is used to make key decisions with splits in the trees.
- Averaging Across Folds: The importance was averaged over all folds to ensure that our results were stable and representative of the model’s performance across different subsets of the data.
The following visualizations succinctly present these differences in feature importance between the GBDT and GOSS models:
The analysis revealed interesting patterns in feature prioritization by each model. Both the GBDT and GOSS models exhibited a strong preference for “GrLivArea” and “LotArea,” highlighting the fundamental role of property size in determining house prices. Additionally, both models ranked ‘Neighborhood’ highly, underscoring the importance of location in the housing market.
However, the models began to diverge in their prioritization from the fourth feature onwards. The GBDT model showed a preference for “BsmtFinSF1,” indicating the value of finished basements. On the other hand, the GOSS model, which prioritizes instances with larger gradients to correct mispredictions, emphasized “OverallQual” more strongly.
As we conclude this analysis, it’s evident that the differences in feature importance between the GBDT and GOSS models provide valuable insights into how each model perceives the relevance of various features in predicting housing prices.
Further Reading
Tutorials
Ames Housing Dataset & Data Dictionary
Summary
This blog post introduced you to LightGBM’s capabilities, highlighting its distinctive features and practical application on the Ames Housing dataset. From the initial setup and comparison of GBDT and GOSS boosting strategies to an in-depth analysis of feature importance, we’ve uncovered valuable insights that not only demonstrate LightGBM’s efficiency but also its adaptability to complex datasets.
Specifically, you learned:
- Exploration of model variants: Comparing the default GBDT with the GOSS model provided insights into how different boosting strategies can be leveraged depending on the data characteristics.
- How to experiment with leaf-wise strategy: Adjusting the
num_leaves
parameter influences model performance, with an optimal setting providing a balance between complexity and accuracy. - How to visualize feature importance: Understanding and visualizing which features are most influential in your models can significantly impact how you interpret the results and make decisions. This process not only clarifies the model’s internal workings but also aids in improving model transparency and trustworthiness by identifying which variables most strongly influence the outcome.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.