Boosting: Is It Always The Best Option?

ROC Curve

Gradient boosting has become quite a popular technique in the area of machine learning. Given its reputation for achieving potentially higher accuracy than other modelling techniques, it has become particularly popular as a “go-to” model for Kaggle competitions.

However, use of gradient boosting raises two questions:

  1. Does this technique really outperform others consistently irrespective of the data being examined?
  2. Even if this is the case, are gradient boosting techniques always a wise choice?

To answer these questions, I decided to compare the use of gradient boosting techniques to that of logistic regression by attempting to classify diabetes based on outcome. The dataset is available at the UCI Machine Learning Repository.

Essentially, the dataset provides us with several features that are used to predict the outcome variable (diabetes = 1, no diabetes = 0).

Firstly, feature extraction with extratrees was performed to identify the most important features in predicting the outcome variable.

Then, the following models were run:

  1. Logistic Regression
  2. Gradient Boosting Classifier
  3. LightGBM Classifier
  4. XGBoost Classifier
  5. AdaBoost Classifier

Feature Extraction

Feature Extraction is being used to determine the most important features that influence the outcome variable, i.e. which features have the strongest correlation with diabetes incidence.

>>> from sklearn.ensemble import ExtraTreesClassifier
>>> model = ExtraTreesClassifier()
>>> model.fit(x, y)
ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
>>> print(model.feature_importances_)
[0.10696279 0.25816011 0.09378777 0.09258844 0.06920807 0.11396286
 0.12328806 0.1420419 ]

From the feature extraction, features 0 (pregnancies), 1 (glucose), 5 (BMI), 6 (DiabetesPedigreeFunction), and 7 (Age) showed the highest scores in terms of feature importance, and these are the ones that are included in the models to predict the outcome variable.

FeatureScore
Pregnancies0.10696279
Glucose0.25816011
Blood Pressure0.09378777
Skin Thickness0.09258844
Insulin0.06920807
BMI0.11396286
Diabetes Pedigree Function0.12328806
Age0.1420419

Therefore, these variables were defined as xnew under a numpy column stack, and the data was partitioned into training and validation data with train_test_split.

x0=x[:,0]
x1=x[:,1]
x5=x[:,5]
x6=x[:,6]
x7=x[:,7]
xnew=np.column_stack((x0,x1,x5,x6,x7))
xnew

from sklearn.model_selection import train_test_split
x_train,x_val,y_train,y_val=train_test_split(xnew,y,random_state=0)

Logistic Regression vs. Boosting Classifiers

Having selected the relevant features and partitioning the data, a logistic regression was run in conjunction with several boosting classifiers.

# Logistic Regression
logreg=LogisticRegression().fit(x_train,y_train)
logreg

# GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(x_train, y_train)

#  LightGBM Classifier
import lightgbm as lgb
lgb_model = lgb.LGBMClassifier(learning_rate = 0.001, 
                              num_leaves = 65,  
                              n_estimators = 100)                       
lgb_model.fit(x_train, y_train)
 
# XGBoost
import xgboost as xgb
xgb_model = xgb.XGBClassifier(learning_rate=0.001,
                            max_depth = 1, 
                            n_estimators = 100)
xgb_model.fit(x_train, y_train)

# AdaBoost
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=100,
    algorithm="SAMME.R", learning_rate=0.001)
ada_clf.fit(x_train, y_train)

As can be observed, the number of n_estimators was set to 100 while the learning rate was set to 0.001. Machine Learning Mastery offers more detail as to how to implement gradient boosting techniques, but in this case the learning rate (or shrinkage parameter) is set to below 0.1 for better generalization error, while the number of n_estimators (or number of trees) is set to 100 in accordance with the recommended range of 100 to 500 as outlined in the “Greedy Function Approximation: A Gradient Boosting Machine” paper.

When these models were run, the following training and validation set scores were obtained:

>>> print("Accuracy on training set: {:.3f}".format(logreg.score(x_train,y_train)))
Accuracy on training set: 0.766
>>> print("Accuracy on validation set: {:.3f}".format(logreg.score(x_val,y_val)))
Accuracy on validation set: 0.797

>>> print("Accuracy on training set: {:.3f}".format(gbrt.score(x_train, y_train)))
Accuracy on training set: 0.896
>>> print("Accuracy on validation set: {:.3f}".format(gbrt.score(x_val, y_val)))
Accuracy on validation set: 0.792

>>> print("Accuracy on training set: {:.3f}".format(lgb_model.score(x_train, y_train)))
Accuracy on training set: 0.642
>>> print("Accuracy on validation set: {:.3f}".format(lgb_model.score(x_val, y_val)))
Accuracy on validation set: 0.677

>>> print("Accuracy on training set: {:.3f}".format(xgb_model.score(x_train, y_train)))
Accuracy on training set: 0.748
>>> print("Accuracy on validation set: {:.3f}".format(xgb_model.score(x_val, y_val)))
Accuracy on validation set: 0.750

>>> print("Accuracy on training set: {:.3f}".format(ada_clf.score(x_train, y_train)))
Accuracy on training set: 0.748
>>> print("Accuracy on validation set: {:.3f}".format(ada_clf.score(x_val, y_val)))
Accuracy on validation set: 0.750
ModelTraining AccuracyValidation Accuracy
Logistic Regression0.7660.797
Gradient Boosting Classifier0.8960.792
LightGBM Classifier0.6420.677
XGBoost Classifier0.7480.750
AdaBoost Classifier0.7480.750

From looking at the above results, two things are evident:

  1. Only the GradientBoostingClassifier yields a similar validation accuracy to the logistic regression – all other boosting models show a slightly less validation accuracy.
  2. Moreover, the accuracy of the logistic regression on the training set is slightly lower than that of the validation set, implying that overfitting is less of an issue on the logistic regression than on the gradient boosting models.

Conclusion

Boosting models have become a “black box” model of sorts, and are increasingly being relied upon for increased accuracy. However, these models don’t necessarily give the best accuracy in all cases (as we have seen here), and the issue of overfitting also must be considered. It was observed that the training accuracy was significantly higher than the validation accuracy in many cases, and this indicates overfitting.

Boosting works on the premise of combining several weak models (e.g. many decision trees) in order to increase accuracy – hence why these are often referred to as ensemble models.

While boosting can be advantageous depending on the data one is working with, they do come with an overfitting risk and should not simply be relied upon by default without considering the data in question and whether other models could prove to be more suitable.

Leave a Reply

Your email address will not be published. Required fields are marked *

two × 4 =