Actuarial Data Science - Open Learning Resource
Gradient boosting machines (GBMs) are another key ensemble method that often deliver state-of-the-art predictive performance, but can feel like a “black box” at first. In this lecture, we break down the boosting idea into small steps so that you can see how GBMs are built up from simple components and how to control their complexity in practice.
Understand the motivation for boosting and the boosting technique
Perform predictive modelling using GBMs
Compare GBMs with other modelling techniques
Leo Breiman (1928–2005) (Source: Wikipedia)
Different trees fitted on bootstrap samples.
CART models are simple but often produce noisy (bushy) or weak predictors.
In general: Boosting > Random Forests > Bagging > Single Tree
Schematic of AdaBoost (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 10.1).
Initialise observation weights:$ w_i = , i = 1, 2, , N.$
For m = 1, 2, \dots, M:
Output the final classifier:G(x) = \text{sign}\left(\sum_{m=1}^M \alpha_m G_m(x)\right).
Adapted from The Elements of Statistical Learning (Hastie et al. 2009), Algorithm 10.1.
Test error versus boosting iterations for decision stumps (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 10.2).
Training and test error as a function of the number of boosting iterations.
Training and test error as a function of the number of boosting iterations (noisy setting).
Boosting builds an additive model f(x) = \sum_{m=1}^M \beta_m \, b(x; \gamma_m), where b(x; \gamma_m) is a tree, and \gamma_m parametrises the splits.
We have seen this type of model in statistics before, for example:
Traditionally, the parameters \theta_m are fitted jointly (e.g. least squares or maximum likelihood).
With boosting, the parameters (\beta_m, \gamma_m) are fitted in a stagewise fashion. This slows the process down and tends to overfit less quickly.
Algorithm:
Initialise f_0(x) = 0
For m = 1, \dots, M:
Fit a base learner by solving (\beta_m, \gamma_m) = \arg\min_{\beta, \gamma} \sum_{i=1}^N L\big(y_i, \, f_{m-1}(x_i) + \beta\, b(x_i; \gamma)\big)
Update the model f_m(x) = f_{m-1}(x) + \beta_m \, b(x; \gamma_m)
Adapted from The Elements of Statistical Learning (Hastie et al. 2009), Algorithm 10.2.
Sometimes step (b) is replaced by: f_m(x) = f_{m-1}(x) + \epsilon \, \beta_m \, b(x; \gamma_m)
gbm package in R, the default is \epsilon = 0.001gbm package in R, or via the caret package with method = "gbm".Test error versus number of trees for bagging, random forests, and gradient boosting (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 15.1).
Effect of the learning rate \gamma in gradient descent. Left: \gamma=0.3 and x_5 is close to the solution; middle: \gamma=0.9, where larger steps still leave \mathbf{x}_5 far from the optimum; right: \gamma=1.1, where the iterates diverge.
Now let us discuss gradient boosting machines.
Let us consider a special case to better understand gradient boosting.
Ensemble Learning consists of two steps:
Simple examples of ensembles
Variable importance (spam data, GBM) (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 10.6).
Partial dependence plots (spam data, GBM) (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 10.7).
Fit the model f(X) = \sum_{m=1}^M \alpha_m T_m(X) in step (2) using Lasso regularization: \alpha(\lambda) = \arg\min_{\alpha} \sum_{i=1}^N \mathcal{L}\big(y_i, \alpha_0 + \sum_{m=1}^M \alpha_m T_m(x_i)\big) + \lambda \sum_{m=1}^M |\alpha_m|.
randomForest: implementation of Leo Breiman’s algorithm.gbm: implementation of Friedman’s gradient boosting algorithm.glmnet: Lasso and elastic net (for post-processing ensembles).caret: a package that provides a unified interface for building predictive models. It includes tools for:
pdp: partial dependence plots