Actuarial Data Science - Open Learning Resource
In this lecture, we step back and examine the overall modelling landscape: what it means to build a model, how different statistical learning methods relate to one another, and why we sometimes deliberately penalise model complexity (shrinkage) to improve performance. The goal is to provide a conceptual map so that later techniques, such as GLMs, random forests, and boosting, fit into a coherent framework rather than appearing as isolated methods.
Data generation (Source: Bishop and Nasrabadi (2006), Figure 1.2)
runif())sin(), rnorm())Geometric interpretation of the sum-of-squares error function (Source: Bishop and Nasrabadi (2006), Figure 1.3)
\begin{aligned} f(x, \bm{w}) &= \sum_{m=0}^{M} w_m x^m \Big|_{M=0} \\ &= w_0 \end{aligned}
Fitted polynomial (M = 0) (Source: Bishop and Nasrabadi (2006), Figure 1.4)
\begin{aligned} f(x, \bm{w}) &= \sum_{m=0}^{M} w_m x^m \Big|_{M=1} \\ &= w_0 + w_1 x \end{aligned}
Fitted polynomial (M = 1) (Source: Bishop and Nasrabadi (2006), Figure 1.4)
\begin{aligned} f(x, \bm{w}) &= \sum_{m=0}^{M} w_m x^m \Big|_{M=3} \\ &= w_0 + w_1 x + w_2 x^2 + w_3 x^3 \end{aligned}
Fitted polynomial (M = 3) (Source: Bishop and Nasrabadi (2006), Figure 1.4)
\begin{aligned} f(x, \bm{w}) &= \sum_{m=0}^{M} w_m x^m \Big|_{M=9} \\ &= w_0 + w_1 x + \cdots + w_9 x^9 \end{aligned}
Fitted polynomial (M = 9) (Source: Bishop and Nasrabadi (2006), Figure 1.4)
Training and test error vs model complexity (Source: Bishop and Nasrabadi (2006), Figure 1.5)
Fitted model parameters for different polynomial orders (Source: Bishop and Nasrabadi (2006), Table 1.1)
Effect of training set size on model fitting (N = 15, M = 9) (Source: Bishop and Nasrabadi (2006), Figure 1.6)
Effect of training set size on model fitting (N = 100, M = 9) (Source: Bishop and Nasrabadi (2006), Figure 1.6)
Review: An Introduction to Statistical Learning (James et al. 2013), Chapter 6.2
Data splitting (Source: Cochrane (2018))
Cross-validation (Source: Wikimedia Commons contributors (2016))
Review: An Introduction to Statistical Learning (James et al. 2013), Chapter 5.1
Regularization term: \mathrm{Err}_W(\bm{w}) = \sum_{j=1}^{M} |w_j|^q = \lVert \bm{w} \rVert^q
The regularised error is \tilde{\mathrm{Err}}(\bm{w}) = \sum_{n=1}^{N} [f(x_n, \bm{w}) - t_n]^2 + \lambda \lVert \bm{w} \rVert^q
q=1: L1 regularisation (lasso)
q=2: L2 regularisation (ridge)
Geometric interpretation of L1 and L2 regularisation (Source: Bishop and Nasrabadi (2006), Figure 3.3)
Geometric comparison of ridge and lasso regularisation (Source: Bishop and Nasrabadi (2006), Figure 3.4)
Add the ridge regulariser to the error function to discourage the coefficients from reaching large values and control model complexity \tilde{\mathrm{Err}}(\bm{w}) = \sum_{n=1}^{N} [f(x_n, \bm{w}) - y_n]^2 + \lambda \lVert \bm{w} \rVert^2
Squared norm of the parameter vector \bm{w}: \lVert \bm{w} \rVert^2 = w_0^2 + w_1^2 + \cdots + w_M^2
\lambda governs the relative importance of the regularisation term compared with the sum-of-squares error term
Fitted model (ln λ = -18, weak regularisation) (Source: Bishop and Nasrabadi (2006), Figure 1.7)
Fitted model (ln λ = 0, strong regularisation) (Source: Bishop and Nasrabadi (2006), Figure 1.7)
Training and test RMS error versus \ln \lambda (Source: Bishop and Nasrabadi (2006), Figure 1.8)
Let us begin by reviewing the simplest possible model:
Stack the input vectors row by row: \bm{X} = \begin{bmatrix} \bm{x}_1^T \\ \bm{x}_2^T \\ \vdots \\ \bm{x}_n^T \end{bmatrix} \in \mathbb{R}^{n \times (p+1)}, \quad \bm{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \in \mathbb{R}^{n}
Rewrite the error: \mathrm{Err}(\bm{\beta}) = (\bm{y} - \bm{X}\bm{\beta})^T (\bm{y} - \bm{X}\bm{\beta})
Closed-form solution (normal equations): \hat{\bm{\beta}} = (\bm{X}^T \bm{X})^{-1} \bm{X}^T \bm{y}
To extend linear regression, a simple approach is to expand the features using basis functions. For example:
Sometimes, when the feature dimension is large relative to the number of observations, the model can overfit.
The regularised error is: \tilde{\mathrm{Err}}(\bm{\beta}) = \sum_{n=1}^{N} \big[ f(\phi(x_n), \bm{\beta}) - y_n \big]^2 + \lambda \lVert \bm{\beta} \rVert^q
q=1: lasso (L1 regularisation)
q=2: ridge (L2 regularisation)
Gaussian prior for ridge regression (Source: An Introduction to Statistical Learning (James et al. 2013), Figure 6.11)
Laplace prior for lasso regression (Source: An Introduction to Statistical Learning (James et al. 2013), Figure 6.11)
Source: adapted from Zou and Hastie (2004)
\hat{\bm{\beta}} = \arg\min_{\bm{\beta}} \; \lVert \bm{y} - \bm{X}\bm{\beta} \rVert^2 + \lambda_2 \lVert \bm{\beta} \rVert^2 + \lambda_1 \lVert \bm{\beta} \rVert_1
Source: adapted from Zou and Hastie (2005)
Geometry of ridge, lasso, and elastic net penalties (Source: Zou and Hastie (2004))
The elastic net penalty J(\bm{\beta}) = \alpha \lVert \bm{\beta} \rVert^2 + (1 - \alpha)\lVert \bm{\beta} \rVert_1 where \alpha = \frac{\lambda_2}{\lambda_2 + \lambda_1} - Singularities at the vertexes (necessary for sparsity) - Strict convex edges; the strength of convexity varies with \alpha (grouping effect)
glmnet, cv.glmnet (package: glmnet)lm.ridge (package: MASS)lars, cv.lars (package: lars)penalized (package: penalized)“Sometimes less is better.”
Overview of feature selection methods (Source: Analytics Vidhya (2016), adapted from online sources)
Filter methods (Source: Analytics Vidhya (2016))
| Feature\Response | Continuous | Categorical |
|---|---|---|
| Continuous | Pearson’s Correlation | LDA |
| Categorical | ANOVA | Chi-square test |
Wrapper methods (Source: Analytics Vidhya (2016))
Review: An Introduction to Statistical Learning (James et al. 2013), Chapters 6.1, 6.5
Embedded methods (Source: Analytics Vidhya (2016))
Review: An Introduction to Statistical Learning (James et al. 2013), Sections 6.3, 10.2
