Actuarial Data Science - Open Learning Resource
Up to this point, you have seen many modelling techniques. This lecture helps you answer a natural follow-up question: which model should I trust, and how do I know it will work on new data? We focus on practical tools for comparing models and understanding the trade-off between bias, variance, and complexity.
No free lunch.
There is no single best model that works optimally for all kinds of problems.

The law of parsimony for problem-solving: “entities should not be multiplied without necessity.”
Assume the tuning parameter is \alpha, and write the predictions as \hat{f}_\alpha(x).
The tuning parameter controls the complexity of the model.
Objectives:
How can we do this?

For ridge regression, let \boldsymbol{\beta}_\ast denote the parameters of the best-fitting linear approximation to f: \boldsymbol{\beta}_\ast = \arg\min_{\boldsymbol{\beta}} \, \mathbb{E}\big[ (f(\mathbf{X}) - \mathbf{X}^\top \boldsymbol{\beta})^2 \big]
Assume the input variable \mathbf{X} is random and the tuning parameter is \alpha.
For ridge regression, the average squared bias can be decomposed as: \begin{aligned} \mathbb{E}_{\mathbf{x}_0}\big[ f(\mathbf{x}_0) - \mathbb{E}\hat{f}_\alpha(\mathbf{x}_0) \big]^2 &= \mathbb{E}_{\mathbf{x}_0}\big[ f(\mathbf{x}_0) - \mathbf{x}_0^\top \boldsymbol{\beta}_\ast \big]^2 \\ &\quad + \mathbb{E}_{\mathbf{x}_0}\big[ \mathbf{x}_0^\top \boldsymbol{\beta}_\ast - \mathbb{E}(\mathbf{x}_0^\top \hat{\boldsymbol{\beta}}_\alpha) \big]^2 \\ &= \text{Ave}[\text{Model bias}]^2 + \text{Ave}[\text{Estimation bias}]^2 \end{aligned}
Training and test error as a function of model complexity (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 7.1).
Schematic illustration of the bias–variance trade-off (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 7.2).
Expected prediction error (orange), squared bias (green), and variance (blue) for k-NN (left) and linear models (right), for both regression (top) and classification (bottom) (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 7.3).
\mathbb{E}_{\mathbf{y}}\big(\mathrm{Err}_{\text{in}}\big) = \mathbb{E}_{\mathbf{y}}\big(\overline{\mathrm{err}}\big) + \frac{2}{N}\sum_{i=1}^{N} \mathrm{Cov}(\hat{y}_i, y_i)
Remarks:
The general form of the in-sample estimate is \widehat{\mathrm{Err}}_{\text{in}} = \overline{\mathrm{err}} + \hat{w} where \hat{w} is an estimate of the average optimism.
When a log-likelihood loss function is used:
Another popular approach for model selection:
Review: An Introduction to Statistical Learning (James et al. 2013), Section 6.1.3
Performance Measures:
AIC/BIC for logistic regression
\text{Accuracy} = \dfrac{TP + TN}{TP + FP + TN + FN}
\text{Recall} = \dfrac{TP}{TP + FN}
\text{Precision} = \dfrac{TP}{TP + FP}
\text{F1-score} = \dfrac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
ROC curve / Area under the ROC curve (AUC)
Review: An Introduction to Statistical Learning (James et al. 2013), Section 4.4


Review: An Introduction to Statistical Learning (James et al. 2013), Chapter 5
Illustration of nested cross-validation (Source: Raschka (2018)).
Resampling with replacement
As with cross-validation, the bootstrap seeks to estimate the conditional error \mathrm{Err}_{\mathcal{T}}, but typically performs well only for estimating the expected prediction error \mathrm{Err}.
Quantifying uncertainty:
Implementation:
Bootstrap sampling illustration (Source: The Elements of Statistical Learning (Hastie et al. 2009), Figure 7.12).
