Actuarial Data Science - Open Learning Resource
GLMs are one of the core workhorses of actuarial modelling, especially in pricing and reserving. In this lecture, we revisit the GLM framework and connect it to the broader statistical learning ideas you have seen, so that you understand when and why a GLM is an appropriate choice, not just how to fit one in software.
Perform data analytics modelling using GLM for response variables of the following types:
Understand the specification of GLMs and their model assumptions.
Select and validate a GLM appropriately.
Compare GLMs with other modelling techniques.
A Linear model assumes Y \mid X \sim \mathcal{N}(\mu(X), \sigma^2 I). And \mathbb{E}(Y \mid X) = \mu(X) = X^T \beta, where X = (1, x_1, x_2, \ldots, x_p)^T and \beta = (\beta_0, \beta_1, \ldots, \beta_p)^T.
The main components (that we are going to relax) are:
In insurance settings, most variables are non-normal in nature. Models used are often multiplicative, hence linear on the log scale.
A generalized linear model (GLM) generalises normal linear regression models in the following directions:
Random component: Y \sim \text{some exponential family distribution}
Link: between the random component and covariates: g(\mu(X)) = X^T \beta = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p where g is the link function and \mu = \mathbb{E}(Y \mid X). The link function can be any monotonic function (allowing inversion): \mu(X) = g^{-1}(X^T \beta) = g^{-1}(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)
Model fitting and inferences: maximum likelihood estimation
Canonical exponential family for k = 1, y \in \mathbb{R}: f_\theta(y) = \exp\left(\frac{y\theta - b(\theta)}{\phi} + c(y, \phi)\right) for some known functions b(\cdot) and c(\cdot, \cdot).
If \phi is known, this is a one-parameter exponential family with \theta being the canonical parameter.
If \phi is unknown, this may or may not be a two-parameter exponential family. \phi is called the dispersion parameter.
In this class, we always assume that \phi is known.
\begin{aligned} f_\theta(y) &= \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(y - \mu)^2}{2\sigma^2}\right) \\ &= \exp\left\{ \frac{y\mu - \frac{1}{2}\mu^2}{\sigma^2} - \frac{1}{2} \left( \frac{y^2}{\sigma^2} + \log(2\pi\sigma^2) \right) \right\} \end{aligned}
The exponential family includes a rich class of distributions, such as normal, Poisson, binomial, Gamma, and inverse Gaussian distributions.
| Normal | Poisson | Bernoulli | |
|---|---|---|---|
| Notation | \mathcal{N}(\mu, \sigma^2) | \mathcal{P}(\mu) | \mathcal{B}(p) |
| Range of y | (-\infty, \infty) | \{0, 1, 2, \ldots\} | \{0, 1\} |
| \phi | \sigma^2 | 1 | 1 |
| b(\theta) | \dfrac{\theta^2}{2} | e^\theta | \log(1 + e^\theta) |
| c(y,\phi) | -\dfrac{1}{2}\left(\dfrac{y^2}{\phi} + \log(2\pi\phi)\right) | -\log(y!) | 0 |
We should choose the response distribution that is consistent with the characteristics of the response variable.
Examples
The mean \mathbb{E}(Y) and the variance \mathrm{Var}(Y) can be derived as follows:
\mathbb{E}(Y) = \mu = b'(\theta).
\mathrm{Var}(Y) = V(\mu)\,\phi = b''(\theta)\,\phi. V(\mu) is called the variance function.
Examples
Normal N(\mu, \sigma^2): \theta = \mu, and hence \theta(\mu) = \mu.
Gamma(\alpha, \beta): \theta = -\beta/\alpha = -\frac{1}{\mu}, and hence \theta(\mu) = -\frac{1}{\mu}.
Inverse Gaussian(\alpha, \beta): \theta = -\frac{1}{2}\frac{\beta^2}{\alpha^2} = -\frac{1}{2\mu^2}, and hence \theta(\mu) = -\frac{1}{2\mu^2}.
Poisson(\mu): \theta = \log \mu, and hence \theta(\mu) = \log \mu.
Binomial(m, p): \theta = \log\!\left(\frac{p}{1-p}\right) = \log\!\left(\frac{\mu}{m-\mu}\right), and hence \theta(\mu) = \log\!\left(\frac{\mu}{m-\mu}\right).
Negative Binomial(r, p): \theta = \log(1-p) = \log\!\left(\frac{\mu p}{r}\right), and hence \theta(\mu) = \log(1-p) = \log\!\left(\frac{\mu p}{r}\right).
Consider a Poisson likelihood, f(y) = \frac{\mu^y}{y!} e^{-\mu} = \exp\!\left(y \log \mu - \mu - \log(y!)\right), Thus, \theta = \log \mu, \quad b(\theta) = \mu, \quad c(y, \phi) = -\log(y!), \phi = 1, \mu = e^\theta,
b(\theta) = e^\theta, b''(\theta) = e^\theta = \mu.
Examples
Normal N(\mu, \sigma^2) with V(\mu) = 1
Gamma(\alpha, \beta) with V(\mu) = \mu^2
Inverse Gaussian(\alpha, \beta) with V(\mu) = \mu^3
Poisson(\mu) with V(\mu) = \mu
Binomial(m, p) with V(\mu) = \mu\left(1 - \mu/m\right)
Negative Binomial(r, p) with V(\mu) = \mu\left(1 + \mu/r\right)
Mean–Variance Relationships Across Distributions
Examples of Link Functions
Comparison of logit and probit link functions
Inverse link functions: logit (blue) and probit (red)
| Distribution | b(\theta) | g(\mu) |
|---|---|---|
| Normal | \dfrac{\theta^2}{2} | \mu |
| Poisson | e^\theta | \log \mu |
| Bernoulli | \log(1 + e^\theta) | \log\!\left(\dfrac{\mu}{1-\mu}\right) |
| Gamma | -\log(-\theta) | -\dfrac{1}{\mu} |
Given a link function g, note the following relationship between \beta and \theta: \begin{aligned} \theta_i &= (b')^{-1}(\mu_i) \\ &= (b')^{-1}\big(g^{-1}(X_i^T \beta)\big) \equiv h(X_i^T \beta) \end{aligned}
where h is defined as h = (b')^{-1} \circ g^{-1} = (g \circ b')^{-1}.
Remark: if g is the canonical link function, then h is the identity.
The link function is closely related to the interpretation of coefficients.
For qualitative (categorical) predictors, the interpretation depends on the choice of the reference level.
Suppose a categorical variable has a reference category. Then:
See the GLM lab materials for more detailed examples and interpretations.
The log-likelihood is given by \begin{aligned} \ell_n(\beta; \mathbf{Y}, \mathbf{X}) &= \sum_i \frac{Y_i \theta_i - b(\theta_i)}{\phi} \\ &= \sum_i \frac{Y_i\, h(X_i^T \beta) - b\!\left(h(X_i^T \beta)\right)}{\phi} \end{aligned} up to a constant term.
Note that when we use the canonical link function, we obtain the simpler expression \ell_n(\beta, \phi; \mathbf{Y}, \mathbf{X}) = \sum_i \frac{Y_i X_i^T \beta - b(X_i^T \beta)}{\phi}.
Deviance is a likelihood-based goodness-of-fit measure on the training set. It plays much the same role for GLMs as RSS does for ordinary linear models.
Null model: only the constant term is included.
Full model (saturated model): each fitted value is equal to the observation, so the saturated model fits perfectly. The full model is the case when \mu_i = y_i for all i = 1,2,\ldots,n, so that the log-likelihood in the full model gives \ell_{\text{full}} = \sum_{i=1}^{n} \left[ \frac{y_i \tilde{\theta}_i - b(\tilde{\theta}_i)}{\phi} + c(y_i, \phi) \right] where \tilde{\theta}_i are the canonical parameter values corresponding to \mu_i = y_i for all i = 1,2,\ldots,n.
We define the deviance of the chosen GLM (with log-likelihood \ell) as D \equiv 2(\ell_{\text{full}} - \ell).
Note: for a linear model with Gaussian errors, the deviance is just the RSS on the training set. The null deviance is 2(\ell_{\text{full}} - \ell_{\text{null}}).
Consider two models:
Model 2 is a significant improvement over Model 1 (the more parsimonious model) if D_1 - D_2 exceeds the critical value from a \chi^2(p - q) distribution.
Since \Pr\!\left[ \chi^2(\nu) > 2\nu \right] \approx 5\%, the following rule of thumb can be used as an approximation: \text{Model 2 is preferred if } D_1 - D_2 > 2(p - q).
Let Y_{ij} denote the number of claims for the jth male driver in group i (i=1,2,3, j=1,2,\cdots). Three models are fitted, with deviances as shown below:
| Model | Link function | Deviance |
|---|---|---|
| Model 1 | \log(\mu_{ij}) = \alpha_i,\ i = 1,2,3 | 60.40 |
| Model 2 | \log(\mu_{ij}) = \begin{cases} \alpha, & \text{if } i = 1,2 \\ \beta, & \text{if } i = 3 \end{cases} | 61.64 |
| Model 3 | \log(\mu_{ij}) = \alpha | 72.53 |
Determine whether or not
(a) model 2 is a significant improvement over model 3;
(b) model 1 is a significant improvement over model 2;
(c) model 1 is a significant improvement over model 3.
We drop the subscript i = 1,2,\ldots,n
Deviances are:
| Distribution | Deviance D(y, \hat{\mu}) |
|---|---|
| Normal | \sum (y - \hat{\mu})^2 |
| Poisson | 2 \sum \left[ y \log\!\left(\frac{y}{\hat{\mu}}\right) - (y - \hat{\mu}) \right] |
| Binomial | 2 \sum \left[ y \log\!\left(\frac{y}{\hat{\mu}}\right) + (m - y)\log\!\left(\frac{m - y}{m - \hat{\mu}}\right) \right] |
| Gamma | 2 \sum \left[ -\log\!\left(\frac{y}{\hat{\mu}}\right) + \frac{y - \hat{\mu}}{\hat{\mu}} \right] |
| Inverse Gaussian | \sum \frac{(y - \hat{\mu})^2}{\hat{\mu}^2 y} |
Residuals are a primary tool for assessing how well a model fits the data.
They can also help detect the form of the variance function and diagnose problem observations.
Response residuals (y_i - \hat{\mu}_i) are not very helpful, as they are not normally distributed and do not have constant variance.
Similar to the RSS of a linear model, deviance always decreases as the number of predictors increases.
AIC and BIC are penalised likelihood measures, which balance the goodness of fit on the training set and model complexity.
When a log-likelihood loss function is used, we have
Model assumptions:
How to solve it?
If Y \sim \text{Poisson}(\lambda) and \lambda \sim \text{Gamma}\!\left(r, \frac{1-p}{p}\right), then Y \sim \text{NegBinom}(r, p).
Review: An Introduction to Statistical Learning (James et al. 2013), Sections 4.3 and 4.6.2
A binary target variable with the logit link is called a logistic regression model.
Performance Measures

ROC Curve
ROC and AUC
Positive right-skewed continuous response:
Assume c is the number of claims on a policy and z_1, \ldots, z_c are the individual claim sizes. Then the total claim size is Y = \sum_{i=1}^{c} z_i = z_1 + \cdots + z_c.
The Tweedie distribution is a member of the exponential family and \mathrm{Var}(Y) = \phi \mu^p, where 1 < p < 2. Here, p is the power of the variance function.
Note:
In many applications, the effect of one predictor may depend on another predictor.
We can include interaction terms in a GLM: g(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2
Interpretation:
Types of interactions:
Extend linear predictors using basis functions: g(\mu) = \beta_0 + \beta_1 b_1(X) + \cdots + \beta_K b_K(X)
Here:
Basis functions can capture non-linear relationships and improve predictive performance
These ideas form the basis of more flexible models such as Generalised Additive Models (GAMs)
Penalised GLMs add a penalty to the likelihood: \text{Loss} = -\ell(\beta) + \lambda \cdot \text{Penalty}(\beta)
Common choices of penalties:
Refer to the topic of Modelling and Shrage Techniques discussed earlier
glm() function
glm(formula, family = familytype(link = linkfunction), data = )| Family | Default (canonical) link | Other / preferred link(s) |
|---|---|---|
| binomial | (link = “logit”) | “probit”, “cloglog” |
| gaussian | (link = “identity”) | |
| gamma | (link = “inverse”) | “log” |
| inverse.gaussian | (link = “1/mu^2”) | “log” |
| poisson | (link = “log”) | |
| quasipoisson | (link = “log”) | |
| quasibinomial | (link = “logit”) |
library(statmod) and library(tweedie)glm.nb() (from MASS): fit a Negative Binomial regressiontweedie.profile(): maximum likelihood estimation of the Tweedie index parameter pstep(): stepwise model selectionsummary()AIC()BIC()You are working as an actuarial analyst for a general insurer. You are given a dataset with the following information:
You are asked to examine the effects of various factors on the number of claims that a policyholder is observed over a period of time. Please specify: 1. the distribution of the target variable
2. the predictors
3. the link function
4. the offset (if any)
5. write down the key R command to fit such a GLM
modPossion <- glm(NumClaims ~ Age + Gender + RiskCat, family=poisson, offet = log(Time))We analyse 4624 claims from Australian automobile insurance to understand the factors that affect claim severity and to predict claim severity.
R output for fitted GLM
