Lecture: GLM

Actuarial Data Science - Open Learning Resource

Author

Fei Huang, UNSW Sydney

Learning Objectives

  • GLMs are one of the core workhorses of actuarial modelling, especially in pricing and reserving. In this lecture, we revisit the GLM framework and connect it to the broader statistical learning ideas you have seen, so that you understand when and why a GLM is an appropriate choice, not just how to fit one in software.

  • Perform data analytics modelling using GLM for response variables of the following types:

    • count data
    • binary data
    • continuous data
    • a mixture of discrete and continuous data
  • Understand the specification of GLMs and their model assumptions.

  • Select and validate a GLM appropriately.

  • Compare GLMs with other modelling techniques.

Generalized Linear Models (Review from Prior Courses)

Linear Model

A Linear model assumes Y \mid X \sim \mathcal{N}(\mu(X), \sigma^2 I). And \mathbb{E}(Y \mid X) = \mu(X) = X^T \beta, where X = (1, x_1, x_2, \ldots, x_p)^T and \beta = (\beta_0, \beta_1, \ldots, \beta_p)^T.

  • Normal errors
  • Constant variance
  • Additive linear relationship

Components of a Linear Model

The main components (that we are going to relax) are:

  1. Random component: the response variable Y \mid X is continuous and normally distributed with mean \mu = \mu(X) = \mathbb{E}(Y \mid X).
  2. Link: the relationship between the random component and the covariates X = (1, x_1, x_2, \ldots, x_p)^T \mu(X) = X^T \beta
  3. Model fitting and inference: least squares, maximum likelihood estimation, and weighted least squares

Insurance Applications

In insurance settings, most variables are non-normal in nature. Models used are often multiplicative, hence linear on the log scale.

  • In general insurance, claim numbers (frequency) are typically Poisson, or Poisson with over-dispersion. These distributions are not symmetric, and their variance is proportional to the mean.
  • In general insurance, claim amounts (severity) are right-skewed, often following distributions such as the Gamma.
  • In mortality studies, the aim is to explain the number of deaths in terms of variables such as age, gender, and lifestyle.
  • In health insurance, we may wish to explain the number of claims made by different individuals or groups of individuals in terms of explanatory variables such as age, gender and occupation.

How to Solve the Problem?

  • Transformations on linear models (e.g., log-transformed response)?
    • If using a log-transformed response, then the target variable must be non-negative and will be lognormally distributed.
    • It does not solve all problems.
  • GLMs provide more flexibility and cover a wider range of distributions.

Generalisation

A generalized linear model (GLM) generalises normal linear regression models in the following directions:

  1. Random component: Y \sim \text{some exponential family distribution}

  2. Link: between the random component and covariates: g(\mu(X)) = X^T \beta = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p where g is the link function and \mu = \mathbb{E}(Y \mid X). The link function can be any monotonic function (allowing inversion): \mu(X) = g^{-1}(X^T \beta) = g^{-1}(\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p)

  3. Model fitting and inferences: maximum likelihood estimation

    • Newton-Raphson Method
    • Fisher scoring Method

One-Parameter Canonical Exponential Family

  • Canonical exponential family for k = 1, y \in \mathbb{R}: f_\theta(y) = \exp\left(\frac{y\theta - b(\theta)}{\phi} + c(y, \phi)\right) for some known functions b(\cdot) and c(\cdot, \cdot).

  • If \phi is known, this is a one-parameter exponential family with \theta being the canonical parameter.

  • If \phi is unknown, this may or may not be a two-parameter exponential family. \phi is called the dispersion parameter.

  • In this class, we always assume that \phi is known.

Example: Normal Distribution

  • Consider the following normal density function with known variance \sigma^2:

\begin{aligned} f_\theta(y) &= \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(y - \mu)^2}{2\sigma^2}\right) \\ &= \exp\left\{ \frac{y\mu - \frac{1}{2}\mu^2}{\sigma^2} - \frac{1}{2} \left( \frac{y^2}{\sigma^2} + \log(2\pi\sigma^2) \right) \right\} \end{aligned}

  • Therefore \theta = \mu, \phi = \sigma^2, b(\theta) = \frac{\theta^2}{2}, and c(y, \phi) = -\frac{1}{2} \left( \frac{y^2}{\phi} + \log(2\pi \phi) \right).

Other distributions

The exponential family includes a rich class of distributions, such as normal, Poisson, binomial, Gamma, and inverse Gaussian distributions.

Normal Poisson Bernoulli
Notation \mathcal{N}(\mu, \sigma^2) \mathcal{P}(\mu) \mathcal{B}(p)
Range of y (-\infty, \infty) \{0, 1, 2, \ldots\} \{0, 1\}
\phi \sigma^2 1 1
b(\theta) \dfrac{\theta^2}{2} e^\theta \log(1 + e^\theta)
c(y,\phi) -\dfrac{1}{2}\left(\dfrac{y^2}{\phi} + \log(2\pi\phi)\right) -\log(y!) 0

How to Select the Response/Target Distribution?

We should choose the response distribution that is consistent with the characteristics of the response variable.

Examples
  • Count response variables, such as the number of claims (frequency) and the number of deaths in mortality studies.
    • Poisson distribution (mean = variance)
    • Quasi-Poisson and Negative Binomial (variance > mean)
  • Binary response variables, such as fraud detection, email spam detection, and whether a client is credible or not.
    • Bernoulli distribution (binomial distribution with n=1)
  • Positive continuous response variables, such as claim amounts (severity) and insurance coverage.
    • Gamma and inverse Gaussian distributions
  • A mixture of discrete and continuous response variables, such as aggregate loss in general insurance.
    • Tweedie distribution

Mean and Variance

The mean \mathbb{E}(Y) and the variance \mathrm{Var}(Y) can be derived as follows:

\mathbb{E}(Y) = \mu = b'(\theta).

\mathrm{Var}(Y) = V(\mu)\,\phi = b''(\theta)\,\phi. V(\mu) is called the variance function.

Examples of Mean

  • Note that \mu = b'(\theta), so the mean \mu depends on the canonical (location) parameter \theta, or equivalently \theta depends on \mu. Hence, we sometimes write \theta = \theta(\mu).
Examples
  • Normal N(\mu, \sigma^2): \theta = \mu, and hence \theta(\mu) = \mu.

  • Gamma(\alpha, \beta): \theta = -\beta/\alpha = -\frac{1}{\mu}, and hence \theta(\mu) = -\frac{1}{\mu}.

  • Inverse Gaussian(\alpha, \beta): \theta = -\frac{1}{2}\frac{\beta^2}{\alpha^2} = -\frac{1}{2\mu^2}, and hence \theta(\mu) = -\frac{1}{2\mu^2}.

  • Poisson(\mu): \theta = \log \mu, and hence \theta(\mu) = \log \mu.

  • Binomial(m, p): \theta = \log\!\left(\frac{p}{1-p}\right) = \log\!\left(\frac{\mu}{m-\mu}\right), and hence \theta(\mu) = \log\!\left(\frac{\mu}{m-\mu}\right).

  • Negative Binomial(r, p): \theta = \log(1-p) = \log\!\left(\frac{\mu p}{r}\right), and hence \theta(\mu) = \log(1-p) = \log\!\left(\frac{\mu p}{r}\right).

Example: Poisson Distribution

Consider a Poisson likelihood, f(y) = \frac{\mu^y}{y!} e^{-\mu} = \exp\!\left(y \log \mu - \mu - \log(y!)\right), Thus, \theta = \log \mu, \quad b(\theta) = \mu, \quad c(y, \phi) = -\log(y!), \phi = 1, \mu = e^\theta,

b(\theta) = e^\theta, b''(\theta) = e^\theta = \mu.

Variance Function

  • The variance is sometimes expressed as \mathrm{Var}(Y) = \phi V(\mu), where V(\mu) = b''\!\left(\theta(\mu)\right) and is called the variance function.
Examples
  • Normal N(\mu, \sigma^2) with V(\mu) = 1

  • Gamma(\alpha, \beta) with V(\mu) = \mu^2

  • Inverse Gaussian(\alpha, \beta) with V(\mu) = \mu^3

  • Poisson(\mu) with V(\mu) = \mu

  • Binomial(m, p) with V(\mu) = \mu\left(1 - \mu/m\right)

  • Negative Binomial(r, p) with V(\mu) = \mu\left(1 + \mu/r\right)

Mean-variance relationship (1)

Mean–Variance Relationships Across Distributions

Example: The Bernoulli Distribution

  • We can check that b(\theta) = \log(1 + e^\theta).
  • Hence we solve b'(\theta) = \frac{e^\theta}{1 + e^\theta} = \mu, \theta = \log\!\left(\frac{\mu}{1-\mu}\right).
  • The canonical link for the Bernoulli distribution is the logit link.

GLM Model Structure

  • Let (X_i, Y_i) \in \mathbb{R}^p \times \mathbb{R}, i = 1,2,\ldots,n, be independent random pairs such that the conditional distribution of Y_i given X_i has a density in the canonical exponential family: f_{\theta_i}(y_i) = \exp\left\{ \frac{y_i \theta_i - b(\theta_i)}{\phi} + c(y_i, \phi) \right\}.
  • \mathbf{Y} = (Y_1, \ldots, Y_n)^T, \mathbf{X} = (X_1^T, \ldots, X_n^T)^T
  • The mean \mu_i is related to the canonical parameter \theta_i via \mu_i = b'(\theta_i).
  • The mean \mu_i depends on the covariates through a link function g: g(\mu_i) = X_i^T \beta.

\beta and \theta

  • Given a link function g, note the following relationship between \beta and \theta: \begin{aligned} \theta_i &= (b')^{-1}(\mu_i) \\ &= (b')^{-1}\big(g^{-1}(X_i^T \beta)\big) \equiv h(X_i^T \beta) \end{aligned}

  • where h is defined as h = (b')^{-1} \circ g^{-1} = (g \circ b')^{-1}.

  • Remark: if g is the canonical link function, then h is the identity.

Model Interpretation: Quantitative Predictors

The link function is closely related to the interpretation of coefficients.

  • Log link: \ln(\mu) = X^T\beta = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p \mu = e^{X^T\beta} = e^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p} = e^{\beta_0} \times e^{\beta_1 X_1} \times \cdots \times e^{\beta_p X_p}
    • When there is a unit change in a predictor X_j (all others fixed), there is a multiplicative change (e^{\beta_j}) in the mean \mu, and the percentage change in \mu is (e^{\beta_j} - 1).
    • When \beta_j is positive, e^{\beta_j} > 1 and \mu increases; when \beta_j is negative, e^{\beta_j} < 1 and \mu decreases.
  • Logit link: \ln\!\left(\frac{\mu}{1-\mu}\right) = \ln(\text{odds}) = X^T\beta
    • Similar interpretations.

Model Interpretation: Qualitative Predictors

For qualitative (categorical) predictors, the interpretation depends on the choice of the reference level.

Suppose a categorical variable has a reference category. Then:

  • The intercept typically corresponds to the expected response for the reference category.
  • The coefficient associated with another category measures the difference relative to the reference category.
  • Under a log link, exponentiating the coefficient gives the multiplicative effect (i.e., multiplicative relativities) relative to the reference category.
  • Under a logit link, exponentiating the coefficient gives the odds ratio relative to the reference category.

See the GLM lab materials for more detailed examples and interpretations.

Log-likelihood

  • The log-likelihood is given by \begin{aligned} \ell_n(\beta; \mathbf{Y}, \mathbf{X}) &= \sum_i \frac{Y_i \theta_i - b(\theta_i)}{\phi} \\ &= \sum_i \frac{Y_i\, h(X_i^T \beta) - b\!\left(h(X_i^T \beta)\right)}{\phi} \end{aligned} up to a constant term.

  • Note that when we use the canonical link function, we obtain the simpler expression \ell_n(\beta, \phi; \mathbf{Y}, \mathbf{X}) = \sum_i \frac{Y_i X_i^T \beta - b(X_i^T \beta)}{\phi}.

Maximum Likelihood Estimation (MLE)

  • For linear models, the MLE and the ordinary least squares method are equivalent.
  • For GLMs, we use MLE for parameter estimation.
  • The log-likelihood \ell(\theta) is strictly concave when using the canonical link function and \phi > 0. As a consequence, the maximum likelihood estimator is unique.
  • On the other hand, if another parameterisation is used, the likelihood function may not be strictly concave, leading to several local maxima. The following methods can be used for optimisation:
    • Newton–Raphson method
    • Fisher scoring method
  • The coefficient estimates and the fitted mean response are given by: \hat{\mu}_i = g^{-1}\!\left(\hat{\beta}_0 + \hat{\beta}_1 x_{i1} + \cdots + \hat{\beta}_p x_{ip}\right).

Goodness-of-Fit Measures: Deviance

  • Deviance is a likelihood-based goodness-of-fit measure on the training set. It plays much the same role for GLMs as RSS does for ordinary linear models.

  • Null model: only the constant term is included.

  • Full model (saturated model): each fitted value is equal to the observation, so the saturated model fits perfectly. The full model is the case when \mu_i = y_i for all i = 1,2,\ldots,n, so that the log-likelihood in the full model gives \ell_{\text{full}} = \sum_{i=1}^{n} \left[ \frac{y_i \tilde{\theta}_i - b(\tilde{\theta}_i)}{\phi} + c(y_i, \phi) \right] where \tilde{\theta}_i are the canonical parameter values corresponding to \mu_i = y_i for all i = 1,2,\ldots,n.

  • We define the deviance of the chosen GLM (with log-likelihood \ell) as D \equiv 2(\ell_{\text{full}} - \ell).

  • Note: for a linear model with Gaussian errors, the deviance is just the RSS on the training set. The null deviance is 2(\ell_{\text{full}} - \ell_{\text{null}}).

Model Comparison

  • Consider two models:

    • Model 1: q parameters, with deviance D_1;
    • Model 2: p parameters (p > q), with deviance D_2. D_1 - D_2 \sim \chi^2(p - q).
  • Model 2 is a significant improvement over Model 1 (the more parsimonious model) if D_1 - D_2 exceeds the critical value from a \chi^2(p - q) distribution.

  • Since \Pr\!\left[ \chi^2(\nu) > 2\nu \right] \approx 5\%, the following rule of thumb can be used as an approximation: \text{Model 2 is preferred if } D_1 - D_2 > 2(p - q).

Example

Let Y_{ij} denote the number of claims for the jth male driver in group i (i=1,2,3, j=1,2,\cdots). Three models are fitted, with deviances as shown below:

Model Link function Deviance
Model 1 \log(\mu_{ij}) = \alpha_i,\ i = 1,2,3 60.40
Model 2 \log(\mu_{ij}) = \begin{cases} \alpha, & \text{if } i = 1,2 \\ \beta, & \text{if } i = 3 \end{cases} 61.64
Model 3 \log(\mu_{ij}) = \alpha 72.53

Determine whether or not
(a) model 2 is a significant improvement over model 3;
(b) model 1 is a significant improvement over model 2;
(c) model 1 is a significant improvement over model 3.

Exponential Distributions and Their Deviances

  • We drop the subscript i = 1,2,\ldots,n

  • Deviances are:

Distribution Deviance D(y, \hat{\mu})
Normal \sum (y - \hat{\mu})^2
Poisson 2 \sum \left[ y \log\!\left(\frac{y}{\hat{\mu}}\right) - (y - \hat{\mu}) \right]
Binomial 2 \sum \left[ y \log\!\left(\frac{y}{\hat{\mu}}\right) + (m - y)\log\!\left(\frac{m - y}{m - \hat{\mu}}\right) \right]
Gamma 2 \sum \left[ -\log\!\left(\frac{y}{\hat{\mu}}\right) + \frac{y - \hat{\mu}}{\hat{\mu}} \right]
Inverse Gaussian \sum \frac{(y - \hat{\mu})^2}{\hat{\mu}^2 y}

Deviance Residuals (1)

  • Residuals are a primary tool for assessing how well a model fits the data.

  • They can also help detect the form of the variance function and diagnose problem observations.

  • Response residuals (y_i - \hat{\mu}_i) are not very helpful, as they are not normally distributed and do not have constant variance.

Deviance Residuals (1) (continued)

  • We consider the deviance residuals instead: r_i^D = \text{sign}(y_i - \hat{\mu}_i)\sqrt{d_i}
    • where d_i is the contribution of the ith observation to the deviance (drawing on the idea that deviance is akin to RSS).
    • \text{sign}(y_i - \hat{\mu}_i) = 1 if y_i > \hat{\mu}_i, and \text{sign}(y_i - \hat{\mu}_i) = -1 if y_i < \hat{\mu}_i.
    • For example, for a Poisson distribution, we have r_i^D = \text{sign}(y_i - \hat{\mu}_i)\sqrt{2\left[ y_i \log\!\left(\frac{y_i}{\hat{\mu}_i}\right) - (y_i - \hat{\mu}_i) \right]}.
  • \sum_{i=1}^{n}(r_{i}^{D})^2=\sum_{i=1}^{n}d_i=D.

Deviance Residuals (2)

  • Deviance residuals are approximately normally distributed for most response distributions.
    • We can use a QQ plot of the standardised deviance residuals as a diagnostic tool.
  • Deviance residuals should have no systematic patterns with respect to the predictors.
  • Standardised deviance residuals have approximately constant variance.

AIC and BIC

  • Similar to the RSS of a linear model, deviance always decreases as the number of predictors increases.

  • AIC and BIC are penalised likelihood measures, which balance the goodness of fit on the training set and model complexity.

  • When a log-likelihood loss function is used, we have

    • \text{AIC} = -2 \, \text{loglik} + 2d
    • \text{BIC} = -2 \, \text{loglik} + (\log n)d

Procedure of Constructing a GLM

  • Choose a response distribution f(y_i) and hence choose b(\theta).
  • Choose a link g(\mu).
  • Choose explanatory variables X in terms of which g(\mu) is to be modelled.
  • Collect observations y_1,\cdots,y_n on the response y and corresponding values x_1,\cdots,x_n on the explanatory variables x.
  • Fit the model by estimating \beta (and, if unknown, \phi).
  • Given the estimate of \beta, generate predictions (or fitted values) of y for different settings of x, and examine how well the model fits. The estimated values of \beta can also be used to assess whether the explanatory variables are important in determining \mu.

Models for Count Responses

Count Response: Poisson Regression

  • Y_i \sim \text{Poisson}(\lambda_i): \dfrac{\lambda_i^{y_i}}{y_i!} e^{-\lambda_i},
    • g(\mu_i) = X_i^T \beta \quad \Rightarrow \quad \log(\lambda_i) = X_i^T \beta (canonical link = log link; log-linear model)

Model assumptions:

  • Poisson Response: The response variable is a count per unit of time or space, described by a Poisson distribution.
  • Independence: The observations must be independent of one another.
  • Mean = Variance: By definition, the mean of a Poisson random variable is equal to its variance.
  • Linearity: The log of the mean rate, \log(\lambda_i), must be a linear function of x.

Offsets

  • If \mu is the mean of the count Y, then the occurrence rate \mu / n is of interest, and g\!\left(\frac{\mu}{n}\right) = X^T \beta.
  • When g is the log function, we have \log\!\left(\frac{\mu}{n}\right) = X^T \beta \quad \Rightarrow \quad \log \mu = \log n + X^T \beta.
  • The variable n is called the exposure, and \log n is called an offset.
  • With the offset, Y has expected value directly proportional to exposure: \mu = n e^{X^T \beta}.
  • Offsets are used to correct for group size or differing time periods of observation.

Overdispersion

  • Overdispersion suggests that there is more variation in the response than the model implies.

How to solve it?

  1. Quasi-Poisson and quasi-likelihood: use an estimated dispersion factor to inflate standard errors: \sqrt{\hat{\phi}} \times \mathrm{SE}(\hat{\beta}). \hat{\phi} = \frac{\sum (\text{Pearson residuals})^2}{n - p}, where p is the number of model parameters. \hat{\phi} will be larger than 1 in the presence of overdispersion.
    • Pearson residuals: r_i^{P} = \dfrac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}}.
  2. Use a negative binomial regression model.

Negative Binomial Regression

If Y \sim \text{Poisson}(\lambda) and \lambda \sim \text{Gamma}\!\left(r, \frac{1-p}{p}\right), then Y \sim \text{NegBinom}(r, p).

  • \mathbb{E}(Y) = \dfrac{pr}{1 - p} = \mu and \mathrm{Var}(Y) = \dfrac{pr}{(1 - p)^2} = \mu + \dfrac{\mu^2}{r}
  • The overdispersion in this case is given by \dfrac{\mu^2}{r} (so smaller values of r indicate greater overdispersion).

Models for Binary Responses

Review: An Introduction to Statistical Learning (James et al. 2013), Sections 4.3 and 4.6.2

Binary Response: Logistic Regression

A binary target variable with the logit link is called a logistic regression model.

  • Y_i \sim \text{Binomial}(1, \pi_i)
    • g(\mu_i) = g(\pi_i) = \ln\!\left(\frac{\pi_i}{1-\pi_i}\right) = X_i^T \beta (canonical link = logit link)
    • \pi_i = \dfrac{e^{X_i^T \beta}}{1 + e^{X_i^T \beta}}
    • The logit link ensures predictions of \pi_i using X_i are in the interval (0,1) for all \beta and X_i.
  • Other links
    • probit link: \Phi^{-1}(\pi_i)
    • complementary log-log (c-log-log) link: \log\!\left(-\log(1 - \pi_i)\right)

Performance Measures

  • We need a cut-off to translate the predicted probabilities into classifications.
  • Confusion matrix corresponding to the cut-off:
Performance Measures
  • \text{Sensitivity / Recall / True Positive Rate} = \dfrac{TP}{TP + FN}
    • The ratio of TP to the total number of positive events.
  • \text{Specificity} = \dfrac{TN}{TN + FP}
    • The ratio of TN to the total number of negative events.
  • \text{False Positive Rate} = \dfrac{FP}{TN + FP} = 1 - \text{Specificity}
  • The confusion matrix can be computed on both the training and testing sets.

Confusion Matrix.

How to Choose the Correct Cut-off

  • Depends on the business objective.
  • For fraud detection, a low cut-off is preferred to avoid the high cost of undetected fraud.
  • In other cases, if the profits and costs associated with correct and incorrect classifications are known, an optimal cut-off can be chosen to maximise overall profit or minimise overall cost.

ROC and AUC

ROC Curve
ROC and AUC
  • The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate against the False Positive Rate.
  • Each point on the ROC curve corresponds to a certain cut-off.
  • A good model will have its ROC curve bend towards the top-left corner.
  • AUC is the area under the ROC curve.
  • The relative value of the AUC can be used for quantitative assessment of a classifier: the higher, the better.
    • \text{AUC} = 1: perfect classifier.
    • \text{AUC} = 0.5: random classifier. The ROC curve lies on the 45^\circ diagonal line.

Models for Continuous Responses

Continuous Response: Gaussian and Gamma Regression

  • Gaussian response: linear regression

Positive right-skewed continuous response:

  • Gamma response: Y_i \sim \text{Gamma}(\mu_i, \nu)
    • g(\mu_i) = \log(\mu_i) = X_i^T \beta
    • The canonical link is the inverse function. Since parameters from a model with the inverse link are difficult to interpret, the log link is usually regarded as more useful.

Continuous Response: Gaussian and Gamma Regression (continued)

  • Inverse Gaussian response: Y_i \sim \text{IG}(\mu_i, \sigma^2)
    • g(\mu_i) = \log(\mu_i) = X_i^T \beta (log link)
    • The canonical link is g(\mu) = \mu^{-2}. However, the log link is usually preferred.

Models for a Mixture of Discrete and Continuous Responses

Tweedie Regression

Assume c is the number of claims on a policy and z_1, \ldots, z_c are the individual claim sizes. Then the total claim size is Y = \sum_{i=1}^{c} z_i = z_1 + \cdots + z_c.

  • If c is Poisson and z_j are independent Gamma random variables, then Y has a Tweedie distribution.
  • This distribution has a non-zero probability at Y = 0, equal to the Poisson probability of no claims. The rest of the distribution is similar to the Gamma distribution.

Tweedie Regression (continued)

  • The Tweedie distribution is a member of the exponential family and \mathrm{Var}(Y) = \phi \mu^p, where 1 < p < 2. Here, p is the power of the variance function.

  • Note:

    • p = 0: Normal
    • p = 1: Poisson
    • p = 2: Gamma
    • p = 3: Inverse Gaussian

Other Considerations

Interaction terms

  • In many applications, the effect of one predictor may depend on another predictor.

  • We can include interaction terms in a GLM: g(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2

  • Interpretation:

    • The effect of X_1 on the response depends on the value/level of X_2
    • Interaction terms allow for non-additive effects
  • Types of interactions:

    • Continuous × Continuous (captures non-linear joint effects)
    • Categorical × Categorical (different group combinations)
    • Continuous × Categorical (slope differs across groups)

Basis Extensions

  • Extend linear predictors using basis functions: g(\mu) = \beta_0 + \beta_1 b_1(X) + \cdots + \beta_K b_K(X)

  • Here:

    • b_k(X) are basis functions (transformations of X)
    • e.g. X, X^2, \text{splines}, \log X
  • Basis functions can capture non-linear relationships and improve predictive performance

  • These ideas form the basis of more flexible models such as Generalised Additive Models (GAMs)

Penalized GLM (Shrinkage Techniques)

  • Penalised GLMs add a penalty to the likelihood: \text{Loss} = -\ell(\beta) + \lambda \cdot \text{Penalty}(\beta)

  • Common choices of penalties:

    • Lasso
    • Ridge
    • Elastic Net
  • Refer to the topic of Modelling and Shrage Techniques discussed earlier

Using R

Using R: glm()

  • glm() function
    • glm(formula, family = familytype(link = linkfunction), data = )
Family Default (canonical) link Other / preferred link(s)
binomial (link = “logit”) “probit”, “cloglog”
gaussian (link = “identity”)
gamma (link = “inverse”) “log”
inverse.gaussian (link = “1/mu^2”) “log”
poisson (link = “log”)
quasipoisson (link = “log”)
quasibinomial (link = “logit”)

Other Useful R Functions

  • library(statmod) and library(tweedie)
  • glm.nb() (from MASS): fit a Negative Binomial regression
  • tweedie.profile(): maximum likelihood estimation of the Tweedie index parameter p
  • step(): stepwise model selection
  • summary()
  • AIC()
  • BIC()
  • etc.

Examples

Example 1: Question

You are working as an actuarial analyst for a general insurer. You are given a dataset with the following information:

  • NumClaims: number of claims of a policyholder over a period of time
  • Time: the number of years a policyholder is observed
  • Age: age of a policyholder
  • Gender: gender of a policyholder
  • RiskCat: risk category to which a policyholder belongs

Example 1: Question (continued)

You are asked to examine the effects of various factors on the number of claims that a policyholder is observed over a period of time. Please specify: 1. the distribution of the target variable
2. the predictors
3. the link function
4. the offset (if any)
5. write down the key R command to fit such a GLM

Example 1: Solution

  1. Poisson distribution
  2. Age, Gender and RiskCat
  3. log link
  4. \log(\text{Time})
  5. modPossion <- glm(NumClaims ~ Age + Gender + RiskCat, family=poisson, offet = log(Time))

Example 2: Question

We analyse 4624 claims from Australian automobile insurance to understand the factors that affect claim severity and to predict claim severity.

  • ClaimAmount: the sum of claim payments
  • VehValue: the vehicle value in thousands of Australian dollars
  • Gender: the gender of the policyholder (Female = 1, Male = 0)

Example 2: Question (continued)

R output for fitted GLM

Example 2: Question (continued)

  1. Specify the distribution and the link function used in this model.
  2. Explain why they are selected for this example.
  3. Write down the mathematical form of the link function g(\mu) = X^T \beta for this model.
  4. Interpret the GLM coefficients related to VehValue and Female.
  5. Discuss how to perform diagnostic checking for this model.

References

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning: With Applications in r. Vol. 103. Springer.