Actuarial Data Science - Open Learning Resource
Some coding examples in this lecture are adapted from R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023).
data.frame in the tidyverse environment.tibble package provides opinionated data frames that make working in the tidyverse easier.vignette("tibble") for more information.This lecture focuses on data quality: how to store your data in a tidy, convenient format, and how to identify and fix common data issues before they silently undermine your analysis. The aim is that by the end, you feel more confident trusting the datasets you work with, or at least knowing when you should not trust them yet.
as_tibble()tibble()tribble(): create tibbles in a transposed format%>%%>% to emphasise a sequence of actions rather than the object being acted on%>% as “then” when reading code%>% should always have a space before it and is usually followed by a new line.tibble() does less:
print() to display more rows (n) and columns (width)$ (by name) or [[ ]] (by name or position)%>%, use the special placeholder .[1] 0.06613743 0.64441979 0.87036981 0.99049084 0.76861869
[1] 0.06613743 0.64441979 0.87036981 0.99049084 0.76861869
[1] 0.06613743 0.64441979 0.87036981 0.99049084 0.76861869
[1] 0.06613743 0.64441979 0.87036981 0.99049084 0.76861869
[1] 0.06613743 0.64441979 0.87036981 0.99049084 0.76861869
dplyr, ggplot2, and other tidyverse packages are designed to work with tidy datatable4a
year variablecases variablepivot_wider() is the opposite of pivot_longer()table2
table3 has a different problem: one column (rate) contains two variables (cases and population).separate() splits one column into multiple columns, by splitting wherever a separator character appears.separate() splits values at non-alphanumeric characters (i.e. characters that are not numbers or letters).sep argument to specify a custom separator.separate()unite() is the inverse of separate(): it combines multiple columns into a single column_) between values from different columnssep = ""NA – the presence of an absencepivot_wider()complete()values_drop_na = TRUE in pivot_longer() to convert explicit missing values into implicit onesfill()fill() replaces missing values with the most recent non-missing value (also known as last observation carried forward).Example: Outliers
Examples:
General insurance (policy limits); life insurance (age groups of mortality data)
Some references on handling high-cardinality features:
Generate questions about your data
Search for answers by visualising, transforming, and modelling your data
Use what you learn to refine your questions and/or generate new ones
We combine dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions
Two types of questions are especially useful for making discoveries in your data:
Adapted from Wickham, Çetinkaya-Rundel, and Grolemund (2023), see Chapter 10 of R for Data Science
diamonds (see ?diamonds for details)See also: Covariation
filter()), and use a smaller binwidth of 0.1.cut using geom_freqpoly()See also: Covariation
Question:
Look at the histogram below. What questions can you ask?
Why are there more diamonds at whole carats and common fractional values?
Why are there more diamonds just to the right of each peak than to the left?
Why are there no diamonds larger than 3 carats?
coord_cartesian(), often together with ylim() or xlim(), to restrict the axis ranges.dplyr.y variable measures one of the three dimensions of a diamond (in mm)NA) using mutate() with ifelse() or case_when()ggplot2 does not include missing values in plots, but it will display a warning that they have been removedgeom_freqpoly() (e.g. see frequency polygon example)Diagram illustrating how a boxplot is constructed (Source: R for Data Science (Wickham, Çetinkaya-Rundel, and Grolemund 2023))
cut. What do you observe?mpg dataset. We want to understand how highway mileage (hwy) varies across vehicle classes (class)class by the median of hwy using reorder(..., FUN = median)geom_boxplot() may be clearer when flipped using coord_flip()geom_count()dplyrgeom_tile() with the fill aestheticgeom_point()alpha aesthetic to add transparencygeom_bin2d() or geom_hex() to bin observations in two dimensions