Lecture: Data Quality Check and Cleaning

Actuarial Data Science - Open Learning Resource

Author

Fei Huang, UNSW Sydney

Tibbles

We work with “tibbles” instead of R’s traditional data.frame in the tidyverse environment.
The tibble package provides opinionated data frames that make working in the tidyverse easier.
See vignette("tibble") for more information.

This lecture focuses on data quality: how to store your data in a tidy, convenient format, and how to identify and fix common data issues before they silently undermine your analysis. The aim is that by the end, you feel more confident trusting the datasets you work with, or at least knowing when you should not trust them yet.

#install.packages("tidyverse")
library(tidyverse)

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.4     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.4     v tibble    3.2.1
v lubridate 1.9.3     v tidyr     1.3.0
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Creating Tibbles

Convert a data frame to a tibble: as_tibble()
Create tibbles: tibble()
Use non-syntactic names with backticks
tribble(): create tibbles in a transposed format

Example: Converting to a Tibble

as_tibble(iris)

Example: Creating a Tibble

tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)

Example: Creating Tibbles with Special Names

tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)

Pipe Data using `%>%`

Use %>% to emphasise a sequence of actions rather than the object being acted on
Pronounce %>% as “then” when reading code
No need to create intermediate objects
%>% should always have a space before it and is usually followed by a new line.

iris %>%
  group_by(Species) %>%
  summarise(
    Sepal.Length = mean(Sepal.Length),
    Sepal.Width = mean(Sepal.Width),
    Species = n_distinct(Species)
  )

Tibble vs. Data Frame

tibble() does less:
- it never changes the type of the inputs (e.g. it does not convert strings to factors)
- it never changes variable names
- it never creates row names
Printing
- tibbles show only the first 10 rows
- each column reports its type
- use print() to display more rows (n) and columns (width)
Subsetting
- extract a single variable using $ (by name) or [[ ]] (by name or position)
- within a pipe %>%, use the special placeholder .

Examples: Extracting Variables from a Tibble

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)
# Extract by name
df$x

[1] 0.3914372 0.1394484 0.3303444 0.1322754 0.1065543

df[["x"]]

[1] 0.3914372 0.1394484 0.3303444 0.1322754 0.1065543

# Extract by position
df[[1]]

[1] 0.3914372 0.1394484 0.3303444 0.1322754 0.1065543

df %>% .$x

[1] 0.3914372 0.1394484 0.3303444 0.1322754 0.1065543

df %>% .[["x"]]

[1] 0.3914372 0.1394484 0.3303444 0.1322754 0.1065543

Import Data

Read Introduction to Data Science, Chapter 5
Read R for Data Science, Chapter 7

Tidy Data

The same data can be organised in different ways.
Tidy data is easier to work with.
There are three interrelated rules that make a dataset tidy:
- Each variable has its own column
- Each observation has its own row
- Each value has its own cell
Practical guidelines:
- Store each dataset as a tibble
- Store each variable in its own column
dplyr, ggplot2, and other tidyverse packages are designed to work with tidy data

library(tidyverse)

Pivoting: Longer

A common problem occurs when column names represent values rather than variables.
Example: table4a
- Column names (1999, 2000) represent values of the year variable
- Values in these columns represent values of the cases variable
- Each row represents two observations, not one

#table4a

table4a %>% 
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")

Pivoting: Wider

pivot_wider() is the opposite of pivot_longer()
Use it when a single observation is spread across multiple rows.
Example: table2
- Each observation is a country in a given year
- Data for each observation is spread across multiple rows

table2 %>%
    pivot_wider(names_from = type, values_from = count)

Separating

table3 has a different problem: one column (rate) contains two variables (cases and population).
separate() splits one column into multiple columns, by splitting wherever a separator character appears.
By default, separate() splits values at non-alphanumeric characters (i.e. characters that are not numbers or letters).
Use the sep argument to specify a custom separator.

 table3

Example: Using `separate()`

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")

#table3 %>% 
#  separate(rate, into = c("cases", "population"), sep = "/", convert=TRUE)

#table3 %>% 
#  separate(year, into = c("century", "year"), sep = 2)

Unite

unite() is the inverse of separate(): it combines multiple columns into a single column
By default, it places an underscore (_) between values from different columns
If no separator is desired, use sep = ""

table5 %>% 
  unite(new, century, year)

#table5 %>% 
#  unite(new, century, year, sep = "")

Missing Values

A value can be missing in one of two ways:
- Explicitly: represented by NA – the presence of an absence
- Implicitly: not recorded in the dataset – the absence of a presence

  stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

Making Implicit Missing Values Explicit

Use pivot_wider()
Use complete()

stocks %>% 
  pivot_wider(names_from = year, values_from = return)

#stocks %>% 
#  complete(year, qtr)

Making Explicit Missing Values Implicit

Set values_drop_na = TRUE in pivot_longer() to convert explicit missing values into implicit ones

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )

Filling Missing Values with `fill()`

fill() replaces missing values with the most recent non-missing value (also known as last observation carried forward).

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

treatment %>% 
  fill(person)

Assessing Data Quality

Common Data Issues

Missing data
Irregular data and outliers
Uninformative data
Censored and truncated data
High cardinality features
Imbalanced data

Diagnosing Missing Data

Descriptive statistics
Visualisation (e.g. plots)
See also: Missing values in tidy data; and Missing values in EDA

Missing Data

Understand why values are missing
Distinguish between structurally missing values and other types
- e.g. the number of children a man has given birth to
Informative missingness: the pattern of missing data is related to the outcome
- e.g. in a drug study, side effects are so severe that patients drop out

Handling Missing Values

Include a missing indicator (dummy variable)
- useful when the pattern of missingness is informative
Some models can handle missing data (e.g. tree-based methods)
Many models cannot handle missing values
- linear models, neural networks, SVMs
Remove observations or variables as a last resort
- may be feasible for large dataset
Impute missing values

Imputing Missing Values

Use information from other predictors in the training data to estimate missing values
- simple methods: mean, median or mode
- model-based methods: e.g. k-nearest neighbours
Imputation is extensively studied in the statistical literature for inference, but is often less critical in predictive modelling

Irregular Data and Outliers

Detection:
- Descriptive statistics
- Visualisation (e.g. boxplots, scatter plots)
- Outlier detection methods
Handling:
- Data validation: ensure there are no recording errors
- Remove or adjust values
- Outliers may belong to a different population
- Use models that are robust to outliers (e.g. tree-based methods, SVMs)
- Apply transformations to reduce the impact of outliers, such as the spatial sign transformation
  - each observation is scaled by its norm

Example: Outliers

Uninformative Data

Repetitive data
Duplicate observations
Irrelevant variables
Collinearity: two or more predictor variables are highly correlated
- Redundant predictors increase model complexity
- Can lead to unstable models, numerical issues, and degraded predictive performance
- For linear regression, measures such as the variance inflation factor (VIF) can be used to identify predictors that are highly correlated with others

Censored data

The value of an observation is only partially known
For interpretation or inference
- Typically handled using formal methods with assumptions about the censoring mechanism
For prediction
- Often treated as missing data or the censored value is used as observed

Examples:

General insurance (policy limits); life insurance (age groups of mortality data)

High-Cardinality Features

Categorical predictors with many unique levels
Examples: postcodes, medical condition codes, or similar variables

Some references on handling high-cardinality features:

Imbalanced Data

Imbalance between classes (e.g. control vs treatment) can cause modelling issues
Construct a balanced training set to improve model performance
- Undersampling: reduce the number of observations in the majority class
- Oversampling: generate additional observations in the minority class

Data Validation

Validate data against external sources and previous datasets
Consult domain experts or data providers to assess data quality

Exploratory Data Analysis (EDA)

Exploratory Data Analysis

Generate questions about your data
Search for answers by visualising, transforming, and modelling your data
Use what you learn to refine your questions and/or generate new ones
We combine dplyr and ggplot2 to interactively ask questions, answer them with data, and then ask new questions
Two types of questions are especially useful for making discoveries in your data:
- What type of variation occurs within variables?
- What type of covariation occurs between variables?

library(tidyverse)

Adapted from Wickham, Çetinkaya-Rundel, and Grolemund (2023), see Chapter 10 of R for Data Science

Variation

Variation is the tendency of the values of a variable to change from measurement to measurement.
Variables:
- Continuous variable: can take any value within a range of possible values
- Categorical variable: takes one of a fixed set of values

Visualising Distributios: Categorical Variables

To examine the distribution of a categorical variable, use a bar chart
Data: diamonds (see ?diamonds for details)

Visualising Distributions: Continuous Variables

To examine the distribution of a continuous variable, use a histogram

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

#diamonds %>% 
#  count(cut_width(carat, 0.5))

Exercise: Histogram Bin Widths

Plot a histogram of diamonds with size less than 3 carats (using filter()), and use a smaller binwidth of 0.1.

smaller <- diamonds %>% 
  filter(carat < 3)
  
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

Exercise: Frequency Polygons by Group

Overlay multiple histograms in the same plot by cut using geom_freqpoly()

ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)

Typical Values

In both bar charts and histograms, tall bars show common values of a variable, while shorter bars show less common values. Gaps indicate values that do not appear in the data.
To turn this information into useful questions, look for anything unexpected:
- Which values are most common? Why?
- Which values are rare? Why? Does that match your expectations?
- Can you see any unusual patterns? What might explain them?

Example: Interpreting a Histogram

Question:

Look at the histogram below. What questions can you ask?

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

Example: Discussion

Why are there more diamonds at whole carats and common fractional values?
Why are there more diamonds just to the right of each peak than to the left?
Why are there no diamonds larger than 3 carats?

Unusual Values (Outliers)

Outliers are observations that are unusual; data points that do not seem to fit the overall pattern
Sometimes outliers are data entry errors; other times they may reveal important insights
When you have a large dataset, outliers can be difficult to see in a histogram

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

Visualising Outliers

To make unusual values easier to see, we can zoom in using coord_cartesian(), often together with ylim() or xlim(), to restrict the axis ranges.

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

Display Unusual Values

Extract them using dplyr.
What questions might you ask about these observations?

unusual <- diamonds %>% 
  filter(y < 3 | y > 20) %>% 
  select(price, x, y, z) %>%
  arrange(y)
unusual

Deal with Outliers: Example

In the ‘Diamond’ example, the y variable measures one of the three dimensions of a diamond (in mm)
Diamonds cannot have a width of 0 mm, so these values must be incorrect
We might also suspect that measurements of 32mm and 59mm are implausible: such diamonds would be over an inch long, but do not cost hundreds of thousands of dollars

Deal with Outliers

Repeat your analysis with and without the outliers.
If they have minimal impact on the results and the cause is unclear, it is reasonable to replace them with missing values and proceed
If they have a substantial impact on your results, do not remove them without justification
- Investigate the cause (e.g. data entry errors)
- Clearly document any decisions to remove them in your analysis

Missing Values

If you encounter unusual values in your dataset and want to proceed with your analysis, you have two options:
- Drop the entire row containing the unusual values (not recommended — why?)

diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20))
#diamonds2

Replace the unusual values with missing values (NA) using mutate() with ifelse() or case_when()

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))
#diamonds2

ggplot2 does not include missing values in plots, but it will display a warning that they have been removed

Covariation

If variation describes the behaviour within a variable, covariation describes behavior between variables.
Covariation is the tendency for two or more variables to vary together in a related way
The best way to identify covariation is to visualise relationships between variables.
We consider three common cases:
- A categorical and a continuous variable
- Two categorical variables
- Two continuous variables

A Categorical and a Continuous Variable

Explore the distribution of a continuous variable across levels of a categorical variable
Use geom_freqpoly() (e.g. see frequency polygon example)
It can be difficult to compare distributions when overall counts differ substantially (e.g. see distribution of cut example).
Instead of displaying counts, we can display density, where the area under each curve is standardised to one

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
i Please use `after_stat(density)` instead.

Boxplot

Another way to display the distribution of a continuous variable across categories is a boxplot.

Example: Diamond Prices by Cut

Examine the distribution of diamond prices by cut. What do you observe?

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()

This suggests the counterintuitive finding that higher-quality diamonds are cheaper on average. Why might this be the case?

Example: Highway Mileage by Vehicle Class

Consider the mpg dataset. We want to understand how highway mileage (hwy) varies across vehicle classes (class)
To make patterns easier to see, reorder class by the median of hwy using reorder(..., FUN = median)
For long variable names, geom_boxplot() may be clearer when flipped using coord_flip()

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) #+

#  coord_flip()

Two Categorical Variables

To visualise covariation between categorical variables, count the number of observations for each combination.
One approach is to use geom_count()

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

Example: Computing Counts with dplyr

Another approach is to compute counts using dplyr

diamonds %>% 
  count(color, cut)

Example: Visualising Counts with geom_tile()

Then visualise the counts using geom_tile() with the fill aesthetic

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

Two Continuous Variables

A common way to visualise the covariation between two continuous variables is a scatter plot using geom_point()
Covariation appears as patterns in the points
Example: visualise the relationship between carat size and price of diamonds

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

Other Ways to Visualise the Relationship

Use the alpha aesthetic to add transparency
Use geom_bin2d() or geom_hex() to bin observations in two dimensions
Bin one continuous variable so that it behaves like a categorical variable

Example: Add Transparency

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

From Data Patterns to Models

Patterns in your data provide clues about relationships
Models are tools for extracting and formalising these patterns

References

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media.

Recommended Reading

Tibbles

Tibbles

Creating Tibbles

Example: Converting to a Tibble

Example: Creating a Tibble

Example: Creating Tibbles with Special Names

Pipe Data using %>%

Tibble vs. Data Frame

Examples: Extracting Variables from a Tibble

Import Data

Import Data

Tidy Data

Tidy Data

Pivoting: Longer

Pivoting: Wider

Separating

Example: Using separate()

Unite

Missing Values

Making Implicit Missing Values Explicit

Making Explicit Missing Values Implicit

Filling Missing Values with fill()

Assessing Data Quality

Common Data Issues

Diagnosing Missing Data

Missing Data

Handling Missing Values

Imputing Missing Values

Irregular Data and Outliers

Uninformative Data

Censored data

High-Cardinality Features

Imbalanced Data

Data Validation

Exploratory Data Analysis (EDA)

Exploratory Data Analysis

Variation

Visualising Distributios: Categorical Variables

Visualising Distributions: Continuous Variables

Exercise: Histogram Bin Widths

Exercise: Frequency Polygons by Group

Typical Values

Example: Interpreting a Histogram

Example: Discussion

Unusual Values (Outliers)

Visualising Outliers

Display Unusual Values

Deal with Outliers: Example

Deal with Outliers

Missing Values

Covariation

A Categorical and a Continuous Variable

Boxplot

Example: Diamond Prices by Cut

Example: Highway Mileage by Vehicle Class

Two Categorical Variables

Example: Computing Counts with dplyr

Example: Visualising Counts with geom_tile()

Two Continuous Variables

Other Ways to Visualise the Relationship

Example: Add Transparency

From Data Patterns to Models

References

Pipe Data using `%>%`

Example: Using `separate()`

Filling Missing Values with `fill()`