Lecture: Data Visualisation

Actuarial Data Science - Open Learning Resource

Fei Huang, UNSW Sydney

Learning Objectives

  • The goal of this lecture is to help you become comfortable “talking to your data”: asking simple visual questions, checking for problems, and building intuition before fitting formal models
  • You should come away with a small toolkit of plots and ggplot2 patterns that you can reuse in many different actuarial and business contexts

Learning Objectives (continued)

  • Understand how to perform exploratory data analysis using the tidyverse package in R
  • Explore features of the data using visualisation
  • Identify common data issues, including:
    • missing values, unreliable/non-validated data, outliers, high-cardinality features
  • Apply appropriate methods to address common data problems
  • Assess data quality using a range of techniques
  • Use R for data manipulation (e.g. filter, merge, sort, group, summarise)
  • Import, export and tidy data in R
  • Use R Markdown for communication and reproducible analysis

Exploratory Data Analysis: An Introduction

Exploratory Data Analysis (EDA)

  • A set of procedures to produce descriptive and graphical summaries of data
  • Explore the data as they are, without making strong modelling assumptions
  • Examine the data to understand relationships among variables
  • Detect potential problems in the dataset (e.g. outliers, missingness, miscoding)
  • Check whether your question can realistically be answered using the available data
  • Develop a first, rough sketch of the answer to your question, which can then be refined with more formal models

The Process of Exploratory Data Analysis

EDA is an iterative cycle. You:

  1. Formulate your question
  2. Search for answers by
    • Collecting and importing data
    • Checking data quality and cleaning data
    • Manipulating and transforming data
    • Visualising data
  3. Use what you learn to refine your questions and/or generate new questions

Data Visualisation is arguably the most important tool for EDA

Seeing Patterns in Data

  • What patterns can you see in this plot?

Seeing Patterns in Data (continued)

  • What patterns can you see in this plot?

Data Visualization

“Visually attractive graphics also gather their power from content and interpretations beyond the immediate display of some numbers. The best graphics are about the useful and important, about life and death, about the universe. Beautiful graphics do not traffic with the trivial.”

— Edward Tufte, The Visual Display of Quantitative Information

Data Visualisation using ggplot2

Explore Features of Data using Visualisation

  • A statistical graphic maps variables from
    1. a dataset to
    2. aesthetic properties of
    3. geometric objects.
  • ggplot2 is part of the tidyverse
  • Use ggplot2 to visualise your data
  • See also: A ggplot2 grammar guide

Loading ggplot2

#install.packages("tidyverse")
library(tidyverse)
  • Reload the package every time you start a new session
  • Use package::function() to call functions explicitly, e.g. ggplot2::ggplot()

A First Visualisation Question: The mpg Dataset

  • Question: Do cars with larger engines use more fuel than cars with small engines?
  • Data: the mpg data frame in ggplot2 (ggplot2::mpg)
    • mpg contains observations collected by the US Environmental Protection Agency on 38 popular models of cars from 1999 to 2008
    • The data frame has 234 rows and 11 variables

A First Visualisation Question: The mpg Dataset (continued)

  • Variables:
    • manufacturer: manufacturer name
    • model: model name
    • displ: engine displacement, in litres
    • year: year of manufacture
    • cyl: number of cylinders
    • trans: type of transmission
    • drv: type of drive train, where f = front-wheel drive, r = rear wheel drive, 4 = four-wheel drive
    • cty: city miles per gallon
    • hwy: highway miles per gallon
    • fl: fuel type
    • class: vehicle class

Inspecting the mpg Dataset

mpg

See also: More aesthetic mappings

Creating a Plot

ggplot(data = mpg) + # the dataset  
  aes(x = displ) + # the x position
  aes(y = hwy) + 
  # the y position
  geom_point() + # the point geometric shape 
  # Adjust axis titles' front size
  theme(axis.title=element_text(size=14,face="bold"))

See also: A complete ggplot “sentence”; Local mpg plot

Declaring Data

Method 1: Declare Data Inside ggplot()

ggplot(data = mpg)

Method 2: Use the Pipe Operator

  • Pipe data into ggplot() using the pipe operator %>%
mpg %>% # data piped into
  ggplot() # initiating plot

Exercises: Declaring Data

  1. Run ggplot(data = mpg). What do you see?

  2. How many rows and columns are in mpg?

  3. What does the drv variable describe? Read the help for ?mpg to find out.

  4. Create a scatterplot of hwy vs cyl.

  5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

Aesthetic Mapping

What is an Aesthetic Mapping?

  • An aesthetic is a visual property of the objects in your plot.
  • Aesthetics include properties such as position, size, the shape, and colour.
  • Mapping refers to linking variables in the data to visual properties (aesthetics) in the plot.

Common Aesthetic Mappings

A main pool of aesthetics (Source: Wilke (2019), Fundamentals of Data Visualization)

aes() means ‘Ask’

  • aes(): What variables are we asking the aesthetics (colour, position, shape, etc.) to represent?
  • aes(colour = gender): “Please represent the variable gender using different colours.”

Mapping Multiple Aesthetics

See also: The mpg dataset

mpg_plot= ggplot(data = mpg) + # the dataset  
  aes(x = displ) + # the x position
  aes(y = hwy) + # the y position
  geom_point() + 
  #the point geometric shape, the above aes are required and
  #the below are optional
  #theme(axis.title=element_text(size=14,face="bold"))+
  aes(colour = class) + # colour for type of car
  #aes(shape = class) +
  #ggplot2 will only use six shapes at a time. By default, 
  #additional groups will go unplotted when using 'shape'.
  aes(size = cty) +  # Size for city miles per gallon
  aes(alpha = year) # transparency for year of manufacture

See also: A complete ggplot “sentence”

Example: Multiple Aesthetic Mappings

print(mpg_plot)

Aesthetic Mappings for Different Geoms

Different geometric objects use different aesthetic properties. For example, column plots often use fill to distinguish groups.

mpg %>% # data piped into
  ggplot() + # initiating plot
  aes(x = class) +  #categorical variable 
  aes(y = hwy) + 
  geom_col() + #Use `geom_col` to create a column geometry
  aes(colour = class) +
  aes(fill = class) + # new aes 'fill'
  aes(linetype = class) #new aes 'linetype'

Unmapped Aesthetics

# Create scatter plot with base layer
mpg_plot <- ggplot(data = mpg) + 
  aes(x = displ, y = hwy) + 
  geom_point() + 
  # Add another point layer with fixed (unmapped) aesthetics
  # These aesthetics don't represent variables - they're constant values
  geom_point(
    colour = "plum4",  # Fixed colour for all points in this layer
    size = 8,         # Fixed size for all points
    shape = 21         # Fixed shape (filled circle)
  ) 

Plot: Unmapped Aesthetics

print(mpg_plot)

Exercises: Aesthetic Mappings

  1. Look at the help for geom_text (?geom_text). What are the required aesthetics?
  2. Which variables in mpgare categorical, and which are continuous? (Hint: use ?mpg to read the dataset documentation). How can you identify this when you run mpg?
  3. Map a continuous variable to colour, size, and shape. How do these aesthetics behave differently for categorical versus continuous variables?

ggplot(data = mpg) + # the dataset  
  aes(x = displ) + # the x position
  aes(y = hwy) + 
  aes(colour = cty)+ #colour, size and shape
  # the y position
  geom_point()  # the point geometric shape 

Facets

Using facet_wrap()

  • Facet your plot by a single variable
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  # ~ followed by a discrete variable
  facet_wrap(~ class, nrow = 2) 

Using facet_grid()

  • Facet your plot by the combination of two variables
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  # two variable names separated by a ~
  facet_grid(drv ~ cyl)
  #facet_grid(. ~ cyl) #not facet in the rows 
  #facet_grid(drv~.) #not facet in the rows 

Geometric Objects

A Complete “Sentence” in ggplot

  • data + aes + geom
  • Plots:
  • Nouns: geometric objects
    • geom_point()
    • geom_col()
    • geom_line()
    • geom_text()
    • geom_segment()
    • geom_smooth()
    • geom_bar()
    • etc.
  • The conditional mood: geom specific data and aesthetic mapping

Different Geoms

Compare with mpg plot

library(grid)
library(gridExtra)
# Create scatter plot with points
scatter_plot <- ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
# Create smooth line plot with different linetypes by drive type
smooth_plot <- ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv), se = FALSE)
# Arrange both plots side by side
grid.arrange(scatter_plot, smooth_plot, ncol = 2)

Example: Boxplots

# Boxplot with default orientation (vertical)
boxplot_vertical <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()
# Boxplot with flipped axes (horizontal)
boxplot_horizontal <- ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip() # switches the x and y axes
# Arrange both plots side by side for comparison
grid.arrange(boxplot_vertical, boxplot_horizontal, ncol = 2)

Local Data and Aesthetic Mappings

# Create base scatter plot with global aesthetics
mpg_plot <- ggplot(data = mpg) + 
  aes(x = displ, y = hwy) +
  geom_point() +
  aes(colour = class) + 
  # xend and yend are required for geom_segment
  # Set endpoints to origin (0, 0) for all segments
  aes(xend = 0, yend = 0) +
  # Add segments layer with local data and aesthetics
  # geom_segment() draws a straight line between points (x, y) and (xend, yend)
  geom_segment(
    # Use local data: only flights with premium fuel (fl == "p")
    data = subset(mpg, fl == "p"), 
    # Local aesthetics: size mapped to cylinders, alpha to city mpg
    aes(size = cyl, alpha = cty), 
    colour = "orange"  # Fixed colour for all segments
  )

Result: Local Data and Aesthetic Mappings

print(mpg_plot)

Example: Local Aesthetic Mappings

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + #global aes
  geom_point(mapping = aes(colour = class)) +  #local aes
  #local data and aes
  geom_smooth(data = filter(mpg, class == "subcompact"), 
              se = TRUE) #se: standard error

Annotation: Highlighting a Point

# Create base scatter plot
mpg_plot <- ggplot(data = mpg) + 
  aes(x = displ, y = hwy) + 
  geom_point() + 
  # Add annotation: highlight a specific point
  annotate(geom = "point",
           x = 3.1,      # x-coordinate of highlighted point
           y = 27,       # y-coordinate of highlighted point
           colour = "red")

Result: Highlighting a Point

print(mpg_plot)

Annotation: Adding Text

# Create base scatter plot
mpg_plot <- ggplot(data = mpg) + 
  aes(x = displ, y = hwy) + 
  geom_point() + 
  # Add highlighted point annotation
  annotate(geom = "point",
           x = 3.1, y = 27,
           colour = "red") +
  # Add text annotations at multiple locations
  annotate(geom = "text",
           x = c(2, 4, 6),  # Multiple x-coordinates
           y = 40,          # Single y-coordinate (text appears at same height)
           label = "Hello",
           colour = "blue")

Result: Adding Text

print(mpg_plot)

Annotation: Adding a Curved Arrow

# Create base scatter plot
mpg_plot <- ggplot(data = mpg) + 
  aes(x = displ, y = hwy) + 
  geom_point() + 
  # Highlight a specific point
  annotate(geom = "point",
           x = 3.1, y = 27,
           colour = "red") +
  # Add text labels
  annotate(geom = "text",
           x = c(2, 4, 6), y = 40,
           label = "Hello",
           colour = "blue") +
  # Add curved arrow connecting text to point
  annotate(geom = "curve",
           x = 2, y = 39,        # Starting point of arrow
           xend = 3, yend = 27.3, # Ending point of arrow
           colour = "green",
           arrow = arrow(angle = 20))  # Arrow head angle

Result: Adding a Curved Arrow

print(mpg_plot)

Annotation: Adding Reference Lines

  • use geom_abline, geom_hline, and geom_vline
# Create base scatter plot
mpg_plot <- ggplot(data = mpg) + 
  aes(x = displ, y = hwy) + 
  geom_point() + 
  # Add reference lines
  geom_abline(slope = 5, intercept = 3) +  # Diagonal line: y = 5x + 3
  geom_hline(yintercept = 30,                # Horizontal line at y = 30
             linetype = "dotted", 
             colour = "blue") +
  geom_vline(xintercept = c(4, 5),          # Vertical lines at x = 4 and x = 5
             linetype = "dashed", 
             colour = "red")

Result: Adding Reference Lines

print(mpg_plot)

Exercises: Aesthetics and Annotations

  1. What has gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, 
                           y = hwy, colour = "blue"))
  1. What happens if you map an aesthetic to something other than a variable name, e.g. aes(colour = displ < 5)? Note: you will also need to specify x and y.

Interactive Data Visualisation (Optional Extension)

  • R package: Shiny
  • Can host standalone apps on a webpage
  • Can embed apps in R Markdown documents or build dashboards

References

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. O’Reilly Media.
Wilke, Claus O. 2019. Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media.