Layered Grammar of Graphics

Matteo Ceccarello

Why a grammar?

  • Express the fundamental principles or rules of an art or science

  • Gain insight into complicated graphics

  • Allow more flexibility and expressiveness

  • Provide a consistent framework to think about graphics

  • Constrain by principled rules rather than API

The components

  • data

  • aesthetic mapping

  • geometric objects

  • scales

  • statistical transformations

  • position adjustments

  • facet specification

  • coordinate system

Data

  • This is the most fundamental part: all other components depend on it.

  • In the discussion we assume we are dealing with tidy data:

  • Variables

  • Observations

  • Values

Aesthetic mappings

An aesthetic is a visual property of the objects in your plot

Examples:

  • Position on the x, y plane

  • Colour

  • Shape

  • Size

Geometric objects

A geom is the geometrical object that a plot uses to represent data

Examples:

  • Points

  • Lines

  • Bars

  • Polygons

Note

The aesthetic mapping associates variables in the data with visual properties of geometric objects.

Scales

A scale controls the mapping from data values to aesthetic values

Data values

idx category price
1 shoes 100
2 shoes 70
3 computers 1000
4 trousers 80

Aesthetic values

Scale

Scales

A scale controls the mapping from data values to aesthetic values

Data values

idx category price
1 shoes 100
2 shoes 70
3 computers 1000
4 trousers 80

Scale

Scales

One can of course have multiple mappings at the same time

Statistical transformations

Transforms the data, typically by summarizing it

Examples:

  • Identity
  • Binning
  • Smoothing
  • Quantile computation
  • Conditional statistics
  • Density estimation

Statistical transformations

Summarization

Statistical transformations

Binning

Position adjustments

Adjustment of the position of graphical objects to avoid overplotting

Examples:

  • random jittering
  • dodging
  • stacking

Layers

The combination of

  • data
  • aesthetic mapping
  • geometric object
  • statistical transformation
  • position adjustments

You can stack several layers on top of each other

Facet specification

Create multiple plots with the same layers, each on a different subset of data

var1 var2 x y
a Z 2 1.0
a Z 3 1.2
b Z 2 3.0
b Z 2 1.0
a W 4 1.0
a W 3 2.0
b W 2 3.0
b W 3 1.0

Coordinate system

Maps the position of objects onto the plane of the plot

GGplot implementation

There is a close correspondance between many of ggplot’s functions and elements of the grammar

  • aestetic mappings: aes
  • geometric objects: geom_*
  • scales: scale_*
  • statistical transformations: stat_*
  • facet specification: facet_*
  • coordinate system: coord_*

Defining a layer

ggplot() +
  layer(
    data = gapminder,
    mapping = aes(x=gdpPercap, y=lifeExp),
    geom = 'point',
    stat = 'identity',
    position = 'identity'
  ) +
  scale_x_log10()

Why is the scale outside of the layer definition?

Multiple layers

ggplot() +
  layer(
    data = filter(gapminder, year == 1952),
    mapping = aes(x=gdpPercap, y=lifeExp, color=factor(year)),
    geom = 'point',
    stat = 'identity',
    position = 'identity'
  ) +
  layer(
    data = filter(gapminder, year == 2007),
    mapping = aes(x=gdpPercap, y=lifeExp, color=factor(year)),
    geom = 'point',
    stat = 'identity',
    position = 'identity'
  ) +
  scale_x_log10()

Using default values

Oftentimes data and aesthetic mapping are shared across all layers. In such cases, we can provide the default in the ggplot function.

Using specialized functions like geom_* or stat_*, we can use default values for all the other components of a layer

ggplot() +
  layer(
    data = gapminder,
    mapping = aes(x=gdpPercap, y=lifeExp),
    geom = 'point',
    stat = 'identity',
    position = 'identity'
  )
ggplot(data = gapminder, 
       mapping = aes(x=gdpPercap, y=lifeExp)) +
  geom_point()

Using default values

Each geom has a default stat, each stat has a default geom

ggplot() +
  layer(
    data = gapminder,
    mapping = aes(x=gdpPercap, y=lifeExp),
    geom = 'line',
    stat = 'smooth',
    position = 'identity',
    params = list(
      method = 'gam',
      color = 'blue',
      size = 1
    )
  )
ggplot(data = gapminder, 
       mapping = aes(x=gdpPercap, y=lifeExp)) +
  stat_smooth()

A tour of geometric objects

p + geom_point(aes(
    x=gdpPercap, y=lifeExp, 
    color=continent, size=pop))

p + geom_line(aes(
      x=year, y=pop, 
      color=continent, linetype=continent))

p + geom_col(aes(
      x=continent, y=pop, 
      color=continent))

p + geom_col(aes(
      x=continent, y=pop, 
      fill=continent))

p + geom_ribbon(aes(
      x=year, 
      fill=continent,
      ymax=maxGDP, ymin=minGDP))

p + geom_linerange(aes(
      x=continent, 
      color=continent,
      ymax=maxGDP, ymin=minGDP))

p + geom_pointrange(aes(
      x=continent, 
      y=meanGDP, color=continent,
      ymax=maxGDP, ymin=minGDP))

Statistical transformations: summaries

Sometimes it is easier to express a layer in terms of the statistical transformation

gapminder |>
  filter(year==2007) |>
  ggplot(aes(x=continent, y=gdpPercap)) +
  stat_summary()

The default geometric object of stat_summary is pointrange

Statistical transformations

Usually a stat introduces some new variables that can be mapped to aesthetics.

To know which ones, look at the help pages.

For instance, stat_summary introduces

  • ymin
  • ymax
  • y (overwrites)

which can then be used by geometric objects.

Ribbon, revisited

gapminder |> 
  drop_na(gdpPercap) |> 
  filter(continent == "Africa") |>
  ggplot(aes(x=year, y=gdpPercap, fill=continent)) +
  stat_summary(geom='ribbon',
               alpha=0.7)

Ribbon, revisited

gapminder |> 
  drop_na(gdpPercap) |> 
  filter(continent == "Africa") |>
  ggplot(aes(x=year, y=gdpPercap, fill=continent)) +
  stat_summary(geom='ribbon',
               alpha=0.7) +
  stat_summary(geom='ribbon',
               fun.max = max,
               fun.min = min,
               alpha=0.2)

Ribbon, revisited

gapminder |> 
  drop_na(gdpPercap) |> 
  filter(continent == "Africa") |>
  ggplot(aes(x=year, y=gdpPercap, fill=continent)) +
  stat_summary(geom='ribbon',
               alpha=0.7) +
  stat_summary(geom='ribbon',
               fun.max = max,
               fun.min = min,
               alpha=0.2) +
  geom_line(stat='summary')

Exercise: flight delay

Install the nycflights13 package. In the console:

install.packages("nycflights13")

and then import it in your notebook

library(nycflights13)
flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

build a plot showing the median, min, max monthly delay

  • Pay attention to missing values (hint: look at the na.rm parameter, or use drop_na)
  • Use stat_summary, along with the geometric object of your choosing

Solution: flight delay

library(nycflights13)
flights |>
  ggplot(aes(x=month, y=dep_delay)) +
  stat_summary(
    fun.max = max,
    fun = median,
    fun.min = min,
    na.rm = TRUE
  )

The group aesthetic

By default, statistical transformations are applied to groups defined by the interaction of all discrete variables in the plot.

For most applications you can simply specify the grouping with various aesthetics (colour, shape, fill, linetype) or with facets.

ggplot(gapminder,
       aes(x=year, y=gdpPercap,
           color=continent)) +
  stat_summary(geom='line')

The group aesthetic

There is also an invisible aesthetic that can be used just for grouping purposes, aptly called group

ggplot(gapminder,
       aes(x=year, y=gdpPercap,
           group=continent)) +
  stat_summary(geom='line')

Position adjustments

Dodge

p + geom_col(
  position='dodge'
)

Stack

p + geom_col(
  position='stack'
)

Fill

p + geom_col(
  position='fill'
)

Position adjustments

Jitter

p + geom_point(



)

p + geom_point(
  position="jitter"


)

p + geom_point(
  position=position_jitter(
    width=0.1
  )
)

Exercise: average flight delay (2)

build a plot showing the median, min, and max monthly delay with context

  • extend the plot you prepared before by adding another layer with the raw data points.
  • you should possibly reduce their size
  • and also color them "gray"
  • use position adjustments to reduce overplotting

Solution: average flight delay (2)

library(nycflights13)
flights |>
  ggplot(aes(x=month, y=dep_delay)) +
  geom_point(size=0.4, color="gray", position=position_jitter(0.2)) +
  stat_summary(
    fun.max = max,
    fun = median,
    fun.min = min,
    na.rm = TRUE
  )

Scales

Scales are functions that map from data values to aesthetic values

Scales are invertible functions

Examples:

  • Map data values to pixel positions, and back
  • Map data values to colors, and back
  • Map data values to shapes, and back…

Scales

Scales are usually linear, but not necessarily

In some cases we can apply a non-linear transformation to improve readability

 [1] "10"             "100"            "1 000"          "10 000"        
 [5] "100 000"        "1 000 000"      "10 000 000"     "100 000 000"   
 [9] "1 000 000 000"  "10 000 000 000"

Linear scale (default):

Logarithmic scale:

In a logarithmic scale, multiples are equally spaced. We can use them to display data that spans a very wide range, in an unequal way

Scales

Scales can be controlled using functions with signature scale_* where * is the name of the relevant aesthetic.

ggplot(gapminder,
       aes(x=gdpPercap, 
           y=lifeExp, 
           color=continent)) +
  geom_point() +
1  scale_x_log10() +
2  scale_y_continuous() +
3  scale_color_brewer(
    type    = 'qual', 
    palette = 'Set2'
  )
1
Scale for x with default parameters for log transformation
2
Default scale for y (can be omitted)
3
Custom color scale, can take parameters

Faceting

ggplot(gapminder, aes(x=year, y=gdpPercap, fill=continent)) +
  stat_summary(geom='ribbon',
               alpha=0.6) +
  facet_wrap(vars(continent))

Faceting

gapminder |>
  mutate(decade = factor(floor(year / 10)*10)) |>
ggplot(aes(x=gdpPercap, y=lifeExp, color=continent)) +
  geom_point() +
  scale_x_log10(labels=scales::dollar) +
  facet_grid(rows = vars(continent),
             cols = vars(decade))

Playing with coordinates

continent_population <- gapminder |>
  filter(year == 2002) |> 
  drop_na(pop) |>
  mutate(pop = as.numeric(pop)) |>
  group_by(continent) |> 
  summarise(pop = sum(pop))

ggplot(continent_population, 
       aes(x=continent, 
           y=pop, 
           fill=continent)) +
  geom_col() +
  coord_cartesian()

Playing with coordinates

ggplot(continent_population, 
       aes(x=continent, 
           y=pop, 
           fill=continent)) +
  geom_col() +
  coord_flip()

Playing with coordinates

ggplot(continent_population, 
       aes(x=continent, 
           y=pop, 
           fill=continent)) +
  geom_col(width=1) +
  coord_polar()

Playing with coordinates

ggplot(continent_population, 
       aes(x="", 
           y=pop,
           fill=continent)) +
  geom_col() +
  coord_polar(theta="y")

Exercise (1)

Imagine you are running several correlation tests of some variables with an effect of interest. You obtain the following data:

read_csv("experiment.csv")
# A tibble: 5 x 4
  var   cat    corr        pval
  <chr> <chr> <dbl>       <dbl>
1 A     F1     0.9  0.000000001
2 B     F2     0.7  0.0000001  
3 C     F1     0.6  0.0001     
4 D     F3     0.22 0.15       
5 E     F3     0.2  0.12       

How can you represent it?

Exercise (1) possible solution

read_csv("experiment.csv") |>
  mutate(
    var = fct_reorder(var, corr),
    significant = pval <= 0.05
  ) |>
  ggplot(aes(x = var, y = corr, fill = cat, alpha = significant, linetype = significant)) +
  geom_col(color = "black") +
  geom_text(aes(label = pval), hjust = 1, nudge_y = -0.01, alpha = 1) +
  scale_alpha_manual(values = c(0.3, 1)) +
  scale_linetype_manual(values = c("dotted", "blank")) +
  guides(alpha = guide_none(), linetype = guide_none()) +
  labs(
    title = "A, B, and C significantly correlate with the effect",
    caption = str_wrap("Correlations of variables with the effect, color coded by family and labelled by p-value. Transparent columns are non significant.")
  ) +
  coord_flip()

Exercise (2)

Consider now an experiment where you compared the distributions of different features from two different sets A and B. The column significant reports if the difference between distributions is significant

read_csv("comparing_distributions.csv")
# A tibble: 2,200 x 4
   type  feat  value significant
   <chr> <chr> <dbl> <lgl>      
 1 A     F1    0.882 FALSE      
 2 A     F1    0.784 FALSE      
 3 A     F1    0.784 FALSE      
 4 A     F1    0.382 FALSE      
 5 A     F1    0.931 FALSE      
 6 A     F1    0.757 FALSE      
 7 A     F1    0.794 FALSE      
 8 A     F1    0.905 FALSE      
 9 A     F1    0.298 FALSE      
10 A     F1    0.946 FALSE      
# i 2,190 more rows

How would you represent such data?

Exercise (2) solution

read_csv("comparing_distributions.csv") |>
  ggplot(aes(x = value, fill=type, alpha = significant)) +
  geom_density(
    aes(y = after_stat(scaled)),
    data = function(dat) {filter(dat, type == "A")},
    trim = T
  ) +
  geom_density(
    aes(y = -after_stat(scaled)),
    data = function(dat) {filter(dat, type == "B")},
    trim = T
  ) +
  scale_x_continuous(limits=c(0,1)) +
  scale_alpha_manual(values=c(0.1, 0.8)) +
  facet_wrap(vars(feat), ncol=1, strip.position = "left") +
  guides(alpha = guide_none()) +
  theme_bw() +
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
  )