idx | category | price |
---|---|---|
1 | shoes | 100 |
2 | shoes | 70 |
3 | computers | 1000 |
4 | trousers | 80 |
Express the fundamental principles or rules of an art or science
Gain insight into complicated graphics
Allow more flexibility and expressiveness
Provide a consistent framework to think about graphics
Constrain by principled rules rather than API
data
aesthetic mapping
geometric objects
scales
statistical transformations
position adjustments
facet specification
coordinate system
This is the most fundamental part: all other components depend on it.
In the discussion we assume we are dealing with tidy data:
Variables
Observations
Values
An aesthetic is a visual property of the objects in your plot
Examples:
Position on the x, y plane
Colour
Shape
Size
A geom is the geometrical object that a plot uses to represent data
Examples:
Points
Lines
Bars
Polygons
Note
The aesthetic mapping associates variables in the data with visual properties of geometric objects.
A scale controls the mapping from data values to aesthetic values
Data values
idx | category | price |
---|---|---|
1 | shoes | 100 |
2 | shoes | 70 |
3 | computers | 1000 |
4 | trousers | 80 |
Aesthetic values
Scale
A scale controls the mapping from data values to aesthetic values
Data values
idx | category | price |
---|---|---|
1 | shoes | 100 |
2 | shoes | 70 |
3 | computers | 1000 |
4 | trousers | 80 |
Scale
One can of course have multiple mappings at the same time
Transforms the data, typically by summarizing it
Examples:
Adjustment of the position of graphical objects to avoid overplotting
Examples:
The combination of
You can stack several layers on top of each other
Create multiple plots with the same layers, each on a different subset of data
var1 | var2 | x | y |
---|---|---|---|
a | Z | 2 | 1.0 |
a | Z | 3 | 1.2 |
b | Z | 2 | 3.0 |
b | Z | 2 | 1.0 |
a | W | 4 | 1.0 |
a | W | 3 | 2.0 |
b | W | 2 | 3.0 |
b | W | 3 | 1.0 |
Maps the position of objects onto the plane of the plot
There is a close correspondance between many of ggplot
’s functions and elements of the grammar
aes
geom_*
scale_*
stat_*
facet_*
coord_*
Why is the scale outside of the layer definition?
ggplot() +
layer(
data = filter(gapminder, year == 1952),
mapping = aes(x=gdpPercap, y=lifeExp, color=factor(year)),
geom = 'point',
stat = 'identity',
position = 'identity'
) +
layer(
data = filter(gapminder, year == 2007),
mapping = aes(x=gdpPercap, y=lifeExp, color=factor(year)),
geom = 'point',
stat = 'identity',
position = 'identity'
) +
scale_x_log10()
Oftentimes data and aesthetic mapping are shared across all layers. In such cases, we can provide the default in the ggplot
function.
Using specialized functions like geom_*
or stat_*
, we can use default values for all the other components of a layer
Each geom
has a default stat
, each stat
has a default geom
Sometimes it is easier to express a layer in terms of the statistical transformation
The default geometric object of stat_summary
is pointrange
Usually a stat introduces some new variables that can be mapped to aesthetics.
To know which ones, look at the help pages.
For instance, stat_summary
introduces
ymin
ymax
y
(overwrites)which can then be used by geometric objects.
Install the nycflights13 package
. In the console:
and then import it in your notebook
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# i 336,766 more rows
# i 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
build a plot showing the median, min, max monthly delay
na.rm
parameter, or use drop_na
)stat_summary
, along with the geometric object of your choosinggroup
aestheticBy default, statistical transformations are applied to groups defined by the interaction of all discrete variables in the plot.
For most applications you can simply specify the grouping with various aesthetics (colour, shape, fill, linetype) or with facets.
group
aestheticThere is also an invisible aesthetic that can be used just for grouping purposes, aptly called group
build a plot showing the median, min, and max monthly delay with context
size
color
them "gray"
Scales are functions that map from data values to aesthetic values
Scales are invertible functions
Examples:
Scales are usually linear, but not necessarily
In some cases we can apply a non-linear transformation to improve readability
[1] "10" "100" "1 000" "10 000"
[5] "100 000" "1 000 000" "10 000 000" "100 000 000"
[9] "1 000 000 000" "10 000 000 000"
Linear scale (default):
Logarithmic scale:
In a logarithmic scale, multiples are equally spaced. We can use them to display data that spans a very wide range, in an unequal way
Scales can be controlled using functions with signature scale_*
where *
is the name of the relevant aesthetic.
x
with default parameters for log transformation
y
(can be omitted)
Imagine you are running several correlation tests of some variables with an effect of interest. You obtain the following data:
# A tibble: 5 x 4
var cat corr pval
<chr> <chr> <dbl> <dbl>
1 A F1 0.9 0.000000001
2 B F2 0.7 0.0000001
3 C F1 0.6 0.0001
4 D F3 0.22 0.15
5 E F3 0.2 0.12
How can you represent it?
read_csv("experiment.csv") |>
mutate(
var = fct_reorder(var, corr),
significant = pval <= 0.05
) |>
ggplot(aes(x = var, y = corr, fill = cat, alpha = significant, linetype = significant)) +
geom_col(color = "black") +
geom_text(aes(label = pval), hjust = 1, nudge_y = -0.01, alpha = 1) +
scale_alpha_manual(values = c(0.3, 1)) +
scale_linetype_manual(values = c("dotted", "blank")) +
guides(alpha = guide_none(), linetype = guide_none()) +
labs(
title = "A, B, and C significantly correlate with the effect",
caption = str_wrap("Correlations of variables with the effect, color coded by family and labelled by p-value. Transparent columns are non significant.")
) +
coord_flip()
Consider now an experiment where you compared the distributions of different features from two different sets A
and B
. The column significant
reports if the difference between distributions is significant
# A tibble: 2,200 x 4
type feat value significant
<chr> <chr> <dbl> <lgl>
1 A F1 0.882 FALSE
2 A F1 0.784 FALSE
3 A F1 0.784 FALSE
4 A F1 0.382 FALSE
5 A F1 0.931 FALSE
6 A F1 0.757 FALSE
7 A F1 0.794 FALSE
8 A F1 0.905 FALSE
9 A F1 0.298 FALSE
10 A F1 0.946 FALSE
# i 2,190 more rows
How would you represent such data?
read_csv("comparing_distributions.csv") |>
ggplot(aes(x = value, fill=type, alpha = significant)) +
geom_density(
aes(y = after_stat(scaled)),
data = function(dat) {filter(dat, type == "A")},
trim = T
) +
geom_density(
aes(y = -after_stat(scaled)),
data = function(dat) {filter(dat, type == "B")},
trim = T
) +
scale_x_continuous(limits=c(0,1)) +
scale_alpha_manual(values=c(0.1, 0.8)) +
facet_wrap(vars(feat), ncol=1, strip.position = "left") +
guides(alpha = guide_none()) +
theme_bw() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.border = element_blank()
)