[1] "2019-10-25"
[1] "2019-10-25"
[1] "2019-10-25"
[1] NA
We can classify data types into two broad categories
Ordered
Categorical
All ordered data has an implicit ordering
Ordinal vs. quantitative
Sequential and diverging
Temporal
There is a meaningful ordering, but we cannot do full-fledged arithmetic
Shirt sizes (XS, S, M, L, XL)
Rankings (1st, 2nd, 3rd, 4th, 5th)
There is a meaningful ordering, and arithmetic operations are supported
Lengths, heights, etc…
Counts
Prices
….
sequential: where there is a homogeneous range from a minimum to a maximum value
diverging: can be deconstructed into two sequences pointing in opposite directions that meet at a common zero point
Sequential: height of mountains, depth of seas
Diverging: the elevation of points on earth surface; the PH scale (acid to alcaline, with middle point on 7.5)
This distinction will be useful mainly for defining meaningful scales
Time is a complex subject, so temporal data requires special care
Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST.
The lubridate
package provides a series of facilities to deal with date and times.
It takes care for you of all the gory details of time handling on computers, including time zones, leap years/seconds, daylight saving times, etc…
It provides three datatypes
date
time
datetime
Creating dates
Creating datetimes
From components
[1] "2019-10-25"
[1] "2019-10-25 08:10:25 UTC"
With vectorization
flights |>
drop_na(dep_delay) |>
mutate(
date = make_date(year, month, day),
weekday = wday(date, label=FALSE), # Gets the day of the week
weekday_str = wday(date, label=TRUE)
) |>
group_by(weekday, weekday_str) |>
summarise(dep_delay = mean(dep_delay)) |>
ggplot(aes(x=weekday_str, y=dep_delay)) +
geom_col(width = 1) +
coord_polar()
flights |>
drop_na(dep_delay) |>
mutate(date = make_date(year, month, day)) |>
group_by(date, origin) |>
summarise(dep_delay = mean(dep_delay)) |>
mutate(
weekday = wday(date, label=TRUE),
weeknum = isoweek(date)) |>
ggplot(aes(x=weekday, y=weeknum,
fill=dep_delay)) +
geom_tile() +
scale_fill_continuous(type="viridis") +
facet_wrap(vars(origin)) +
theme_bw() +
theme(legend.position = 'bottom')
flights |>
drop_na(dep_delay) |>
mutate(date = make_date(year, month, day)) |>
group_by(date, origin) |>
summarise(dep_delay = mean(dep_delay)) |>
mutate(
weekday = wday(date, label=TRUE),
weeknum = isoweek(date)) |>
ggplot(aes(x=weekday, y=weeknum,
fill=dep_delay, color=dep_delay)) +
geom_tile() +
scale_fill_continuous(type="viridis", aesthetics=c("fill","color")) +
facet_wrap(vars(origin)) +
coord_polar() +
theme_void() +
theme(axis.text.x = element_text(),
strip.text = element_text(size=14),
legend.position = 'bottom')
Categorical types do not have an implicit ordering.
With categories you can just compare for equality
You can enforce an explicit ordering when needed
This is a tidyverse library that helps with working with categorical data, which in R is called a factor
You can use the factor
function to create factors, which is vectorized
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# i 1,694 more rows
What is the difference between factors and strings?
Care must be used when selecting the mapping between attributes and aesthetics, as well as when choosing the geometric objects we use.
There are mainly two types of aesthetics
Magnitude aesthetics
Identity aesthetics
There are mainly two types of aesthetics
Magnitude aesthetics: ordered data
Identity aesthetics: categorical data
Expressivenes: what can be encoded by what
Effectiveness: the importance should match the salience of the aesthetic, i.e. its noticeability
When choosing a mapping
Given a stimulus how close to the it measurement is the human perceptual judgement
Stevens psychophysical power law [1975]
Be careful with encoding linear information with a non perceptually-linear aesthetic
Weber’s Law: our perceptual system is based on relative judgements, not absolute ones. In other terms, our perception is contextual
We look for patterns in what we see:
Munzner: Visualization Analysis and Design, chapters 2 and 5 (in the library)
Ware: Visual thinking for design, chapter 2 (in the library)
Hadley: R for Data Science, chapters 15 and 16