RMarkdown is a plain text format (and accompanying tools) that allows you to intersperse text, code, and outputs
Quarto is an open-source scientific and technical publishing system that allows to render notebooks in a variety of formats
The RMarkdown pipeline
https://socviz.co/gettingstarted.html
Data types
R has five basic data types:
character
"hello"
numeric (a.k.a. real numbers, double)
0.82
integer
42L
complex
3.2+5.2i
logical (a.k.a. booleans)
TRUE# same as T
F # Same as FALSE
Names
In R, everything can be given a name
x # this is a valid namedescriptive_name # descriptive names are preferable# - note the underscore separating the words# - spaces are not allowedalso.valid # This is also a valid name, using an older and maybe# confusing naming scheme. If you come from Java/C++/Python/Javascript....# the . in the middle of the name is *not* the member access operator
These names (and others) are not allowed
FALSE, TRUE, Inf, NA, NaN, NULL, for, if, else, break, function
Some names are best avoided, because they are library functions that you would overwrite
mean, range
Binding things to names
Using the “arrow” syntax you can assign names to things
x <-5# The `arrow` is the assignment operatorsome_string <-"Hello I am a sequence of characters"
Later on, you can retrieve the values by simply referencing the name
# NOT !# short-circuited AND short-circuited OR <- for control flow&&||# AND OR <- for logical operations&|
Boolean values and comparisons
# Boolean andTRUE&FALSE
[1] FALSE
# Boolean orTRUE|FALSE
[1] TRUE
# Negation!TRUE
[1] FALSE
5==3
[1] FALSE
5!=3
[1] TRUE
5>3
[1] TRUE
5<=3
[1] FALSE
!(5<3) & (TRUE& (2*2==4))
[1] TRUE
Using R as a calculator
What does the following comparison return (sqrt gives the square root)?
sqrt(2)^2==2
\[
(\sqrt{2})^2 = 2
\]
[1] FALSE
numeric data is insidious, and comparisons should be handled with care.
sqrt(2)^2-2
[1] 4.440892e-16
dplyr::near(sqrt(2)^2, 2)
[1] TRUE
Missing values
The NA keyword represents a missing value.
NA>3
[1] NA
NA+10
[1] NA
NA==NA
[1] NA
Missing values
To check if a value is NA you use the is.na function
a <-NAis.na(a)
[1] TRUE
b <-"this variable has a value"is.na(b)
[1] FALSE
Other special values
What is the result of this operation?
0/0
[1] NaN
The NaN value (Not aNumber): the result cannot be represented by a computer.
What about this operation?
sqrt(-1)
[1] NaN
We get NaN even if this would be the definition of the complex number i.
If you want the complex number, then you should declare it explicitly
sqrt( as.complex(-1) )
[1] 0+1i
NA vs NaN
Beware: in R the values NA and NaN refer to distinct concepts. This is in contrast with Python, where NaN is often used also to indicate missing values.
In particular, and confusingly
is.na(NaN)
[1] TRUE
but
is.nan(NA)
[1] FALSE
Other special values
What about this operation?
1/0
[1] Inf
The Inf value is used to represent infinity, and propagates in calculations
Inf+10
[1] Inf
min(Inf, 10)
[1] 10
Inf-Inf
[1] NaN
Vectors
Atomic vectors are homogeneous indexed collections of values of the same basic data type
vec_numbers <-vector("numeric", 4)vec_numbers
[1] 0 0 0 0
vec_letters <-vector("character", 6)vec_letters
[1] "" "" "" "" "" ""
You can also define sequence of numbers
1:10
[1] 1 2 3 4 5 6 7 8 9 10
Vectors
You can ask for the type of a vector using typeof
typeof(vec_numbers)
[1] "double"
typeof(vec_letters)
[1] "character"
typeof(1:10)
[1] "integer"
Vectors
You can ask for the length of a vector using length
length(vec_numbers)
[1] 4
length(vec_letters)
[1] 6
length(1:10)
[1] 10
What about scalars?
What does this return?
typeof(3)
[1] "double"
What about this?
length(3)
[1] 1
There are no scalar values, but vectors of length 1!
With the str function you can look at the structure of nested lists.
str(my_named_list)
List of 3
$ pi : num 3.14
$ name: chr "Listy List"
$ geo :List of 2
..$ city : chr "Bozen"
..$ country: chr "Italy"
Control flow: if
if (condition) {# Do something if condition holds} elseif (second condition) {# Otherwise, do something else if the second condition holds} else {# If non of the previous holds, do this}
For example, do different things depending on the type of a vector
my_vec <-c(1.0, 3.14, 5.42)if (is.numeric(my_vec)) {mean(my_vec)} else {# Signal an error and stop executionstop("We are expecting a numeric vector!")}
[1] 3.186667
Control flow: for loops
for (iteration specification) {# Do something for each iteration}
List of 4
$ a: num -0.383
$ b: num 0.417
$ c: num 0.855
$ d: num -20.9
Functions
Whenever you find yourself copy-pasting the code, create a function instead!
The name of the function serves to describe its purpose
Maintenance is easier: you only need to update code in one place
You don’t make silly copy-paste errors
Functions: anatomy
Function call
fn_name(<value1>,argument2 =<value2>)
Function definition
my_func <-function(arg1, arg2, named_arg3 =42) {# Do things with arguments# The last statement is the return value# you can also use the explicit `return(value)` to do early returns}
List of 4
$ a: num [1:5] 0.685 0.891 1 0.147 0
$ b: num [1:5] 0.928 1 0 0.967 0.837
$ c: num [1:5] 0.1217 1 0 0.8751 0.0858
$ d: num [1:5] 0 0.551 1 0.434 0.138
Functions: variable number of arguments
You can write functions that accept a variable number of arguments using the ... syntax:
with_varargs <-function(...) {# The following line stores the additional arguments in a list,# for convenient access. Additional arguments can even be named args <-list(...)return(str(args))}
with_varargs("hello", # This is a positional argumentb =42, # This is an additional argument that will go in the args lista ="world"# And additional arguments can also be named)
List of 3
$ : chr "hello"
$ b: num 42
$ a: chr "world"
Libraries
Functions are the basic unit of code reuse
Libraries (also called packages) group together functions with related functionality
https://cran.r-project.org
Installing libraries
Just use the command
install.packages("name_of_the_library")
The tidyverse
R packages for data science
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Install the complete tidyverse with:
install.packages("tidyverse")
Using libraries
Prepend the package name
readr::read_csv("file.csv")
Bring all the package’s functions into scope
library(readr)read_csv("file.csv")
Using libraries
The second option is more convenient, but some names may mask names already in scope
library(dplyr)
Attaching package: `dplyr`
The following objects are masked
from `package:stats`:
filter, lag
The following objects are masked
from `package:base`:
intersect, setdiff, setequal, union
In this case the shadowed names are still accessible using their fully qualified name
stats::filterbase::intersect
The Tidyverse libraries
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
These packages can be installed simply by using
install.packages("tidyverse")
And then used with
library(tidyverse)
The main library we will deal with.
Declarative graphics with a well-defined grammar.
The main reason we use Rrather than python.
The tabular data representation we will mostly use.
A modern iteration on the data frame concept.
Data manipulation library.
Covers most of our preprocessing needs.
Reads a variety of file formats in a convenient way.