R is a language and environment for statistical computing and graphics.
https://posit.cloud provides an online IDE
RMarkdown is a plain text format (and accompanying tools) that allows you to intersperse text, code, and outputs
Quarto is an open-source scientific and technical publishing system that allows to render notebooks in a variety of formats
R has five basic data types:
In R, everything can be given a name
x # this is a valid name
descriptive_name # descriptive names are preferable
# - note the underscore separating the words
# - spaces are not allowed
also.valid # This is also a valid name, using an older and maybe
# confusing naming scheme. If you come from Java/C++/Python/Javascript....
# the . in the middle of the name is *not* the member access operator
These names (and others) are not allowed
Some names are best avoided, because they are library functions that you would overwrite
Using the “arrow” syntax you can assign names to things
Later on, you can retrieve the values by simply referencing the name
Arithmetic
Comparisons
Logical
What does the following comparison return (sqrt
gives the square root)?
\[ (\sqrt{2})^2 = 2 \]
[1] FALSE
The NA
keyword represents a missing value.
[1] NA
[1] NA
[1] NA
To check if a value is NA
you use the is.na
function
What is the result of this operation?
[1] NaN
The NaN
value (N
ot a
N
umber): the result cannot be represented by a computer.
NA
vs NaN
Beware: in R the values NA
and NaN
refer to distinct concepts. This is in contrast with Python, where NaN
is often used also to indicate missing values.
What about this operation?
[1] Inf
The Inf
value is used to represent infinity, and propagates in calculations
[1] NaN
Atomic vectors are homogeneous indexed collections of values of the same basic data type
You can ask for the type of a vector using typeof
You can ask for the length of a vector using length
What does this return?
[1] 1
There are no scalar values, but vectors of length 1!
The c
function c
ombines its argments
[1] 1 3 5 7 2 4 6 8
Using c
multiple times does not nest vectors
[1] "1" "hello" "0.45"
[1] 2 3 4
R coerces the length of vectors, if needed.
Remember that 1
is a vector of length one. By coercion, in the operation above, it is replaced with c(1, 1, 1)
by recycling its value.
[1] 2 5 4
logical
vectorsThere are distinct operators for element-wise operators on logical vectors:
which is different from
logical
vectorsHow can you check if all the values are FALSE
?
Elements of vectors can be named, which will be useful for indexing into the vector
Notice that you need to enclose a name in quotes only if it contains spaces.
You can index into vectors using integers indexes.
Beware: indexing starts from 1!
So what about this?
[1] NA
What does the code below gives?
[1] "values" "these"
[1] "these" "these" "some" "some" "these"
What about
[1] "these" "some" "values"
Negative indices remove values from a vector!
You can use boolean vectors to retain only the entries corresponding to TRUE
Is the following naming valid?
Is the following naming valid?
This is not valid, since it makes subsetting ambiguous.
A list
allows to store elements of different type in the same collection, without coercion.
With the str
function you can look at the structure of nested lists.
if
for
loopsWe will use the following data as examples.
List of 4
$ a: num [1:10] -1.207 0.277 1.084 -2.346 0.429 ...
$ b: num [1:10] 0.317 0.303 0.159 0.04 0.219 ...
$ c: num [1:10] 0.877 0.0146 1.8351 0.5193 1.9963 ...
$ d: num [1:10] -159.354 -1.608 21.193 0.963 -0.907 ...
for
loopsWe want to compute the mean of each of a
, b
, c
and d
in loop_data
. A straighforward approach would be
data_means <- list(
a = mean(loop_data$a),
b = mean(loop_data$b),
c = mean(loop_data$c),
d = mean(loop_data$d)
)
str(data_means)
List of 4
$ a: num -0.383
$ b: num 0.417
$ c: num 0.855
$ d: num -20.9
What are the issues with this approach?
for
loopsWe can do better with a for
loop
data_means <- list()
for (i in 1:length(loop_data)) {
data_means <- c(
data_means,
mean(loop_data[[i]])
)
}
str(data_means)
List of 4
$ : num -0.383
$ : num 0.417
$ : num 0.855
$ : num -20.9
Did we lose something?
for
loopsWhenever you find yourself copy-pasting the code, create a function instead!
Consider the following data
List of 4
$ a: num [1:5] 0.00986 0.67827 1.02956 -1.72953 -2.20435
$ b: num [1:5] -1.319 1.453 -37.231 0.164 -4.862
$ c: num [1:5] 0.1215 0.8928 0.0146 0.7831 0.09
$ d: num [1:5] 0.0384 1.2302 2.2003 0.9757 0.337
we want to rescale all the values so that they lie in the range 0
to 1
.
Let’s first see how to do it on my_list$a
:
Now, instead of copying and pasting the code for all the entries in my_list
, we define a function rescale01
and then we can invoke it, maybe in a loop
You can write functions that accept a variable number of arguments using the ...
syntax:
Functions are the basic unit of code reuse
Libraries (also called packages) group together functions with related functionality
https://cran.r-project.org
Just use the command
tidyverse
The second option is more convenient, but some names may mask names already in scope
Attaching package: `dplyr`
The following objects are masked
from `package:stats`:
filter, lag
The following objects are masked
from `package:base`:
intersect, setdiff, setequal, union
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
The main library we will deal with.
Declarative graphics with a well-defined grammar.
The main reason we use R
python
.
The tabular data representation we will mostly use.
A modern iteration on the data frame concept.
Data manipulation library.
Covers most of our preprocessing needs.
Reads a variety of file formats in a convenient way.
Handles corner cases and encodings for you.