A R programming language

A.1 Basic characteristics

R is free software for statistical computing and graphics. It is widely used by statisticians, scientists, and other professionals for software development and data analysis. It is an interpreted language and therefore the programs do not need compilation.

A.2 Why R?

R is one of the main two languages used for statistics and machine learning (the other being Python).

Pros

Libraries. Comprehensive collection of statistical and machine learning packages.
Easy to code.
Open source. Anyone can access R and develop new methods. Additionally, it is relatively simple to get source code of established methods.
Large community. The use of R has been rising for some time, in industry and academia. Therefore a large collection of blogs and tutorials exists, along with people offering help on pages like StackExchange and CrossValidated.
Integration with other languages and LaTeX.
New methods. Many researchers develop R packages based on their research, therefore new methods are available soon after development.

Cons

Slow. Programs run slower than in other programming languages, however this can be somewhat ammended by effective coding or integration with other languages.
Memory intensive. This can become a problem with large data sets, as they need to be stored in the memory, along with all the information the models produce.
Some packages are not as good as they should be, or have poor documentation.
Object oriented programming in R can be very confusing and complex.

A.3 Setting up

https://www.r-project.org/.

A.3.1 RStudio

RStudio is the most widely used IDE for R. It is free, you can download it from https://rstudio.com/. While console R is sufficient for the requirements of this course, we recommend the students install RStudio for its better user interface.

A.3.2 Libraries for data science

Listed below are some of the more useful libraries (packages) for data science. Students are also encouraged to find other useful packages.

dplyr Efficient data manipulation. Part of the wider package collection called tidyverse.
ggplot2 Plotting based on grammar of graphics.
stats Several statistical models.
rstan Bayesian inference using Hamiltonian Monte Carlo. Very flexible model building.
MCMCpack Bayesian inference.
rmarkdown, knitr, and bookdown Dynamic reports (for example such as this one).
devtools Package development.

A.4 R basics

A.4.1 Variables and types

Important information and tips:

no type declaration
define variables with <- instead of = (although both work, there is a slight difference, additionally most of the packages use the arrow)
for strings use ""
for comments use #
change types with as.type() functions
no special type for single character like C++ for example

n            <- 20
x            <- 2.7
m            <- n # m gets value 20
my_flag      <- TRUE
student_name <- "Luka"
typeof(n)

## [1] "double"

typeof(student_name)

## [1] "character"

typeof(my_flag)

## [1] "logical"

typeof(as.integer(n))

## [1] "integer"

typeof(as.character(n))

## [1] "character"

A.4.2 Basic operations

n + x

## [1] 22.7

n - x

## [1] 17.3

diff <- n - x # variable diff gets the difference between n and x
diff

## [1] 17.3

n * x

## [1] 54

n / x

## [1] 7.407407

x^2

## [1] 7.29

sqrt(x)

## [1] 1.643168

n > 2 * n

## [1] FALSE

n == n

## [1] TRUE

n == 2 * n

## [1] FALSE

n != n

## [1] FALSE

paste(student_name, "is", n, "years old")

## [1] "Luka is 20 years old"

A.4.3 Vectors

use c() to combine elements into vectors
can only contain one type of variable
if different types are provided, all are transformed to the most basic type in the vector
access elements by indexes or logical vectors of the same length
a scalar value is regarded as a vector of length 1

1:4 # creates a vector of integers from 1 to 4

## [1] 1 2 3 4

student_ages  <- c(20, 23, 21)
student_names <- c("Luke", "Jen", "Mike")
passed        <- c(TRUE, TRUE, FALSE)
length(student_ages)

## [1] 3

# access by index
student_ages[2]

## [1] 23

student_ages[1:2]

## [1] 20 23

student_ages[2] <- 24 # change values

# access by logical vectors
student_ages[passed == TRUE] # same as student_ages[passed]

## [1] 20 24

student_ages[student_names %in% c("Luke", "Mike")]

## [1] 20 21

student_names[student_ages > 20]

## [1] "Jen"  "Mike"

A.4.3.1 Operations with vectors

most operations are element-wise
if we operate on vectors of different lengths, the shorter vector periodically repeats its elements until it reaches the length of the longer one

a <- c(1, 3, 5)
b <- c(2, 2, 1)
d <- c(6, 7)
a + b

## [1] 3 5 6

a * b

## [1] 2 6 5

a + d

## Warning in a + d: longer object length is not a multiple of shorter object
## length

## [1]  7 10 11

a + 2 * b

## [1] 5 7 7

a > b

## [1] FALSE  TRUE  TRUE

b == a

## [1] FALSE FALSE FALSE

a %*% b # vector multiplication, not element-wise

##      [,1]
## [1,]   13

A.4.4 Factors

vectors of finite predetermined classes
suitable for categorical variables
ordinal (ordered) or nominal (unordered)

car_brand <- factor(c("Audi", "BMW", "Mercedes", "BMW"), ordered = FALSE)
car_brand

## [1] Audi     BMW      Mercedes BMW     
## Levels: Audi BMW Mercedes

freq      <- factor(x       = NA,
                    levels  = c("never","rarely","sometimes","often","always"),
                    ordered = TRUE)
freq[1:3] <- c("rarely", "sometimes", "rarely")
freq

## [1] rarely    sometimes rarely   
## Levels: never < rarely < sometimes < often < always

freq[4]   <- "quite_often" # non-existing level, returns NA

## Warning in `[<-.factor`(`*tmp*`, 4, value = "quite_often"): invalid factor
## level, NA generated

freq

## [1] rarely    sometimes rarely    <NA>     
## Levels: never < rarely < sometimes < often < always

A.4.5 Matrices

two-dimensional generalizations of vectors

my_matrix <- matrix(c(1, 2, 1,
                      5, 4, 2),
                    nrow  = 2,
                    byrow = TRUE)
my_matrix

##      [,1] [,2] [,3]
## [1,]    1    2    1
## [2,]    5    4    2

my_square_matrix <- matrix(c(1, 3,
                             2, 3),
                           nrow  = 2)
my_square_matrix

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    3

my_matrix[1,2] # first row, second column

## [1] 2

my_matrix[2, ] # second row

## [1] 5 4 2

my_matrix[ ,3] # third column

## [1] 1 2

A.4.5.1 Matrix functions and operations

most operation element-wise
mind the dimensions when using matrix multiplication %*%

nrow(my_matrix) # number of matrix rows

## [1] 2

ncol(my_matrix) # number of matrix columns

## [1] 3

dim(my_matrix) # matrix dimension

## [1] 2 3

t(my_matrix) # transpose

##      [,1] [,2]
## [1,]    1    5
## [2,]    2    4
## [3,]    1    2

diag(my_matrix) # the diagonal of the matrix as vector

## [1] 1 4

diag(1, nrow = 3) # creates a diagonal matrix

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

det(my_square_matrix) # matrix determinant

## [1] -3

my_matrix + 2 * my_matrix

##      [,1] [,2] [,3]
## [1,]    3    6    3
## [2,]   15   12    6

my_matrix * my_matrix # element-wise multiplication

##      [,1] [,2] [,3]
## [1,]    1    4    1
## [2,]   25   16    4

my_matrix %*% t(my_matrix) # matrix multiplication

##      [,1] [,2]
## [1,]    6   15
## [2,]   15   45

my_vec <- as.vector(my_matrix) # transform to vector
my_vec

## [1] 1 5 2 4 1 2

A.4.6 Arrays

multi-dimensional generalizations of matrices

my_array <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim = c(2, 2, 2))
my_array[1, 1, 1]

## [1] 1

my_array[2, 2, 1]

## [1] 4

my_array[1, , ]

##      [,1] [,2]
## [1,]    1    5
## [2,]    3    7

dim(my_array)

## [1] 2 2 2

A.4.7 Data frames

basic data structure for analysis
differ from matrices as columns can be of different types

student_data <- data.frame("Name" = student_names, 
                           "Age"  = student_ages, 
                           "Pass" = passed)
student_data

##   Name Age  Pass
## 1 Luke  20  TRUE
## 2  Jen  24  TRUE
## 3 Mike  21 FALSE

colnames(student_data) <- c("name", "age", "pass") # change column names
student_data[1, ]

##   name age pass
## 1 Luke  20 TRUE

student_data[ ,colnames(student_data) %in% c("name", "pass")]

##   name  pass
## 1 Luke  TRUE
## 2  Jen  TRUE
## 3 Mike FALSE

student_data$pass # access column by name

## [1]  TRUE  TRUE FALSE

student_data[student_data$pass == TRUE, ]

##   name age pass
## 1 Luke  20 TRUE
## 2  Jen  24 TRUE

A.4.8 Lists

useful for storing different data structures
access elements with double square brackets
elements can be named

first_list  <- list(student_ages, my_matrix, student_data)
second_list <- list(student_ages, my_matrix, student_data, first_list)
first_list[[1]]

## [1] 20 24 21

second_list[[4]]

## [[1]]
## [1] 20 24 21
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    2    1
## [2,]    5    4    2
## 
## [[3]]
##   name age  pass
## 1 Luke  20  TRUE
## 2  Jen  24  TRUE
## 3 Mike  21 FALSE

second_list[[4]][[1]] # first element of the fourth element of second_list

## [1] 20 24 21

length(second_list)

## [1] 4

second_list[[length(second_list) + 1]] <- "add_me" # append an element
names(first_list) <- c("Age", "Matrix", "Data")
first_list$Age

## [1] 20 24 21

A.4.9 Loops

mostly for loop
for loop can iterate over an arbitrary vector

# iterate over consecutive natural numbers
my_sum <- 0
for (i in 1:10) {
  my_sum <- my_sum + i
}
my_sum

## [1] 55

# iterate over an arbirary vector
my_sum       <- 0
some_numbers <- c(2, 3.5, 6, 100)
for (i in some_numbers) {
  my_sum <- my_sum + i
}
my_sum

## [1] 111.5

A.5 Functions

for help use ?function_name

A.5.1 Writing functions

We can write our own functions with function(). In the brackets, we define the parameters the function gets, and in curly brackets we define what the function does. We use return() to return values.

sum_first_n_elements <- function (n) {
  my_sum <- 0
  for (i in 1:n) {
    my_sum <- my_sum + i
  }
  return (my_sum)
}
sum_first_n_elements(10)

## [1] 55

A.6 Other tips

Use set.seed(arbitrary_number) at the beginning of a script to set the seed and ensure replication.
To dynamically set the working directory in R Studio to the parent folder of a R script use setwd(dirname(rstudioapi::getSourceEditorContext()$path)).
To avoid slow R loops use the apply family of functions. See ?apply and ?lapply.
To make your data manipulation (and therefore your life) a whole lot easier, use the dplyr package.
Use getAnywhere(function_name) to get the source code of any function.
Use browser for debugging. See ?browser.

A.7 Further reading and references

Getting started with R Studio: https://www.youtube.com/watch?v=lVKMsaWju8w
Official R manuals: https://cran.r-project.org/manuals.html
Cheatsheets: https://www.rstudio.com/resources/cheatsheets/
Workshop on R, dplyr, ggplot2, and R Markdown: https://github.com/bstatcomp/Rworkshop