A R programming language

A.1 Basic characteristics

R is free software for statistical computing and graphics. It is widely used by statisticians, scientists, and other professionals for software development and data analysis. It is an interpreted language and therefore the programs do not need compilation.

A.2 Why R?

R is one of the main two languages used for statistics and machine learning (the other being Python).

Pros

  • Libraries. Comprehensive collection of statistical and machine learning packages.
  • Easy to code.
  • Open source. Anyone can access R and develop new methods. Additionally, it is relatively simple to get source code of established methods.
  • Large community. The use of R has been rising for some time, in industry and academia. Therefore a large collection of blogs and tutorials exists, along with people offering help on pages like StackExchange and CrossValidated.
  • Integration with other languages and LaTeX.
  • New methods. Many researchers develop R packages based on their research, therefore new methods are available soon after development.

Cons

  • Slow. Programs run slower than in other programming languages, however this can be somewhat ammended by effective coding or integration with other languages.
  • Memory intensive. This can become a problem with large data sets, as they need to be stored in the memory, along with all the information the models produce.
  • Some packages are not as good as they should be, or have poor documentation.
  • Object oriented programming in R can be very confusing and complex.

A.3 Setting up

https://www.r-project.org/.

A.3.1 RStudio

RStudio is the most widely used IDE for R. It is free, you can download it from https://rstudio.com/. While console R is sufficient for the requirements of this course, we recommend the students install RStudio for its better user interface.

A.3.2 Libraries for data science

Listed below are some of the more useful libraries (packages) for data science. Students are also encouraged to find other useful packages.

  • dplyr Efficient data manipulation. Part of the wider package collection called tidyverse.
  • ggplot2 Plotting based on grammar of graphics.
  • stats Several statistical models.
  • rstan Bayesian inference using Hamiltonian Monte Carlo. Very flexible model building.
  • MCMCpack Bayesian inference.
  • rmarkdown, knitr, and bookdown Dynamic reports (for example such as this one).
  • devtools Package development.

A.4 R basics

A.4.1 Variables and types

Important information and tips:

  • no type declaration
  • define variables with <- instead of = (although both work, there is a slight difference, additionally most of the packages use the arrow)
  • for strings use ""
  • for comments use #
  • change types with as.type() functions
  • no special type for single character like C++ for example
n            <- 20
x            <- 2.7
m            <- n # m gets value 20
my_flag      <- TRUE
student_name <- "Luka"
typeof(n)
## [1] "double"
typeof(student_name)
## [1] "character"
typeof(my_flag)
## [1] "logical"
typeof(as.integer(n))
## [1] "integer"
typeof(as.character(n))
## [1] "character"

A.4.2 Basic operations

n + x
## [1] 22.7
n - x
## [1] 17.3
diff <- n - x # variable diff gets the difference between n and x
diff
## [1] 17.3
n * x
## [1] 54
n / x
## [1] 7.407407
x^2
## [1] 7.29
sqrt(x)
## [1] 1.643168
n > 2 * n
## [1] FALSE
n == n
## [1] TRUE
n == 2 * n
## [1] FALSE
n != n
## [1] FALSE
paste(student_name, "is", n, "years old")
## [1] "Luka is 20 years old"

A.4.3 Vectors

  • use c() to combine elements into vectors
  • can only contain one type of variable
  • if different types are provided, all are transformed to the most basic type in the vector
  • access elements by indexes or logical vectors of the same length
  • a scalar value is regarded as a vector of length 1
1:4 # creates a vector of integers from 1 to 4
## [1] 1 2 3 4
student_ages  <- c(20, 23, 21)
student_names <- c("Luke", "Jen", "Mike")
passed        <- c(TRUE, TRUE, FALSE)
length(student_ages)
## [1] 3
# access by index
student_ages[2] 
## [1] 23
student_ages[1:2]
## [1] 20 23
student_ages[2] <- 24 # change values

# access by logical vectors
student_ages[passed == TRUE] # same as student_ages[passed]
## [1] 20 24
student_ages[student_names %in% c("Luke", "Mike")]
## [1] 20 21
student_names[student_ages > 20]
## [1] "Jen"  "Mike"

A.4.3.1 Operations with vectors

  • most operations are element-wise
  • if we operate on vectors of different lengths, the shorter vector periodically repeats its elements until it reaches the length of the longer one
a <- c(1, 3, 5)
b <- c(2, 2, 1)
d <- c(6, 7)
a + b
## [1] 3 5 6
a * b
## [1] 2 6 5
a + d
## Warning in a + d: longer object length is not a multiple of shorter object
## length
## [1]  7 10 11
a + 2 * b
## [1] 5 7 7
a > b
## [1] FALSE  TRUE  TRUE
b == a
## [1] FALSE FALSE FALSE
a %*% b # vector multiplication, not element-wise
##      [,1]
## [1,]   13

A.4.4 Factors

  • vectors of finite predetermined classes
  • suitable for categorical variables
  • ordinal (ordered) or nominal (unordered)
car_brand <- factor(c("Audi", "BMW", "Mercedes", "BMW"), ordered = FALSE)
car_brand
## [1] Audi     BMW      Mercedes BMW     
## Levels: Audi BMW Mercedes
freq      <- factor(x       = NA,
                    levels  = c("never","rarely","sometimes","often","always"),
                    ordered = TRUE)
freq[1:3] <- c("rarely", "sometimes", "rarely")
freq
## [1] rarely    sometimes rarely   
## Levels: never < rarely < sometimes < often < always
freq[4]   <- "quite_often" # non-existing level, returns NA
## Warning in `[<-.factor`(`*tmp*`, 4, value = "quite_often"): invalid factor
## level, NA generated
freq
## [1] rarely    sometimes rarely    <NA>     
## Levels: never < rarely < sometimes < often < always

A.4.5 Matrices

  • two-dimensional generalizations of vectors
my_matrix <- matrix(c(1, 2, 1,
                      5, 4, 2),
                    nrow  = 2,
                    byrow = TRUE)
my_matrix
##      [,1] [,2] [,3]
## [1,]    1    2    1
## [2,]    5    4    2
my_square_matrix <- matrix(c(1, 3,
                             2, 3),
                           nrow  = 2)
my_square_matrix
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    3
my_matrix[1,2] # first row, second column
## [1] 2
my_matrix[2, ] # second row
## [1] 5 4 2
my_matrix[ ,3] # third column
## [1] 1 2

A.4.5.1 Matrix functions and operations

  • most operation element-wise
  • mind the dimensions when using matrix multiplication %*%
nrow(my_matrix) # number of matrix rows
## [1] 2
ncol(my_matrix) # number of matrix columns
## [1] 3
dim(my_matrix) # matrix dimension
## [1] 2 3
t(my_matrix) # transpose
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    4
## [3,]    1    2
diag(my_matrix) # the diagonal of the matrix as vector
## [1] 1 4
diag(1, nrow = 3) # creates a diagonal matrix
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
det(my_square_matrix) # matrix determinant
## [1] -3
my_matrix + 2 * my_matrix
##      [,1] [,2] [,3]
## [1,]    3    6    3
## [2,]   15   12    6
my_matrix * my_matrix # element-wise multiplication
##      [,1] [,2] [,3]
## [1,]    1    4    1
## [2,]   25   16    4
my_matrix %*% t(my_matrix) # matrix multiplication
##      [,1] [,2]
## [1,]    6   15
## [2,]   15   45
my_vec <- as.vector(my_matrix) # transform to vector
my_vec
## [1] 1 5 2 4 1 2

A.4.6 Arrays

  • multi-dimensional generalizations of matrices
my_array <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim = c(2, 2, 2))
my_array[1, 1, 1]
## [1] 1
my_array[2, 2, 1]
## [1] 4
my_array[1, , ]
##      [,1] [,2]
## [1,]    1    5
## [2,]    3    7
dim(my_array)
## [1] 2 2 2

A.4.7 Data frames

  • basic data structure for analysis
  • differ from matrices as columns can be of different types
student_data <- data.frame("Name" = student_names, 
                           "Age"  = student_ages, 
                           "Pass" = passed)
student_data
##   Name Age  Pass
## 1 Luke  20  TRUE
## 2  Jen  24  TRUE
## 3 Mike  21 FALSE
colnames(student_data) <- c("name", "age", "pass") # change column names
student_data[1, ]
##   name age pass
## 1 Luke  20 TRUE
student_data[ ,colnames(student_data) %in% c("name", "pass")]
##   name  pass
## 1 Luke  TRUE
## 2  Jen  TRUE
## 3 Mike FALSE
student_data$pass # access column by name
## [1]  TRUE  TRUE FALSE
student_data[student_data$pass == TRUE, ]
##   name age pass
## 1 Luke  20 TRUE
## 2  Jen  24 TRUE

A.4.8 Lists

  • useful for storing different data structures
  • access elements with double square brackets
  • elements can be named
first_list  <- list(student_ages, my_matrix, student_data)
second_list <- list(student_ages, my_matrix, student_data, first_list)
first_list[[1]]
## [1] 20 24 21
second_list[[4]]
## [[1]]
## [1] 20 24 21
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    2    1
## [2,]    5    4    2
## 
## [[3]]
##   name age  pass
## 1 Luke  20  TRUE
## 2  Jen  24  TRUE
## 3 Mike  21 FALSE
second_list[[4]][[1]] # first element of the fourth element of second_list
## [1] 20 24 21
length(second_list)
## [1] 4
second_list[[length(second_list) + 1]] <- "add_me" # append an element
names(first_list) <- c("Age", "Matrix", "Data")
first_list$Age
## [1] 20 24 21

A.4.9 Loops

  • mostly for loop
  • for loop can iterate over an arbitrary vector
# iterate over consecutive natural numbers
my_sum <- 0
for (i in 1:10) {
  my_sum <- my_sum + i
}
my_sum
## [1] 55
# iterate over an arbirary vector
my_sum       <- 0
some_numbers <- c(2, 3.5, 6, 100)
for (i in some_numbers) {
  my_sum <- my_sum + i
}
my_sum
## [1] 111.5

A.5 Functions

  • for help use ?function_name

A.5.1 Writing functions

We can write our own functions with function(). In the brackets, we define the parameters the function gets, and in curly brackets we define what the function does. We use return() to return values.

sum_first_n_elements <- function (n) {
  my_sum <- 0
  for (i in 1:n) {
    my_sum <- my_sum + i
  }
  return (my_sum)
}
sum_first_n_elements(10)
## [1] 55

A.6 Other tips

  • Use set.seed(arbitrary_number) at the beginning of a script to set the seed and ensure replication.
  • To dynamically set the working directory in R Studio to the parent folder of a R script use setwd(dirname(rstudioapi::getSourceEditorContext()$path)).
  • To avoid slow R loops use the apply family of functions. See ?apply and ?lapply.
  • To make your data manipulation (and therefore your life) a whole lot easier, use the dplyr package.
  • Use getAnywhere(function_name) to get the source code of any function.
  • Use browser for debugging. See ?browser.

A.7 Further reading and references