A R programming language
A.1 Basic characteristics
R is free software for statistical computing and graphics. It is widely used by statisticians, scientists, and other professionals for software development and data analysis. It is an interpreted language and therefore the programs do not need compilation.
A.2 Why R?
R is one of the main two languages used for statistics and machine learning (the other being Python).
Pros
- Libraries. Comprehensive collection of statistical and machine learning packages.
- Easy to code.
- Open source. Anyone can access R and develop new methods. Additionally, it is relatively simple to get source code of established methods.
- Large community. The use of R has been rising for some time, in industry and academia. Therefore a large collection of blogs and tutorials exists, along with people offering help on pages like StackExchange and CrossValidated.
- Integration with other languages and LaTeX.
- New methods. Many researchers develop R packages based on their research, therefore new methods are available soon after development.
Cons
- Slow. Programs run slower than in other programming languages, however this can be somewhat ammended by effective coding or integration with other languages.
- Memory intensive. This can become a problem with large data sets, as they need to be stored in the memory, along with all the information the models produce.
- Some packages are not as good as they should be, or have poor documentation.
- Object oriented programming in R can be very confusing and complex.
A.3 Setting up
A.3.1 RStudio
RStudio is the most widely used IDE for R. It is free, you can download it from https://rstudio.com/. While console R is sufficient for the requirements of this course, we recommend the students install RStudio for its better user interface.
A.3.2 Libraries for data science
Listed below are some of the more useful libraries (packages) for data science. Students are also encouraged to find other useful packages.
- dplyr Efficient data manipulation. Part of the wider package collection called tidyverse.
- ggplot2 Plotting based on grammar of graphics.
- stats Several statistical models.
- rstan Bayesian inference using Hamiltonian Monte Carlo. Very flexible model building.
- MCMCpack Bayesian inference.
- rmarkdown, knitr, and bookdown Dynamic reports (for example such as this one).
- devtools Package development.
A.4 R basics
A.4.1 Variables and types
Important information and tips:
- no type declaration
- define variables with
<-
instead of=
(although both work, there is a slight difference, additionally most of the packages use the arrow) - for strings use
""
- for comments use
#
- change types with
as.type()
functions - no special type for single character like C++ for example
## [1] "double"
## [1] "character"
## [1] "logical"
## [1] "integer"
## [1] "character"
A.4.2 Basic operations
## [1] 22.7
## [1] 17.3
## [1] 17.3
## [1] 54
## [1] 7.407407
## [1] 7.29
## [1] 1.643168
## [1] FALSE
## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] "Luka is 20 years old"
A.4.3 Vectors
- use
c()
to combine elements into vectors - can only contain one type of variable
- if different types are provided, all are transformed to the most basic type in the vector
- access elements by indexes or logical vectors of the same length
- a scalar value is regarded as a vector of length 1
## [1] 1 2 3 4
student_ages <- c(20, 23, 21)
student_names <- c("Luke", "Jen", "Mike")
passed <- c(TRUE, TRUE, FALSE)
length(student_ages)
## [1] 3
## [1] 23
## [1] 20 23
student_ages[2] <- 24 # change values
# access by logical vectors
student_ages[passed == TRUE] # same as student_ages[passed]
## [1] 20 24
## [1] 20 21
## [1] "Jen" "Mike"
A.4.3.1 Operations with vectors
- most operations are element-wise
- if we operate on vectors of different lengths, the shorter vector periodically repeats its elements until it reaches the length of the longer one
## [1] 3 5 6
## [1] 2 6 5
## Warning in a + d: longer object length is not a multiple of shorter object
## length
## [1] 7 10 11
## [1] 5 7 7
## [1] FALSE TRUE TRUE
## [1] FALSE FALSE FALSE
## [,1]
## [1,] 13
A.4.4 Factors
- vectors of finite predetermined classes
- suitable for categorical variables
- ordinal (ordered) or nominal (unordered)
## [1] Audi BMW Mercedes BMW
## Levels: Audi BMW Mercedes
freq <- factor(x = NA,
levels = c("never","rarely","sometimes","often","always"),
ordered = TRUE)
freq[1:3] <- c("rarely", "sometimes", "rarely")
freq
## [1] rarely sometimes rarely
## Levels: never < rarely < sometimes < often < always
## Warning in `[<-.factor`(`*tmp*`, 4, value = "quite_often"): invalid factor
## level, NA generated
## [1] rarely sometimes rarely <NA>
## Levels: never < rarely < sometimes < often < always
A.4.5 Matrices
- two-dimensional generalizations of vectors
## [,1] [,2] [,3]
## [1,] 1 2 1
## [2,] 5 4 2
## [,1] [,2]
## [1,] 1 2
## [2,] 3 3
## [1] 2
## [1] 5 4 2
## [1] 1 2
A.4.5.1 Matrix functions and operations
- most operation element-wise
- mind the dimensions when using matrix multiplication
%*%
## [1] 2
## [1] 3
## [1] 2 3
## [,1] [,2]
## [1,] 1 5
## [2,] 2 4
## [3,] 1 2
## [1] 1 4
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
## [1] -3
## [,1] [,2] [,3]
## [1,] 3 6 3
## [2,] 15 12 6
## [,1] [,2] [,3]
## [1,] 1 4 1
## [2,] 25 16 4
## [,1] [,2]
## [1,] 6 15
## [2,] 15 45
## [1] 1 5 2 4 1 2
A.4.6 Arrays
- multi-dimensional generalizations of matrices
## [1] 1
## [1] 4
## [,1] [,2]
## [1,] 1 5
## [2,] 3 7
## [1] 2 2 2
A.4.7 Data frames
- basic data structure for analysis
- differ from matrices as columns can be of different types
student_data <- data.frame("Name" = student_names,
"Age" = student_ages,
"Pass" = passed)
student_data
## Name Age Pass
## 1 Luke 20 TRUE
## 2 Jen 24 TRUE
## 3 Mike 21 FALSE
## name age pass
## 1 Luke 20 TRUE
## name pass
## 1 Luke TRUE
## 2 Jen TRUE
## 3 Mike FALSE
## [1] TRUE TRUE FALSE
## name age pass
## 1 Luke 20 TRUE
## 2 Jen 24 TRUE
A.4.8 Lists
- useful for storing different data structures
- access elements with double square brackets
- elements can be named
first_list <- list(student_ages, my_matrix, student_data)
second_list <- list(student_ages, my_matrix, student_data, first_list)
first_list[[1]]
## [1] 20 24 21
## [[1]]
## [1] 20 24 21
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 2 1
## [2,] 5 4 2
##
## [[3]]
## name age pass
## 1 Luke 20 TRUE
## 2 Jen 24 TRUE
## 3 Mike 21 FALSE
## [1] 20 24 21
## [1] 4
second_list[[length(second_list) + 1]] <- "add_me" # append an element
names(first_list) <- c("Age", "Matrix", "Data")
first_list$Age
## [1] 20 24 21
A.4.9 Loops
- mostly for loop
- for loop can iterate over an arbitrary vector
# iterate over consecutive natural numbers
my_sum <- 0
for (i in 1:10) {
my_sum <- my_sum + i
}
my_sum
## [1] 55
# iterate over an arbirary vector
my_sum <- 0
some_numbers <- c(2, 3.5, 6, 100)
for (i in some_numbers) {
my_sum <- my_sum + i
}
my_sum
## [1] 111.5
A.5 Functions
- for help use
?function_name
A.5.1 Writing functions
We can write our own functions with function()
. In the brackets, we
define the parameters the function gets, and in curly brackets we define what
the function does. We use return()
to return values.
sum_first_n_elements <- function (n) {
my_sum <- 0
for (i in 1:n) {
my_sum <- my_sum + i
}
return (my_sum)
}
sum_first_n_elements(10)
## [1] 55
A.6 Other tips
- Use
set.seed(arbitrary_number)
at the beginning of a script to set the seed and ensure replication. - To dynamically set the working directory in R Studio to the parent folder of a R script use
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
. - To avoid slow R loops use the apply family of functions. See
?apply
and?lapply
. - To make your data manipulation (and therefore your life) a whole lot easier, use the dplyr package.
- Use
getAnywhere(function_name)
to get the source code of any function. - Use browser for debugging. See
?browser
.
A.7 Further reading and references
- Getting started with R Studio: https://www.youtube.com/watch?v=lVKMsaWju8w
- Official R manuals: https://cran.r-project.org/manuals.html
- Cheatsheets: https://www.rstudio.com/resources/cheatsheets/
- Workshop on R, dplyr, ggplot2, and R Markdown: https://github.com/bstatcomp/Rworkshop