"R is a free software environment for statistical computing and graphics."
R can be run from the command line, however we are going to use RStudio, a free-to-use program that provides a user-friendly environment. RStudio interface is based on similar scheme of others scripting languages, such as Spyder for Python or Matlab.
The latest version of R can be downloaded from the CRAN website, whereas R Studio can be found on the RStudio website.
R must be installed before RStudio.
RStudio default layout consists of three panels. At left there is a panel showing the code executed plus eventually the textual output. At right, the top panel show the current variables/functions loaded in the environment (top) and the bottom panel allows the access to the file system. When a script is opened the left panel is split in two.
R software packages are distributed through three main channels: CRAN, Bioconductor and Github.
CRAN packages can be installed directly from the command line, by calling the function:
# R
install.packages
For example, to install the package ggplot2
:
# R
install.packages('ggplot2')
or, equivalently,
# R
install.packages("ggplot2")
Please notice that the name of package is passed to the function as a string through the characters "
or '
(they are exchangeable in R).
Typically, the function will install also all the dependencies, but in some cases they must be installed manually. This is usually reported in the output generated by install.packages
.
To install Bioconductor packages, first install the BiocManager
package:
# R
if (!requiredNamespace("BiocManager"))
install.packages("BiocManager")
then call the actual install command (e.g. to install the GenomicFeatures
package):
# R
BiocManager::install("GenomicFeatures")
Github packages are directly distributed by the developers, and may contain features still in development, so users should ensure that the code works properly before applying it to real-case problems.
To install packages from Github, it is necessary to know the URL of the package. These can be found by searching the name of the package on Github website. For instance, the URL for the MALDIquant
package is
https://github.com/sgibb/MALDIquant
The package devtools
(from CRAN) is required to install packages from Github:
# R
install.packages("devtools")
Then install the package, calling the command:
# R
devtools::install_github("<DeveloperName>/<PackageName>")
that in the case of MALDIquant
becomes
# R
# URL: https://github.com/sgibb/MALDIquant =
# https://github.com/<DeveloperName>/<PackageName>
devtools::install_github("sgibb/MALDIquant")
Resources:
R base and packages functions are distributed with a help that also shows examples of usage. The help can be called by the command (to be exectuted in the command line):
# R
?<functionName>
For instance, to get the help associated with the function factor
:
# R
?factor
Additional information is often provided together with packages on their website. Simple tutorials may be distributed as well, called vignette.
Resources:
- ggplot2
- dplyR
- tidyR
- XCMS
Check out tidyverse, a set of packages specifically designed for data science.
The basic data types available in R are:
- scalar
- vector
- character
- string
- matrix
- array
- data frame
- list
- factor
A scalar is a simple number. This can be an integer, a real number (single or double precision), a complex number. A scalar can be assigned to a generic variable with the following command (variable name = x):
# R
x <- 3.1415926535897932384626433832795028841971693993751
The symbol<-
represents the assignment operator. Alternatively, the symbol =
can be used, but <-
is preferred.
A vector is an ordered collection of scalars.
For instance, a 4-dimensional vector can be defined using the command c()
:
# R
X <- c(2, 3, 4, 5)
The elements of a vector can be selected by passing the index corresponding to the element of interest. For instance, to read the second element of a vector:
# R
myVector <- c(4, 5, 6, 7)
secElement <- myVector[2] # 2 is the second element, so now secElement is equal to 5
The index can be also a scalar variable. The following gives the same results
# R
myVector <- c(4, 5, 6, 7)
myIndex <- 2
secElement <- myVector[myIndex] # 2 is the second element, so now secElement is equal to 5
The length of a vector is given by the command length
:
# R
myVector <- c(4, 5, 6, 7)
length(myVector)
# > [1] 4
A character variable can be defined using the symbols '
or "
:
# R
myChar <- "A"
myChar <- 'A' # Identical results
A string is a sequence of characters:
# R
myString <- 'Hello, World!'
Characters cannot be accessed by their index, like for vectors. Specific functions are available to work with strings.
Note that a string is considered a 1-element object, differently from a character vector that instead is a collection of n character variables.
The length of a string is given by the command char
:
# R
myString <- 'Hello, World!'
nchar(myString)
# > [1] 13
A factor is a special vector of labelled elements. Usually its elements are discrete and can be either strings or scalars:
# R
myFactor <- factor(c('This', 'is', 'my', 'factor'))
myFactor2 <- factor(c('Y', 'Y', 'Y', 'N', 'N', 'Y'))
myFactor3 <- factor(c(2, 2, 4, 4, 1, 2, 1, 3))
As it can be noticed, a factor vector is generated by passing a vector to the function factor
.
A numeric matrix can be defined by the command matrix
. The first argument of this function is the full list of values that will be used as matrix elements (column-by-column). The second and third arguments represent the number of rows and columns, respectively. Obviously, the number of elements must be equal to the product of the matrix dimensions. For instance, a randomly sampled vector of 20 scalars can be used to fill a 4x5 matrix:
# R
matElements <- sample(20)
Xmat <- matrix(matElements, 4, 5)
The dimensions of a matrix are given by the following commands:
# R
# Define a matrix
matElements <- sample(20)
Xmat <- matrix(matElements, 4, 5)
# Number of rows
nrow(Xmat)
# > [1] 4
# Number of columns
ncol(Xmat)
# > [1] 5
# Both
dim(Xmat)
# > [1] 4 5
Matrix dimensions can be named, using the commands dimnames
, rownames
, or colnames
.
Names can be assigned also at the definition time:
# R
Xmat <- matrix(sample(20), 4, 5)
# Assign the row names
rownames(Xmat) <- c(1:4)
# Assign the column names
colnames(Xmat) <- c(1:5)
# Read the row names and column names
rownames(Xmat)
# > [1] "1" "2" "3" "4"
colnames(Xmat)
# > [1] "1" "2" "3" "4" "5"
# Assign using dimnames
dimnames(Xmat) <- list(c(1:4), c(1:5)) # Notice that in this case we need a list
# Assign at the definition
Xmat <- matrix(sample(20), 4, 5, dimnames = list(c(1:4), c(1:5))) # Same as dimnames command
Xmat
# > Xmat
# 1 2 3 4 5
# 1 6 2 18 15 4
# 2 13 9 8 7 20
# 3 10 14 1 19 5
# 4 11 3 16 12 17
An array is the matrix extension to more than 2-dimensions. For instance, the following command will assign a 3-dimensional array of dimensions (5 x 6 x 10) to the variable myArray
:
# R
myArray <- array(sample(300), c(5, 6, 10)
Elements of arrays can be accessed in the similar fashion of vectors and matrices:
# R
myElement <- myArray[1, 3, 2] # myElement correspond to the element (1, 3, 2) of myArray
A list is a more complex data structure. It can be seen as a vector, whose elements can be of different types or dimensions. For instance, a list containing a vector and a matrix can be defined as follows:
# R
# Element-by-element assignment
myList <- list() # Empty list
myList[[1]] <- sample(20) # vector
myList[[2]] <- matrix(sample(100, 20), 4, 5) # matrix
# Direct assignment: the first element will be named 'myVector',
# and the second element 'myMatrix'
myList <- list(myVector = sample(20),
myMatrix = matrix(sample(100, 20), 4, 5))
The elements of a list can be accessed by passing their index or the name, as defined in the list. Using the previous example:
# R
X <- myList[[1]] # X is now equal to myVector NOTE: [[ ]] instead of [ ]
X <- myList$myVector # Access by name through the operator $
Finally, a data frame is a matrix-like structure (columns of same length), whose columns can be vectors of different data type. For instance a char and a numeric vector can be joined to form a data frame:
# R
myCharVector <- c('A', 'B', 'C', 'D')
myNumVector <- c(1, 2, 3, 4)
myDataFrame <- data.frame(Letters = myCharVector, Numbers = myNumVector)
As seen in the previous section, elements of vectors, arrays, etc. can be accessed by their indices. Single elements can be accessed by the value of their index (also represented by an integer variable). However, also multiple elements can be accessed, using the following commands
# R
# Define a matrix
myMatrix <- matrix(sample(30), 5, 6)
# Read the 4th row
myMatrix[4, ]
# Read the 2nd column
myMatrix[, 2]
# Read the first 3 elements of the 4th column
myMatrix[1:3, 4]
The symbol a:b
is equivalent to c(a, a+1, a+2, a+3, ..., b-2, b-1, b)
.
Repeated operations can be assembled into functions. Functions are often exported by packages, or can be defined by the user. User-defined functions follow the structure:
# R
myFunction <- function(argument1, argument2, ...) {
# Operations go here
...
return(returnValue)
}
Therefore, the function can be called through its name
# R
myValue <- myFunction(x, y, ...)
As you can notice, the function ends with the command return
. This defines the variable value returned by the function. This variable can be of any data type.
For instance, a function that calculates the factorial of an integer can be defined as follows:
# R
myFactorial <- function(n) {
# Check that the argument is integer
stopifnot(is.integer(n))
# Calculate 1 * 2 * ... * (n-1) * n
f <- 1
for (i in 2:n)
f <- f * i
# Then return the value
return(f)
}
Then, the factorial of an integer can be calculated calling the function:
# R
myFactorial(25) # Returns the value of 25!
Resources:
In R, repeated operations (iterations) can be modelled in different ways. The canonical for loops can be run in this way:
# R
for (iterator in firstValue:lastValue)
{
# Perform some operations
doSomething(iterator)
}
In this example, the third power of x can be calculate using a for loop:
# R
# A very inefficient power calculation (use x^3 in real life)
for (i in 1:2)
{
x <- x * x
}
R allows to run iterations also by the commands apply
, apply
, apply
.
The function apply
returns the values of a function calculated on the marginal dimension of a variable (e.g. columns of a matrix). For instance, to calculate the sum of a matrix columns elements
# R
apply(myMatrix, 2, sum) # 2 defines the calculation over columns (1 for rows)
If we want to apply more complex operations, we can define a function on the elements
# R
# Calculate the sum of squares of columns elements
apply(myMatrix, 2, function(x) sum(x^2))
The functions sapply
and lapply
have a similar behaviour but they are applied to vectors and lists, respectively
# R
# Use sapply to avoid a for loop. Calculate the square of an array elements
myVector <- c(4, 2, 10)
sapply(1:length(myVector), function(x) x^2)
# Example of lapply
myList <- list(x = 'my', y = 'list', z = 'is', w = 'cool')
# Calculate the number of characters of each element of myList
lapply(myList, length) # This will give a list with 4 numbers: 2, 4, 2, 4
Resources:
As all the other languages, also R has operators for conditional statement (if-else):
# R
if (conditionIsTrue)
{
doSomething
} else
{
doSomethingElse
}
Basic logical operators are
# R
x == y # Returns TRUE if x is identical to y
x != y # Returns TRUE if x is not identical to y
x <- TRUE # x contains the logical value TRUE
x <- FALSE # x contains the logical value FALSE
!x # Only if x is a logical variable, returns its negation
x && y # Logical AND
x || y # Logical OR
Beside the builtin functions, there are several packages designed to produce high quality graphs. Probably, the most famous among these is ggplot2
.
Here, it is possible to find nice examples of data graphs generated using ggplot2