4 Programming in R
The popularity of R is in part due to its fantastic statistical packages and extremely flexible graphical capabilities. It is often easy to forget that it is a fully functional programming language in its own right. In this part of the practical we cover how to implement some standard programming practices, such as for
and while
loops, conditional if
statements and so on.
As previously stated, R is a high-level language, and as such does not require the same level of code complexity to work. For example, it is not necessary to differentiate between
integer
,float
anddouble
types for numerical arrays. It is also not necessary to declare objects in advance of using them—you can create objects ‘on-the-fly’, and even change an object type without any memory reallocation. In addition, R has many packages which provide functions to implement a huge number of different algorithms or analyses etc. The cost of this functionality is speed. For example, afor
loop written in native R code will generally be much slower than an equivalent loop written in C. For simple problems this difference is often practically negligible, and is often counter-balanced against the fact that R code is usually significantly faster to write. For complex, highly computationally demanding algorithms, this difference can be highly significant. Many R functions get around this by virtue of being wirtten in C (or Fortran), and simply providing a user-friendly interface to a fast C executable. We have already seen examples of this for functions such asmean()
andsum()
.
Often, native R code is very useful to know, and so the following sections inroduce some (hopefully) familiar concepts in the context of R.
4.1 Loops
R has implementations of both for
and while
loops. For example,
4.1.1 for
loops
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
This command creates an object i
, which takes the values in the vector 1:10
in turn, and prints each to the screen. Notice that we wrote this entire loop on a single line. The curly brackets {}
delineate the contents of the loop. A more complex example is
for(i in 1:10) {
## generate 10 random numbers
## uniformly between 0 and 1
x <- runif(10)
## calculate the mean
x <- mean(x)
## print the mean to the screen
print(x)
}
## [1] 0.5455448
## [1] 0.4690436
## [1] 0.5321327
## [1] 0.6748452
## [1] 0.5085469
## [1] 0.5547003
## [1] 0.4563259
## [1] 0.6409085
## [1] 0.462969
## [1] 0.4955541
Here I’ve included comments in the loop itself. Of course it is also possible to use nested functions to write this more concisely.
for
loop using nested functions. You should be able to fit the content of the loop on a single line. Why does your code print different values to before?
Note: for single line pieces of code, it is not always necessary to use curly brackets
{}
to delineate the loop. However, it is a good habit to get into, and saves lots of simple errors. Hence, in this workshop I will insist you always delineate loops and if/else statments using curly brackets. Please let me know if I’ve missed any!
Note: the use of indentation to delineate what is inside of the
for
loop. This is not necessary for R to work, but in my opinion greatly helps the comprehension of the code. Along with comments and spacing, this is a useful habit to get into, and I insist upon it here. Some people prefer to place the leading bracket on a different line to thefor
statement e.g. something like:for(i in 1:10) { x <- runif(10) print(x) }
I would support either approach above, but something like
for(i in 1:10) { x <- runif(10) print(x) }
is bad programming practice (in my opinion).
For those of you who are used to Python, please note that the indentations are not necessary in R, however for multi-line environments you must enclose all of the environment in curly brackets!
Loops can also be nested:
## [1] 1 1
## [1] 1 2
## [1] 1 3
## [1] 2 1
## [1] 2 2
## [1] 2 3
## [1] 3 1
## [1] 3 2
## [1] 3 3
for
loops, that transposes a 3 \(\times\) 3 matrix. Test it on the matrix x <- matrix(1:9, 3, 3)
Note:
for
loops in R can loop over any type of vector, not just sequential indices. For example,## [1] "a" "b" "c" "d" "e"
## [1] "a" ## [1] "b" ## [1] "c" ## [1] "d" ## [1] "e"
4.1.2 while
loops
R’s while
statement works in a similar way. The loop continues until some logical statement is met. For example,
## [1] 1
## [1] 2
## [1] 3
## [1] 4
Note: R doesn’t have a
++
operator like C. In addition notice that thei
variable had to be initialised in advance of the loop, so that thewhile
statement could be evaluated. Thei
variable is then incremented during the loop.
Warning: a common coding error is to create infinite loops. You must take care to ensure that the loop will eventually end. For example, the following code will produce an infinite loop. Why?
i <- 1 while(i < 5) { print(i) }
while
loop that generates a uniform random number between 0 and 1 (use the runif()
function), until a number of > 0.9 is obtained.
4.2 Conditional statements
R also supports conditional statements, such as if
and else
. These help the code to change its path depending on whether a condition is true or false. For example:
## [1] "y is < 0.5"
Note: that the
else
statement starts on the same line as the closing bracket}
from the precedingif
statement.
Conditional statements can also be nested.
x
and y
. Write a nested if
/else
statement that prints out whether x
is > 0.5, and then whether y
is > 0.5.
R also supports an ifelse
function, that can be applied element-wise to vectors and matrices. This is useful if you want to return an object the same size as the one you are testing. For example,
## [1] 0.5009349 0.8786042 0.9173392 0.5871057 0.5337919 0.2404480
## [1] TRUE TRUE TRUE TRUE TRUE FALSE
4.3 User defined functions
R also allows you define your own functions
, that take a set of arguments and return some output. For example,
## define a function that takes a number and returns whether
## the number is greater than, less than or equal to 0
sign_num <- function(x) {
if(x < 0) {
print("x is less than 0")
} else {
if(x == 0) {
print("x is equal to zero")
} else {
print("x is greater than 0")
}
}
}
This piece of code is broken down as follows:
- The
function(x)
part defines afunction
object that takes an argumentx
. Whatever variables are defined here are local variables, i.e. they only exist within the function, and are destroyed once the function exits. - The first set of curly brackets then denote which lines of code are part of the function.
- The function then takes the local variable
x
, and using a series of nestedif
/else
loops, prints out whether the number is greater than, less than or equal to zero. - When the code hits the final curly bracket, the function exits.
- The function is stored (using the assignment operator
<-
) into an object calledsign_num
.
We can then use this function by simply passing values to it:
## [1] "x is greater than 0"
## [1] "x is less than 0"
## [1] "x is equal to zero"
Note: By default, an R function returns the final line as its output. In the example above, no object was present on the final line, so R returned nothing. A useful habit to get into is to use explicit
return()
statements to denote what object is returned from a function.
The following code returns the absolute value of x
, as well as printing information on whether the original x
was greater than, less than or equal to zero:
## define a function that takes a number and returns whether
## the number is greater than, less than or equal to 0
sign_num <- function(x) {
if(x < 0) {
print("x is less than 0")
} else {
if(x == 0) {
print("x is equal to zero")
} else {
print("x is greater than 0")
}
}
return(ifelse(x < 0, -x, x))
}
sign_num(-1)
## [1] "x is less than 0"
## [1] 1
Aside: A really useful function in R is called
paste()
. This allows us to concatenate objects into strings, which is particularly useful for printing to the screen. For example:## [1] "The variable x is 10"
We can see that this has taken the value of
x
(which is 10), and created a character vector of length 1 containing the concatenated string"The variable x is 10"
. Notice by default thatpaste()
inserts a space between the arguments it’s concatenating, this can be suppressed using thesep
argument, or by using thepaste0()
function e.g. note the difference between the first function and the latter two (which are equivalent) below:## [1] "x = 1"
## [1] "x =1"
## [1] "x =1"
Try the following pieces of code. What do they do?
paste("Here is a number:", 1:10)
.paste(1:10, collapse = " + ")
.paste("Here are some numbers:", paste(1:10, collapse = ", "))
.x <- 1:10;
paste("The sum of the first", length(x),
"numbers is given by", paste(1:10, collapse = " + "),
"=", sum(x))
Outputs from R functions can be passed into other objects e.g.
## [1] "x is less than 0"
## [1] -1.3
## [1] 1.3
Here we pass a variable y
to our function, and because R creates a local variable inside the function, the original value of y
remains unchanged once the function has been run. The absolute value is calculated and then passed into a new variable absy
. R also allows variables to be overwritten, hence the following code overwrites y
with its absolute value.
## [1] "x is less than 0"
## [1] 1.3
If you wish to exit an R function earlier than the final line, perhaps due to some conditional statement, then the return()
function can be used to do this e.g.
## [1] 2
## [1] 17
whoknows
function defined in the above piece of code doing?
The stop()
function is useful as a way of stopping the code if some condition is breached. It takes a character argument that it prints to the screen if it is run.
whoknows <- function(x) {
if(length(x) > 1) {
stop("x argument needs to be of length 1")
}
for(i in 2:x) {
if(x %% i == 0) {
return(i)
}
}
return(NULL)
}
whoknows(10)
## [1] 2
## Error in whoknows(c(10, 17)): x argument needs to be of length 1
There is even a useful stopifnot()
function, which takes a conditional statement as an argument and stops the code if the statement is not true. This is particularly useful for checking inputs. Notice that our functions have been quite sloppy so far, and really we should check that the variables that we are entering into the function are of the correct form. A better piece of code is:
whoknows <- function(x) {
## check `x` variable is present
stopifnot(!missing(x))
## check `x` is numeric
stopifnot(is.numeric(x))
## check `x` is of length one
stopifnot(length(x) == 1)
## check `x` is a whole number
stopifnot(x %% 1 == 0)
## check `x` is larger than 1
stopifnot(x >= 2)
## print out the first number > 1 that
## divides x to leave a whole number
for(i in 2:x) {
if(x %% i == 0) {
return(i)
}
}
}
## Error in whoknows(): !missing(x) is not TRUE
## [1] 2
## Error in whoknows(2.5): x%%1 == 0 is not TRUE
## Error in whoknows(-1): x >= 2 is not TRUE
## Error in whoknows("bob"): is.numeric(x) is not TRUE
## Error in whoknows(c(10, 17)): length(x) == 1 is not TRUE
Write an R function called checkprime
, that takes a single argument x
, and returns whether x
is a prime number or not. The function should check whether x
is a single positive integer > 2 and return an error if not. [Hint: copy the whoknows
function into your R script file and use this as the basis of your new function.]
Use it to check whether the following numbers are primes:
- 3891
- 7103
- 7919
- 10535
Write an R function called allprimes
, that takes a single argument x
, and returns a vector containing all prime numbers from 1
to x
. The function should check whether x
is a single positive integer > 2 and return an error if not.
4.4 Using apply()
functions
There are several ubiquituous functions that you’ll see used often in R code, I’ll introduce four main ones (though there are others): apply()
, lapply()
, sapply()
and tapply()
.
Their sole raison d’être (as far as I can tell) is to make code more concise. Essentially they allow you to replace the use of a loop with a single line of code that aids legibility. They do not (as far as I’m aware) make the code faster (for that you can see various functions in packages like plyr
).
For ultimate speed, one always wants to try to use vectorised operations in R, but sometimes looping over various margins of an object can be useful, and these functions can help.
4.4.1 apply()
The array()
function applies a function to either rows or columns of a matrix (in fact it applies to any dimension of an array
, but we will restrict ourselves here to matrix
types only). For example, consider that we have a matrix x
as below:
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Now we might want to calculate the mean say, of each row, in which case we could write a loop:
## set output vector
xmn <- numeric(nrow(x))
## calculate mean for each row
for(i in 1:nrow(x)) xmn[i] <- mean(x[i, ])
## print means
xmn
## [1] 5.5 6.5 7.5
Note that here we had to set up an output vector (xmn
) and then populate it with values. We used vectorised functions (mean()
) to calculate the means for each row, but we required a loop to extract each row as a vector
(the i
th row is x[i, ]
). We could do this in a single line using apply()
:
## [1] 5.5 6.5 7.5
Here the first argument to apply()
is the matrix, the second argument determines the margin (1
for rows and 2
for columns), and the third argument determines the function you want to apply to each row/column (mean()
), and must take a vector of values as its first argument. To calculate the means of each column of x
we can write:
## [1] 2 5 8 11
We can also use user-defined functions, for example to divide each row by its sum, we could use:
## [,1] [,2] [,3]
## [1,] 0.04545455 0.07692308 0.1
## [2,] 0.18181818 0.19230769 0.2
## [3,] 0.31818182 0.30769231 0.3
## [4,] 0.45454545 0.42307692 0.4
Here each row of x
is passed into our function as a local object called vec
, the function then divides each element of vec
by its sum and returns the resulting vector. Note that apply()
returns an output matrix with dimensions ncol(x)
\(\times\) nrow(x)
here, so we might want to transpose if we want to return in the original dimensions:
## [,1] [,2] [,3] [,4]
## [1,] 0.04545455 0.1818182 0.3181818 0.4545455
## [2,] 0.07692308 0.1923077 0.3076923 0.4230769
## [3,] 0.10000000 0.2000000 0.3000000 0.4000000
Other arguments can be passed to our user-defined function as extra arguments to apply, as long as the first argument to our user-defined function corresponds to each row/column e.g. to divide each line by a variable n
we could:
## [,1] [,2] [,3] [,4]
## [1,] 0.1 0.4 0.7 1.0
## [2,] 0.2 0.5 0.8 1.1
## [3,] 0.3 0.6 0.9 1.2
## [,1] [,2] [,3] [,4]
## [1,] 0.5 2.0 3.5 5.0
## [2,] 1.0 2.5 4.0 5.5
## [3,] 1.5 3.0 4.5 6.0
We can also save our user-defined function as an object and pass it into apply()
e.g.
## [,1] [,2] [,3] [,4]
## [1,] 0.1 0.4 0.7 1.0
## [2,] 0.2 0.5 0.8 1.1
## [3,] 0.3 0.6 0.9 1.2
## [,1] [,2] [,3] [,4]
## [1,] 0.5 2.0 3.5 5.0
## [2,] 1.0 2.5 4.0 5.5
## [3,] 1.5 3.0 4.5 6.0
apply()
to return the range of each column of the matrix x
defined above. Return as a 2 \(\times\) 4 matrix.
4.4.2 lapply()
and sapply()
These functions essentially do the same thing but on lists, so each element of a list is passed to a corresponding function e.g.
## [[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
To get the dimensions of each element of the list we could use something like:
## [[1]]
## [1] 2 2
##
## [[2]]
## [1] 3 3
We can see that lapply()
returns a list object with the dimensions. The function sapply()
is a user-friendly version of lapply()
that always returns a simpler object than a list if possible e.g.
## [,1] [,2]
## [1,] 2 3
## [2,] 2 3
Note: Again, be careful of the format returned by these functions. In this case
sapply()
returns a 2 \(\times\) 2 matrix where the dimensions are returned down the columns. If you are in doubt, check with a trivial example like this one.
4.4.3 tapply()
A very useful function is tapply()
. This takes three arguments: the first is a vector of values, the second is a factor
vector of the same length, and the third is a function. tapply()
groups the first vector according to the levels of the second vector (the factor
), and then applies the function to these subgroups. For example, R has a datasets
package that provides a series of data sets—here we will look at one called chickwts
.
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
## 6 168 horsebean
## weight feed
## Min. :108.0 casein :12
## 1st Qu.:204.5 horsebean:10
## Median :258.0 linseed :12
## Mean :261.3 meatmeal :11
## 3rd Qu.:323.5 soybean :14
## Max. :423.0 sunflower:12
Here we have loaded the chickwts
data set, that records chick weights in an experiment, where newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Their weights in grams after six weeks are given along with feed types. So if we wanted to measure the mean weight at six weeks for chicks fed different diets, we could use:
## casein horsebean linseed meatmeal soybean sunflower
## 323.5833 160.2000 218.7500 276.9091 246.4286 328.9167
datasets
package provides data on from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions (stored as the PlantGrowth
data set). Summarise this data set and produce an estimate of the mean yield for each experimental group. Generate means and standard deviations for each experimental group.