4 Programming in R

The popularity of R is in part due to its fantastic statistical packages and extremely flexible graphical capabilities. It is often easy to forget that it is a fully functional programming language in its own right. In this part of the practical we cover how to implement some standard programming practices, such as for and while loops, conditional if statements and so on.

As previously stated, R is a high-level language, and as such does not require the same level of code complexity to work. For example, it is not necessary to differentiate between integer, float and double types for numerical arrays. It is also not necessary to declare objects in advance of using them—you can create objects ‘on-the-fly’, and even change an object type without any memory reallocation. In addition, R has many packages which provide functions to implement a huge number of different algorithms or analyses etc. The cost of this functionality is speed. For example, a for loop written in native R code will generally be much slower than an equivalent loop written in C. For simple problems this difference is often practically negligible, and is often counter-balanced against the fact that R code is usually significantly faster to write. For complex, highly computationally demanding algorithms, this difference can be highly significant. Many R functions get around this by virtue of being wirtten in C (or Fortran), and simply providing a user-friendly interface to a fast C executable. We have already seen examples of this for functions such as mean() and sum().

Often, native R code is very useful to know, and so the following sections inroduce some (hopefully) familiar concepts in the context of R.

4.1 Loops

R has implementations of both for and while loops. For example,

4.1.1 for loops

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

This command creates an object i, which takes the values in the vector 1:10 in turn, and prints each to the screen. Notice that we wrote this entire loop on a single line. The curly brackets {} delineate the contents of the loop. A more complex example is

## [1] 0.5455448
## [1] 0.4690436
## [1] 0.5321327
## [1] 0.6748452
## [1] 0.5085469
## [1] 0.5547003
## [1] 0.4563259
## [1] 0.6409085
## [1] 0.462969
## [1] 0.4955541

Here I’ve included comments in the loop itself. Of course it is also possible to use nested functions to write this more concisely.

Task
Write the above for loop using nested functions. You should be able to fit the content of the loop on a single line. Why does your code print different values to before?

Note: for single line pieces of code, it is not always necessary to use curly brackets {} to delineate the loop. However, it is a good habit to get into, and saves lots of simple errors. Hence, in this workshop I will insist you always delineate loops and if/else statments using curly brackets. Please let me know if I’ve missed any!

Note: the use of indentation to delineate what is inside of the for loop. This is not necessary for R to work, but in my opinion greatly helps the comprehension of the code. Along with comments and spacing, this is a useful habit to get into, and I insist upon it here. Some people prefer to place the leading bracket on a different line to the for statement e.g. something like:

for(i in 1:10) 
{
    x <- runif(10)
    print(x)
}

I would support either approach above, but something like

for(i in 1:10)
{
x <- runif(10)
print(x)
}

is bad programming practice (in my opinion).

For those of you who are used to Python, please note that the indentations are not necessary in R, however for multi-line environments you must enclose all of the environment in curly brackets!

Loops can also be nested:

## [1] 1 1
## [1] 1 2
## [1] 1 3
## [1] 2 1
## [1] 2 2
## [1] 2 3
## [1] 3 1
## [1] 3 2
## [1] 3 3
Task
Write a piece of R code, using nested for loops, that transposes a 3 \(\times\) 3 matrix. Test it on the matrix x <- matrix(1:9, 3, 3)

Note: for loops in R can loop over any type of vector, not just sequential indices. For example,

## [1] "a" "b" "c" "d" "e"
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"

4.1.2 while loops

R’s while statement works in a similar way. The loop continues until some logical statement is met. For example,

## [1] 1
## [1] 2
## [1] 3
## [1] 4

Note: R doesn’t have a ++ operator like C. In addition notice that the i variable had to be initialised in advance of the loop, so that the while statement could be evaluated. The i variable is then incremented during the loop.

Warning: a common coding error is to create infinite loops. You must take care to ensure that the loop will eventually end. For example, the following code will produce an infinite loop. Why?

i <- 1
while(i < 5) { 
   print(i) 
 }
Task
Write a while loop that generates a uniform random number between 0 and 1 (use the runif() function), until a number of > 0.9 is obtained.

4.2 Conditional statements

R also supports conditional statements, such as if and else. These help the code to change its path depending on whether a condition is true or false. For example:

## [1] "y is < 0.5"

Note: that the else statement starts on the same line as the closing bracket } from the preceding if statement.

Conditional statements can also be nested.

Task
Generate two uniform random numbers between 0 and 1, called x and y. Write a nested if/else statement that prints out whether x is > 0.5, and then whether y is > 0.5.

R also supports an ifelse function, that can be applied element-wise to vectors and matrices. This is useful if you want to return an object the same size as the one you are testing. For example,

## [1] 0.5009349 0.8786042 0.9173392 0.5871057 0.5337919 0.2404480
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

4.3 User defined functions

R also allows you define your own functions, that take a set of arguments and return some output. For example,

This piece of code is broken down as follows:

  • The function(x) part defines a function object that takes an argument x. Whatever variables are defined here are local variables, i.e. they only exist within the function, and are destroyed once the function exits.
  • The first set of curly brackets then denote which lines of code are part of the function.
  • The function then takes the local variable x, and using a series of nested if/else loops, prints out whether the number is greater than, less than or equal to zero.
  • When the code hits the final curly bracket, the function exits.
  • The function is stored (using the assignment operator <-) into an object called sign_num.

We can then use this function by simply passing values to it:

## [1] "x is greater than 0"
## [1] "x is less than 0"
## [1] "x is equal to zero"

Note: By default, an R function returns the final line as its output. In the example above, no object was present on the final line, so R returned nothing. A useful habit to get into is to use explicit return() statements to denote what object is returned from a function.

The following code returns the absolute value of x, as well as printing information on whether the original x was greater than, less than or equal to zero:

## [1] "x is less than 0"
## [1] 1

Aside: A really useful function in R is called paste(). This allows us to concatenate objects into strings, which is particularly useful for printing to the screen. For example:

## [1] "The variable x is 10"

We can see that this has taken the value of x (which is 10), and created a character vector of length 1 containing the concatenated string "The variable x is 10". Notice by default that paste() inserts a space between the arguments it’s concatenating, this can be suppressed using the sep argument, or by using the paste0() function e.g. note the difference between the first function and the latter two (which are equivalent) below:

## [1] "x = 1"
## [1] "x =1"
## [1] "x =1"
Task

Try the following pieces of code. What do they do?

  1. paste("Here is a number:", 1:10).
  2. paste(1:10, collapse = " + ").
  3. paste("Here are some numbers:", paste(1:10, collapse = ", ")).
  4. x <- 1:10; paste("The sum of the first", length(x), "numbers is given by", paste(1:10, collapse = " + "), "=", sum(x))

Outputs from R functions can be passed into other objects e.g.

## [1] "x is less than 0"
## [1] -1.3
## [1] 1.3

Here we pass a variable y to our function, and because R creates a local variable inside the function, the original value of y remains unchanged once the function has been run. The absolute value is calculated and then passed into a new variable absy. R also allows variables to be overwritten, hence the following code overwrites y with its absolute value.

## [1] "x is less than 0"
## [1] 1.3

If you wish to exit an R function earlier than the final line, perhaps due to some conditional statement, then the return() function can be used to do this e.g.

## [1] 2
## [1] 17
Task
What is the whoknows function defined in the above piece of code doing?

The stop() function is useful as a way of stopping the code if some condition is breached. It takes a character argument that it prints to the screen if it is run.

## [1] 2
## Error in whoknows(c(10, 17)): x argument needs to be of length 1

There is even a useful stopifnot() function, which takes a conditional statement as an argument and stops the code if the statement is not true. This is particularly useful for checking inputs. Notice that our functions have been quite sloppy so far, and really we should check that the variables that we are entering into the function are of the correct form. A better piece of code is:

## Error in whoknows(): !missing(x) is not TRUE
## [1] 2
## Error in whoknows(2.5): x%%1 == 0 is not TRUE
## Error in whoknows(-1): x >= 2 is not TRUE
## Error in whoknows("bob"): is.numeric(x) is not TRUE
## Error in whoknows(c(10, 17)): length(x) == 1 is not TRUE
Task

Write an R function called checkprime, that takes a single argument x, and returns whether x is a prime number or not. The function should check whether x is a single positive integer > 2 and return an error if not. [Hint: copy the whoknows function into your R script file and use this as the basis of your new function.]

Use it to check whether the following numbers are primes:

  1. 3891
  2. 7103
  3. 7919
  4. 10535
Task

Write an R function called allprimes, that takes a single argument x, and returns a vector containing all prime numbers from 1 to x. The function should check whether x is a single positive integer > 2 and return an error if not.

Use it to return a vector containing all the prime numbers up to 1,000.

4.4 Using apply() functions

There are several ubiquituous functions that you’ll see used often in R code, I’ll introduce four main ones (though there are others): apply(), lapply(), sapply() and tapply().

Their sole raison d’être (as far as I can tell) is to make code more concise. Essentially they allow you to replace the use of a loop with a single line of code that aids legibility. They do not (as far as I’m aware) make the code faster (for that you can see various functions in packages like plyr).

For ultimate speed, one always wants to try to use vectorised operations in R, but sometimes looping over various margins of an object can be useful, and these functions can help.

4.4.1 apply()

The array() function applies a function to either rows or columns of a matrix (in fact it applies to any dimension of an array, but we will restrict ourselves here to matrix types only). For example, consider that we have a matrix x as below:

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Now we might want to calculate the mean say, of each row, in which case we could write a loop:

## [1] 5.5 6.5 7.5

Note that here we had to set up an output vector (xmn) and then populate it with values. We used vectorised functions (mean()) to calculate the means for each row, but we required a loop to extract each row as a vector (the ith row is x[i, ]). We could do this in a single line using apply():

## [1] 5.5 6.5 7.5

Here the first argument to apply() is the matrix, the second argument determines the margin (1 for rows and 2 for columns), and the third argument determines the function you want to apply to each row/column (mean()), and must take a vector of values as its first argument. To calculate the means of each column of x we can write:

## [1]  2  5  8 11

We can also use user-defined functions, for example to divide each row by its sum, we could use:

##            [,1]       [,2] [,3]
## [1,] 0.04545455 0.07692308  0.1
## [2,] 0.18181818 0.19230769  0.2
## [3,] 0.31818182 0.30769231  0.3
## [4,] 0.45454545 0.42307692  0.4

Here each row of x is passed into our function as a local object called vec, the function then divides each element of vec by its sum and returns the resulting vector. Note that apply() returns an output matrix with dimensions ncol(x) \(\times\) nrow(x) here, so we might want to transpose if we want to return in the original dimensions:

##            [,1]      [,2]      [,3]      [,4]
## [1,] 0.04545455 0.1818182 0.3181818 0.4545455
## [2,] 0.07692308 0.1923077 0.3076923 0.4230769
## [3,] 0.10000000 0.2000000 0.3000000 0.4000000

Other arguments can be passed to our user-defined function as extra arguments to apply, as long as the first argument to our user-defined function corresponds to each row/column e.g. to divide each line by a variable n we could:

##      [,1] [,2] [,3] [,4]
## [1,]  0.1  0.4  0.7  1.0
## [2,]  0.2  0.5  0.8  1.1
## [3,]  0.3  0.6  0.9  1.2
##      [,1] [,2] [,3] [,4]
## [1,]  0.5  2.0  3.5  5.0
## [2,]  1.0  2.5  4.0  5.5
## [3,]  1.5  3.0  4.5  6.0

We can also save our user-defined function as an object and pass it into apply() e.g.

##      [,1] [,2] [,3] [,4]
## [1,]  0.1  0.4  0.7  1.0
## [2,]  0.2  0.5  0.8  1.1
## [3,]  0.3  0.6  0.9  1.2
##      [,1] [,2] [,3] [,4]
## [1,]  0.5  2.0  3.5  5.0
## [2,]  1.0  2.5  4.0  5.5
## [3,]  1.5  3.0  4.5  6.0
Task
Use apply() to return the range of each column of the matrix x defined above. Return as a 2 \(\times\) 4 matrix.

4.4.2 lapply() and sapply()

These functions essentially do the same thing but on lists, so each element of a list is passed to a corresponding function e.g.

## [[1]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

To get the dimensions of each element of the list we could use something like:

## [[1]]
## [1] 2 2
## 
## [[2]]
## [1] 3 3

We can see that lapply() returns a list object with the dimensions. The function sapply() is a user-friendly version of lapply() that always returns a simpler object than a list if possible e.g.

##      [,1] [,2]
## [1,]    2    3
## [2,]    2    3

Note: Again, be careful of the format returned by these functions. In this case sapply() returns a 2 \(\times\) 2 matrix where the dimensions are returned down the columns. If you are in doubt, check with a trivial example like this one.

4.4.3 tapply()

A very useful function is tapply(). This takes three arguments: the first is a vector of values, the second is a factor vector of the same length, and the third is a function. tapply() groups the first vector according to the levels of the second vector (the factor), and then applies the function to these subgroups. For example, R has a datasets package that provides a series of data sets—here we will look at one called chickwts.

##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean
##      weight             feed   
##  Min.   :108.0   casein   :12  
##  1st Qu.:204.5   horsebean:10  
##  Median :258.0   linseed  :12  
##  Mean   :261.3   meatmeal :11  
##  3rd Qu.:323.5   soybean  :14  
##  Max.   :423.0   sunflower:12

Here we have loaded the chickwts data set, that records chick weights in an experiment, where newly hatched chicks were randomly allocated into six groups, and each group was given a different feed supplement. Their weights in grams after six weeks are given along with feed types. So if we wanted to measure the mean weight at six weeks for chicks fed different diets, we could use:

##    casein horsebean   linseed  meatmeal   soybean sunflower 
##  323.5833  160.2000  218.7500  276.9091  246.4286  328.9167
Task
The datasets package provides data on from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions (stored as the PlantGrowth data set). Summarise this data set and produce an estimate of the mean yield for each experimental group. Generate means and standard deviations for each experimental group.