Book Review: Social Media Mining With R

Social Media Mining with R by Nathan Danneman and Richard Heimann

[Full disclosure – I was given a free review copy of the book from the publisher. This review refers to the ebook version]

I worry when a book proposes that it will appeal to everyone. The trouble is that it could well end up appealing to no-one. The introduction to this new book on social media mining in R says that it will be suitable for skilled programmers with little social science background, people who lack any sort of programming background, people wanting an introduction to social data mining and advanced social science researchers. That doesn’t leave many people out!

It turns out that this book doesn’t really cover much social media mining. It does cover basic sentiment analysis and text-mining with some references to twitter data but other social media platforms are barely mentioned and there is no discussion of any of the other social media APIs (e.g. Facebook, Linkedin and Github). The authors also miss an opportunity to discuss other analysis techniques such as social network analysis and network graphs, which twitter (and other social media) data would be ideal for.

In the first third of the book, much mention is made of big data and how researchers are going to have to prepare for it, but the actual content for dealing with big data (or at least data that would struggle to fit in the RAM of an average laptop) amounts to little more than “check that you have enough memory to read in your data and read the manual for the read.table function”. R has a reputation for not coping well with large datasets but there are some packages that really assist with this (like fread, data.table, sqldf and dplyr) that could easily have been discussed, along with any number of tips for speeding up I/O using base R functions.

The introductory and background material and theory is interesting and provides a good high level introduction to sentiment analysis, though it still feels somewhat rushed (for example, why completely leave out semi-supervised approaches? It would have been nice to have at least a page or two describing what they are and why they might be useful) and there is little detail on the assumptions of the different models. The introduction would make a good springboard for deeper research however, and the further reading and bibliography sections are very good.

I like there to be code on nearly every page of a programming book but this one is very light on R code examples. There is a chapter on setting up R and installing packages (This is not going to be anyone’s first R book so this chapter feels like a waste, particularly in a book that only clocks in at 120 pages), another covering getting tweets via the twitteR package (but not the streaming API for collecting tweets over a prolonged period of time or any way of storing tweets in a database) and then nothing until two thirds of the way through where there are three short case studies of different methods of sentiment analysis. It also seems strange to separate out the theory in the earlier chapters and the practical applications later on.

The case studies are mostly decent, covering lexicon-based, naive Bayes and the authors’ own unsupervised Item Response Theory sentiment analysis methods. It seems strange to me in a book on social media mining that the first example doesn’t even use social media data, but US government reports which even the authors admit bears little resemblance to twitter data. Some of the code is poorly explained (or not explained at all), idiosyncratic (there may be reasons for using scan to read in a text file, but what are they?) and makes use of out of date packages (Hadley Wickham shipped reshape2 back in 2012, why still use reshape?). The authors miss several opportunities to flesh out the book: One minute they say how important data cleaning is and how useful regular expressions are and then they just point you in the direction of a website teaching regular expressions. The same is true for the plotting package ggplot2 (Is Wilkinson’s The Grammar of Graphics really an appropriate introductory text for this package?) and the rWeka machine learning package. A lot of the information is good (in particular the description of the tm text-mining package functions) but it all feels a little lightweight, like a collection of blog posts. The case study analyses were interesting, though choosing such obviously polarised subjects as abortion and Obamacare resulted in much neater, cleaner results than would have been found had subjects been used where opinions are less black-and-white.

Sentiment analysis seems to be everywhere at the moment. It reminds me of word clouds a few years ago: Everyone starting out in data science was using them but they soon became a little embarrassing as they became more ubiquitous and were even famously described as the “mullets of the internet”. Make of it what you will that word clouds are used several times as an analysis method in this book.

Users who have maybe done a Coursera data science course or two and want to try their hand at some simple analyses to boost their portfolio will probably find this book useful but more serious users are likely to be left disappointed.

Using Twitter to Keep Up to Date With the Medical Literature

The medical literature is huge and growing at an explosive rate. According to a 2010 study, there were 75 clinical trials and 11 systematic reviews being published every day. In a couple of quick Pubmed searches I found 17571 papers published with “clinical trials” and 7974 with “primary care” in the title or abstract in 2013 alone. With these kinds of numbers, the prospect of keeping on top of the new publications in even a relatively small subset of this literature seems daunting to say the least. Even two weeks of annual leave would mean you have potentially hundreds of papers to plough through! Wouldn’t it be good to have some kind of real-time feed of papers in your research area that you could check on a regular basis and quickly scan for anything that is directly relevant or interesting?

It turns out that twitter is an excellent tool to do just that. Without any specialist knowledge or programming, you can quickly set up a ‘twitterbot’ that give links to the results from any Pubmed search along with the article titles. You could get Pubmed to send digest emails with the same information, but this can soon clutter up your inbox and is difficult to properly and efficiently archive. A twitter feed is easily scanned through and doesn’t feel intrusive or unwieldy.

I set up a twitterbot for articles in my field (electronic medical record research) in February. Since then, @EMR_research has already indexed over 230 papers. Going through all of these titles at once it would have been a pretty tedious affair but now I can just check the feed every day or so over a coffee and very quickly determine which ones are worth reading or bookmarking. Casey Bergman, a researcher in the Faculty of Life Sciences at the University of Manchester, who set up one of the earliest academic twitterbots (for Drosophila_ research) puts it like this:

It is relatively easy to keep up on a daily basis with what is being published, but it’s virtually impossible to catch up when you let the river of information flow for too long.

One of the further benefits of hosting academic search feeds on twitter is that they are available for others to see too. So one person can set up a feed and their whole research group has access to it. In addition, a feed can become a resource hub for researchers and other interested parties in that area. Since I set up @EMR_research in February, the page has had follows, retweets and favourites from epidemiologists, clinicians, statisticians, medication information services, patient interest groups, healthcare startups and research councils.

According to this list, there are already over 30 twitterbots indexing scientific literature over a range of disciplines from neuroscience, cell biology, parasitology to evolutionary ecology. Having access to a small network of feeds over several closely related disciplines to cross reference would be immensely helpful in keeping up with the literature and forming the basis of systematic reviews. Health researchers may frequently find themselves hopping between projects in different research areas – one day diabetes, the next cardiovascular disease, the next mental health – and having an easily accessible feed to the most recent literature for each would be invaluable.

Setting up a twitter feed for publications in your research area is easy. You just need a twitter account and a free account with dlvr.it, a content sharing service. See these two posts for detailed instructions on how to set up your own feed.

It’s Not Just for Stats: Web Mining in R

These are the slides from a talk I gave to the Manchester R user group on the 6th of May, 2014.

I discussed what web mining(or web scraping) is and the advantages of using R to do it. Then I provided a toolkit of R functions for Web mining before giving two examples of using web mining for practical uses:

  1. Downloading multiple files from a website
  2. A mash-up of data from a number of sources to find out the best places to apply for a new job in academia!
Comments

Survival Analysis in R

I recently gave a talk at the Manchester University Faculty of Life Sciences R user group on Survival Analysis in R. Survival analysis is hugely important for modelling longditudinal data like cohort studies. However, people still find it tricky to understand and even a recent paper I reviewed did a logistic regression when they should have done survival analysis. Beyond epidemiology and health research, survival analysis is a really useful addition to ecologists and biologists toolkits when they might want to model the time taken for some event to occur. For example:

  • Germination timing
  • Arrival of a migrant or parasite
  • Seed or offspring dispersal
  • Response to some stimulus

In the presentation, I discussed censoring (dealing with missing data of various kinds) and the survival function before introducing the Cox proportional hazards model. This is the most commonly used survival model because it is a semi-parametric model where you can look model the multiplicative effects of your covariates of interest (e.g. drug therapy on survival) without having to explicitly model the shape of the survival curve itself.

I then give a run through of a typical survival analysis in R before providing some links for further reading.

Comments

Care.data Could Help Avoid Another Pharma Scandal

This is a post I wrote for The Conversation on the problems with the new UK Care.data electronic medical record database. The post was later taken up by BBC news and I appeared on the BBC1 Breakfast show for a televised debate with the GP Gordon Gantz on the benefits of a national medical record database.

Comments

Book Review: Machine Learning With R

Machine Learning with R by Brett Lantz

[Full disclosure – I was given a free review copy of the book from the publisher. This review refers to the ebook version]

This is the most recent of a group of books that try to explore machine learning from a programming, rather than purely mathematical, perspective. The book is highly successful in this respect and deserves a place on the bookshelf of any data scientist, Kaggler or statistician.

The book takes a slightly different tack from previous ones in this field (See ‘Programming Collective Intelligence’ and ‘Machine learning for Hackers’) in that it concentrates largely on the packages themselves and how to use them to solve real world ML problems, rather than focusing on coding up simple algorithms from scratch and running these on toy datasets. Perhaps this way the book doesn’t provide as much insight into how the algorithm design, but it does make the book much more practically useful, particularly since it spends a good chunk of each chapter explaining the algorithm in simple, plain English.

The book is well laid out and written. Despite a slightly shaky start (do we really still think of ML in terms of Skynet, the Matrix and Hal?), the introduction is excellent and gives a pleasing summary of the philosophical and ethical issues surrounding machine learning and big data. Next, there is a thoughtful introduction to data management and exploratory data analysis that highlights important and often missed tips on things like getting data out of SQL databases. It introduces some basic R functions and concepts (some I had managed to miss up until now) without feeling like a tacked on ‘R for beginners’ chapter.

In the guts of the book, each chapter focuses on a group of related algorithms (KNN, Naive Bayes, Decision trees, Regression, Neural nets and SVMs, association rules, clustering) and has a good introduction to the algorithm in question, followed by sections on finding and cleaning data, implementing the algorithm on the data and evaluating and improving model performance. There are clear and easy to understand tables and descriptions of the important distinctions between the algorithms and the reasons for choosing one over another. The datasets the author has chosen are large and interesting enough to well illustrate the points being made without being frustratingly unwieldy and many of them are ‘classic’ machine learning datasets from places such as the UCI Machine Learning Data Repository.

Next, the book looks more deeply at evaluating and improving model performance and discusses important ensemble and meta-learning techniques like bagging, boosting and RandomForests. This section will be of particular interest to people wanting to enter Kaggle or other data science competitions because they show how to milk as much performance as possible from the basic algorithms described earlier in the book.

The final section discusses getting the algorithms to run on big datasets and improving the performance of R itself using tricks like the data.table and ff packages and parallel processing. This is the only section of the book that feels slightly rushed and many of these topics are discussed only briefly before linking to the relevant package documentation. This is only small criticism though, since coding up these kinds of systems will depend strongly on the data you have and these are difficult subjects to cover whilst retaining generality.

Obviously, the book cannot cover everything. It is decidedly light on graphs and has almost nothing on visualisation techniques and packages like ggplot2 which have become almost mandatory for doing data science today. Also, if you are new to R, you really want to get one of the excellent introductory books first and if you are new to ML, you probably want to spend a while learning some basic stats as well. Finally, this book doesn’t pretend to be a deep text about the mathematics of the algorithms it covers. For that you will need to go for something like Bishop’s classic ‘Pattern Recognition and Machine Learning’ and be prepared to put in some serious effort!

In short, if you are looking for a practical guide to implementing ML algorithms on real data and if you are more comfortable thinking in R code than in mathematical equations, this is the book for you and is the best that I have seen on the subject so far.

Comments

Develop in RStudio, Run in RScript

I have been using RStudio Server for a few months now and am finding it a great tool for R development. The web interface is superb and behaves in almost exactly the same way as the desktop version. However, I do have one gripe which has forced me to change my working practices slightly – It is really difficult to crash out of a frozen process. Whereas in Console R, I could just hit Control-D to get out and back to Bash, in RStudio, while you can use the escape key to terminate an operation, if you have a big process going on everything just freezes and you can’t do anything. One way to deal with this is to kill the rstudio process in another terminal, but this kills the whole session, including the editor, and you will lose anything you haven’t saved in your scripts. This problem is exacerbated when I am trying to use run parallel processes using the Multicore package, because it takes an age to kill all of the extra forks first.

So, now I use RStudio for development and testing and run my final analysis scripts directly using Rscript. I have this line of code at the start of my scripts…

1
2
3
4
5
require(multicore)
cat(sprintf("Multicore functions running on maximum %d cores",
            ifelse(length(commandArgs(trailingOnly=TRUE)),
                   cores <- commandArgs(trailingOnly=TRUE)[1],
                   cores <- 1)))
1
## Multicore functions running on maximum 1 cores

… so when I am testing in Rstudio, cores is set to 1, but when I run as an Rscript, I can specify how many cores I want to use. I then just add mc.cores = cores to all of my mclapply calls like this:

1
2
3
4
5
6
# Example processor hungry multicore operation
mats <- mclapply(1:500,
                 function(x) matrix(rnorm(x*x),
                                    ncol = x) %*% matrix(rnorm(x*x),
                                                         ncol = x),
                 mc.cores = cores)

The advantage of this is that, when mc.cores are set to 1, mclapply just calls lapply which is easier to crash out of (try running the above code with cores set to more than 1 to see what I mean. It will hang for a lot longer) and produces more useful error messages:

1
2
3
4
5
6
7
# Error handling with typos:
# mcapply:
mats <- mclapply(1:500,
                 function(x) matrix(rnor(x*x),
                                    ncol = x) %*% matrix(rnorm(x*x),
                                                         ncol = x),
                 mc.cores = 2)
1
## Warning: all scheduled cores encountered errors in user code
1
2
3
4
5
6
# Falls back to lapply with 1 core:
mats <- mclapply(1:500,
                 function(x) matrix(rnor(x*x),
                                    ncol = x) %*% matrix(rnorm(x*x),
                                                         ncol = x),
                 mc.cores = 1)
1
## Error: could not find function "rnor"

Running final analysis scripts has these advantages:

  • You can much more easily crash out of them if there is a problem.
  • You can run several scripts in parallel without taking up console space
  • You can easily redirect the std error and std out from your program to a log file to keep an eye on its progress

Running multicore R scripts in the background with automatic logging

If you have a bunch of R scripts that might each take hours to run, you don’t want to clog up your RStudio console with them. This is a useful command to effectively run a big R analysis script in the background via Rscript. It should work equally well for Linux and Mac:

1
$ Rscript --vanilla R/myscript.R 12 &> logs/myscript.log &

Rscript runs the .R file as a standalone script, without going into the R environment. The --vanilla flag means that you run the script without calling in your .Rprofile (which is typically set up for interactive use) and without prompting you to save a workspace image. I always run the Rscript from the save level to that which is set as the project root for Rstudio to avoid any problems because of relative paths being set up wrongly. The number after the .R file to be run is the number of cores you want to run the parallel stuff on. Other arguments you may want to pass to R would also go here. the &> operator redirects both the stdin and stderror to the file logs/myscript.log (I always set up a logs directory in my R projects for this purpose). The & at the end runs the process in the background, so you get your bash prompt back while it is running. Then if you want to check the progress of your script, just type:

1
$ tail -f logs/myscript.log

And you can watch the output of your script in real time. Hit Control-C to get back to your command prompt. You can also use the top command to keep an eye on your processor usage.

If you want to kill your script, either find the PID number associated with your process in top and do kill PIDNUMBER or, if you are lazy/carefree, type killall R to kill any running R processes. This will not affect your rstudio instance.

Comments

Functional Programming in R

This post is based on a talk I gave at the Manchester R User Group on functional programming in R on May 2nd 2013. The original slides can be found here

This post is about functional programming, why it is at the heart of the R language and how it can hopefully help you to write cleaner, faster and more bug-free R programs. I will discuss what functional programming is at a very abstract level as a means of the representation of some simplified model of reality on a computer. Then I’ll talk about the elements that functional programming is comprised of and highlight the most important elements in programming in R. I will then go through a quick example demo of a FP-style generic bootstrap algorithm to sample linear models and return bootstrap confidence intervals. I’ll compare this with a non-FP alternative version so you will hopefully clearly see the advantages of using an FP style. To wrap up, I’ll make a few suggestions for places to go to if you want to learn more about functional programming in R.

What is Functional programming?

… Well, what is programming? When you write a program you are building an abstract representation of some tiny subset of reality on your computer, whether it is an experiment you have conducted or a model of some financial system or a collection of features of members of a population. There are obviously different ways to represent reality, and the different different methods of doing so programmatically can be thought of as the metaphysics of different styles of programming.

Consider for a moment building a representation of a river on a computer, a model of a river system for example.

alt River

In non-functional languages such as C++, Java and (to some extent) Python, the river is an object in itself, a thing that does things to other things and that may have various properties associated with it such as flow rate, depth and pollution levels. These properties may change over time but there is always this constant, the river, which somehow persists over time.

In FP we look at things differently…

Hereclitus - We never step into the same river twice

The presocratic philosopher Hereclitus said “We never step into the same river twice”, recognising that the thing we call a river is not really an object in itself, but something undergoing constant change through a variety of processes. In functional programming we are less concerned with the object of the river itself but rather the processes that it undergoes through time. Our river at any point in time is just a collection of values (say, particles and their positions). These values then feed into a process to generate the series of values at the next time point. So we have data flowing through processes of functions and that collection of data over time is what we call a river, which is really defined by the functions that the data flows through. This is a very different way of looking at things to that of imperative, object oriented programming.

After this somewhat abstract and philosophical start, I’ll talk about the more practical elements of functional programming (FP). FP has been around for a very long time and originally stems from Lisp, which was first implemented in the 1950’s. It is making something of a comeback of late for a variety of reasons, but mostly because it is so good at dealing with concurrent, multicore problems potentially over many computers. There are several elements that FP is generally considered to be comprised of. Different languages highlight different elements, depending on how strictly functional they are.

Functions are first class citizens of the language

This means that functions can be treated just like any other data type – They can be passed around as arguments to other functions, returned by other functions and stored in lists. This is the really big deal about functional programming and allows for higher-order functions (such as lapply) and closures, which I’ll talk about later. This is the most fundamental functional concept and I’d argue that a language has to have this property in order to be called a functional language, even if it has some of the other elements listed below. For example, Python has anonymous functions and supports declarative programming with list comprehensions and generators, but functions are fundamentally different from data-types such as lists so Python cannot really be described as a functional language in the same way as Scheme or R can be.

Functional purity

This is more of an element of good functional program design. Pure functions take arguments, return values and otherwise have no side effects – no I/O or global variable changes. This means that if you call a pure function twice with the same arguments, you know it will return the same value. This means programs are easily tested because you can test different elements in isolation and once you know they work, you can treat them like a black box, knowing that they will not change some other part of your code somewhere else. Some very strictly functional languages, like Haskell, insist on functional purity to the extent that in order to output data or read or write files you are forced to wrap your ‘dirty’ functions in constructs called monads to preserve the purity of your code. R does not insist on functional purity, but it is often good practice to split your code into pure and impure functions. This means you can test your pure code easily and confine your I/O and random numbers etc to a small number of dirty functions.

Vectorised functions

Vectorised functions operate equally well on all elements of a vector as they do on a single number. They are very important in R programming to the point that much of the criticism of R as a really slow language can be put down to failing to properly understand vectorisation. This also includes the declarative style of programming, where you tell the language what you want, rather than how you want to get it. This is common in languages like SQL and in Python generators. I’ll discuss this more later.

Anonymous functions

In FP, naming and applying a function are two separate operations, you don’t need to give your functions names in order to call them. So, calling this function:

powfun <- function(x, pow) {
    x^pow
}
powfun(2, 10)
## [1] 1024

to the interpreter is exactly the same as applying variables to the anonymous function:

(function(x, pow) {
    x^pow
})(2, 10)
## [1] 1024

This is particularly useful when you are building small, single use functions such as those used as arguments in higher order functions such as lapply.

Immutable data structures

Immutable data structures are associated with pure functions. The idea is that once an object such as a vector or list is created, it should not be changed. You can’t affect your data structures via side effects from other functions. Going back to our river example, doing so would be like going back in time and rearranging some of the molecules and starting again. Having immutable objects means that you can reason more easily about what is going on in your program. Some languages, like Clojure, only have immutable data structures and it is impossible to change a list in place, you would have to have a list as an argument to a function which returns another list that you then assign back to the variable name for the original list. R does not insist on immutability, but in general, data structures are only changed in this way and not through side effects. It is often best to follow this, for the same reasons as it is best to have pure functions.

Recursion

Recursive functions are functions that call themselves. Historically, these have been hugely important in FP, to the extent that some languages (for example Scheme) do not even have for loops and they define all of their looping constructs via recursion. R does allow for recursive functions and they can sometimes be useful, particularly in traversal of tree-like data structures, but recursion in not very efficient in R (it is not tail-call optimised) and I will not discuss it further here, though it may well be the subject of a future post.

Functional Programming in R

R has a reputation for being an ugly, hacked together and slow language. I think this is slightly unfair, but in this ever-so-slightly biassed account, I am going to blame the parents:

R genealogy

R is the offspring of the languages S and Scheme. S is a statistical language invented in the 1970’s which is itself based on non-functional, imperative languages like C and Fortran. It is useful in this domain and much of R’s statistical abilities stem from this, but it is certainly less than pretty. Scheme is a concise, elegant, functional language in the lisp family. The designers of R tried to build something with the statistical functionality of S and the elegance of Scheme. Unfortunately, they left in much of the inelegant stuff from S as well and this mixed parentage means that it is now perfectly possible to write ugly, hacky, slow code in the style of S, just as it is also possible to write elegant, efficient functional code in the style of scheme. The problem is that functional programming has been far less mainstream so people tend to learn to code in the way they know first, resulting in rafts of ugly, hacky R code. Programming R in an elegant, functional way is not more difficult, but is immediately less intuitive to people who were brought up reading and writing imperative code. I would always recommend people learning R to learn these functional concepts from the outset because this way you are working with how the language was designed, rather than against it.

To show just how functional a language is at its core, it is first important to recognise that everything in R is a function call, even if it looks like it isn’t. So,

> 1 + 2
## [1] 3

… is exactly the same as…

> `+`(1, 2)
## [1] 3

The + operator is really just “syntactic sugar” for a + function with the arguments 1 and 2 applied to it. Similarly,

> 1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

… is the same as…

> `:`(1, 10)
##  [1]  1  2  3  4  5  6  7  8  9 10

here again, to give the range of numbers between 1 and 10, : is really just a function in disguise. If you were to break down more complex expressions in this way, the result would be code that looks very Scheme-like indeed.

I will now look in more depth at the functional concepts that are most important in R, Vectorised functions, higher order functions and closures.

Vectorised functions

Probably the best known FP concept in R is the vectorised function which ‘automagically’ operates on a data structure like a vector in just the same way as on a single number. Some of the most important of these are the vector subsetting operations. In these, you take a declarative approach to programming: you tell R what you want, not how to get it. Because of this property of operating across vectors, proper use of vectorised functions will drastically reduce the number of loops you have to hand code into your scripts. They are still performing loops under the hood, but these loops are implemented in C, so are many times faster than a standard for loop in R.

For example, when I was first using R for data management and analysis, I spent months writing code like this:

> # Get all even numbers up to 200000
> # using S-style vector allocation:
> x <- c()
> for(i in 1:200000){
>     if(i %% 2 == 0){
>         x <- c(x, i)
>     }
> }

This is about the worst possible way of achieving the given task (here, getting all even numbers up to 200000). You are running a for loop, which is slow in itself, and testing if i is even on each iteration and then growing the x vector every time you find that it is. The code is ugly, slow and verbose (On my machine it took around 10 seconds).

For me, writing vectorised code was a real revelation. To achieve the same goal as the code above in a vectorised style:

> # FP style vectorised operation
> a <- 1:200000
> x <- a[a %% 2 == 0]

You assign a vector with all the values from 1 t0 200000, then you say “I want all of these that are divisible by two”. This ran 3 orders of magnitude faster than the non-FP code, is half the length and clearer – you don’t have to mentally run through the loop in order to work out what it does. So you get the benefits of both concision (Number of bugs correlate well with lines of code) and clarity (The code becomes almost self-documenting).

This is a slightly unfair comparison and there are ways to speed up your loops, for example by pre-allocating a vector of the correct length before you run the loop. However, even if you do this, the result will still be around 20 times slower and will be even more verbose. It is good practice whenever you write a for loop in R to check if there is not a better way to do so using vectorised functions. The majority of R’s built-in functions are vectorised and using these effectively is a prerequisite of using R effectively.

Higher-order functions

Because functions in R are first class citizens of the language, it is trivial to write and use functions that take other functions as arguments. The most well used of these are the functions in the apply family (lapply, sapply, apply etc.). These cause a lot of headaches for new-ish R users, who have just got to grips with for loops, but they are really no more difficult to use. When you use one of these apply functions, you are just taking a collection of data (say a list or vector) and applying the input function to every element of the collection in turn, and collecting the results in another collection.

Apply functions

Because the mapping of each element to the function is independent of the elements around it, you can split the collections up and join them together at the end, which means that the functions can be better optimised than a for loop (especially in the case of lapply) and also easily run over multiple processor cores (see multicore::mclapply).

Conceptually, to use these functions you just need to think about what your input is, what the function you want to apply to each element is and what data structure you want as your output:

  • lapply : Any collection –> FUNCTION –> list
  • sapply : Any collection –> FUNCTION –> matrix/vector
  • apply : Matrix/dataframe + margin –> FUNCTION –> matrix/vector
  • Reduce : Any collection –> FUNCTION –> single element

so if you want your output in a list, use lapply. If you want a vector or matrix, use sapply. If you want to calculate summaries of the rows or columns of a dataframe, use apply. If you want to condense your dataset down into a single summary number, use Reduce. There are several other functions in the same family, which all follow a similar pattern.

Closures

Closures are at the heart of all functional programming languages. Essentially a closure is a function to which has been added data via its arguments. The function ‘closes over’ the data at the time the function was created and it is possible to access it at a later time. Compare this to the idea of an object in languages like C++ and Java, which are data with functions attached to them.

closures

You can use closures to build wrappers around functions with new default values and partially apply functions and even mimic Object-oriented style objects, but possibly most interestingly, you can build functions that return other functions. This is great if you want to call a function many times on the same dataset but with different parameters, such as in maximum-likelihood optimisation problems when you are seeking to minimise some cost function and also for randomisation and bootstrapping algorithms.

To demonstrate the usefulness of this, I am now going to build a generic bootstrapping algorithm in a functional style that can be applied to any linear model. It will demonstrate not only functions returning functions, but higher-order functions (in this case, sapply), anonymous functions (in the mapping function to sapply) and vectorised functions. I will then compare this against a non-FP version of the algorithm and hopefully some of the advantages of writing in an FP style in R will become clear. Here is the code. I am doing a bootstrap of a simple linear model on the classic iris dataset:

boot.lm <- function(formula, data, ...){
  function(){
    lm(formula=formula, 
       data=data[sample(nrow(data), replace=TRUE),], ...)
  }
}

iris.boot <- boot.lm(Sepal.Length ~ Petal.Length, iris)
bstrap <- sapply(X=1:1000, 
                 FUN=function(x) iris.boot()$coef)

That is the algorithm. The boot.lm function returns a closure. You pass it a linear model formula and a dataframe and it returns a function with no arguments that itself returns a linear model object of a bootstrapped replicate (sample with replacement) of the supplied data. So, the iris.boot function takes the formula of Sepal.Length~Petal.Length and the iris dataset and every time you call it it gives a new bootstrap replicate of that model on that data. You then just need to run this 1000 times and collect the coefficients, which can be done with a one-liner sapply call. We are using sapply because we want a matrix of coefficients with one line per replicate. The FUN argument to sapply is an anonymous function that returns the coefficients of the function. You could have equally well have written something like

get.coefs <- function(x){
    iris.boot$coef
}

bstrap <- sapply(X=1:1000, 
                 FUN=get.coefs)

… but because the function is so short, it is no less clear to include it without a name.

Once the model has run, we can use the apply higher-order function to summarise the rows of bstrap by applying the quantile function to give the median and 95% confidence intervals:

apply(bstrap, MARGIN=1, FUN=quantile, 
      probs=c(0.025, 0.5, 0.975))
##       (Intercept) Petal.Length
## 2.5%        4.162       0.3701
## 50%         4.304       0.4098
## 97.5%       4.448       0.4472

This is an elegant way to solve a common analysis problem in R. If you are running a large model and you want to speed things up (and you have a few cores free!), it is a simple task and a couple of lines of code to replace the call to sapply to one to multicore::mclapply and run the model on as many processor cores as you can.

In contrast, here is a roughly equivalent non-FP style bootstrapping algorithm:

boot_lm_nf <- function(d, form, iters, output, ...){
  for(i in 1:iters){
    x <- lm(formula=form, 
            data=d[sample(nrow(d),
                   replace = TRUE),], ...)[[output]]
    if(i == 1){
      bootstrap <- matrix(data=NA, nrow=iters, 
                    ncol=length(x), 
                    dimnames=list(NULL,names(x)))
      bootstrap[i,] <- x
    } else bootstrap[i,] <- x
  }
  bootstrap
}

This ugly beast is full of fors and ifs and braces and brackets and double brackets. It has a load of extra boilerplate code to define the variables and fill the matrices. Plus, it is less generic than the FP version since you can only output the attributes of the model itself, whereas previously we could apply any function we like in place of the anonymous function in the sapply call. It is more than twice as verbose and impossible to multicore without a complete rewrite. On top of all that, getting the coefficients out in a non-FP way is a tedious task:

bstrap2 <- boot_lm_nf(d=iris, 
            form=Sepal.Length ~ Petal.Length, 
            iters=1000, output="coefficients")
CIs <- c(0.025, 0.5, 0.975)
cbind( "(Intercept)"=quantile(bstrap2[,1],probs = CIs),
      "Petal.Length"=quantile(bstrap2[,2],probs = CIs))
##       (Intercept) Petal.Length
## 2.5%        4.169       0.3699
## 50%         4.310       0.4081
## 97.5%       4.448       0.4414

The code duplication in the cbind is a pain, as is having to name the coefficients directly. Both of these reduce the generalisability of the algorithm.

Wrapping up

I hope I have demonstrated that writing more functional R code is

  • More concise (fewer lines of code)
  • Often faster (Particularly with effective vectorisation)
  • Clearer and less prone to bugs (because you are abstracting away a lot of the ‘how to’ code)
  • More elegant

R is a strongly functional language to its core and if you work with this in your code, your R hacking will be more intuitive, productive and enjoyable.

Further Reading

Here are some good and accessible resources available if you want to learn more about functional programming in general and FP in R in particular:

  • Structure and interpretation of computer programs by Abelson and Sussman is the bible of FP and is written by the creators of Scheme. This book has been used as the core of the MIT Computer Science course since the early 1990s and is still not dated.
  • Hadley Wickham’s in progress ebook on Github is a fantastic resource on FP in R amongst a host of other advanced R topics.
  • The R Inferno by Patrick Burns is a classic free online book on R and has a great chapter on vectorisation and when it is best to apply it.
  • If you are intersted in the metaphysical stuff at the start of this post, Rich Hickey, the inventor of the Clojure language give this great talk on the importance of FP and the failings of the traditional OOP model. The talk was also summarised nicely in this blog post.
FP, Lisp, R
Comments

Two R Tutorials for Beginners

I am currently in the process of rescuing some of the pages from my now defunct datajujitsu.co.uk blogger blog and moving here. I also today gave a tutorial to the University of Manchester on data cleaning and subsetting, so I am killing two birds with one stone by linking to the Rpubs pages for both this and a short tutorial I gave last year on vectorisation.

The tutorials are:

  1. Subsetting in R: Spring cleaning your data
  2. Speeding up your R code – vectorisation tricks for beginners

The R markdown source file for both of these are available on my github page. Rpubs is a great site from the people behind RStudio that allows you to upload R markdown scripts compiled using Knitr in no time at all.

Using R Markdown, Knitr, RStudio and Rpubs to produce and publish tutorials has proved a complete joy. It is simple, quick and painless to get pages online with embedded R code and output.

I have also produced a slide presentation for an internal seminar series in my department using R Markdown, Knitr, Pandoc and Beamer. I was really pleased with the results (Which are also on my github page) and how easily I was able to achieve them, particularly with the huge reduction of Latex boilerplate I was forced to write. I will be doing all of my presentations with this method in future and will blog about the workflow for doing so in due course.

Comments

Scraping Organism Metadata for Treebase Repositories From GOLD Using Python and R

I recently wanted to get hold of habitat/phenotype/sequencing metadata for the individual organisms of an archived Treebase project.

The GOLD database holds more than 18000 full genomes. For many of these it provides pretty good metadata (GOLDcards) which are indirectly linked to Treebase via NCBI taxa IDs.

Unfortunately GOLD does not seem to have any kind of API for systematic downloads, so I hacked together a very quick-and-dirty scraper in Python that reads in taxa from a Treebase repo, follows the links to each species NCBI page and downloads the linked GOLDcard, if it exists.

Here is the code. You will need the external BeautifulSoup and lxml libraries for this to work – both are fantastic. (The Treebase repo here is from Wu et al. 2009**, just change the url string for a different repo):

Once you have downloaded all of the available files, It would be great to have your metadata in a nice flatfile with one line per taxa, right? I did this with a little R script using the rather wonderful readHTMLtable() function in the XML (install.packages(‘XML’)) package.

The output is a semicolon separated file with taxa in the rows and the different categories of metadata in columns. The metadata is often fairly incomplete, and there are plenty of omissions, but hopefully it will become more useful as more deposits are made to GOLD.

** Wu D., Hugenholtz P., Mavromatis K., Pukall R., Dalin E., Ivanova N.N., Kunin V., Goodwin L., Wu M., Tindall B.J., Hooper S.D., Pati A., Lykidis A., Spring S., Anderson I.J., D’haeseleer P., Zemla A., Singer M., Lapidus A., Nolan M., Copeland A., Chen F., Cheng J., Lucas S., Kerfeld C., Lang E., Gronow S., Chain P., Bruce D., Rubin E.M., Kyrpides N.C., Klenk H., & Eisen J.A. 2009. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature, 462(7276): 1056-1060.

Comments