What I Am Looking for in a New Data Science Hire

This is my first post for a very long time. At least part of this tardiness has been because I have been getting to grips with my new role as senior data scientist at RealityMine, a company providing cross-media behavioural data products for marketing and media customers. I started in November 2015 and am now in a second round of recruitment. There are any number of articles online about ‘how to get your first job as a data scientist’ or ‘What skills a data scientist needs’, but for me, many of them feel more like a wish list for the perfect data science degree course, rather than advice for getting a new position – they are too general and, often seem to be written by people looking for work as a data scientist rather than someone trying to build up a team. For me, data science is far more than being able to run a Spark job on a Hadoop cluster or to implement a machine learning algorithm: It requires critical thinking, the ability to apply the scientific method in unfamiliar settings and creativity in crafting stories and visualisations from the high-quality information refined from large, messy and complex data. Similarly, big data is not a panacea for modern business: in fact it raises many questions that require the combination of statistical and computational skills with domain knowledge and common sense to effectively address.

In this post I hope to give a flavour of the kinds of skills, knowledge and values that I specifically am looking for when I look at CVs and conduct interviews. Clearly different teams will have different requirements and others may well disagree with me on which are the most important facets, but I wanted to give an insight into what someone on the other side of the recruitment fence values in potential new hires.

I am looking for:

Good conceptual knowledge of maths, statistics and machine learning

I am not necessarily looking for someone who does Gaussian elimination or eigenvalue decomposition by hand for fun, but I would need them to understand the relationships between rows and columns in matrices and be able to relate them to data and models, and to understand what a PCA is doing, what it might tell us and the impacts on further analysis. I find it useful to ask around these subjects in an interview, partly to assess for a minimum conceptual understanding of statistics but also to assess the ability to communicate complex ideas to people who may not be an expert in the domain. A complete algebraic understanding is far less important than knowing when to apply a logistic or a linear regression and how to interpret the output. I want people who know the difference between classification, regression, clustering, supervised, unsupervised, predictive modelling, dimension reduction etc. and be able to fairly instinctively know when to apply each, even if they are not sure exactly which model might be the best to apply within each of those groups. I am not necessarily particularly interested if you have coded your own gaussian process model from the bare metal up in C: this may well be impressive but I am looking for practical data scientists who are able to apply their knowledge under a wide range of problems, and if they don’t know the answer, they at least have a pretty good idea of where to look for it.

Data processing and visualisation expertise

This is perhaps the only absolute essential thing I am looking for and is why ‘pure’ statisticians with experience only in SPSS, Stata or similar are really not likely to make it past an interview. I need people who are data manipulation cyborgs, for whom filtering, subsetting, merging and transforming data is second nature and can do it using their preferred tool (be it dplyr, Pandas, SQL or whatever) with flow, without being hindered by those tools. Think of the difference between a new driver, who has to think consciously about every single decision (signal, gears, brakes, accelerator, mirrors etc.) and yet is still overloaded with information and someone who has been driving for years for whom many of these tasks are handled by muscle memory and the unconscious brain. Data manipulation is not the chore to get through to get to the proper data science – it is data science. The same is true for visualisation. Exploring the data always comes before understanding it and visualising data is a hugely important part of this. Personally I spent a lot of time getting familiar enough with ggplot2 to be able to crank out quality plots on demand but if you have another favourite that you work efficiently with, that is equally fine. Related to this is report building, my workflow has been drastically improved by learning to generate automated reports using tools like Rmarkdown and Emacs org-mode. If new hires have already spent the time getting to grips with these tools, they can get to grips with the data that much more quickly.

Scientific method

This is critical. Data scientists lack the specialist and deep software development knowledge of engineers but we are constantly testing hypotheses, experimenting and proving ourselves wrong in our quest for meaning in our data. Hypothetico-deductive thinking is the best method humans have devised for discovery and reasoning where the information we have is incomplete, and I think this is the main reason why many say that science PhDs (rather than Statistics, maths or CS graduates) make the best data scientists. This is drilled into us for at least three years until it becomes second nature (Conversely none of the best engineers I have met have done a PhD – make of that what you will!).

An interest in programming

I am looking for the polyglot data scientist, they may be a wizard at using R, but I need them to understand the circumstances under which it is better to use Python or SQL or bash and be open to diving in to use newer tools like Scala/Spark, if they have not already. I need them to not be constrained to the tools they know the best. A blog, twitter feed, StackExchange presence or GitHub account is a good place to show that you have an interest in the broader issues in data science. An interest in programming and computer science more generally is another big plus – someone who has spent a little time playing with a few other languages is likely to have a better feel for how their main languages perform under certain conditions. Having a grasp of the differences between functional and object orientated programming and their relative merits for data analysis and pipelines would be a definite benefit. Knowing their way around the linux command line, git and other command line tools is pretty much a prerequisite for hitting the ground running in the first few weeks.

Domain knowledge

I work in a fairly disruptive company in an area I had relatively little knowledge of before I joined. Domain knowledge is often most usefully learned on the job, at least initially, but there should be an interest in the industry you are looking to work in. There is little you can do to prepare yourself for the idiosyncrasies of a new company’s data warehouse, however an understanding that it is essentially the same models you are using whether it is in ecology, health informatics, market research or ecommerce automatically puts you in good stead to understanding the data science needs of a business. From then on the main requirement is curiosity and a thirst for knowledge. I’d rather have an ex-ecologist who had split the last few years between cranking out models in R and digging in the dirt for beetles than someone who had spent 10 years in my industry but who struggled with much more than pivot table in Excel and point-and-click models in SPSS.

Miscellaneous traits

  • Need to be comfortable with meetings and explaining complex concepts to people with little or no statistical or technical knowledge
  • Willingness to learn and also to impart knowledge to other departments and areas in the company. Data science should never exist in a silo: There is much to learn from BI, engineering, architecture and infrastructure and hopefully much to teach them too!
  • I don’t want someone who just wants to put their headphones on and code 8 hours a day. There will be many opportunities to get stuck into some proper hacking but I also need the ability to collaborate, tell stories with the data and explain why they are using a particular technique to project managers, salespeople and customers alike

In summary

So it is conceptual knowledge, technical virtuosity, critical scientific thinking, a love of learning, a gift for communication of complex ideas and a collaborative spirit I look for in a new data scientist hire. I don’t much like breaking down data scientists into groups like ‘modellers’, ‘visualisers’ ‘data wranglers’ etc. Of course different people will be relatively stronger in a given area at any given point in their careers, but data science is an ever-changing environment that selects for generalists and to limit yourself to a certain facet of the discipline is to potentially limit your own opportunities.

I’d be interested to hear your opinions. And, by the way, We are hiring!

Book Review: Haskell Data Analysis Cookbook

Haskell Data Analysis Cookbook by Nishant Shukla

[Full disclosure – I was given a free review copy of the book from the publisher. This review refers to the ebook version]

Data scientists are spoiled for choice with languages for statistics and data analysis these days. There is the tried and battle-tested but sometimes plodding workhorse (R), the popular general purpose language with a bunch of fairly new tools (Python), the expensive, closed-source old guard (SAS, Matlab, Stata), C++ for lightning fast calculations but slooow development time, a modern day Lisp with great Java interaction (Clojure) and a new Lisp/Python/R hybrid that is looking to steal Matlab’s crown for numerical computation (Julia). This book sees the purely functional, strongly typed language Haskell throw its hat in the data science ring.

This is a great book for data scientists wanting to leverage the power of functional programming in data analysis applications. There are chapters on data cleaning, text scraping, hashing, tree traversal, social network analysis, basic stats and machine learning algorithms, mapReduce and visualisation. It is also good for more general purpose beginner and intermediate level Haskell hackers, since it covers a lot of areas that can be essential in day-to-day programming such as reading and writing data from and to a variety of sources (including databases and the web), text processing, parallel programming and dealing with real time data. Often in books on haskell (as in many books on Lisp) much space is given to show off the cool FP aspects of the language but you are left struggling with doing the practical IO tasks that are no-brainers in more traditional languages.

The book covers a wide range of subjects, but it provides only a primer for most of them to get you started. In most cases this is enough, but there are a couple of areas I would have liked to have seen greater depth, for example in the statistics section, a more comprehensive introduction to linear modelling and regression would have been more convincing for R and Python users thinking of switching to Haskell. I would also have liked to have seen a treatment of random number generation and simulation; the purely functional nature of Haskell seems to make it difficult to generate simple random sequences because you need to set the seed for each run in order to maintain referential transparency (i.e. a function in Haskell generally needs to return the same value for the same input every time).

The examples are well chosen, written and explained and there is also a Github page with the source code from all the chapters if you don’t want to type out every one from scratch. Another nice touch is the list of list of data sources and APIs for doing your own analyses with. I’m sure you could find them with a little Googling but it was good to have them all in one place and there were several I was not aware of.

It must be said that this is not an ‘introduction to Haskell’ book. There is plenty of assumed knowledge and the syntax is hardly discussed at all. However, I would highly recommend anyone starting off with Haskell to get this book alongside an introductory book such as the fantastic “Learn you a Haskell for Great Good” to get to grips with the mind-bending complexity of the pure FP paradigm alongside the practical real-world applications.

The quality of the book aside, after reading the book I am yet to be convinced that Haskell is a great language for data science. The static typing and functional purity would be good for building large scale data-centric systems requiring rigorous validation but in my experience, most analytics coding involves lots of rapid prototyping, on-the-fly testing and hacking away at the REPL. It is this style of programming that Python, R and Lisp-based languages excel at – incrementally growing your codebase to deal with swiftly changing data and conditions. Haskell seems slightly too rigid to be a comfortable fit; having to think too much about types early on slows down early development (although it could well pay off later on) and the strictness of its functional programming can feel pretty unforgiving. Also the ghci interactive environment just doesn’t feel as natural as a Lisp or R REPL. Finally, Haskell seems to be missing other important features such as an equivalent to R’s data.frame (which is available in Python, Clojure and Julia) and good interoperability with other systems (such as Weka, R and Hadoop).

Book Review: Social Media Mining With R

Social Media Mining with R by Nathan Danneman and Richard Heimann

[Full disclosure – I was given a free review copy of the book from the publisher. This review refers to the ebook version]

I worry when a book proposes that it will appeal to everyone. The trouble is that it could well end up appealing to no-one. The introduction to this new book on social media mining in R says that it will be suitable for skilled programmers with little social science background, people who lack any sort of programming background, people wanting an introduction to social data mining and advanced social science researchers. That doesn’t leave many people out!

It turns out that this book doesn’t really cover much social media mining. It does cover basic sentiment analysis and text-mining with some references to twitter data but other social media platforms are barely mentioned and there is no discussion of any of the other social media APIs (e.g. Facebook, Linkedin and Github). The authors also miss an opportunity to discuss other analysis techniques such as social network analysis and network graphs, which twitter (and other social media) data would be ideal for.

In the first third of the book, much mention is made of big data and how researchers are going to have to prepare for it, but the actual content for dealing with big data (or at least data that would struggle to fit in the RAM of an average laptop) amounts to little more than “check that you have enough memory to read in your data and read the manual for the read.table function”. R has a reputation for not coping well with large datasets but there are some packages that really assist with this (like fread, data.table, sqldf and dplyr) that could easily have been discussed, along with any number of tips for speeding up I/O using base R functions.

The introductory and background material and theory is interesting and provides a good high level introduction to sentiment analysis, though it still feels somewhat rushed (for example, why completely leave out semi-supervised approaches? It would have been nice to have at least a page or two describing what they are and why they might be useful) and there is little detail on the assumptions of the different models. The introduction would make a good springboard for deeper research however, and the further reading and bibliography sections are very good.

I like there to be code on nearly every page of a programming book but this one is very light on R code examples. There is a chapter on setting up R and installing packages (This is not going to be anyone’s first R book so this chapter feels like a waste, particularly in a book that only clocks in at 120 pages), another covering getting tweets via the twitteR package (but not the streaming API for collecting tweets over a prolonged period of time or any way of storing tweets in a database) and then nothing until two thirds of the way through where there are three short case studies of different methods of sentiment analysis. It also seems strange to separate out the theory in the earlier chapters and the practical applications later on.

The case studies are mostly decent, covering lexicon-based, naive Bayes and the authors’ own unsupervised Item Response Theory sentiment analysis methods. It seems strange to me in a book on social media mining that the first example doesn’t even use social media data, but US government reports which even the authors admit bears little resemblance to twitter data. Some of the code is poorly explained (or not explained at all), idiosyncratic (there may be reasons for using scan to read in a text file, but what are they?) and makes use of out of date packages (Hadley Wickham shipped reshape2 back in 2012, why still use reshape?). The authors miss several opportunities to flesh out the book: One minute they say how important data cleaning is and how useful regular expressions are and then they just point you in the direction of a website teaching regular expressions. The same is true for the plotting package ggplot2 (Is Wilkinson’s The Grammar of Graphics really an appropriate introductory text for this package?) and the rWeka machine learning package. A lot of the information is good (in particular the description of the tm text-mining package functions) but it all feels a little lightweight, like a collection of blog posts. The case study analyses were interesting, though choosing such obviously polarised subjects as abortion and Obamacare resulted in much neater, cleaner results than would have been found had subjects been used where opinions are less black-and-white.

Sentiment analysis seems to be everywhere at the moment. It reminds me of word clouds a few years ago: Everyone starting out in data science was using them but they soon became a little embarrassing as they became more ubiquitous and were even famously described as the “mullets of the internet”. Make of it what you will that word clouds are used several times as an analysis method in this book.

Users who have maybe done a Coursera data science course or two and want to try their hand at some simple analyses to boost their portfolio will probably find this book useful but more serious users are likely to be left disappointed.

Using Twitter to Keep Up to Date With the Medical Literature

The medical literature is huge and growing at an explosive rate. According to a 2010 study, there were 75 clinical trials and 11 systematic reviews being published every day. In a couple of quick Pubmed searches I found 17571 papers published with “clinical trials” and 7974 with “primary care” in the title or abstract in 2013 alone. With these kinds of numbers, the prospect of keeping on top of the new publications in even a relatively small subset of this literature seems daunting to say the least. Even two weeks of annual leave would mean you have potentially hundreds of papers to plough through! Wouldn’t it be good to have some kind of real-time feed of papers in your research area that you could check on a regular basis and quickly scan for anything that is directly relevant or interesting?

It turns out that twitter is an excellent tool to do just that. Without any specialist knowledge or programming, you can quickly set up a ‘twitterbot’ that give links to the results from any Pubmed search along with the article titles. You could get Pubmed to send digest emails with the same information, but this can soon clutter up your inbox and is difficult to properly and efficiently archive. A twitter feed is easily scanned through and doesn’t feel intrusive or unwieldy.

I set up a twitterbot for articles in my field (electronic medical record research) in February. Since then, @EMR_research has already indexed over 230 papers. Going through all of these titles at once it would have been a pretty tedious affair but now I can just check the feed every day or so over a coffee and very quickly determine which ones are worth reading or bookmarking. Casey Bergman, a researcher in the Faculty of Life Sciences at the University of Manchester, who set up one of the earliest academic twitterbots (for Drosophila_ research) puts it like this:

It is relatively easy to keep up on a daily basis with what is being published, but it’s virtually impossible to catch up when you let the river of information flow for too long.

One of the further benefits of hosting academic search feeds on twitter is that they are available for others to see too. So one person can set up a feed and their whole research group has access to it. In addition, a feed can become a resource hub for researchers and other interested parties in that area. Since I set up @EMR_research in February, the page has had follows, retweets and favourites from epidemiologists, clinicians, statisticians, medication information services, patient interest groups, healthcare startups and research councils.

According to this list, there are already over 30 twitterbots indexing scientific literature over a range of disciplines from neuroscience, cell biology, parasitology to evolutionary ecology. Having access to a small network of feeds over several closely related disciplines to cross reference would be immensely helpful in keeping up with the literature and forming the basis of systematic reviews. Health researchers may frequently find themselves hopping between projects in different research areas – one day diabetes, the next cardiovascular disease, the next mental health – and having an easily accessible feed to the most recent literature for each would be invaluable.

Setting up a twitter feed for publications in your research area is easy. You just need a twitter account and a free account with dlvr.it, a content sharing service. See these two posts for detailed instructions on how to set up your own feed.

It’s Not Just for Stats: Web Mining in R

These are the slides from a talk I gave to the Manchester R user group on the 6th of May, 2014.

I discussed what web mining(or web scraping) is and the advantages of using R to do it. Then I provided a toolkit of R functions for Web mining before giving two examples of using web mining for practical uses:

  1. Downloading multiple files from a website
  2. A mash-up of data from a number of sources to find out the best places to apply for a new job in academia!

Survival Analysis in R

I recently gave a talk at the Manchester University Faculty of Life Sciences R user group on Survival Analysis in R. Survival analysis is hugely important for modelling longditudinal data like cohort studies. However, people still find it tricky to understand and even a recent paper I reviewed did a logistic regression when they should have done survival analysis. Beyond epidemiology and health research, survival analysis is a really useful addition to ecologists and biologists toolkits when they might want to model the time taken for some event to occur. For example:

  • Germination timing
  • Arrival of a migrant or parasite
  • Seed or offspring dispersal
  • Response to some stimulus

In the presentation, I discussed censoring (dealing with missing data of various kinds) and the survival function before introducing the Cox proportional hazards model. This is the most commonly used survival model because it is a semi-parametric model where you can look model the multiplicative effects of your covariates of interest (e.g. drug therapy on survival) without having to explicitly model the shape of the survival curve itself.

I then give a run through of a typical survival analysis in R before providing some links for further reading.


Care.data Could Help Avoid Another Pharma Scandal

This is a post I wrote for The Conversation on the problems with the new UK Care.data electronic medical record database. The post was later taken up by BBC news and I appeared on the BBC1 Breakfast show for a televised debate with the GP Gordon Gantz on the benefits of a national medical record database.


Book Review: Machine Learning With R

Machine Learning with R by Brett Lantz

[Full disclosure – I was given a free review copy of the book from the publisher. This review refers to the ebook version]

This is the most recent of a group of books that try to explore machine learning from a programming, rather than purely mathematical, perspective. The book is highly successful in this respect and deserves a place on the bookshelf of any data scientist, Kaggler or statistician.

The book takes a slightly different tack from previous ones in this field (See ‘Programming Collective Intelligence’ and ‘Machine learning for Hackers’) in that it concentrates largely on the packages themselves and how to use them to solve real world ML problems, rather than focusing on coding up simple algorithms from scratch and running these on toy datasets. Perhaps this way the book doesn’t provide as much insight into how the algorithm design, but it does make the book much more practically useful, particularly since it spends a good chunk of each chapter explaining the algorithm in simple, plain English.

The book is well laid out and written. Despite a slightly shaky start (do we really still think of ML in terms of Skynet, the Matrix and Hal?), the introduction is excellent and gives a pleasing summary of the philosophical and ethical issues surrounding machine learning and big data. Next, there is a thoughtful introduction to data management and exploratory data analysis that highlights important and often missed tips on things like getting data out of SQL databases. It introduces some basic R functions and concepts (some I had managed to miss up until now) without feeling like a tacked on ‘R for beginners’ chapter.

In the guts of the book, each chapter focuses on a group of related algorithms (KNN, Naive Bayes, Decision trees, Regression, Neural nets and SVMs, association rules, clustering) and has a good introduction to the algorithm in question, followed by sections on finding and cleaning data, implementing the algorithm on the data and evaluating and improving model performance. There are clear and easy to understand tables and descriptions of the important distinctions between the algorithms and the reasons for choosing one over another. The datasets the author has chosen are large and interesting enough to well illustrate the points being made without being frustratingly unwieldy and many of them are ‘classic’ machine learning datasets from places such as the UCI Machine Learning Data Repository.

Next, the book looks more deeply at evaluating and improving model performance and discusses important ensemble and meta-learning techniques like bagging, boosting and RandomForests. This section will be of particular interest to people wanting to enter Kaggle or other data science competitions because they show how to milk as much performance as possible from the basic algorithms described earlier in the book.

The final section discusses getting the algorithms to run on big datasets and improving the performance of R itself using tricks like the data.table and ff packages and parallel processing. This is the only section of the book that feels slightly rushed and many of these topics are discussed only briefly before linking to the relevant package documentation. This is only small criticism though, since coding up these kinds of systems will depend strongly on the data you have and these are difficult subjects to cover whilst retaining generality.

Obviously, the book cannot cover everything. It is decidedly light on graphs and has almost nothing on visualisation techniques and packages like ggplot2 which have become almost mandatory for doing data science today. Also, if you are new to R, you really want to get one of the excellent introductory books first and if you are new to ML, you probably want to spend a while learning some basic stats as well. Finally, this book doesn’t pretend to be a deep text about the mathematics of the algorithms it covers. For that you will need to go for something like Bishop’s classic ‘Pattern Recognition and Machine Learning’ and be prepared to put in some serious effort!

In short, if you are looking for a practical guide to implementing ML algorithms on real data and if you are more comfortable thinking in R code than in mathematical equations, this is the book for you and is the best that I have seen on the subject so far.


Develop in RStudio, Run in RScript

I have been using RStudio Server for a few months now and am finding it a great tool for R development. The web interface is superb and behaves in almost exactly the same way as the desktop version. However, I do have one gripe which has forced me to change my working practices slightly – It is really difficult to crash out of a frozen process. Whereas in Console R, I could just hit Control-D to get out and back to Bash, in RStudio, while you can use the escape key to terminate an operation, if you have a big process going on everything just freezes and you can’t do anything. One way to deal with this is to kill the rstudio process in another terminal, but this kills the whole session, including the editor, and you will lose anything you haven’t saved in your scripts. This problem is exacerbated when I am trying to use run parallel processes using the Multicore package, because it takes an age to kill all of the extra forks first.

So, now I use RStudio for development and testing and run my final analysis scripts directly using Rscript. I have this line of code at the start of my scripts…

cat(sprintf("Multicore functions running on maximum %d cores",
                   cores <- commandArgs(trailingOnly=TRUE)[1],
                   cores <- 1)))
## Multicore functions running on maximum 1 cores

… so when I am testing in Rstudio, cores is set to 1, but when I run as an Rscript, I can specify how many cores I want to use. I then just add mc.cores = cores to all of my mclapply calls like this:

# Example processor hungry multicore operation
mats <- mclapply(1:500,
                 function(x) matrix(rnorm(x*x),
                                    ncol = x) %*% matrix(rnorm(x*x),
                                                         ncol = x),
                 mc.cores = cores)

The advantage of this is that, when mc.cores are set to 1, mclapply just calls lapply which is easier to crash out of (try running the above code with cores set to more than 1 to see what I mean. It will hang for a lot longer) and produces more useful error messages:

# Error handling with typos:
# mcapply:
mats <- mclapply(1:500,
                 function(x) matrix(rnor(x*x),
                                    ncol = x) %*% matrix(rnorm(x*x),
                                                         ncol = x),
                 mc.cores = 2)
## Warning: all scheduled cores encountered errors in user code
# Falls back to lapply with 1 core:
mats <- mclapply(1:500,
                 function(x) matrix(rnor(x*x),
                                    ncol = x) %*% matrix(rnorm(x*x),
                                                         ncol = x),
                 mc.cores = 1)
## Error: could not find function "rnor"

Running final analysis scripts has these advantages:

  • You can much more easily crash out of them if there is a problem.
  • You can run several scripts in parallel without taking up console space
  • You can easily redirect the std error and std out from your program to a log file to keep an eye on its progress

Running multicore R scripts in the background with automatic logging

If you have a bunch of R scripts that might each take hours to run, you don’t want to clog up your RStudio console with them. This is a useful command to effectively run a big R analysis script in the background via Rscript. It should work equally well for Linux and Mac:

$ Rscript --vanilla R/myscript.R 12 &> logs/myscript.log &

Rscript runs the .R file as a standalone script, without going into the R environment. The --vanilla flag means that you run the script without calling in your .Rprofile (which is typically set up for interactive use) and without prompting you to save a workspace image. I always run the Rscript from the save level to that which is set as the project root for Rstudio to avoid any problems because of relative paths being set up wrongly. The number after the .R file to be run is the number of cores you want to run the parallel stuff on. Other arguments you may want to pass to R would also go here. the &> operator redirects both the stdin and stderror to the file logs/myscript.log (I always set up a logs directory in my R projects for this purpose). The & at the end runs the process in the background, so you get your bash prompt back while it is running. Then if you want to check the progress of your script, just type:

$ tail -f logs/myscript.log

And you can watch the output of your script in real time. Hit Control-C to get back to your command prompt. You can also use the top command to keep an eye on your processor usage.

If you want to kill your script, either find the PID number associated with your process in top and do kill PIDNUMBER or, if you are lazy/carefree, type killall R to kill any running R processes. This will not affect your rstudio instance.


Functional Programming in R

This post is based on a talk I gave at the Manchester R User Group on functional programming in R on May 2nd 2013. The original slides can be found here

This post is about functional programming, why it is at the heart of the R language and how it can hopefully help you to write cleaner, faster and more bug-free R programs. I will discuss what functional programming is at a very abstract level as a means of the representation of some simplified model of reality on a computer. Then I’ll talk about the elements that functional programming is comprised of and highlight the most important elements in programming in R. I will then go through a quick example demo of a FP-style generic bootstrap algorithm to sample linear models and return bootstrap confidence intervals. I’ll compare this with a non-FP alternative version so you will hopefully clearly see the advantages of using an FP style. To wrap up, I’ll make a few suggestions for places to go to if you want to learn more about functional programming in R.

What is Functional programming?

… Well, what is programming? When you write a program you are building an abstract representation of some tiny subset of reality on your computer, whether it is an experiment you have conducted or a model of some financial system or a collection of features of members of a population. There are obviously different ways to represent reality, and the different different methods of doing so programmatically can be thought of as the metaphysics of different styles of programming.

Consider for a moment building a representation of a river on a computer, a model of a river system for example.

alt River

In non-functional languages such as C++, Java and (to some extent) Python, the river is an object in itself, a thing that does things to other things and that may have various properties associated with it such as flow rate, depth and pollution levels. These properties may change over time but there is always this constant, the river, which somehow persists over time.

In FP we look at things differently…

Hereclitus - We never step into the same river twice

The presocratic philosopher Hereclitus said “We never step into the same river twice”, recognising that the thing we call a river is not really an object in itself, but something undergoing constant change through a variety of processes. In functional programming we are less concerned with the object of the river itself but rather the processes that it undergoes through time. Our river at any point in time is just a collection of values (say, particles and their positions). These values then feed into a process to generate the series of values at the next time point. So we have data flowing through processes of functions and that collection of data over time is what we call a river, which is really defined by the functions that the data flows through. This is a very different way of looking at things to that of imperative, object oriented programming.

After this somewhat abstract and philosophical start, I’ll talk about the more practical elements of functional programming (FP). FP has been around for a very long time and originally stems from Lisp, which was first implemented in the 1950’s. It is making something of a comeback of late for a variety of reasons, but mostly because it is so good at dealing with concurrent, multicore problems potentially over many computers. There are several elements that FP is generally considered to be comprised of. Different languages highlight different elements, depending on how strictly functional they are.

Functions are first class citizens of the language

This means that functions can be treated just like any other data type – They can be passed around as arguments to other functions, returned by other functions and stored in lists. This is the really big deal about functional programming and allows for higher-order functions (such as lapply) and closures, which I’ll talk about later. This is the most fundamental functional concept and I’d argue that a language has to have this property in order to be called a functional language, even if it has some of the other elements listed below. For example, Python has anonymous functions and supports declarative programming with list comprehensions and generators, but functions are fundamentally different from data-types such as lists so Python cannot really be described as a functional language in the same way as Scheme or R can be.

Functional purity

This is more of an element of good functional program design. Pure functions take arguments, return values and otherwise have no side effects – no I/O or global variable changes. This means that if you call a pure function twice with the same arguments, you know it will return the same value. This means programs are easily tested because you can test different elements in isolation and once you know they work, you can treat them like a black box, knowing that they will not change some other part of your code somewhere else. Some very strictly functional languages, like Haskell, insist on functional purity to the extent that in order to output data or read or write files you are forced to wrap your ‘dirty’ functions in constructs called monads to preserve the purity of your code. R does not insist on functional purity, but it is often good practice to split your code into pure and impure functions. This means you can test your pure code easily and confine your I/O and random numbers etc to a small number of dirty functions.

Vectorised functions

Vectorised functions operate equally well on all elements of a vector as they do on a single number. They are very important in R programming to the point that much of the criticism of R as a really slow language can be put down to failing to properly understand vectorisation. This also includes the declarative style of programming, where you tell the language what you want, rather than how you want to get it. This is common in languages like SQL and in Python generators. I’ll discuss this more later.

Anonymous functions

In FP, naming and applying a function are two separate operations, you don’t need to give your functions names in order to call them. So, calling this function:

powfun <- function(x, pow) {
powfun(2, 10)
## [1] 1024

to the interpreter is exactly the same as applying variables to the anonymous function:

(function(x, pow) {
})(2, 10)
## [1] 1024

This is particularly useful when you are building small, single use functions such as those used as arguments in higher order functions such as lapply.

Immutable data structures

Immutable data structures are associated with pure functions. The idea is that once an object such as a vector or list is created, it should not be changed. You can’t affect your data structures via side effects from other functions. Going back to our river example, doing so would be like going back in time and rearranging some of the molecules and starting again. Having immutable objects means that you can reason more easily about what is going on in your program. Some languages, like Clojure, only have immutable data structures and it is impossible to change a list in place, you would have to have a list as an argument to a function which returns another list that you then assign back to the variable name for the original list. R does not insist on immutability, but in general, data structures are only changed in this way and not through side effects. It is often best to follow this, for the same reasons as it is best to have pure functions.


Recursive functions are functions that call themselves. Historically, these have been hugely important in FP, to the extent that some languages (for example Scheme) do not even have for loops and they define all of their looping constructs via recursion. R does allow for recursive functions and they can sometimes be useful, particularly in traversal of tree-like data structures, but recursion in not very efficient in R (it is not tail-call optimised) and I will not discuss it further here, though it may well be the subject of a future post.

Functional Programming in R

R has a reputation for being an ugly, hacked together and slow language. I think this is slightly unfair, but in this ever-so-slightly biassed account, I am going to blame the parents:

R genealogy

R is the offspring of the languages S and Scheme. S is a statistical language invented in the 1970’s which is itself based on non-functional, imperative languages like C and Fortran. It is useful in this domain and much of R’s statistical abilities stem from this, but it is certainly less than pretty. Scheme is a concise, elegant, functional language in the lisp family. The designers of R tried to build something with the statistical functionality of S and the elegance of Scheme. Unfortunately, they left in much of the inelegant stuff from S as well and this mixed parentage means that it is now perfectly possible to write ugly, hacky, slow code in the style of S, just as it is also possible to write elegant, efficient functional code in the style of scheme. The problem is that functional programming has been far less mainstream so people tend to learn to code in the way they know first, resulting in rafts of ugly, hacky R code. Programming R in an elegant, functional way is not more difficult, but is immediately less intuitive to people who were brought up reading and writing imperative code. I would always recommend people learning R to learn these functional concepts from the outset because this way you are working with how the language was designed, rather than against it.

To show just how functional a language is at its core, it is first important to recognise that everything in R is a function call, even if it looks like it isn’t. So,

> 1 + 2
## [1] 3

… is exactly the same as…

> `+`(1, 2)
## [1] 3

The + operator is really just “syntactic sugar” for a + function with the arguments 1 and 2 applied to it. Similarly,

> 1:10
##  [1]  1  2  3  4  5  6  7  8  9 10

… is the same as…

> `:`(1, 10)
##  [1]  1  2  3  4  5  6  7  8  9 10

here again, to give the range of numbers between 1 and 10, : is really just a function in disguise. If you were to break down more complex expressions in this way, the result would be code that looks very Scheme-like indeed.

I will now look in more depth at the functional concepts that are most important in R, Vectorised functions, higher order functions and closures.

Vectorised functions

Probably the best known FP concept in R is the vectorised function which ‘automagically’ operates on a data structure like a vector in just the same way as on a single number. Some of the most important of these are the vector subsetting operations. In these, you take a declarative approach to programming: you tell R what you want, not how to get it. Because of this property of operating across vectors, proper use of vectorised functions will drastically reduce the number of loops you have to hand code into your scripts. They are still performing loops under the hood, but these loops are implemented in C, so are many times faster than a standard for loop in R.

For example, when I was first using R for data management and analysis, I spent months writing code like this:

> # Get all even numbers up to 200000
> # using S-style vector allocation:
> x <- c()
> for(i in 1:200000){
>     if(i %% 2 == 0){
>         x <- c(x, i)
>     }
> }

This is about the worst possible way of achieving the given task (here, getting all even numbers up to 200000). You are running a for loop, which is slow in itself, and testing if i is even on each iteration and then growing the x vector every time you find that it is. The code is ugly, slow and verbose (On my machine it took around 10 seconds).

For me, writing vectorised code was a real revelation. To achieve the same goal as the code above in a vectorised style:

> # FP style vectorised operation
> a <- 1:200000
> x <- a[a %% 2 == 0]

You assign a vector with all the values from 1 t0 200000, then you say “I want all of these that are divisible by two”. This ran 3 orders of magnitude faster than the non-FP code, is half the length and clearer – you don’t have to mentally run through the loop in order to work out what it does. So you get the benefits of both concision (Number of bugs correlate well with lines of code) and clarity (The code becomes almost self-documenting).

This is a slightly unfair comparison and there are ways to speed up your loops, for example by pre-allocating a vector of the correct length before you run the loop. However, even if you do this, the result will still be around 20 times slower and will be even more verbose. It is good practice whenever you write a for loop in R to check if there is not a better way to do so using vectorised functions. The majority of R’s built-in functions are vectorised and using these effectively is a prerequisite of using R effectively.

Higher-order functions

Because functions in R are first class citizens of the language, it is trivial to write and use functions that take other functions as arguments. The most well used of these are the functions in the apply family (lapply, sapply, apply etc.). These cause a lot of headaches for new-ish R users, who have just got to grips with for loops, but they are really no more difficult to use. When you use one of these apply functions, you are just taking a collection of data (say a list or vector) and applying the input function to every element of the collection in turn, and collecting the results in another collection.

Apply functions

Because the mapping of each element to the function is independent of the elements around it, you can split the collections up and join them together at the end, which means that the functions can be better optimised than a for loop (especially in the case of lapply) and also easily run over multiple processor cores (see multicore::mclapply).

Conceptually, to use these functions you just need to think about what your input is, what the function you want to apply to each element is and what data structure you want as your output:

  • lapply : Any collection –> FUNCTION –> list
  • sapply : Any collection –> FUNCTION –> matrix/vector
  • apply : Matrix/dataframe + margin –> FUNCTION –> matrix/vector
  • Reduce : Any collection –> FUNCTION –> single element

so if you want your output in a list, use lapply. If you want a vector or matrix, use sapply. If you want to calculate summaries of the rows or columns of a dataframe, use apply. If you want to condense your dataset down into a single summary number, use Reduce. There are several other functions in the same family, which all follow a similar pattern.


Closures are at the heart of all functional programming languages. Essentially a closure is a function to which has been added data via its arguments. The function ‘closes over’ the data at the time the function was created and it is possible to access it at a later time. Compare this to the idea of an object in languages like C++ and Java, which are data with functions attached to them.


You can use closures to build wrappers around functions with new default values and partially apply functions and even mimic Object-oriented style objects, but possibly most interestingly, you can build functions that return other functions. This is great if you want to call a function many times on the same dataset but with different parameters, such as in maximum-likelihood optimisation problems when you are seeking to minimise some cost function and also for randomisation and bootstrapping algorithms.

To demonstrate the usefulness of this, I am now going to build a generic bootstrapping algorithm in a functional style that can be applied to any linear model. It will demonstrate not only functions returning functions, but higher-order functions (in this case, sapply), anonymous functions (in the mapping function to sapply) and vectorised functions. I will then compare this against a non-FP version of the algorithm and hopefully some of the advantages of writing in an FP style in R will become clear. Here is the code. I am doing a bootstrap of a simple linear model on the classic iris dataset:

boot.lm <- function(formula, data, ...){
       data=data[sample(nrow(data), replace=TRUE),], ...)

iris.boot <- boot.lm(Sepal.Length ~ Petal.Length, iris)
bstrap <- sapply(X=1:1000, 
                 FUN=function(x) iris.boot()$coef)

That is the algorithm. The boot.lm function returns a closure. You pass it a linear model formula and a dataframe and it returns a function with no arguments that itself returns a linear model object of a bootstrapped replicate (sample with replacement) of the supplied data. So, the iris.boot function takes the formula of Sepal.Length~Petal.Length and the iris dataset and every time you call it it gives a new bootstrap replicate of that model on that data. You then just need to run this 1000 times and collect the coefficients, which can be done with a one-liner sapply call. We are using sapply because we want a matrix of coefficients with one line per replicate. The FUN argument to sapply is an anonymous function that returns the coefficients of the function. You could have equally well have written something like

get.coefs <- function(x){

bstrap <- sapply(X=1:1000, 

… but because the function is so short, it is no less clear to include it without a name.

Once the model has run, we can use the apply higher-order function to summarise the rows of bstrap by applying the quantile function to give the median and 95% confidence intervals:

apply(bstrap, MARGIN=1, FUN=quantile, 
      probs=c(0.025, 0.5, 0.975))
##       (Intercept) Petal.Length
## 2.5%        4.162       0.3701
## 50%         4.304       0.4098
## 97.5%       4.448       0.4472

This is an elegant way to solve a common analysis problem in R. If you are running a large model and you want to speed things up (and you have a few cores free!), it is a simple task and a couple of lines of code to replace the call to sapply to one to multicore::mclapply and run the model on as many processor cores as you can.

In contrast, here is a roughly equivalent non-FP style bootstrapping algorithm:

boot_lm_nf <- function(d, form, iters, output, ...){
  for(i in 1:iters){
    x <- lm(formula=form, 
                   replace = TRUE),], ...)[[output]]
    if(i == 1){
      bootstrap <- matrix(data=NA, nrow=iters, 
      bootstrap[i,] <- x
    } else bootstrap[i,] <- x

This ugly beast is full of fors and ifs and braces and brackets and double brackets. It has a load of extra boilerplate code to define the variables and fill the matrices. Plus, it is less generic than the FP version since you can only output the attributes of the model itself, whereas previously we could apply any function we like in place of the anonymous function in the sapply call. It is more than twice as verbose and impossible to multicore without a complete rewrite. On top of all that, getting the coefficients out in a non-FP way is a tedious task:

bstrap2 <- boot_lm_nf(d=iris, 
            form=Sepal.Length ~ Petal.Length, 
            iters=1000, output="coefficients")
CIs <- c(0.025, 0.5, 0.975)
cbind( "(Intercept)"=quantile(bstrap2[,1],probs = CIs),
      "Petal.Length"=quantile(bstrap2[,2],probs = CIs))
##       (Intercept) Petal.Length
## 2.5%        4.169       0.3699
## 50%         4.310       0.4081
## 97.5%       4.448       0.4414

The code duplication in the cbind is a pain, as is having to name the coefficients directly. Both of these reduce the generalisability of the algorithm.

Wrapping up

I hope I have demonstrated that writing more functional R code is

  • More concise (fewer lines of code)
  • Often faster (Particularly with effective vectorisation)
  • Clearer and less prone to bugs (because you are abstracting away a lot of the ‘how to’ code)
  • More elegant

R is a strongly functional language to its core and if you work with this in your code, your R hacking will be more intuitive, productive and enjoyable.

Further Reading

Here are some good and accessible resources available if you want to learn more about functional programming in general and FP in R in particular:

  • Structure and interpretation of computer programs by Abelson and Sussman is the bible of FP and is written by the creators of Scheme. This book has been used as the core of the MIT Computer Science course since the early 1990s and is still not dated.
  • Hadley Wickham’s in progress ebook on Github is a fantastic resource on FP in R amongst a host of other advanced R topics.
  • The R Inferno by Patrick Burns is a classic free online book on R and has a great chapter on vectorisation and when it is best to apply it.
  • If you are intersted in the metaphysical stuff at the start of this post, Rich Hickey, the inventor of the Clojure language give this great talk on the importance of FP and the failings of the traditional OOP model. The talk was also summarised nicely in this blog post.