Book Review: Machine Learning With R

Machine Learning with R by Brett Lantz

[Full disclosure – I was given a free review copy of the book from the publisher. This review refers to the ebook version]

This is the most recent of a group of books that try to explore machine learning from a programming, rather than purely mathematical, perspective. The book is highly successful in this respect and deserves a place on the bookshelf of any data scientist, Kaggler or statistician.

The book takes a slightly different tack from previous ones in this field (See ‘Programming Collective Intelligence’ and ‘Machine learning for Hackers’) in that it concentrates largely on the packages themselves and how to use them to solve real world ML problems, rather than focusing on coding up simple algorithms from scratch and running these on toy datasets. Perhaps this way the book doesn’t provide as much insight into how the algorithm design, but it does make the book much more practically useful, particularly since it spends a good chunk of each chapter explaining the algorithm in simple, plain English.

The book is well laid out and written. Despite a slightly shaky start (do we really still think of ML in terms of Skynet, the Matrix and Hal?), the introduction is excellent and gives a pleasing summary of the philosophical and ethical issues surrounding machine learning and big data. Next, there is a thoughtful introduction to data management and exploratory data analysis that highlights important and often missed tips on things like getting data out of SQL databases. It introduces some basic R functions and concepts (some I had managed to miss up until now) without feeling like a tacked on ‘R for beginners’ chapter.

In the guts of the book, each chapter focuses on a group of related algorithms (KNN, Naive Bayes, Decision trees, Regression, Neural nets and SVMs, association rules, clustering) and has a good introduction to the algorithm in question, followed by sections on finding and cleaning data, implementing the algorithm on the data and evaluating and improving model performance. There are clear and easy to understand tables and descriptions of the important distinctions between the algorithms and the reasons for choosing one over another. The datasets the author has chosen are large and interesting enough to well illustrate the points being made without being frustratingly unwieldy and many of them are ‘classic’ machine learning datasets from places such as the UCI Machine Learning Data Repository.

Next, the book looks more deeply at evaluating and improving model performance and discusses important ensemble and meta-learning techniques like bagging, boosting and RandomForests. This section will be of particular interest to people wanting to enter Kaggle or other data science competitions because they show how to milk as much performance as possible from the basic algorithms described earlier in the book.

The final section discusses getting the algorithms to run on big datasets and improving the performance of R itself using tricks like the data.table and ff packages and parallel processing. This is the only section of the book that feels slightly rushed and many of these topics are discussed only briefly before linking to the relevant package documentation. This is only small criticism though, since coding up these kinds of systems will depend strongly on the data you have and these are difficult subjects to cover whilst retaining generality.

Obviously, the book cannot cover everything. It is decidedly light on graphs and has almost nothing on visualisation techniques and packages like ggplot2 which have become almost mandatory for doing data science today. Also, if you are new to R, you really want to get one of the excellent introductory books first and if you are new to ML, you probably want to spend a while learning some basic stats as well. Finally, this book doesn’t pretend to be a deep text about the mathematics of the algorithms it covers. For that you will need to go for something like Bishop’s classic ‘Pattern Recognition and Machine Learning’ and be prepared to put in some serious effort!

In short, if you are looking for a practical guide to implementing ML algorithms on real data and if you are more comfortable thinking in R code than in mathematical equations, this is the book for you and is the best that I have seen on the subject so far.