Two R Tutorials for Beginners

I am currently in the process of rescuing some of the pages from my now defunct datajujitsu.co.uk blogger blog and moving here. I also today gave a tutorial to the University of Manchester on data cleaning and subsetting, so I am killing two birds with one stone by linking to the Rpubs pages for both this and a short tutorial I gave last year on vectorisation.

The tutorials are:

  1. Subsetting in R: Spring cleaning your data
  2. Speeding up your R code – vectorisation tricks for beginners

The R markdown source file for both of these are available on my github page. Rpubs is a great site from the people behind RStudio that allows you to upload R markdown scripts compiled using Knitr in no time at all.

Using R Markdown, Knitr, RStudio and Rpubs to produce and publish tutorials has proved a complete joy. It is simple, quick and painless to get pages online with embedded R code and output.

I have also produced a slide presentation for an internal seminar series in my department using R Markdown, Knitr, Pandoc and Beamer. I was really pleased with the results (Which are also on my github page) and how easily I was able to achieve them, particularly with the huge reduction of Latex boilerplate I was forced to write. I will be doing all of my presentations with this method in future and will blog about the workflow for doing so in due course.

Comments

Scraping Organism Metadata for Treebase Repositories From GOLD Using Python and R

I recently wanted to get hold of habitat/phenotype/sequencing metadata for the individual organisms of an archived Treebase project.

The GOLD database holds more than 18000 full genomes. For many of these it provides pretty good metadata (GOLDcards) which are indirectly linked to Treebase via NCBI taxa IDs.

Unfortunately GOLD does not seem to have any kind of API for systematic downloads, so I hacked together a very quick-and-dirty scraper in Python that reads in taxa from a Treebase repo, follows the links to each species NCBI page and downloads the linked GOLDcard, if it exists.

Here is the code. You will need the external BeautifulSoup and lxml libraries for this to work – both are fantastic. (The Treebase repo here is from Wu et al. 2009**, just change the url string for a different repo):

Once you have downloaded all of the available files, It would be great to have your metadata in a nice flatfile with one line per taxa, right? I did this with a little R script using the rather wonderful readHTMLtable() function in the XML (install.packages(‘XML’)) package.

The output is a semicolon separated file with taxa in the rows and the different categories of metadata in columns. The metadata is often fairly incomplete, and there are plenty of omissions, but hopefully it will become more useful as more deposits are made to GOLD.

** Wu D., Hugenholtz P., Mavromatis K., Pukall R., Dalin E., Ivanova N.N., Kunin V., Goodwin L., Wu M., Tindall B.J., Hooper S.D., Pati A., Lykidis A., Spring S., Anderson I.J., D’haeseleer P., Zemla A., Singer M., Lapidus A., Nolan M., Copeland A., Chen F., Cheng J., Lucas S., Kerfeld C., Lang E., Gronow S., Chain P., Bruce D., Rubin E.M., Kyrpides N.C., Klenk H., & Eisen J.A. 2009. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature, 462(7276): 1056-1060.

Comments

Mapping Academic Collaborations in Evolutionary Biology

This post is a republication of a visualisation I did in 2011 for my (now defunct) datajujitsu.co.uk blog. It was a naive first attempt at web-scraping from an academic publishers website. It was done before I was aware of the problems surrounding access to, and text-mining of, online academic content hosted by publishers such as Wiley and Elsevier. Producing such a piece now (in 2013) would certainly be regarded as a political act. The text and visualisations are unchanged from the original

Like many people, I was immensely impressed with Paul Butler’s global map of facebook friend connections, a spectacular way of visualising, and humanising, a large amount of raw data. I was further impressed to find out that he did it solely using R. I recently found Flowingdata’s tutorial on creating the same effect using flight information and got to thinking about what other datasets I could apply it to. My original plan was to build a scraper to get all of the abstracts from a particular subject from Pubmed and visualise the academic collaborations between institutions for all of these abstracts. Unfortunately though, Pubmed only stores the addresses of the institutions of the corresponding author, so I decided to stick with my own subject, evolutionary biology, and get all the abstacts from the journals Evolution1558-5646) and Evolutionary Biology1420-9101) since 2009. I could then extract these using a hacked together Python script which would then feed the addresses into the Yahoo PlaceFinder api to get a data set of coordinates for each cross-institution collaboration in every paper published in the journals for the last two and a half years. I then fed this data into R, generated great circles for each of the collaborations using the geosphere package and processed it a la FlowingData to get the following global map of academic collaboration in evolutionary biology since 2009:

Evolution social network 2009-2011

You can clearly see the main hubs of collaboration in Europe and the East and West coasts of the USA, with smaller hubs in Japan and South-Eastern Australia. There are further actively collaborating institutions in South America and Africa, but almost all of their collaborations are with North american and European Universities. Looking into the data itself, the median longitude for JEB institutions is firmly in Europe, while the median longitude for Evolution in the USA (This makes sense since Evolution is based in the states while JEB is a European journal, though there is no geographic imperative to publish in either). Technical info: I scraped the data from the Wiley website for the two journals using Python and BeautifulSoup. For the R analysis I used the modules maps, geosphere, reshape and gdata.

Comments