What are we writing about? Using CETIS Publications RSS in R

I have been poking around Adam Coopers text mining weak signals in R code, and being too lazy to collect data in CSV format wondered if I could come up with something similar that used RSS feeds. I discovered it was really easy to read and start to mine RSS feeds in R, but there didn’t seem to be much help available on the web so I thought I’d share my findings.

My test case was the new CETIS publications site, Phil has blogged about how the underlying technology behind site is wordpress, which means it has an easy to find feed. I wrote a very small script to test things out that looks something like this:

      src<-xpathApply(xmlRoot(doc), "//category")
      tags<- NULL

      for (i in 1:length(src)) {
             tags<- rbind(tags,data.frame(tag=tag<-xmlSApply(src[[i]], xmlValue)) )  

This simply grabs the feed and puts all the categories tags into a dataframe. I then removed the tags that referred to the type of publication and plotted it as a piechart. I’m pretty sure this isn’t the prettiest way to do this, but it was very quick and worked!

         cats <- subset(tags, tag != "Briefing Paper" & tag != "White Paper" & tag != "Other Publication" & tag != "Journal Paper"  & tag != "Report")
         df$tag = factor(df$tag)

Which gave me a visual breakdown of all the categories used on our publications site and how much they are used:

typesI was surprised at how much of a 5 minute job it was. It struck me that because the feed has the publication date it would be easy to do the Google Hans Rosling style chart with it. My next step would be to grab multiple feeds and use some of Adams techniques on the descriptions/content of the feed.


I had been interested in how to grab RSS and pump it into R and ‘interesting things we can do with the CETIS publications RSS feed’ had been a bit of an after thought. Martin brought up the idea of using the feed to drive a wordl (see comments). I stole the code from the comment and changed my code slightly so that I was grabbing the publication descriptions rather than the tags used… This is what it came up with.. Click to enlarge