Building a custom database of country time-series data using Quandl

Wed 08 May 2013

Filed under odds and ends

Tags development evaluation R reproducibleResearch

Encouraged by this post I had another look at quandl for collecting datasets from different agencies. Right now I need to get data for four countries on a couple of dozen indicators.

Rplot

This graphic is just a quick example with only two indicators of what I am aiming to be able to do.

The process on Quandl at the moment is a bit fiddly:

  • there is no search function in the API
  • the country codes used are different from agency to agency

So my workflow is this. It isn’t as complicated as it sounds. I have used spreadsheets to store country codes and queries to make it all as re-useable as possible. You can download the spreadsheets here and here.

  • edit the csv spreadsheet of the 2-and 3-digit ISO country codes, plus the actual names. Also, WHO for some reasons uses some other codes which I had to paste in by hand. If you find your sources are also using yet other codes, you can add them to the spreadsheet. Put an x in the “enabled” column to mark the countries you want to use.
  • search manually at quandl for interesting queries and add them to the other csv spreadsheet, replacing the country code with %s, again putting an x in the “enabled” column for the queries you want, adding a human-readable title in the “title” column if you want and putting “alpha2” or “alpha3” etc in the country_sign column to mark which kind of country code is being used.
  • run the script below.

 

authcode=“yourAuthCodeFromQuandl”

 

library(Quandl)

 

cou=list()

queries=read.csv(“queries.csv”)

queries=queries[queries$enabled!=“”,]

codes=read.csv(“countryCodes.csv”)

codesE=codes[codes$enabled!=“”,]

for(qq in 1:nrow(queries)){

q=queries$query[qq]

for(cc in 1:nrow(codesE)){

 

co=codesE[cc,queries[qq,“country_sign”]]

tex=paste(q,co,sep=“.”)

cou[[tex]]=try(Quandl(sprintf(q,co),authcode=authcode),T)

if(attributes(cou[[tex]])$class!=“try-error”)cou[[tex]]$Indicator=ifelse(!is.na(queries$title[qq]),queries$title[qq],q)

if(attributes(cou[[tex]])$class!=“try-error”)cou[[tex]]$Country=codesE[cc,“name”]

}}

 

rr=rbind.fill(cou[sapply(cou,function(x)length(x)>1)])

rr$Date=as.character(rr$Date)rr$Year=as.character(rr$Year)

rr$Year=as.Date(ifelse(!is.na(rr$Year),rr$Year,rr$Date))

rr$Value=ifelse(!is.na(rr$Value),rr$Value,rr$Percent) #you might have to do something like this if your queries are returning data in columns with some other label than Value

#then try a graphic for demonstration purposes

ggplot(data=rr,aes(x=Year,y=Value,group=Country,colour=Country))+geom_point(size=3)+geom_line()+facet_grid(Indicator~.,scales=“free”)+ theme(strip.text.y = theme_text(size = 13, hjust=0,angle = 0))+theme(axis.text.x=element_text(angle=90))

And voila.

I wanted to put the spreadsheets as a google spreadsheet but it seems RGoogleDocs is not working for R 3.0.


Comments


socialdatablog © Steve Powell Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More