R in the Cloud

So R is a great language for statistics, and Rstudio is a great environment in which to run it.  It’s a powerful set of statistical tools, but it does have one limitation – memory.  When running it on your desktop, it doesn’t make incredibly efficient use of memory and you can run into problems when dealing with reasonably large data sets like internet data.  However, there is a neat way around it – moving R to the cloud.  Amazon is best-known as a bookseller with some associated ecommerce on the side, but it also happens to have a gigantic business on the side called Amazon Web Services.  It allows you to rent server capacity on demand, which can basically give you unlimited computing power whenever you need it.

Yhat (an analytics company) provides an excellent guide to setting up R on Amazon Web Services.  It seems pretty complex when you first look it up, but once you get over your fear of entering things via the command line it’s quite easy. Don’t be intimidated!  It took me about twenty minutes to set up an instance, with great results.  Rstudio runs in the browser, and looks just like it does on the desktop, and once you figure out how to upload files it works exactly the same.  Even better, it’s a lot faster than it is on your desktop because it uses memory more efficiently – and while it’s running, it doesn’t totally destroy your computer’s performance.  Even using the introductory “free” server, which has lower specifications than my laptop, it’s faster than my laptop.  And upgrading to one of the “high-end” servers – which are obscenely powerful – costs a few cents an hour and you only get billed when it’s actually in use.  Here’s what it looks like in action, just a simple browser window with R inside:

Screen Shot 2014-06-04 at 2.55.53 PM

In short – upgrading to R on AWS is a pretty easy step that can really upgrade your data analysis game.  It provides an arbitrarily large amount of computing power that can allow you to take on the projects that were too much to handle before.  Even the free tier is great, because it lets you offload long-running jobs to the server while still using your computer.  I’d highly recommend it.

