How Google and Facebook are using R

by mike | February 19th, 2009


(March 26th Update: Video now available)
Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled “The R and Science of Predictive Analytics”, co-located with the Predictive Analytics World conference here in SF.

The panel comprised of four recognized R users from industry:

  • Bo Cowgill, Google
  • Itamar Rosenn, Facebook
  • David Smith, Revolution Computing
  • Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)

The panelists were asked to explain how they use R for predictive analytics within their firms, its strengths and weaknesses as a tool, and provide a case study. What follows is my summary with comments.

Panel Introduction

I began by describing R as a programming language with strengths in three areas: (i) data manipulation, (ii) statistics, and (iii) data visualization.

What sets it apart from other data analysis tools? It was developed by statisticians, it’s free software, and it is extensible via user-developed packages — there are nearly 2000 of them as of today at the Comprehensive R Archive Network or CRAN.

Many of these packages can be used for predictive analytics. Jim highlighted Max Kuhn’s caret package , which provides a wrapper for accessing dozens of classification and regression models, from neural networks to naive Bayes.

Bo Cowgill, Google

R is the most popular statistical package at Google, according to Bo Cowgill, and indeed Google is a donor to the R Foundation. He remarked that “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.” Nonetheless, he’s optimistic to see that as the R developer community has expanded, R’s documentation has improved, and its performance has gained.

One theme that Bo first brought up, but which was echoed by others, was that while Google uses R for data exploration and model prototyping, it is not typically used in production: in Bo’s group, R is typically run in a desktop environment.

The typical workflow that Bo thus described for using R was: (i) pulling data with some external tool, (ii) loading it into R, (iii) performing analysis and modeling within R, (iv) implementing a resulting model in Python or C++ for a production environment.

Itamar Rosenn, Facebook

Itamar conveyed how Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?

For the first question, Itamar’s team used recursive partitioning (via the rpart package) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.

For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.

David Smith, Revolution Computing

David’s firm, Revolution Computing, not only uses R, but R is their core business. David said that “we are to R what Red Hat is to Linux”. His firm addresses some of the pain points of using R, such as (i) supporting older versions of the software and (ii) providing parallel computing in R through their ParallelR suite.

David showcased how one of their life sciences clients used R to classify genomic data through use of the randomForest package, and how the analysis of classification trees could be easily parallelized using their ‘foreach’ package.

He also mentioned that several firms they have worked with do use R in production environments, whereby a particular script is exposed on a server, and a client calls it with some data to return a result (several ways exist to set up R in a client-server manner, such as RServe , rapache , and Biocep).

David evangelizes and educates about R at the Revolutions blog .

Jim Porzak, The Generations Network

Jim (also co-chairs the R Users Group), gave a brief overview of his PAW talk on using R for marketing analytics. In particular, Jim has used the flexclust package to cluster customer survey data for Sun Microsystems, and apply the resulting profiles to identify high-value sales leads.

During the Q & A session, the panelists were asked several questions.

How do you work around R’s memory limitations? (R workspaces are stored in RAM, and thus their size is limited)

Three responses were given (including one from the audience):

(i) use R’s database connectivity (e.g. RMySQL), and pull in only slices of your data, (ii) downsample your data (do you really a billion data points to test your model?), or (iii) run your scripts on a RAM-obsessed colleague’s machine or fire up an virtual server on Amazon’s compute cloud — for up to 15 Gigs.

What’s the general ramp-up process for groups wanting to use R?

Itamar and Bo both indicated that within their groups, almost everyone arrived having learned R in their university studies. Jim Porzak led an R tutorial within his last firm using an internal slide deck.

How easy is it for developers who are not statisticians to learn R?

The consensus seemed to be that R is a difficult language to achieve competency in, vis-a-vis Python, Perl, or other high-level scripting languages.   Jim emphasized, however, that he is a not a statistician – nor were any of our panelists. (As a non-statistician R user myself, I will say this — a consequence of learning R is an improved grasp of statistics. Knowing statistics is a necessary pre-requisite for understanding R’s features, from its data types to its modeling syntax).

How well does R interface with other tools and languages?

There are several packages on CRAN for importing and exporting data to and from Matlab ( RMatlab), Splus, SAS, Excel and other tools.  In addition, there are interfaces for running R within Python ( RPy ) and Java ( RJava ).

The panelists mentioned that they typically run R within a GUIs, either RCommander or Rattle . (Aside: I run R exclusively in emacs using ESS — incidentally, one of its authors was panelist David Smith).

A video of the event is now available courtesy of Ron Fredericks and LectureMaker.

Tagged , ,

  1. John Furrier says:

    I just discovered this blog. Just added it to my suscbribe list…this is an area that many people in the mainstream market are overlooking…

    thanks for the coverage

  2. I use Rpy and matlab at times for most of my social network analysis work.

    Looking forward to the video by the way.

  3. [...] in the wild By statsinthewild Here is an interesting blog post about how Google and Facebook are using the free statistical package R [...]

  4. Adrian Dragulescu says:

    It’s great to see that Google is using R a lot. I would like to see them getting more involved with the R community. It would be great if they can provide some good R interfaces to their excellent products. The API’s are there, why not make the full R integration. They have so many talented software engineers, maybe they can show their love by contributing software to R.

  5. capnza says:

    Awesome summary. Interesting to see applications of R outside of the academic sphere!

  6. [...] Interesting summary of a panel involving Google, Facebook, and R. – Link [...]

  7. [...] How Google and Facebook are using R : Data Evolution (tags: r statistics analytics) This entry was posted on Friday, February 20th, 2009 at 7:10 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site. « links for 2009-02-19 [...]

  8. Dmitri Don says:

    Thanks for the great article about R. FYI, Your Revolutions blog link is not working, please fix.

  9. [...] How Google and Facebook are using R : Data Evolution R is the most popular statistical package at Google, according to Bo Cowgill, and indeed Google is a donor to the R Foundation. He remarked that “The best thing about R is that it was developed by statisticians. The worst thing about R is that… it was developed by statisticians.” Nonetheless, he’s optimistic to see that as the R developer community has expanded, R’s documentation has improved, and its performance has gained. (tags: statistics analytics R) Possibly related posts: (automatically generated)links for 2009-02-06The elephant in the roomMy daily readings 08/02/2008How are we going to solve the multiple data set issues? [...]

  10. Ralph Winters says:

    Thanks for the synopsis. With both Google and Facebook onboard with using R, it will definitely add credibility to its being a viable datamining tool. I would like to see more coverage of optimizing memory use, since it is still one sticking point which puts it at a disadvantage as compared to a product like SAS, which can better handle huge amounts of data.

  11. Yves Remord says:

    Mass manipulation, consequently, appears as aiming to have populations all go towards the same direction, and the nature of this destination is secondary. “Go where you want to go, we don’t care, but please: do all go in the same direction”. Indeed, having people move together does help in predictive analytics.

    Finally, what better way to improve predictions than to give them a helping hand?

  12. Great meetup and lecture, thanks Mike.

  13. Great post! The all-in-memory problem will continue to hold back R’s utility, but there are some great efforts afoot to fix this, everything from parallelization to new ways to store the data in a memory-mapped-to-disk approach (ala S, SPSS, SAS, etc.) For example, see http://www.r-project.org/conferences/useR-2007/program/posters/adler.pdf as a promising approach.

    I don’t believe sampling is always the right answer; unless you understand the underlying distribution, your sampling could cause you to miss some subtle effects. I look forward to these new approaches which can handle all the data at once…

    For those interested in more hints and tips around using R that I’ve collected in my journey from novice to dangerous novice, please see http://www.nettakeaway.com/tp/?s=R where I review some GUIs, collected links, tips about coming from an SPSS environment, etc.

  14. [...] crowd.  I know dozens of people under 30 doing statistical stuff and only one knows SAS.  At that R meetup last week, Jim Porzak asked the audience if there are any recent grad students who learned R [...]

  15. [...] I was reading over at Data Evolution about a presentation on how Google and Facebook use R. The following was a summary of what Bo Cowgill of Google said about his workflow: The typical [...]

  16. [...] then I read that R is the most popular statistical package at Google. Clearly not so bijou, after [...]

  17. [...] How Google and Facebook are using R Michael E. Driscoll, Data Evolution [...]

  18. [...] How Google and Facebook are using R by Michael E. Driscoll | February 19, 2009 http://dataspora.com/blog/predictive-analytics-using-r/ [...]

  19. Jim emphasized, however, that he is a not a statistician – nor were any of our panelists. (As a non-statistician R user myself, I will say this — a consequence of learning R is an improved grasp of statistics. Knowing statistics is a necessary pre-requisite for understanding R’s features, from its data types to its modeling syntax).

    R is like any other statistical package; i can help you gain inference if you know what you’re doing. Otherwise, it will produce output much like any poorly written program that doesn’t actually accomplish the task at hand.

    As a 30 year statistician from a top-10 graduate program I am increasingly distressed by the dominance computer scientists are gaining in the “analytics field” merely for their increased access to the platforms involved and higher-than-average keyboard skills.

    flexclus in R is like fastclus in SAS and minitab’s old cluster. SPSS, Matlab, Pstat, and Mathematica all have analogs. The truly disturbing aspect of this dynamic is that good statisticians are quite likely to give comp sci types their due, but the comp sci types are trying to corner the analytics market via their sheer advantage vis-a-vis platfrom expertise/access!

  20. Big Data: SSD’s, R, and Linked Data Streams…

    The Solid State Storage Revolution: If you haven’t seen it, I recommend you watch Andy Bechtolsheim’s keynote at the recent Mysqlconf. We covered SSD’s in our just published report on Big Data management technologies. Since then, we’ve gotten addit…

  21. [...] Linked Data Streams: I had a chance to visit with Dataspora founder and blogger Mike Driscoll, an enthusiastic advocate for the use of the open source statistical computing language, R. After founding and leading online [...]

  22. [...] Once you’ve read the first chapter, download R. R is an open-source statistics package/language that’s quite popular. Never heard of it? Don’t believe me? Check out this post (How Google and Facebook are using R). [...]

  23. [...] Once you’ve read the first chapter, download R. R is an open-source statistics package/language that’s quite popular. Never heard of it? Check out this post (How Google and Facebook are using R). [...]

  24. [...] How Google and Facebook are using R : Dataspora Blog Quote: "[Facebook] … found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site." (categories: statistics google analytics facebook datamining analysis data ) [...]

  25. Siah says:

    I just watched that video, great work. I am wondering if Jim is willing to present his work on one of the upcoming R meetups. I have talked to a couple of people and everyone wants to see more use cases. I think Jim would be a perfect person to give a talk. Everybody thinks that local people should give more talks

  26. [...] guter Gesellschaft: Google und Facebook nutzen R für PA In diesem Artikel beschreibt der Dataspora Blog, wie die beiden “mittelgroßen” Webstartups Google und [...]

  27. [...] Web companies have also started using these techniques. For instance, Facebook has used R in predictive analytics to answer questions like “Which data points predict whether a user will stay? And if they [...]

  28. [...] Web companies have also started using these techniques. For instance, Facebook has used R in predictive analytics to answer questions like “Which data points predict whether a user will stay? And if they [...]

  29. [...] Web companies have also started using these techniques. For instance, Facebook has used R in predictive analytics to answer questions like “Which data points predict whether a user will stay? And if they [...]

  30. R Resources…

    The page is dedicated to R documentation, articles, and links. NY Times Article 1/7/2009 NYTimes R Article.pdf Dataspora Blog 2/19/2009 How Google and Facebook are using R…

  31. Laura says:

    I’ve read the article four times, and I’m still at a loss to understand what R is or does or is supposed to do. For those of us who are less tech-savvy … ??