The Seven Secrets of Successful Data Scientists

by mike | August 27th, 2010

At O’Reilly’s “Making Data Work” seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.

What follows is a blog-ified and amended version of that talk, originally entitled “Secrets of Successful Data Scientists.”

1. Choose The Right-Sized Tool

Or, as I like to say, you don’t need a chainsaw to cut butter.

If you’ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it (yes, curse the Flying Spaghetti Monster, I’ve just endorsed that dull knife called Excel).

In fact, Excel’s and Emacs’ program-by-example keyboard macros can be fantastic tool for quick and dirty data clean-up.

Alternatively, if you’ve got 600 million lines of data and you need something simple, piping together a several Unix tools (cut, uniq, sort) with a dash of Perl one-liner foo may get you there.

But don’t confuse this kind of data exploration, where the goal is to size up the data, with building proper data plumbing, where you want robustness and maintainability. Perl and bash scripts are nice for the former, but can be a nightmare for building data pipelines.

When you’re data gets very large, so big it can’t fit reasonably on your laptop (in 2010, that’s north of a terabyte), then you’re in Hadoop, parallelized database , or overpriced Big Iron territory.

So, when it comes to choosing tools: scale them up as you need, and focus on getting results first.

2. Compress Everything

We live in an IO-bound world, where the dominant bottlenecks to data flow are disk read-speed and network bandwidth.

As I was writing this, I was downloading an uncompressed CSV file via a web API. Uncompressed, it was 257MB, ZIP-compressed: 9MB.

Compression gives you a 6-8x bump out of the gate. When moving or crunching data of a certain heft, compress everything, always: it will save you time and money.

That said, because compression can render data difficult to introspect, I don’t recommend compressing TBs of data into a single tarball, but rather splitting it up, as I discuss next.

3. Split Up Your Data

“Monolithic” is a bad word in software development.

It’s also, in my experience, a bad word when it comes to data.

The real world is partitioned – whether as zip codes, states, hours, or top-level web domains – and your data should be too. Respect the grain of your data, because eventually you’ll need to use it to shard your database or distribute it across your file system.

Even more, it’s this splitting up of data that enables the parallel execution in Hadoop and commercial data platforms (such as Greenplum, Aster, and Netezza).

Splitting is part of a larger design pattern succinctly identified in a paper by Hadley Wickham as:     split, apply, combine .

This is, in my mind, a more lucid formulation of “map, reduce” to include key selection (“split”) as a distinct step before any map/apply.

4. Sample Your Data

Let’s say hypothetically you’ve got 200 GBs of data from your portmanteau of a start-up, FaceLink. Someone wants to know if more people visit on Mondays or Fridays, what do you do?

Before you wonder “if only I had 64 GB of RAM on my MacBook Pro”, or fire up a Hadoop streaming job, try this: look at a 10k sample of data.

It’s easy to visually inspect, or pull into R and plot.

Sampling allows you to quickly iterate your approach, and work around edge cases (say, pesky unescaped line terminators), before running a many-hour job on the full monty.

That said, sampling can bite you if you’re not careful: when data is skewed, which it always is, it can be hard to estimate joint-distributions – comparing the means of California vs Alaska, for example, if your sample is dominated by Californians (an issue that statistics, that sexy skill, can address).

5. Smart Borrows, But Genius Uses Open Source

Before you create something new out of whole cloth, pause and consider that someone else may have already seen it, solved it, and open-sourced it.

A Google Code Search may find turn up a regular expression for that obscure data format.

The open source community allows you, if not to stand on the shoulders of giants, to at least rely on the gruntwork of fellow geeks.

6. Keep Your Head in the Cloud

This past week, an engineer friend was just thinking about buying a dream desktop: a high RAM, multi-core box to run machine learning code over TBs of data.

I told him it was a terrible idea.

Why? Because the data he wants to work on isn’t local, it’s on an Amazon EC2 cluster. It’d take hours to download those TBs over a cable connection.

If you want to compute locally, pull down a sample. But if your data is in the cloud, that’s where your tools and code should be.

7. Don’t Be Clever

I once heard Brewster Kahle discuss managing the Internet Archive’s many-petabyte data platform: “everytime one of our engineers comes to me with a new, ingenious and clever idea for managing our data, I have a response: ‘You’re fired.’”

Hyperbole aside, his point is well-taken: cleverness doesn’t scale.

When dealing with big data, embrace standards and use commonly available tools. Most of all, keep it simple, because simplicity scales.

I know of a firm that, several years ago, decided to fork one part of Hadoop because they had a more clever approach. Today, they are several versions behind the latest release, and devoting time & energy to back-porting changes.

Cleverness rarely pays off. Focus your precious programmer-hours on the problems that are unsolved, not simply unoptimized.

  1. amolpatil2k says:

    I had never heard of Hadoop, Greenplum, Aster or Netezza, so I need to get out of here already.

    … and work around edge cases (say, pesky unescaped line terminators), before running a many-hour job on the full monty … Lovely

    The open source community allows you, if not to stand on the shoulders of giants, to at least rely on the gruntwork of fellow geeks … Lovely too.

    Didn’t quite agree with Seven. Kahle seems to be a closed system guy. if IA had an open API policy, many people would find many ways to use the same data which in turn would prompt the admins to reorganize it. Similarly making Hadoop clever would cut into Greenplum’s profits so that firm was sacrificed to set an example of what not to do.

  2. Nuno says:

    While some sharding scales better than no sharding your advice to respect the grain and shard by key is known not to be optimal for most applicatoons. Usimg techniques like consistent hashing to insert the data is not perfect but is at least an improvment over what you suggested. I can exain further if you like but to hard to type in the phone :P

    The real world is partitioned –whether as zip codes, states, hours, or top-level web domains –and your data should be too. Respect the grain of your data, because eventually you’ll need to use it to shard your database or distribute it across your file system.

  3. John Warden says:

    Great post Michael. And I love your final advice: “Focus your precious programmer-hours on the problems that are unsolved, not simply unoptimized.”

  4. Tomithy says:

    Thanks for this blog of working with data, as I am new to the field of data analytics, your post gave me a few good advices on how to work with them.

    4 – Sample Your Data: As data scales, solutions might not necessary scale as fast as it does and it is especially annoying when I have to re-run huge queries and processing just to test out small changes. With a query sampling wrapper, I would be way more efficiently to work with a smaller query sets to refine my search query iteratively.

    Keep posting~!

  5. Eric Gaumer says:

    Great post. I’ve been involved in enterprise search for nearly a decade and I spend a majority of my time scrubbing data, getting it ready for indexing and/or analysis.

    We recently released an open source data flow framework that leverages flow-based programming techniques. It provides an execution environment to manage and coordinate a series of custom processes connected (at runtime) to form pipelines and directed acyclic graphs. Its REST interface make it scalable and cloud friendly and the architecture is event driven (i.e., push as opposed to pull).

    We’re using Stackless Python and YUI to provide an experience similar to Yahoo! Pipes but completely customizable and much more scalable (server side processing).

    Check it out, you might find it useful for a variety of your data processing needs.

    http://www.pypes.org

  6. @Nuno – great point on the dangers of sharding by key and the value of consistent hashing. The world is dominated by power-law distributions, not uniform, so few natural keys can be trusted. In that light, I’ll spin “respect the grain of the data” as meaning: understand how your data is distributed.

    @Eric – thanks for the link to Pypes, I’ll check it out!

  7. [...] The Seven Secrets of Successful Data Scientists (From Dataspora Blog) At least in terms of building the infrastructure, this seems like solid advice. You can’t do all the fascinating analysis attributed to true data science if you have a faulty foundation. [...]

  8. Dennis Karr says:

    I am doing research for my company on the Big Data Problem and found this Blog. We need to find consultants that specialize in Big Data, especially ones that have worked with EBAY, Google, Yahoo, or Amazon. Any recommendatations? Also what universities specialize in teaching Big Data solutions?

  9. Thanks for the great post! I concur with the author that we do in fact live in an IO bound world and that compression is the answer. Parallel analytic databases such as ParAccel enable the best IO throughputs and compression ratios via a columnar architecture.

    More importantly, ParAccel has allowed data scientists to analyze and iterate over data at a scale and grain that makes most sense to them without being forced to aggregate or sample because of limitations in their existing analytic tools. The result is, of course, new and useful insights at a rate that wasn’t possible before.

  10. [...] Nicely put. The seven secrets of successful data scientists: Dataspora blog [...]

  11. Ryan says:

    I liked back to this post in a comment at FlowingData, for symmetry here’s the link. http://flowingdata.com/2010/09/28/poll-results-what-do-you-use-to-analyze-andor-visualize-data/#comment-52398

  12. [...] yourself for being a data scientist? Are there any data scientists secrets? Michael E. Driscoll ☞ lists on Dataspora blog seven secrets for successful data [...]