SQL is Dead. Long Live SQL!
by mike | November 25th, 2009
“The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.”– E.F. Codd, 1969
“Database research has produced a number of good results, but the relational database is not one of them.” – Henry Baker, 1991
Outside of programming language flame wars, few questions raise the hackles of hackers more than: “how should I store my data?”
I will argue here, like many such debates , the answer is: it depends on what you’re doing.
While the rise of non-relational data stores serves a much-needed niche, the death of SQL and relational databases has been much exaggerated. E.F. Codd may be dead, but SQL is alive and well as a simple yet powerful data query language.
3NF Crusaders vs NoSQL Rebels
While the current critique relational databases shares features of earlier debates (such as in the 1990s, when object-oriented databases were heralded as the next big thing), it has some new twists. Thus to review the players and their positions:
On our right are the relational curmudgeons, the kind of folks who pen manifestos and crusade against NULL values. They have converted nearly all of big business to their ministry, and have billions of dollars in their coffers to show for it. They insist that data should be stored in terms of its relations, to protect its integrity and facilitate its analysis. Ideally that means third-normal form, but more liberal branches of the church exist.
On our left are the folks from the misnomered NoSQL movement, shaggy kids from the likes of Facebook and Twitter . They’ve rebelled against the shackles of relational tables (and bear the scars of MySQL scaling struggles). They believe that data should be persisted as it’s programmed: in objects. And they’ve spawned a constellation of colorfully named open-source projects – Cassandra, Voldemort, CouchDB, MongoDB, and Dynamo – to consummate their cause.
A Three-Pronged Attack on SQL: Syntax, Schemas, and Performance
At the heart of the NoSQL movement are three distinct critiques:
- A dislike for SQL’s syntax, which is ill-fitted to programming patterns. It’s painful to write select statements to grab the data spread out across many tables, when all you want is a record. Within web frameworks, the interface problem has been solved to a large degree by object-relational-mappers, such as Ruby’s ActiveRecord.
- A rejection of the strong typing of relational schemas, which make it painfully difficult to alter one’s data model. It also makes writing to the data store a complex process.
- A critique of performance, which in turn relates to how concurrency and partitioning of computation is handled. Most relational databases maintain a shared state, which strives for perfect concurrency, but complicates distributed computation over many nodes. NoSQL architectures are built on languages and tools, like Erlang and Hadoop, that favor distributed processes which (to use two favorite catch phrases) “share nothing” but are “eventually consistent.” The NoSQL philosophy also weighs heavily against joins.
These critical threads are mirrored in the movement and their associated projects. One the one hand you have developers who prefer the programmatic ease of interacting with NoSQL data stores, such as Cassandra and CouchDB. They also don’t suffer the performance penalties of scale: unlike with relational tables, the performance of look-ups does not degrade as the stored number of objects rises.
On the other, you have Big Data analysts (like myself), who love Hadoop because it allows easy distributed computation over massive, loosely typed data sets.
Analytics: MapReduce for Munging, SQL for Set Operations
With regard to analytics, the Hadoop ecosystem makes it easy to dump several billion records of varying formats into a data store and process them – without having to conform them to a common data model. Thus NoSQL framework is great for massive data munging.
But if I had to access an already structured massive data set, I prefer SQL’s declarative syntax to MapReduce constructs.
I recently sat down at an SQL terminal with several hundred billion call records behind it. With a simple SQL query, I determined how many distinct people the average American telephones more than once in a given month (answer: five). In a few hundred seconds, I’d generated a report on the global state of the customer calling network.
Contrary to what the NoSQL may inveigh, it’s not that relational databases can’t scale – in fact, they can scale to petabytes, as those who know Fortune 500 enterprise computing can attest . The problem is that relational databases require lots of ETL cruft to munge fluid blobs of data into strongly typed tables.
I can’t imagine the programmer pain and suffering that went into building one, unified, global database. But once it’s there, I’d much prefer to access it with SQL statements than MapReduce code .
And I’m not alone in feeling this way: Jeff Hammerbacher of Cloudera recently told me that, for an enterprise deployment, usage jumped 10x when an SQL interface – HIVE (which I mention below) – was placed on the cluster.
NoSQL is a Misnomer: SQL is Innocent!
Which brings me to my defense of SQL. I agree with two of three above critiques that embody the NoSQL philosophy, namely the need for schema-less storage and distributed architectures. But when they go after SQL, and name the movement in opposition to it, they’ve named the wrong villain. (Your honor,) SQL is just an innocent query language!
As evidence of innocence, look no further than Google’s GQL and Hadoop’s HIVE, two SQL-style query languages for NoSQL data stores.
Why SQL in a NoSQL data store? For one, it’s a language that both business analysts and developers already know; so the zero-th order adoption step is shorter.
But SQL lives on for a deeper reason: it is a simple yet powerful language for set operations. SQL captures the essential patterns of data manipulation, such as:
- intersections (JOINs)
- filters (WHEREs)
- reductions or aggregations (GROUP BYs)
I suspect that many developers who profess a disdain for SQL have been deceived by its simplicity. One of my favorite packages in R is sqldf, which allows SQL queries on R data frames. SQL’s declarative expressions are frequently more readable and compact than their R programmatic equivalents.
MapReduce is Possible in SQL
Until very recently one of the more difficult operations to perform in SQL was a top-K query, for example, finding the five highest priced items in for every store in a retail database. But so-called window functions, which make such queries easy to express, have become part of the SQL standard and are now natively supported in Postgres.
Window functions are powerful because they provide a “split-apply” functionality, otherwise known as a map function. Combine these with SQL’s GROUP BY operations, which is a reduce function, and you have achieved – voila! – map-reduce in SQL. And as with all map functions, window operations are massively parallelizable (something that has not gone unnoticed by some commercial vendors.)
Verdict: Don’t Use a Chainsaw to Cut Butter (Use the Right Tool)
Both NoSQL and SQL have their place in an analytics ecosystem. In the Big Data workflow that I’ve advocated in the past, I view SQL as a pipe feeding data into more sophisticated modeling and visualization tools, such as R. But it is an easy-to-use pipe, and it allows analysts to quickly pull out a subset of data — and start asking questions of that data.
The verdict in the great NoSQL debate is: know your tools and know your goals. In the Big Data space today, there can be an undue focus on formats or mechanics, but these are just a means to one end: products. Remember, Paul Graham and his team wrote Viaweb in Lisp, and it just worked.


Mike,
Nice article but sometimes SQL isn’t great for analysis either. Imagine trying to construct a social graph from call records at a particular point in time. Or, imagine trying to determine how many callers hit redial. That would take some hairy sql, big iron, and a lot of compute time. We at 1010data designed our own analytical language, and engine for big data that can express queries like this in a very simple language and process them on big data (billions of rows) in seconds. Our web service is used by big wall st firms like the NYSE, Goldman Sachs, and JP Morgan (to name a few, we also have customers in other industries as well) to do time series analysis on equity and mortgage data.
Anyway, we’re a big fan of the blog and thought you should know about 1010 (www.1010data.com). Hit me up via email if you want to talk more.
-Dan H
1010data
I’ve got to agree with Dan H above. SQL is a pretty crummy language for analysis too. What if I want to determine which reps have had the highest growth in call volume? Even calculating simple metrics like % change over time is insanely difficult with simple sql queries.
This is why the NoSQL movement started in analytics. Tools like Big Table and Hadoop allow people to actually make use of large amounts of data, rather than just cleaning it and storing it. Putting SQL as a convenience layer on top of those tools is fine, but it’s just syntactic sugar. Use it when it’s easy, but have the power to drop into something that gets your question answered.
In fact, most analysts I know use SAS, not SQL for working with large amounts of data, because of the inflexibility of the language. I’m guess that you yourself find you move into R or Excel (or even PL/SQL) for a large number of your analyses, because you can’t get to the end result with just SQL
Further, I’m not sure what your experience is with Fortune 500 companies using extremely large SQL databases, but my experience is that a ton of time is wasted pulling records from denormalized applications and putting them into 3NF data warehouses and then pulling them back out into denormalized data marts so that the data can actually be used.
Creating these data marts and data warehouses is expensive and slow. For analytics to be effective, you need answers fast, and those answers don’t fall out along strict dimensional lines (or you would see OLAP being used by more than just IT departments providing a solution to a problem that few people have).
I think SQL was a great language for 30 years ago, when disk space was expensive (so you needed to put things in 3NF to reduce redundancy), computers operated faster in serial than in parallel, IO wasn’t necessarily the blocking factor, and when ways that data needed to be understood were fairly minimal.
Today, constraints have changed. IO is the primary bottleneck, and massive parallelization is the only way to work with data of the size that we can easily collect. SQL is a useful compatibility layer for people who want to ease into the brave new world of data analysis, but it’s time we realized it’s the past, not the future.
Dan & Gerad –
Agreed that SQL by itself is not the end-all and be-all for analytics, but it’s more powerful than I think you’re giving it credit for — particularly with newer architectures that provide parallelization (avoiding the IO bottlenecks you mention), user-defined functions, and map-reduce functionality.
To the oft-cited example that SQL is not efficient at querying network data, I would disagree. In my work with a N. American telecom, I built a first-degree calling network for all customers in a parallelized single pass over the data, executed in seconds over several TB of data. More complex network graph queries — including identifying transitive closures — are also express-able (http://willets.org/sqlgraphs.html).
For higher-order statistical analysis, I would agree that SQL has its limitations. My preferred workflow is to pull data from SQL into R, and derive statistics there.
Yes, creating data warehouses is expensive & slow, but data integrity matters at scale. Having worked with massive event data, I have learned some painful lessons from having a 32-node map-reduce job fail because of bad data lurking somewhere in the terabytes. This does not happen in data warehouses.
Nice to see a defense of SQL. The earlier commenters are right that it has many limitations. But it’s *WAY* better than writing your own python or java code every time you want to calculate certain types of summary statistics.
I think part of the issue is that programmers are heard much more loudly in open-source internet world and SQL doesn’t fit their needs as well as custom data stores. Also, the people whose needs are best satisfied by SQL-style data warehouses are big corporations and governments, which are decidedly not cool in startup/hacker world.
I’m very much interested in trying out NoSQL.
SQL certainly has it’s place and it can be the right tool for the job. I think we need to remember what the “Q” in SQL stands for. While analytics are certainly possible, to an extent, SQL is best at retrieving large amounts of data, but not necessarily at analyzing it. R, or maybe even Matlab, are better suited for most types of analysis.
@Michael, that site you give has some very useful hierarchical routines, however they make use of looping structures, which I find awkward in SQL, and almost cheating a bit. Views, joins and sub-queries are much more natural.
While transitive closure is very useful and easy to do using Oracle’s CONNECT BY clause, last I checked, isn’t supported by mySQL. This isn’t to say TC isn’t possible, but it is quite non-trivial. Graph generation is no picnic, and I sure couldn’t find a pure SQL way to implement any DNA processing algorithms.
Over the years, I’ve tried so hard to turn SQL into a magic answer box, but sometimes it just isn’t possible. The author has it right when he says “..I view SQL as a pipe feeding data into more sophisticated modeling and visualization tools…”
Sigh. This post may be too rational for the Internets.
Fancy meeting you here Gerad! I think you’ve been spending too much time with MySQL. Sure, SQL is not R. But your examples should be pretty easy in a mature modern SQL DBMS using PARTITION BY.
In my experience, SQL syntax actually shines on graphs compared to MapReduce, while MapReduce is much nicer for text-whacking than SQL. Most commercial DBMSs — along with PGSQL 8.4 — support recursive queries that do well on big graphs. But if your DBMS doesn’t support recursive queries (or you find them confusing, which they are!), you can always go for the typical MapReduce hack of writing a loop in your favorite scripting language around a simple SQL query. On that front, MapReduce has no particular bragging rights.
For some recent examples of doing scalable statistics with SQL, see http://databeta.wordpress.com/2009/03/20/mad-skills/ and http://databeta.wordpress.com/2009/04/01/declarative-ie/ and the tech reports linked from there.
Joe
link back: http://www.jonathanlevin.co.uk/2009/11/reply-to-sql-is-dead-long-live-sql.html
SQL is good, but needs to be made more powerful and simpler to use. I mean features like you can find in C# linq query language, partucularily defining reusable calculations inside queries. (Maybe some vendors has it, I don’t know)