The Data Singularity is Here
by mike | March 8th, 2010
In this blog post I’ll attempt to sketch the forces behind what I’m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences.
In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren’t even at the terminal node of action. International cargo shipments, high-frequency stock trades, and genetic diagnoses are all made without us.
Absent humans, these data and decision loops have far less friction; they become constrained only by the costs of bandwidth, computation, and storage– all of which are dropping exponentially.
The result is an explosion of data thrown off from these machine-mediated pipelines, along with data about those flows (and data about that data, and so on). The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.
So whether or not the Singularity is Near, the Data Singularity is here, and its consequences are being felt.
But before I discuss these consequences, I’d like to expand on the premise. The world wasn’t always drowning in this data deluge, so how did we get here?
I. Data at the Speed of Speech
For most of human history, information traveled no faster than the sound of the human voice. The origin of human language was the original singularity: it marked the birth of a non-biological information channel, distinct from our DNA.
But despite this achievement , the production of information — whether farmers’ almanacs or merchants’ ledgers — was still constrained the by costs of ink and parchment and the write-speed of the human hand.
All 70,000 volumes of the Library of Alexandria, the collected body of human knowledge in antiquity, could fit on two thumb drives today.
Thus the transmission and production of data, when it was done at all, was painstaking in form, small in scale, and occurred between people.
People --> People
II. Data at the Speed of Light
With the telegraph, for the first time, data flowed at the speed of light.
In the late 18th century, the first substantive telegraph line connected Paris to a suburb 210 kilometers to its north, using optical semaphores rather than electrical currents to communicate. Yet while data hopped between stations at light speed, it had to be routed by human operators at each station.
Centuries earlier, the printing press dramatically reduced the production costs of information. Still, human authors transmitted their hand drafted manuscripts to type setters, who set type with fonts optimally designed for human eyes.
III. Programmable Looms and Reading Machines
Punch cards represented the movement of data away from human-readable, anthropocentric substrates, onto a medium designed principally for consumption by machines.
Punch cards were developed in the early 18th century to control industrial looms , in France.
Now, machines were the final terminus of data transmission. This act of communicating with our machines, programming them, was at the heart of Charles Babbage’s Analytical Engine, which came more than a century later.
People --> Machines
IV. Phonographs and Recording Machines
Developing on the other side of the communication spectrum were machines that excelled at writing and storing data.
The modern rotating disk drive feels less inspired by punch cards, but by Thomas Edison’s cylinder machines, better known as phonographs.
The human voice was a natural data format, and if early pioneers had a vision for the modern human-machine interface, I imagine it would have been to program machines by voice. It’s a vision that still eludes us.
By the middle of the 20th century, a slew of semiconductor technologies emerged to close the loop of data generation: we had machines that produced digital data, and machines that continuously consumed it, without human intervention.
Machines --> Machines
These technologies also sparked the beginning of a less-celebrated, but equally important exponential curve: the falling cost of data storage.
V. Listening to the Pulse of the Planet
The exponential drop in data storage costs has meant that logging historical data about a process, or billions of processes, is economically feasible.
I conjecture that the largest share of data on the planet sits in log files; these are the EKGs of the server farms that manage our cell phones, our e-mail accounts, and every other facet of our online existence — and which consume 3% of the US energy budget .
Ubiquitous networking and cheap bandwidth has meant these pools of storage are no longer isolated on individual sensors, phones, or servers, but form the tributaries feeding an ocean of data in the Cloud.
And yet, funneling these massive volumes of data creates enormous technological pressures, against which companies struggle. So why keep the data?
Because inside these log files, amidst the myriad conversations recorded between machines, lies the pulse of their customers.
Collectively, these logs reveal the pulse of the planet — flight delays, package shipments, job losses, and human sentiments.
And as I’ll discuss in my next post, those who can extract a meaningful signal from this thunderous cacophony — the analysts, statisticians, and data scientists — are uniquely positioned to change the world.



Great post – looking forward to the rest of the installments. Reminded me of this old post: One word: Data
Nice post. I look forward to the next one – can’t stand the suspense
“The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.”
As this happens, I am curious about how the role of human judgment and interpretation will evolve. For operational/execution systems, it may not matter but for decision-support systems, without a human in the loop, bad things can happen.
Great beginning, can’t wait to read the other two parts.
The storage price curve is striking. Even more so when you think that there are already technologies in the pipeline for the next couple orders of magnitude.
There are dizzying opportunities for anybody who can come up with good ways to sift and analyze this data.
Sometimes I worry that so much of the most valuable, granular data, therefore so much of this opportunity, resides in enormous proprietary databases that may be very hard for startups to dig into. Google, Apple, Microsoft, telcos, the ISPs, they all know how valuable and sensitive this stuff is.
When Britain opened several of its big databases to the public, wasn’t that in part because the government was hoping on the cheap to attract researchers and developers to its datasets (and the problems they document), rather than watching them work exclusively for the profit-driven guardians of infobiz data?
Great post; like the others, I look forward to the next installment.
This talk of data makes me think of the long-term implications of keeping all this data.
Re: storage getting cheaper, the economics, aren’t always as clear cut as you might think in terms of keeping vs. discarding data. Just because storage is cheap and getting cheaper, doesn’t mean you want to pay to keep all of it. Think in terms of paper records — just because your storage of the paper documents in rural location X is extremely cheap compared to, say, the NYC area, doesn’t mean you want to keep paying to store the material if you don’t actually need it or use it.
The Science and Technology Council of the Academy of Motion Picture Arts and Sciences (yes, the Oscar awarders) released a report in 2007 entitled, “The Digital Dilemma”. They examined the costs of storing and migrating “digital film” vs CMYK. After losing many early (reel) films b/c they were not thinking of long-term use of the films, archivists figured out how to store CMYK for 100 years. The policy with CMYK is to “save everything” related to the film and the making of it, b/c you don’t know what you will need in the future.
The authors of the report stated that you cannot have a “save everything” policy with regards to digital movies, and that even with culling, it will cost 1100 (yes, one thousand one hundred) times more to store and preserve digital movies than it will a CMYK film. In fact, some producers and directors have made digital movies, and then preserved them on CMYK, because that has been figured out in terms of long-term preservation and is less expensive.
The point is…data is valuable, but some of it will be valuable only for so long. Storage is getting cheaper, but is it really cost effective to keep yottabytes of data sets just because you can?
Then you have to get into the bits about appraisal and selection….
A very good example of your conjecture is the Wikileaks 9/11 text message capture. The vast majority of the data is machines talking to other machines. Looking forward to part 2.
[...] The Data Singularity is Here (Dataspora Blog) tweetmeme_style = ‘compact’; tweetmeme_service = ‘bit.ly’; tweetmeme_source = ‘ReformedBroker’; View the discussion thread.blog comments powered by Disqus [...]
Michael,
Excellent article and well written.
I agree with you that we are living in interesting times and that the current explosion of data and information is driving change in ways that will probably surprise most of us. I believe that this time period in future will be referred to as the “Information Revolution”, comparable in impact to the Industrial Revolution of the 19th Century.
However, I do not fully agree with your premise that human involvement will soon be unnecessary. I had already started to draft a blog post on how inaccurate computers can be (and your post will spur me on to complete this!) and that human judgement and intuition will remain an essential ingredient in many, but not all, areas.
This may sound radical, however, if we take two factors into account, then the data singularity concept is called into question. Firstly, although many organisations aspire to perfect data quality, the cost and achievability of this means that decision making will always be based on data which is, to a degree, imperfect.
Secondly, many of the more complex business analysis and modelling scenarios involve fiendishly complex decision logic. The complexity of such logic itself can be prone to errors, particularly when changes to that logic are introduced. A few years ago I was involved in the development and operation of a rules based artificial intelligence system which ably demonstrated this problem.
When imperfect data is utilised by complex (and also imperfect) decision logic, then the outputs will not be as intended. Therefore human assessment and inference will still be required.
Looking forward to the next instalment.
Julian
[...] the speed at which information travels between two nodes in a network. It was about a so called Data Singularity and the basic premise was that nowadays information flows are so horribly fast that only computers [...]
Now got my blog post mentioned above completed. See http://bit.ly/95dEHl
Hi Michael,
Excellent post – well done. Like the other people who commented, I eagerly await the next installment.
I agree there is huge opportunity for those who can “mine” the data. I also believe there will be an increasing need to “govern” the data.
Ken
As always, thought provoking stuff.
I can’t stop thinking about your “log files are tributaries feeding into the Cloud” metaphor. So cool! However, I would extend it by altering their destination not to a single Cloud… but to a large collection of land-locked ponds, lakes, and seas. For better or worse, Google has beaten everyone by creating their own singular Ocean in which we all swim, sail, and surf.
Jeff
Looking forward to the next installment, thank you
Still computers only do what we tell them to do. Currently there are people that know how to tell computers what to do (programmers), and those that don’t.
People traditionally excused from having programming knowledge now are required to have at least some, due to the pervasiveness of digital technology. So many non-programming jobs require “Excel experience”, which effectively means: can you automate at least some of your work.
Yet programming is a black art to most non-programmers. We can have all the data in the world, but what we are really going to run short of are people that know what to tell computers to do with it. Even the label “programmer” encompasses a whole range of skill, of which only a top few have the skills required to build really decent programs.
We need to come up with better ways of making computer programming accessible to everybody. We need to make it easy to build good software. Otherwise this great edifice of complexity will become increasingly unwieldy.
Great post, Michael. Really appreciate your analysis of the evolution of data flow.
-b
[...] years as technology enables robust, easy and cost-effective trading and settlement mechanisms and data (which is the raw material of any exchange or risk management toolkit) continues to grow in siz… Indeed the greatest impediment to the development of such markets is cultural: there is still an [...]
Michael –
I’m a relatively new visitor, first time commenter.
A bit off-topic, but do you have any recommendations on what are the top schools for a Master’s (or PhD) in Statistics? Specifically looking for a school that would offer good preparation to jump back into industry (as opposed to academics) for practical application of what has been learned.
Or could there be a potential blog post in the future for people who aspire to be where you are and have your level of understanding and practical ability with regards to data/ architecture /etc? Maybe covering:
- Education track
- Recommended hard skills (i.e. Python, R, SQL, etc.)
- Your favorite online resources / papers / textbooks / etc.
- Etc.
Ken – Not off-topic at all – since you asked, I’d highly recommend reading up on this excellent (and recent) post by my colleague and friend Bradford Cross.
http://measuringmeasures.com/blog/2010/3/12/learning-about-machine-learning-2nd-ed.html
Thanks this is great.
So you would recommend self-study over a formal degree? How would you turn self-learning into something that adds credibility to a resume (like a formalized masters would)?
[...] data singularity is here. Quote: “The machines all around us — our smart phones, smart cars, and fee-happy bank [...]
Computers are new, but data, data has been around just it was never looked on in the same manner as before.
There is data, when you put some fries in your mouth – your senses feeding data all the time. In fact in such cases the data may not even be stored but its its summary may be… or not …
So enough of this hype about big data – data is not interesting its meaning is
As a formal degree to get into machine learning I would suggest major in statistics with minor in computer science, because imho this covers most of the subjects listed in the recommodations of Bradford Cross, at least that’s my conclusion after comparing the list with the courses offered at my university.
Thanks Michael.
Terrific post, and couldn’t agree more. Particularly like your call out of the log files .. .they are a treasure, and growing by the second.
Best,
Jon
[...] Thirty years ago, the internet didn’t exist. Computers were not in common use – either commonly, or by the common person. The world has changed. Not only are nearly two billion people accessing the internet, but we’re facing a data singularity. [...]
[...] of data mining, Michael Driscoll of Dataspora has an interesting pair of posts extolling the virtues of Big [...]
Michael –
Inspiring post.
As an avid technologist but also an amateur archaeologist, I’d like to suggest a small correction to the historical flow you presented.
The “big bang” moment of the data singularity was not language but the invention of writing ca. 3000BC. Until then data did not accumulate (i.e. get stored). I believe that the revolution occurred with the means of storage rather then with the means of communication. It just so happens that for most of human history storage was limited in capacity (papyrus, clay tablets and parchment) and bandwidth (human hand writing) and therefore exceedingly expensive and only for the elites/rulers. The invention of the alphabet, the printing press and hard cover books are all milestones in the slow process of decreasing cost of bandwidth and/or storage capacity. You should have shown the graph depicting the decrease in GigaByte of storage starting in 3000BC rather than 1980….
Obviously machine language and digital data storage have accelerated this historical trend exponentially in the last 50 years.
I look forward to reading more of your posts.
Yaron
[...] The Data Singularity is Here : Dataspora Blog Those who can extract a meaningful signal from this thunderous cacophony of data r uniquely positioned 2 change the world http://is.gd/gmgHn (tags: via:packrati.us) [...]