The Rise of the Data Web

by mike | August 20th, 2009

The future of the web is data, not documents. The web has evolved from Tim Berners-Lee’s original vision of “some big, virtual documentation system in the sky” into an vibrant ecosystem of data where documents — and human actors — will play an ever smaller role.

As others have noted, we’ve reached a tipping point in history: more data is being manufactured by machines — servers, cell phones, GPS-enabled cars — than by people. The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.

Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext. Similarly, we’ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.

The web we experience will continue to be dominated by documents — e-mail, blogs, and news. And while many sites are data-centric — Google maps, Weather.com, and Yahoo finance — it’s the web that we can’t see that surging with data. It’s not about us, it’s about servers in the cloud mediating entire pipelines of data, only occasionally surfacing in a browser.

But the web’s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data. As we build out the data web, we ought to embrace standards that mirror data’s form in its natural habitats — as programmatic data structures, relational tables, or key-value pairs — while taking advantage of data’s stream-like nature. Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.

Sacred “Words & Enthusiasm” vs Meaningless Utterances

Documents and data are different.  The table below reflects my thin grasp of the fissure lines, as a step towards arguing why we ought to design around them.

Documents are made of “words and enthusiasm”: sonnets, cake recipes, blog posts, Supreme Court rulings, and dictionary definitions. Their core stuffing is text. Their structure is unpredictable and irregular — even fractal.

Data are not created but collected (something given, not something made): city temperatures, stock prices, web visitors, and home runs. They are observations in time and space, with periodic and predictable structure. Data are reorderable and divisible: you can relay city temperatures in any order, but you can’t rearrange a Shakespearian sonnet without muddling its meaning. Some documents are so meaningful as to be considered sacred.

Data are, in this regard, meaningless on their own; they do not signify, they simply are. These data are the utterances of the spimes that surround us.

Documents as Trees, Data as Streams

The argument for shifting away from markup languages as data formats is not just practical, it’s philosophical: it’s about pivoting our conception away from the dominant metaphor of documents — trees — towards one far more suitable for data — streams.

Trees are rooted and finite: you can’t chop up a tree and easily put it back together again (while XML has made concessions to document fragments, it is not a natural fit).

Streams can be split, sampled, and filtered. The divisibility of data streams lends itself to parallelism in a way that document trees do not. The stream paradigm conceives of data as extending infinitely forward in time. The Twitter data stream has no end: it ought have no end tag.

Conceiving of data as streams moves us out of the realm of static objects and into the realm of signal processing. This is the domain of the living: where the web is not an archive but an organism, reacting in real-time.

XML Considered Harmful for Data

XML is a poor language for data because it solves the wrong problems — those of documents — while leaving many of data’s unique issues unaddressed.   But many promising alternatives exist — microformats like JSON, Thrift, and even SQLite’s file format – as I will detail in a my next post.

Tagged

  1. as far as new, potentially binary, structured data formats go –

    I’m finding Avro interesting for big long lists of identical structured objects. It basically competes with Protocol Buffers and Thrift. It has a good spec, which is nice. http://hadoop.apache.org/avro/

    BSON (inside MongoDB) is also good to know about. http://www.mongodb.org/display/DOCS/BSON

  2. [...] these days.  The latest source of inspiration is a blog post by Mike Driscoll at Dataspora.  In the rise of the data web Mike writes that through our various frameworks we have industrialized the creation of hypertext [...]

  3. Ryan Cox says:

    Protocol Buffers is worth coverage in your future post as well.

    http://code.google.com/p/protobuf/

    It’s becoming more and more popular as a data interchange format ( independent of it’s RPC beginnings ).

    -ryan

  4. [...] Driscoll at Dataspora follows up on this theme providing a more high-level discussion of how the rise of data (vs. documents) conflicts with the architecture that underlies the web today.  Current mark-up languages are geared towards, and ideal for, documents (e.g. HTML and XML), not [...]

  5. Tantek says:

    Excellent table of fissure lines between documents and data. I’m going to be referring to that in future discussions and analyses.

    Regarding XML and documents, unfortunately, XML has pretty much failed for documents as well.

    From the incredible size of the XML stack (necessary X*** W3C specs) to the under-specification of any sort of document processing/validation model for multi-namespace documents, the resultant complexity, incomplete at that, has largely failed on the Web (despite being specifically created as a *future* for the Web!).

    Even W3C has canceled XHTML2, the most “pragmatic” attempt to develop an XML-based evolutionary version of HTML, in deference to HTML5, in many ways admitting defeat in the Great Web Semantics Schism that occurred just over five years ago at the W3C Workshop on Web Applications and Compound Documents (San Jose, CA, 2004) – the very meeting that added inspiration and critical support to the WHATWG/HTML5 effort, and the microformats effort (conceived just four months earlier) whose microformats.org community launched just a year later.

    XML tried to span documents and data, and perhaps as a result is worse at both.

    For documents: HTML (plus microformats) has won favor with web publishers. It’s easier, more familiar, more incremental, and more fault tolerant than XML (“draconian” error handling was a design failure of epic proportions for anything as human-centric as documents).

    For data: JSON has won favor with web developers. It’s again easier, smaller, faster and more efficient than XML.

    Where does that leave XML?

    Tantek Çelik
    former browser web standards developer (actually implemented deep enough standards support to try implementing multi-namespace XML support in a production market-leading browser (IE5/Mac at the time) and discovered numerous fatal show-stopper specification holes) and W3C representative (spent years improving CSS, contributing to XHTML2, and trying, unsuccessfully, to get the aforementioned holes filled, before deciding building smaller, simpler, limited additions to HTML in the form of microformats would yield far more immediate benefits to web authors/developers and the web as a whole.)

  6. Thiago Silva says:

    2 cents:

    Signal processing metaphor might be interesting and I think we had a lot of sw experience with that over the decades (the mainstream example in software being the “UNIX philosophy”, perhaps). But I ask myself if we are not using ideas ignoring what we have discovered about their limitations. Most of the time, that’s what seems to be the case – specially when it’s about the web (from its inception, maybe), where technology and approach doesn’t seem to be chosen with much care in mind, as history tells.

    Cheers,
    Thiago Silva

  7. [...] a tweet by Tim O’Reilly, I came across an excellent post entitled The Rise of the Data Web, on Dataspora Blog (which I quickly subscribed to). The author, Michael E. Driscoll, summed up [...]

  8. [...] me, if you combine the plethora of data being generated by Web 2.0 technologies with the inherent social and behavioral aspects of these technologies, it [...]

  9. [...] the job — it’s ugly, and it’s bulky. So when I read Michael E. Driscoll’s comparison of documents (including XML) to trees and data to streams, it struck a chord with [...]

  10. i applied for data entry jobs over the internet and it is also a good part time job.:-.

  11. [...] publish specialized datasets, providing both authoring tools and opportunities to participate in a web of data, not just of [...]