I’ve run across quite a few stories lately discussing the 1) the revolution in data production we are living through and 2) the challenges we face in being able to sift through and view that data in a meaningful way through the web.
The first comes from GigaOM, where Jennifer Martinez looks at the emerging problem of trying to keep up with the constant flow of data via status updates. As our networks grow, and our use of various social networks increases, we are inundated with updates which often times leads to missing particular updates that we may be most interested in. Additional, she notes that besides missing out on information you care about, this stream overload can lead to “disjointed conversations that lack context, making it hard to piece together and decipher what it all means”. I can relate to this problem, and my ‘immersion’ in social networks is average to above average. I haven’t figured out an optimal way to keep up. I try to utilize a few useful tools (e.g. Seesmic), but between social networks and Google Reader I find myself constantly playing catchup.
Michael Driscoll at Dataspora follows up on this theme providing a more high-level discussion of how the rise of data (vs. documents) conflicts with the architecture that underlies the web today. Current mark-up languages are geared towards, and ideal for, documents (e.g. HTML and XML), not the kind of streaming data that will come to dominate content. To explain this point he provides a comparison of metaphors where documents=trees and data=streams:
Trees are rooted and finite: you can’t chop up a tree and easily put it back together again (while XML has made concessions to document fragments, it is not a natural fit).
Streams can be split, sampled, and filtered. The divisibility of data streams lends itself to parallelism in a way that document trees do not. The stream paradigm conceives of data as extending infinitely forward in time. The Twitter data stream has no end: it ought have no end tag.
Conceiving of data as streams moves us out of the realm of static objects and into the realm of signal processing. This is the domain of the living: where the web is not an archive but an organism.
Finally, Ben Lorica at O’Reilly Radar discuss the challenges with trying to analyze large amounts of data in near real-time. As there are a number of potential solutions for the structured data that we are generating, there is a less obvious way to deal with the immense unstructured data. He notes recent work by a team at UC Berkeley that was able to take unstructured data and, leveraging entity extraction, turned it into structured data for a SQL database.