The Library of Congress set up a deal a few weeks ago to acquire Twitter's complete archive of public messages. It's not a particularly impressive number of bytes by itself, but it's a goldmine for computational analysis. And that academic potential is behind the government wanting to obtain what might seem like a vast cacophony of meaningless chatter.
In the WNYC Radiolab podcast released today, "Vanishing Words", Jad and Robert look at linguistic computation. Specifically, the idea that you can identify and predict dementia using word analysis of personal history, say a collection of letters or diary entries. Or if you're Agatha Christie, crime novels. If you've got a minute let Jad Abumrad & Robert Krulwich tell you about this:
Working with Jad's mention of "the age of Twitter": online services like Twitter, Facebook, Google, and so on are quite earnestly working with words as scientific data; it's a core element of staying competitive in their business. Computational language analysis is a fascinating field, and luckily it also seems to have powerful economic incentive.
Word data is probably still the easiest way to directly get highly personalized information about a person (e.g. a status update, a tweet). Facebook Data Scientists, for example, work primarily to teach computer models to interpret the words used in Facebook status updates into meaningful demographic data. The computers gather information and the scientists pick out interesting patterns so that better, more personalized advertising can be served. Better targeted ads translate to actual interest in ads, which translates to business.
Computational research and analysis (like the studies mentioned in this Radiolab podcast) is exploding commercially and academically, like a virtual internet gold rush. Supply is growing exponentially as hundreds of millions of people use online services to communicate publicly. Demand is blowing up too, because we're realizing, like these scientists discovering something deeply personal about Agatha Christie, just how much we can learn from a simple collection of words.
It's exciting to consider how much we may be able to learn about ourselves using non-contextual information. Words unrelated to each other in everyday usage still form patterns unseen on a larger scale. Everything you do leaves a mark on the world, and soon we may be able to better understand our markings and appreciate our histories holistically.
I imagine the future like learning the answers to questions we never thought to ask.
Edit 5/11/10: Agatha Christie also wrote dozens of diary entries and notes about books that may have shown signs of dementia. (via @JadAbumrad "Agatha Christie's deranged notebooks (interesing to read after the latest @wnycradiolab podcast) - http://bit.ly/ar2smX"
Edit 5/14/10: For an interesting exemplar of Facebook linguistic data-mining, see their Gross National Happiness trend index. The study describing the methodology used is cited below the chart.