Context, Not Content, Is King

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

Back at OSCON, as I mentioned previously, Adrian gave a talk that I thought was one of the highlights of the show. It discussed in detail how programming and development could, almost by itself, be a form of journalism. The notion might seem a bit suspect, until you consider the examples.

Chicagocrime.org, as has been well documented, takes crime statistics published by the Chicago PD, adds structure and context and out pours information that’s usable in brand new ways. Ditto for Mixed Messages, Faces of the Fallen, and the Votes Database that he’s been working on over at the Post. Take some existing data, relate it to other data in a new way, and you (may) have something brand new. Folks with backgrounds in Business Intelligence / Reporting are probably nodding along at this point.

Whether or not this sum-is-greater-than-the-whole-of-the-parts development technically qualifies as journalism or not is a question I’ll leave to those with a serious interest in the philosophies behind the Fourth Estate. I’ll just say that it is for me, and move on.

What’s of greater interest to me is how important this trend is becoming, and how much work there is yet to do. Everyone has probably seen by now the quotes floating around from the Berkeley study documenting the information/content/data explosion; what has not been talked about adequately, IMO, is just how little of that information is captured in any machine readable or manipulable form.

Adrian discussed just this problem in his talk. His example was the reporting of a crime. The reporter might go to the police for the crime blotter information, and get some basic facts. Type of crime, location, time of day, etc. The reporter then may take that information, and impart it to a human audience via an article in the paper. But while that’s fine for humans, it’s less ideal for machines. Computers are not so adept at parsing prose to extract details, because there is no consistent structure from story to story, and thus no consistent way to attack the problem programmatically. Certainly there are approaches that can be used, but by definition they will be imperfect.

On a more local note, consider the phenomenon of blogs. They excel at publishing continually updated information, but do not provide much in the way of structure. Even basic structural elements such as categories or the tags just introduced in MT 3.3 are approximations at best; halfway solutions to the problem of unstructured content. Search, in many cases, is used as a substitute for structure. And search even search requires human intervention and interpretation, as the headlines on Google News occasionally remind us.

The implications of this lack of structure are predictably diverse. We’ve had numerous conversations with clients, for example, on the topic of separating usable content (wheat) from the less helpful (chaff) on a variety of support forums. One of those clients has even designed a product around this problem, using pattern matching and other techniques to try and capture value in the volumes of error messages that are and have been generated in unstructured formats.

But it also has regular, everyday implications, as the aforementioned Chicagocrime.org and housingmaps.com have reminded us. I, for example, am hugely frustrated by the fact that a wealth of data on fish location, size and so on is locked either in difficult to navigate threads on message boards – or worse – each individual’s brain. Why can’t I look at a map of my local area and see where/when people caught fish?

The problem is not, I don’t believe, what you might suspect: participation. You might think that people simply are either unwilling or too lazy to share their experiences with their fellow fisherman, but that doesn’t seem to be the case. Not everyone will, of course, and as Ian Frazier recounts in the excellent Fish’s Eye, there will always those who derive pleasure from their ability to withhold or mislead people about where the fish are. But if the benefits are clear – better fishing and a sense of community – it’s fairly clear that people will share, and will participate. Just look at the forums here.

Instead, I believe that the problem – just as it was at the Washington Post, apparently – is the lack of an ability to capture that structured information as easily as possible. But what if there was a means of doing that? I’d be interested to try and find out, and am going to try to devote some attention to this problem in the days ahead. But in the meantime, I think we all need to be more aware of the problem that the lack of structure poses, because I think therein lie some interesting opportunities – both for infrastructure, and sites built on top of them.


  1. Hi Stephen,

    I’m the co-developer of the Post’s votes db with Adrian (I work for the paper itself in research), and I think you’re right on with this piece. For news-gathering organizations, I think the act of placing the information we gather in structured form has to take place as seamlessly as it can – meaning the reporter or editor should barely realize that it’s “extra work” for them to do so.

    Automation – a great thing for Adrian and I when it comes to the votes database – is the first choice, but some things can’t be automated. When that occurs, we ought to provide a way for information to be placed in a structured context by, for example, having the reporter/editor add such information as part of writing/editing a story.

    In addition, we have to start treating our archives as the sources of valuable data that they are. Many patterns we’ll miss on a daily basis, but if we build in a way to account for them over time, they’ll practically reveal themselves to us.

    Derek Willis

  2. Around about now Cote is jumping up and down yelling “microformats” – maybe an extension of the hReview microformat could be used toadd a parseable structure to (e.g.) a blog entry?

  3. Derek: thanks for dropping by, and excellent work over at the post. you guys are doing some excellent work.

    anyhow, i agree on the automate-what-you-can / deal-with-what-you-can’t approach, although i think in some organizations it might present some problems. i say that because i used to implement content management solutions, and the input part was always, shall we say, interesting.

    also couldn’t agree more re: archives. as the lady says, much that should not have been forgotten was lost.

    Ric: great point, and i think Cote and i are indeed coming at the same problem from different ends. but while i think microformats are part of the eventual solution to this problem, they are not – IMO – the solution themselves. we need to consider input and the user experience / interface, which for microformats now i’d characterize as poor to quite poor.

Leave a Reply

Your email address will not be published. Required fields are marked *