Back at OSCON, as I mentioned previously, Adrian gave a talk that I thought was one of the highlights of the show. It discussed in detail how programming and development could, almost by itself, be a form of journalism. The notion might seem a bit suspect, until you consider the examples.
Chicagocrime.org, as has been well documented, takes crime statistics published by the Chicago PD, adds structure and context and out pours information that’s usable in brand new ways. Ditto for Mixed Messages, Faces of the Fallen, and the Votes Database that he’s been working on over at the Post. Take some existing data, relate it to other data in a new way, and you (may) have something brand new. Folks with backgrounds in Business Intelligence / Reporting are probably nodding along at this point.
Whether or not this sum-is-greater-than-the-whole-of-the-parts development technically qualifies as journalism or not is a question I’ll leave to those with a serious interest in the philosophies behind the Fourth Estate. I’ll just say that it is for me, and move on.
What’s of greater interest to me is how important this trend is becoming, and how much work there is yet to do. Everyone has probably seen by now the quotes floating around from the Berkeley study documenting the information/content/data explosion; what has not been talked about adequately, IMO, is just how little of that information is captured in any machine readable or manipulable form.
Adrian discussed just this problem in his talk. His example was the reporting of a crime. The reporter might go to the police for the crime blotter information, and get some basic facts. Type of crime, location, time of day, etc. The reporter then may take that information, and impart it to a human audience via an article in the paper. But while that’s fine for humans, it’s less ideal for machines. Computers are not so adept at parsing prose to extract details, because there is no consistent structure from story to story, and thus no consistent way to attack the problem programmatically. Certainly there are approaches that can be used, but by definition they will be imperfect.
On a more local note, consider the phenomenon of blogs. They excel at publishing continually updated information, but do not provide much in the way of structure. Even basic structural elements such as categories or the tags just introduced in MT 3.3 are approximations at best; halfway solutions to the problem of unstructured content. Search, in many cases, is used as a substitute for structure. And search even search requires human intervention and interpretation, as the headlines on Google News occasionally remind us.
The implications of this lack of structure are predictably diverse. We’ve had numerous conversations with clients, for example, on the topic of separating usable content (wheat) from the less helpful (chaff) on a variety of support forums. One of those clients has even designed a product around this problem, using pattern matching and other techniques to try and capture value in the volumes of error messages that are and have been generated in unstructured formats.
But it also has regular, everyday implications, as the aforementioned Chicagocrime.org and housingmaps.com have reminded us. I, for example, am hugely frustrated by the fact that a wealth of data on fish location, size and so on is locked either in difficult to navigate threads on message boards – or worse – each individual’s brain. Why can’t I look at a map of my local area and see where/when people caught fish?
The problem is not, I don’t believe, what you might suspect: participation. You might think that people simply are either unwilling or too lazy to share their experiences with their fellow fisherman, but that doesn’t seem to be the case. Not everyone will, of course, and as Ian Frazier recounts in the excellent Fish’s Eye, there will always those who derive pleasure from their ability to withhold or mislead people about where the fish are. But if the benefits are clear – better fishing and a sense of community – it’s fairly clear that people will share, and will participate. Just look at the forums here.
Instead, I believe that the problem – just as it was at the Washington Post, apparently – is the lack of an ability to capture that structured information as easily as possible. But what if there was a means of doing that? I’d be interested to try and find out, and am going to try to devote some attention to this problem in the days ahead. But in the meantime, I think we all need to be more aware of the problem that the lack of structure poses, because I think therein lie some interesting opportunities – both for infrastructure, and sites built on top of them.