When we’re talking about data, we typically start with open data. What we mean, in general, is access and availability. The ability to discover and procure a given dataset with a minimum of friction. Think Data.gov or Data.gov.uk.
The next logical direction is commerce. And while this idea isn’t high profile at the moment, at least outside of data geek circles, it will be. Startups like Data Marketplace, Factual, and Infochimps perceive the same opportunity that SAP does with its Information On Demand or Microsoft, with Project Dallas.
What follows commerce? What’s the next logical step for data? Github, I think. Or something very like it.
In the open source world, forking used to be an option of last resort, a sort of “Break Glass in Case of Emergency” button for open source projects. What developers would do if all else failed. Github, however, and platforms with decentralized version control infrastructures such as Launchpad or, yes, Gitorius, actively encourage forking (coverage). They do so primarily by minimizing the logistical implications of creating and maintaining separate, differentiated codebases. The advantages of multiple codebases are similar to the advantages of mutation: they can dramatically accelerate the evolutionary process by parallelizing the development path.
The question to me is: why should data be treated any different than code? Apart from the fact that the source code management tools at work here weren’t built for data, I mean. The answer is that it shouldn’t.
Consider the dataset above from the US Census Department, hosted by Infochimps. Here’s the abstract, in case you can’t read it:
The Statistical Abstract files are distributed by the US Census Department as Microsoft Excel files. These files have data mixed with notes and references, multiple tables per sheet, and, worst of all, the table headers are not easily matched to their rows and columns.
A few files had extraneous characters in the title. These were corrected to be consistent. A few files have a sheet of crufty gibberish in the first slot. The sheet order was shuffled but no data were changed.
Translation: it’s useful data, but you’ll have to clean it up before you go to work on it.
What if, however, we had a Github-like “fork” button? Push button replication of dataset, that would presumably have a transform step applied with the resulting, modified dataset made as freely available. Rather than have to transform the dataset, then, I merely picked up one that someone else had already taken the time to clean.
What are the problems with this approach?
- Data Quality/Provenance:
Discussing this idea with a few data people, they expressed skepticism because the transform might negatively affect the dataset. False or inaccurate data could be created inadvertently, or worse maliciously, to create a damaged dataset from which the wrong conclusions could be drawn. But again, is this any different than code? Why should bad data be any easier to insert into a dataset than a virus into a codebase?
But for the cautious, each fork could presumably include a record of the specific transforms such that a user could replicate the steps and check the result against the fork for reassurance.
- Data Ownership:
Some data owners would undoubtedly be opposed to free public transformation of their datasets, particularly because it could pose pricing issues in commercial scenarios. Who gets paid if you’re not buying the original, commercially priced dataset but a cleansed version of same? Still, the solution to this is simple if less than ideal: disallow forking on select datasets.
With a great many datasets of manageable size, duplication via forking would be less of an issue than is commonly supposed. There is little question, however, that for large datasets (1 GB+), rampant forking could pose substantial infrastructure costs.
I’m sure there are others. Still, the future to me in this area seems clear: we’re going to see transformation of datasets incorporated into the marketplaces. As the demand for public data increases, the market will demand higher quality, easier to work with data. With that demand will come supply, one way or another. There’s little sense in having each individual consumer of the data replicate the same steps to make it usable. The question will be which one of the marketplaces learns from Github and its brethren first.
Expect collaborative development to beget collaborative analysis, in other words. Soon.