Skip to content

The Future of Open Data Looks Like…Github?

Census Data, From Infochimps

When we’re talking about data, we typically start with open data. What we mean, in general, is access and availability. The ability to discover and procure a given dataset with a minimum of friction. Think or

The next logical direction is commerce. And while this idea isn’t high profile at the moment, at least outside of data geek circles, it will be. Startups like Data Marketplace, Factual, and Infochimps perceive the same opportunity that SAP does with its Information On Demand or Microsoft, with Project Dallas.

What follows commerce? What’s the next logical step for data? Github, I think. Or something very like it.

In the open source world, forking used to be an option of last resort, a sort of “Break Glass in Case of Emergency” button for open source projects. What developers would do if all else failed. Github, however, and platforms with decentralized version control infrastructures such as Launchpad or, yes, Gitorius, actively encourage forking (coverage). They do so primarily by minimizing the logistical implications of creating and maintaining separate, differentiated codebases. The advantages of multiple codebases are similar to the advantages of mutation: they can dramatically accelerate the evolutionary process by parallelizing the development path.

The question to me is: why should data be treated any different than code? Apart from the fact that the source code management tools at work here weren’t built for data, I mean. The answer is that it shouldn’t.

Consider the dataset above from the US Census Department, hosted by Infochimps. Here’s the abstract, in case you can’t read it:

The Statistical Abstract files are distributed by the US Census Department as Microsoft Excel files. These files have data mixed with notes and references, multiple tables per sheet, and, worst of all, the table headers are not easily matched to their rows and columns.

A few files had extraneous characters in the title. These were corrected to be consistent. A few files have a sheet of crufty gibberish in the first slot. The sheet order was shuffled but no data were changed.

Translation: it’s useful data, but you’ll have to clean it up before you go to work on it.

What if, however, we had a Github-like “fork” button? Push button replication of dataset, that would presumably have a transform step applied with the resulting, modified dataset made as freely available. Rather than have to transform the dataset, then, I merely picked up one that someone else had already taken the time to clean.

What are the problems with this approach?

  • Data Quality/Provenance:
    Discussing this idea with a few data people, they expressed skepticism because the transform might negatively affect the dataset. False or inaccurate data could be created inadvertently, or worse maliciously, to create a damaged dataset from which the wrong conclusions could be drawn. But again, is this any different than code? Why should bad data be any easier to insert into a dataset than a virus into a codebase?

    But for the cautious, each fork could presumably include a record of the specific transforms such that a user could replicate the steps and check the result against the fork for reassurance.

  • Data Ownership:
    Some data owners would undoubtedly be opposed to free public transformation of their datasets, particularly because it could pose pricing issues in commercial scenarios. Who gets paid if you’re not buying the original, commercially priced dataset but a cleansed version of same? Still, the solution to this is simple if less than ideal: disallow forking on select datasets.
  • Storage:
    With a great many datasets of manageable size, duplication via forking would be less of an issue than is commonly supposed. There is little question, however, that for large datasets (1 GB+), rampant forking could pose substantial infrastructure costs.

I’m sure there are others. Still, the future to me in this area seems clear: we’re going to see transformation of datasets incorporated into the marketplaces. As the demand for public data increases, the market will demand higher quality, easier to work with data. With that demand will come supply, one way or another. There’s little sense in having each individual consumer of the data replicate the same steps to make it usable. The question will be which one of the marketplaces learns from Github and its brethren first.

Expect collaborative development to beget collaborative analysis, in other words. Soon.

Categories: Data, Marketplaces.

Tags: , ,

  • Pingback: Links 5/5/2010: Collabora Joins GNOME Foundation, Red Hat Enterprise Linux 6 Tested | Techrights()

  • Pingback: Reiterating the need for a data commons()

  • Ho-Sheng Hsiao

    Github might be on your brain, but it is on mine’s too. There are tons of people talking about it outside of the open source community, and I talked a bit about some of them in about the book, “Intangible Assets, Hidden Liabilities”.

    I had been working with Chef, which lets you describe infrastructure as code. And I had exactly that experience that comes from forking code — It isn’t data so much as infrastructure.

    It’s the same kind of trend with data. I think maybe, Tim Berner-Lee’s vision of the semantic web will finally be paying off, especially with tools that lets us process multiple versions of the same dataset and, as you said, a marketplace for them.

    Great article!

  • tyler

    Gridworks ( comes to mind, too.

  • Stephen Pascoe

    My gut feeling that there are both stong similarities and differences between data and code. The similarities are drawing us into the idea that we can use all the marvelous shiny tools we use for software engineering whereas the trick in making it happen will be to understand what the true differences are. For instance, from the article:

    “Why should bad data be any easier to insert into a dataset than a virus into a codebase?”

    Well I would say it blatantly is. A virus has to do something to be a virus, dodgy data can be just noise. Similarly there is an intuitive benchmark that makes code “correct” — does it run without crashing. Whereas most datasets have no such quantifiable metrics. This suggests datasets will need machine-verifiable quality control measures to make forking feasible. A sort of test suite for data.

  • Michelle Greer

    Just like code, data is going to take all forms. Some of it will be free. Some of it is and always will be harder to collect and certifiably accurate and therefore will always cost money.

    I can see all three models (freemium, open, and paid) applying to selling data and still working.

  • Pingback: Black Duck Blog » Blog Archive » Disruption, open source, and sustainable business models()

  • Pingback: Bookmarks for May 14th through June 2nd()


    Two years on, it’s great to see how far InfoChimps and Factual have come along.  Along with services like MortarData for analyzing this data – we are definitely starting to see the future this article predicted two years ago.