Skip to content

Data science, Gangnam style

How can we enable viral data science? It’s all about taking advantage of network effects [writeup] as so many other disciplines and companies have done — in this case, the key is to catalyze collaboration [writeup] a la GitHub.

Two and a half years ago, my colleague Steve wrote “The Future of Open Data Looks Like…Github?” Today, the next step is finally real enough to see where the future lies — and it’s not just data that will look like GitHub, it’s the entire workflow behind data science. Witness the open-sourcing of EMC Chorus as OpenChorus at Strata NY, and the accompanying partnerships, including one with Kaggle’s platform for data-science crowdsourcing. We’re seeing the beginnings of bringing the collaboration models that have been vastly successful in open-source communities to data science.

Data science today looks something like this:

It’s a straightforward process that carries from the initial data through the analysis (which you should treat just as rigorously as any other code [writeup]) to the output results and how they’re visualized. This works great in a world without collaboration or sharing, where a data scientist works in a room alone with no need to work with, or present results to, others at any point. Unfortunately the real world is nothing like this. We need to collaborate, we need to present results and their business implications, and we need to spend 80% of our time cleaning up the data — often working with data providers to do so.

The future looks like this: The entire workflow from data to analysis to result to visualization will be social and collaborative. Just as the future of ALM looks a lot like GitHub [writeup], data science will gain its own version of ALM as it simultaneously gains discipline from parallel universes like DevOps [writeup]. Data scientists will be able to fork the full workflow to experiment, then file pull requests to merge any changes back into the original analysis, whether it’s to the input data, the analytical code itself, or the formatting or annotation of the results and data visualizations. Every step of the way will enable discussion and commentary on global as well as specific details via both Facebook-style comments and targeted annotations (e.g. on the cell level for data, or pointing to specific data points in a visualization).

Below, you can see a high-level overview of collaborative data science. In bold is what’s being added or fixed in each step, while italics indicate a Git-style branch of changes in the analytical workflow.

If you’re familiar with how Git works, either standalone or via GitHub, this will not look strange to you. It’s the same kind of branch-based history you see when visualizing the paths of development in any distributed version control. In the same lines as GitHub’s code review, you would be able to review not just the code but also the data and the results/visualizations.

Imagine integrating this with social tools like Yammer to create a truly collaborative environment for data science where you can easily discuss, reviewpromote, and contribute to any of the data science happening in your company. When you see an analysis that seems promising but isn’t quite right, just click “Fork” to fix the problem, then file a pull request to get your changes merged back in.

This will enable a whole new level of capability in organizational migrations toward data-driven decision making.

For vendors, the opportunity to become the central hub of data science is there, just awaiting someone willing to seize the day.

Disclosure: Microsoft, which owns Yammer, is a client. GitHub has been a client. EMC, Kaggle, and Facebook are not clients.


Categories: big-data, community, data-science, distributed-development, open-source, social.

  • Chris Fisher

    Why so much reinventing of the wheel?

    • dberkholz

      I’m just saying what needs to happen and what I expect to happen, not necessarily how it will come to be. It’s not a reinvention so much as a customization of the wheel for a slightly different use case … think wagon wheels and tractor wheels.


    Great article – really enjoyed this write up as well as the links provided to the previous story by your colleague Steve.

    Definitely agree that this is where we are heading given the lack of clean data.  Rather than have in-house teams clean up the data, it makes sense to have outside teams clean it up and put it into a usable format for all to see.

    Using a service like git would allow historical oversight much like Wikipedia does now.

    And just like git, one wouldn’t have to worry “damaged data” because you could see which data repo had the most followers.  Social pressures would allow for self monitoring.

    Not only could this company do this for open data, it could also do it for closed data inside of companies (like yammer, etc).

    Further, the beauty of this is through the outsourcing of cleaning up the data, the company then has a better idea of how to package the data for the future.

  • Pingback: Data Scientist or DevOps or Programmer « The World's Oldest Intern

  • Pingback: Big Data Quotes of the Week | What's The Big Data?

  • Pingback: This Month in Data Science