Skip to content

Data science, Gangnam style

How can we enable viral data science? It’s all about taking advantage of network effects [writeup] as so many other disciplines and companies have done — in this case, the key is to catalyze collaboration [writeup] a la GitHub.

Two and a half years ago, my colleague Steve wrote “The Future of Open Data Looks Like…Github?” Today, the next step is finally real enough to see where the future lies — and it’s not just data that will look like GitHub, it’s the entire workflow behind data science. Witness the open-sourcing of EMC Chorus as OpenChorus at Strata NY, and the accompanying partnerships, including one with Kaggle’s platform for data-science crowdsourcing. We’re seeing the beginnings of bringing the collaboration models that have been vastly successful in open-source communities to data science.

Data science today looks something like this:

It’s a straightforward process that carries from the initial data through the analysis (which you should treat just as rigorously as any other code [writeup]) to the output results and how they’re visualized. This works great in a world without collaboration or sharing, where a data scientist works in a room alone with no need to work with, or present results to, others at any point. Unfortunately the real world is nothing like this. We need to collaborate, we need to present results and their business implications, and we need to spend 80% of our time cleaning up the data — often working with data providers to do so.

The future looks like this: The entire workflow from data to analysis to result to visualization will be social and collaborative. Just as the future of ALM looks a lot like GitHub [writeup], data science will gain its own version of ALM as it simultaneously gains discipline from parallel universes like DevOps [writeup]. Data scientists will be able to fork the full workflow to experiment, then file pull requests to merge any changes back into the original analysis, whether it’s to the input data, the analytical code itself, or the formatting or annotation of the results and data visualizations. Every step of the way will enable discussion and commentary on global as well as specific details via both Facebook-style comments and targeted annotations (e.g. on the cell level for data, or pointing to specific data points in a visualization).

Below, you can see a high-level overview of collaborative data science. In bold is what’s being added or fixed in each step, while italics indicate a Git-style branch of changes in the analytical workflow.

If you’re familiar with how Git works, either standalone or via GitHub, this will not look strange to you. It’s the same kind of branch-based history you see when visualizing the paths of development in any distributed version control. In the same lines as GitHub’s code review, you would be able to review not just the code but also the data and the results/visualizations.

Imagine integrating this with social tools like Yammer to create a truly collaborative environment for data science where you can easily discuss, reviewpromote, and contribute to any of the data science happening in your company. When you see an analysis that seems promising but isn’t quite right, just click “Fork” to fix the problem, then file a pull request to get your changes merged back in.

This will enable a whole new level of capability in organizational migrations toward data-driven decision making.

For vendors, the opportunity to become the central hub of data science is there, just awaiting someone willing to seize the day.

Disclosure: Microsoft, which owns Yammer, is a client. GitHub has been a client. EMC, Kaggle, and Facebook are not clients.


Categories: big-data, community, data-science, distributed-development, open-source, social.