{"id":1105,"date":"2012-11-20T13:34:27","date_gmt":"2012-11-20T19:34:27","guid":{"rendered":"http:\/\/redmonk.com\/dberkholz\/?p=1105"},"modified":"2013-04-19T21:27:14","modified_gmt":"2013-04-20T02:27:14","slug":"data-science-gangnam-style","status":"publish","type":"post","link":"https:\/\/redmonk.com\/dberkholz\/2012\/11\/20\/data-science-gangnam-style\/","title":{"rendered":"Data science, Gangnam style"},"content":{"rendered":"<p><iframe loading=\"lazy\" src=\"http:\/\/www.youtube.com\/embed\/9bZkp7q19f0\" height=\"315\" width=\"560\" frameborder=\"0\"><\/iframe><\/p>\n<p>How can we enable viral data science? It&#8217;s all about taking advantage of network effects [<a href=\"http:\/\/redmonk.com\/cote\/2007\/09\/12\/web-20-is-people-its-people\/\">writeup<\/a>] as so many other disciplines and companies have done &#8212; in this case, the key is to catalyze collaboration\u00a0<a href=\"http:\/\/redmonk.com\/dberkholz\/2012\/07\/20\/catalyze-developer-adoption-by-lowering-your-activation-energy\/\">[writeup]<\/a>\u00a0a la GitHub.<\/p>\n<p>Two and a half years ago, my colleague Steve wrote &#8220;<a title=\"Permanent link to The Future of Open Data Looks Like\u2026Github?\" href=\"http:\/\/redmonk.com\/sogrady\/2010\/05\/04\/open-data-github\/\" rel=\"bookmark\" rev=\"post-3681\">The Future of Open Data Looks Like\u2026Github?<\/a>&#8221; Today, the next step is finally real enough to see where the future lies &#8212; and <strong>it&#8217;s not just data<\/strong> that will look like GitHub, <strong>it&#8217;s the entire workflow behind data science<\/strong>. Witness the open-sourcing of EMC Chorus as <a href=\"http:\/\/www.greenplum.com\/communities\/developer\/openchorus\">OpenChorus<\/a>\u00a0at <a href=\"http:\/\/strataconf.com\/stratany2012\">Strata NY<\/a>, and the accompanying <a href=\"http:\/\/www.greenplum.com\/blog\/topics\/data-science\/more-hands-than-our-own-greenplums-logan-lee-on-opening-chorus\">partnerships<\/a>,\u00a0including one\u00a0with Kaggle&#8217;s platform for data-science crowdsourcing. We&#8217;re seeing the beginnings of bringing the collaboration models that have been vastly successful in open-source communities to data science.<\/p>\n<p>Data science today looks something like this:<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/dberkholz-media.redmonk.com\/dberkholz\/files\/2012\/11\/data_science_workflow.png\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1244\" data-permalink=\"https:\/\/redmonk.com\/dberkholz\/2012\/11\/20\/data-science-gangnam-style\/data_science_workflow\/\" data-orig-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2012\/11\/data_science_workflow.png\" data-orig-size=\"488,799\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"data_science_workflow\" data-image-description=\"\" data-medium-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2012\/11\/data_science_workflow-183x300.png\" data-large-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2012\/11\/data_science_workflow.png\" class=\"wp-image-1244 aligncenter\" title=\"data_science_workflow\" alt=\"\" src=\"http:\/\/dberkholz-media.redmonk.com\/dberkholz\/files\/2012\/11\/data_science_workflow.png\" width=\"205\" height=\"335\" \/><\/a><\/p>\n<p>It&#8217;s a straightforward process that carries from the initial data through the analysis (which you should treat just as rigorously as any other code [<a href=\"http:\/\/redmonk.com\/dberkholz\/2012\/11\/06\/what-can-data-scientists-learn-from-devops\/\">writeup<\/a>]) to the output results and how they&#8217;re visualized. This works great in a world without collaboration or sharing, where a data scientist works in a room alone with no need to work with, or present results to, others at any point. Unfortunately the real world is nothing like this. We need to <strong>collaborate<\/strong>, we need to <strong>present results<\/strong> and their business implications, and we need to spend <a href=\"http:\/\/www.information-management.com\/news\/data-scientists-driving-business-advantage-10023533-1.html\">80% of our time cleaning up the data<\/a> &#8212; often <strong>working with data providers<\/strong> to do so.<\/p>\n<p><strong>The future looks like this<\/strong>: The entire workflow from data to analysis to result to visualization will be <strong>social<\/strong> and <strong>collaborative<\/strong>. Just as\u00a0the future of ALM looks a lot like GitHub [<a href=\"http:\/\/redmonk.com\/dberkholz\/2012\/09\/04\/github-grows-closer-to-a-full-alm-toolchain\/\">writeup<\/a>], data science will gain its own version of ALM as it simultaneously gains discipline from parallel universes like DevOps [<a href=\"http:\/\/redmonk.com\/dberkholz\/2012\/11\/06\/what-can-data-scientists-learn-from-devops\/\">writeup<\/a>]. Data scientists will be able to <strong>fork<\/strong> the full workflow to experiment, then file pull requests to <strong>merge<\/strong> any changes back into the original analysis, whether it&#8217;s to the input data, the analytical code itself, or the formatting or annotation of the results and data visualizations. Every step of the way will enable <strong>discussion<\/strong> and commentary on global as well as specific details via both Facebook-style comments and <strong>targeted annotations<\/strong> (e.g. on the cell level for data, or pointing to specific data points in a visualization).<\/p>\n<p>Below, you can see a high-level overview of collaborative data science. In bold is what&#8217;s being added or fixed in each step, while italics indicate a Git-style branch of changes in the analytical workflow.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/dberkholz-media.redmonk.com\/dberkholz\/files\/2012\/11\/collaborative_data_science_workflow.png\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1247\" data-permalink=\"https:\/\/redmonk.com\/dberkholz\/2012\/11\/20\/data-science-gangnam-style\/collaborative_data_science_workflow\/\" data-orig-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2012\/11\/collaborative_data_science_workflow.png\" data-orig-size=\"1087,1552\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"collaborative_data_science_workflow\" data-image-description=\"\" data-medium-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2012\/11\/collaborative_data_science_workflow-210x300.png\" data-large-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2012\/11\/collaborative_data_science_workflow-717x1024.png\" class=\"wp-image-1247 aligncenter\" title=\"collaborative_data_science_workflow\" alt=\"\" src=\"http:\/\/dberkholz-media.redmonk.com\/dberkholz\/files\/2012\/11\/collaborative_data_science_workflow-717x1024.png\" width=\"502\" height=\"717\" \/><\/a><\/p>\n<p>If you&#8217;re familiar with how Git works, either standalone or via GitHub, this will not look strange to you. It&#8217;s the same kind of branch-based history you see when visualizing the paths of development in any distributed version control. In the same lines as GitHub&#8217;s <a href=\"https:\/\/github.com\/features\/projects\/codereview\">code review<\/a>, you would be able to review not just the code but also the data and the results\/visualizations.<\/p>\n<p><strong>Imagine integrating this with social tools<\/strong> like Yammer to create a truly collaborative environment for data science where you can easily <em>discuss<\/em>, <em>review<\/em>,\u00a0<em>promote, <\/em>and<em> contribute to<\/em> any of the data science happening in your company. When you see an analysis that seems promising but isn&#8217;t quite right, <strong>just click &#8220;Fork&#8221;<\/strong> to fix the problem, then <strong>file a pull request<\/strong> to get your changes merged back in.<\/p>\n<p>This will enable a whole new level of capability in organizational migrations toward <strong>data-driven decision making<\/strong>.<\/p>\n<p>For vendors, the opportunity to become the central hub of data science is there, just awaiting someone willing to seize the day.<\/p>\n<p><span style=\"color: #999999;\"><em><strong>Disclosure<\/strong>: Microsoft, which owns Yammer, is a client. GitHub has been a client. EMC, Kaggle, and Facebook are not clients.<\/em><\/span><\/p>\n<div class=\"acc_license\"><a href=\"http:\/\/creativecommons.org\/licenses\/by-sa\/3.0\/\"><img decoding=\"async\" src=\"http:\/\/i.creativecommons.org\/l\/by-sa\/3.0\/88x31.png\" alt=\"by-sa\" \/><\/a><\/div><!--<rdf:RDF xmlns=\"http:\/\/creativecommons.org\/ns#\" xmlns:dc=\"http:\/\/purl.org\/dc\/elements\/1.1\/\" xmlns:rdf=\"http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#\"><Work rdf:about=\"\"><license rdf:resource=\"http:\/\/creativecommons.org\/licenses\/by-sa\/3.0\/\" \/><\/Work><License rdf:about=\"http:\/\/creativecommons.org\/licenses\/by-sa\/3.0\/\"><requires rdf:resource=\"http:\/\/creativecommons.org\/ns#Attribution\" \/><permits rdf:resource=\"http:\/\/creativecommons.org\/ns#Reproduction\" \/><permits rdf:resource=\"http:\/\/creativecommons.org\/ns#Distribution\" \/><permits rdf:resource=\"http:\/\/creativecommons.org\/ns#DerivativeWorks\" \/><requires rdf:resource=\"http:\/\/creativecommons.org\/ns#ShareAlike\" \/><requires rdf:resource=\"http:\/\/creativecommons.org\/ns#Notice\" \/><\/License><\/rdf:RDF>-->","protected":false},"excerpt":{"rendered":"<p>How can we enable viral data science? It&#8217;s all about taking advantage of network effects [writeup] as so many other disciplines and companies have done &#8212; in this case, the key is to catalyze collaboration\u00a0[writeup]\u00a0a la GitHub. Two and a half years ago, my colleague Steve wrote &#8220;The Future of Open Data Looks Like\u2026Github?&#8221; Today,<\/p>\n","protected":false},"author":6,"featured_media":1247,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[5,18,7,20,13,22],"tags":[],"class_list":["post-1105","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-big-data","category-community","category-data-science","category-distributed-development","category-open-source","category-social"],"jetpack_featured_media_url":"https:\/\/redmonk.com\/dberkholz\/files\/2012\/11\/collaborative_data_science_workflow.png","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p23Tsn-hP","_links":{"self":[{"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/posts\/1105","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/comments?post=1105"}],"version-history":[{"count":0,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/posts\/1105\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/media\/1247"}],"wp:attachment":[{"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/media?parent=1105"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/categories?post=1105"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/tags?post=1105"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}