I was at Strata NY the week before last, and fortunately I got out just in time to beat Sandy. I’ve been thinking at Strata and since about how the relatively new discipline of data science could learn from the gradually maturing concept of DevOps, which seems to be about 3-5 years ahead of data science. In my experience, many data scientists resemble the ops side of the DevOps equation. They devote a great deal of effort to the statistical analysis without backing it up with solid software-engineering techniques, in the same way as many ops need to be led to the joys of maintainable, reproducible, collaborative approaches to infrastructure. So how could we create a culture around what I’ll call Devalytics, for lack of a better term?
Build a culture of “Analysis as code”
In the same way the DevOps mantra is “Infrastructure as code,” today’s data scientists need to think of all their scripts as actual software that will require ongoing maintenance, enhancement, and support. To paint with a broad brush, there is no such thing as a one-off script. As soon as anyone else has access to it, or if it even sticks around on your local filesystem, it will almost inevitably be reused and applied to different situations in the future.
Rather than continuing to pretend analysis is a one-time, ad hoc action, automate it. In DevOps, the goal is to avoid logging into command prompts on individual servers because it greatly increases effort and decreases maintainability and reproducibility. As soon as something is automated, you save a huge amount of time repeating identical steps over and over. Conversely, you need to maintain the automation machinery, but a cost-benefit analysis will show that the effort rapidly pays off — particularly for complex actions such as analysis that are nontrivial to get right.
Teach software engineering to data scientists
Software engineering is not just writing code. Efficient automation requires that you apply modern methods in software development to your analysis. Many of today’s data scientists come from backgrounds in either statistics or other hard sciences (physics and biology are surprisingly common). They may have learned how to code, but they never learned modern development techniques such as continuous integration, collaborative development tools such as real-time chat and mailing lists, or even use of modern (a.k.a. fast, distributed) version control a la GitHub.
- Test your code and your data. Using continuous integration and unit testing enables you to constantly know whether your code meets the standards needed for a successful analysis. Data scientists often simply spot-check a few results, or possibly verify the final data by hand or with a script, but rarely automate tests for either parts of the code or the output data itself. The value of appropriate control data sets and analyses is vast, yet this is relatively uncommon even in good software engineering. Data scientists have the opportunity to bring the best of both worlds, the science and the programming, together — but far too often, it’s instead the worst of each.
- Use version control. The benefits of maintaining code in version control are numerous, from the ability to look at what changed over time, to easy discovery of bugs, to enabling others to work on the code simultaneously. And yet, many people’s idea of version control is .bak files, with dates appended if you’re lucky. Even if you’re creating pipelines using visual programming, dragging building blocks around in a GUI, this has no bearing on the potential for version control on the backend — being able to see changes over time is critical.
- Catalyze collaboration. When working with others, the toolset in use has a major impact upon the success and pace of progress. The best practice in leading-edge companies like GitHub today is to work asynchronously (Monktoberfest video), interrupting others only when they are willing to be interrupted, rather than when you want to interrupt them. This requires tools for real-time chat (whether it looks more like IRC, Salesforce Chatter, or IBM Connections); long-format discussion and decision-making, where the best entrants are options like Google Groups and StackExchange, while the old standby is mailing lists; and issue tracking, such as GitHub Issues or Atlassian JIRA. In essence, the goal is to bring the types of collaboration tools that have been popularized by open-source software into other styles of social business.
Apply agile development and continuous delivery
I’ve experienced, time and time again, the significant benefits you accrue by developing iteratively. Leading-edge version control such as Git (in combination with GitHub) encourages agile development using small commits by simply making it incredibly fast and easy to perform multiple, small commits instead of huge, monolithic ones. This greatly eases testing and debugging, even making much of it automatible using tools like git-bisect.
Continuous delivery is a method of bringing these small commits all the way to production services in a frequent, iterative fashion, rather than combining a series of small commits into a daily or weekly push to production. Etsy, for example, deploys to production 30 times a day (Monktoberfest video). The equivalent of continuous delivery in Devalytics is always ensuring a production-ready analysis is available on a rapid, regular basis — however incomplete it may be in terms of features. You can add features over time in an agile fashion, which prevents waterfall-style failures where nothing is ever ready for production.
Keep scaling in mind, but don’t optimize prematurely
When you need Big Data solutions, take advantage of the above methods, including lower-level DevOps techniques like configuration management for the underlying machines (virtual or bare metal). This makes it much easier to scale both the data and the algorithms. Scalability is one of the big selling points for Revolution Analytics, which parallelizes many of the core algorithms in R.
Although you should architect code with the potential for scaling later on, it often doesn’t make sense to actually incorporate scalability if you don’t anticipate any need for it. As Donald Knuth has said, “Premature optimization is the root of all evil.” And yet, the key word in that statement is premature — some percentage of the time, you actually will need to optimize. Just don’t do it when you have no need for it; that’s wasted time and effort.
Monitor the output
As a data scientist, you should be familiar with the concept that the unusual results — the anomalies — often comprise the most interesting data. They tend to drive many of the follow-up questions that result in truly unexpected discoveries rather than simply confirming a prediction.
Besides anomalies, the other value that monitoring provides is a view of the trends over time, which you can integrate into a live dashboard with the results of various algorithms for predictive analytics. Of course these analytics are maintained in version control and brought to production using continuous delivery, just like everything else you’re now doing correctly.
People like Jason Dixon see the future of monitoring as composeable open-source components. This provides an opportunity for data science to integrate into monitoring as another building block for advanced machine learning and predictive analytics.
Thinking about ways to make your analytical work easy to replicate, build upon, and scale while saving significant amounts of time in the process are the basic tenets of “Devalytics.” Applying the above lessons will transform your one-off statistics or machine-learning runs into live scientific metrics that can provide significant and ongoing value.
Disclosure: Salesforce.com, IBM, and Atlassian are clients. GitHub has been a client. Revolution Analytics and Twitter are not clients (although they should be), and neither is Etsy.