TL; DR: Cloud Native approaches are transforming Data Science. They are the logical next step for making Machine Learning and AI available at massive scale.
The dirty, and not so little, secret of data science is the amount of time spent by practitioners on general data hygiene. Be it initial clean up, reformatting, ongoing maintenance or another area, we frequently see estimates stating that from 40 to 70% of a data scientists time is spent on data hygiene. The lower bound of 40% seems far more reasonable, but no matter what way you interpret the figures it is a significant amount of time.
The second secret, less known to outsiders, but all too well understood by practitioners, is the amount of time which is spent on maintaining custom tools, infrastructure and so forth. From custom developed in house tools to software from various vendors, people spend an in-ordinate amount of time on this area. As we have noted previously, big data infrastructure is little more than dial tone. It is essential for doing business, but there is almost zero differentiation possible with it. How efficiently your organisation, be it data scientists or other team members, can use the infrastructure is what really matters.
Both factors make for a lot of wasted time for practitioners, and are contributing, in no small way, to the growing sense of frustration many organisations have with their return on big data investments. At the same time, we are seeing a growing set of demands and expectations for rapidly iterating products using real time data and analytics.
When we talk about real time analytics, we need to look at the world through two prisms – that of the traditional business user and that of consumers. In the world of the traditional business user there is still a grudging acceptance that getting actionable data and insights can take time – although this acceptance is quickly disappearing. Consumers, however, will not wait for your report to run, the expect rapidly evolving and customisable experiences matching their current needs – and if you don’t plan on providing it for them, someone else will. The customer experience matters, and the expectations of business users have shifted, they have the same expectations of their enterprise apps as they now have of their consumer ones.
Packaging, Packaging, Packaging
All of this brings us to packaging. From initial experimentation, to running production grade data products, the ease of both initial setup and day two administration matters immensely. Cloud native approaches are addressing both needs.
The key to data science and, subsequently, machine learning is experimentation, and the initial experimentation phase is something people like to complete in a quick and iterative fashion.
There are many ways of going through this initial experimentation phase, but one we have noted a massive uptick in over the last six months is the use of AWS Athena. Now many will argue that a technology such as Athena is little more than a combination of Presto and S3 combined and dressed up – that is exactly the point. S3 has become the de-facto storage service for many companies. It is simple to set up, simple to use, and there is no administrative headache. Providing an easy mechanism to experiment with data on top of S3 makes sense.
We similar types of experimentation occurring with Google Cloud Dataproc, where people begin a small migration of an existing Hadoop workload and begin to iterate very rapidly on top of it, scaling the experiments as needed.
As we move onto more in-depth experiments, the software requirements become somewhat more complex, but the goals of rapid experimentation remain the same. At the Cloud Native Compute Foundation summit in Berlin earlier this year, Vicki Cheung of the OpenAI institute summed this up perfectly
“research ideas come and go, and we do not want to invest a lot of time into engineering something that might not make the cut” – Vicki Cheung
Essentially these cases come down to treating data infrastructure as a product, which can be scaled up and down as needed, and which allows data scientists access to the toolkit they wish to use without concerns. The OpenAI example is particularly telling – in their case they are leveraging Kubernetes as a base building block, models are bundled up as containers and then scaled out and executed.
If you begin to combine the approach that organisations like the OpenAI institute are talking with emerging projects such as bioboxes, which focuses on creating interchangeable bioinformatics software containers you can see the attraction for data scientists.
End of Days for Hadoop?
No, far from it – there are many companies that have made significant investments in Hadoop expertise, and are seeing material gains.
But there are serious difficulties. Standing up and maintaining Hadoop infrastructure is not for the faint hearted. As my colleague, James Governor, recently noted “that giant sucking sound is the sound of Hadoop being sucked into the cloud”. To paraphrase the “infrastructure as a product” mantra that we heard from the team at OpenAI, your big data infrastructure needs to be self-service and easily extendable. We have noted in the past that your big data infrastructure is just dial tone. The ability of your team to use it efficiently is far more important.
Data scientists are already bypassing stagnant infrastructure for their experimentation. If frustration grows they will bypass the same infrastructure for their production needs as well.
Disclaimer: Amazon and Google are current RedMonk clients.