It’s a truism today that we all need to become data-driven businesses – data is the new natural resource – and so on. But nobody seems to know what that means. Tim O’Reilly was talking about the trend, Data is the new Intel Inside, as long ago as 2005. RedMonk attached itself fairly early – Welcome To The Age of Data – trying to understand the challenges facing a world where we have an incredible richness of tools to make more effective use of data than ever before, but a lack of good skills and processes to use them. Some of the same batch of trends that drove developer-led technology adoption create entirely new possibilities in data. You want super cheap hosted storage and compute, at web scale, payable by the hour? Just provide your credit card details. Costs of storage keeps cratering, and because everything is now networked, telemetry becomes super easy. When everything calls home you always know what time it is. And the tooling though – it’s a golden age for data tooling driven by open source, with non-traditional market entrants creating tools they’re happy to open source as a by-product of their operations. See Hadoop, Tensor Flow, Kafka etc. Applied science is pushing us forward, and the revolution is being open sourced.
What is data worth? In my 2010 predictions I expected to see “datasets increasingly recognized as a serious, balance sheet-worthy asset”. I was a bit early there. Data is clearly still not a well understood or significant investment category – brand “goodwill” is better accounted for, but there is no doubt that markets value companies perceived to be data rich with higher evaluations than other companies. Data is a moat. IBM acquired the Weather Company for around $2bn according to the Financial Times, and promptly put CEO David Kenny in charge of a swathe of its Watson and Cloud units. Uber now has a business selling data to companies including Starwood, and is leveraging data to make deals the public sector organisations such as the city of Boston.
Data is Hard
But taking advantage of data is hard – requiring entirely new skill sets. Valuing it is hard. Cleaning it is hard. Querying it is hard. Managing and maintaining it is hard. Most companies are not ready for data science, they actually just want technologies like Hadoop to look like the last 30 years of IT. How many companies are ready for a world where SPSS and SAS are being to some extent superseded by the likes of R? Python is very hot right now.
I would argue the current malaise in the Hadoop ecosystem, which led Cloudera to IPO at a price significantly down on its last round led by Intel, is largely because most organisations are just not properly set up to take advantage of new approaches and tools. HSBC spent tens of millions of dollars building its own Hadoop cluster for anti-money laundering, before deprecating the approach in favour of using Google Cloud platform tools to the same end. But what Hadoop actually the problem? Cloud platforms continue to be the strongest competition for commercial open source – AWS Cloud is absolutely crushing it in the data space – that much is read.
A couple of weeks the DataWorks summit in Munich, run by Hortonworks, crystallised some of these ideas for me, largely driven by a great presentation by Pierre Etienne Bardin, of the French bank Société Générale. The insight that really struck me was:
“Data transformation is like digital transformation, it impacts everyone.”
But what does that look like in practice? Unlike in software development, where we long ago codified, for example, agile and pair programming, and we have frameworks such as 12 Factor to inspire us in building distributed apps, there are no widely agreed equivalents in data. In software development Pivotal has done a phenomenal job of owning the very idea of Digital Transformation as the Pivotal Way has become the model for enterprises that want to develop software more like startups and Cloud natives do. As I wrote last year
“Pivotal Labs was originally a consulting shop, acquired by VMware, which had built a reputation improving already cutting edge engineering operations at companies including Google and Twitter, with a focus on pair programming and a set of practices called the Pivotal Way. Enterprises are buying into this approach. It is the spearhead for most customer engagement. Today the companies adopting Pivotal Cloud Foundry generally start with services engagement. Frankly there aren’t many services firms in the world that truly understand how to teach agile, devops, and CI/CD.”
There is however no “Pivotal of Data” which means there is a market opportunity for someone to codify the transformation in terms of methods, processes, and tooling selection (heterogeneity of awesome tooling – life comes at you fast – is a blessing and a curse for enterprises.) There are organisations, leaders worth learning from, though, such as Société Générale.
Data Science and Culture Lessons
Where did the Société Générale journey begin? As with so many great projects today, with a logo – see BigDatahub above.
Organisations need to understand they’re not going to get everything right in one big bang.
“How did we get here? We failed. But we failed fast. If you fail you have learned a lot about your data. Gather every outcome of a failure to accelerate the next use case you work on.”
That’s data scientific method right there. The BigDataHub began with a very specific project to make an immediate impact on the business – focusing on a International Financial Reporting Standard (IFRS) 9.2 regulatory compliance.
“We had 3 months to prove we could analyse millions of records, and execute all controls in less than 3 hours”
The complexities of the coding involved meant the bank brought in some outside expertise, according to Bardin, one of their Hortonworks best Scala/SPARK developers. The IFRS project was successful enough the BigDataHub used it to codify an approach to effectively communicate with the business.
- Inspire – a full day session, a forum, so the business can come and meet and discuss problems with the data engineers, identifying ideas worth taking forward.
- Experiment – run a dedicated workshop around the specific idea and begin to put some big data within their business process
- Industrialise – put the experiment into operation, with all production data, and analytics with a user friendly interface
The last bit is super important. Traditional “industrialisation” doesn’t always concern itself with the UI, but
“You need a more user friendly interface because that’s how we interact with the business.“
The user interface issue also plays strongly into how the BigDataHub puts its teams together.
- data engineers – build the pipeline, so the data is industrialised
- data scientists work with the data, creating insights
- data visualisation specialists – to help the business tell the stories in the data
These are 3 different roles, and Bardin advised against mixing them up. For industrialisation the team also has architects for security, and compliance people to sign off on processes. Moving from proof of concept to industrialisation is a real challenge, particularly around packaging the approach it so it can be deployed in fully trusted mode by banking IT. BigDataHub borrows heavily from modern cloud software development disciplines, which Bardin describes thus:
“We try to work full agile. The product owner works for the business. The goal is to have one colocated team. Bring all the skills together, in agile mode, with 2 week sprints.”
From a sourcing perspective code is 99% open source, with a heavy reliance on Hortonworks Hadoop stack, including SPARK, and Apache Kafka, with open source data science libraries. One issue is keeping up with the pace of tooling innovation, as described above.
“The open source community is so active it creates new tech on a very frequent basis where traditionally we had tech that would last 5 years. We want to leverage as much as we can what the community is making available. The key word for us is adaptation. The team needs to be as adaptive as the world is changing around us.”
Bardin said Société Générale also needed fundamentally new approaches to training and recruitment. The business needs to understand that on recruitment you have to go to where the experts are.
“You need to go to meetups. You need to contribute. You can’t wait for them to come to you. They won’t come unless you’re Google or Facebook.”
The bank invests heavily in training its own people, given it has 150,000 employees, but the challenge is retention
“After 3 years people chase them, so you have to work. the more they become experts the more they are chased by outside.”
Bardin said key to hiring and retention was keeping the challenges interesting, so the practitioners had a stake in success and a pipeline of interesting challenges to take on, that matched their skillsets. To some readers many of these ideas probably sound like business as usual, but they come across as very refreshing from a major French bank, and let’s face it, most of us look more like that than a true Cloud Native. Bardin is basically leading a best practices lab for data science. If you get a chance to see him talk I highly recommend it. Unfortunately I didn’t get a chance to talk to Bardin after his talk – I was keen to talk about test-driven development for data science, an area RedMonk is currently chewing on – what does devops for data-driven apps look like? How can we bring container-style velocity to databases, with forking and cloning and de-identification for compliance? The kind of thing a new client, Datical is working on, partnering with, eg, Delphix.
As with so many areas 95% of success in using technology is hygiene factors. You have to eat your broccoli before you get ice-cream, and you have to brush and floss afterwards. Data science and data culture are definitely not exceptions. Automation and abstraction solve some of the problems. Software may be eating the world, but data is where the value lies. Realising that data is what the industry needs to do next. Data Transformation is the new Digital Transformation.
Leaving the last word to Bardin though
”Even if you think big you have to start small”.
HortonWorks paid T&E. IBM, Pivotal and Datical are clients.