James Governor's Monkchips

That giant sucking sound? Hadoop moving into the cloud

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

 

Related image

Hadoop was born in the cloud, as a big data project designed to take advantage of Yahoo’s infrastructure, based on a paper from Google about the map/reduce algorithm.

It was never originally designed for on prem enterprise deployment. I recently wrote that some of the issues with Hadoop adoption are cultural, but management overheads are another problem – setting up clusters, capacity planning, maintaining and updating versions etc. Keeping cluster running, the basic care and feeding that any distributed system requires. Another management issue is the sheer pace of innovation in open source big data tooling. Hadoop is great for counting and sorting in batch mode and Hadoop Data File System (HDFS) is a powerful data reservoir, but then what? The query infrastructure is still immature, and targeted at highly skilled and expensive data scientists with language skills rather than common or garden SQL tooling. As data workloads increasingly became stream-based, Spark took off. Then we needed a message bus, and Kafka emerged as the platform of choice. But how is all of this stuff supposed to fit together? Enterprises, in general, don’t want to be systems integrators (except of course, the ones that do) and prefer to outsource the packaging of technology to third parties.

Cloudera, Hortonworks, Map/R were founded as Hadoop distribution providers, and have responded to the rate of change issue but broadening their story and embracing Spark and Kafka, positioning themselves as broad next generation data platforms

But many early deployments were on prem, which meant management overheads remained, especially in a world where new software versions off all of the pieces of the stack are emerging at a furious pace. Updating on prem software sucks.

Even in the cloud though it promises elastic scalability, capacity planning is an issue – what happens when your Hadoop cluster grows out of the sizing you have set up on AWS Cloud? Hadoop and associated tooling carries a fairly significant management overhead. While the distribution players can mitigate these issues to some extent, the alternative is managed serviced from AWS, Azure and Google Cloud Platform (GCP).

If lighthouse customers is anything to go by, Google may have found a sweet spot in picking up customers that are fed up with with running Hadoop, and are looking for an integrated set of offerings that don’t carry a management overhead. Google has its ducks in a row from a packaging perspective.

One of the first major GCP wins for its Big Data services was Spotify, which said loud and clear it was willing to trade openness for convenience and extra capability.

HSBC spent tens of millions of pounds standing up its own Hadoop infrastructure for anti-money laundering (AML) but was disappointed with the results. It has now migrated to GCP for AML, and is beginning to migrate other workloads there, such as finance liquidity reporting. The strategy is Cloud First and GCP is really well positioned there.

Ocado is an online grocery delivery platform, which also offers third party digital and fulfilment services. It’s a classic platform play and has gone all in on Google Cloud for data. Interestingly it would be even more aggressive about adopting Google infrastructure were it not for corporate restrictions on adopting beta versions.

Qubit is a startup offering personalisation services to retailer customers. It was running a massive Hadoop and HDFS cluster on AWS: 120k events per second, Over 1bn personalisations per day from hadoop and dedicated Hbase clusters. At the end of 2016 it ported all of its data to GCP because of manageability, particularly around capacity planning. Emre Baran, founder said:

With Bigtable and Bigquery we got rid of all the problems we used to have with Amazon, where we had to freeze the world, move the data to a new instance, then start the world again.

Qubit’s migration went like:

Storm to Dataflow

Kafka to PubSub

Hive to BiqQuery

Hbase to BigTable

Mesos to GCE

At Google NEXT 17 yesterday analysts were also got another couple of names that are currently confidential, but both are frankly very impressive – one in retail data management and one in automotive for managing warranties. Do lighthouse customers completely change a market? Obviously not, but it does point very clearly to a market sweet – or should be suite – spot. Positioning is such that Google’s sales teams can go into accounts and ask simply – why would you do anything else? The retail vertical in particular is a good target for GCP because retailers are afraid of relying too much on AWS, because Amazon is such a fierce competitor in pretty much every retail category.

This play is more about managed services, removing current management headaches, but of course Google has plenty more to offer – it has done an excellent job of creating integrated data pipelines, and there is upside in areas such as machine learning with Tensorflow. While a massive amount of data is flowing into AWS, with GCP Google is now in the game, with a differentiated set of managed data services based on open source code.

Amazon is set up the handle Hadoop Cloud migrations through Elastic Map Reduce (EMR), but much of its data management focus has been focused on its own proprietary tooling – notably RedShift for data warehousing and Kinesis for streaming and real time analytics. That is changing somewhat with support for tools such as Elasticsearch.

For the incumbent distro players (even with their platform specific services) the cloud presents significant challenges. Cloudera to some extent acknowledged this with the lower pricing of its recent IPO. Hortonworks had IPOd earlier. MapR remains privately held. All of these companies have partnerships with the major cloud players. Microsoft resells Hortonworks as HDInsight, and may end up acquiring the company.

There are obviously risks associated with partnering with cloud companies. Once the data is in the cloud Azure, AWS and Azure will be sorely tempted to go after the workloads themselves rather than leave margins to partners.

Data gravity is often touted as a potential problem in the cloud – once data is there, it’s likely staying there. But another way of looking at it is that gravity pulls all of your data together, to help you use it more effectively. The high speed networking, storage options, deployment flexibility make entirely new data approaches possible that just would not have made sense on prem. AWS has done some excellent work in that regard with QuickSight, a BI tool which you can point at any pretty much any AWS data service, which will check the metadata and makes recommendations about join and query possibilities.

But for customers the benefits of moving to the cloud are obvious to the point of being a no brainer. Data governance issues are being dealt with, and the upsides of getting data in the cloud from a manageability perspective are too significant to be ignored. Big data generally, and Hadoop specifically, is the latest enterprise workload that just makes more sense off prem.

 

AWS, Cloudera, Google and Microsoft are clients.

2 comments

  1. This article is wrong on so many levels. First Hadoop was not “born in the cloud”. Yahoo used their onprem resources. They didn’t use a third party provider. The author gives no real cost comparison between running Hadoop onprem versus paying others. I can tell you that the clients I work with have done cost comparisons using AWS and every time onprem is less expensive….by a significant amount.

    There is no doubt that companies will chose between running Hadoop onprem versus outsourcing. This is not a Hadoop specific comparison. Eventually people will understand that the successful companies are the ones that understand managing their own infrastructure is a competitive advantage. Why do you think Amazon, Google and Microsoft began building their own infrastructure in the first place? Offering “cloud services” was something they figured out only after they built their own successful infrastructure.

    Thankfully the author has the decency to admit that Amazon, Google and Microsoft are clients.

    1. thanks for your comment Alex. Regarding Yahoo I don’t believe I mentioned third party resources. I could perhaps have said “born at Web scale”. But effectively yes Yahoo was running a cloud. I’d be happy to talk about your research into managing on prem vs off. Our views on managing on prem vs managed services are also based on discussions with many organisations. I believe your point about Amazon, Google and Microsoft rather proves the point – these companies are operating at a scale, with management resources, that very few organisations can or ever will match.

Leave a Reply

Your email address will not be published. Required fields are marked *