In the beginning – October, 2003 to be precise – there was the Google File System. And it was good. MapReduce, which followed in December 2004, was even better. Together, they served as a framework for Doug Cutting’s original work at Yahoo, work that resulted in the project now known as Hadoop in 2005.
After being pressed into service by Yahoo and other large web properties, Hadoop’s inevitable standalone commercialization arrived in the form of Cloudera in 2009. Founded by Amr Awadallah (Yahoo), Christophe Bisciglia (Google), Jeff Hammerbacher (Facebook) and Mike Olson (Oracle/Sleepycat) – Cutting was to join later – Cloudera oddly had the Hadoop market more or less to itself for a few years.
Eventually the likes of MapR, Hortonworks, IBM and others arrived. And today, any vendor with data processing ambitions is either in the Hadoop space directly or partnering with an entity that is – because there is no other option. Even vendors with no major data processing businesses, for that matter, are jumping in to drive other areas of their businss – Intel being perhaps the most obvious example.
The question is not today, as it was in those early days, what Hadoop is for. In the early days of the project, many conversations with users about the power of Hadoop would stall when they heard words like “batch” or compared MapReduce to SQL (see Slide 22). Even already on-board employers like Facebook, meanwhile, faced with a market shortage of MapReduce-trained candidates were forced to write alternative query mechanisms like Hive themself. All of which meant that conversations about Hadoop were, without exception, conversations about what Hadoop was good for.
Today, the revese is true: it’s more difficult to pinpoint what Hadoop isn’t being used for than what it is. There are multiple SQL-like access mechanisms, some like Impala driving towards lower and lower latency queries, and Pivotal has even gone so far as to graft a fully SQL-compliant relational database engine on to the platform. Elsewhere, projects like HBase have layered federated database-like capabilities onto the core HDFS Hadoop foundation. The net of which is that Hadoop is gradually transitioning away from being a strictly batch-oriented system aimed at specialized large dataset workloads and into a more mainstream, general purpose data platform.
The large opportunity that lies in a more versatile, less specialized Hadoop helps explain the behavior of participating vendors. It’s easier to understand, for example, why EMC is aggressively integrating relational database technology into the platform if you understand where Hadoop is going versus where it has been. Likewise, Cloudera’s “Enterprise Data Hub” messaging is clearly intended to achieve separation from the perception that Hadoop is “for batch jobs.” And the size of the opportunity is the context behind IBM’s comments that it “doesn’t need Cloudera.” If the opportunity, and attendant risk, was smaller, IBM would likely be content to partner. But it is not.
Nor is innovation in the space limited to those would sell software directly; quite the contrary, in fact. Facebook’s Presto is a distributed SQL engine built directly on top of HDFS, and Google Spanner et al clones are as inevitable as Hadoop was once upon a time. Amazon’s RedShift, for its part, is gathering momentum amongst customers who don’t wish to build and own their own data infrastructure.
Of course, Hadoop could very well be years behind Google from a technology perspective. But even if the Hadoop ecosystem is the past to Google, it’s the present for the market. And questions about that market abound. How does the market landscape shake out? Are smaller players shortly to be acquired by larger vendors desperate not be locked out of a growth market? Will the value be in the distributions, or higher level abstractions? How do broadening platform strategies and ambitions affect relationships with would-be partners like a MongoDB? How do the players continue to balance the increasing trend towards open source against the need to optimize revenue in an aggressively competitive market? Will open source continue to be the default, baseline expectation, or will we see a tilt back towards closed source? Will other platforms emerge to sap some of Hadoop’s momentum? Will anyone seriously close the gap between MapReduce/SQL analyst and Excel user from an accessibility standpoint?
And so on. These are the questions we’re spending a great deal of time exploring in the wake of the first Strata/HadoopWorld in which Hadoop deliberately and repeatedly asserted itself as a general purpose technology. From here on out, the stakes are higher by the day, and margin for error low. To she who gets more of the answers to the above questions correct go the spoils.