Of the non-relational datastore technologies created in the past several years, none has been more successful or seen greater acceptance than Hadoop. Popular with startups, enterprise vendors and customers alike, Hadoop improved substantially the processing time associated with certain workloads. One early example, which came from a user of the technology as opposed to a vendor selling it, asserted that a customer profiling query which processed over a period of weeks in a traditional data warehouse executed in around thirteen minutes on a (sizable) Hadoop cluster. This type of performance guarantees relevance, even within the most conservative organizations.
And so from the early days of the project at Yahoo, a commercial ecosystem inevitably evolved. Initially, the commercial players were few: Cloudera, primarily. But with the later additions of Hortonworks, MapR and most recently EMC/Greenplum/Pivotal and Intel, the field is increasingly competitive.
Not surprisingly, we’re beginning to see a fragmentation of offerings. This is typical within areas of commercial competition: to differentiate their solutions, not to mention the open source base, vendors are incented to change or add to the project in attempt to add greater value and thus secure an outsized portion of the market’s available revenue. Nor does the license inhibit this; the permissive Apache license, in fact, allows vendors to leverage the underlying codebase as they see fit, even to the extent of converting the open source code into a proprietary project.
But apart from the commercial incentives to fragment, much of the accelerating differentiation in the Hadoop space has been a product of shortcomings or limitations of the Hadoop project itself. Hive and Pig, for example, were originally created by Facebook and Yahoo respectively to provide an alternate, and theoretically more accessible, interface to Hadoop. By mimicing more traditional query languages such as SQL, Hive and Pig widened their potential userbase, allowing would be Hadoop customers deploy SQL resources to Hadoop tasks rather than having to pay competitive market rates for MapReduce-trained candidates. MapR, meanwhile, was originally built to attack both performance issues in Hadoop as well as the single point of failure risk of the name node architecture.
From a macro-perspective, however, the most interesting innovation around Hadoop has been that which makes it more like a traditional database. For all of its advantages in leveraging the MapReduce algorithm to distribute data processing tasks across clusters with linear or near-linear improvements in efficiency with each added node, Hadoop was crippled in many respects compared to traditional databases. It was, as discussed, more difficult to access without specialized expertise. It was actually slower – much slower, in many cases – for smaller amounts of data, or queries where latency is a factor – interactive queries were virtually impossible. It lacked the ability to natively handle structure, columnar or otherwise. And so on.
Hadoop was like a savant; prodigiously gifted in certain areas, but severely functionally limited in others. Which is why projects like Druid, HBase, Impala and Storm were written: either to address them within the context of Hadoop, or to work around its limitations externally.
Big picture, then, Hadoop has been becoming more database-like for some time. Either because those in the Hadoop ecosystem having been making it so, or because competitive pressures have forced it in that direction.
Which is why EMC/Greenplum’s announcement of Pivotal HD was in many respects not a surprise. From the descriptions (we have not been briefed as yet), Pivotal HD is essentially a hybrid datastore, one that marries Hadoop to a more traditional Massive Parallel Processing relational architecture.
The Hadoop components, both in the HDFS filesystem and the MapReduce core, are well understood. Relatively new is the SQL-compliant HAWQ database engine. Derived from Greenplum’s datawarehousing engine (and thus from PostgreSQL?), HAWQ attempts to offer the performance of Hadoop using a traditional database front-end.
On paper at least, the combination is interesting: while Hadoop was already becoming more database-like, EMC/Greenplum went further and actually embedded an entire database within the project. The target for this distribution is fairly clear, as EMC/Greenplum’s documentation [PDF] benchmarks against Hive (supported by Cloudera, HortonWorks and MapR) and Impala (Cloudera). As is typical, it claims subtantial performance gains over both alternatives. Granting that vendor supplied benchmarks are always to be taken with a grain of salt, it’s worth examining the appeal of Pivotal HD and its potential impact on the Hadoop marketplace.
If we assume, for the sake of argument, that a) the performance claims are either true or a reasonable approximation and that b) the functional difference between a full SQL implementation and alternatives like Hive is massive, the offering is undoubtedly compelling. Marrying Hadoop performance to a true SQL-capable front end with database-like latency makes for a unique distribution. The question facing EMC/Greenplum is whether its technical differentiation is enough to mitigate the advantages Cloudera, Hortonworks et al have as a result of being very closely based on open source projects. Because while much of the initial coverage doesn’t mention this, Pivotal HD is not an open source project.
For commenters such as Forbes contributor (and CTO and editor of CITO Research) Dan Woods, open source will not be an obstacle for EMC/Greenplum. He argues, in part, that Pivotal HD’s full SQL compliance and higher performance will prove irresistible: essentially that the technically superior product (assuming that’s Pivotal HD) will inevitably win. He also implies, at least indirectly, that open source is fundamentally unable to offer a competitive product. Let’s consider those contentions in order.
The difficulty with arguing that superior products win is that it has, historically, been frequently not true.
- Linux was, for many years, technically inferior to both Microsoft Windows and competing Unix implementations such as HP-UX or Solaris – and some would argue that it still is. Nevertheless, Linux has emerged as the primary competitor to Microsoft Windows in the server market, and is powering workloads from smartphones to mainframes.
- MySQL was functionally more limited – intentionally so, in many respects – than other relational database competitors both proprietary (DB2, Oracle, SQL Server) and open source (PostgreSQL). It is today the most popular relational database on the planet.
- AWS is at a competitive disadvantage to physical hardware in many respects, from reliability to performance to long term cost. In spite of this, it is beginning to have a major impact on traditional hardware suppliers, as is similarly primitive hardware supplied by ODMs, and is commonly regarded as the major vendor in an exploding market.
- Further afield, Android was, for the majority of its early life, a technically inferior alternative to Apple’s iOS in almost every way. It is today the volume mobile platform leader.
As my analyst colleague Matthew Aslett implied, from development tooling (Eclipse) to web (Apache) and app (JBoss/Tomcat) servers to browsers (Firefox) to programming languages (JavaScript, PHP), the list of technologies considered inferior that have supplanted or at a minimum complemented their betters is long. The real danger in this industry, in fact, is overvaluing technology at the expense of convenience. This is particularly true when evaluation is done from a buyer’s perspective rather than a user’s. With so much procurement being driven today by practitioners, the unrestricted availability of open source assets gives them a significant competitive advantage.
And this is to say nothing of price sensitive customers who may yet have true big data problems. One analysis, for example, estimated that the cost of building YouTube from open source components and commodity hardware would be approximately $104M; the cost to build it out on the proprietary Oracle Exadata platform, meanwhile, was roughly $589M; a nearly 6X multiple, in other words.
Certainly there will be customers whose needs will dictate the adoption of a unique solution like Pivotal HD, but how many will that be relative to the segment whose adoption cycle begins with the download of one of the free Hadoop distributions?
More problematic, however, is Wood’s argument that open source is somehow unfit or unable to address difficult engineering problems. Here’s one characteristic portion of Wood’s piece:
Why do NetApp and other storage vendors make so much money? Why hasn’t the open source community risen up and created a storage platform that is as rock solid and scalable as what NetApp offers? Why are top companies like Google and Facebook huge NetApp customers? The reason is that some engineering problems are so hard and require so many years of effort and layers of interacting systems that nobody is going to solve them and give them away.
Apart from the fact that a great deal of the Pivotal HD distribution itself is based upon open source technologies, the simple fact is that solutions to difficult engineering problems are given away regularly. See Accumulo (NSA), Asgard (NetFlix), Cassandra (Facebook), Ceph (DOE/NNSA), Druid (Metamarkets), DTrace (Sun), Mesos (Twitter), OpenStack (NASA & Rackspace), ZFS (Sun) and, of course, Hadoop (Yahoo). All were built to solve difficult problems and then given away. The mindset that the release of a project as open source may only represent an organizational loss of value is regrettably characteristic of buy-oriented IT executives. In many cases, particularly for organizations who make money with software rather than from software, the benefits to releasing software as an open source project substantially outweigh the costs. It is also worth noting that many of these organizations are operating at a scale that the majority of enterprises will never experience, which implies that the problems they are solving are hard.
In reality, the interesting question here is the reverse of Wood’s. Instead of asking why NetApp and other storage vendors make so much money (though it’s interesting to note, as an aside, that their gross margin is lower than Red Hat’s), the more appropriate question might be: given the impact open source has had on the markets for operating systems, web/application servers, configuration/management software, development tooling, databases, or virtualization and cloud infrastructure, why should storage expect its forward revenues to be uniquely immune? More specifically, is it more likely that storage problem is uniquely intractable relative to the challenge of building things like the Google Filesystem, the details of which were part of the inspiration for Hadoop? Or would you argue that realizable storage revenues will be increasingly impacted by a combination of commodity hardware and open source software?
Whatever your answer, it is necessary to give credit where credit is due: EMC/Greenplum have in Pivotal HD clearly anticipated the market direction towards relational-Hadoop hybrids, and delivered a product that on paper will be of great interest. But to dismiss the impact open source might have on its rate of adoption and revenue potential seems to overweight technology while undervaluing history. Technology matters, but availability does too.
Disclosure: Cloudera, MapR, Intel and VMware (Pivotal parent) are RedMonk customers. EMC, Greenplum and Hortonworks are not.