tecosystems

The Dream of Hadoop is Alive in AI

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit


Nineteen years ago come April, Yahoo allowed two developers to release a project called Hadoop as open source software. Based on the Google File System and MapReduce papers from Google, it was designed to enable querying operations on large scale datasets using commodity hardware. Importantly, in contrast to the standard relational databases of the time, it could handle structured, semi-structured and unstructured data. The dream of Hadoop for many enterprises was opening their vast stores of accumulated data, which variedly widely in structure and normalization, to both routine and ad hoc querying. The ability to easily ask questions of data independent of its scale represented a nirvana for organizations always seeking to operate with better and more real time intelligence.

There were several obstacles to achieving this, however and today, while Hadoop is still around and in use within many enterprises, it has largely been leapfrogged by a variety of other on premise and cloud based alternatives.

One of the first barriers many organizations encountered was the querying itself. Writing a query in Hadoop required an engineer to understand both Java the language and the principles of MapReduce. Many outside of Google, in fact, were surprised when the company – which tended to be secretive and protective of its technology at the time – chose to release the MapReduce paper publicly at all. As it turned out, part of the justification was to simplify the on ramp for external hires; with the paper public, Google could hire talent already familiar with its principles rather than having to spend internal time and money familiarizing themselves with the concept.

So complex, in fact, was the task of writing MapReduce jobs that multiple organizations wrote their own alternative query interfaces; two of the most popular were Hive, created by Facebook, and Pig, a product of Yahoo. Both embraced a SQL-like interface, because it was simpler to hire engineers with SQL experience than with Java and MapReduce skills. IBM, for its part, tried to graft on a spreadsheet-like interface called BigSheets to Hadoop to enable even non-programmers to leverage Hadoop’s underlying scale to query very large scale datasets – what used to be called Big Data.

None of these alternative interfaces took off, however, and for that and a variety of other reasons including its lack of suitability for streaming workloads, number of moving parts and the ready availability of alternative managed services like AWS’ EMR/Redshift, Google BigQuery, Microsoft’s HDInsight / Synapse Analytics or – eventually – Databricks and Snowflake, Hadoop’s traction slipped.

The dream it offered, however, has never been closer.

The problem in recent years has not been the scale of data to be queried. While certain classes of data workloads remain expensive and difficult to operate on, over the last two decades advances in both hardware and software have made operating on large scale data both easier and, relatively speaking at least, more cost effective.

Instead the primary challenge has been the query interface itself. Whatever the language and frameworks selected, SQL-like otherwise, narrowed the funnel of potential users down to employees with the requisite set of technical skills. But as even modest users of today’s LLM systems are aware, querying datasets is now trivial if not a totally solved problem.

Anyone who’s taken the time to upload a set of data – be it public corporate earnings, a climate science dataset or even personal utility consumption data – into a consumer grade LLM can test this out. Gone is the need to write complex queries or carefully refine charts and dashboards. Instead, the interface is simple, natural language questions:

  1. What does this balance sheet suggest about the overall health of the business?
  2. What are the year on year trends with respect to temperature, humidity and windspeed within this dataset?
  3. What are the seasonal fluctuations in my electricity consumption and how have they varied over the past three years?

There are caveats, of course, most notably the models propensity to make basic errors and the delta between an individual’s dataset and an enterprise’s. But the absolute lack of any friction whatsoever from question to answer is absolutely transformative. While most of the industry’s attention at present has been on AI for code assistance, query assistance is likely to be at least as useful for the average enterprise employee. The benefit to the enterprise from query assistants, in fact, may be substantially greater than code assistants if some of the counterintuitive findings from the DORA report prove accurate.

Very few enterprises, of course, will be willing to feed the corporate data they once crawled with Hadoop to public models such as ChatGPT, Claude or Gemini. Regardless of promises made on the part of the public models, there is at least for the present a major gap in trust surrounding the potential for – and potential risks of – data exfiltration.

Which explains several things. First, why Snowflake is currently valued at over $55B and Databricks closed a round one month ago valuing the company at $62B. Second, it explains why the two companies have competed fiercely around their respective in house models Arctic and DBRX. And lastly, it helps explain the massive importance of and standardization on Apache Iceberg, which one of my colleagues will be covering in a soon to be released piece.

It’s about the dream of Hadoop, after all. It is well understood that AI advantages incumbents; all other points being equal, most enterprises would prefer to operate models on their data in place rather than have to trust new platforms and third parties, let alone migrate data. Databricks, Snowflake -along with the hyperscalers, obviously – are incumbents already trusted with large scale data from a large number of enterprises; that provides opportunity. Opportunity that they need to unlock with native, existing LLM interfaces – hence their respective investments in models. Iceberg, for its part, is fast becoming the Kubernetes of tables, which is to say the standard substrate on which everything is built across all of the above.

Enterprises have been migrating away from specialized datastores and towards multi-modal, general purpose datastores for years now, to be sure. AI is just the latest workload they’re expected to handle natively. AI models, in fact, may offer the cleanest path forward towards vertically integrating application-like functionality into the database. It’s more straightforward than acquiring and integrating an independent application platform, certainly. Data vendors may or may not have the market permission to absorb one of the various PaaS-like players, but they are already trusted to run AI-workloads – AI workloads that overlap, sometimes significantly, with traditional application workloads. There’s a reason vendors in the space refer to themselves as data platforms: that’s exactly what they are, and are becoming.

The dream of Hadoop isn’t here today, to be clear. Even if the technology were fully ready, questions about security, compliance, access control and more remain. And as always, there are concerns about model hallucinations. But thanks to AI, the financial markets clearly believe it to be closer than it’s ever been. And after using models to query a variety of datasets of varying size and scope, it’s hard to argue the point.

Disclosure: AWS, Google, IBM and Microsoft are RedMonk customers. Databricks, OpenAI and Snowflake are not currently RedMonk customers.