tecosystems

The View from HadoopWorld

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

hadoopworld

Rod Smith argued this morning that we don’t have good terminology in the Hadoop space, and he’s right. I don’t even know what to properly call it, actually. Big data got thrown around a lot today, but I personally prefer Joe’s Megadata description. Whatever we call it, the opportunity Hadoop and close cousins Hive and Pig are aimed at is expanding. How quickly? Well, Facebook has reportedly gone from generating 200 GB of new data per day a little more than a year ago to 24-25 terabytes. And we wonder why queries that used to take a few seconds suddenly take a few days.

Worse, the rate of content growth seems at least consistently linnear, if not geometric. Meaning that today’s problems are tomorrow’s nightmares. So even as more and more enlightened enterprises wake up to the benefits of data driven decision making, their ability to generate actionable intelligence decreases.

Enter Hadoop. As we heard today, the decision to open source Hadoop in the first place was as much resignation as altruism: Yahoo recognized that the business problems it was facing making sense of ever-expanding datastores was far from unique, meaning that an open source project targeting the problem was inevitable. Bowing to this accurate observation, Hadoop was born and became the epicenter of a movement – with both community and commercial aspects – aimed at making sense of incomprensibly large datasets.

Anyway, the on the ground logistics: HadoopWorld was, if this is the first you’ve heard of it, the inaugural event centered around the Apache project named after a stuffed elephant. Held at the Roosevelt hotel in Manhattan, the smallish venue was absolutely packed. The keynotes saw folks sitting against both walls, and every session I was able to hit was at or near capacity: don’t mention that to the fire marshals.

True, the Roosevelt is not exactly the Mandalay Bay, capacity-wise, but I would think that Cloudera – the primary organizers of the show – had to be thrilled with the attendance. Maybe not so much from a sponsor perspective: the show floor, as it was, consisted of a couple of table cloth and plastic sign equipped church-style folding tables. But the practitioners of big data – or as I prefer, Joe’s – were out in force. In this economy, in the most expensive city to visit on this continent. Not bad at all.

Sadly, I was only able to stay through early afternoon – the inevitable consequences of a Friday event – but that was enough to give me a pretty good sense of what the authors and users both would like to see from Hadoop. Here’s what I would look for in the not too distant future:

Data Stores

Facebook, likely, could care less about a data store, in the commercial sense of that word. They more than have their hands full with the Dear Diary content all of their users, myself included, generate. Ditto Yahoo: 10% of their Hadoop cluster is apparently four thousand machines. But there is no question in my mind that a big part of Hadoop’s future is going to involve the consumption of assets – structured and unstructured – from external sources. Maybe that’s weather data, maybe that’s patent data, maybe that’s financial markets data, and maybe – if you’re a baseball fan with an interest an obsession – it’s data like PitchF/X and HitF/X.

Which is why I think that data stores – places where datasets are exchanged and/or sold – are inevitable. Maybe they’ll eventually be baked in to Hadoop user interfaces, and maybe they won’t. Either way, they’ll get used. Because while Flowing Data consistently does an absolutely outstanding job of telling you where to go for data, the volume demands an aggregate interface. As they agree, handing out invites to Infochimps, a service that was kind enough to grant me access to their beta.

This, we will see more of. One, because we need simplified access to raw data, and two because that data has value, and can be monetized.

Database Strategy vs Data Strategy

Some of you probably know that I’ve been arguing for for years now that developers would eventually throw off the shackles of relational databases eventually. Not entirely, or even mostly: relational databases work very well as general purpose persistence stores. But it seemed clear to me that the persistence mechanism = relational database 1:1 relationship was not indefinitely sustainable.

So you can imagine that I was happy to hear Rod – someone who spends a lot of time with IBM customers – say today that it was his view that enterprises were moving from a “database strategy” to a “data strategy.” Different tools for different jobs has long been a RedMonk mantra, and nothing illustrates that more clearly than Hadoop. It’s appropriate for some workloads, and less appropriate – as we’ll see next – for others.

Is Hadoop going to replace Oracle, or further afield, Teradata? Probably not. But can it do a few things those tools can’t? You bet.

Developer Experience

This one’s near and dear to my heart, as well as – from the grumblings around me – a bunch of other would be users. Hadoop can be used with a variety of languages, from Perl to Python to Ruby, but as Doug Cutting admitted today, they’re all second class citizens relative to Java. The plan, however, is for that to change. Which can’t happen soon enough, in my view.

It’s not that there’s anything intrinsically wrong with Java, or its audience. The point, rather, is that there are lots and lots of dynamic language developers out there that would be far more productive working in their native tongue versus translating into Java.

Dynamic vs Batch

In the Q&A following a Facebook keynote this morning, a member of the audience asked how Cassandra worked relative to Hadoop for their usage. Boiling it down, the answer was Cassandra for dynamic operations – like searching Facebook messages – and Hadoop for anything and everything batch related. Which is why I wasn’t surprised to hear a few of the sessions focus on increasing Hadoop’s relevance for dynamic inquiries and operations. This is a technology that’s part infrastructure – which is being worked on – and part user experience. Which I’ll get to next.

User Experience

As Cloudera’s Christophe Bisciglia noted, the Hadoop project is known for many things, but user interfaces are not one of those things. Which is generally fine, because most early adopters will prefer – and place more trust in – the command line, anyhow. But to expand the addressable market beyond engineers, Hadoop will need to follow in the footsteps of virtualization with better tooling. While it’s been possible to virtualize operating systems for years, the introduction of dead simple GUIs like VM Player dramatically expanded the audience that might employ virtualization. Ditto for Hadoop. Accessible to engineers is good; accessible to business analysts is even better.

Thus Cloudera announced it’s desktop offering (interesting choice in the open source MooTools/Javascript window manager backend), Karmasphere its Studio product (an even more interesting choice in NetBeans as a foundation), IBM demoed the soon-to-be-renamed M2 (web only front-end), and Doug talked about a spreadsheet-like interface. All of the above are intended, in one way or another, to simplify the task of inputting, parsing, analyzing and using new datasets via Hadoop.

This is excellent news for people like me, with many questions to ask but not the time to learn another system.

Wrapup

All in all? An excellent show, one well worth my time. My only parting suggestion – besides not doing it on a Friday – would be to arrange power strips for the show. It’s kind of tough to write it up on a dying battery. Otherwise, congrats to the organizers and the speakers: very well done.

Disclosure: Cloudera and IBM are RedMonk customers, while Apache, Facebook and Yahoo are not.