Blogs

RedMonk

Skip to content

What Factors Justify the Use of Apache Hadoop?

The question posed at this week’s San Francisco Hadoop User Group is a common one: “what factors justify the use of an Apache Hadoop cluster vs. traditional approaches?” The answer you receive depends on who you ask.

Relational database authors and advocates have two criticisms of Hadoop. First, that most users have little need for Big Data. Second, that MapReduce is more complex than traditional SQL queries.

Both of these criticisms are valid.

In a post entitled “Terabytes is not big data, petabytes is,” Henrik Ingo argued that the gigabytes and terabytes I referenced as Big Data did not justify that term. He is correct. Further, it is true that the number of enterprises worldwide with petabyte scale data management challenges is limited.

MapReduce, for its part, is in fact challenging. Challenging enough that there are two separate projects (Hive and Pig) that add SQL-like interfaces as a complement to the core Hadoop MapReduce functionality. Besides being more accessible, SQL skills are an order of magnitude more common from a resource availability standpoint.

Hadoop supporters, meanwhile, counter both of those concerns.

It was Hadoop sponsor Cloudera, in fact, that originally coined the term “Medium Data” as an acknowledgement that data complexity was not purely a function of volume. As Bradford Cross put it:

Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.

Big Data, like NoSQL, has become a liability in most contexts. Setting aside the lack of a consistent definition, the term is of little utility because it is single-dimensional. Larger dataset sizes present unique computational challenges. But the structure, workload, accessibility and even location of the data may prove equally challenging.

We use Hadoop at RedMonk, for example, to attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.

There are a variety of options for data mining at the scale we practice it. From the basic grep to the Perl CPAN modules Henrik points to, there are many tools that would provide us with similar capabilities. Why Hadoop? Because the ecosystem is growing, the documentation is generally excellent, the unstructured nature of our datasets and, yes, its ability to attack Big Data. Because while our datasets – at least individually – do not constitute Big Data, they are growing rapidly.

Nor have we had to learn MapReduce. The Hadoop ecosystem at present is rich enough already that we have a variety of front end options available, from visual spreadsheet metaphors (Big Sheets) to SQL-style queries (Hive) with a web UI (Beeswax). No Java necessary.

Brian Aker’s comparison of MapReduce to an SUV in Henrik’s piece is apt; whether you’re a supporter of Hadoop or not, curiously. Brian’s obviously correct that a majority of users will use a minority of its capabilities. Much like SUVs and their owners.

While the overkill of an SUV is offset by its higher fuel costs and size, however, the downside to Hadoop usage is less apparent. Its single node performance is merely adequate and the front ends are immature relative to the tooling available in the relational database world, but the build out around the core is improving by the day.

When is Hadoop justified? For a petabyte workloads, certainly. But the versatility of tool makes it appropriate for a variety of workloads beyond quote unquote big data. It’s not going to replace your database, but your database isn’t likely to replace Hadoop either.

Different tools for different jobs, as ever.

Disclosure: Cloudera is a RedMonk customer.

Categories: Analytics, Big Data.

  • http://openlife.cc Henrik Ingo

    Excellent summary of what I tried to say in a couple of iterations too.

    Assuming that Hadoop is relatively easy to install and get running, the point of there existing various front ends is a good one. It makes using Hadoop less of a Computer Science grade excercise. The point of querying data in its raw form without transformation is an excellent one I hadn’t thought of. If you use different data sources and the mining you do is “ad hoc” / one-off mining into it (and increasingly we do), then this is an advantage for sure. There is no benefit in transforming data into a standardized star schema or whatever, if the reporting out of it isn’t going to be continuous.

    And even if you setup continuous reporting, it might not be of a big advantage. One thing I firmly remember learning from those data mining courses was that mining the data in its original format was always preferable and a data warehouse was considered an unfortunate reality to deal with.

    If I was getting back to these kinds of tasks now, I wouldn’t look at the Perl tools anymore, I too would dive into the Hadoop world. Or use grep when feasible. (For my Tiny Data sets, it often is.) If I wanted to jest, I’d use bash-reduce :-)

  • Pingback: Jan. 14: What We’re Reading About the Cloud: Cloud Computing News «

  • Pingback: On Myth of MapReduce Complexity… « GridGain – Cloud Application Platform

  • Pingback: When Should You Use Hadoop? | Daringminds.com

  • http://redmonk.com/sogrady sogrady

    Assuming that Hadoop is relatively easy to install and get running, the point of there existing various front ends is a good one.”

    The Cloudera distributions make it very easy. The packaged CDH for Ubuntu is non-differentiated, essentially, from the installation of any other packaged application. Hue – their browser based UI – exposes the file system, job running system and Hive out of the box. It’s not comparable to their more mature tooling available for relational database systems, but it’s more than adequate for non-experienced users.

    One thing I firmly remember learning from those data mining courses was that mining the data in its original format was always preferable and a data warehouse was considered an unfortunate reality to deal with.”

    This is one of the things we hear repeatedly: via the process of transformation/normalization, data is inevitably lost. Ergo the desire to work with data as originally produced.

    If I was getting back to these kinds of tasks now, I wouldn’t look at the Perl tools anymore, I too would dive into the Hadoop world. Or use grep when feasible. (For my Tiny Data sets, it often is.) If I wanted to jest, I’d use bash-reduce.”

    The funny thing is that the canonical Hadoop “Hello World” example is actually a word count, which is little more than grep at scale. Indeed, our analysis of the Hacker News dataset, among others, is basically just that.

    While not sophisticated, it does have value ;)

  • Pingback: When Should You Use Hadoop? / Brainyloft Press SOHO / Feed Your Brain

  • Pingback: Links 27/1/2011: Release of Sabayon Linux 5.5, Fedora 14 for IBM System z 64-bit | Techrights

  • Pingback: Beyond Jeopardy! with IBM Watson – Quick Analysis

  • Pingback: M-A-O-L » Hadoop Update

  • http://www.learncomputer.com/ Michael Dorf

    I absolutely agree with you. Hadoop, buzzword notwithstanding, can be nothing but overhead in plenty of cases. I wrote a did a similar post on the same topic recently:

    http://www.learncomputer.com/hadoop-where-why/

  • Asish

    Sounds Good.. Looking for more with Screenshots and detailed explanation from you… Thanksmate for posting Hadoop training bangalore

  • Pingback: B2C and B2B | Dan Gordon — Technology Business Advisor