tecosystems

What Factors Justify the Use of Apache Hadoop?

By Stephen O'Grady | @sogrady | January 13, 2011

The question posed at this week’s San Francisco Hadoop User Group is a common one: “what factors justify the use of an Apache Hadoop cluster vs. traditional approaches?” The answer you receive depends on who you ask.

Relational database authors and advocates have two criticisms of Hadoop. First, that most users have little need for Big Data. Second, that MapReduce is more complex than traditional SQL queries.

Both of these criticisms are valid.

In a post entitled “Terabytes is not big data, petabytes is,” Henrik Ingo argued that the gigabytes and terabytes I referenced as Big Data did not justify that term. He is correct. Further, it is true that the number of enterprises worldwide with petabyte scale data management challenges is limited.

MapReduce, for its part, is in fact challenging. Challenging enough that there are two separate projects (Hive and Pig) that add SQL-like interfaces as a complement to the core Hadoop MapReduce functionality. Besides being more accessible, SQL skills are an order of magnitude more common from a resource availability standpoint.

Hadoop supporters, meanwhile, counter both of those concerns.

It was Hadoop sponsor Cloudera, in fact, that originally coined the term “Medium Data” as an acknowledgement that data complexity was not purely a function of volume. As Bradford Cross put it:

Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.

Big Data, like NoSQL, has become a liability in most contexts. Setting aside the lack of a consistent definition, the term is of little utility because it is single-dimensional. Larger dataset sizes present unique computational challenges. But the structure, workload, accessibility and even location of the data may prove equally challenging.

We use Hadoop at RedMonk, for example, to attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.

There are a variety of options for data mining at the scale we practice it. From the basic grep to the Perl CPAN modules Henrik points to, there are many tools that would provide us with similar capabilities. Why Hadoop? Because the ecosystem is growing, the documentation is generally excellent, the unstructured nature of our datasets and, yes, its ability to attack Big Data. Because while our datasets – at least individually – do not constitute Big Data, they are growing rapidly.

Nor have we had to learn MapReduce. The Hadoop ecosystem at present is rich enough already that we have a variety of front end options available, from visual spreadsheet metaphors (Big Sheets) to SQL-style queries (Hive) with a web UI (Beeswax). No Java necessary.

Brian Aker’s comparison of MapReduce to an SUV in Henrik’s piece is apt; whether you’re a supporter of Hadoop or not, curiously. Brian’s obviously correct that a majority of users will use a minority of its capabilities. Much like SUVs and their owners.

While the overkill of an SUV is offset by its higher fuel costs and size, however, the downside to Hadoop usage is less apparent. Its single node performance is merely adequate and the front ends are immature relative to the tooling available in the relational database world, but the build out around the core is improving by the day.

When is Hadoop justified? For a petabyte workloads, certainly. But the versatility of tool makes it appropriate for a variety of workloads beyond quote unquote big data. It’s not going to replace your database, but your database isn’t likely to replace Hadoop either.

Different tools for different jobs, as ever.

Disclosure: Cloudera is a RedMonk customer.

14 comments

Henrik Ingo says:

January 14, 2011 at 3:50 am

Excellent summary of what I tried to say in a couple of iterations too.

Assuming that Hadoop is relatively easy to install and get running, the point of there existing various front ends is a good one. It makes using Hadoop less of a Computer Science grade excercise. The point of querying data in its raw form without transformation is an excellent one I hadn’t thought of. If you use different data sources and the mining you do is “ad hoc” / one-off mining into it (and increasingly we do), then this is an advantage for sure. There is no benefit in transforming data into a standardized star schema or whatever, if the reporting out of it isn’t going to be continuous.

And even if you setup continuous reporting, it might not be of a big advantage. One thing I firmly remember learning from those data mining courses was that mining the data in its original format was always preferable and a data warehouse was considered an unfortunate reality to deal with.

If I was getting back to these kinds of tasks now, I wouldn’t look at the Perl tools anymore, I too would dive into the Hadoop world. Or use grep when feasible. (For my Tiny Data sets, it often is.) If I wanted to jest, I’d use bash-reduce 🙂

Reply
Jan. 14: What We’re Reading About the Cloud: Cloud Computing News « says:

January 14, 2011 at 8:31 pm

[…] What Factors Justify the Use of Apache Hadoop? (From tecosystems) This is a great defense of Hadoop, not that it needs one. Yes, it’s overkill in certain situations, but it’s also practical for data sets less than1TB, and is getting better by the day. […]

Reply
On Myth of MapReduce Complexity… « GridGain – Cloud Application Platform says:

January 15, 2011 at 12:03 am

[…] read another post that was claiming that MapReduce (in Java) was a fairly complex paradigm and therefore hacking […]

Reply
When Should You Use Hadoop? | Daringminds.com says:

January 17, 2011 at 10:44 am

[…] analyst Stephen O’Grady tackles the question “What Factors Justify the Use of Apache Hadoop?” O’Grady cites two of the most common criticisms of Hadoop: 1) Most users don’t […]

Reply
sogrady says:

January 17, 2011 at 12:34 pm

“Assuming that Hadoop is relatively easy to install and get running, the point of there existing various front ends is a good one.”

The Cloudera distributions make it very easy. The packaged CDH for Ubuntu is non-differentiated, essentially, from the installation of any other packaged application. Hue – their browser based UI – exposes the file system, job running system and Hive out of the box. It’s not comparable to their more mature tooling available for relational database systems, but it’s more than adequate for non-experienced users.

“One thing I firmly remember learning from those data mining courses was that mining the data in its original format was always preferable and a data warehouse was considered an unfortunate reality to deal with.”

This is one of the things we hear repeatedly: via the process of transformation/normalization, data is inevitably lost. Ergo the desire to work with data as originally produced.

“If I was getting back to these kinds of tasks now, I wouldn’t look at the Perl tools anymore, I too would dive into the Hadoop world. Or use grep when feasible. (For my Tiny Data sets, it often is.) If I wanted to jest, I’d use bash-reduce.”

The funny thing is that the canonical Hadoop “Hello World” example is actually a word count, which is little more than grep at scale. Indeed, our analysis of the Hacker News dataset, among others, is basically just that.

While not sophisticated, it does have value 😉

Reply
When Should You Use Hadoop? / Brainyloft Press SOHO / Feed Your Brain says:

January 17, 2011 at 2:36 pm

[…] analyst Stephen O’Grady tackles the question “What Factors Justify the Use of Apache Hadoop?” O’Grady cites two of the most common criticisms of Hadoop: 1) Most users don’t […]

Reply
Links 27/1/2011: Release of Sabayon Linux 5.5, Fedora 14 for IBM System z 64-bit | Techrights says:

January 27, 2011 at 3:49 pm

[…] What Factors Justify the Use of Apache Hadoop? The question posed at this week’s San Francisco Hadoop User Group is a common one: “what factors justify the use of an Apache Hadoop cluster vs. traditional approaches?” The answer you receive depends on who you ask. […]

Reply
Beyond Jeopardy! with IBM Watson – Quick Analysis says:

February 18, 2011 at 2:09 pm

[…] addition to IBM PR and AR reaching out to me, the Apache Software Foundation sent me info on the Hadoop and UIMA software being used by Watson: The Watson system uses UIMA as its principal infrastructure […]

Reply
M-A-O-L » Hadoop Update says:

September 24, 2011 at 5:42 pm

[…] What Factors Justify the Use of Apache Hadoop? […]

Reply
Michael Dorf says:

January 26, 2012 at 2:15 am

I absolutely agree with you. Hadoop, buzzword notwithstanding, can be nothing but overhead in plenty of cases. I wrote a did a similar post on the same topic recently:

http://www.learncomputer.com/hadoop-where-why/

Reply
Asish says:

August 14, 2014 at 7:08 am

Sounds Good.. Looking for more with Screenshots and detailed explanation from you… Thanksmate for posting Hadoop training bangalore

Reply
B2C and B2B | Dan Gordon — Technology Business Advisor says:

September 2, 2014 at 4:23 pm

[…] M-A-O-L » Hadoop Update on What Factors Justify the Use of Apache Hadoop? […]

Reply
jackson says:

February 6, 2015 at 2:03 am

I get a lot of great information from this blog. Thank you for your sharing this informative blog.Just now I have completed hadoop certification course at a leading academy. If you are looking for best Hadoop Training in Chennai visit FITA IT training and placement academy.

Reply
Mohamed Shiyas says:

February 27, 2015 at 7:16 am

Thanks for your informative article on ios mobile
application development. Your article helped me to explore the future of mobile
apps developers. Having sound knowledge on mobile application development will
help you to float in mobile application development. IOS
Training in Chennai

Reply