tecosystems

What Factors Justify the Use of Apache Hadoop?

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

The question posed at this week’s San Francisco Hadoop User Group is a common one: “what factors justify the use of an Apache Hadoop cluster vs. traditional approaches?” The answer you receive depends on who you ask.

Relational database authors and advocates have two criticisms of Hadoop. First, that most users have little need for Big Data. Second, that MapReduce is more complex than traditional SQL queries.

Both of these criticisms are valid.

In a post entitled “Terabytes is not big data, petabytes is,” Henrik Ingo argued that the gigabytes and terabytes I referenced as Big Data did not justify that term. He is correct. Further, it is true that the number of enterprises worldwide with petabyte scale data management challenges is limited.

MapReduce, for its part, is in fact challenging. Challenging enough that there are two separate projects (Hive and Pig) that add SQL-like interfaces as a complement to the core Hadoop MapReduce functionality. Besides being more accessible, SQL skills are an order of magnitude more common from a resource availability standpoint.

Hadoop supporters, meanwhile, counter both of those concerns.

It was Hadoop sponsor Cloudera, in fact, that originally coined the term “Medium Data” as an acknowledgement that data complexity was not purely a function of volume. As Bradford Cross put it:

Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.

Big Data, like NoSQL, has become a liability in most contexts. Setting aside the lack of a consistent definition, the term is of little utility because it is single-dimensional. Larger dataset sizes present unique computational challenges. But the structure, workload, accessibility and even location of the data may prove equally challenging.

We use Hadoop at RedMonk, for example, to attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.

There are a variety of options for data mining at the scale we practice it. From the basic grep to the Perl CPAN modules Henrik points to, there are many tools that would provide us with similar capabilities. Why Hadoop? Because the ecosystem is growing, the documentation is generally excellent, the unstructured nature of our datasets and, yes, its ability to attack Big Data. Because while our datasets – at least individually – do not constitute Big Data, they are growing rapidly.

Nor have we had to learn MapReduce. The Hadoop ecosystem at present is rich enough already that we have a variety of front end options available, from visual spreadsheet metaphors (Big Sheets) to SQL-style queries (Hive) with a web UI (Beeswax). No Java necessary.

Brian Aker’s comparison of MapReduce to an SUV in Henrik’s piece is apt; whether you’re a supporter of Hadoop or not, curiously. Brian’s obviously correct that a majority of users will use a minority of its capabilities. Much like SUVs and their owners.

While the overkill of an SUV is offset by its higher fuel costs and size, however, the downside to Hadoop usage is less apparent. Its single node performance is merely adequate and the front ends are immature relative to the tooling available in the relational database world, but the build out around the core is improving by the day.

When is Hadoop justified? For a petabyte workloads, certainly. But the versatility of tool makes it appropriate for a variety of workloads beyond quote unquote big data. It’s not going to replace your database, but your database isn’t likely to replace Hadoop either.

Different tools for different jobs, as ever.

Disclosure: Cloudera is a RedMonk customer.

14 comments

  1. Excellent summary of what I tried to say in a couple of iterations too.

    Assuming that Hadoop is relatively easy to install and get running, the point of there existing various front ends is a good one. It makes using Hadoop less of a Computer Science grade excercise. The point of querying data in its raw form without transformation is an excellent one I hadn’t thought of. If you use different data sources and the mining you do is “ad hoc” / one-off mining into it (and increasingly we do), then this is an advantage for sure. There is no benefit in transforming data into a standardized star schema or whatever, if the reporting out of it isn’t going to be continuous.

    And even if you setup continuous reporting, it might not be of a big advantage. One thing I firmly remember learning from those data mining courses was that mining the data in its original format was always preferable and a data warehouse was considered an unfortunate reality to deal with.

    If I was getting back to these kinds of tasks now, I wouldn’t look at the Perl tools anymore, I too would dive into the Hadoop world. Or use grep when feasible. (For my Tiny Data sets, it often is.) If I wanted to jest, I’d use bash-reduce 🙂

  2. […] What Factors Justify the Use of Apache Hadoop? (From tecosystems) This is a great defense of Hadoop, not that it needs one. Yes, it’s overkill in certain situations, but it’s also practical for data sets less than1TB, and is getting better by the day. […]

  3. […] read another post that was claiming that MapReduce (in Java) was a fairly complex paradigm and therefore hacking […]

  4. […] analyst Stephen O’Grady tackles the question “What Factors Justify the Use of Apache Hadoop?” O’Grady cites two of the most common criticisms of Hadoop: 1) Most users don’t […]

  5. Assuming that Hadoop is relatively easy to install and get running, the point of there existing various front ends is a good one.”

    The Cloudera distributions make it very easy. The packaged CDH for Ubuntu is non-differentiated, essentially, from the installation of any other packaged application. Hue – their browser based UI – exposes the file system, job running system and Hive out of the box. It’s not comparable to their more mature tooling available for relational database systems, but it’s more than adequate for non-experienced users.

    One thing I firmly remember learning from those data mining courses was that mining the data in its original format was always preferable and a data warehouse was considered an unfortunate reality to deal with.”

    This is one of the things we hear repeatedly: via the process of transformation/normalization, data is inevitably lost. Ergo the desire to work with data as originally produced.

    If I was getting back to these kinds of tasks now, I wouldn’t look at the Perl tools anymore, I too would dive into the Hadoop world. Or use grep when feasible. (For my Tiny Data sets, it often is.) If I wanted to jest, I’d use bash-reduce.”

    The funny thing is that the canonical Hadoop “Hello World” example is actually a word count, which is little more than grep at scale. Indeed, our analysis of the Hacker News dataset, among others, is basically just that.

    While not sophisticated, it does have value 😉

  6. […] analyst Stephen O’Grady tackles the question “What Factors Justify the Use of Apache Hadoop?” O’Grady cites two of the most common criticisms of Hadoop: 1) Most users don’t […]

  7. […] What Factors Justify the Use of Apache Hadoop? The question posed at this week’s San Francisco Hadoop User Group is a common one: “what factors justify the use of an Apache Hadoop cluster vs. traditional approaches?” The answer you receive depends on who you ask. […]

  8. […] addition to IBM PR and AR reaching out to me, the Apache Software Foundation sent me info on the Hadoop and UIMA software being used by Watson: The Watson system uses UIMA as its principal infrastructure […]

  9. […] What Factors Justify the Use of Apache Hadoop? […]

  10. I absolutely agree with you. Hadoop, buzzword notwithstanding, can be nothing but overhead in plenty of cases. I wrote a did a similar post on the same topic recently:

    http://www.learncomputer.com/hadoop-where-why/

  11. Sounds Good.. Looking for more with Screenshots and detailed explanation from you… Thanksmate for posting Hadoop training bangalore

  12. […] M-A-O-L » Hadoop Update on What Factors Justify the Use of Apache Hadoop? […]

  13. I get a lot of great information from this blog. Thank you for your sharing this informative blog.Just now I have completed hadoop certification course at a leading academy. If you are looking for best Hadoop Training in Chennai visit FITA IT training and placement academy.

  14. Thanks for your informative article on ios mobile
    application development. Your article helped me to explore the future of mobile
    apps developers. Having sound knowledge on mobile application development will
    help you to float in mobile application development. IOS
    Training in Chennai

Leave a Reply to Henrik Ingo Cancel reply

Your email address will not be published. Required fields are marked *