The question posed at this week’s San Francisco Hadoop User Group is a common one: “what factors justify the use of an Apache Hadoop cluster vs. traditional approaches?” The answer you receive depends on who you ask.
Relational database authors and advocates have two criticisms of Hadoop. First, that most users have little need for Big Data. Second, that MapReduce is more complex than traditional SQL queries.
Both of these criticisms are valid.
In a post entitled “Terabytes is not big data, petabytes is,” Henrik Ingo argued that the gigabytes and terabytes I referenced as Big Data did not justify that term. He is correct. Further, it is true that the number of enterprises worldwide with petabyte scale data management challenges is limited.
MapReduce, for its part, is in fact challenging. Challenging enough that there are two separate projects (Hive and Pig) that add SQL-like interfaces as a complement to the core Hadoop MapReduce functionality. Besides being more accessible, SQL skills are an order of magnitude more common from a resource availability standpoint.
Hadoop supporters, meanwhile, counter both of those concerns.
It was Hadoop sponsor Cloudera, in fact, that originally coined the term “Medium Data” as an acknowledgement that data complexity was not purely a function of volume. As Bradford Cross put it:
Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data. The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem.
Big Data, like NoSQL, has become a liability in most contexts. Setting aside the lack of a consistent definition, the term is of little utility because it is single-dimensional. Larger dataset sizes present unique computational challenges. But the structure, workload, accessibility and even location of the data may prove equally challenging.
We use Hadoop at RedMonk, for example, to attack unstructured and semi-structured datasets without the overhead of an ETL step to insert them into a traditional relational database. From CSV to XML, we can load in a single step and begin querying.
There are a variety of options for data mining at the scale we practice it. From the basic grep to the Perl CPAN modules Henrik points to, there are many tools that would provide us with similar capabilities. Why Hadoop? Because the ecosystem is growing, the documentation is generally excellent, the unstructured nature of our datasets and, yes, its ability to attack Big Data. Because while our datasets – at least individually – do not constitute Big Data, they are growing rapidly.
Nor have we had to learn MapReduce. The Hadoop ecosystem at present is rich enough already that we have a variety of front end options available, from visual spreadsheet metaphors (Big Sheets) to SQL-style queries (Hive) with a web UI (Beeswax). No Java necessary.
Brian Aker’s comparison of MapReduce to an SUV in Henrik’s piece is apt; whether you’re a supporter of Hadoop or not, curiously. Brian’s obviously correct that a majority of users will use a minority of its capabilities. Much like SUVs and their owners.
While the overkill of an SUV is offset by its higher fuel costs and size, however, the downside to Hadoop usage is less apparent. Its single node performance is merely adequate and the front ends are immature relative to the tooling available in the relational database world, but the build out around the core is improving by the day.
When is Hadoop justified? For a petabyte workloads, certainly. But the versatility of tool makes it appropriate for a variety of workloads beyond quote unquote big data. It’s not going to replace your database, but your database isn’t likely to replace Hadoop either.
Different tools for different jobs, as ever.
Disclosure: Cloudera is a RedMonk customer.