tecosystems

The Problem with Big Data

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

The first time I heard the “Medium Data” idea was from Christophe Bisciglia and Todd Lipcon at Cloudera. I think the concept is great. Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data.”
– Bradford Cross, Big Data Is Less About Size, And More About Freedom

He who has the most data wins. Right?

Google certainly believes this. But how many businesses, really, are like Google?

Not too many. Or fewer than would constitute a healthy and profitable market segment, according to what I’m hearing from more and more of the Big Data practitioners.

Part of the problem, clearly, is definitional. What is Big to you might not be Big to me, and vice versa. Worse, the metrics used within the analytics space vary widely. Technologists are used to measuring data in storage: Facebook generates 24-25 terabytes of new data per day, etc. Business Intelligence practitioners, on the other hand, are more likely to talk in the number of rows. Or if they’re very geeky, the number of variables, time series and such.

Big doesn’t always mean big, in other words. But big is increasingly bad, from what we hear. At least from a marketing perspective.

You and I may know, as Bradford says, that “companies do not have to be at Google scale to have data issues.” Would be users often do not, however. Some enterprises hear the literally incomprehensible numbers thrown around in Big Data, Hadoop or NoSQL discussions, and conclude that their workloads are too small for quote unquote Big Data tools. Which may of course be true. But it’s more likely that that conclusion is premature.

Hadoop and the NoSQL tools are often designed with scale in mind, true. But scalability is not the beginning and end of such tools’ utility. Consider the task of analyzing several hundred gigabytes of log files. The challenge is not scalability, but approach. Log files often map poorly to a relational database, while Hadoop is quite adept at processing them.

This is, as we often say around here, a matter of matching the right tool to the right job. The question is whether the term Big Data is doing more harm than good in that effort. And if the answer to that is yes, what might replace it. Bradford is obviously correct when he states that Big Data is about more than size. But if that’s true, are we really well served replacing Big with another size modifier like Medium?

The nomenclature in this entire space could use a refresh, and whether it gets one may have surprising impacts on adoption and deployment.