The Problem with Big Data

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

The first time I heard the “Medium Data” idea was from Christophe Bisciglia and Todd Lipcon at Cloudera. I think the concept is great. Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data.”
– Bradford Cross, Big Data Is Less About Size, And More About Freedom

He who has the most data wins. Right?

Google certainly believes this. But how many businesses, really, are like Google?

Not too many. Or fewer than would constitute a healthy and profitable market segment, according to what I’m hearing from more and more of the Big Data practitioners.

Part of the problem, clearly, is definitional. What is Big to you might not be Big to me, and vice versa. Worse, the metrics used within the analytics space vary widely. Technologists are used to measuring data in storage: Facebook generates 24-25 terabytes of new data per day, etc. Business Intelligence practitioners, on the other hand, are more likely to talk in the number of rows. Or if they’re very geeky, the number of variables, time series and such.

Big doesn’t always mean big, in other words. But big is increasingly bad, from what we hear. At least from a marketing perspective.

You and I may know, as Bradford says, that “companies do not have to be at Google scale to have data issues.” Would be users often do not, however. Some enterprises hear the literally incomprehensible numbers thrown around in Big Data, Hadoop or NoSQL discussions, and conclude that their workloads are too small for quote unquote Big Data tools. Which may of course be true. But it’s more likely that that conclusion is premature.

Hadoop and the NoSQL tools are often designed with scale in mind, true. But scalability is not the beginning and end of such tools’ utility. Consider the task of analyzing several hundred gigabytes of log files. The challenge is not scalability, but approach. Log files often map poorly to a relational database, while Hadoop is quite adept at processing them.

This is, as we often say around here, a matter of matching the right tool to the right job. The question is whether the term Big Data is doing more harm than good in that effort. And if the answer to that is yes, what might replace it. Bradford is obviously correct when he states that Big Data is about more than size. But if that’s true, are we really well served replacing Big with another size modifier like Medium?

The nomenclature in this entire space could use a refresh, and whether it gets one may have surprising impacts on adoption and deployment.


  1. “Big Data Is Less About Size, And More About Freedom”

    There are two types of Data Organizations –
    1) Traditional(Most industries) –
    information flow is Warehouse(typically RDBMS) —> SAS –> Modeler (algorithm guy) –> Programmer(development guy)

    2) The Big Data ones – Internet or Capital Markets
    Information flow — Any Data(Warehouse files) —> Hadoop —> Modeler and Programmer (same person)

    The Traditional ones are your typical Enterprise customers – They too solve Scale Problems with tools tools from Oracle , IBM etc.

    Stephen – When people talk about Big Data, I register it as the category 2 guys. Which is really a style of your BI department.

  2. […] me and Stephen have been wary of reductionist approaches to defining NoSQL – we feel Hadoop style Big Data for example should be thought of as a related […]

  3. Another thought:

    Although Medium Data is what your typical Enterprise Application needs, you very soon move to a “Big data” problem when you talk about Cloud Based PaaS.

    i.e would AppEngine ever have been able to provide a fabric cloud model without having Big Table. a data store that looks at Any Data as Entities and attributes. AppEngine does not care if your app is small,medium or big . For it, All Data is one big distributed store and hence big.

    Amazon’s RDBMS service is still ultimately in the instance cloud world. Its Dynamo is in the Fabric cloud world.

    Take this problem to the Enterprise in the private cloud context. The Enterprise Architecture in the private cloud world will end up defining a programming model that says- here write your JDO class and we will take care of persistence automatically in some form. the Datastore behind could be anything, but has to support.

    When you commoditize anything, ability to scale comes with it.

  4. […] my blog “The Story of Big Data,” but then I remembered Stephen O’Grady’s words of wisdom on the topic and decided to change that… That said, I do think that “big data” is as good a […]

  5. […] cumbersome to discuss individually. If anything, the space we clumsily refer to today as Big Data [coverage] has more moving pieces than did web […]

Leave a Reply

Your email address will not be published.