“The most significant findings of our preliminary review are: The U.S. Government had sufficient information prior to the attempted December 25 attack to have potentially disrupted the AQAP plot.
…
Though all of that information was available to all-source analysts at the CIA and the NCTC prior to the attempted attack, the dots were never connected, and as a result, the problem appears to be more about a component failure to ‘connect the dots,’ rather than a lack of information sharing. The information that was available to analysts, as is usually the case, was fragmentary and embedded in a large volume of other data.”
– Summary of the White House Review of the December 25, 2009 Attempted Terrorist Attack, from Whitehouse.gov [PDF]
Assume for a moment that you had full, unconditional access to Google’s dataset: what would you ask of it? Google uses it predict the flu. What can you find? If you’re like most of us, you will lock up almost immediately. The paradox of choice overwhelms the rational mind with its virtually infinite possibilities.
The question of what to ask is important because the big data space, at present, is focused on data collection, storage and processing. Which is why things like Cloudera’s Flume catalog and S3 sink are (rightfully) the subject of intense interest. Not every problem in large scale data processing is solved. Far from it, in fact.
But as Twitter’s Kevin Weil eloquently put it during his OSCON talk, “asking the right question is hard.” Which is the best explanation of why people like Kevin are so important.
At the recent Hadoop Summit, the reported consensus was that the stuff that used to be hard – collecting, storing and working on large volumes of data – is getting if not easy, easier. Even for individuals, thanks in equal parts to cloud computing and open source software. A conclusion we subscribe to coverage). The challenges that remain, however, may prove to be even more formidable. Because as intelligence agency failures like the December 25 attack prove quite adequately, asking the right question is hard, no matter how much we spend on tools and infrastructure.
The success of Google and the other web firms has led to the central belief that more data is always better. And statistically speaking, that tends to be true, particularly when you’re making predictions, as models for inference perform better with higher volumes of data. As Google’s Chief Scientist Peter Norvig put it, “We don’t have better algorithms than anyone else. We just have more data.”
Higher data volumes is far from a universal positive, however. First, there’s the fact that data may offer diminishing returns. As the Chief Economist of Google, Hal Varian once observed:
There’s a kind of natural diminishing returns to scale just because of statistics: you have to have four times as big a sample to get twice as good an estimate.
The bigger problem is the volume of data itself. Even if you can process it analytically, it’s difficult to know what to look for. What to ask. How to ask it. And how to keep asking different variations of the question to get not the answer you expect, but the correct answer.
Consider a piece from this week’s USA Today, entitled “Methods for detecting test bias flawed, research suggests.” Wherein lies the flaw? The question, not the data.
“A major new research project ā led by a scholar who favors standardized testing ā has just concluded that the methods used by the College Board (and just about every other testing entity for either admissions or employment testing) are seriously flawed…
In the common approach, individual questions are analyzed. What the new paper suggests is another way to look for bias. The scholars created a database with literally trillions of questions and scores on a range of tests, including all the major standardized tests used in college admissions. And this database featured trillions of questions that had been determined to have bias. But when samples were pulled out for analysis of a given question on a given test, the results came back negative for bias.
The conclusion, Aguinis said, is that question-by-question analysis doesn’t detect bias.
“Given our research, the conclusion that tests are unbiased should be revisited,” he said. “We need a much bigger question.”
So it’s hard to attack data with the right questions. Got it. What can be done? If Facebook and Twitter are any indication, the answer is to resource for it. Here’s Jeff Hammerbacher, ex-Facebook, on the origins of the data team there:
Around 4Q05 they decided to establish a ‘reporting and analytics’ function that was more like traditional DW/BI, and to hire a ‘research scientist’ to do things like identify and evaluate algorithms for news feed ranking. They hired one person into each role. Unfortunately, the person hired into the latter role passed away due to a tragic biking accident. I was hired in 1Q06 with the same title (‘research scientist’), but my role quickly evolved into supporting the functions of the reporting and analytics group. Some time in 3Q06, Adam D’Angelo returned and we discussed changing the two groups to be focused on ‘Data Infrastructure’ and ‘Data Analytics’, and in 4Q06 (I think) we merged them into the ‘Data Team’…
Dustin Moskovitz can talk more about the motivations in 2005. From asking him directly while at Facebook, the goal seemed to revolve around 1) building a historical repository which could be queried offline without impacting the live site and 2) figuring out if changes made to the site impacted user behavior in a positive or negative fashion. For 2, the controversial change near the end of 05 was adding high school networks.
Of course not everybody has Facebook’s resources, but there are a variety of resources that can help you identify the right questions to be asking. Including, yes, your friendly industry analysts. This problem of what question to ask is one of the reasons I do not personally subscribe to the idea that our profession is made up of surplus middlemen, though you should obviously consider the source. Even if you have perfect data, you almost certainly do not have perfect questions.
The fact is that even when the boundaries of a dataset are narrowly defined – as with, say, the Netflix data visualized by the New York Times – it’s easy to get lost in it. The trick is no longer merely being able to aggregate and operate on data; it’s knowing what to do with it.
Find the people that can do that, whether they’re FTE’s or consultants, and you’ll have your competitive advantage. To answer the right questions, you need the right people.