By The Numbers #15

Alt + E S V

By The Numbers #15

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

Commentary on some of the interesting numbers in the news

Data aggregation edition


Six hundred million: the number of customer satisfaction responses logged by the folks at HappyOrNot in their first 8 years. The company utilizes kiosks with a simple smiley face interface to collect quick customer feedback in primarily retail settings. (Fans of 37Signals may recognize similarities with Smiley, an similar concept that functions as an app rather than a physical terminal.)

While surveys or comment cards may be information-rich, most people are unlikely to take the time to complete them. Rather than rely on a low volume of complex customer responses, the kiosk instead operates under the assumption that aggregating simple data can yield similarly rich insights.

A single HappyOrNot terminal can register thousands of impressions in a day, from people who buy and people who don’t. The terminals are self-explanatory, and customers can use them without breaking stride. In the jargon of tech, giving feedback through HappyOrNot is “frictionless.” And, although the responses are anonymous, they are time-stamped. One client discovered that customer satisfaction in a particular store plummeted at ten o’clock every morning. Video from a closed-circuit security camera revealed that the drop was caused by an employee who began work at that hour and took a long time to get going. She was retrained, and the frowns went away.
Customer Satisfaction at the Push of a Button, The New Yorker

When aggregated, anonymous individual data points have the power to create actionable insights. HappyOrNot tells the positive story of data accumulation, allowing companies to pinpoint successes/problem areas that may have otherwise remained undiscovered. However, as other headlines this week have demonstrated, there are considerable risks to data aggregation in certain contexts.


Beginning in 2015, fitness app Strava published a global heat map of users’ fitness activity. The most recent iteration is comprised of 3 trillion GPS data points showing worldwide activity by its users through September 2017. This data was (theoretically) anonymized and aggregated to show global activity patterns, but in the process the map inadvertently ended up displaying the location/activity patterns of global military bases. Military personnel using fitness trackers displayed not only the location of bases but also data about patterns of movement and supply/logistics routes.

This incident has captured national (and international) attention because of the sensitive nature of the data revealed by the map has national security implications. However, advisors have flagged this type of risk as far back 2014 when The U.S. President’s Council of Advisors on Science and Technology discussed the substantial risks of “anonymous” personally identifiable information.

Anonymization of a data record might seem easy to implement. Unfortunately, it is increasingly easy to defeat anonymization by the very techniques that are being developed for many legitimate applications of big data. In general, as the size and diversity of available data grows, the likelihood of being able to re-identify individuals (that is, re-associate their records with their names) grows substantially.
Big Data and Privacy: A Technological Perspective, PCAST

According to Priceton’s Arvind Narayanan in A Precautionary Approach to Big Data Privacy, re-identification is particularly problematic when the underlying data is of a personally sensitive nature, and it is easier to complete for high-dimensional datasets “which contain many data points for each individual’s record.” Unfortunately the Strava dataset epitomizes both of these characteristics, and the end result is not only identification of base locations and activities at those bases, but also of the individuals themselves.

This incident has shown that data collection has externalities. Zeynep Tufekci argued both on Twitter and in longer form in the New York Times that in an age where re-identification of data is increasingly possible as data sets grow and machine learning improves, neither individuals nor companies are able to fully grasp the implications of data privacy.

Part of the problem with the ideal of individualized informed consent is that it assumes companies have the ability to inform us about the risks we are consenting to. They don’t. Strava surely did not intend to reveal the GPS coordinates of a possible Central Intelligence Agency annex in Mogadishu, Somalia — but it may have done just that. Even if all technology companies meant well and acted in good faith, they would not be in a position to let you know what exactly you were signing up for.
The Latest Data Privacy Debacle, The New York Times

I enjoyed the contrast of these two stories, in that both companies employ data aggregation but with very different end results. We at RedMonk talk frequently about the value of data. Data is an excellent driver of business decisions and business value, even if it’s a value that is not currently reflected on the balance sheet. In particular we frequently coach that sharing data can be an especially attractive way to reach developer audiences. Aggregated data is a powerful business tool, but as these examples illustrate, data aggregation is a double-edged sword. Just as there are large upsides, there are also substantial negatives to consider. Especially when dealing with potentially sensitive information, care and caution must be taken when gathering, aggregating, and especially publicizing user data.

(Featured image photo credit: Flickr/morebyless under CC-BY 2.0)

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *