As my colleague Stephen said last week, everyone has interesting data. When I was at O’Reilly’s Strata conference earlier this month, I was fortunate enough to meet Karl Van den Bergh of Jaspersoft, a business intelligence company that says it makes the world’s most popular BI software. But it turns out that Jaspersoft has some very interesting data that’s relevant far beyond its own arena — because its software can connect to any number of data sources, including a number of NoSQL databases. They’ve taken one approach to it by creating a “Big Data Index” that has some time-series data. Karl was kind enough to share this data with us, so we’ll be taking a few different approaches to looking at it over the next few weeks. At this first pass, we’ll just look at a summary of the total downloads between January 2011 and March 2012:
What’s striking about this is the broad consistency it shows with data from other sources as well as my intuitive expectation that Hadoop, Mongo, and Cassandra would show up at the top of the list. Some interesting points are the relative popularities:
- Hadoop and Mongo are quite similar;
- Cassandra was surprisingly low, to me, although it’s quite competitive with CouchDB;
- Redis shows a fairly strong placement, supporting its status as an up-and-comer;
I split the Hadoop downloads into a stacked histogram showing Hive, HBase and Avro separately. Over this time span, the more SQL-like Hive beat out HBase with 50% more downloads: 3,682 to 2,360. This could be a reflection of the growing popularity of Big Data applications for people new to the Hadoop ecosystem, who are looking for a familiar toolset to lower the barrier to entry. Avro, which you may not have heard of, is a serialization format for Hadoop that’s designed for data-intensive applications, so it’s no surprise that a niche use case shows less popularity than the more broadly applicable HBase and Hive methods for accessing Hadoop.
The thing that’s particularly powerful about this data is that everyone has something like it. Whether it’s download statistics or web traffic, it can all provide useful insights — especially when combined with other data like we can do with RedMonk Analytics.
Disclosure: Jaspersoft is not a client. Hadoop distributor Cloudera as well as MongoDB-based 10gen are both clients, but Cassandra support company DataStax is not.