Blogs

RedMonk

links for 2010-03-04

by-nc-sa

Flightcaster and the Future of Asymmetric Intelligence as a Product

When John Paulson bet against the real estate markets, he knew something that other people didn’t. By applying his models against purchased real estate databases, he perceived an opportunity where others saw folly. Fifteen billion in profit later, people are understandably a bit curious as to how he pulled it off. Gladwell’s explanation (subscription required) focuses on the man. Personally, I’m more interested in the technology, because I think it’s probable that we’re going to see a lot more Paulsons in the days ahead.

Similarly outsized profits will, presumably, still be rare, what with the number of people entrusted to bet hundreds of millions of dollars of other people’s money not likely to ever be large. But between the accelerating democratization of the tools of large scale data processing, the dependent trend towards of ever more frictionless access to data and dramatically lower compute costs we’re going see a lot more profit from data derived intelligence. In other words, I’d bet long on enterprises whose primary product is analytics driven insight.

We often speak of data as if it is the ends rather than the means. But the raw data often isn’t much help. We need intelligence, by which I mean insights derived from the data source. Or sources. To make data usable, we need to make sense of it. Which means sorting it, visualizing it, comparing it to predictive or historical models, and – increasingly – recombining it with other data.

The Boston Red Sox subscribe to a private weather forecasting service, Meteorlogix. Why? When there is so much free weather data available, why would the Red Sox pay a private service? Presumably because they feel that the private firm offers something the public services do not, and as a weather sensitive business, the economic impact of even marginally better forecasting could be material.

Businesses that monetize data aren’t new, of course. What’s different today are the capital costs. Large scale data processing software can be obtained for free. The same is often true for operational data. Storage and compute costs, meanwhile, are now both pay-as-you-go and accessible even to the smallest businesses, thanks in part to the cloud.

Consider the case of Flightcaster. After initially dismissing them as little more than a yet-to-be-acquired feature of a TripIt, I’m beginning to wonder whether or not I’ve got it the wrong way round. TripIt’s proven that there’s money in optimizing the travel schedule of individual consumers. But isn’t it possible that Flightcaster will eventually be able to extract significantly more revenue on higher margins from airlines for helping optimize their operations?

Overall — 87% of flights had time added to their scheduled between 1996 and 2009, while only 80% experienced longer actual elapsed times. Meanwhile, 10% had time subtracted from their schedules, but 16% of flights were faster in actuality. So airlines were certainly over-compensating in 2009.

Motives? Like Scott says, they are many fold: Better operations overall, better on-time performance, better ability to plan.

It’s a game airlines play to balance their operational needs and customer service. Sometimes they win, sometimes they lose. But predictability of delays is the biggest lever to help them play this game. Over time, we hope to use FlightCaster data to help with these kinds of decisions as we gather more data and analyze it in different ways.

Emphasis mine. How did Flightcaster, a one time Y Combinator startup, put itself in a position to know more about the state of airline operations than the airlines themselves? By building themselves a highly differentiated dataset amalgamated from sources like the Bureau of Transportation Statistics, FAA Air Traffic Control System Command Center, FlightStats and the National Weather Service. From an interview with their head of research, Bradford Cross:

The public data set that we use is the “on-time database” published by the FAA. The data set is tricky to get all in one place since the FAA does not provide any decent API to it. The biggest issue is that we make real time predictions, so we needed a historical set of captured real time data, which we had to create ourselves.

Having a more amalgamated real time dataset going back historically for a decade would be a big help. Having more modernized ways of accessing the data would be helpful.

Until then, if anyone wants to buy it, we will sell it to them for a very high price.

Is Flightcaster’s iPhone app the important product for the firm, then? I doubt it. It’s useful as a marketing tool, I’m sure, but ultimately the value of the firm lies in their data. By combining public datasets, Flightcaster can answer the easy questions – who is the most delayed airline? the most delayed airports? – as well as more complicated analyses such as “how our political system is causing flight delays,” “whether or not winter weather is causing delays” and so on. Like Google, their real value is underappreciated, because it’s a product that is indirectly monetized.

How many businesses like Flightcaster are poised to emerge over the next few years, with data easier to get, the tools to work on it cheaper, and the financial incentives better understood? Tough to say. But it’s safe to assume that there are thousands of similar asymmetries between publicly available data and the intelligence it contains yet to be discovered.

Which is why it doesn’t take much of a model to predict more of them.

by-nc-sa

Who’s Winning the Cloud Marketing Battle?

One of the cloud related questions we get with some frequency is: who’s winning the marketing battle? For all that the cloud has (justifiably) become a much maligned term for technologists, thanks to the liberal abuse of the term by marketers everywhere, the fact remains that the cloud represents the next major battlefield for vendors small and large. As such, getting insight into who is being talked about, and how much – relative to their competitors – is a useful bit of intelligence.

To help answer this question, I decided to take a quick look at some numbers from ITDatabase.com – who is a client, for the sake of disclosure. The following visualization plots to the actual press hits ITDatabase has recorded over the past three and six months, respectively, for a query of “cloud computing.” This is certainly not an authoritative answer to the question of whose winning the cloud marketing battle, but I found the results interesting enough to share.

The surprises for me were Apple and Microsoft. Also interesting were the fact that Amazon and IBM were roughly on par, and that large providers like Rackspace and suppliers such as Red Hat didn’t make the list. The fact that most of the rest of the players were on par was more predictable.

Again, this is a single metric, and a proxy for visibility at that. But it is certainly something we’ll be discussing with our clients moving forward, and probably worth vendor consideration moving forward.

by-nc-sa

Data vs Dual Licensing: Which Will Make More Money?

By 2012, at least 70% of the revenue from commercial OSS will come from vendor-centric projects with dual-license business models.

80% probability. This is may true today but the lack of revenue among broader market OSS products compared to Linux isn’t large enough yet to make this one a done deal. What is clear is that the overwhelming majority of ‘commercial oss’ efforts are based on a dual license model – vendor prefer the ‘open core’ moniker because it sounds more OSS friendly but its essentially the same thing.”
- Mark Driver, Gartner, Open Source Predictions for 2010

This prediction from Gartner’s Mark Driver confused me, I’ll admit, when I first read it. Baffled me, actually. Looking at the market, it seemed clear to me that the practice of dual licensing was, if anything, in decline. I couldn’t see how we could look at the same market and come to such different conclusions. My view was similar to Brian Aker’s (as is Cloudera’s Mike Olson’s, notably), most recently of MySQL/Sun:

When MySQL pushed dual licensing, investors looked for this hook in every business model. I remember standing outside of a conference room in SF a couple of years ago and talking to one of the Mozilla Foundation people. Their question to me was “Is the nonsense over dual licensing being the future over yet?”. The fact is, there are few, and growing fewer, opportunities to make money on dual licensing. Dual licensing is one of the areas where open source can often commoditize other open source right out of the market. The dearth of companies following in MySQL’s dual licensing footsteps to riches, belabors the point of how niche this solution was.

Even for MySQL, long the standard bearer for the approach, the logistics of dual licensing were and are becoming increasingly problematic over time:

For smaller firms, the primary limitation [of dual licensing] is the development. Unlike non-dual licensed projects which need only concern themselves with the quality and provenance of code contributions from external parties, dual-license vendors need also consider the question of copyright ownership. Because dual licensing depends on ownership of copyright for the entirety of the asset in question, third parties must either assign or be willing to jointly hold the copyright for any potential contributions. Early in a project’s lifecycle, this is a minor concern because the project owner likely employs most of those qualified to improve it. As a project matures and becomes more popular, however, this is a more pressing issue. First, because it acts to inhibit community participation (see slide 18 of this deck produced by Monty), but second – and more problematically – it means that third parties can, in practical terms, offer a more complete product.

Jeremy Zawodny made reference to the practical implications of the dual license in a post from December of last year entitled “The New MySQL Landscape.” In it, he made the assertion that “You can get a ‘better’ MySQL than the one Sun/MySQL gives you today. For free.” This is the cost of the dual licensing model: in return for the right to exclusively relicense the code, you forfeit a.) the right to amortize your development costs across a wide body of contributors, and b.) the right to uniformly integrate the best patches/fixes/etc that are made available under the original license because you cannot always acquire the copyright.

This doesn’t mean that dual licensing is a uniformly bad strategy, but it does imply that it has costs, and that those costs escalate over time. This situation is the inevitable result of the dual license model over time as applied to a successful project. For those looking for perspective from a MySQL and Drizzle developer, I’d recommend reading Brian Aker’s piece here.

Even setting aside the disincentives to pursuing a dual licensing strategy, the basic math of the 70% argument didn’t work for me. Even at MySQL, remember, a fraction of the revenue is derived from the issuance of dual licenses. And even if we assumed, for the sake of argument, that the entire revenue stream was the product of dual licensing, that still wouldn’t be enough to meet the 70% projection. Not nearly so.

As Driver notes when he says “the lack of revenue among broader market OSS products compared to Linux isn’t large enough yet.” Linux, it seems clear, is the largest single open source commercial ecosystem, and due to the lack of centralized copyright ownership, it cannot be dual licensed by anyone. What Driver is saying, in other words, is that the open source commercial ecosystem has to be big enough that Linux doesn’t comprise more than 30% of it.

Consider the following back of the envelope calculations. Red Hat’s revenues in the year ending 2009 were $652 million and change. We know that, for copyright and licensing reasons, none of that money may derive from dual licensing revenues. If we assumed, counterfactually, that Red Hat represented all of the non-dual license revenue of the market – the leftover 30%, my math says that the total revenue picture would be around $2.17B. Meaning that we need a little more than two Red Hat’s more worth of revenue to emerge from dual licensees like MySQL.

Personally, I’m skeptical that that would happen, even with the hybrid source trend.

Part of the problem is, I believe, semantics. Driver seems to be conflating what is sometimes referred to as “open core” licensing with dual licensing. Personally, I believe they are distinct. The former tends to refer to varying combinations of open source and proprietary codebases, while the latter is more generally used in conjunction with copyright mechanisms as they apply to a single open source codebase. This view is supported by my analyst colleagues over at the 451 Group.

Were we to grant Driver the more expansive definition of dual licensing, however, I still think that figure is wrong. Based on the conversations we’re having with vendors in the space, it seems more likely that revenue growth and expansion will come not from quote unquote dual licensing, but derived intelligence from gathered data and telemetry.

Judging by the almost universally poor conversion metrics – that is, the number of users of a given open source tool that are converted to paying customers – it seems reasonable to assert that there are ongoing and systemic issues in the commercialization of open source software. Hence the proliferation of alternative revenue models such as dual licensing, open source and even SaaS. It is far from clear, however, that these models satisfactorily align customer and vendor interests such that conversion percentage will elevate to levels where they are competitive with proprietary software.

At the end of the day, open source customers are generally paying for one or more of a.) break/fix/integration/support/etc services that they hope not to need, b.) withheld features that they need to pay to gain access to, or c.) the right to not observe the terms and conditions of the original license. The relative distribution of revenue within this set is skewed by the size and scope of the Linux community towards A, with B being the raison d’etre for open core and C the same for dual licensing.

But what if open source vendors could leverage their primary strength – distribution – more effectively as a direct revenue stream? I’ve been predicting for three years or so that they would do just that, via data aggregation and analytics. The alignment of customer and vendor goals is better in this model than in virtually any other. The simplest example of this model outside of open source is Google, who provides users with search at no cost, receiving in return massive volumes of data which they monetize both directly (contextual ad placement) and indirectly (algorithmic improvement, machine learning, intelligence for product planning strategy, etc). Why couldn’t software vendors employ a similar model, trading free software for user generated telemetry data? The answer is, they can. SpiceWorks, for one, is doing just that now, quite successfully, albeit not with open source software.

The strength of open source is in its ubiquity, and the volume it commands ensures that the telemetry returned would have substantial – potentially immense, depending on the project – value. Importantly, however, the value lies in the aggregation. A single user’s telemetry is likely to be relatively uninteresting. A hundred users’ telemetry, more interesting. A thousand users’, that much more so, and so on. Users, therefore, wouldn’t be surrendering anything of material value to a would-be vendor in the transaction. Better, analysis of the aggegrate could have enormous value to customers. How is my infrastructure performing relative to similar environments? What are the types of conditions that indicate a potential problem? What differentiates my architecture from the Top 10 best performing? These are answerable questions…if you have a big enough dataset. Most customers would not have that; an open source software provider aggregating and analyzing their combined telemetry would.

Privacy and trust will certainly be concerns, but if the right data is offered as an incentive and the appropriate anonymization assured, those can be addressed for most customers. And for those that remain concerned, they should have the ability to opt out understanding that they will in turn have no access to the resulting analytics, and might therefore be at a disadvantage relative to their competitors who were using the intelligence.

This direction seems nothing less than inevitable for me, and so it is no surprise that we’re beginning to see (and help) a variety of open source vendors move in this direction. Free and open source data has a bright future regardless of the revenue model, but as we see successful projects better leverage their traction via analytics, the result should be a win for ecosystems and customers alike.

Whether you believe as I do, however, that the money is ultimately going to come from data more than code, it seems clear to me that it is not going from what is commonly considered to be dual licensing. Because while it is not true that I am an enemy of that particular approach, I do believe it’s in decline. Not least because it’s poorly aligned with customers needs.

Unlike data.

by-nc-sa

They Say The Pacific Has No Memory: Well Neither Do Facebook, Twitter or Your iPhone

You know what the Mexicans say about the Pacific? They say it has no memory.”

It’s easy to understand why Andy Dufresne, wrongfully imprisoned protagonist of the Shawshank Redemption, would value a place with no memory, no history. What’s less obvious, at least to me, is why we all feel that way.

Because clearly we must. Facebook, after all, has in recent days passed first the 400 million user mark and then Yahoo. And while the tool is exquisitely well designed to help us share the present, as far as the world’s most popular social network is concerned, the past may as well have never happened. What were you doing this time last year? Good luck figuring that one out; the UI certainly isn’t going to help you.

Which is not to single out Facebook: Twitter is no better. I can’t find an answer to the question of how far back Twitter lets you browse in their API documentation, but I’ve seen the number 3200 claimed a few times. If that’s true, about 66% of my Twitter history is non-visible to me, the author. And while my iPhone dutifully backs up my SMS history, it does not – at least as far as I can tell – expose it to me. The closest thing is knowing where they’re stored on the file system.

Yes, there are workarounds for all of the above: piping feeds into backup services like BackupMyTweets or YouArchiveIt, using one off tools to extract and store your data, and so on. But how many users do you think will be able to find and successfully use those? More to the point: what are they going to do with raw backups? How will they search it, look it up by date, or drop it in a calendar?

They won’t, in all likelihood. It will be just like it is today: as if the past had never happened.

For some, that might be a good thing. For others, it won’t matter, because the content is of marginal value or less. But for a subset of users and a subset of content, the unavailability is not a boon. Wouldn’t it be nice, for example, to look back on all the congratulations you received when you had a child? Started your new job? Got married? Or just had a birthday? More interestingly, might not that content have latent, unrealized value? Isn’t it possible, for example, that you could do sentiment analysis on your Twitter stream and have a more realistic and objective look at fluctuations in your mood and outlook, day to day, month to month, or year to year? Might you be able to mine it for undiscovered patterns of behavior? Imagine being able to browse your Facebook, Twitter and SMS history via just a simple calendar, the way you can FourSquare. Are the privacy issues? You bet. But the alternative – privacy through simple loss of data – is no more attractive.

There’s a reason that some very smart people are interested in “logging” more and more aspects of their lives: the more data you have, the more meaningful the conclusions you can extract. Unfortunately, Facebook, Twitter et al are just trying to keep their heads above water at this point – four months ago, Facebook was adding 24-25 terabytes per day – so returning our data to us period, let alone making it useful and meaningful, just isn’t a priority for them right now. In fact, we can probably recreate the Civil War more accurately from correspondence than we can the recent events of our lives.

But it should be more of a priority for us, and for those building applications for us. Because living strictly in the present at the expense of the past rarely does anyone any good. Just ask George Santayana.

by-nc-sa