tecosystems

Beyond Cassandra: Facebook, Twitter and the Future of Development

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

Whenever an internet company elects to build its own database, web server or language framework (coverage), the inevitable result is a discussion of the relative merits of the new technologies versus those not chosen. The canonical example here are the endless comparisons of NoSQL infrastructure to the more traditional relational database approach (coverage). However interesting such conversations might be, they’re obscuring the longer term implications of a fundamental shift in the way that software is produced, and why.

Historically, businesses – web and otherwise – have used a variety of mechanisms to protect their assets, to preserve their competitive advantage. In the software world, we’ve seen copyright, licensing, non-disclosure agreements, patents, trademarks, and a host of other legal tools employed. If you’ve been in the industry for any length of time, chances are you’ve been on one end or the other of one or all of the above at some point.

In the past ten or fifteen years or so, however, we’ve seen software firms increasingly ask a simple but profound question: what are the assets I must protect? Twenty or more years ago, the answer was simple: protect everything. Ten years ago, as open source was on the way to becoming a mainstream software development practice and companies built upon the resulting projects grew exponentially in size, the reply was more nuanced. A lot of software needed to be protected, but there were substantial chunks that could be shared. Today, firms appear to be asking a different question: is my value in data, or source code? And if the answer is data, what should my software development practices look like?

Facebook and Twitter, as high profile properties that grew up without the legacy protectionist mindset, might be illustrative here.

If we examine developers.facebook.com/opensource/, for example, there are several very obvious trends.

  • Facebook has a strong preference for permissive licensing. Wherever possible, Facebook avoids copyleft licensing in favor of more liberal alternatives. The default licensing choice, in fact, appears to be Apache 2.0 (e.g. Cassandra, Scribe, Tornado and Thrift), with other licenses employed tactically for compliance or compatability reasons (e.g. GPL and Flashcache or PHP and Hip-Hop).
  • Between their contributions to previously existing projects (e.g. Hadoop, Cfengine, memcached, MySQL, and PHP) and releases of software they built (e.g. Cassandra, Hip-Hop, Hive, Scribe, Tornado, Thrift) the core of Facebook’s infrastructure is built on non-differentiating, publicly available code (Update: just for reference, we’re told via email that Facebook, “no longer contributes to nor uses Cassandra.” Update 2: we are now being told – and Facebook has confirmed – that Cassandra is actually still employed by the company for, among other things, Inbox Search.)
  • Language usage at Facebook is fairly heterogeneous, with both dynamic languages (e.g. Javascript, PHP, Python) represented as well as traditional alternatives (e.g. C, C++, Java). Perhaps because of Facebook’s emphasis on performance, however, the latter is significantly more common than the former.
  • Facebook hosts very few of their own assets; Tornado appears to be the notable exception (possibly bc it came from FriendFeed). Some assets are hosted with Github (coverage), those that are not are typically housed at Apache.

As for Twitter’s twitter.com/about/opensource:

  • Twitter, like Facebook, has an affinity for permissive licenses in general and the Apache license specifically. Twurl, a Twitter-specific flavor of Curl, is MIT licensed, but FlockDB, Gizzard, Murder and even its GC trace script jvm-gc-stats are Apache 2.0 licensed.
  • By all accounts, Twitter is run off of an infrastructure similarly undifferentiated. Its primary data storage, for example, has been MySQL based with a parallel implementation of Cassandra (which Twitter contributes to). Their social features are likewise enabled via a graph database, FlockDB, whose source is available.
  • Languages at Twitter are similarly heterogeneous, though Twitter appears to rely more heavily on dynamic languages than does Facebook (Murder is 97% Python / 3% Ruby, for example) resorting to Scala when performance is at a premium (FlockDB is 83% Scala, Gizzard 100%).
  • Effectively zero of Twitter’s released open source projects are self-hosted; Twitter has insteaded outsourced this task to Github. There does not appear to be any predisposition to existing open source foundations, Apache or otherwise.

Though Facebook and Twitter clearly have some differentiation in their operational priorities and philosophies, then, the similarities far outweigh the differences. Following on the heels of Amazon, Google, Yahoo and the other early web firms, Facebook, Twitter et al are pushing the envelope even further: Google publishes their algorithms (e.g. MapReduce), Facebook their software (e.g. Cassandra).

If they are at all representative of the direction of application development in web native firms, then, we might reasonably expect the following:

  • Default to Open Source:
    Rather than ask whether a given asset should be open source, firms are likely to increasingly try to identify which pieces should not be. We don’t see many businesses running off of an entirely open source foundation, but the differentiation points are typically further up the stack. In practical terms, then, this means that it will be difficult to differentiate, competitively, on infrastructure software. And if there is no competitive advantage in your infrastructure, the benefits to using or releasing open source software – whether those are better resource availability or the ability to amortize development costs – are likely to outweigh the marginal benefit of developing it strictly in house.
  • Language Heterogeneity:
    Traditional development best practices – which typically annoint a language or set of languages as the permitted options – will likely become workload specific. Performance or scale sensitive applications, for example, would be restricted to a set of predetermined language options (e.g. C/C++ at Facebook, Scala at Twitter, etc). Glue languages, however, are likely to be far less homogenous, and reflective of different influencers (e.g. developer preferences, available bindings/libraries, etc).
  • No Core Competencies in Project Hosting:
    Few if any web firms are attempting to specialize in project hosting. This task is increasingly being left to specialized hosts (e.g. Github) or governance oriented foundations (e.g. Apache). This is preferable from a developer standpoint, because centralized project hosting simplifies discovery/cross-pollination and enables network effects such as social, collaborative development. Source code control is increasingly likely to be distributed by default, as well.
  • Permissive Licensing Standard:
    Much has been made in some quarters over the decline of the GPL. While the “decline” is unquestionably overstated (coverage) considering that the license is more popular than the next ten licenses combined, my expectation entering 2010 was that permissive licensing would continue to grow at the expense of reciprocal licensing (coverage). The behavior of web firms generally validates this assertion.
  • Precise Identification of Value:
    It would be absurd to argue that the value of a Twitter was in no part related to the software that powers it. But it would be equally foolish to suggest, when we have open source Twitter clones such as StatusNet freely available, that the value was all, or even mostly, in the software. The value of a Facebook or a Twitter is ultimately in the data they generate, not the code. In a very real sense, its users are its asset. When application development is considered, then, it will be considered with this in mind. If code isn’t ultimately your differentiating asset, then the dynamics of development are irrevocably altered.

Many of you are doubtless curious as to how relevant the application development experiences of unique, web native businesses such as Facebook and Twitter are to traditional enterprise customers. The answer depends largely on timeframe. In the short term, the impact will be minimal both because enterprises move slowly and because their attention to web firms is fairly minimal. In the longer term, however, the web firms have the ability to substantially influence developer best practices, product direction and so on. Witness the mainstream popularity within enterprises today of dynamic languages, once popularized by web firms, or the accelerating adoption of projects such as Hadoop.

We will not see within the foreseeable future a world in which all software is open source, nor one in which there is no differentiation to be found in development. It is likely, however, that as our understanding and appreciation of what, precisely, is differentiating improves, our software development practices will evolve along with it. Where better to look, then, to understand where things are going than to firms that have grown up without preconceived notions of what must be protected at all costs?