Skip to content

Microsoft goes after the barrier to entry for data science with Azure ML

A month ago, I got a pre-briefing on Microsoft’s Azure Machine Learning with Roger Barga (group program manager, machine learning) and Joseph Sirosh (CVP, Machine Learning). Yesterday, Microsoft made it available to customers and partners, so now seems like the right time to talk about how it fits into the broader market.

The TL;DR is that I’m quite impressed by the story and demo Microsoft showed around machine learning. They’ve paid attention to the need for simplicity while enabling the flexibility that any serious developer or data scientist will want.

Here’s an example of a slide from their briefing, which obviously resonates with us here at RedMonk:

Machine Learning Briefing June 2014_p3

For example, we constantly hear about toolsets like Apache Mahout (for Hadoop) that it’s more of a prototype than anything you can actually put into production. You need to have a deep knowledge of machine learning to get things up and running, whereas Microsoft’s making the effort to curate solid algorithms. This makes for a nice overlap between Microsoft product and research, the latter of which  has some outstanding examples of machine learning (such as the real-time translation from English to Chinese in late 2012 by Rick Rashid).

In action, Azure ML looks a lot like Yahoo Pipes for data science. You plug in sources and sinks, without thinking too much about how that all happens. The main expertise needed seems to be around two areas

  1. (Largely glossed over) Cleaning the data before working with it
  2. Choosing an algorithm that makes sense given your data and assumptions

Both of these require expertise in machine learning, and I’m not yet sure how Microsoft plans to get around that. Their target market, as described to me, is “emerging data scientists” coming out of universities and bootcamps. Somewhere between the experts and the data analysts who spend all day long doing SQL queries and data modeling. Some comparisons of data against various distributions to check the best fit and whether that suits the chosen algorithm would be one approach; another would be preference of nonparametric algorithms.

Here’s a screenshot of a pipeline:

Machine Learning Briefing June 2014_p4screen

From my point of view, a critical feature to any pipeline like this is flexibility. Microsoft’s never going to provide every algorithm of interest. The best they can hope for is to get the 80% of common use cases; however there’s no guarantee that even the 80% is the same 80% across every customer and use case. That’s why flexibility is vital to tools like this, even when they’re trying to democratize a complex problem domain.

That’s why I was thrilled to hear them describe the flexibility in the platform:

  • You can create custom data ingress/egress modules
  • You can apply arbitrary R operations for data transformation
  • You can upload custom R packages
  • You can eventually productionize models through the machine-learning API

All of this, except for the one-off R operations, will rely on the machine-learning SDK:

Machine Learning Briefing June 2014_p10

Much like higher-level AWS services such as Elastic Beanstalk, you don’t pay for the stack, you pay for the underlying resources consumed. In other words, you don’t pay to set up the job, you pay when you click run.

Microsoft’s got a solid product offering here. They need to figure out how to tell the right stories to the right audiences about ease of use and flexibility, build broader appeal to both forward-leaning and enterprise audiences, and continue to focus on constructing a larger data-science offering on Azure and on Windows (including partners like Hortonworks). They also need to continue reaching toward openness, as they’ve shown with things like Linux IaaS support and Node.js support. One example would be Python, an increasingly popular language for data science.

Disclosure: Microsoft and AWS have been clients. Hortonworks is not.


Categories: adoption, big-data, data-science, microsoft.

Widespread correlations across programming-language rankings

IEEE Spectrum recently came out with a very interesting interactive tool for ranking programming languages. What makes it interesting is that it incorporates 12 different sources including data from code, jobs, conversation, and searches — and you can customize the weights assigned to each source.


But the first thing that occurred to me was, this is a fantastic opportunity to look at commonalities and communities across all of these sources. That could tell us about which places could provide unique insight into what technologies developers care about and use, and which provide mainly reinforcement of others.

Before I did anything, however, I wanted to test the veracity of the rankings. So I compared RedMonk’s January rankings against an equal weighting of GitHub active repositories and StackOverflow questions. While not perfectly correlated, since IEEE used only 2013 and RedMonk uses all-time, the Pearson correlation coefficient for the top 20 languages is 0.97 (where 1 would be entirely correlated).

Having confidence in their data and reinforcing RedMonk’s rankings, I moved on to calculate, using the full 49 languages supplied by IEEE, correlations across every data source they provided:

  • CareerBuilder
  • Dice
  • GitHub active projects
  • GitHub created projects
  • Google search (# of results)
  • Google trends (search volume)
  • Hacker News
  • IEEE Xplore (IEEE articles mentioning a language)
  • Reddit
  • StackOverflow questions
  • StackOverflow views
  • Topsy (Twitter search results)

Here’s a spreadsheet showing the numbers, where higher correlations are in red and very weak correlations are in blue:

The strongest correlation on the chart, interestingly, is the 0.92 found between Twitter conversation and Google trends. Apparently, people talking about programming languages in real-time chat tend to also search for what they’re talking about.

The other very strong correlations (above 0.85) are:

  • Google: trends and search. Nothing surprising here.
  • Job sites: Dice and CareerBuilder. Nothing surprising.
  • Reddit and Google trends. Discussion about current topics seems to correlate with interest in finding more information about those topics.
  • Twitter and Google search. The 0.88 here is slightly below the 0.92 between Twitter and Google trends. Most interesting about this pair is that it shows a connection between conversation and amount of content (# of results), rather than just people searching for what could be a small amount of material.
  • Reddit and Twitter. Similar communities seem to participate across a wide variety of online discussion forums.
  • GitHub created and StackOverflow questions. Because it’s a correlation of open-source usage and broader conversation among forward-leaning communities, this is the one we rely upon for the RedMonk language rankings.

Midrange correlations : Hacker News and IEEE Xplore

In the middle (correlations between 0.3–0.7), I was surprised that Hacker News correlated rather weakly with all of the other sources. This implies a degree of independence for this community relative to the behavior of all global developers, and even the subset who participate on StackOverflow. It’s certainly some interesting data to support the saying that HN is for Bay Area developers (and their bleeding-edge “cousins” across the world).

IEEE Xplore, which is oriented around academic research, had similarly weak correlations with everything else (HN included). This supports a general disconnect between academia and both general trends (most other sources) as well as forward-leaning communities like HN.

Both of these seem to make sense based on my prior expectations, since both of these groups are rather unlike the rest.

StackOverflow viewers are the outliers

The weakest correlations were between StackOverflow views and almost everything else. It’s shocking how different the visitors to StackOverflow seem from every other data source. If we actually take a look at the top 20 languages based on StackOverflow views, it bears out the unusual nature that the poor correlations suggested:

  1. Arduino
  2. VHDL
  3. Visual Basic
  4. ASP.NET
  5. Verilog
  6. Shell
  7. HTML
  8. Delphi
  9. Objective-C
  10. SQL
  11. Cobol
  12. Apex Code
  13. ABAP
  14. CoffeeScript
  15. Go
  16. MATLAB
  17. Assembly
  18. C++
  19. C
  20. Scala

Three of the top 5 are hardware (Arduino, VHDL, Verilog), supporting a strong audience of embedded developers. Outside of StackOverflow views, these languages are nonexistent in the top 10 with only two exceptions: Arduino is #7 on Reddit and VHDL is #8 in IEEE Xplor. That paints a very clear contrast between this group and everyone else, and perhaps a unique source of data about trends in embedded development.

Enterprise stalwarts are also commonplace, such as Visual Basic, Cobol, Apex (’s language), and ABAP (SAP’s language). Other than this:

  • Visual Basic is only in the top 10 in Google
  • Cobol and Apex are only in the top 20 on career sites (in the high teens)
  • ABAP is only in the top 20 on career sites and Google search (in the high teens)

Again, StackOverflow views may be a unique source of information on an otherwise hard-to-find community.

Viewing correlations as a network graph reveals communities

However, this only lets us easily look at two-way correlations. If we want to see communities, it could be easier to examine this with a graph, with the connecting edges being the correlations between pairs of data sources. Here’s a visualization of that, only showing strong correlations (above 0.7), and with highly connected nodes shown in red while poorly connected nodes are increasingly blue.


Graph layout weighted by correlation across data sources, using a force-directed layout in Gephi. I used a 0.7 minimum threshold for the Pearson correlation coefficient.

It’s instantly apparent that some data sources serve as centerpieces that can broadly represent a swathe of communities while others are weakly connected and could provide more unique insight. In particular, note that IEEE Xplore and SO views are missing altogether because they had no correlations above 0.7 to anything else.

The most central and strongly connected node, perhaps surprisingly, is Twitter. Google is close by, however, which supports the validity of the oft-maligned TIOBE rankings to represent many communities. However it could be a better choice on their part to use Google trends over search results, based on the strength and number of connections shown above.

On the opposite side, being nearly unrepresented without explicitly adding them in, are the two that didn’t appear (StackOverflow views and IEEE Xplore). In addition, largely disconnected sources would be well worth considering to provide additional diversity. On this graph, they’re weakly connected (more blue) and less strongly correlated with their connections (thinner edges) — sources like GitHub active projects and Hacker News.


Based on that, I thought I’d recalculate a new set of rankings that accounted for these connections. I decided to include Topsy (weight 100), StackOverflow views (weight 100), Hacker News (weight 50), and IEEE Xplor (weight 50) to represent the diversity across these communities. These communities are vastly different sizes, so this truly reflects source diversity rather than population-level interest.  But it’s interesting to see interest scaled by community rather than by pure population:

  1. C
  2. C++
  3. Python
  4. Java
  5. SQL
  6. Arduino
  7. C#
  8. Go
  9. Visual Basic
  10. Ruby
  11. Assembly
  12. R
  13. Shell
  14. HTML
  15. MATLAB
  16. Objective-C
  17. PHP
  18. Scala
  19. Perl
  20. JavaScript

In comparison to the RedMonk top 20, the changes are about what you’d expect based on the earlier results. Languages more popular in niche communities tend to move up (e.g. Arduino, Go) because of how I weighted the outlier sources, while languages that aren’t popular across all those audience types (e.g. JavaScript, PHP) shifted downwards

This work revealed a widespread network of communities spread across a wide variety of forums, including code, discussion, jobs, and searches. Some of the most interesting results were the exceptions from the norm — in particular, StackOverflow views could provide a unique window into embedded and enterprise audiences, while Hacker News and IEEE Xplore are other sources with quite disparate data relative to the majority of the group. Finally, the connection between real-time conversation on Twitter and existing content on Google was a newly interesting correlation between discussion and resources that actually exist, rather than purely discussion and interest.

Disclosure: SAP and are clients. Microsoft has been a client.


Categories: adoption, community, programming-languages.

Microservices and the migrating Unix philosophy

A core Unix tenet pioneered by Ken Thompson was its philosophy of one tool, one job. As described by Wikipedia:

The Unix philosophy emphasizes building short, simple, clear, modular, and extendable code that can be easily maintained and repurposed by developers other than its creators. The philosophy is based on composable (as opposed to contextual) design.

This philosophy was most clearly visible through the existence of a substantial set of small tools designed to accept input and output such that they could be chained together in a series using pipes (|) a la `cat file | sed | tail`. Other instantiations include the “everything is a file” mentality and the near-universal use of plain text as a communication format. Both of these encouraged the sharing of a common toolset for accessing and processing data of all types, regardless of its source.

Following up on Steve’s writeup on microservices last week, I figured I’d better get this post out the door. I’ve had the ideas on the back burner for a year or so, but the burgeoning interest in microservices means now is the right time to tell this story.

The “new” composable world, same as the old one

Composability has made a resurgence in the past couple of years, inspired in part by the now-infamous 2011 post by Steve Yegge. It described Amazon’s move to a service-oriented organization where all data between teams must be transferred via API rather than emailing around Excel spreadsheets.

We’ve seen this pervade through to the design of AWS itself, and the ability of Amazon to keep up an astonishing pace of feature releases in AWS. More recently, the PaaS community, incited by the cries of Warner’s Jonathan Murray for a composable enterprise, has begun talking specifically about the virtues of composability (although it’s enabled it implicitly for much longer).

Another area where composability’s had a huge impact is IT monitoring. The ELK stack of Elasticsearch, Logstash, and Kibana as well as the #monitoringsucks/#monitoringlove movements (see Jason Dixon’s Monitorama conferences) serve to define the new composable monitoring infrastructure. It exists as a refutation of the old-style monolithic approach best embodied by the Big Four of HP, BMC, IBM, and CA. This movement further refutes the last revolution led by the still-dominant open-source alternative, Nagios, and the kingmakers-style bottom-up approach that enabled Splunk’s success.

Composability embodies the Unix philosophy that I began this piece by describing, and we’re now seeing it move up the software stack from its advent in Unix 40+ years ago at Bell Labs.

Granularity collapses when unneeded

The key point I want to make in this piece, however, is that composability does not and can not exist everywhere simultaneously. It just won’t scale. Although the flexibility that a composable infrastructure provides is vital during times of rapid innovation, such that pieces can be mixed and matched as desired, it also sticks users with a heavy burden when it’s unneeded.

As developer momentum and interest continues to concentrate up the stack toward cloud, containers like Docker, and PaaS, and away from concerns about the underlying system, that lower-level system tends to re-congeal into a monolith rather than remaining composable.

We’ve seen this happen in a number of instances. One prime example is in the Linux base system today, where systemd is gradually taking over an increasing level of responsibility across jobs formerly owned by init systems, device managers, cron daemons, and loggers. Only the first of those has seen significant reinvention in the last decade or so, with alternatives to the old-school SysV init system cropping up including Upstart, OpenRC, and systemd. But with systemd’s gradual integration both horizontally and vertically into the kernel and the GNOME desktop environment, it’s quickly becoming mandatory if you want one option that works everywhere.

Even beyond that, the advent of container technologies and distributions like CoreOS mean that users care increasingly less about the underlying system and just want it served to them as a working blob they can ignore. This is a similar driver to what Red Hat’s doing with CentOS, by attempting to provide a stable underlying firmament that you treat essentially as a large blob to build applications upon.

Another example is in X.Org, the primary Unix window system. Ten years ago, it was undergoing a period of rapid innovation driven in part by its recent fork from XFree86. The entire monolithic codebase was modularized into hundreds of separate applications, libraries, drivers, and the server. But now that community has realized it’s difficult to maintain so many stable APIs and the cost is no longer worth the benefit, so it’s considering bringing drivers and server back together into a mini-monolith of sorts.

Think of it as an accordion. Parts of it expand when there’s rapid innovation underway, often driven by external forces like the advent of the cloud, and then contract again when consensus is generally reached on the right solution and the market’s settled down. Then another part of the accordion expands, and so ad infinitum.

Disclosure: Amazon, IBM, Pivotal, Splunk, and Red Hat are clients. Microsoft, HP, and CA have been clients. Docker, Elasticsearch, CoreOS, Nagios, BMC, and Warner Music Group are not clients.


Categories: api, services.

Oracle v Google could drive a new era of open-source APIs

The question of whether APIs are copyrightable has a huge bearing on implementers and users of those APIs. Today, the US Court of Appeals for the Federal Circuit released its decision in the Oracle v Google case around (1) whether a set of 37 Java API packages were copyrightable and (2) whether Google infringed those copyrights in Android and its Dalvik VM. It said that APIs are definitely copyrightable, but it called for a new trial on whether Google’s actions qualified as fair use. Also on the bright side, copyright (unlike patent) isn’t solely owned by the Federal Circuit, so this isn’t a nationally binding decision. That allows for other circuit courts to disagree, leaving more room for a potential appeal to the Supreme Court.

One specific point the court made was (p. 39):

It is undisputed that—other than perhaps as to the three core packages—Google did not need to copy the structure, sequence, and organization of the Java API packages to write programs in the Java language.

And it elsewhere went on to discuss Google’s use of those APIs as clear techniques to subordinate developers familiar with them rather than something that was necessary to write Java.

But most importantly, the appeals court sent the question of whether Google’s actions constituted fair use (i.e., they infringed but it wasn’t enough to count) back to the lower court for a new trial. The court said (p. 61):

We find this [function and interoperability as part of a fair use defense] particularly true with respect to those core packages which it seems may be necessary for anyone to copy if they are to write programs in the Java language. And, it may be that others of the packages were similarly essential components of any Java language-based program.

What are the implications?

APIs may be copyrightable as a consequence of this decision, but that leaves a lot of gray area in terms of how they can be used, while opening up a whole new arena of how their licensing affects API re-implementers as well as consumers.

Before you make any decisions based on this news and what you think it means to you, I’d advise asking a lawyer, because I’m not one. I will, however, run through a series of questions raised by this decision that need answers.

Sam Ramji also pointed to the EFF’s filing for this case as a useful summary of some of the impacts of copyrightable APIs in areas such as Unix, network protocols, and the cloud.

For API consumption

It’s less concerning for those who are only consuming an API, because in many cases these users are already subject to terms of service that may already satisfy the API provider’s desires. One implication of this ruling is that if APIs are copyrightable, however, then the license applying to the rest of the software also applies to the API as an integral component of the software.

Since I’m not a lawyer, I can’t speak specifically to any legal precedent for code that uses an API, as far as whether it’s considered a derivative work (subject to copyright) [LWN writeup], a transformational work (fair use exception to copyright), a separate work with no copyright relationship at all, etc. However, lawyers such as Larry Rosen seem to lean toward the idea that implementing programs against a library API is not creating a derivative work.

If all software using APIs were derivative work of the providers, it would create very interesting situations of asymmetric power for Microsoft Windows and other OS providers as well as for web APIs. The provider would be able to exert license control over every consumer of the API, and it could go so far as to individually curate (i.e., pick and choose) users of its APIs. This would dramatically shift the balance of power toward the API provider and away from individual developers, particularly in the non-web case because of the popularity of ToS, API keys, and additional restrictions in the latter.

For API compatibility

Even if the new trial holds up fair use as a defense, the biggest problem for the rest of us is that fair use applies on an individual level. In other words, every single time an API is copied, it’s sufficiently different from every other time that the owners of the original API could sue for infringement and you’d have to hire lawyers for a fair-use defense.

This judgment greatly changes the risk assessment for whether to support any given API. It may not end up going to trial, but even the legal effort and expense alone will create significant difficulty for many startups and open-source projects. This will have a serious chilling effect on the creation and growth of new implementations because of greatly increased switching costs for whatever API users are currently doing. Low barriers to entry may encourage more users to move through the funnel, but the flip side of the coin is that increased barriers to exit will similarly discourage developers and other customers who see it as a form of lock-in. People are much more willing to invest time and money when they perceive that their effort is portable than when it’s tied to a single proprietary platform.

This is going to create a lot of difficulty for everyone reimplementing S3 APIs, for example, such as OpenStack and others, with the exception of Eucalyptus because of its Amazon partnership.

A new opportunity to go beyond open APIs to open-source APIs

The “open API” movement has thus far been rather amorphous in terms of structure and definition. But generally, open APIs are perceived to use open protocols and formats including REST, JSON, and (less popular these days) XML. In other words, it’s open in terms of the standards used, and not necessarily open in terms of anyone’s ability to register for it.

This ruling creates an opportunity for API providers to put their APIs under open-source licenses, be it permissive or copyleft. This can apply, particularly with permissive licenses, even if the non-API components of the source code are proprietary. That would provide users with the comfort of knowing that they have the freedom to leave, which lowers resistance to initially commit to a technology, precisely because the API is open source and so it could be reimplemented elsewhere. Unfortunately, it also means that the idea of “open APIs” will become even cloudier because it add open source into an already confusing mixture of usage.

If nothing changes legally, we may end up with the equivalent of FRAND patent terms in copyright licenses as well, along with additional usage of standards bodies to support them. Unfortunately FRAND does not always mean open-source friendly, so we may see some of the same battles all over again with copyrights as we saw with patents.

Disclosures: I am not a lawyer. Google and Oracle are not clients, but Sun was. Microsoft and Eucalyptus have been clients. The OpenStack Foundation is not, but a variety of OpenStack vendors are.


Categories: api, open-source, operating-systems, windows.

The interface from Dev to Ops isn’t going away; it’s rotating

When I talk about DevOps today, it’s about three main things:

  • Bringing agile development through to production (and back into the business)
  • Infrastructure as code
  • The revolution in IT monitoring (#monitoringlove)

However, what doesn’t seem to be appreciated is how this shift, particularly the first and most important aspect of it, changes the roles of developers and operations teams. And that’s regardless of whether the latter group is called sysadmins, SREs, or DevOps engineers. After three recent discussions and inspired partially by Jeff Sussna’s writing on empathy, I figure it was worth sharing this with the rest of the world.

DevOps isn’t just about developers taking on more work and ops learning Puppet, Chef, etc. That may sound obvious, but it’s the way many organizations seem to see it. It’s not just about making developers responsible for code in production, so they’re getting paged at 3am. Despite the DevOps community being so driven by the ops side of the house rather than developers coming into ops, it can’t limit itself to one-way empathy and one-way additional effort. It has to go the other way too. Developers do need to be responsible for their code in production, but ops also need to own up to the importance of maintaining stable infrastructure in dev and test environments. I’ve seen ridicule for Facebook coming out of the CMO for Marketing Cloud talking about Facebook’s new motto of “Move fast with stable infra” over “Move fast and break things.” But that’s perfectly applicable to this transition, even in spite of it being terrible marketing.

Developers today often own both their application code as well as their environment in dev and maybe test as well, while ops owns applications and infrastructure in production, which the top of this image illustrates:



On the bottom of this image, conversely, the separation between dev and ops has rotated. This is key. It also underlies shifts like Red Hat’s incorporation of CentOS and its move to position both CentOS and RHEL as stable environments to innovate on top of in the context of cloud.

And yet, in every case, what’s missing is the understanding and communication that developers must own application-layer code wherever it lives, while ops must own the infrastructure wherever that is. The external system environment is irrelevant — dev/test/production must all have environmental parity from the application’s perspective. Also irrelevant to this point is whether your ops are traditional or are essentially developers building infrastructure tools. In many cases this rotation may involve using tools like Vagrant or Docker to provision identically from the ground up, but the key transition here is a cultural one of bidirectional empathy and bidirectional contribution.

Disclosure: Chef, Red Hat, and are clients. Puppet Labs has been. DigitalOcean, Facebook, Hashicorp and Docker are not.


Categories: devops, marketing.

GitHub language trends and the fragmenting landscape

A while ago, I wanted to get a little quick feedback on some data I was playing with, but the day was almost over and I wasn’t done working on it yet. I decided to tweet my rough draft of a graph of GitHub language trends anyway, followed later by a slight improvement.


Trends over time, smoothed to make it a little easier to follow

Much to my surprise, that graph was retweeted more than 2,000 times and reached well over 1 million people. My colleagues have both examined this data since I posted the graph — James took a stab at pulling out a few key points, particularly GitHub’s start around Rails and its growth into the mainstream, and Steve’s also taken a look at visualizing this data differently.

Despite that being fantastic news, the best part was the questions I got, and all the conversations gave me an opportunity to decide what points would be most interesting to people who read this post. The initial plot was a spaghetti graph, so I fixed it up and decided to do a more in-depth analysis.


Before we can get into useful results and interpretation, there are a few artifacts and potential pitfalls to be aware of:

  • GitHub is a specific community that’s grown very quickly since it launched [writeup]. It was not initially reflective of open source as a whole but rather centered around the Ruby on Rails community;
  • In 2009, the GitPAN project imported all of CPAN (Perl’s module ecosystem) into GitHub, which explains the one-time peak;
  • Language detection is based on lines of code, so a repository with a large amount of JavaScript template libraries (e.g. jQuery) copied into it will be detected as JavaScript rather than the language where most of the work is being done; and
  • I’m showing percentages, not absolute values. A downward slope does not mean fewer repositories are being created. It does mean, however, that other languages are gaining repositories faster.

The big reveal

The first set of graphs shows new, non-fork repositories created on GitHub by primary language and year. This dataset includes all languages that were in the top 10 during any of the years 2008–2013, but languages used for text-editor configuration were ignored (VimL and Emacs Lisp). I’m showing them as a grid of equally scaled graphs to make comparisons easier across any set of languages, and I’m using percentages to indicate relative share of GitHub.

Data comes from date- and language-restricted searches using the GitHub search API.

Data comes from date- and language-restricted searches using the GitHub search API.

  • GitHub hits the mainstream: James quickly nailed the key point: GitHub has gone mainstream over the past 5 years. This is best shown by the decline of Ruby as it reached beyond the Rails community and the simultaneous growth of a broad set of both old and newer languages including Java, PHP, and Python as GitHub reached a broader developer base. The apparent rise and drop of languages like PHP, Python, and C could indicate that these communities migrated toward GitHub earlier than others. This would result in an initially larger share that lowered as more developers from e.g. Java, C++, C#, Obj-C, and Shell joined.
  • The rise of JavaScript: Another trend that instantly stands out is the growth of JavaScript. Although it’s tempting to attribute that to the rise of Node.js [2010 writeup], reality is far more ambiguous. Node certainly accounts for a portion of the increase, but equally important to remember is (1) the popularity of frameworks that generate large quantities of JavaScript code for new projects and (2) the JavaScript development philosophy that encourages bundling of dependencies in the same repo as the primary codebase. Both of these encourage large amounts of essentially unmodified JavaScript to be added to webapp repositories, which increases the likelihood that repositories, especially those involving small projects in other languages, get misclassified as JavaScript.
  • Windows and iOS development nearly invisible: Both C# and Objective-C are unsurprisingly almost invisible, because they’re both ecosystems that either don’t encourage or actively discourage open-source code. These are the two languages in this chart most likely to be unreflective both of current usage outside GitHub but also of predictive usage, again due to open-source imbalance in those communities.

What about pushes rather than creation?

What’s really interesting is that if you do the same query by when the last push of code to the repo occurred rather than its creation, the graphs look nearly identical (not shown). The average number of pushes to repositories is independent of both time and language but is correlated with when repositories were created. In only two cases do the percentages of created and pushed repos differ by more than 2 points: Perl in 2009 (+4.1% pushed) and Ruby in 2008 (–3.5% pushed), both of which are likely artifacts due to the caveats described earlier.

This result is particularly striking because there’s no difference over time despite a broader audience joining GitHub, and there’s also no difference across all of these language communities. The vast majority of repositories (>98%) are modified only in the year they are created, and they’re never touched again. This is consistent with my previous research exploring the size of open-source projects, where we saw that 87% of repositories have ≤5 contributors.

Are GitHub issues a better measure of interest?

One potential problem with looking at repositories is that it’s not a reflection of usage or and a fairly indirect measurement of interest for a given codebase. It instead measures developers creating new code — to get a closer look at usage, some possibilities are forks, stars, or issues. GitHub’s search API makes it more convenient to focus on issues so that’s what I measured for this post. My expectation going into this was that issues would be much more biased by extremely popular projects with large numbers of users, but let’s take a look:

Issues filed within repositories with that primary language.

Issues filed within repositories with that primary language.

This gave me a fairly similar set of graphs to the new-repository data. It’s critical to note that although these are new issues, they’re filed against both new and preexisting repos so the trends are not directly comparable in that sense. Rather, they’re comparable in terms of thinking about different measurements of developer interest in a given language during the same timeframe. The peaks in Ruby, Python, and C++ early on are all due to particularly popular projects that dominated GitHub in its earlier days, when it was a far smaller collection of projects. Other than that, let’s take a look through the real trends.

  • Nearly all of these trends are consistent with new repos. With the clear exception of Ruby and less obvious example of JavaScript, the trends above are largely consistent with those in the previous set of graphs. I’ll focus mainly on the exceptions in my other points.
  • JavaScript’s increase appears asymptotic rather than linear. In other words, it continues to increase but it’s decelerating, and it appears to be moving toward a static share around 25% of new issues. This may be the case with new repos as well, but it’s less obvious there than here.
  • Ruby’s seen a steep decline since 2009. It peaked early on with Rails-related projects, but as GitHub grew mainstream, Ruby’s share of issues dropped back down. But again, this trend seems to be gradually flattening out around 10% of total issues.
  • Java and PHP have both grown and stabilized. In both cases, they’ve reached around 10% of issue share and remained largely steady since then, although Java may continue to see slow growth here.
  • Python’s issue count has consistently shrunk since 2009. Since dropping to 15% after an initial spike in 2008, it’s slowly come down to just above 10%. Given the past trend, which may be flattening out, it’s unclear whether it will continue to shrink.

The developer-centric (rather than code-centric) perspective

What if we take a different tack and focus on the primary language of new users joining GitHub? This creates a wildly different set of trends that’s reflective of individual users, rather than being weighted toward activist users who create lots of repositories and issues.

Users joining in a certain year with a majority of their repositories in that language.

Users joining in a certain year with a majority of their repositories in that language.

The points I find most interesting about these graphs are:

  • There are no clearly artifactual spikes. All of the trends here are fairly smooth, very much unlike both the repos and issues. This is very encouraging because it suggests any results here may be more reliable rather than spurious.
  • Language rank remains quite similar to the other two datasets. Every dataset is ordered by the number of new repos created in each language in 2013, to make comparisons simpler across datasets. If you look at activity in 2013 for issues and users, you can see that their values are generally ranked in the correct order with few minor exceptions. One in this case is that Java and Ruby should clearly be reversed, but that’s about all that’s obviously out of order.
  • Almost every language shows a long-term downhill trend. With the exception of Java and (recently) CSS, all of these languages have been decreasing. This was a bit of a puzzler and made me wonder more about the fragmentation of languages over time, which I’ll explore later in this post as well as future posts. My initial guess is that users of languages below the top 12 are growing in share to counterbalance the decreases here. It’s also possible that GitHub may leave some users unclassified, which would tend to lower everything else’s proportion over time.
  • I’m therefore not going to focus on linear decreases. I will, however, examine nonlinear decreases, or anything that’s otherwise an exception such as increases.
  • Ruby’s downward slide shows an interesting sort of exponential decay. This is actually “slower” than a linear decrease as it curves upwards, so it indicates that relative to everything else moving linearly downward, Ruby held onto its share better.
  • Java was the only top language that showed long-term increases during this time. Violating all expectations and trends, new Java users on GitHub even grew as a percentage of overall new users, while everything else went downhill. This further supports the assertion that GitHub is reaching the enterprise.

A consensus approach accounts for outliers

When I aggregated all three datasets together to look at how trends correlated across them, everything got quite clear:

New repositories, users, and issues in a given language according to the GitHub search API.

New repositories, users, and issues in a given language according to the GitHub search API.

Artifacts become obvious as spikes in only one of the three datasets, as happens for a number of languages in the 2009–2010 time frame. It’s increasingly obvious that only 5 languages have historically mattered on GitHub on the basis of overall share: JavaScript, Ruby, Java, PHP, and Python. New contender CSS is on the way up, while C and C++ hold honorable mentions. Everything else is, on a volume basis, irrelevant today, even if it’s showing fantastic growth like Go and will likely be relevant in these rankings within the next year or two.

The fragmenting landscape

In looking at the decline in the past couple of years among many of the top languages, I started wondering whether it was nearly all going to JavaScript and Java or whether there might be more hidden in there. After all, there’s a whole lot more than 12 languages on GitHub. So I next looked at total repository creation and subtracted only the languages shown above, to look at the long tail.


Totals after subtracting the top 12 languages.

Although you can see an initial rush by the small but diverse community of early adopters creating lots of repositories in less-popular languages, it dropped off dramatically as GitHub exploded in popularity. Then the trend begins a more gradual increase as a wide variety of smaller language communities migrate onto GitHub. New issues show a similar but slower increase starting in 2009, when GitHub added issues. While new users increase the fastest, that likely reflects a combination of users in less-popular languages and “lurker” users with no repositories at all, and therefore no primary language.

The programming landscape today continues to fragment, and this GitHub data supports that trend over time as well as an increasing overlap with the mainstream, not only early adopters.

Update (2014/05/05): Here’s raw data from yesterday in Google Docs. 

Update (2014/05/08): Simplify graphs as per advice from Jennifer Bryan.

Disclosure: GitHub has been a client.


Categories: community, distributed-development, open-source, programming-languages.

Upcoming speaking engagements, come find me!

I’ve got a number of speaking engagements in the next couple of months, so if you want to hear what I’ve got to say or just grab a beer, find me here:

  • The Minneapolis DevOps meetup tomorrow, on the cloud vs DevOps deathmatch
  • DevNation in San Francisco next week, alongside the Red Hat Summit: “Linux distros failed us — now what? Coping with cloud, DevOps, and more”
  • The Open Business Conference in San Fran on May 5: “Open DevOps”
  • WANdisco’s Subversion & Git Live in NYC on May 6: “Open source in the enterprise”
  • GlueCon in mid-May in Denver: “The parallel universes of DevOps and cloud developers”
  • Grand opening of CenturyLink’s newest datacenter in early June in Shakopee MN (near Minneapolis) on modern IT infrastructure

I’m also going to spend some time in Atlanta at the OpenStack Summit, in Portland at Monitorama, and in San Antonio and Austin. Let me know if you’ll be in any of those places and want to meet up!

Update (4/29/14): Added specifics about CenturyLink’s DC opening.

Disclosure: Red Hat, WANdisco, and CenturyLink are clients.


Categories: analyst, cloud, devops, linux, red-hat.

IBM’s piecemeal march toward open source

I attended IBM’s Pulse conference earlier this month, which has traditionally been focused around the former Tivoli division. It’s is a strange mix of software for sysadmins and software that can do things like manage city water supplies (the GreenMonk, Tom, is all about the latter). Last April, IBM renamed it Cloud and Smarter Infrastructure — even the short form, C&SI, doesn’t exactly flow off the tongue.

While I don’t write about every conference as a rule, this one was interesting for a variety of reasons. One is the increasing focus on cloud to the point that they called Pulse “The Premier Cloud Conference” this year, including a gigantic banner across the keynote stage. Somehow I doubt that went over terribly well with today’s customer and user base of traditional IT and asset-management people, as my colleague James also noted while at the show. IBM has a very interesting and solid developer story here, but much like VMware did with its developer story in the pre-Pivotal days at a VMworld full of IT admins, IBM’s presenting to the wrong audience. I found the story they told, including live coding on a stage at an IBM business conference, really compelling for developers. The problem was that it’s an aspirational message at this point about their audience and their conference rather than one that’s a good fit for the conference as it stands today.

The second reason Pulse was interesting is because of IBM’s actually doing with the cloud now. Since the days of SmartCloud, which was widely viewed as a subpar cloud offering, IBM’s gained a lot more credibility through involvement with OpenStack, its purchase of SoftLayer, and significant investment in the Cloud Foundry PaaS as well.

IBM’s always been about selling business value, not technologies, so its move up the stack from IaaS to PaaS is a natural extension of that focus. This shift is consistent with other moves IBM has recently made, such as progressively selling off more and more of its commodity hardware businesses — first ThinkPad and more recently xSeries. IBM sees much greater success when it sells business results, not commodities, and PaaS is a better fit than IaaS for that reason.

Cloud Foundry has gained a great deal of traction in the past couple of years, so IBM’s choice of it as a platform makes a great deal of sense. As that happened, we’ve seen Cloud Foundry gain broader governance than its previous leadership solely by Pivotal and earlier by VMware, again in line with what I’d expect from IBM. Choosing an open-source PaaS is in-line with IBM’s selection of OpenStack and, more than a decade ago, Linux, so I see this move as a good one that’s clearly been thought through and that is consistent with IBM’s philosophy and behavior.

The bigger picture

I think we’ll continue to see IBM’s software divisions reorient around the “IBM as a Service” concept that it pushed at Pulse. In the context of BlueMix, that will mean an ongoing addition of buildpacks based on its existing software portfolio as it’s able to repackage them as services.

I further see this as a revival of IBM’s commitment to key open-source projects as far back as Linux and Eclipse nearly 15 years ago. The extent to which IBM’s bought in to OpenStack and Cloud Foundry, and to a lesser extent Chef, is significant — it’s no matter of a fly-by-night patch series, it’s a major effort that’s headlined as critical to IBM’s future to its own customer base. There’s an excellent opportunity for IBM to continue extending this commitment to open source, as we’ve seen to an extent with Hadoop, Worklight (via PhoneGap), and across other areas like social, WebSphere, and the rest of information management. If it cares about developer traction, and it appears to do so, open source is a clear route to success in the form of an olive branch.

Disclosure: IBM, VMware, Pivotal, the Eclipse Foundation, and Chef are clients, as are numerous OpenStack and Hadoop vendors. The OpenStack Foundation is not.


Categories: big-data, cloud, devops, ibm, open-source, Uncategorized.

Go: the emerging language of cloud infrastructure

Over the past year in particular, an increasing number of open-source projects in Go have emerged or gained significant adoption. Some of them:

  • Docker
  • Packer
  • Serf
  • InfluxDB
  • Cloud Foundry’s gorouter and CLI
  • CoreOS’s etcd and fleet
  • Vitess, YouTube’s tooling for MySQL scaling
  • Canonical’s Juju (rewritten in Go)
  • Mozilla’s Heka
  • A Go interface to OpenStack Swift
  • Heroku’s and hk CLIs
  • Apcera’s NATS and gnatsd

Although this seemingly shows the importance of Go, I have a background as a scientist so I hate to be influenced by random anecdotes. It therefore raised the question to me of whether this was a real trend or just observation bias.  To answer this question, I went to Ohloh’s huge data set of more than 600,000 free and open-source software (FOSS) projects. Below, I plotted a number of different ways to look at Go adoption over time:


Data from Ohloh tracks new code added during each month. This plot is from a dataset as of 20131216.

As you can see, Go’s rapidly closing in on 1% of total commits and half a percent of projects and contributors. While the trend is obviously interesting, at first glance numbers well under one percent look inconsequential relative to overall adoption. To provide some context, however, each of the most popular languages on Ohloh (C, C++, Java, JavaScript) only constitutes ~10% of commits and ~5% of projects and contributors. That means Go, a seemingly very minor player, is already used nearly one tenth as much in FOSS as the most popular languages in existence.

One of the aspects I found most interesting about the marquee projects I mentioned earlier is how many of them are cloud-centric or otherwise made for dealing with distributed systems or transient environments. Go’s big selling point is concurrency according to one of its designers, Rob Pike (the same one who coauthored the famous “The Unix Programming Environment“). That make it particularly gratifying that people writing projects in Go seem to see it the same way.

Cloud infrastructure is famously complex and requires a great deal of effort to build a truly reliable, scale-out architecture because everything needs redundancy and coordination at the software level rather than the hardware level. Thus tools like Netflix’s Simian Army, one component of their increasingly full-featured platform that’s still waiting to be packaged up, have emerged to provide acid tests of cloud software. On the other side, an underappreciated aspect of PaaS (Platform as a Service) is its improvements not just to developer productivity but also to operator complexity by providing similar benefits at a higher level as Go does at the code level. There’s a lot of value to a packaged solution that handles the complexity of concurrency, coordination, and reliability in a gray-box fashion that enables transparency without requiring manual composition of an entire infrastructure.

Tooling that can ease the complexity for both new entrants and existing users of the cloud will continue to gain prominence at all levels of the stack, whether it’s languages like Go, middle ground like the Simian Army, or higher-level options like PaaS.

Update (2014/03/19): Added Heroku, Apcera

Disclosures: Pivotal, Black Duck, Heroku, and a number of OpenStack vendors are clients. Canonical has been a client. Docker, Hashicorp, InfluxDB, CoreOS, Google, Mozilla, Apcera, and the OpenStack Foundation are not.


Categories: adoption, cloud, devops, go, open-source, packaging, programming-languages.

At Strata, “hardcore” data science is pretty fluffy

Last week, I attended O’Reilly’s data-centric conference, Strata. It’s my fourth Strata and my third on the west coast, so I’m starting to get a pretty good feel for the show’s evolution over time and some of the contrast across coasts as well.

I keep going into the show with high expectations for the “Hardcore Data Science” track (“Deep Data” in 2012 [writeup]), which is framed essentially as continuing education for professional data scientists. Unfortunately, both years I’ve attended, it fell tragically short of that goal of educating data scientists. In 2012, I sat through the whole day and heard 2–3 talks where I learned something new, but this time I was so disappointed that I left around 11am in favor of the data-driven business track. In talking to other attendees, general reception was that the Google talk on deep learning was great, but the rest of the day was a disappointment in terms of technical content and learning practical and usable techniques.

I must admit I’m deeply surprised that O’Reilly didn’t get negative feedback last time around that it should’ve applied to this year’s “hardcore” program, as I consider the company among the top couple of professional conference organizers around, across the widest set of topics.

One of the challenges with Strata is catering to a diverse set of audiences, and O’Reilly’s done an excellent job with the “I want to learn Hadoop / new Big Data tech X” crowd. More recently, they’ve also done very well reaching out to the business-level audience trying to learn about the value of data. However, it seems like the technical core of the conference is gradually being left in the lurch in terms of their opportunity to learn, although there’s always the hallway track and whatever marketing value they get out of their talks and tutorials.

I would suggest that the intensive data-science track at future Strata conferences be made much more technical, that the talks become sufficiently practical that they’re handing out links to GitHub or other real-world, low-level implementations at the end of the talk, and that this shift in topic and intended audience be very clearly communicated to the speakers. Other than that slip-up, good show — I heard good things about the tutorials, the business day, and talks for the mid-level audience throughout the rest of the conference.


Categories: big-data, data-science.