After clearing a series of obstacles – some mundane and irrelevant, others much less so – it’s time to publish our bi-annual RedMonk Programming Language Rankings. As many are aware, these rankings are a continuation of the original work that Drew Conway and John Myles White first looked at the question late in 2010. From a macro perspective, the process remains the same: we extract language rankings from GitHub and Stack Overflow, and combine them for a ranking that attempts to reflect both code (GitHub) and discussion (Stack Overflow) traction. The idea is not to offer a statistically valid representation of current usage, but rather to correlate language discussion (Stack Overflow) and usage (GitHub) in an effort to extract insights into potential future adoption trends.
In January 2014, we were forced to make a change to the way that GitHub’s rankings were collected because GitHub stopped providing them. This quarter’s run features the first major change in how these rankings are conducted since then. To help understand how this change was made and why it was necessary, here’s a brief explanation of our GitHub ranking process.
The Process to Date
In our early language ranking runs we pulled the data directly from GitHub’s Explore page. GitHub ceased publishing the rankings there, however, and in 2014 we found a new data source using the GitHub Archive public dataset on Google BigQuery.
Our query counted repository languages (excluding forked repos) by aggregating total created events. Though now defunct, our previous query was similar to the one in this Stack Overflow answer.
This query worked from 2014 through our last run in June 2016. However, we again needed to adjust our query due to changes in the GitHub Archive table structure as well as changes in GitHub’s API that impacted GitHub Archive’s language data. These changes provided the opportunity to evaluate our data source.
Our Updated Process
In June 2016, GitHub and Google announced a second public data set for publicly licensed repos. We initially explored the languages table on this dataset as our new potential source. This data had that benefit of providing multiple languages for a repository based on the number of bytes used per language, which in theory could give a more accurate representation of languages rather than using a repo’s primary language.
However, we found that the results from this data were suboptimal because:
- This data only includes licensed repositories, a much smaller subset than the public repositories of GitHub Archive.
- Furthermore, the process of recognizing licenses is occasionally brittle, even further limiting the available data.
- The definition of languages expanded beyond what we have historically represented and included things like config files and typesetting systems.
- In what was our ultimate deciding factor, the results of this query were significantly less correlated with previous GitHub language data as well as Stack Overflow data.
We also briefly explored the GH Torrent project. While this was an interesting data set that could be a great resource for curious individuals, its licensing prohibited our use in this instance.
This ultimately led us back to GitHub Archive. Though we could not access the same language data that we had previously, we were able to query language by pull request. Our query resembles the one GitHub used to assemble the 2016 State of the Octoverse.
We endeavored to make the new query as comparable as possible to the previous process.
- Language is based on the base repository language. While this continues to have the caveats outlined below, it does have the benefit of cohesion with our previous methodology.
- We exclude forked repos.
- We use the aggregated history to determine ranking (though based on the table structure changes this can no longer be accomplished via a single query.)
The obvious question in the wake of this procedural change concerns impact. How do this quarter’s rankings compare with our last run? There are two answers to that: first, the change within the GitHub portion of our rankings; second, the change in our rankings overall with the unaffected Stack Overflow results weighted in. In both cases, it depends on where in the Top 20 a language is ranked. Within our Top 10 languages, for example, the average ranking change for the GitHub only results was a significant but not enormous 1.2 spots. In the back half of the Top 20, however, the average change in a language’s position was 5.7.
When we weight in the Stack Overflow results, predictably, these differentials are somewhat more modest. Within the Top 10, languages moved on average only half a spot. And even in the much more volatile back half, the end change in the overall rankings was a mere three spots.
This is, to be sure, the most significant change since we started performing this analysis. But as mentioned, after testing various approaches, this is the one most tightly correlated and thus offering the greatest continuity between our previous rankings.
With that major update out of the way, please keep in mind the other usual caveats.
- To be included in this analysis, a language must be observable within both GitHub and Stack Overflow.
- No claims are made here that these rankings are representative of general usage more broadly. They are nothing more or less than an examination of the correlation between two populations we believe to be predictive of future use, hence their value.
- There are many potential communities that could be surveyed for this analysis. GitHub and Stack Overflow are used here first because of their size and second because of their public exposure of the data necessary for the analysis. We encourage, however, interested parties to perform their own analyses using other sources.
- All numerical rankings should be taken with a grain of salt. We rank by numbers here strictly for the sake of interest. In general, the numerical ranking is substantially less relevant than the language’s tier or grouping. In many cases, one spot on the list is not distinguishable from the next. The separation between language tiers on the plot, however, is generally representative of substantial differences in relative popularity.
- In addition, the further down the rankings one goes, the less data available to rank languages by. Beyond the top tiers of languages, depending on the snapshot, the amount of data to assess is minute, and the actual placement of languages becomes less reliable the further down the list one proceeds.
With that, here is the first quarter plot for 2017.
(Click to embiggen)
Besides the above plot, which can be difficult to parse even at full size, we offer the following numerical rankings. As will be observed, this run produced several ties which are reflected below (they are listed out here alphabetically rather than consolidated as ties because the latter approach led to misunderstandings). Note that this is actually a list of the Top 23 languages, not Top 20, because of said ties.
Lower down in the order, however, things get more interesting. A few comments on languages with notable movement, in no particular order.
- R: The preferred language for a growing number of statisticians, data scientists and other analytical types had been enjoying a incremental rise, moving from 15 to a steady 13 and finally jumping to 12 in our last run. This time around, however, the language falls back two spots to number 14. This is principally attributable to a softening in its GitHub ranking in the new process. Unlike its competitor in the analytical space, Python, which rose three spots along that axis, R fell five spots in our GitHub rankings even as its Stack Overflow ranking rose one place. This minor movement, however, says little about R’s current or future performance; like PHP, the language remains popular in spite of a step back.
Swift: On the opposite end of the R, Swift was a major beneficiary of the new GitHub process, jumping eight spots from 24 to 16 on our GitHub rankings. While the language appears to be entering something of a trough of disillusionment from a market perception standpoint, with major hype giving way to skepticism in many quarters, its statistical performance according to the observable metrics we track remains strong. Swift has reached a Top 15 ranking faster than any other language we have tracked since we’ve been performing these rankings. Its strong performance from a GitHub perspective suggests that the wider, multi-platform approach taken by the language is paying benefits. As we’ve said since it first entered our rankings, Swift remains a language to watch.
Go: While Go also benefitted from the new ranking model, jumping four spots in the GitHub portion of our ranking system, that wasn’t enough to keep up with Swift which leapfrogged it. To some extent, this isn’t a surprise, as Go had neither the built in draw of iOS mobile app development nor is it generally positioned as a front and back end language as Swift increasingly is. More to the point, while it might have held static, a ranking of 15 is impressive for an infrastructure runtime.
PowerShell: As mentioned above, no top tier language outperformed TypeScript on the GitHub portion of our rankings, but one language equaled it. PowerShell moved from 36 within the GitHub rankings to 19 to match TypeScript’s 17 point jump, and that was enough to nudge it into the Top 20 overall from its prior ranking of 25. While we can’t prove causation, it is interesting to note that this dramatic improvement from PowerShell comes one quarter after it was released as open source software. Between PowerShell and TypeScript, not to mention C#’s sustained performance, Microsoft has reason to be pleased about is programming language investments.
Rust: One of the biggest overall gainers of any of the measured languages, Rust leaped from 47 on our board to 26 – one spot behind Visual Basic. This comes two quarters after the language not only stalled, but actually gave up ground in our last rankings. What a difference a few months can make. By our metrics, Rust went from the 46th most popular language on GitHub to the 18th. Some of that is potentially a result of the new process, of course, but no other language grew faster. Granted, it’s easier for Rust to achieve that kind of growth than for a language already in the top tier, but nevertheless Rust’s performance is impressive. It’s possible that Rust is finally turning the corner and becoming the mainstream language that many expected it could be. We’ll be watching its movement over the next few quarters to assess Rust’s potential for moving into the Top 20.
Credit: My colleague Rachel Stephens evaluated the available options for extracting rankings from GitHub data, and wrote and executed the queries that are responsible for the GitHub axis in these rankings.