tecosystems

The RedMonk Programming Language Rankings: January 2017

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

After clearing a series of obstacles – some mundane and irrelevant, others much less so – it’s time to publish our bi-annual RedMonk Programming Language Rankings. As many are aware, these rankings are a continuation of the original work that Drew Conway and John Myles White first looked at the question late in 2010. From a macro perspective, the process remains the same: we extract language rankings from GitHub and Stack Overflow, and combine them for a ranking that attempts to reflect both code (GitHub) and discussion (Stack Overflow) traction. The idea is not to offer a statistically valid representation of current usage, but rather to correlate language discussion (Stack Overflow) and usage (GitHub) in an effort to extract insights into potential future adoption trends.

In January 2014, we were forced to make a change to the way that GitHub’s rankings were collected because GitHub stopped providing them. This quarter’s run features the first major change in how these rankings are conducted since then. To help understand how this change was made and why it was necessary, here’s a brief explanation of our GitHub ranking process.

The Process to Date

In our early language ranking runs we pulled the data directly from GitHub’s Explore page. GitHub ceased publishing the rankings there, however, and in 2014 we found a new data source using the GitHub Archive public dataset on Google BigQuery.

Our query counted repository languages (excluding forked repos) by aggregating total created events. Though now defunct, our previous query was similar to the one in this Stack Overflow answer.

This query worked from 2014 through our last run in June 2016. However, we again needed to adjust our query due to changes in the GitHub Archive table structure as well as changes in GitHub’s API that impacted GitHub Archive’s language data. These changes provided the opportunity to evaluate our data source.

Our Updated Process

In June 2016, GitHub and Google announced a second public data set for publicly licensed repos. We initially explored the languages table on this dataset as our new potential source. This data had that benefit of providing multiple languages for a repository based on the number of bytes used per language, which in theory could give a more accurate representation of languages rather than using a repo’s primary language.
However, we found that the results from this data were suboptimal because:

  • This data only includes licensed repositories, a much smaller subset than the public repositories of GitHub Archive.
  • Furthermore, the process of recognizing licenses is occasionally brittle, even further limiting the available data.
  • The definition of languages expanded beyond what we have historically represented and included things like config files and typesetting systems.
  • In what was our ultimate deciding factor, the results of this query were significantly less correlated with previous GitHub language data as well as Stack Overflow data.

We also briefly explored the GH Torrent project. While this was an interesting data set that could be a great resource for curious individuals, its licensing prohibited our use in this instance.

This ultimately led us back to GitHub Archive. Though we could not access the same language data that we had previously, we were able to query language by pull request. Our query resembles the one GitHub used to assemble the 2016 State of the Octoverse.

We endeavored to make the new query as comparable as possible to the previous process.

  • Language is based on the base repository language. While this continues to have the caveats outlined below, it does have the benefit of cohesion with our previous methodology.
  • We exclude forked repos.
  • We use the aggregated history to determine ranking (though based on the table structure changes this can no longer be accomplished via a single query.)

The primary change is that the GitHub portion of the language ranking is now based on pull requests rather than repos. While this means we couldn’t replicate the rankings as they were before, the results were generally correlated with our past runs and were the best method available. On the positive side, it also eliminates the most common complaint regarding the rankings historically: that measurements by repo might overestimate a given language’s importance – JavaScript, most frequently.

The Net

The obvious question in the wake of this procedural change concerns impact. How do this quarter’s rankings compare with our last run? There are two answers to that: first, the change within the GitHub portion of our rankings; second, the change in our rankings overall with the unaffected Stack Overflow results weighted in. In both cases, it depends on where in the Top 20 a language is ranked. Within our Top 10 languages, for example, the average ranking change for the GitHub only results was a significant but not enormous 1.2 spots. In the back half of the Top 20, however, the average change in a language’s position was 5.7.

When we weight in the Stack Overflow results, predictably, these differentials are somewhat more modest. Within the Top 10, languages moved on average only half a spot. And even in the much more volatile back half, the end change in the overall rankings was a mere three spots.

This is, to be sure, the most significant change since we started performing this analysis. But as mentioned, after testing various approaches, this is the one most tightly correlated and thus offering the greatest continuity between our previous rankings.

With that major update out of the way, please keep in mind the other usual caveats.

  • To be included in this analysis, a language must be observable within both GitHub and Stack Overflow.
  • No claims are made here that these rankings are representative of general usage more broadly. They are nothing more or less than an examination of the correlation between two populations we believe to be predictive of future use, hence their value.
  • There are many potential communities that could be surveyed for this analysis. GitHub and Stack Overflow are used here first because of their size and second because of their public exposure of the data necessary for the analysis. We encourage, however, interested parties to perform their own analyses using other sources.
  • All numerical rankings should be taken with a grain of salt. We rank by numbers here strictly for the sake of interest. In general, the numerical ranking is substantially less relevant than the language’s tier or grouping. In many cases, one spot on the list is not distinguishable from the next. The separation between language tiers on the plot, however, is generally representative of substantial differences in relative popularity.
  • In addition, the further down the rankings one goes, the less data available to rank languages by. Beyond the top tiers of languages, depending on the snapshot, the amount of data to assess is minute, and the actual placement of languages becomes less reliable the further down the list one proceeds.

With that, here is the first quarter plot for 2017.

(Click to embiggen)

Besides the above plot, which can be difficult to parse even at full size, we offer the following numerical rankings. As will be observed, this run produced several ties which are reflected below (they are listed out here alphabetically rather than consolidated as ties because the latter approach led to misunderstandings). Note that this is actually a list of the Top 23 languages, not Top 20, because of said ties.

1 JavaScript
2 Java
3 Python
4 PHP
5 C#
5 C++
7 CSS
7 Ruby
9 C
10 Objective-C
11 Scala
11 Shell
11 Swift
14 R
15 Go
15 Perl
17 TypeScript
18 PowerShell
19 Haskell
20 Clojure
20 CoffeeScript
20 Lua
20 Matlab

Updated process or no, JavaScript and Java retain their respective positions atop our rankings. The lack of movement in JavaScript is particularly notable given that some argued that measuring by repo overweighted JavaScript’s actual significance versus a metric like pull requests, the basis for the new query. PHP has dropped a spot for the first time in the history of our rankings, but remains enormously popular even at the number four spot. Out of all of the languages in the top ten, on the other hand, Python benefitted the most from the change in our GitHub ranking process: where the average movement was one spot, Python jumped three spots, hence its leapfrogging of PHP. Outside of that, the only really notable movement in the top ten was Ruby dropping from five to seven.

Lower down in the order, however, things get more interesting. A few comments on languages with notable movement, in no particular order.

  • R: The preferred language for a growing number of statisticians, data scientists and other analytical types had been enjoying a incremental rise, moving from 15 to a steady 13 and finally jumping to 12 in our last run. This time around, however, the language falls back two spots to number 14. This is principally attributable to a softening in its GitHub ranking in the new process. Unlike its competitor in the analytical space, Python, which rose three spots along that axis, R fell five spots in our GitHub rankings even as its Stack Overflow ranking rose one place. This minor movement, however, says little about R’s current or future performance; like PHP, the language remains popular in spite of a step back.

  • Swift: On the opposite end of the R, Swift was a major beneficiary of the new GitHub process, jumping eight spots from 24 to 16 on our GitHub rankings. While the language appears to be entering something of a trough of disillusionment from a market perception standpoint, with major hype giving way to skepticism in many quarters, its statistical performance according to the observable metrics we track remains strong. Swift has reached a Top 15 ranking faster than any other language we have tracked since we’ve been performing these rankings. Its strong performance from a GitHub perspective suggests that the wider, multi-platform approach taken by the language is paying benefits. As we’ve said since it first entered our rankings, Swift remains a language to watch.

  • Go: While Go also benefitted from the new ranking model, jumping four spots in the GitHub portion of our ranking system, that wasn’t enough to keep up with Swift which leapfrogged it. To some extent, this isn’t a surprise, as Go had neither the built in draw of iOS mobile app development nor is it generally positioned as a front and back end language as Swift increasingly is. More to the point, while it might have held static, a ranking of 15 is impressive for an infrastructure runtime.

  • TypeScript: Last quarter, this was what we believed was the question facing TypeScript: “The question facing the language isn’t whether it can grow, but whether it has the momentum to crack the Top 20 in the next two to three quarters, leapfrogging the likes of CoffeeScript and Lua in the process.” Well, consider that question answered. Of all of the top tier languages, none jumped more than TypeScript on our GitHub rankings, as the JavaScript superset moved up 17 points. While it also saw improvement in its Stack Overflow numbers, it was the GitHub improvement that vaulted it nine spots up and into the Top 20. We didn’t have time to explore the basis for this movement, but it seems reasonable to suspect that Angular is playing a role.

  • PowerShell: As mentioned above, no top tier language outperformed TypeScript on the GitHub portion of our rankings, but one language equaled it. PowerShell moved from 36 within the GitHub rankings to 19 to match TypeScript’s 17 point jump, and that was enough to nudge it into the Top 20 overall from its prior ranking of 25. While we can’t prove causation, it is interesting to note that this dramatic improvement from PowerShell comes one quarter after it was released as open source software. Between PowerShell and TypeScript, not to mention C#’s sustained performance, Microsoft has reason to be pleased about is programming language investments.

  • Rust: One of the biggest overall gainers of any of the measured languages, Rust leaped from 47 on our board to 26 – one spot behind Visual Basic. This comes two quarters after the language not only stalled, but actually gave up ground in our last rankings. What a difference a few months can make. By our metrics, Rust went from the 46th most popular language on GitHub to the 18th. Some of that is potentially a result of the new process, of course, but no other language grew faster. Granted, it’s easier for Rust to achieve that kind of growth than for a language already in the top tier, but nevertheless Rust’s performance is impressive. It’s possible that Rust is finally turning the corner and becoming the mainstream language that many expected it could be. We’ll be watching its movement over the next few quarters to assess Rust’s potential for moving into the Top 20.

Credit: My colleague Rachel Stephens evaluated the available options for extracting rankings from GitHub data, and wrote and executed the queries that are responsible for the GitHub axis in these rankings.

11 comments

  1. […] March 2017. RSIPL was created after reading several similar articles and lists, most notably, the RedMonk Programming Languages Rankings. The ranking methodology RedMonk uses is based upon popularity of the languages on Github […]

  2. […] according to the amount of code posted on Github and the number of questions on Stack Overflow, Rust leaped from number 47 to 26 in the list of languages between June 2016 and January 2017. This stellar rise […]

  3. XSLT has more Qs on StackOverflow than Lua, CoffeeScript, or Clojure, and 10 times as many as XQuery, so why isn’t it on the chart? (I’ve no idea where it would rank on github – probably not very high – because it tends to be a language for writing custom rather than generic code).

  4. Some languages like TeX have their own dedicated StackExchange websites (e.g., http://tex.stackexchange.com/); their y-axis values should likely be much higher.

  5. Tex is going to show up less on SO because it has its own dedicated SE site: tex.stackexchange.com

  6. The data looks reliable. Especially that it mostly overlap with other similar stats like this one:
    # Language | Popularity | Avg. Salary global
    —————-+————+————
    1. Javascript | 30.13% | $60,186
    2. Java | 27.5% | $61,741
    3. Python | 15.58% | $66,353
    4. C# | 12.8% | $66,470
    5. SQL | 10.97% | $54,139
    6. C++ | 9.95% | $69,092
    7. Php | 8.74% | $53,420
    8. Node.js | 8.45% | $64,818
    9. C | 6.28% | $67,720
    10.Ruby | 5.08% | $68,478
    More stats and details https://jobsquery.it/stats/language/group

  7. Any way to put delta vectors directly on the graph?

    Having the text commentary is nice, but “more please, Sir”

  8. […] most widely used languages chime with a separate analysis of Stack Overflow and GitHub rankings by RedMonk, which also placed JavaScript, Java, Python and […]

  9. GitHub linguist considers Rmd (R markdown) to be HTML. I’m curious how many repos containing R notebook were not properly identified.

    How are you connecting languages to stack overflow? Are you taking packages or tools using a particular language into account? That is, are you making the connection by language name alone? What Stack Overflow data are you using to make the connection? Question tags? Number of responses? Or did you get access to their traffic data?

  10. […] a powerful platform for web development. For these reasons and others, Ruby is listed as #7 in the RedMonk Programming Language Rankings for Q1 2017 which uses both GitHub and Stack Overflow data sources to determine overall […]

  11. Doesn’t ranking a language’s “popularity” based on StackOverflow tags seem a bit of a strange metric? Surely this is mostly ranking how difficult it is to use or how steep the learning curve, rather than how much people enjoy using it? And to a lesser extent, but in the same vein, one could argue that GitHub pulls is more a metric of how hard it is to achieve the required result first time 🙂 – ok the last point may be a bit of a stretch but the first certainly isn’t.

Leave a Reply

Your email address will not be published. Required fields are marked *