A while ago, I wanted to get a little quick feedback on some data I was playing with, but the day was almost over and I wasn’t done working on it yet. I decided to tweet my rough draft of a graph of GitHub language trends anyway, followed later by a slight improvement.
Much to my surprise, that graph was retweeted more than 2,000 times and reached well over 1 million people. My colleagues have both examined this data since I posted the graph — James took a stab at pulling out a few key points, particularly GitHub’s start around Rails and its growth into the mainstream, and Steve’s also taken a look at visualizing this data differently.
Despite that being fantastic news, the best part was the questions I got, and all the conversations gave me an opportunity to decide what points would be most interesting to people who read this post. The initial plot was a spaghetti graph, so I fixed it up and decided to do a more in-depth analysis.
Before we can get into useful results and interpretation, there are a few artifacts and potential pitfalls to be aware of:
- GitHub is a specific community that’s grown very quickly since it launched [writeup]. It was not initially reflective of open source as a whole but rather centered around the Ruby on Rails community;
- In 2009, the GitPAN project imported all of CPAN (Perl’s module ecosystem) into GitHub, which explains the one-time peak;
- I’m showing percentages, not absolute values. A downward slope does not mean fewer repositories are being created. It does mean, however, that other languages are gaining repositories faster.
The big reveal
The first set of graphs shows new, non-fork repositories created on GitHub by primary language and year. This dataset includes all languages that were in the top 10 during any of the years 2008–2013, but languages used for text-editor configuration were ignored (VimL and Emacs Lisp). I’m showing them as a grid of equally scaled graphs to make comparisons easier across any set of languages, and I’m using percentages to indicate relative share of GitHub.
- GitHub hits the mainstream: James quickly nailed the key point: GitHub has gone mainstream over the past 5 years. This is best shown by the decline of Ruby as it reached beyond the Rails community and the simultaneous growth of a broad set of both old and newer languages including Java, PHP, and Python as GitHub reached a broader developer base. The apparent rise and drop of languages like PHP, Python, and C could indicate that these communities migrated toward GitHub earlier than others. This would result in an initially larger share that lowered as more developers from e.g. Java, C++, C#, Obj-C, and Shell joined.
- Windows and iOS development nearly invisible: Both C# and Objective-C are unsurprisingly almost invisible, because they’re both ecosystems that either don’t encourage or actively discourage open-source code. These are the two languages in this chart most likely to be unreflective both of current usage outside GitHub but also of predictive usage, again due to open-source imbalance in those communities.
What about pushes rather than creation?
What’s really interesting is that if you do the same query by when the last push of code to the repo occurred rather than its creation, the graphs look nearly identical (not shown). The average number of pushes to repositories is independent of both time and language but is correlated with when repositories were created. In only two cases do the percentages of created and pushed repos differ by more than 2 points: Perl in 2009 (+4.1% pushed) and Ruby in 2008 (–3.5% pushed), both of which are likely artifacts due to the caveats described earlier.
This result is particularly striking because there’s no difference over time despite a broader audience joining GitHub, and there’s also no difference across all of these language communities. The vast majority of repositories (>98%) are modified only in the year they are created, and they’re never touched again. This is consistent with my previous research exploring the size of open-source projects, where we saw that 87% of repositories have ≤5 contributors.
Are GitHub issues a better measure of interest?
One potential problem with looking at repositories is that it’s not a reflection of usage or and a fairly indirect measurement of interest for a given codebase. It instead measures developers creating new code — to get a closer look at usage, some possibilities are forks, stars, or issues. GitHub’s search API makes it more convenient to focus on issues so that’s what I measured for this post. My expectation going into this was that issues would be much more biased by extremely popular projects with large numbers of users, but let’s take a look:
This gave me a fairly similar set of graphs to the new-repository data. It’s critical to note that although these are new issues, they’re filed against both new and preexisting repos so the trends are not directly comparable in that sense. Rather, they’re comparable in terms of thinking about different measurements of developer interest in a given language during the same timeframe. The peaks in Ruby, Python, and C++ early on are all due to particularly popular projects that dominated GitHub in its earlier days, when it was a far smaller collection of projects. Other than that, let’s take a look through the real trends.
- Ruby’s seen a steep decline since 2009. It peaked early on with Rails-related projects, but as GitHub grew mainstream, Ruby’s share of issues dropped back down. But again, this trend seems to be gradually flattening out around 10% of total issues.
- Java and PHP have both grown and stabilized. In both cases, they’ve reached around 10% of issue share and remained largely steady since then, although Java may continue to see slow growth here.
- Python’s issue count has consistently shrunk since 2009. Since dropping to 15% after an initial spike in 2008, it’s slowly come down to just above 10%. Given the past trend, which may be flattening out, it’s unclear whether it will continue to shrink.
The developer-centric (rather than code-centric) perspective
What if we take a different tack and focus on the primary language of new users joining GitHub? This creates a wildly different set of trends that’s reflective of individual users, rather than being weighted toward activist users who create lots of repositories and issues.
The points I find most interesting about these graphs are:
- There are no clearly artifactual spikes. All of the trends here are fairly smooth, very much unlike both the repos and issues. This is very encouraging because it suggests any results here may be more reliable rather than spurious.
- Language rank remains quite similar to the other two datasets. Every dataset is ordered by the number of new repos created in each language in 2013, to make comparisons simpler across datasets. If you look at activity in 2013 for issues and users, you can see that their values are generally ranked in the correct order with few minor exceptions. One in this case is that Java and Ruby should clearly be reversed, but that’s about all that’s obviously out of order.
- Almost every language shows a long-term downhill trend. With the exception of Java and (recently) CSS, all of these languages have been decreasing. This was a bit of a puzzler and made me wonder more about the fragmentation of languages over time, which I’ll explore later in this post as well as future posts. My initial guess is that users of languages below the top 12 are growing in share to counterbalance the decreases here. It’s also possible that GitHub may leave some users unclassified, which would tend to lower everything else’s proportion over time.
- I’m therefore not going to focus on linear decreases. I will, however, examine nonlinear decreases, or anything that’s otherwise an exception such as increases.
- Ruby’s downward slide shows an interesting sort of exponential decay. This is actually “slower” than a linear decrease as it curves upwards, so it indicates that relative to everything else moving linearly downward, Ruby held onto its share better.
- Java was the only top language that showed long-term increases during this time. Violating all expectations and trends, new Java users on GitHub even grew as a percentage of overall new users, while everything else went downhill. This further supports the assertion that GitHub is reaching the enterprise.
A consensus approach accounts for outliers
When I aggregated all three datasets together to look at how trends correlated across them, everything got quite clear:
The fragmenting landscape
Although you can see an initial rush by the small but diverse community of early adopters creating lots of repositories in less-popular languages, it dropped off dramatically as GitHub exploded in popularity. Then the trend begins a more gradual increase as a wide variety of smaller language communities migrate onto GitHub. New issues show a similar but slower increase starting in 2009, when GitHub added issues. While new users increase the fastest, that likely reflects a combination of users in less-popular languages and “lurker” users with no repositories at all, and therefore no primary language.
The programming landscape today continues to fragment, and this GitHub data supports that trend over time as well as an increasing overlap with the mainstream, not only early adopters.
Update (2014/05/05): Here’s raw data from yesterday in Google Docs.
Update (2014/05/08): Simplify graphs as per advice from Jennifer Bryan.
Disclosure: GitHub has been a client.