Blogs

RedMonk

Skip to content

GitHub language trends and the fragmenting landscape

A while ago, I wanted to get a little quick feedback on some data I was playing with, but the day was almost over and I wasn’t done working on it yet. I decided to tweet my rough draft of a graph of GitHub language trends anyway, followed later by a slight improvement.

github_new_repos-custom

Trends over time, smoothed to make it a little easier to follow

Much to my surprise, that graph was retweeted more than 2,000 times and reached well over 1 million people. My colleagues have both examined this data since I posted the graph — James took a stab at pulling out a few key points, particularly GitHub’s start around Rails and its growth into the mainstream, and Steve’s also taken a look at visualizing this data differently.

Despite that being fantastic news, the best part was the questions I got, and all the conversations gave me an opportunity to decide what points would be most interesting to people who read this post. The initial plot was a spaghetti graph, so I fixed it up and decided to do a more in-depth analysis.

Caveats

Before we can get into useful results and interpretation, there are a few artifacts and potential pitfalls to be aware of:

  • GitHub is a specific community that’s grown very quickly since it launched [writeup]. It was not initially reflective of open source as a whole but rather centered around the Ruby on Rails community;
  • In 2009, the GitPAN project imported all of CPAN (Perl’s module ecosystem) into GitHub, which explains the one-time peak;
  • Language detection is based on lines of code, so a repository with a large amount of JavaScript template libraries (e.g. jQuery) copied into it will be detected as JavaScript rather than the language where most of the work is being done; and
  • I’m showing percentages, not absolute values. A downward slope does not mean fewer repositories are being created. It does mean, however, that other languages are gaining repositories faster.

The big reveal

The first set of graphs shows new, non-fork repositories created on GitHub by primary language and year. This dataset includes all languages that were in the top 10 during any of the years 2008–2013, but languages used for text-editor configuration were ignored (VimL and Emacs Lisp). I’m showing them as a grid of equally scaled graphs to make comparisons easier across any set of languages, and I’m using percentages to indicate relative share of GitHub.

Data comes from date- and language-restricted searches using the GitHub search API.

Data comes from date- and language-restricted searches using the GitHub search API.

  • GitHub hits the mainstream: James quickly nailed the key point: GitHub has gone mainstream over the past 5 years. This is best shown by the decline of Ruby as it reached beyond the Rails community and the simultaneous growth of a broad set of both old and newer languages including Java, PHP, and Python as GitHub reached a broader developer base. The apparent rise and drop of languages like PHP, Python, and C could indicate that these communities migrated toward GitHub earlier than others. This would result in an initially larger share that lowered as more developers from e.g. Java, C++, C#, Obj-C, and Shell joined.
  • The rise of JavaScript: Another trend that instantly stands out is the growth of JavaScript. Although it’s tempting to attribute that to the rise of Node.js [2010 writeup], reality is far more ambiguous. Node certainly accounts for a portion of the increase, but equally important to remember is (1) the popularity of frameworks that generate large quantities of JavaScript code for new projects and (2) the JavaScript development philosophy that encourages bundling of dependencies in the same repo as the primary codebase. Both of these encourage large amounts of essentially unmodified JavaScript to be added to webapp repositories, which increases the likelihood that repositories, especially those involving small projects in other languages, get misclassified as JavaScript.
  • Windows and iOS development nearly invisible: Both C# and Objective-C are unsurprisingly almost invisible, because they’re both ecosystems that either don’t encourage or actively discourage open-source code. These are the two languages in this chart most likely to be unreflective both of current usage outside GitHub but also of predictive usage, again due to open-source imbalance in those communities.

What about pushes rather than creation?

What’s really interesting is that if you do the same query by when the last push of code to the repo occurred rather than its creation, the graphs look nearly identical (not shown). The average number of pushes to repositories is independent of both time and language but is correlated with when repositories were created. In only two cases do the percentages of created and pushed repos differ by more than 2 points: Perl in 2009 (+4.1% pushed) and Ruby in 2008 (–3.5% pushed), both of which are likely artifacts due to the caveats described earlier.

This result is particularly striking because there’s no difference over time despite a broader audience joining GitHub, and there’s also no difference across all of these language communities. The vast majority of repositories (>98%) are modified only in the year they are created, and they’re never touched again. This is consistent with my previous research exploring the size of open-source projects, where we saw that 87% of repositories have ≤5 contributors.

Are GitHub issues a better measure of interest?

One potential problem with looking at repositories is that it’s not a reflection of usage or and a fairly indirect measurement of interest for a given codebase. It instead measures developers creating new code — to get a closer look at usage, some possibilities are forks, stars, or issues. GitHub’s search API makes it more convenient to focus on issues so that’s what I measured for this post. My expectation going into this was that issues would be much more biased by extremely popular projects with large numbers of users, but let’s take a look:

Issues filed within repositories with that primary language.

Issues filed within repositories with that primary language.

This gave me a fairly similar set of graphs to the new-repository data. It’s critical to note that although these are new issues, they’re filed against both new and preexisting repos so the trends are not directly comparable in that sense. Rather, they’re comparable in terms of thinking about different measurements of developer interest in a given language during the same timeframe. The peaks in Ruby, Python, and C++ early on are all due to particularly popular projects that dominated GitHub in its earlier days, when it was a far smaller collection of projects. Other than that, let’s take a look through the real trends.

  • Nearly all of these trends are consistent with new repos. With the clear exception of Ruby and less obvious example of JavaScript, the trends above are largely consistent with those in the previous set of graphs. I’ll focus mainly on the exceptions in my other points.
  • JavaScript’s increase appears asymptotic rather than linear. In other words, it continues to increase but it’s decelerating, and it appears to be moving toward a static share around 25% of new issues. This may be the case with new repos as well, but it’s less obvious there than here.
  • Ruby’s seen a steep decline since 2009. It peaked early on with Rails-related projects, but as GitHub grew mainstream, Ruby’s share of issues dropped back down. But again, this trend seems to be gradually flattening out around 10% of total issues.
  • Java and PHP have both grown and stabilized. In both cases, they’ve reached around 10% of issue share and remained largely steady since then, although Java may continue to see slow growth here.
  • Python’s issue count has consistently shrunk since 2009. Since dropping to 15% after an initial spike in 2008, it’s slowly come down to just above 10%. Given the past trend, which may be flattening out, it’s unclear whether it will continue to shrink.

The developer-centric (rather than code-centric) perspective

What if we take a different tack and focus on the primary language of new users joining GitHub? This creates a wildly different set of trends that’s reflective of individual users, rather than being weighted toward activist users who create lots of repositories and issues.

Users joining in a certain year with a majority of their repositories in that language.

Users joining in a certain year with a majority of their repositories in that language.

The points I find most interesting about these graphs are:

  • There are no clearly artifactual spikes. All of the trends here are fairly smooth, very much unlike both the repos and issues. This is very encouraging because it suggests any results here may be more reliable rather than spurious.
  • Language rank remains quite similar to the other two datasets. Every dataset is ordered by the number of new repos created in each language in 2013, to make comparisons simpler across datasets. If you look at activity in 2013 for issues and users, you can see that their values are generally ranked in the correct order with few minor exceptions. One in this case is that Java and Ruby should clearly be reversed, but that’s about all that’s obviously out of order.
  • Almost every language shows a long-term downhill trend. With the exception of Java and (recently) CSS, all of these languages have been decreasing. This was a bit of a puzzler and made me wonder more about the fragmentation of languages over time, which I’ll explore later in this post as well as future posts. My initial guess is that users of languages below the top 12 are growing in share to counterbalance the decreases here. It’s also possible that GitHub may leave some users unclassified, which would tend to lower everything else’s proportion over time.
  • I’m therefore not going to focus on linear decreases. I will, however, examine nonlinear decreases, or anything that’s otherwise an exception such as increases.
  • Ruby’s downward slide shows an interesting sort of exponential decay. This is actually “slower” than a linear decrease as it curves upwards, so it indicates that relative to everything else moving linearly downward, Ruby held onto its share better.
  • Java was the only top language that showed long-term increases during this time. Violating all expectations and trends, new Java users on GitHub even grew as a percentage of overall new users, while everything else went downhill. This further supports the assertion that GitHub is reaching the enterprise.

A consensus approach accounts for outliers

When I aggregated all three datasets together to look at how trends correlated across them, everything got quite clear:

New repositories, users, and issues in a given language according to the GitHub search API.

New repositories, users, and issues in a given language according to the GitHub search API.

Artifacts become obvious as spikes in only one of the three datasets, as happens for a number of languages in the 2009–2010 time frame. It’s increasingly obvious that only 5 languages have historically mattered on GitHub on the basis of overall share: JavaScript, Ruby, Java, PHP, and Python. New contender CSS is on the way up, while C and C++ hold honorable mentions. Everything else is, on a volume basis, irrelevant today, even if it’s showing fantastic growth like Go and will likely be relevant in these rankings within the next year or two.

The fragmenting landscape

In looking at the decline in the past couple of years among many of the top languages, I started wondering whether it was nearly all going to JavaScript and Java or whether there might be more hidden in there. After all, there’s a whole lot more than 12 languages on GitHub. So I next looked at total repository creation and subtracted only the languages shown above, to look at the long tail.

github_new_repos_issues_users_other

Totals after subtracting the top 12 languages.

Although you can see an initial rush by the small but diverse community of early adopters creating lots of repositories in less-popular languages, it dropped off dramatically as GitHub exploded in popularity. Then the trend begins a more gradual increase as a wide variety of smaller language communities migrate onto GitHub. New issues show a similar but slower increase starting in 2009, when GitHub added issues. While new users increase the fastest, that likely reflects a combination of users in less-popular languages and “lurker” users with no repositories at all, and therefore no primary language.

The programming landscape today continues to fragment, and this GitHub data supports that trend over time as well as an increasing overlap with the mainstream, not only early adopters.

Update (2014/05/05): Here’s raw data from yesterday in Google Docs. 

Update (2014/05/08): Simplify graphs as per advice from Jennifer Bryan.

Disclosure: GitHub has been a client.

by-sa

Categories: community, distributed-development, open-source, programming-languages.

  • TheSteve0

    Hey Donny – really interesting post. Couple of questions
    1) I would like to see the overall trend of user signups. My guess is that github has probably peaked in terms of percentage new user recruitment and so we see an overall decrease in the rate of new user acquisition

    2) Is there any way you can get # of commits or lines of change over time for the projects and then break that by language. That would avoid the Jquery problem.

    Either way – fun

  • Pingback: Programming Languages in 2013 | Kynosarges Weblog

  • knardi

    These graphs would be incredibly more useful as absolute numbers instead of percentages of the whole. Do you have those?

    • dkural

      Why would absolute numbers be more useful?

      • knardi

        Because one can easily see approximate percentages from absolute numbers, but not vice versa.

        In particular, these graphs seem to suffer from lots of movement that occurred because of changes in the popularity of GitHub itself, rather than of the individual languages, but it’s impossible to actually see that without the raw numbers.

        I would most like to see graphs of the yearly rates of growth of repositories, users, and issues for each language, in absolute values.

  • http://www.fermigier.com/ sfermigier

    Bullshit analysis, based on bogus data.

    For instance, 1/2 of my repositories, which are really Python projects, are misclassified by GitHub as Javascript projects.

    • http://nvartolomei.com/ nvartolomei

      Did you submitted bug reports to GitHub?

    • http://dague.net/ Sean Dague

      Actually a very interesting point. I just looked at my repo list – https://github.com/sdague?tab=repositories – and it misclassifies 3 ruby repositories at javascript. Would be curious how many other repositories are misclassified.

      • http://www.computersnyou.com/ Alok Yadav

        agree ! most of the ruby ( specially rack and rails based ) application misclassified by github as javascript project

      • LaTrinius Washington

        COBOL is where it’s at, son. The Fortune 100, especially the big financials, can’t find enough mainframe guys. They are paying geezers $300K/yr as consultants to come out of retirement to keep they shiit running.

        • Andrew Pennebaker

          Such high paying jobs are rare, and will only become rarer, as organizations realize it’s much cheaper to write it from scratch.

        • Jeff Ulrich

          The COBOL jobs are being outsourced to India. There aren’t nearly as many COBOL jobs stateside as there once were. There is no shortage of COBOL programmers in India.

          • LaTrinius Washington

            India only has Java and C# programmers. There are no COBOL programmers there.

          • Jeff Ulrich

            http://www.shine.com/job-search/simple/cobol-400/ These jobs were being taken by folks in India back in the early 2000s when I worked at the Federal Reserve. This is not new. Nobody here wants to learn / code in COBOL.

        • http://www.versioneye.com/ Robert Reiz

          $300K/yr is not enough! The COBOL way is a one-way street. Only the finance and insurance industry is employing COBOL devs. The number of jobs are limited and as soon they decide to re-write the old COBOL shit in Java you are unemployed.

    • joshcody

      I’ve found similar issues with reporting on my repos. The problem is usually because I’ve included jquery, bootstrap, angular, or some other library that contains the min.js as well as the un-minified js and a bunch of other extra files. Javascript tends to be verbose and adding libraries locally will increase the total number of javascript code lines which is what Github is reporting on.

    • http://redmonk.com/dberkholz Donnie Berkholz

      The data’s perfectly fine, data is data is data. You just have to realize how it’s created so you understand the interpretations you can draw from it.

      I’ll copy and paste from the caveats I mentioned at the top of the post, since you’ve effectively repeated them again: “Language detection is based on lines of code, so a repository with a large amount of JavaScript template libraries (e.g. jQuery) copied into it will be detected as JavaScript rather than the language where most of the work is being done”

      That said, there’s two further points worth noting. First, JavaScript is pretty much universal across webapps so any artifacts due to JS usage shouldn’t make a large impact when comparing across any languages besides JS where webapps are created. Second, if you consider the *usage* aspect of this, you are using more JavaScript than Python in those projects, regardless of what your commits look like.

      • http://www.fermigier.com/ sfermigier

        Donnie, I know your line of defense, but it is as bogus as your initial analysis.

        1. If the data you use as the input of your analysis is bogus, and if you are aware of it, you should work on getting better data, not keep on working on bogus data.

        2. As a programmer, I know a bit what I’m doing. I know that for a given project, at this point for 1000 lines of original Python I write, I probably write about 10 to 100 lines of Javascript. All the rest is jQuery and Bootstrap and Angular stuff that’s just copied in my projects. For smaller projects, the amount of included JS code just confuses the detection algorithm. I know that’s the same for a great number of other web projects.

        • EECOLOR

          Maybe this is an incentive to not keep those files in your repository. You could use a build tool and dependency manager to pull in all dependencies. Another option is to make more use of CDN’s.

          • Kelsoh

            When I program, I focus on making my app work–bringing in what I need to get what I need done. Even if I agree delegating base files to CDN’s is a good idea, it’s not my job to structure my code so that bloody github can read global stats properly.

            They ought to have heuristics to quickly identify if something is a rails app, etc. etc.

      • http://dague.net/ Sean Dague

        Realistically this isn’t just web apps. https://github.com/sdague/temperature.rb is a good instance of a ruby project that has 0% javascript in it, but it classified as javascript.

        If it was just the jquery question, that would be one thing. However, at least on my set of repos there are pure ruby repositories classified as javascript. Would be interesting to have some random spot checking to figure out how accurate the classifier seems to be.

      • reinier_post

        How hard would it be to do string similarity analysis on the code base to weed out contributed code? Basically: don’t count lines of code, but the amount of difference between code and any other code. That would be a fun project and I think it’s doable.

        • http://redmonk.com/dberkholz Donnie Berkholz

          The complexity is more in the scale than anything else. Cloning 10 million git repositories takes a lot of time and space, and doing similarity comparisons across that much code is going to be very computationally expensive. I’d love to see it happen but I don’t see myself doing it.

          • reinier_post

            With the right simplifications, it may be doable.

          • http://www.fermigier.com/ sfermigier

            Science is hard. Let’s go shopping.

    • Jeremy List

      How do you see what it classifies the project as? For my project I can only see what language it’s in by looking at file extensions.

    • http://www.versioneye.com/ Robert Reiz

      Totally agree! Most of my Ruby repos are classified as JS repos. Just pull in a couple mainstream JS Frameworks and you have very quickly much more JS code than Ruby code. It would be cool if GitHub would take the Rails directory structure in count and simply ignore everything what is under app/assets/javascript. Or simply ignore the most popular JS libs like jQuery and co.

    • John Stargell Corser

      This actually is mentioned in his analysis…

      The rise of JavaScript: Another trend that instantly stands out is the growth of JavaScript. Although it’s tempting to attribute that to the rise of Node.js [2010 writeup], reality is far more ambiguous. Node certainly accounts for a portion of the increase, but equally important to remember is (1) the popularity of frameworks that generate large quantities of JavaScript code for new projects and (2) the JavaScript development philosophy that encourages bundling of dependencies in the same repo as the primary codebase. Both of these encourage large amounts of essentially unmodified JavaScript to be added to webapp repositories, which increases the likelihood that repositories, especially those involving small projects in other languages, get misclassified as JavaScript.

  • http://jimmythompson.co.uk/ Jimmy Thompson

    When discussing the primary language for new users you mention the rise of Java as evidence of your assertion that “Github is reaching the enterprise”. I know, at least in the UK, Java is the language for choice for university teaching and many students/graduates would join Github with Java as their most confident language. Students are being encouraged to post their university work on Github as part of this trend of having a “Github resumé”; I would say this is an equally likely cause for the rise of users (and repositories) whose primary language is Java.

    In addition, the rise of CSS could be, on the most part, attributed to the rise of repositories which are for Github pages. All website repositories I’ve seen are classed as CSS by Github.

    • http://redmonk.com/dberkholz Donnie Berkholz

      Universities today have shifted toward teaching what hiring companies want students to know, so they’re pretty similar on that front.

      Great point re Pages. Hadn’t thought about that one.

  • Skynet

    C# is alive and well even if it’s not a primary language in open source ;)

    • https://about.me/jeff_dickey Jeff Dickey

      I don’t think anyone’s denying that (ditto for Objective-C, my current second-fave language after Ruby). The problem being discussed is that the trends drawn in the OP are, if not completely bogus, then at best highly questionable. By using Github language data when there are known, demonstrable problems with the Github language classifier. It’s not like they’re not aware of this, either; the Linguist repo description reads

      Language Savant. If your repository’s language is being reported incorrectly, send us a pull request!

      Linguist’s job is part of one of the two hard problems in computer science.

  • stubbornella

    Github didn’t include CSS as a language until recently, and even still has lots of repos misclassified. For example: https://github.com/stubbornella/oocss

  • http://www.facebook.com/borjavss Borja V. Sorlí Sanz

    Hello everybody! Why R does not appear in github repository?

  • mrobviouslee

    Trying too hard to create useful info from useless data.

  • http://codesorcery.net incanus

    > Both C# and Objective-C are unsurprisingly almost invisible, because they’re both ecosystems that either don’t encourage or actively discourage open-source code.

    I don’t think this is correct; I think you are confusing GPL conflicts in the App Store with general open source (MIT, BSD, etc.) use. There is a vibrant open source community on iOS; just have a look at CocoaPods, most of which are open source, as an example:

    http://cocoapods.org

  • Derek Gaston

    If you did this analysis based on Linguist (the language classifier GitHub uses) then this is completely bogus data. Look at the discussion here: https://github.com/github/linguist/pull/936 at one point there were nearly 100 open pull requests! The project has basically been dead and has been misclassifying for a long time. Only over the last couple of weeks have some GitHub devs stepped in and started to clean things up.

    Redo this analysis after Linguist gets back on its feet…

  • Eric des Courtis

    Add Erlang, Haskell and Scala please

    • Jeremy List

      My only project in github is in C, but most of my private projects are in Haskell.

  • Xmetaskull

    Well, I see the graphs, however, the doesn’t mirror the reality in my life. Although, I have roughly equal experience in C# and Java, for whatever reason (perhaps my geography), I still get about 3 recruiter calls for C# positions vs one for java. Same kinds of jobs, roughly equal pay, similar projects, duties, responsibilities, and level of skill required. Now, I’m not a mobile app developer. I think maybe because Android developers use java for mobile apps, that could explain why java isn’t flat; however, for writing server api and enterprise web apps, I think c# and java fight neck and neck. Actually, .NET usually wins the day in my part of the country (probably not true everywhere).

    Another thing I find weird about this study. Graph starts in 2008, right? The iphone came out, when… 2007, ipad not long after? Seven years later now, apps number in the millions (not to mention macbooks fly off the shelf), and objective C growth has been flat? I don’t quite get that.

    Also, CSS? This is not a software programming language. I mean, maybe, kinda, sorta, but no. I mean, it’s not like I go… hmmm, should I write this thing in java or CSS? It’s not even like I go… hmmm, should I write this in java or javascript? OK, Perhaps this study wasn’t meant as a comparison of similar tools, but things like CSS, javascript, HTML, these things are in projects, whether they be .NET or java or Ruby or Python, but I don’t hear many developers talking about a web application they wrote in CSS, or even javascript (although some do). They use them, for sure, of course, but, in the end, they say… I wrote it in java, I wrote it in python, I wrote it in c#….

  • Peter Drinnan

    Java is getting a lot is use in cloud app development, and specifically in API server development. This is a HUGE market. Tools such as Jenkins and Elastic Beanstalk make deploying Java apps to cloud servers a lot easier. I expect the trend to continue.

  • Richard Cook

    This is actually a pretty good exercise in big data and its discontents. The comments reveal the many difficulties with analyzing even a relatively well-behaved, strictly technical dataset when the data was not collected specificially for analysis. The discussion reveals that this foray into big data creates questions rather than asnwering them. Real “big data” sources (e.g. U.S. Census) are carefully and exhaustingly planned collections that strive for representativeness. They are extensively calibrated and tested and the resulting datasets are qualified in statistically rigorous ways. This is what gives the later analysis of data subsets some explanatory power.

    GitHub is not the world of code nor are GitHub repositories representative of the world of running code. Viewing the world through GitHub is great fun and an enjoyable exercise. Claims about the predominance or salience of one or another language are less meaningful assessments than troll-food for geeks.

  • Pingback: The RedMonk Programming Language Rankings: June 2014 – tecosystems

  • alaanile

    The data’s perfectly fine, data is data is data. You just have to realize how it’s created so you understand the interpretations you can draw from it.

    I’ll copy and paste from the caveats I mentioned at the top of the post, since you’ve effectively repeated them again: “Language detection is based on lines of code, so a repository with a large amount of JavaScript template libraries (e.g. jQuery) copied into it will be detected as JavaScript rather than the language where most of the work is being done”

    That said, there’s two further points worth noting. First, JavaScript is pretty much universal across webapps so any artifacts due to JS usage shouldn’t make a large impact when comparing across any languages besides JS where webapps are created. Second, if you consider the *usage* aspect of this, you are using more JavaScript than Python in those projects, regardless of what your commits look like.

    نقل عفش بالرياض

    كشف تسربات المياه بالرياض

    شركة عزل مائى بالرياض

    نقل عفش بجدة

    نقل عفش مكة

    تنظيف مساجد بجدة

    شركة الأحمدي لنقل الأثاث

    شركة عزل خزانات

    شركات مكافحة حشرات

    شركة كشف تسربات بالرياض

    شركات العزل الحراري

    شركة عزل مائي

    شركة القمة لمكافحة الفئران والقوارض

    مكافحة البق

    شركات مكافحة النمل الابيض

    شركة كشف تسربات المياه ومعالجتها

    شركة كشف تسربات بالرياض

    نقل اثاث

    شركة رش مبيدات بالمدينة المنورة

    شركة تنظيف مسابح بالمدينة المنورة

    شركة تنظيف موكيت بالمدينة المنورة

    رش مبيدات بمكة

    شركة عزل خزانات بجدة

    شركة تسليك مجارى بجدة

    شركة مكافحة حشرات بالدمام

    شركة تنظيف فلل بالدمام

    نقل اثاث بالدمام

    مكافحة حشرات بالرياض

    كشف تسربات المياه بالرياض

    شركة تنظيف بالرياض

    شركة نقل اثاث بالقاهرة

    شركة كشف تسربات المياه تبوك

    ابي وايت صرف صحي بالرياض

    here

    here