Donnie Berkholz's Story of Data

Widespread correlations across programming-language rankings

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

IEEE Spectrum recently came out with a very interesting interactive tool for ranking programming languages. What makes it interesting is that it incorporates 12 different sources including data from code, jobs, conversation, and searches — and you can customize the weights assigned to each source.

ieee_spectrum

But the first thing that occurred to me was, this is a fantastic opportunity to look at commonalities and communities across all of these sources. That could tell us about which places could provide unique insight into what technologies developers care about and use, and which provide mainly reinforcement of others.

Before I did anything, however, I wanted to test the veracity of the rankings. So I compared RedMonk’s January rankings against an equal weighting of GitHub active repositories and StackOverflow questions. While not perfectly correlated, since IEEE used only 2013 and RedMonk uses all-time, the Pearson correlation coefficient for the top 20 languages is 0.97 (where 1 would be entirely correlated).

Having confidence in their data and reinforcing RedMonk’s rankings, I moved on to calculate, using the full 49 languages supplied by IEEE, correlations across every data source they provided:

  • CareerBuilder
  • Dice
  • GitHub active projects
  • GitHub created projects
  • Google search (# of results)
  • Google trends (search volume)
  • Hacker News
  • IEEE Xplore (IEEE articles mentioning a language)
  • Reddit
  • StackOverflow questions
  • StackOverflow views
  • Topsy (Twitter search results)

Here’s a spreadsheet showing the numbers, where higher correlations are in red and very weak correlations are in blue:

The strongest correlation on the chart, interestingly, is the 0.92 found between Twitter conversation and Google trends. Apparently, people talking about programming languages in real-time chat tend to also search for what they’re talking about.

The other very strong correlations (above 0.85) are:

  • Google: trends and search. Nothing surprising here.
  • Job sites: Dice and CareerBuilder. Nothing surprising.
  • Reddit and Google trends. Discussion about current topics seems to correlate with interest in finding more information about those topics.
  • Twitter and Google search. The 0.88 here is slightly below the 0.92 between Twitter and Google trends. Most interesting about this pair is that it shows a connection between conversation and amount of content (# of results), rather than just people searching for what could be a small amount of material.
  • Reddit and Twitter. Similar communities seem to participate across a wide variety of online discussion forums.
  • GitHub created and StackOverflow questions. Because it’s a correlation of open-source usage and broader conversation among forward-leaning communities, this is the one we rely upon for the RedMonk language rankings.

Midrange correlations : Hacker News and IEEE Xplore

In the middle (correlations between 0.3–0.7), I was surprised that Hacker News correlated rather weakly with all of the other sources. This implies a degree of independence for this community relative to the behavior of all global developers, and even the subset who participate on StackOverflow. It’s certainly some interesting data to support the saying that HN is for Bay Area developers (and their bleeding-edge “cousins” across the world).

IEEE Xplore, which is oriented around academic research, had similarly weak correlations with everything else (HN included). This supports a general disconnect between academia and both general trends (most other sources) as well as forward-leaning communities like HN.

Both of these seem to make sense based on my prior expectations, since both of these groups are rather unlike the rest.

StackOverflow viewers are the outliers

The weakest correlations were between StackOverflow views and almost everything else. It’s shocking how different the visitors to StackOverflow seem from every other data source. If we actually take a look at the top 20 languages based on StackOverflow views, it bears out the unusual nature that the poor correlations suggested:

  1. Arduino
  2. VHDL
  3. Visual Basic
  4. ASP.NET
  5. Verilog
  6. Shell
  7. HTML
  8. Delphi
  9. Objective-C
  10. SQL
  11. Cobol
  12. Apex Code
  13. ABAP
  14. CoffeeScript
  15. Go
  16. MATLAB
  17. Assembly
  18. C++
  19. C
  20. Scala

Three of the top 5 are hardware (Arduino, VHDL, Verilog), supporting a strong audience of embedded developers. Outside of StackOverflow views, these languages are nonexistent in the top 10 with only two exceptions: Arduino is #7 on Reddit and VHDL is #8 in IEEE Xplor. That paints a very clear contrast between this group and everyone else, and perhaps a unique source of data about trends in embedded development.

Enterprise stalwarts are also commonplace, such as Visual Basic, Cobol, Apex (Salesforce.com’s language), and ABAP (SAP’s language). Other than this:

  • Visual Basic is only in the top 10 in Google
  • Cobol and Apex are only in the top 20 on career sites (in the high teens)
  • ABAP is only in the top 20 on career sites and Google search (in the high teens)

Again, StackOverflow views may be a unique source of information on an otherwise hard-to-find community.

Viewing correlations as a network graph reveals communities

However, this only lets us easily look at two-way correlations. If we want to see communities, it could be easier to examine this with a graph, with the connecting edges being the correlations between pairs of data sources. Here’s a visualization of that, only showing strong correlations (above 0.7), and with highly connected nodes shown in red while poorly connected nodes are increasingly blue.

cc_black
Graph layout weighted by correlation across data sources, using a force-directed layout in Gephi. I used a 0.7 minimum threshold for the Pearson correlation coefficient.

It’s instantly apparent that some data sources serve as centerpieces that can broadly represent a swathe of communities while others are weakly connected and could provide more unique insight. In particular, note that IEEE Xplore and SO views are missing altogether because they had no correlations above 0.7 to anything else.

The most central and strongly connected node, perhaps surprisingly, is Twitter. Google is close by, however, which supports the validity of the oft-maligned TIOBE rankings to represent many communities. However it could be a better choice on their part to use Google trends over search results, based on the strength and number of connections shown above.

On the opposite side, being nearly unrepresented without explicitly adding them in, are the two that didn’t appear (StackOverflow views and IEEE Xplore). In addition, largely disconnected sources would be well worth considering to provide additional diversity. On this graph, they’re weakly connected (more blue) and less strongly correlated with their connections (thinner edges) — sources like GitHub active projects and Hacker News.

Conclusions

Based on that, I thought I’d recalculate a new set of rankings that accounted for these connections. I decided to include Topsy (weight 100), StackOverflow views (weight 100), Hacker News (weight 50), and IEEE Xplor (weight 50) to represent the diversity across these communities. These communities are vastly different sizes, so this truly reflects source diversity rather than population-level interest.  But it’s interesting to see interest scaled by community rather than by pure population:

  1. C
  2. C++
  3. Python
  4. Java
  5. SQL
  6. Arduino
  7. C#
  8. Go
  9. Visual Basic
  10. Ruby
  11. Assembly
  12. R
  13. Shell
  14. HTML
  15. MATLAB
  16. Objective-C
  17. PHP
  18. Scala
  19. Perl
  20. JavaScript

In comparison to the RedMonk top 20, the changes are about what you’d expect based on the earlier results. Languages more popular in niche communities tend to move up (e.g. Arduino, Go) because of how I weighted the outlier sources, while languages that aren’t popular across all those audience types (e.g. JavaScript, PHP) shifted downwards

This work revealed a widespread network of communities spread across a wide variety of forums, including code, discussion, jobs, and searches. Some of the most interesting results were the exceptions from the norm — in particular, StackOverflow views could provide a unique window into embedded and enterprise audiences, while Hacker News and IEEE Xplore are other sources with quite disparate data relative to the majority of the group. Finally, the connection between real-time conversation on Twitter and existing content on Google was a newly interesting correlation between discussion and resources that actually exist, rather than purely discussion and interest.

Disclosure: SAP and Salesforce.com are clients. Microsoft has been a client.

by-sa

3 comments

  1. […] You can read the rest of the analysis on Donnie Berkholz’ “Story of Data” blog  here » […]

  2. Great post. The one thing I would say is that Hacker News is more “read only” than Twitter. I’d like to know the exact percentage of active vs. non-active users, but based on behavior of the developers I know, it seems like there are lots of people who read and not a lot of people who post or take part in conversations, which would throw off this data.

    You could argue that all of these sites run on user-generated data, and Hacker News in no different, but because Hacker News is so high profile lots of people don’t want to share their opinions. Also because everything is more or less anonymous, people are far less likely to be respectful, which prevents a large percentage of readers from participating in conversation.

Leave a Reply

Your email address will not be published. Required fields are marked *