{"id":1898,"date":"2014-07-15T10:32:25","date_gmt":"2014-07-15T15:32:25","guid":{"rendered":"http:\/\/redmonk.com\/dberkholz\/?p=1898"},"modified":"2014-07-16T10:47:45","modified_gmt":"2014-07-16T15:47:45","slug":"widespread-correlations-across-programming-language-rankings","status":"publish","type":"post","link":"https:\/\/redmonk.com\/dberkholz\/2014\/07\/15\/widespread-correlations-across-programming-language-rankings\/","title":{"rendered":"Widespread correlations across programming-language rankings"},"content":{"rendered":"<p>IEEE Spectrum recently came out with a very interesting <a href=\"http:\/\/spectrum.ieee.org\/static\/interactive-the-top-programming-languages\/\">interactive tool<\/a> for ranking programming languages. What makes it interesting is that it incorporates 12 different sources including data\u00a0from\u00a0code, jobs, conversation, and searches \u2014 and you can customize the weights assigned to each source.<\/p>\n<p><a href=\"http:\/\/dberkholz-media.redmonk.com\/dberkholz\/files\/2014\/07\/ieee_spectrum.png\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1900\" data-permalink=\"https:\/\/redmonk.com\/dberkholz\/2014\/07\/15\/widespread-correlations-across-programming-language-rankings\/ieee_spectrum\/\" data-orig-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2014\/07\/ieee_spectrum.png\" data-orig-size=\"634,513\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"ieee_spectrum\" data-image-description=\"\" data-medium-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2014\/07\/ieee_spectrum-300x242.png\" data-large-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2014\/07\/ieee_spectrum.png\" class=\"aligncenter size-medium wp-image-1900\" src=\"http:\/\/dberkholz-media.redmonk.com\/dberkholz\/files\/2014\/07\/ieee_spectrum-300x242.png\" alt=\"ieee_spectrum\" width=\"300\" height=\"242\" \/><\/a><\/p>\n<p>But the first thing that occurred to me was, <strong>this is a fantastic opportunity to look at commonalities and communities across\u00a0all of these sources.<\/strong> That could tell us about which places could provide unique insight into what technologies developers care about and use, and which provide mainly reinforcement of others.<\/p>\n<p>Before I did anything, however, <strong>I wanted to test the veracity of the rankings.<\/strong> So I compared <a href=\"http:\/\/redmonk.com\/sogrady\/2014\/01\/22\/language-rankings-1-14\/\">RedMonk&#8217;s January rankings<\/a> against an equal\u00a0weighting of GitHub active repositories and StackOverflow questions. While not perfectly correlated, since IEEE used only 2013 and RedMonk\u00a0uses all-time, <strong>the Pearson correlation coefficient for\u00a0the top 20 languages is 0.97<\/strong> (where 1 would be entirely\u00a0correlated).<\/p>\n<p>Having confidence in their data and reinforcing RedMonk&#8217;s rankings, I moved on to calculate, using the full 49 languages supplied by IEEE, correlations across every data source they provided:<\/p>\n<ul>\n<li>CareerBuilder<\/li>\n<li>Dice<\/li>\n<li>GitHub active projects<\/li>\n<li>GitHub created projects<\/li>\n<li>Google search (# of results)<\/li>\n<li>Google trends (search volume)<\/li>\n<li>Hacker News<\/li>\n<li>IEEE Xplore (IEEE articles mentioning a language)<\/li>\n<li>Reddit<\/li>\n<li>StackOverflow questions<\/li>\n<li>StackOverflow views<\/li>\n<li>Topsy (Twitter search results)<\/li>\n<\/ul>\n<p>Here&#8217;s a spreadsheet showing the numbers, where higher correlations are in red and very weak correlations are in blue:<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/docs.google.com\/spreadsheets\/d\/1hJZ9MwzumLt9tsWiXvxo0c3oBnnxEfuysjrpkpbQndI\/pubhtml?widget=true&amp;headers=false\" width=\"550\" height=\"350\"><\/iframe><\/p>\n<p><strong>The strongest correlation on the chart, interestingly, is the 0.92 found between Twitter conversation and Google trends.<\/strong>\u00a0Apparently,\u00a0people talking about programming languages in real-time chat tend to also search for what they&#8217;re talking about.<\/p>\n<p>The other very strong correlations (above 0.85) are:<\/p>\n<ul>\n<li><strong>Google: trends and search.<\/strong> Nothing surprising here.<\/li>\n<li><strong>Job sites: Dice and CareerBuilder.<\/strong> Nothing surprising.<\/li>\n<li><strong>Reddit and Google trends.<\/strong>\u00a0Discussion about current topics seems to correlate with interest in finding more information about those topics.<\/li>\n<li><strong>Twitter and Google search.<\/strong>\u00a0The 0.88 here is slightly below the 0.92 between Twitter and Google trends. <strong>Most interesting about this pair is that it shows a connection between conversation and amount of content<\/strong> (# of results), rather than just people searching for what could be a small amount of material.<\/li>\n<li><strong>Reddit and Twitter.<\/strong>\u00a0Similar communities seem to participate across a wide variety of online discussion forums.<\/li>\n<li><strong>GitHub created and StackOverflow questions.<\/strong>\u00a0Because it&#8217;s a correlation of open-source usage and broader conversation among forward-leaning communities, this is the one we rely upon for the RedMonk language rankings.<\/li>\n<\/ul>\n<h2>Midrange correlations : Hacker News and IEEE Xplore<\/h2>\n<p>In the middle (correlations between 0.3\u20130.7), I was surprised that <strong>Hacker News correlated rather weakly with all of the\u00a0other sources.<\/strong> This implies a degree of independence for this community relative to the behavior of all global developers, and even the subset\u00a0who participate on StackOverflow. It&#8217;s certainly some interesting data to support the saying that HN is for Bay Area developers (and their bleeding-edge &#8220;cousins&#8221; across the world).<\/p>\n<p>IEEE Xplore, which is oriented around academic research, had similarly weak correlations with everything else (HN included). <strong>This supports a general disconnect between academia and both general trends (most other sources) as well as forward-leaning communities like HN.<\/strong><\/p>\n<p>Both of these seem to make sense based on my prior expectations, since both of these groups are rather unlike the rest.<\/p>\n<h2>StackOverflow viewers are the outliers<\/h2>\n<p><strong>The weakest correlations were between StackOverflow views and <span style=\"text-decoration: underline;\">almost everything else<\/span>.<\/strong> It&#8217;s\u00a0shocking how different the visitors to StackOverflow seem from every other data source.\u00a0If we actually take a look at the top 20 languages based on\u00a0StackOverflow views, it bears out the unusual nature that the poor correlations suggested:<\/p>\n<ol>\n<li>Arduino<\/li>\n<li>VHDL<\/li>\n<li>Visual Basic<\/li>\n<li>ASP.NET<\/li>\n<li>Verilog<\/li>\n<li>Shell<\/li>\n<li>HTML<\/li>\n<li>Delphi<\/li>\n<li>Objective-C<\/li>\n<li>SQL<\/li>\n<li>Cobol<\/li>\n<li>Apex Code<\/li>\n<li>ABAP<\/li>\n<li>CoffeeScript<\/li>\n<li>Go<\/li>\n<li>MATLAB<\/li>\n<li>Assembly<\/li>\n<li>C++<\/li>\n<li>C<\/li>\n<li>Scala<\/li>\n<\/ol>\n<p><strong>Three of the top 5 are hardware\u00a0(Arduino, VHDL, Verilog), supporting a strong audience of embedded developers.\u00a0<\/strong>Outside of StackOverflow\u00a0views, these languages are nonexistent\u00a0in the top 10 with only two exceptions: Arduino is\u00a0#7 on Reddit and VHDL is #8 in IEEE Xplor. That paints a very clear contrast between this group and everyone else, and perhaps a unique source of data about trends in embedded development.<\/p>\n<p><strong>Enterprise stalwarts are also commonplace, such as Visual Basic, Cobol, Apex (Salesforce.com&#8217;s language), and ABAP (SAP&#8217;s language).<\/strong> Other than this:<\/p>\n<ul>\n<li>Visual Basic is only\u00a0in the top 10 in Google<\/li>\n<li>Cobol and Apex are\u00a0only in the top 20 on career sites (in the high teens)<\/li>\n<li>ABAP\u00a0is only in the top 20 on career sites and Google search (in the high teens)<\/li>\n<\/ul>\n<p>Again, StackOverflow views may be a unique source of information on an otherwise hard-to-find community.<\/p>\n<h2>Viewing correlations as a network graph reveals communities<\/h2>\n<p>However, this only lets us easily look at two-way correlations. If we want to see communities, it could be easier to examine\u00a0this with\u00a0a graph, with the connecting edges being the correlations between pairs of data sources. Here&#8217;s a visualization of that, only showing strong correlations (above 0.7), and with highly\u00a0connected nodes shown in red while poorly connected nodes are increasingly\u00a0blue.<\/p>\n<figure id=\"attachment_1899\" aria-describedby=\"caption-attachment-1899\" style=\"width: 397px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/dberkholz-media.redmonk.com\/dberkholz\/files\/2014\/07\/cc_black.png\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"1899\" data-permalink=\"https:\/\/redmonk.com\/dberkholz\/2014\/07\/15\/widespread-correlations-across-programming-language-rankings\/cc_black\/\" data-orig-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2014\/07\/cc_black.png\" data-orig-size=\"1140,1127\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"cc_black\" data-image-description=\"\" data-medium-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2014\/07\/cc_black-300x296.png\" data-large-file=\"https:\/\/redmonk.com\/dberkholz\/files\/2014\/07\/cc_black-1024x1012.png\" class=\"wp-image-1899\" src=\"http:\/\/dberkholz-media.redmonk.com\/dberkholz\/files\/2014\/07\/cc_black-1024x1012.png\" alt=\"cc_black\" width=\"397\" height=\"392\" \/><\/a><figcaption id=\"caption-attachment-1899\" class=\"wp-caption-text\">Graph layout weighted by correlation across data sources, using a force-directed layout in Gephi. I used a 0.7 minimum threshold for the\u00a0Pearson correlation coefficient.<\/figcaption><\/figure>\n<p><strong>It&#8217;s instantly apparent that some data sources serve as centerpieces that can broadly represent a swathe of communities while others are weakly connected and could provide more unique insight.<\/strong> In particular, note that\u00a0IEEE Xplore and SO views are missing altogether because they had no correlations above 0.7 to anything else.<\/p>\n<p><strong>The most central and strongly connected node, perhaps surprisingly, is Twitter.<\/strong> Google is close by, however, which supports the validity of the oft-maligned <a href=\"http:\/\/www.tiobe.com\/index.php\/content\/paperinfo\/tpci\/index.html\">TIOBE rankings<\/a>\u00a0to represent many\u00a0communities. However it could be a better choice on their part to use Google trends over search results, based on the strength and number of connections shown above.<\/p>\n<p>On the opposite side, being nearly unrepresented without explicitly adding them in, are the two that didn&#8217;t appear (StackOverflow views and IEEE Xplore). In\u00a0addition, largely disconnected sources would be well worth considering to provide additional diversity. On this graph, they&#8217;re weakly connected (more blue) and less strongly correlated with their connections (thinner edges) \u2014 sources like GitHub active projects and Hacker News.<\/p>\n<h2>Conclusions<\/h2>\n<p>Based on that, I thought I&#8217;d recalculate a new set of rankings that accounted for these connections. I decided to include Topsy (weight 100), StackOverflow views (weight 100), Hacker News (weight 50), and IEEE Xplor (weight 50) to represent the diversity across these communities. <strong>These\u00a0communities are vastly different sizes, so this truly reflects source diversity rather than population-level interest.<\/strong> \u00a0But it&#8217;s interesting to see interest scaled by community rather than by pure population:<\/p>\n<ol>\n<li>C<\/li>\n<li>C++<\/li>\n<li>Python<\/li>\n<li>Java<\/li>\n<li>SQL<\/li>\n<li>Arduino<\/li>\n<li>C#<\/li>\n<li>Go<\/li>\n<li>Visual Basic<\/li>\n<li>Ruby<\/li>\n<li>Assembly<\/li>\n<li>R<\/li>\n<li>Shell<\/li>\n<li>HTML<\/li>\n<li>MATLAB<\/li>\n<li>Objective-C<\/li>\n<li>PHP<\/li>\n<li>Scala<\/li>\n<li>Perl<\/li>\n<li>JavaScript<\/li>\n<\/ol>\n<p>In comparison to the\u00a0<a href=\"http:\/\/redmonk.com\/sogrady\/2014\/01\/22\/language-rankings-1-14\/\">RedMonk top 20<\/a>, the changes are about what you&#8217;d expect based on the earlier results. Languages more popular in niche communities tend to move up (e.g. Arduino, Go) because of how I\u00a0weighted the outlier sources, while languages that aren&#8217;t popular across all those audience types (e.g. JavaScript, PHP) shifted downwards<\/p>\n<p><strong>This work revealed a widespread network of communities spread across a wide variety of forums, including code, discussion, jobs, and searches.<\/strong> Some of the most interesting results were the exceptions from the norm \u2014 in particular, StackOverflow views could provide a unique window into embedded and enterprise audiences, while Hacker News and IEEE Xplore are other sources\u00a0with quite disparate data relative to the majority of the group. Finally, the connection between real-time\u00a0conversation on Twitter and existing content on Google was a newly interesting correlation between discussion and resources that actually exist, rather than purely discussion and interest.<\/p>\n<p><span style=\"color: #999999;\"><em><strong>Disclosure<\/strong>: SAP and Salesforce.com\u00a0are clients. Microsoft has been a client.<\/em><\/span><\/p>\n<div class=\"acc_license\"><a href=\"http:\/\/creativecommons.org\/licenses\/by-sa\/3.0\/\"><img decoding=\"async\" src=\"http:\/\/i.creativecommons.org\/l\/by-sa\/3.0\/88x31.png\" alt=\"by-sa\" \/><\/a><\/div><!--<rdf:RDF xmlns=\"http:\/\/creativecommons.org\/ns#\" xmlns:dc=\"http:\/\/purl.org\/dc\/elements\/1.1\/\" xmlns:rdf=\"http:\/\/www.w3.org\/1999\/02\/22-rdf-syntax-ns#\"><Work rdf:about=\"\"><license rdf:resource=\"http:\/\/creativecommons.org\/licenses\/by-sa\/3.0\/\" \/><\/Work><License rdf:about=\"http:\/\/creativecommons.org\/licenses\/by-sa\/3.0\/\"><requires rdf:resource=\"http:\/\/creativecommons.org\/ns#Attribution\" \/><permits rdf:resource=\"http:\/\/creativecommons.org\/ns#Reproduction\" \/><permits rdf:resource=\"http:\/\/creativecommons.org\/ns#Distribution\" \/><permits rdf:resource=\"http:\/\/creativecommons.org\/ns#DerivativeWorks\" \/><requires rdf:resource=\"http:\/\/creativecommons.org\/ns#ShareAlike\" \/><requires rdf:resource=\"http:\/\/creativecommons.org\/ns#Notice\" \/><\/License><\/rdf:RDF>-->","protected":false},"excerpt":{"rendered":"<p>IEEE Spectrum recently came out with a very interesting interactive tool for ranking programming languages. What makes it interesting is that it incorporates 12 different sources including data\u00a0from\u00a0code, jobs, conversation, and searches \u2014 and you can customize the weights assigned to each source. But the first thing that occurred to me was, this is a<\/p>\n","protected":false},"author":6,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","footnotes":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[3,18,30],"tags":[],"class_list":["post-1898","post","type-post","status-publish","format-standard","hentry","category-adoption","category-community","category-programming-languages"],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p23Tsn-uC","_links":{"self":[{"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/posts\/1898","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/comments?post=1898"}],"version-history":[{"count":0,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/posts\/1898\/revisions"}],"wp:attachment":[{"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/media?parent=1898"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/categories?post=1898"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/redmonk.com\/dberkholz\/wp-json\/wp\/v2\/tags?post=1898"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}