Skip to content

GitHub language trends and the fragmenting landscape

A while ago, I wanted to get a little quick feedback on some data I was playing with, but the day was almost over and I wasn’t done working on it yet. I decided to tweet my rough draft of a graph of GitHub language trends anyway, followed later by a slight improvement.


Trends over time, smoothed to make it a little easier to follow

Much to my surprise, that graph was retweeted more than 2,000 times and reached well over 1 million people. My colleagues have both examined this data since I posted the graph — James took a stab at pulling out a few key points, particularly GitHub’s start around Rails and its growth into the mainstream, and Steve’s also taken a look at visualizing this data differently.

Despite that being fantastic news, the best part was the questions I got, and all the conversations gave me an opportunity to decide what points would be most interesting to people who read this post. The initial plot was a spaghetti graph, so I fixed it up and decided to do a more in-depth analysis.


Before we can get into useful results and interpretation, there are a few artifacts and potential pitfalls to be aware of:

  • GitHub is a specific community that’s grown very quickly since it launched [writeup]. It was not initially reflective of open source as a whole but rather centered around the Ruby on Rails community;
  • In 2009, the GitPAN project imported all of CPAN (Perl’s module ecosystem) into GitHub, which explains the one-time peak;
  • Language detection is based on lines of code, so a repository with a large amount of JavaScript template libraries (e.g. jQuery) copied into it will be detected as JavaScript rather than the language where most of the work is being done; and
  • I’m showing percentages, not absolute values. A downward slope does not mean fewer repositories are being created. It does mean, however, that other languages are gaining repositories faster.

The big reveal

The first set of graphs shows new, non-fork repositories created on GitHub by primary language and year. This dataset includes all languages that were in the top 10 during any of the years 2008–2013, but languages used for text-editor configuration were ignored (VimL and Emacs Lisp). I’m showing them as a grid of equally scaled graphs to make comparisons easier across any set of languages, and I’m using percentages to indicate relative share of GitHub.

Data comes from date- and language-restricted searches using the GitHub search API.

Data comes from date- and language-restricted searches using the GitHub search API.

  • GitHub hits the mainstream: James quickly nailed the key point: GitHub has gone mainstream over the past 5 years. This is best shown by the decline of Ruby as it reached beyond the Rails community and the simultaneous growth of a broad set of both old and newer languages including Java, PHP, and Python as GitHub reached a broader developer base. The apparent rise and drop of languages like PHP, Python, and C could indicate that these communities migrated toward GitHub earlier than others. This would result in an initially larger share that lowered as more developers from e.g. Java, C++, C#, Obj-C, and Shell joined.
  • The rise of JavaScript: Another trend that instantly stands out is the growth of JavaScript. Although it’s tempting to attribute that to the rise of Node.js [2010 writeup], reality is far more ambiguous. Node certainly accounts for a portion of the increase, but equally important to remember is (1) the popularity of frameworks that generate large quantities of JavaScript code for new projects and (2) the JavaScript development philosophy that encourages bundling of dependencies in the same repo as the primary codebase. Both of these encourage large amounts of essentially unmodified JavaScript to be added to webapp repositories, which increases the likelihood that repositories, especially those involving small projects in other languages, get misclassified as JavaScript.
  • Windows and iOS development nearly invisible: Both C# and Objective-C are unsurprisingly almost invisible, because they’re both ecosystems that either don’t encourage or actively discourage open-source code. These are the two languages in this chart most likely to be unreflective both of current usage outside GitHub but also of predictive usage, again due to open-source imbalance in those communities.

What about pushes rather than creation?

What’s really interesting is that if you do the same query by when the last push of code to the repo occurred rather than its creation, the graphs look nearly identical (not shown). The average number of pushes to repositories is independent of both time and language but is correlated with when repositories were created. In only two cases do the percentages of created and pushed repos differ by more than 2 points: Perl in 2009 (+4.1% pushed) and Ruby in 2008 (–3.5% pushed), both of which are likely artifacts due to the caveats described earlier.

This result is particularly striking because there’s no difference over time despite a broader audience joining GitHub, and there’s also no difference across all of these language communities. The vast majority of repositories (>98%) are modified only in the year they are created, and they’re never touched again. This is consistent with my previous research exploring the size of open-source projects, where we saw that 87% of repositories have ≤5 contributors.

Are GitHub issues a better measure of interest?

One potential problem with looking at repositories is that it’s not a reflection of usage or and a fairly indirect measurement of interest for a given codebase. It instead measures developers creating new code — to get a closer look at usage, some possibilities are forks, stars, or issues. GitHub’s search API makes it more convenient to focus on issues so that’s what I measured for this post. My expectation going into this was that issues would be much more biased by extremely popular projects with large numbers of users, but let’s take a look:

Issues filed within repositories with that primary language.

Issues filed within repositories with that primary language.

This gave me a fairly similar set of graphs to the new-repository data. It’s critical to note that although these are new issues, they’re filed against both new and preexisting repos so the trends are not directly comparable in that sense. Rather, they’re comparable in terms of thinking about different measurements of developer interest in a given language during the same timeframe. The peaks in Ruby, Python, and C++ early on are all due to particularly popular projects that dominated GitHub in its earlier days, when it was a far smaller collection of projects. Other than that, let’s take a look through the real trends.

  • Nearly all of these trends are consistent with new repos. With the clear exception of Ruby and less obvious example of JavaScript, the trends above are largely consistent with those in the previous set of graphs. I’ll focus mainly on the exceptions in my other points.
  • JavaScript’s increase appears asymptotic rather than linear. In other words, it continues to increase but it’s decelerating, and it appears to be moving toward a static share around 25% of new issues. This may be the case with new repos as well, but it’s less obvious there than here.
  • Ruby’s seen a steep decline since 2009. It peaked early on with Rails-related projects, but as GitHub grew mainstream, Ruby’s share of issues dropped back down. But again, this trend seems to be gradually flattening out around 10% of total issues.
  • Java and PHP have both grown and stabilized. In both cases, they’ve reached around 10% of issue share and remained largely steady since then, although Java may continue to see slow growth here.
  • Python’s issue count has consistently shrunk since 2009. Since dropping to 15% after an initial spike in 2008, it’s slowly come down to just above 10%. Given the past trend, which may be flattening out, it’s unclear whether it will continue to shrink.

The developer-centric (rather than code-centric) perspective

What if we take a different tack and focus on the primary language of new users joining GitHub? This creates a wildly different set of trends that’s reflective of individual users, rather than being weighted toward activist users who create lots of repositories and issues.

Users joining in a certain year with a majority of their repositories in that language.

Users joining in a certain year with a majority of their repositories in that language.

The points I find most interesting about these graphs are:

  • There are no clearly artifactual spikes. All of the trends here are fairly smooth, very much unlike both the repos and issues. This is very encouraging because it suggests any results here may be more reliable rather than spurious.
  • Language rank remains quite similar to the other two datasets. Every dataset is ordered by the number of new repos created in each language in 2013, to make comparisons simpler across datasets. If you look at activity in 2013 for issues and users, you can see that their values are generally ranked in the correct order with few minor exceptions. One in this case is that Java and Ruby should clearly be reversed, but that’s about all that’s obviously out of order.
  • Almost every language shows a long-term downhill trend. With the exception of Java and (recently) CSS, all of these languages have been decreasing. This was a bit of a puzzler and made me wonder more about the fragmentation of languages over time, which I’ll explore later in this post as well as future posts. My initial guess is that users of languages below the top 12 are growing in share to counterbalance the decreases here. It’s also possible that GitHub may leave some users unclassified, which would tend to lower everything else’s proportion over time.
  • I’m therefore not going to focus on linear decreases. I will, however, examine nonlinear decreases, or anything that’s otherwise an exception such as increases.
  • Ruby’s downward slide shows an interesting sort of exponential decay. This is actually “slower” than a linear decrease as it curves upwards, so it indicates that relative to everything else moving linearly downward, Ruby held onto its share better.
  • Java was the only top language that showed long-term increases during this time. Violating all expectations and trends, new Java users on GitHub even grew as a percentage of overall new users, while everything else went downhill. This further supports the assertion that GitHub is reaching the enterprise.

A consensus approach accounts for outliers

When I aggregated all three datasets together to look at how trends correlated across them, everything got quite clear:

New repositories, users, and issues in a given language according to the GitHub search API.

New repositories, users, and issues in a given language according to the GitHub search API.

Artifacts become obvious as spikes in only one of the three datasets, as happens for a number of languages in the 2009–2010 time frame. It’s increasingly obvious that only 5 languages have historically mattered on GitHub on the basis of overall share: JavaScript, Ruby, Java, PHP, and Python. New contender CSS is on the way up, while C and C++ hold honorable mentions. Everything else is, on a volume basis, irrelevant today, even if it’s showing fantastic growth like Go and will likely be relevant in these rankings within the next year or two.

The fragmenting landscape

In looking at the decline in the past couple of years among many of the top languages, I started wondering whether it was nearly all going to JavaScript and Java or whether there might be more hidden in there. After all, there’s a whole lot more than 12 languages on GitHub. So I next looked at total repository creation and subtracted only the languages shown above, to look at the long tail.


Totals after subtracting the top 12 languages.

Although you can see an initial rush by the small but diverse community of early adopters creating lots of repositories in less-popular languages, it dropped off dramatically as GitHub exploded in popularity. Then the trend begins a more gradual increase as a wide variety of smaller language communities migrate onto GitHub. New issues show a similar but slower increase starting in 2009, when GitHub added issues. While new users increase the fastest, that likely reflects a combination of users in less-popular languages and “lurker” users with no repositories at all, and therefore no primary language.

The programming landscape today continues to fragment, and this GitHub data supports that trend over time as well as an increasing overlap with the mainstream, not only early adopters.

Update (2014/05/05): Here’s raw data from yesterday in Google Docs. 

Update (2014/05/08): Simplify graphs as per advice from Jennifer Bryan.

Disclosure: GitHub has been a client.


Categories: community, distributed-development, open-source, programming-languages.

Upcoming speaking engagements, come find me!

I’ve got a number of speaking engagements in the next couple of months, so if you want to hear what I’ve got to say or just grab a beer, find me here:

  • The Minneapolis DevOps meetup tomorrow, on the cloud vs DevOps deathmatch
  • DevNation in San Francisco next week, alongside the Red Hat Summit: “Linux distros failed us — now what? Coping with cloud, DevOps, and more”
  • The Open Business Conference in San Fran on May 5: “Open DevOps”
  • WANdisco’s Subversion & Git Live in NYC on May 6: “Open source in the enterprise”
  • GlueCon in mid-May in Denver: “The parallel universes of DevOps and cloud developers”
  • Grand opening of CenturyLink’s newest datacenter in early June in Shakopee MN (near Minneapolis) on modern IT infrastructure

I’m also going to spend some time in Atlanta at the OpenStack Summit, in Portland at Monitorama, and in San Antonio and Austin. Let me know if you’ll be in any of those places and want to meet up!

Update (4/29/14): Added specifics about CenturyLink’s DC opening.

Disclosure: Red Hat, WANdisco, and CenturyLink are clients.


Categories: analyst, cloud, devops, linux, red-hat.

IBM’s piecemeal march toward open source

I attended IBM’s Pulse conference earlier this month, which has traditionally been focused around the former Tivoli division. It’s is a strange mix of software for sysadmins and software that can do things like manage city water supplies (the GreenMonk, Tom, is all about the latter). Last April, IBM renamed it Cloud and Smarter Infrastructure — even the short form, C&SI, doesn’t exactly flow off the tongue.

While I don’t write about every conference as a rule, this one was interesting for a variety of reasons. One is the increasing focus on cloud to the point that they called Pulse “The Premier Cloud Conference” this year, including a gigantic banner across the keynote stage. Somehow I doubt that went over terribly well with today’s customer and user base of traditional IT and asset-management people, as my colleague James also noted while at the show. IBM has a very interesting and solid developer story here, but much like VMware did with its developer story in the pre-Pivotal days at a VMworld full of IT admins, IBM’s presenting to the wrong audience. I found the story they told, including live coding on a stage at an IBM business conference, really compelling for developers. The problem was that it’s an aspirational message at this point about their audience and their conference rather than one that’s a good fit for the conference as it stands today.

The second reason Pulse was interesting is because of IBM’s actually doing with the cloud now. Since the days of SmartCloud, which was widely viewed as a subpar cloud offering, IBM’s gained a lot more credibility through involvement with OpenStack, its purchase of SoftLayer, and significant investment in the Cloud Foundry PaaS as well.

IBM’s always been about selling business value, not technologies, so its move up the stack from IaaS to PaaS is a natural extension of that focus. This shift is consistent with other moves IBM has recently made, such as progressively selling off more and more of its commodity hardware businesses — first ThinkPad and more recently xSeries. IBM sees much greater success when it sells business results, not commodities, and PaaS is a better fit than IaaS for that reason.

Cloud Foundry has gained a great deal of traction in the past couple of years, so IBM’s choice of it as a platform makes a great deal of sense. As that happened, we’ve seen Cloud Foundry gain broader governance than its previous leadership solely by Pivotal and earlier by VMware, again in line with what I’d expect from IBM. Choosing an open-source PaaS is in-line with IBM’s selection of OpenStack and, more than a decade ago, Linux, so I see this move as a good one that’s clearly been thought through and that is consistent with IBM’s philosophy and behavior.

The bigger picture

I think we’ll continue to see IBM’s software divisions reorient around the “IBM as a Service” concept that it pushed at Pulse. In the context of BlueMix, that will mean an ongoing addition of buildpacks based on its existing software portfolio as it’s able to repackage them as services.

I further see this as a revival of IBM’s commitment to key open-source projects as far back as Linux and Eclipse nearly 15 years ago. The extent to which IBM’s bought in to OpenStack and Cloud Foundry, and to a lesser extent Chef, is significant — it’s no matter of a fly-by-night patch series, it’s a major effort that’s headlined as critical to IBM’s future to its own customer base. There’s an excellent opportunity for IBM to continue extending this commitment to open source, as we’ve seen to an extent with Hadoop, Worklight (via PhoneGap), and across other areas like social, WebSphere, and the rest of information management. If it cares about developer traction, and it appears to do so, open source is a clear route to success in the form of an olive branch.

Disclosure: IBM, VMware, Pivotal, the Eclipse Foundation, and Chef are clients, as are numerous OpenStack and Hadoop vendors. The OpenStack Foundation is not.


Categories: big-data, cloud, devops, ibm, open-source, Uncategorized.

Go: the emerging language of cloud infrastructure

Over the past year in particular, an increasing number of open-source projects in Go have emerged or gained significant adoption. Some of them:

  • Docker
  • Packer
  • Serf
  • InfluxDB
  • Cloud Foundry’s gorouter and CLI
  • CoreOS’s etcd and fleet
  • Vitess, YouTube’s tooling for MySQL scaling
  • Canonical’s Juju (rewritten in Go)
  • Mozilla’s Heka
  • A Go interface to OpenStack Swift
  • Heroku’s and hk CLIs
  • Apcera’s NATS and gnatsd

Although this seemingly shows the importance of Go, I have a background as a scientist so I hate to be influenced by random anecdotes. It therefore raised the question to me of whether this was a real trend or just observation bias.  To answer this question, I went to Ohloh’s huge data set of more than 600,000 free and open-source software (FOSS) projects. Below, I plotted a number of different ways to look at Go adoption over time:


Data from Ohloh tracks new code added during each month. This plot is from a dataset as of 20131216.

As you can see, Go’s rapidly closing in on 1% of total commits and half a percent of projects and contributors. While the trend is obviously interesting, at first glance numbers well under one percent look inconsequential relative to overall adoption. To provide some context, however, each of the most popular languages on Ohloh (C, C++, Java, JavaScript) only constitutes ~10% of commits and ~5% of projects and contributors. That means Go, a seemingly very minor player, is already used nearly one tenth as much in FOSS as the most popular languages in existence.

One of the aspects I found most interesting about the marquee projects I mentioned earlier is how many of them are cloud-centric or otherwise made for dealing with distributed systems or transient environments. Go’s big selling point is concurrency according to one of its designers, Rob Pike (the same one who coauthored the famous “The Unix Programming Environment“). That make it particularly gratifying that people writing projects in Go seem to see it the same way.

Cloud infrastructure is famously complex and requires a great deal of effort to build a truly reliable, scale-out architecture because everything needs redundancy and coordination at the software level rather than the hardware level. Thus tools like Netflix’s Simian Army, one component of their increasingly full-featured platform that’s still waiting to be packaged up, have emerged to provide acid tests of cloud software. On the other side, an underappreciated aspect of PaaS (Platform as a Service) is its improvements not just to developer productivity but also to operator complexity by providing similar benefits at a higher level as Go does at the code level. There’s a lot of value to a packaged solution that handles the complexity of concurrency, coordination, and reliability in a gray-box fashion that enables transparency without requiring manual composition of an entire infrastructure.

Tooling that can ease the complexity for both new entrants and existing users of the cloud will continue to gain prominence at all levels of the stack, whether it’s languages like Go, middle ground like the Simian Army, or higher-level options like PaaS.

Update (2014/03/19): Added Heroku, Apcera

Disclosures: Pivotal, Black Duck, Heroku, and a number of OpenStack vendors are clients. Canonical has been a client. Docker, Hashicorp, InfluxDB, CoreOS, Google, Mozilla, Apcera, and the OpenStack Foundation are not.


Categories: adoption, cloud, devops, go, open-source, packaging, programming-languages.

At Strata, “hardcore” data science is pretty fluffy

Last week, I attended O’Reilly’s data-centric conference, Strata. It’s my fourth Strata and my third on the west coast, so I’m starting to get a pretty good feel for the show’s evolution over time and some of the contrast across coasts as well.

I keep going into the show with high expectations for the “Hardcore Data Science” track (“Deep Data” in 2012 [writeup]), which is framed essentially as continuing education for professional data scientists. Unfortunately, both years I’ve attended, it fell tragically short of that goal of educating data scientists. In 2012, I sat through the whole day and heard 2–3 talks where I learned something new, but this time I was so disappointed that I left around 11am in favor of the data-driven business track. In talking to other attendees, general reception was that the Google talk on deep learning was great, but the rest of the day was a disappointment in terms of technical content and learning practical and usable techniques.

I must admit I’m deeply surprised that O’Reilly didn’t get negative feedback last time around that it should’ve applied to this year’s “hardcore” program, as I consider the company among the top couple of professional conference organizers around, across the widest set of topics.

One of the challenges with Strata is catering to a diverse set of audiences, and O’Reilly’s done an excellent job with the “I want to learn Hadoop / new Big Data tech X” crowd. More recently, they’ve also done very well reaching out to the business-level audience trying to learn about the value of data. However, it seems like the technical core of the conference is gradually being left in the lurch in terms of their opportunity to learn, although there’s always the hallway track and whatever marketing value they get out of their talks and tutorials.

I would suggest that the intensive data-science track at future Strata conferences be made much more technical, that the talks become sufficiently practical that they’re handing out links to GitHub or other real-world, low-level implementations at the end of the talk, and that this shift in topic and intended audience be very clearly communicated to the speakers. Other than that slip-up, good show — I heard good things about the tutorials, the business day, and talks for the mid-level audience throughout the rest of the conference.


Categories: big-data, data-science.

Red Hat’s CentOS “acquisition” good for both sides, but ‘ware the Jabberwock

Red Hat and CentOS announced earlier this week (in the respective links) they are “joining forces” — whatever that means. Let’s explore the announcements and implications to get a better understanding of what’s happening, why, and what it means for the future of RHEL, Fedora, and CentOS.

LWN made some excellent points in its writeup (emphasis and links mine):

The ownership of the CentOS trademarks, along with the requirement that the board have a majority of Red Hat employees makes it clear that, for all the talk of partnership and joining forces, this is really an acquisition by Red Hat. The CentOS project will live on, but as a subsidiary of Red Hat—much as Fedora is today. Some will disagree, but most would agree that Red Hat’s stewardship of Fedora has been quite good over the years; one expects its treatment of CentOS will be similar. Like with Fedora, though, some (perhaps large) part of the development of the distribution will be directed by Red Hat, possibly in directions others in the CentOS community are not particularly interested in.

Plenty of benefits to go around

Whether it’s the rather resource-strapped CentOS gaining more access to people and infrastructure, not to mention those pesky legal threats, or Red Hat bringing home a community that strayed since it split Red Hat Linux and created RHEL/Fedora in 2002–3, the benefits are clear to both sides.

I’m not convinced it had to go nearly as far as it did to realize those benefits, though — formalizing a partnership would have sufficed. However, giving three of the existing lead developers the opportunity to dedicate full-time effort to CentOS will be a huge win, as well the other resources Red Hat is providing around infra, legal, etc. But the handover of the trademark and the governance structure are a bit unusual for the benefits as explained, although entirely unsurprising for an acquisition and company ownership of an open-source project.

What about Fedora?

It’s worth reading what Robyn Bergeron, the Fedora Project Leader, said on the topic.

Red Hat still needs a breeding ground for innovation of the Linux OS, so I don’t see anything significant changing here. What I would hope to see over time is a stronger integration of developers between Fedora and CentOS such that it’s easy to maintain packages in both places if you desire.

Perhaps the largest concern for Fedora is a lessening of Red Hat employees contributing to it on paid time, in the longer term. As the company pivots more toward cloud infrastructure (see its recent appointment of Tim Yeaton and Craig Muzilla to lead groups that own cloud software at Red Hat) with a clear hope of increasing its cloud revenue share, Red Hat’s need to differentiate at the OS level may shrink and thus its need to contribute as many resources to Fedora. However, Robyn duly points out that Fedora’s role as upstream for RHEL isn’t going anywhere, so neither is the project.

The hidden BDFL

Red Hat’s Karsten Wade seems to have become the closest thing there is to a CentOS BDFL (or at least an avatar of Red Hat as BDFL) by virtue of being the “Liaison” on the newly created governing board. The other named board role is Chair, who is a coordinator and “lead voice” but cannot take decisions for the board as the liaison can. In case you didn’t see the fine print, here’s the reason I say that:

The Liaison may, in exceptional circumstances, make a decision on behalf of the Board if a consensus has not been reached on an issue that is deemed time or business critical by Red Hat if: (1) a board quorum (i.e., a majority) is present or a quorum of Board members has cast their votes; or (2) after 3 working days if a Board quorum is not present at a meeting or a quorum has not cast their votes (list votes); provided that the Chair may (or at the request of the Liaison, will) call a meeting and demand that a quorum be present.

Unless the Liaison specifically indicates on a specific issue that he/she is acting in his/her official capacity as Liaison, either prior to a vote or later (e.g., after an issue has been deemed time or business critical), the Liaison’s voice and vote is treated the same as any other member of the Board. Decisions indicated as Liaison decisions made on behalf of the Board by the Liaison may not be overturned.

Translation? If the board (the majority of which is Red Hat employees) can’t come to a consensus or can’t meet/vote within 3 days, the Red-Hat–appointed liaison can make an irrevocable, unilateral decision on behalf of Red Hat. Also worth noting is that Karsten will be the direct manager of the three CentOS employees joining Red Hat, giving him further influence in both formal and informal forms. Although whoever’s in the liaison role theoretically steps down in power when not acting as liaison, this is much like temporarily removing “operator” status on IRC. Everyone knows you’ve got it and could put it back on at any point in time, so every word you say carries much more weight. It is therefore of great interest to understand Karsten more deeply.

He’s got a long history in community management with Red Hat and I’ve had excellent experiences working with him in the Google Summer of Code and many other venues, so I’m confident in his abilities and intentions in this regard. But it’s definitely worthwhile to read his take on the news and understand where he’s coming from. Here’s an excerpt:

 In that time, Red Hat has moved our product and project focus farther up the stack from the Linux base into middleware, cloud, virtualization, storage, etc., etc. … Code in projects such as OpenStack is evolving without the benefit of spending a lot of cycles in Fedora, so our projects aren’t getting the community interaction and testing that the Linux base platform gets. Quite simply, using CentOS is a way for projects to have a stable-enough base they can stand on, so they can focus on the interesting things they are doing and not on chasing a fast-moving Linux.

In other words, they were putting code directly into RHEL that hadn’t had a chance to bake in Fedora first, which is less than ideal for an enterprise distro. Thus the need for a place to test higher-level software on stable platforms (CentOS).

That post makes it perfectly clear where Karsten’s interests lie, so it, along with his background in community management is what drives my initial expectations of Red Hat’s influence upon CentOS. It remains to be seen how often Karsten will need to step up to liaison mode, and to what extent his actions in that role will be handed down from higher up in Red Hat vs independent, so I’m looking forward to seeing how these changes play out.

 Disclosure: Red Hat is a client.


Categories: community, linux, operating-systems, red-hat.

IBM’s billion-dollar bets mean less today, but you still can’t ignore them

Yesterday’s news about IBM creating a new Watson Group and investing $1 billion in it was surprising to me because the company just announced a different billion-dollar bet on Linux on its Power architecture back in September, another billion on flash memory in April, along with another major investment in DevOps over the past couple of years. Not to mention its $2 billion acquisition of SoftLayer to develop a stronger cloud story. [Sidenote: Watson is IBM’s Big Data software aimed to do what IBM calls “cognitive computing.”]

IBM initially made a big bang with its announcement of a billion-dollar investment in Linux back in late 2000. Significantly, it was $1B to be spent in a single year, not some indeterminate future (best of luck verifying that). Given the apparent acceleration in extremely large commitments by IBM, I thought a couple of quick calculations were in order to put the recent ones in context.

Inflation since 2000 puts $1B today at $739M in 2000 dollars (when IBM announced the billion-dollar bet on Linux). Furthermore, IBM’s net income (profit) doubled to $16.6B in 2012 from $8.1B in 2000. The inflation means a $1B bet goes only ~75% as far as it did in 2000, while the significance to IBM’s financials is roughly half of what it was back then. In other words, a bet that was the size of an 800-pound gorilla is only 400–600 pounds these days — but that’s certainly enough to crush most humans, as you might imagine some of its competitors to be. IBM’s increasingly long series of billion-dollar bets continue to draw headlines, but you can’t ignore the reality that an investment like that is going to make a significant impact.

Disclosure: IBM is a client.


Categories: big-data, devops, ibm, linux, marketing, open-source, operating-systems.

The parallel universes of DevOps and cloud developers

The City and the City, by China MiévilleWhen I think about people who live in that foggy world between development and operations, I can’t help being reminded of a China Miéville novel called The City & the City. It’s about two cities that literally overlap in geography, with the residents of each completely ignoring the other — and any violations, or breaches, of that separation are quickly enforced by a shadowy organization known as the Breach.

Much like people starting from development or operations, or for you San Franciscans, the Mission’s weird juxtaposition of its pre-tech and tech populations, The City & the City is a story of parallel universes coexisting in the same space. When I look at the DevOps “community” today, what I generally see is a near-total lack of overlap between people who started on the dev side and on the ops side.

At conferences like Velocity or DevOpsDays, you largely have just the ops who have learned development rather than the devs who learned to be good enough sysadmins. Talks are almost all ops-focused rather than truly in the middle ground or even leaning toward development, with rare exceptions like Adobe’s Brian LeRoux (of PhoneGap fame) at Velocity NY last fall.

On the other hand, that same crowd of developers shows up not at DevOps conferences but rather at cloud conferences. They often don’t care, or perhaps even know, about the term “DevOps” — they’re just running instances on AWS. Or maybe another IaaS or possibly a PaaS, most likely Heroku, GAE, or Azure.

The closest thing to common ground may be events for configuration-management software like PuppetConf or ChefConf, or possibly re:Invent. But even when I was at PuppetConf, the majority of attendees seemed to come from an ops background. Is it because ops care deeply about systems while devs consider them a tool or implementation detail?

The answer to that question is unclear, but the middle ground is clearly divided.

Disclosure: Amazon (AWS), Salesforce (Heroku), and Adobe are clients. Puppet Labs and Microsoft have been clients. Chef and Google are not clients (although they should be).


Categories: cloud, community, devops, Uncategorized.

What were developers reading on my blog and tweetstream in 2013?

As a strong believer in transparency, I wanted to share the actual data from hits on my blog over the past year instead of just a popularity ranking. Using a combination of WordPress stats, Google Analytics, and RedMonk Analytics, I compiled a set of data that reflects what my readers cared about over the past year.

Blog overview: Nearly 90,000 unique visitors

This roughly corresponded to my second year at RedMonk (I started Dec. 1, 2011), so I wanted to take a look at how things changed since the year prior in addition to the raw numbers.

  • 135114 page views (+393% year over year)
  • 105133 visits (+383% YOY)
    • 15770 phone (+163% YOY)
    • 6015 tablet (+7% YOY)
  • 89348 unique visitors (+469% YOY)

Beyond being quite pleased at how well I’ve personally done, the disparity between a large increase in phone visitors and a near-constant rate from tablet users is noteworthy. It makes me wonder whether tablet ownership and usage among our generally predictive community is becoming saturated, while the same audience already owned smartphones and is just using them more.

What are people reading?

As is typical, the post traffic is highly asymmetric, with the top hits dwarfing the remainder.

As always, developers love reading about rankings, data, and tooling, and the top posts reflect that. The surprises, to me, are some of the more conversational pieces — one on the Bay Area bubble and the other on SAP. Both of them got fairly strong traction within niche communities on Twitter, which may explain where the traffic came from.

How are people getting here?

Here’s a graph of the top-ranked sources for inbound visits:

The top sources of traffic to my blog in 2013

In terms of the top sources of inbound traffic, search (namely Google, which dominates search at 99.3%)  was the best draw of readership. Social media and Twitter in particular, however, topped search as a category, with Twitter alone garnering roughly 2/3 of the visits that search did.

Where are they coming from?

Below is a map from Google Fusion Tables that I’ve colored by continent. Deeper greens indicate more visitors, which are absolute rather than normalized by population.

The raw numbers:

  1. 49429: North America (88% US)
  2. 33193: Europe
  3. 13415: Asia
  4. 3861: South America
  5. 3062: Australia
  6. 1259: Africa
  7. (0: Antarctica)

There’s very strong representation among Western countries with 85% of visitors coming from the Americas, Europe, and Australia. This comes as no great surprise since they share the same Latin alphabet and the majority are likely to speak English well.

In fact, I’m quite pleased to have as many people from Asia in particular as I do, but also South America and Africa, because it provides some additional insight about what those developers are doing similarly and differently from their compatriots.

Twitter: 1 million readers in the top week, 4 million in the year

I recently signed up for a service called SumAll to more effectively track how many people are seeing what I’m talking about. Here’s a weekly graph from that service over the course of 2013:

Screenshot from 2014-01-02 22:13:02

Graph courtesy SumAll. I signed up for their service in September so “mention reach” is missing before then.

Retweet reach (how many total followers see my tweets by following the RT chain) in a typical week is around 75,000. Mention reach (whenever I’m credited, even when I didn’t originate it) has been near or above 500,000 three times since I signed up for SumAll in September. It typically hovers around 3x–5x my RT reach, indicating a combination of independent discovery of my content and Twitter clients that quote me or use the letters “RT” rather than a Twitter API retweet.

I’ve had reasonable success in making data graphics go viral on Twitter — each of the 3 highest peaks over the 4 months where I tracked mention reach was the result of me tweeting a graph based on my original research.

Across the entire year, my retweet reach was 4.02 million users. SumAll didn’t calculate mention reach before I signed up in September, but based on the 4.79 million in the last trimester and the typical multiple mentioned above, I would estimate around 10–25 million users encountered my name this year if that 3x–5x ratio holds true.

Year in review wrap-up

2013 was a great year for my RedMonk research, with a gratifying growth in readership over 2012. On average I published 3 posts per month, which I hope to improve in 2014 with a more focused approach to how I balance research time in terms of collection vs production.

Disclosure: SAP and GitHub have been clients. Automattic, Google, Twitter, and SumAll are not clients.


Categories: analyst, social.

BAM! GitHub prediction nailed: 4M users in August, 5M in December

In January, I used data on GitHub’s past growth to predict what would happen over the next year in a post titled “GitHub will hit 5 million users within a year” and said:

In the near term, I’d estimate, based on my Bass model, that GitHub will hit 4 million users near August and 5 million near December.

Prediction from my January 2013 post

Prediction from my January 2013 post. Take the red pill.

On August 7, GitHub reached 4 million and today, it topped 5 million — exactly as I predicted. Given these almost uncannily good results (according to its own user-search API), I couldn’t help but be reminded of a classic XKCD comic:

Science. It works.

Caveat: Users appearing in search may also include GitHub organizations.

Disclosure: GitHub has been a client.


Categories: Uncategorized.