Blogs

RedMonk

Skip to content

Ranking Linux distributions, and the decline of the traditional distros

A recent poll on Hacker News asking about Linux distributions of choice got me thinking, what can can we learn from a bigger picture of the distro landscape than a single HN poll? I went looking around and dug up a couple of other sources of information — Linux Journal’s annual reader’s choice awards, and data from Google Trends.

What makes these three particular choices interesting is that they span a broad swathe of user types, from the hacker (Hacker News) to the enthusiast (Linux Journal) to the “average” Linux user (Google). That means we can learn from the trends across these three user types —  considering which communities may be more predictive or more technical vs which represent broader adoption today.

The results are shown below, ranked by Hacker News popularity so we can see distros popular with highly technical audiences most easily.

distro_share

Distros are shown in order of their Hacker News ranks. Hacker News data comes from here, Linux Journal data from here, and Google Trends data is from a US-based search.

We can see that broadly, the trends are consistent across audiences. Smaller differences exist, as well as a couple of clear outliers that are much more popular with a general audience (Linux Mint and CentOS). It’s no surprise to me that Debian- and Ubuntu-based distros are at the top, although the popularity of Arch beyond the enthusiast community surprised me. Let’s dive into the data, one distro at a time:

  • Ubuntu: The clear winner among all communities, with nearly half of users in the hacker (HN) and general (Google) communities as well as around 1/3 of enthusiasts (LJ). The relative lack of Ubuntu users among enthusiasts suggests that they may prefer a deeper level of flexibility so they can play around with the distro itself rather than just use it (Google) or build on top of it (HN).
  • Arch: A strong showing, led by hackers, with equal proportions in the enthusiast and user communities. This suggests significant general appeal as well as a fulfillment of developer needs, or potentially predictive adoption by the HN community that will later be reflected more widely.
  • Debian: Enthusiast-led adoption, with general users trailing.
  • Mint: Extremely strong traction among general users, with hackers and enthusiasts far behind although still not negligible.
  • Fedora: Broad appeal across many types of users, but trailing well behind the leaders. It’s worth noting that Fedora has perhaps the most even appeal of all distributions — does this mean that being good at everything means being great at nothing?
  • Gentoo: No surprise, a source-based distro is most appealing to developers who need the flexibility and enthusiasts who like it.
  • CentOS: Very clearly biased toward general users, its RHEL-based nature means it will be quite well-tested while lagging behind in updates. This is a fairly good fit for users who just want their OS to work, but it seems to create problems for enthusiasts and developers, who want or need the latest software.
  • Slackware: Enthusiast-led usage.
  • openSUSE: A similar trend as Slackware with even weaker developer adoption. It’s much less popular than any of the Debian-based distros or even Fedora, its closest resemblance on the list.

Developers bias toward flexibility and community

To more closely examine which distros the Hacker News community biases towards, above and beyond their larger popularity, I created a graph based on the ratios between HN and Linux Journal or Google Trends usage, respectively. Higher numbers mean a stronger HN bias, while lower ones means a negative HN bias (HN users are less likely to use it).

distro_bias

Methods

Compared to enthusiasts (LJ), hackers (HN) lean toward flexibility with Arch and community with Ubuntu, without sacrificing the ease of use required by Gentoo — which is often compared to Arch plus compilation.

Compared to general users (Google), hackers show a very strong leaning toward flexibility with Arch and Gentoo, as well as a weaker bias toward Debian. Interestingly, hackers show a significant bias against CentOS and Linux Mint — we discussed some reasons for CentOS above but the reason for a lack of Linux Mint users is unclear to me.

The decline of the traditional distributions

If we look at trends over time, things really start to get interesting. Here are sparklines indicating changes since 2004 for general users, according to Google Trends, in order of their popularity today:

sparklines_custom

Data from Google Trends US-based users, starting in January 2004. Distributions shown in order of popularity. Searches were in the form of the distribution name plus “linux.”

I’ve shown the distros that are holding stable or growing in red, while the rest are in gray, and a pattern quickly becomes clear: The older Linux distributions appear to be bleeding users. They’re all shrinking after a peak around 2004–2005 (Fedora, Debian, Slackware, Gentoo) or later peaks around 2007–2009 (Ubuntu, openSUSE). The sole exception is CentOS, which is merely holding steady rather than shrinking, perhaps because it has an entirely different set of users than your typical distro.

Where are the users going? Some of them are definitely going to be moving to the growing distros, but many of them are also shifting to another OS entirely — OS X. Next time you’re at a conference, take a look around and count how many Macs and how many PCs you see. Apple’s appeal to developers is undeniable, and some of the more pragmatic Linux users decided that a different Unix-based OS was the better choice for them.

You may have heard, like déjà vu, that this year is the year of the Linux desktop? Looks like it happened back in 2005, and we missed it.

Conclusions

Based on these datasets, the Linux userbase today is defined by Debian-based distros. Ubuntu in particular has made itself interesting to general users and developers alike, a difficult feat. Its popularity in the cloud may well have something to do with its developer appeal. Linux Mint may have hit upon a winning formula for a general userbase, worth looking into for creators of the other distros, while Arch seems to have nailed the hacker community.

Disclosure: Red Hat is a client, and Canonical has been. SUSE and Apple are not. 

by-nc-sa

Categories: adoption, community, linux, open-source, operating-systems.

DVCS doesn’t disenfranchise enterprise IT — it empowers it

I was talking to Atlassian recently about its new release of Stash (a tool for internal corporate git forges), which just added forking and personal repos to its capabilities, and something counterintuitive occurred to me. Despite the common belief that distributed version control (DVCS) steals power away from corporate IT, I assert that the reality is in fact opposite: DVCS returns visibility and control to central IT.

For developers using centralized version-control tools like CVS and Subversion, their workstations are often a complete mess of source code and checkouts scattered all over the place like pigeon droppings. They’ve got multiple checkouts of different upstream branches, a bunch of separate checkouts for projects based on the same branch, and even multiple copies of files within each checkout (with nicely dated suffixes or .bak.1, .bak.2-style naming). Frankly, it’s a mess. And it’s got two huge problems:

  1. This workflow sucks for developers. It’s easy to lose work and hard to track progress.
  2. Central IT and project management have no idea what’s going on. They can’t track or control anything and they can’t promote better practices.

Although some leading-edge developers will be using a tool like git that interface with CVS and SVN so they can work in a distributed, offline fashion and commit locally, most won’t. And even those that do work like that still suffer from a number of downsides since the whole team and the main repo aren’t doing the same.

The problem is that management is simply scared by the word “forking.” They envision fragmentation, disappearing commits, and invisible work, when in fact all of those things are already happening. Using a tool like Stash or GitHub Enterprise internally will counterintuitively increase transparency and control into an organization’s development practices, because people will push on a regular basis and use trackable forks within the context of the tool. In addition, there’s the clear benefits of smaller commits (because they’re fast and easy), feature branching (because it’s fast and easy), and maintain things in a way that enables better investigation of bugs (smaller commits make bisection easier).

In other words, using DVCS internally for corporate development improves the experience for both enterprise IT and individual software developers. The problem is that nobody seems to be telling that story to corporate IT.

Disclosure: Atlassian is a client, and GitHub has been one.

by-nc-sa

Categories: distributed-development, social.

Gonzo video with PhoneGap’s Andre Charland and Brian LeRoux

Last week I was at Adobe’s MAX conference in Los Angeles, where I grabbed some time with two of the key people behind PhoneGap, the incredibly popular framework for developing hybrid HTML5/native mobile apps. Over beers at Yard House (which has an outstandingly enthusiastic manager who loves craft and good beer), Andre Charland, Brian LeRoux, and I discuss PhoneGap, Adobe, craft and beer, the connection between designers and developers, and more.

I only had my phone on me at the bar, so this whole thing was recorded on a Nexus 4.

Some language may be NSFW, so watch at your peril.

Disclosure: Adobe is a client and paid for my travel and hotel at Adobe Max.

by-nc-sa

Categories: adoption, community, mobile.

DevOps and cloud: A view from outside the Bay Area bubble

devops_cloud_venn_diagram

I saw two starkly different worlds of IT almost side-by-side last week, thanks to the absurdities of airline pricing, and it illustrated very clearly the contrast between how we perceive the world in our Bay-Area–centric bubble and how the world really is.

First, I spent some time at Amazon’s AWS Summit in San Francisco, where Amazon was pushing best practices at the bleeding edge of tech to one of the most technically sophisticated communities on the planet. Following that, I spent a day at DevOpsDays in Austin, Texas, en route to my home in Minnesota. (For some reason this was hundreds of dollars cheaper than a direct flight.)

In the Bay Area, I saw the same thing that’s endemic of the area. There’s a clear best way to do things, pretty much everyone is aware of it, and that’s what everyone does. Thanks to the heavy startup presence, there’s much less inertia in terms of existing cultures or infrastructure, so changes are easier. When you’ve got a next-door neighbor doing something amazing, it’s very hard to resist the peer pressure and the local culture, so everyone’s doing The Right Thing™. Very similar things hold true in the open-source world, where neighbors may be virtual but they’re still highly visible.

In Austin, it was an entirely different story. I saw yet another example of how the rest of the IT world, at least in this country, lives. I’ve seen it in places like Minnesota, Maine, and Oregon. It’s a world where trendy software vendors and startups don’t represent any meaningful part of the tech community, where businesses mostly don’t yet realize that software is eating the world. It’s a world where inertia rules the day, where business is king and sysadmins have little to no say in major changes. And it’s a world where even experimentation is difficult and must be done on the smallest of scales.

What happens in places like this? Let’s call it Everytown, USA. In Everytown, IT departments can’t afford to build a new infrastructure from scratch using Puppet or Chef in the cloud. They don’t have the freedom to do it externally or the resources to implement a private cloud internally.

Even at a conference like DevOpsDays Austin, if you ask people what they’re actually doing today, most of the time it has little to no resemblance to how a new Bay Area startup would set up its infrastructure. Don’t ask them about their plans, that’s often so ambitious as to be unusable. Maybe they’re maintaining cloud instances by hand in AWS, or maybe they’re slowly migrating a large datacenter full of pets to configuration management, which they’ve been working on for the past five years. If they’re open-source fans, chances are they’re running Nagios and have a huge collection of Nagios-related infrastructure that would need serious, dedicated effort to shift to anything different.

More modern shops could have migrated most or all of their servers to tools like Puppet or Chef, so everything’s at least under configuration management and thus documented and reproducible. But in many cases, this is for datacenter use only, either true on-prem or in a colo. Gaining the capacity, budget, and permission to even migrate to private cloud is impossible for many companies, and it could be that way for a while.

You can see the same thing at conferences for larger enterprise vendors like IBM — talking to attendees at IBM’s Pulse conference this spring, most of them are in exactly the situation I’ve described. IBM’s jump into both cloud and DevOps will make a significant difference to their adoptability in many places; it’s like a stamp of approval that these things are really ready for the enterprise.

“Shadow IT” developers outside the purview of IT-controlled infrastructure, on the other hand, often don’t have or don’t want to develop the expertise to learn DevOps philosophies and approaches. Developers may well be working in the cloud, but chances are they aren’t running tools like Puppet or Chef, and they don’t have any monitoring set up. They may hack things around by hand and hope everything doesn’t break too often, or they may outsource the infrastructure to somewhere external and run in a PaaS.

IT shops like this may be aware that better ways exists and they may have ambitions of going there, someday. The Bay Area view of the right infrastructure is always going to be years away for the rest of us — we even put William Gibson’s quote regarding this on our website:

The future is already here, it’s just unevenly distributed.

 

Update (5/5/13): Of course this is a generalization of reality, which is always more complex than a single answer at either end of the spectrum. I’ve just simplified it to communicate the overall points, which remain true regardless of the details. Reality looks like a distribution on both ends — but the distribution is shifted. I’m just talking about the most common cases within those distributions. There are clearly going to be some Bay Area companies with plenty of inertia, and some Everytown companies overflowing with cloud- and DevOps-based approaches. Even within a single company, there’s a distribution of approaches, with some areas more modern and others more legacy (heard of systems of engagement and systems of record?).

Disclosure: Amazon (AWS) and IBM are clients. Puppet Labs has been a client. Opscode and Nagios are not.

by-nc-sa

Categories: cloud, devops, open-source.

Musical chairs with open-source business models: Opscode and Tokutek

While everyone else is talking about API-related acquisitions (Mashery by Intel, Layer 7 by CA, now ProgrammableWeb by MuleSoft), I’m going to avoid the pack in this post and focus on some other underrated but interesting news that you should know about.

A couple of pieces of changes in direction regarding open source came out in the past few days, and they’ve gotten little coverage thus far, despite their fairly significant implications.

Gambling on traction with open source

NewSQL database provider Tokutek just went open source with its TokuDB v7 release yesterday. TokuDB is a MySQL/MariaDB storage engine based around an algorithm called fractal trees. What makes this move interesting?

For one, open-source NewSQL options are hard to come by. This is one market where open source isn’t yet table stakes, unlike NoSQL, so it does make companies stand out. VoltDB is one of very few OSS options, falling under the strongly copyleft AGPLv3. Tokutek went with GPLv2 for its engine (the same as MySQL), a slightly more permissive license in that you don’t need to provide source if it’s only available within a hosted service. Usefully, they also provided a patent license since that isn’t GPLv2′s strong point. This makes TokuDB newly interesting to service providers who want to incorporate an open-source NewSQL option into their products.

Secondly, it’s always interesting to look at the particular approach companies take to an OSS-centric model. In this case, it’s a combination of the classic models of support and proprietary add-ons (in this case tools for backup and recovery), according to SiliconAngle. As going open source with your core product isn’t a transition that’s easy to step away from, it can be useful to take a piecemeal approach, as you determine where your customers find the real value.

Maximizing the innovation window

Opscode, on the other hand, is moving in a more proprietary direction. As Adam Jacob, Chief Customer Officer, wrote in a post on the past five years:

One shift here is in the order of operations: before we wrote Chef, there was no Chef. We shipped the primitive first, then we built value (Hosted Chef and Private Chef) on top. As we move forward, we’ll shift to open sourcing new primitives after we build something cool on top of them that shows their power. [emphasis mine]

This shift to an “open-source the infrastructure” approach after you’ve already built a beautiful facade on top is a significant change to a model that’s entirely about differentiating on top (a la GitHub, Facebook, Twitter, LinkedIn) rather than being what I would call a true open-source company. It gives Opscode a new monopoly on the time window between when they create a new piece of infrastructure and when they release the proprietary frosting on top. It also has a detrimental effect on a leading subset of users who prefer a more composeable infrastructure, as we’re seeing now in the #monitoringsucks/#monitoringlove movement, and who will now be forced to wait for the core components until Opscode finishes building something on top of them. That said, much like the value of example code in SDKs, Adam is entirely right that building useful products on top of a core component will very clearly illustrate its values and some of its use cases.

So, two transitions: one shifting toward open, another shifting more closed. I’m looking forward to seeing what comes of both.

Disclosure: VoltDB is a client. GitHub has been a client. Opscode, Tokutek, Oracle (MySQL), MariaDB, Twitter, Facebook, and LinkedIn are not clients.

by-nc-sa

Categories: adoption, devops, open-source.

The size of open-source communities and its impact upon activity, licensing, and hosting

Common (mis?)conception states that development practices, standards, and cultures vary broadly depending on the size of an open-source community. In general, we expect that many solo projects may lack the same level of QA and rigor as those with multiple developers due to constraints on time, varying experience levels, etc., and they may not even be intended for consumption by others. As communities grow slightly larger, projects that successfully recruited multiple contributors would likely tend to be higher-quality, on average, than those that failed to do so. In the largest open-source projects with tens or hundreds of contributors, we generally expect a fairly high level of quality, attention to detail, documentation, and so on.

Here, I’m going to dig into data from Ohloh, which tracks a vast set of open-source software projects, to investigate some of the effects related to community size. I’ll look at a number of potentially connected variables centered around development activity, licensing choices, and hosting providers (GitHub, etc.).

As always, the caveats:

  • This is only useful for active projects with active communities, because it contains only projects with commits during a 1-year period and members of the community must opt in to subscribe the project to Ohloh. This equates to 50,000+ projects, so it’s still a good-sized set.
  • It is subject to any imperfections in Ohloh’s measurements, which is particularly relevant for license detection where it simply looks for strings in source files. It will miss any indirect references to licenses by name or URL. It also seems to miss some more obvious ones, which will set a lower bound on license discovery (but it should be independent of community size).
  • In most cases, I’m ignoring the largest ~100 projects on Ohloh, which have ~150 or more committers per year, so these conclusions may not be generalizable to them. These tend to include well-known names like GNOME, Chrome, KDE, the Linux kernel, Mozilla, etc. There simply aren’t enough samples of things at similar size to aggregate data for general, non-project-specific conclusions.

To make these posts more easily readable, I’m going to try something new. All the methodology is now in the figure captions, so skip captions if you just want to read the what and ignore the how.

The size distribution of open-source communities

Before looking at the impact of size, I first wanted to gain an understanding of how big free/open-source software (FOSS) communities were, and how many project communities there were at each size. Plotting the community size against the number of FOSS projects at each size produced the plot shown below:

committer_histogram

Ohloh data for projects active in the past year as of July 2012. Monthly data from the 30 days immediately prior to the Ohloh dump. The LOWESS fit shows a locally smoothed line in the noisier regions, using 1/8 of observations for smoothing each point. Not shown are the 92 projects with >150 committers per year.

Global features

What I find interesting about the shape of the above graph is two things: the helpfully linear behavior on this type of plot, and the gap between monthly and annual contributors. When this type of plot appears linear, it indicates behavior supporting a set of statistical distributions including the power law (the famous 80-20 principle stating that 80% of the effect comes from 20% of the causes). In the below section on specific effects, I’ll show some numbers indicating the kind of behavior we see as a result of this.

One interesting question I wanted to answer here was the relationship between monthly committers and “expected” monthly committers based on the year-long figures — we can get this by dividing annual committers by 12. However, appearances can be deceiving. This graph actually can’t tell us anything about that because there’s zero connection between projects with 80 annual contributors and 80 monthly contributors. Instead, what we can do to get at this information is directly correlate monthly and annual contributors at the level of individual projects, which I’ll show later.

What can that gap tell us, then? As it turns out, it’s not equally sized throughout. There’s a very real and linear increase in that difference as community size increases from 1 to 35 committers (it’s too noisy above that), with the logarithmic difference increasing from 0.57 to 1.29 (each unit of 1.0 indicates a 10x increase). The higher slope for the monthly committers indicates a set of values with a tighter overall distribution that are (unsurprisingly) biased away from high numbers of committers, which are much more easily attained for a project in a year than a month. Put simply, it’s an expected result — you get more unique committers in a year than a month.

Specific effects

The vast majority of projects are tiny, having just 1 or a few contributors. This is even more dramatic than it first appears if you look at the Y-axis, which is logarithmic rather than linear. From the full spreadsheet (embedded below),  we can draw some more quantitative conclusions. On an annual level, just over half of active projects (51%) have only 1 contributor, while 19% have 2, 9% have 3, 5% have 4, and 3% have 5 (see the PDF column below). Overall, 87% of projects have 5 or fewer committers per year (see the CDF column). Looking from the opposite perspective, merely 1% of projects have 50 or more committers per year, and a scant 0.1% have 200 or more (see the Rev. CDF column).

 

Contribution regularity is independent of community size

To directly compare monthly and annual committers, we need to pull the numbers at the level of individual projects and create a plot based on them, rather than looking at two independent histograms on the same graph as we did above. If we do that and aggregate it into 25-project bins to ease visualization, then fit lines to them, we can produce a plot much like the below:

monthly_vs_annual_contributors_custom

Data points were created for each 25 projects, and percentiles were calculated for each data point. The lines indicate an observation-weighted cubic-spline fit with a smoothing factor of 1/1000.

This is a variation on a box plot, showing the median in thick black in addition to a number of percentiles to indicate the size of the distribution cores (25%–75%) and more extreme, non-outlier values (10%–90%). To interpret it, consider that medians represent the central values, while the thicker colored lines represent the “middle half” for each number of committers, and the thin colored lines represent nearly everything (the central 80%).

This plot shows a very clear typical range of annual committers, given a monthly number. Conversely, it could also be read the other way to suggest likely numbers of unique monthly committers, given an annual value.

I next wanted to look more specifically at the relationship between a simplistic prediction of monthly committers and the actual monthly values. It’s based purely on dividing the annual committers by 12 months, which means that a ratio of 1 would equate to each contributor making commits during only 1 month each year.

expected_monthly_contributors_custom

The ratio of expected monthly committers was generated from dividing 1/12 of annual committers by the monthly committers. Each semitransparent circle represents the median committer values of 25 data points, and darker colors indicate multiple overlapping circles.

Interestingly, while the data points distribute much more widely at lower committer counts (potentially due simply to larger populations), it remains near-linear and horizontal, going from a ratio of ~0.20 to ~0.25 as a function of community size. Values below 1 mean committers are making contributions during more than 1 month each year. In particular, if you multiple the ratio by 12 months, you get the average periodicity of someone’s contributions in months — so 0.25 * 12 = committing every 3 months, for a total of 4 months of contributions each year from each committer in large projects, on average, and 5 months from small projects using the same math. While many developers will contribute more, enough will also contribute less to make the final numbers come out around 4–5 months in a remarkably consistent fashion.

An important take-home from this result is that smaller projects are proportionately nearly as likely as larger ones to receive drive-by commits or have relatively inactive developers on a monthly basis.

Larger communities tend to get more engagement

To look for size effects at a finer-grained level than committers alone by looking at commits themselves, I took the ratio of monthly commits per committer and plotted it as a function of community size in the graph below. As the size of a project increases from 1 to ~10 developers, the median gradually doubles from ~5 to ~10 commits per committer, where it then holds steady as community size grows (beyond 20, the data become too noisy due to too few projects of that size).

commits_per_committer

Data points were created for each 25 projects, and percentiles were calculated for each data point. The lines indicate an observation-weighted cubic-spline fit with a smoothing factor of 1/1000.

A number of factors could explain this trend — for example:

  • They receive or accept proportionately more drive-by patches that are credited to a committer rather than the patch contributor;
  • Everyone is more active due to an effect of community interactions or peer pressure;
  • They have a higher proportion of active committers, such as professional contributors who make more frequent commits;
  • Most smaller projects will never gain the traction to grow larger, but the larger a project is, the more likely it is to have gained or be in the process of gaining developer traction.

However, the generally horizontal line at the 90th percentile (the peak around 7-8 appears to be an outlier due to some large projects with abnormally low committer levels that month) indicates that a subset of small communities do behave similarly to the larger ones. This suggests that it may be the last of these explanations.

“Post-OSS” licensing practices are a big issue in smaller communities

My eminent colleague James posted this succinct and bluntly honest tweet last fall:

younger devs today are about POSS – Post open source software. fuck the license and governance, just commit to github.

Luis Villa, open-source lawyer and Friend of RedMonk, wrote an excellent post following up on the topic, postulating that POSS behavior was an explicit rejection of permission-based culture. It’s easy to simply accept that this is happening, but as a scientist by training, I prefer to see whether there’s data to support or deny the assertion that licenses as a whole are growing less popular.

Ohloh is quite useful for licensing data because it goes beyond simply looking at COPYING, LICENSE or README files to directly examine the contents of each source file for strings found in licenses. While it’s undoubtedly imperfect because it looks directly for license strings so may miss poorly worded or obscure references to licenses, that will simply set a baseline for detection. Any changes relative to that baseline will still be valuable.

If we look at the percentage of active projects (1 commit in the past year) without licenses detected by Ohloh, it baselines around 20% for large projects, which one would hope embody best practices in open source. This is likely a combination of two factors, Ohloh’s detection ability and actual missing licenses (likely dominated by the former).

But once we start looking at the trends, that’s when things get interesting. Take a look at the graphs below:

Coloration indicates the number of projects for a given data point. Based on 56,090 Ohloh projects with available data on origination date, committer counts, and license. The observation-weighted cubic spline fit is used as implemented in gnuplot with a scale factor of 1/1000.

Data points were created for each 25 projects, and percentiles were calculated for each data point. The lines indicate an observation-weighted cubic-spline fit with a smoothing factor of 1/1000.

I’ve classified project licensing into one of four categories: None, Copyleft, Permissive, or Limited (a.k.a. weak copyleft). Let’s walk through them in order.

First, unlicensed projects would qualify for the POSS designation and are shown in the top left. This is the largest trend among all of the license types in terms of absolute license share, indicating the importance of thinking about it. When looking at monthly contributors, this trend flattens out around 15 committers at 20% of all active projects and stays flat well beyond the right edge of this graph, to at least 70 commiters per month (after that point it’s too noisy). Regardless of whether this trend is due to a true rejection of the permissive culture, as Luis Villa suggests, or whether it’s a function of lack of licensing education, the shift from 50% unlicensed single-developer projects to below 25% unlicensed projects with 15 or more contributors cannot be ignored. My interpretation is that essentially no projects with ≥10 monthly contributors have licensing problems, while ~1/3 of one-developer projects do. The transition occurs in the middle. In other words, as projects grow, they tend to sort out any licensing issues, likely because they get corporate users, professional developers, etc.

Second, let’s look at copyleft licensing, the next-most-popular type. As a counterpart to the POSS trend, the use of copyleft licenses increases from ~20% to 35–40% around 15–20 monthly committers before the data get too noisy to draw any further conclusions. However, four of the five largest data points (25-project aggregates) hover around 45–50% copyleft, suggesting a potential upper limit that’s driven in part by the Linux kernel and Linux distributions, some of the largest collaborative projects around.

The lower two plots, permissive and limited (weak copyleft) licensing, show mild upward trends on an absolute scale. Permissive shows a small increase from ~20% to ~25%. Limited licenses, on the other hand show a small increase from ~7–8% to ~11–12%. While small on an absolute scale, this modest-seeming trend indicates that limited licenses are roughly 50% more popular in larger communities than small ones.

Hosting providers generally do not support large communities well

The other interesting data point I have is which code forge each active project (1 commit in the past year) is hosted at, so let’s examine the connection between code forges and community sizes. My expectations going in were that:

  • Small communities would bias heavily toward GitHub, because it’s basically the center of open development today; and
  • Larger communities would likely tend to host independently, because they have more complex needs in terms of service heterogeneity and scale.
forge_by_community_size_composite

Dots indicate the integer medians of each 25 data points in order of committer size, semitransparent so darker dots indicate multiple overlapping. points. Otherwise, data source and spline fit as described previously.

Small projects

On an overall level, we can see a strong bias for small projects toward GitHub (~50%), while just under 20% opt for both of SourceForge and Google Code. The remaining ~10% largely choose to self-host, with the last few percent going to Launchpad and Bitbucket.

Two points worthy of note are that GitHub and Launchpad both show global peaks at a committer count higher than 1, indicating a break with the global trend in the first graph that the most common situation is a single-developer project. This could support the importance of a low barrier to entry for collaboration. Getting those first few developers beyond the founder tends to be incredibly difficult, and anything that makes that easier is a huge deal.

Large projects

The downhill trends are clear for SourceForge and Google Code, while Launchpad and Bitbucket appear to remain roughly flat. GitHub seems to have a slight downhill trend. Interestingly, scaling to the needs of larger projects turns out to be a major issue for the older forges (SourceForge and Google Code), but GitHub seems to have largely defeated it.

While it’s clear that the vast majority of the increase in self-hosting comes at the cost of share for SourceForge and Google Code,  it’s hard to attribute precise causes to it. Some of the likeliest possibilities are a lack of desired communication methods, a difficulty with the usability of the platform or collaboration on it, and a failure of the forge to scale effectively.

Conclusions

Once a project reaches 15–20 monthly contributors, it seems to behave much differently, on average, than smaller projects in a number of ways. In larger projects, committers tend to be more active as a whole, licensing tends to be better-determined, and they’re much more likely to be self-hosted. Very small communities make up the vast majority of the open-source world, however, so we need to pay close attention to what’s happening even on solo projects.

Disclosure: Black Duck Software (which runs Ohloh) and Atlassian (which runs Bitbucket) are clients. GitHub and Canonical (which runs Launchpad) have been clients. Dice (which runs SourceForge) and Google are not clients.

by-nc-sa

Categories: adoption, community, data-science, licensing, open-source.

Quantifying the shift toward permissive licensing

The team at Ohloh worked with me to organize a data hackfest at OSCON 2012, and we pulled together a great dataset that included licensing data for all open-source projects in Ohloh that had any commits in the past year. After working with Ohloh data for my recent post on language expressiveness, I wanted to explore it in some different ways to see what else might emerge, and licensing seemed like one worth examining more deeply.

My colleague Steve has posted about permissive vs copyleft licensing a number of times, but we’ve never done quantitative research into licensing choice to prove the extent to which any shifts are happening, the time frames involved, and the potential variations within different programming-language communities.

Approach: Classification, history, and languages

Using the Ohloh data for 57,930 active projects as of July 2012, I classified the top 30 open-source licenses into one of three categories: permissive (e.g. BSD, Apache), limited (e.g. LGPL, MPL, EPL), or copyleft (e.g. GPL, AGPL). This three-category classification accounts for 90+% of all projects with specified licenses, which means it should be representative. The total number of classified projects was 17,549, because a vast number of projects either have no license or Ohloh was unable to detect it. Limited licensing is quite rare, hovering around 2%–3% of projects with licenses, so for the purposes of this post, we will focus on permissive and copyleft licensing.

To attempt to identify historical shifts, I separated projects into buckets based on the date of their first commit. Since license changes between permissive and copyleft are quite rare, this should be a reasonable approach to examining trends over time.

Since I hypothesized that programming language might also play a role, I further split each year’s bucket by language. Here, I’m going to focus on the 11 most popular languages according to our rankings, as well as the total across all languages regardless of popularity. Any data points with 5 or fewer projects between permissive and copyleft are not shown, to remove noise.

Results: A clear trend toward permissiveness

I’m showing the data as a ratio between permissive and copyleft licensing to account for changes in absolute numbers of projects over time. Any number above 1 indicates a bias toward permissive licensing, while any number below one indicates a bias toward copyleft.

sort_license_class_by_year

 

Remarkably, every single language shows an upward trend, starting either in favor of copyleft or near equilibrium and shifting upward in a more permissive direction. The overall total, shown as a thick black line, further supports and clarifies this trend since the individual languages can be rather noisy.

Two languages of particular note are the two extremes: Ruby on the permissive side and Perl on the copyleft side. While most languages cluster relatively tightly, Ruby rises far above them with a very clear and strengthening shift toward permissive licensing — 2x in favor of permissive in 2010, 6x in 2011, and 11x in 2012. At the other extreme, Perl shows a roughly 2x–3x bias in favor of copyleft, which is distinctly below the nearest neighbor, C++, but not nearly as large of a divergence from the primary cluster as Ruby shows.

Other than that, at the level of individual languages, it’s difficult to draw any strong conclusions based on their relative positions because they are much less distinct. More recent web-development languages (Ruby, JavaScript, Python) may bias toward permissiveness, as do languages that tend to be used on closed platforms (Obj-C, C#). The difference between Java, C, and C++ is likely cultural as well, with C and C++ being common in the copyleft community while Java is less so due to inertia from its OSS-unfriendly past.

Conclusions

The shift toward permissive open-source licensing is dramatic over the past decade. Since 2010, this trend has reached a point where permissive is more likely than copyleft for a new open-source project. Although there are language-specific effects, especially in the case of Ruby, the overall movement is clear. Outside the extremes, new projects in even the most copyleft-biased language (C++) in 2012 were given copyleft licenses less than 60% of the time.

Disclosure: Black Duck Software (which owns Ohloh) is a client.

by-nc-sa

Categories: adoption, data-science, licensing, open-source.

Coastal Africa: an up-and-coming force in software

As I was digging through Google Trends to check on some geographic trends related to my post ranking expressive languages, I came across intriguing data about Africa. It turns out that the eastern and western African coasts appear extremely interested in development, according to Google Trends. This is particularly true for Nigeria and Kenya in 2011–2012, as shown below for 2012.

If you look at a longer-term view of the full history of Google Trends from 2004 to present, other nearby countries show up as well, although lower-ranked: Uganda, Ethiopia, Zimbabwe, and Ghana. It’s likely no surprise to anyone that, outside of East and West Africa, South Africa (the country) also makes a strong showing, and Egypt appears to a lesser extent, visible on some of the maps but not on the top 10 of the lists. Here are the results for the search terms I used, “software engineering,” “programming languages,” “computer programming,” and “software development”:

africa

 

It’s further supported by Web-traffic data from Alexa showing programming popularity in parts of Africa, with GitHub being a popular site in both South Africa and Nigeria (the same goes for Stack Overflow).

As another proxy for interest in software development and its future directions, we can look at website traffic to RedMonk.com. It shows the top 10 highest-traffic countries in Africa since 2009 as:

  1. South Africa
  2. Egypt
  3. Kenya
  4. Nigeria
  5. Morocco
  6. Tunisia
  7. Ghana
  8. Algeria
  9. Mauritius
  10. Uganda

South Africa and Egypt alone account for more than half of the traffic, however, so the rest appear behind in this respect. Comparing 2012 with 2010, we’ve seen a 13% increase in the proportion of our traffic coming from Africa, and none of that increase comes from Northern Africa (i.e. Egypt, Tunisia, Algeria) — it’s spread across the remaining regions. However, African traffic remains a quite small proportion of our overall traffic, hovering around 1% compared with our top 3 continents since 2009 at 50%, 32%, and 13%, so they won’t be taking over anytime soon. It’s roughly equivalent to the traffic from the #10-ranked US state.

If we look at another metric, that of LinkedIn members in any of the above African countries who match the job titles “software developer” or “software engineer,” we see very similar results (showing all countries with ≥50 results):

  1. Egypt: 4080
  2. South Africa: 2714
  3. Kenya: 688
  4. Nigeria: 597
  5. Tunisia: 370
  6. Mauritius: 231
  7. Morocco: 167
  8. Ghana: 176
  9. Uganda: 153
  10. Ethiopia: 144
  11. Tanzania: 81
  12. Zimbabwe: 80
  13. Sudan: 70

While those numbers will be smaller than the true developer population, an estimate in IEEE Spectrum suggested roughly 200 full-time programmers in Ghana in 2005 compared to 176 on LinkedIn today, which suggests it’s not a completely unreasonable number. The correlation with hits to RedMonk.com suggests that these numbers, while perhaps not correct on an absolute scale, do reflect relative differences across Africa.

From these two lists, we can see that African software development extends somewhat more broadly than merely eastern and western Africa to include a broader group of generally stable, coastal African countries, be it north, south, east, or west.

What are they writing?

Language-specific searches of all RedMonk’s tier 1 and tier 2 languages on Google Trends with the pattern “$LANGUAGE programming” showed that Java and C/C++ were the primary languages in use. In fact, they were the only ones to show any meaningful population on searches. C/C++ shows up in Kenya and South Africa, while Java shows up strongly in Kenya and Nigeria, more weakly in South Africa, and finally weakest in Egypt.

The country-level popularity above shows an interesting correlation with entries to a World Bank software contest on global development, a domain in which many Africans have a keen interest in because it’s directly relevant to their lives (unlike many apps popular in San Francisco). The top submissions, in order, came from Uganda, Nigeria, Kenya, Ghana, South Africa, Niger, and Rwanda. Most interestingly, Africa had more submissions than any other continent.

I would also expect that as the living standards and cost of living in places like China and India continue to increase, we may see more outsourcing move to Africa.

Conclusion

Minnesota, Colorado, and Virginia are peers to Africa on the basis of RedMonk.com traffic, and most software companies don’t ignore them. If you aren’t thinking about Africa, it’s time to start. It’s already as significant as a top-10 US state, and it’s just going to get bigger from here.

Update (4/1/13): Joel Martin pointed out that this data also shows a reasonable correlation with Internet users in Africa.

Disclosure: World Bank is not a client.

by-nc-sa

Categories: data-science, employment.

Some external validation on expressive languages

I just got pointed to a really interesting and relevant data source by Ben Racine and wanted to post a short update to note the correlation of my post with a new piece of external information.

The information? The input from ~2,500 developers over on Hammer Principle on the statement, “This language is expressive.” I mapped the top 10 and bottom 10 languages to my own median-based ranking, showing only the top two popularity tiers for simplicity, and got this:

expressiveness_weighted_top_tiers

 

Interestingly, it’s a very clear correlation — all expressive at one end, all poorly expressive at the other end, and a mix in the middle (indicating a bit of noise).

What do you think?

by-nc-sa

Categories: adoption, data-science, employment.

What does “expressiveness” via LOC per commit measure in practice?

Yesterday’s post ranking the “expressiveness” of programming languages was quite popular. It got more than 30,000 readers in the first 24 hours; it’s at 31,302 as I write this. For this blog, that qualifies as a great audience. After a day’s worth of feedback, thought, and discussion on Twitter, Hacker News, and the post’s comments, I wanted to sum up some of my thoughts, others’ contributions, and things I left out of the initial post.

 What are we really measuring here?

As I mentioned as a major caveat in the initial post, lines of code (LOC) per commit is an imperfect metric as a window into expressiveness. It’s measuring something, but what does it mean? My take on these results is that it’s a useful metric when painting with broad strokes, and the results seem to generally bear that out. It’s more helpful in comparing large-scale trends than arguing over whether Ruby should be #27 or #22, which is likely below the noise level. I think the reason some placements seem so weird is that it’s measuring expressiveness in practice rather than in theory. That brings in factors like:

  • The standard library and library ecosystem. Is there a weak standard library? Is there a small or nonexistent community of add-on library developers? In both cases, constructing a commit-worthy chunk of code could require additional lines.
  • The development culture and its norms. Is copy-and-pasting common for this language? Are imported libraries often committed to the project repository (JavaScript is a prime candidate here)? Are autogenerated files committed (e.g., minified JavaScript, autotools configure scripts)?
  • The developer population using it. Especially for third-tier languages, the number of developers is small enough that these results could reflect those developers more than the properties of the language itself. Some of the least-popular third-tier languages have fewer than 10 developers committing during a given month. I would generally disregard anything but the largest differences between third-tier languages, and treat even those with skepticism. Some languages are also more popular for beginning programmers, which could influence the results if the beginners make up a significant chunk of the language’s total userbase.
  • The time frame of its initial popularity.  This can result in time-based influences upon tools and methodologies in use. For example, newer languages popularized in the agile and GitHub eras may tend to bias toward smaller, more frequent commits. Languages that grew up alongside waterfall development and slower, centralized version control may be biased more toward larger, monolithic commits. It even carries as far as things like line length — today, wide-screen monitors are common, and many developers no longer restrict their column width to 80 or less. This could have a language-specific impact, where older languages with a great deal of inertia change more slowly to a new “standard” of development. For example, perhaps fixed-format Fortran wasn’t typically maintained in version control at all, and full files were just committed wholesale? That could explain its similarity to JavaScript.
  • Differences in project types by language. If a language is more likely to be used in larger, enterprise projects, this could influence the types of commits it receives. For example, it could get more small bugfixes than new features because it’s a long-lived codebase and requires additional stability. It could also see a different level of refactoring.

So … what should you get out of the results, then?

Frankly, given all the possible variables involved, the biggest surprise here is that the results look as reasonable as they do, at the level of broad, multi-language or cross-tier trends. Here’s what I would tend to believe, and what I would be skeptical about.

  • Believe: multi-language trends
  • Believe: cross-tier trends
  • Believe: large differences between individual languages, but investigate why
  • Believe: highly-ranked languages
  • Be skeptical: anything involving third-tier languages
  • Be skeptical: small differences between individual languages
  • Be skeptical: individual languages that don’t fit into a group of similar ones
  • Be skeptical: low-ranked languages, until investigated

Why do I suggest believing high ranks but not low ones? It’s the Anna Karenina principle, as Tolstoy wrote:

Happy families are all alike; every unhappy family is unhappy in its own way.

While there are a large number of ways to have a high median or high IQR, it seems to me that low values of both would indicate a number of good development practices in addition to a good language.

To wrap things up, I think this is measuring, with a fair amount of noise, a form of expressiveness in practice rather than in theory — a form that includes all the ways code is incorporated into a repository. That makes it an interesting window into a number of potential problems with how specific languages as well as language classes are typically used.

by-nc-sa

Categories: adoption, data-science, employment.

Fast WordPress Hosting