GitHub will hit 5 million users within a year

Donnie Berkholz's Story of Data

GitHub will hit 5 million users within a year

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

With GitHub’s 3 millionth user just announced, the time was right to more deeply examine GitHub’s growth since its start back in 2008. Thanks to Francis Irving’s work (graph) at ScraperWiki, I found a way to query monthly growth rather than just relying on periodic announcements. (For those playing along at home, note the search syntax has since changed.)

My goal was to come up with a model for GitHub’s growth to understand what kind of rules its growth followed and so I could better predict the future.

GitHub users as a population

My first assumption was that I could handle this as a population using standard approaches like the exponential or logistic equations, but I started by plotting a couple of things: users over time, and the log of users over time. Getting a feel for the shape of the data should be the starting point for any analysis.

If it’s a population experiencing exponential growth, it should be log-linear (plotting the log of users on the Y-axis rather than raw users should be linear), but it’s not — so the growth cannot be treated exponentially. Since GitHub users increase faster than exponential growth allows (tighter curvature on graph of users over time, higher slope on log-linear graph), to fit growth, we need a superexponential model.

After digging into this for a while, I finally discovered a model of populations that might fit — coalition-based growthdescribed by von Foerster and colleagues in 1960 in a publication in Science. Its essence is in game theory, considering the entire community as a single group in a two-entity game against its environment, due to members’ high level of communications enabling them to form tightly linked coalitions, rather than independent individuals trying to survive. To me, the parallel with collaborative software development seemed quite strong

The best fit to the data I’ve found so far is described by superexponential growth following the coalition-based equation

P(t) = P0 * t * ekt
P0 = 49,100 ± 1750; k = 0.54 ± 0.009

where P is the population at time t, P0 is the initial population, and k is a growth constant (i.e., the frequency of growth by a factor e). This equation is different from the more typical exponential growth [P(t) = P0 * ekt] because of an additional multiple of t to indicate that the rate of growth actually increases with time, which is inserted to account for the network effects. The results generally make sense, which is always a good check to make — for example, the initial population is much closer to zero than 3 million.

GitHub adoption as diffusion of a new innovation

In the meanwhile, I’d thrown out a request on Twitter for suggestions on how to model this, and Adrian Cockcroft suggested treating it as diffusion atop a pre-existing social network. This seemed reasonable too, so I started looking into it. Turns out that the logistic function is also used to describe diffusion of innovations, but it’s again log-linear, which doesn’t fit the GitHub adoption data. Then I combined this with some of my previous thoughts that there must be alternate ways to model GitHub based on social-network analysis rather than population dynamics.

When I looked more deeply into the theory of diffusion of innovations, I discovered that it’s often treated using the Bass model. This is really just a combination of exponential and logistic equations with two coefficients, p and q, to model diffusion via social interactions and broadcast advertising. The Bass model does account for social networks, but its main shortcoming is that it treats them as fully connected and homogeneous (everyone knows everyone, and all people are identical), when in reality they’re often small world / scale-free. That said, I figured it would make sense to start with the simplest possible approximation and see how it did, and here’s the results:

Intriguingly, the Bass model produced a nearly identical fit to the coalition-based model using the following equation:

P(t) = m * [ (1 – e-(p+q)t) / (1 + q/p * e-(p+q)t) ]
p = 0.003 ± 0.0015, q = 0.83 ± 0.042, m = 21,000,000 ± 12,000,000

where P again is the population at time t, p and q are coefficients for advertising and social-network effects, respectively, and m is the total size of the market.

Under that model, you would interpret nearly all of GitHub’s popularity to social effects (word-of-mouth and friends) and nearly none to broadcast advertising. Again, it’s good to see the results generally make intuitive sense.

Regarding the market size m, it’s critical to note that it is commonly underestimated, particularly with the paucity of data here (only a partial curve and no inflection point).

In summary, modeling GitHub adoption as diffusion of an innovation seems to work pretty well, too, despite the obvious simplifications regarding the social network and static market sizing, advertising, pricing, etc.

What about the future?

Understanding the past is useful, but what we really want to do is predict something. So, do these models enable us to do that? Sure — let’s plot the models out to year 10 and see what things look like:

Neither fit is perfect, with some clear systematic errors, but I suspect that won’t be fixable without a more complex model (e.g. heterogeneous social networks) or more data. The coalition model says things increase faster and faster forever (which seems just a tad unrealistic), predicting 100 million users after 10 years. Although I don’t necessarily discard 100 million developers out of hand, I’m definitely skeptical about 2.5 billion at year 15, which makes the model as a whole a little weak. The Bass model, on the other hand, is a more typical S-shaped curve that’s clearly moving toward a maximum, predicting 20 million users at year 10 and 21 million at year 15.

Now, don’t take those numbers as hard figures, because there’s huge amounts of uncertainty associated with them — the purpose of this exercise was more about understanding GitHub’s growth model and possibly some near-term prediction.

In the near term, I’d estimate, based on my Bass model, that GitHub will hit 4 million users near August and 5 million near December.

Update: GitHub hit 4 million users in early August and 5 million in December, exactly as predicted.

Disclosure: GitHub has been a client.

by-sa

10 comments

  1. I’d love to see what happens if you just fit to the first 3 years and chart out the modelled prediction vs reality for years 4 and 5.

    1. It may be off by a bit considering there’s been a lot more (user based) traction within years 3-5 though?, i.e. the linear plot for 3-5 shows a much steeper (exponential) growth than 0-3.

      1. So far it’s nailed my prediction of 4 million users to the month, and GitHub’s on track to reach 5 million in late December, also as predicted.

  2. […] 最近の同社の成長は、相当急ピッチだった。2011年にユーザ数100万の道標を通過した同社は、今年の1月にはユーザ数300万に達した。8月に到達したユーザ数400万は、RedMonkのDonnie Berkholzが描いた成長カーブと完全に合致する。彼のそのカーブでは、年内に500万に達する、となっている。 […]

  3. […] past growth to predict what would happen over the next year in a post titled “GitHub will hit 5 million users within a year” and […]

  4. […] is a specific community that’s grown very quickly since it launched [writeup]. It was not initially reflective of open source as a whole but rather centered around the Ruby on […]

  5. […] issues. He also analyses new users and new repos. He also analyzed topic of GitHub popularity: GitHub will hit 5 million users within a year, BAM! GitHub prediction nailed: 4M users in August, 5M in […]

  6. […] 2013, I successfully predicted GitHub’s growth from 3 million to 4 and 5 million users respectively, with sub-month […]

Leave a Reply

Your email address will not be published. Required fields are marked *