With GitHub’s 3 millionth user just announced, the time was right to more deeply examine GitHub’s growth since its start back in 2008. Thanks to Francis Irving’s work (graph) at ScraperWiki, I found a way to query monthly growth rather than just relying on periodic announcements. (For those playing along at home, note the search syntax has since changed.)
My goal was to come up with a model for GitHub’s growth to understand what kind of rules its growth followed and so I could better predict the future.
GitHub users as a population
My first assumption was that I could handle this as a population using standard approaches like the exponential or logistic equations, but I started by plotting a couple of things: users over time, and the log of users over time. Getting a feel for the shape of the data should be the starting point for any analysis.
If it’s a population experiencing exponential growth, it should be log-linear (plotting the log of users on the Y-axis rather than raw users should be linear), but it’s not — so the growth cannot be treated exponentially. Since GitHub users increase faster than exponential growth allows (tighter curvature on graph of users over time, higher slope on log-linear graph), to fit growth, we need a superexponential model.
After digging into this for a while, I finally discovered a model of populations that might fit — coalition-based growth, described by von Foerster and colleagues in 1960 in a publication in Science. Its essence is in game theory, considering the entire community as a single group in a two-entity game against its environment, due to members’ high level of communications enabling them to form tightly linked coalitions, rather than independent individuals trying to survive. To me, the parallel with collaborative software development seemed quite strong.
The best fit to the data I’ve found so far is described by superexponential growth following the coalition-based equation
P(t) = P0 * t * ekt
P0 = 49,100 ± 1750; k = 0.54 ± 0.009
where P is the population at time t, P0 is the initial population, and k is a growth constant (i.e., the frequency of growth by a factor e). This equation is different from the more typical exponential growth [P(t) = P0 * ekt] because of an additional multiple of t to indicate that the rate of growth actually increases with time, which is inserted to account for the network effects. The results generally make sense, which is always a good check to make — for example, the initial population is much closer to zero than 3 million.
GitHub adoption as diffusion of a new innovation
In the meanwhile, I’d thrown out a request on Twitter for suggestions on how to model this, and Adrian Cockcroft suggested treating it as diffusion atop a pre-existing social network. This seemed reasonable too, so I started looking into it. Turns out that the logistic function is also used to describe diffusion of innovations, but it’s again log-linear, which doesn’t fit the GitHub adoption data. Then I combined this with some of my previous thoughts that there must be alternate ways to model GitHub based on social-network analysis rather than population dynamics.
When I looked more deeply into the theory of diffusion of innovations, I discovered that it’s often treated using the Bass model. This is really just a combination of exponential and logistic equations with two coefficients, p and q, to model diffusion via social interactions and broadcast advertising. The Bass model does account for social networks, but its main shortcoming is that it treats them as fully connected and homogeneous (everyone knows everyone, and all people are identical), when in reality they’re often small world / scale-free. That said, I figured it would make sense to start with the simplest possible approximation and see how it did, and here’s the results:
Intriguingly, the Bass model produced a nearly identical fit to the coalition-based model using the following equation:
P(t) = m * [ (1 – e-(p+q)t) / (1 + q/p * e-(p+q)t) ]
p = 0.003 ± 0.0015, q = 0.83 ± 0.042, m = 21,000,000 ± 12,000,000
where P again is the population at time t, p and q are coefficients for advertising and social-network effects, respectively, and m is the total size of the market.
Under that model, you would interpret nearly all of GitHub’s popularity to social effects (word-of-mouth and friends) and nearly none to broadcast advertising. Again, it’s good to see the results generally make intuitive sense.
Regarding the market size m, it’s critical to note that it is commonly underestimated, particularly with the paucity of data here (only a partial curve and no inflection point).
In summary, modeling GitHub adoption as diffusion of an innovation seems to work pretty well, too, despite the obvious simplifications regarding the social network and static market sizing, advertising, pricing, etc.
What about the future?
Understanding the past is useful, but what we really want to do is predict something. So, do these models enable us to do that? Sure — let’s plot the models out to year 10 and see what things look like:
Neither fit is perfect, with some clear systematic errors, but I suspect that won’t be fixable without a more complex model (e.g. heterogeneous social networks) or more data. The coalition model says things increase faster and faster forever (which seems just a tad unrealistic), predicting 100 million users after 10 years. Although I don’t necessarily discard 100 million developers out of hand, I’m definitely skeptical about 2.5 billion at year 15, which makes the model as a whole a little weak. The Bass model, on the other hand, is a more typical S-shaped curve that’s clearly moving toward a maximum, predicting 20 million users at year 10 and 21 million at year 15.
Now, don’t take those numbers as hard figures, because there’s huge amounts of uncertainty associated with them — the purpose of this exercise was more about understanding GitHub’s growth model and possibly some near-term prediction.
In the near term, I’d estimate, based on my Bass model, that GitHub will hit 4 million users near August and 5 million near December.
Update: GitHub hit 4 million users in early August and 5 million in December, exactly as predicted.
Disclosure: GitHub has been a client.
Steven H. Noble says:
January 28, 2013 at 5:07 pm
I’d love to see what happens if you just fit to the first 3 years and chart out the modelled prediction vs reality for years 4 and 5.
November 3, 2013 at 1:38 am
It may be off by a bit considering there’s been a lot more (user based) traction within years 3-5 though?, i.e. the linear plot for 3-5 shows a much steeper (exponential) growth than 0-3.
Donnie Berkholz says:
November 11, 2013 at 12:51 pm
So far it’s nailed my prediction of 4 million users to the month, and GitHub’s on track to reach 5 million in late December, also as predicted.
Roundup: Predictions and Analysis on Software Developers and Development – James Governor's Monkchips says:
June 20, 2013 at 9:43 am
[…] GitHub Will Hit 5 Million Users Within a Year, Jan 2013 […]
ユーザ数400万に達したGitHubが「コラボレーションサービスの百貨店」になることで未来の成長を目指す | TechCrunch Japan says:
September 11, 2013 at 8:41 pm
[…] 最近の同社の成長は、相当急ピッチだった。2011年にユーザ数100万の道標を通過した同社は、今年の1月にはユーザ数300万に達した。8月に到達したユーザ数400万は、RedMonkのDonnie Berkholzが描いた成長カーブと完全に合致する。彼のそのカーブでは、年内に500万に達する、となっている。 […]
BAM! GitHub prediction nailed: 4M users in August, 5M in December – Donnie Berkholz's Story of Data says:
December 19, 2013 at 9:30 pm
[…] past growth to predict what would happen over the next year in a post titled “GitHub will hit 5 million users within a year” and […]
What were developers reading on my blog and tweetstream in 2013? – Donnie Berkholz's Story of Data says:
January 6, 2014 at 9:28 am
[…] 2019: GitHub will hit 5 million users within a year […]
GitHub language trends and the fragmenting landscape – Donnie Berkholz's Story of Data says:
May 2, 2014 at 3:48 pm
[…] is a specific community that’s grown very quickly since it launched [writeup]. It was not initially reflective of open source as a whole but rather centered around the Ruby on […]
Playing with GitHub data and researching OSS – current state of art | oskarj.wordpress.com - social informatics geek says:
May 7, 2014 at 6:59 pm
[…] issues. He also analyses new users and new repos. He also analyzed topic of GitHub popularity: GitHub will hit 5 million users within a year, BAM! GitHub prediction nailed: 4M users in August, 5M in […]
GitHub’s vanishing acceleration – Donnie Berkholz's Story of Data says:
September 26, 2014 at 2:49 pm
[…] 2013, I successfully predicted GitHub’s growth from 3 million to 4 and 5 million users respectively, with sub-month […]