Skip to content

The emergence of Spark

In the continuing Big Data evolution of reinventing everything that happened in HPC a couple of decades ago (with slight modifications), one newer ecosystem that comes up more and more is the Berkeley Data Analytics Stack. Some of the better-known components of this stack are Spark, Mesos, GraphX, and MLlib.

Spark in particular has gained interest due in part to very fast computation in-memory or on-disk, generally pulling from Hadoop or Cassandra (courtesy of a connector). And its programming model uses Python, Scala, or Java, which — especially in the case of Python — is very friendly to data scientists. Coincidentally, Spark 1.3 was released today, and it supports the DataFrame abstraction used both in the popular Python pandas library as well as in R (for which it has an upcoming API).

This investigation began while I was sitting at O’Reilly’s Strata conference in a packed Spark talk and began wondering about overall traction and interest in Spark. On a qualitative level, nearly every talk about Spark at the conference was reportedly packed. This came despite the lack of commercial interest highlighted below, which I wrote more about earlier.


As you can see, the level of commercial interest was quite low. In concert with the much busier talk schedule and talk attendance, this became quite suggestive of a broader effect. It maps well to the adoption curve followed by many new open-source technologies, where early adopters and contributors dominate the ecosystem initially with talks about the state of the technology and about DIY implementations. This is later followed by vendors coming up to speed in terms of commercial offerings and integrations, which are quite low at present.

To investigate whether this was a wider pattern, I took the approach of pulling in a number of data sources across the development community to compare relative interest in Spark and some other technologies in the Hadoop ecosystem for extracting and operating on data.

The first and most surprising data was from Stack Overflow:


In the past year and a half or less, interest in Spark has skyrocketed from minimal to far above every other technology on the chart. This roughly coincides with, and slightly lags, two major events:

  • The project’s move to the Apache foundation; and
  • The founding of Databricks, the vendor behind a significant chunk of Spark development.

Although it’s difficult to deconvolute the effects of these two things, it seems likely that they combined to catalyze the growth of the Spark community.

As another data source, let’s examine Hacker News. In general this tends to be a more bleeding-edge crowd, but this data may slightly temper your enthusiasm:


Unlike Stack Overflow, there’s no enormous spike in the last year. Also given the limitations of HN search (words vs tags), some noise like discussion about Spark Devices slips into these queries. While less dramatic than the SO data, there is an equally clear emergence over time from middle of the pack to the dominant technology shown.

It could be that the bleeding-edge crowd here picked up Spark over a longer period of time since mid-2010, while Stack Overflow’s somewhat more conservative audience compressed that same adoption into the past year and a half.

In an attempt to resolve it, I looked at a third data source, Google Trends. This is generally indicative of a broad population that, out of all these, best reflects mass adoption. Queries were coupled with “big data” to limit results to a more accurate subset.


It’s intriguing to see Spark’s emergence echoed again here, with a dramatic-appearing spike just in the past few months. We’ll have to follow it over a longer period of time to determine whether that looks like the Stack Overflow data, but it very clearly stands out beyond the peaks of any of these other technologies.

The next question is how Spark is being used. While difficult to infer, the kind folks at Databricks shared some data with us about the users of the Databricks Cloud:databricks_cloud

No surprise to see the dominance of SQL. 100% of their customer base uses SQL, often coupled with another language like Python or Scala. Much as my colleague Steve wrote back in 2011, one of the first things added to most NoSQL databases was something that looked a whole lot like SQL. The large usage of Python also supports Spark’s accessibility to data scientists.

Unfortunately the ‘spark’ tag on Stack Overflow is a mess containing both Apache Spark and Flex Spark (part of the old Adobe Flex), so I was unable to take a deeper look at that as another comparison point.

Regardless, it’s clear that Spark is a technology you can’t afford to ignore if you’re looking into modern processing of big datasets.

Disclosure: Databricks, Datastax, and Mesosphere are not clients. A number of Hadoop vendors are clients.


Categories: big-data, data-science.

Strata 2015: Reaching for the business user

This week I’m headed to O’Reilly’s Strata conference in San Jose, which is all about Big Data and more broadly data in general. To get a feel for what’s going to happen there and what the big news is, I repeated my analysis from two years ago and dug through all my pre-announcements to look at the overall themes.

As you might expect, this tends to focus on launches and funded startups vs all companies present or the talks. But it does give a reasonable level of clue as to what the take-homes will be for this year’s attendees, and where they might want to dig into the new hotness in more depth.

Without further ado, here’s the themes underlying what 48 companies are announcing this year. Note that the numbers add up to more than 48 because I tagged some announcements that fit into multiple areas.


The top 5 areas of interest are:

  1. Hadoop itself
  2. Analytics and BI
  3. NoSQL databases outside the Hadoop ecosystem
  4. Data integration
  5. Big Data packaging

I highlighted six areas worth noting in red because of a few reasons:

  • The contrast with two years ago;
  • They’re a major problem for data users; or
  • They’re new, emerging technologies, like Spark.

Two years ago, I received 41 notices rather than 48 so there’s been a slight increase in launches at the show. The primary focuses back then were analytics, databases, and packaging. What’s changed?

The rise of BI (business intelligence)

This year I split analytics into two sections (analytics and the new one, BI), aimed at advanced technical users and business users, respectively. Products and companies that appealed to both were tagged with both rather than artificially segmenting them into one or the other. Together, Analytics/BI was easily the dominant sector with 31% of the overall volume targeting it.

This says a lot about the maturity of the Big Data ecosystem. As it matures, you expect increasingly higher-level applications rather than delivery of raw, low-level building blocks. Analytics tools are about as low-level as shipped apps get, with BI being one level higher because it tends to require more intelligence in the app than in the end user. Farther down the road, look for applications that merely incorporate Big Data rather than being all about analyzing a dataset. Most of them today are heavily customized, but this will change.

To draw an analogy to houses, Hadoop is a bag of ready-mix concrete and some trees. Analytics is cinder blocks, boards, and hand tools; BI is power tools. Horizontal business apps and libraries that are composed into business apps are the contractors building your house. Vertical-specific apps are what the general contractor builds for you, and at scale are built on a common template. As you move up the stack, you lose a little flexibility but you’re able to build upon more and more existing work and expertise.

Packaging is no longer the key blocker

In 2013, the major unappreciated theme was packaging Big Data so it was consumable by end users. That no longer seems to be the case, with packaging dropping down from 2nd to 6th place in the list. This implies that Hadoop has become much easier to get up and running than it was in 2013, which is a key blocker to adoption.

 Data cleaning remains underappreciated

Only two companies are pushing products that are primarily about data cleaning, which is generally understood to consume 80%–90% of a data scientist’s time. This to me suggests that either it’s a solved problem (unlikely, given the time expenditure), a problem that’s incredibly difficult to solve, or a problem for which the solution is inexplicably difficult to sell.

What happened to NewSQL?

Companies with new, much faster approaches to traditional RDBMS were all the rage a couple of years ago, but this time around they’ve nearly vanished from the public eye. I’ll be looking to see what their presence is like at the conference, but it seems they don’t have much new to announce at this point.

Emerging tech still emerging (Spark, streaming, in-memory)

Much to my surprise, Spark only showed up 3 times. I would’ve expected at least double the presence of Spark in the announcements as I got. Along with streaming as a whole and in-memory databases, this group formed what I’d call the “emerging tech” category. Although that’s said with a grain of salt, as the technologies themselves have been around for years if not decades, and even a newer streaming option like Storm is now 3.5 years old.

I expect every piece of this area to take off over the next couple of years commercially, as interest within the RedMonk community in these technologies has grown dramatically over the past couple of years. Particularly with the advent of the Internet of Things, streaming technology becomes vital to coping with the data in a timely manner.

Interestingly, in-memory tech has held nearly static, with the exception of Spark. Perhaps that’ll be where the revolution comes from.

(Tangentially, we’re running an IoT developer conf in a few weeks called ThingMonk, in Denver — our first time in the US. Check it out if you want to dig into this!)


To sum up, I expect the growing appeal to the business user via BI and analytics to be among the key takeaways of this Strata conference. Over the next year, especially in the more technologically progressive CA edition, I’ll be looking for increasing uptake of the Berkeley data analytics stack (Spark & friends), streaming tech, and in-memory data processing.

Update [2015/02/17]: Added house analogy.


Categories: big-data, data-science, nosql, packaging.

Cloud outages, transparency, and trust


The ongoing blips and bloops of public-cloud outages, whether planned or unplanned, continue to draw headlines and outrage. And rightly so, since downtime for those who use a single availability zone or even a single region can cost millions in lost business and reputation for companies whose own websites and online stores disappear.

The latest is a much-maligned 40-hour outage on Verizon’s new cloud:

As this tweet shows, the most important part of every outage, planned or unplanned, isn’t the outage itself. It’s everything surrounding it.

It’s the comms, stupid

Much like Bill Clinton’s 1992 rallying cry “It’s the economy, stupid,” cloud providers need to focus on what customers really care about.

Take a look at the CloudHarmony cloud-uptime listings. While AWS is among the top performers, Azure is far from it. Google has a few hours of downtime, and up-and-comer DigitalOcean is more comparable to Azure than AWS.

This suggests to me that outage frequency, within a certain range, isn’t a blocker on adoption of an otherwise compelling cloud provider. The question isn’t which provider is best — but what is the upper limit of what customers find acceptable.

One factor that does very clearly make a difference, however, is communications about the outage. The best-of-breed providers have status sites and Twitter accounts where they post periodic updates, whether an outage was planned or unplanned. Heroku and GitHub are good examples of this. While both sites have their share of downtime, they use strong transparency to maintain the trust of their users.

On the other side of the spectrum is Microsoft, which used to post nice postmortems but has since largely given it up. If you match up their public postmortems with articles pointing out Azure outages, you’ll note a significant disparity, particularly in the last year or two.

I got this bland, unattributed statement courtesy of Microsoft analyst relations:

Reliability is critical to our customers and therefore, extremely important to us. While we aim to deliver high uptime of all services, unfortunately sometimes machines break, software has bugs and people make mistakes, and these are realities that occur across all cloud vendors. When these unusual instances occur, our main focus is fixing the problem, getting the service working and then investigating the failure. Once we identify the cause of the failure we share those learnings with our customers so they can see what went wrong. We also take steps to mitigate that being a problem in the future, so that customers feel confident in us and the service.

We all understand that sometimes things break, because clouds are incredibly complex systems. We’re only really looking for two things out of it: (1) don’t have the same problem twice, and (2) keep us informed. Unfortunately, they aren’t living up to the second half of that. And they’re far from the only ones — see the Verizon example at the beginning of this piece.

As I argued a year ago:

For those wondering what a great postmortem looks like, Mark Imbriaco (in the past at Heroku, GitHub, and DigitalOcean) gives a masterclass here:

Monitorama 2013 – Mark Imbriaco from Monitorama on Vimeo.

And there’s a plethora of examples posted at sites including the following:

If you don’t have trust; if you think old-school opacity is still the right approach; you don’t have loyal customers and they’ll leave you at their first opportunity. Now you’ve seen the examples and the counterexamples — go forth and communicate!

Disclosure: Amazon Web Services, Microsoft, and (Heroku) are clients. GitHub has been. Google, Verizon, CloudHarmony, and DigitalOcean are not.


Categories: cloud, devops, social.

Time for sysadmins to learn data science

At PuppetConf 2012, I had an epiphany when watching a talk by Google’s Jamie Wilkinson where he was live-hacking monitoring data in R. I can’t recommend his talk highly enough — as an analytics guy, this blew my mind:

Since then, one thing has become clear to me: As we scale applications and start thinking of servers as cattle rather than pets, coping with the vast amounts of data they generate will require increasingly advanced approaches. That means over time, monitoring will require the integration of statistics and machine learning in a way that’s incredibly rare today, on both the tools and people sides of the equation.

It’s clear that the analysis paralysis induced by the wall of dashboards doesn’t work. We’ve moved to an approach defined largely by alerting on-demand with tools like Nagios, Sensu, and PagerDuty. Most of the data is never viewed unless there’s a problem, in which case you investigate much more deeply than you ever see in any overview or dashboard.

However, most alerting remains broken. It’s based on dumb thresholds rather than anything even the slightest bit smarter. You’re lucky if you can get something as advanced as alerting based on percentiles, let alone standard deviations or their robust alternatives (black magic!). With log analysis, it’s considered great if you can even manage basic pattern-matching to group together repetitive entries. Granted, this is a big step forward from manual analysis, but we’re still a long way from the moon.

This needs to change. As scale and complexity increase with companies moving to the cloud, to microservice architectures, and to transient containers, monitoring needs to go back to school for its Ph.D. to cope with this new generation of IT.

Exceptions are few and far between, often as add-ons that many users haven’t realized exist — for example Prelert (first for Splunk, now available as a standalone API engine too), or Bischeck for Nagios. Etsy open-sourced the Kale stack, which does some of this, but it wasn’t widely adopted. More recently Numenta announced Grok, its own foray into anomaly detection, which looks quite impressive. And today, Twitter announced another R-based tool in its anomaly-detection suite. Many of you may be surprised to hear that, completely on the other end of the tech spectrum, IBM’s monitoring tools can do some of this too.

On the system-state side, we’re seeing more entrants helping deal with related problems like configuration drift including Metafor, ScriptRock, and Opsmatic. They take a variety of approaches at present. But it’s clear that in the long term, a great deal of intelligence will be required behind the scenes because it’s incredibly difficult to effectively visualize web-scale systems.

The tooling of the future applies techniques like adaptive thresholds that vary by day, time, and more; predictive analytics; and anomaly detection to do things like:

  • Avoid false-positive alerts that wake you up at 3am for no reason;
  • Prevent eye strain from staring at hundreds of graphs looking for a blip;
  • Pinpoint problems before they would hit a static threshold, like an instance gradually running out of RAM; and
  • Group together alerts from a variety of applications and systems into a single logical error.

DevOps or not, I’m running into more people and bleeding-edge vendors who are bringing a “data science” approach to IT. This is epitomized by attendees to Jason Dixon’s Monitorama conference. Before long, it will be unavoidable in modern infrastructure.

Want to get started? You could do a lot worse than Coursera’s data-science specialization.

Disclosure: Prelert, Splunk, IBM, and ScriptRock are clients. Puppet Labs has been. Etsy, Metafor, Nagios Inc, Numenta, Opsmatic, Twitter, and PagerDuty are not.


Categories: cloud, data-science, devops, docker, ibm.

Enterprise tech: the still-hot old thing

Every time you turn around, you’re hearing about data science, DevOps, mobile-first, growth hackers, etc. But that doesn’t mean the existing footprint has disappeared — far from it, in fact. Recruiters continue to search in huge numbers to hire enterprise talent, not just for the latest generation of tech unicorns.

This week, LinkedIn released its annual report on the top skills recruiters search for. Prominent on the top 25 were enterprise stalwarts like:

  • Middleware and Integration Software
  • Storage Systems and Management
  • Business Intelligence
  • Java Development
  • SAP ERP Systems

Since recruiter interest links directly to the hiring market, it’s clear that companies continue to search for talent that you could expect to be ubiquitous at this point. This supports the more general assertion that developers as a whole are in shortage.

Consequently, I would argue that the tech industry needs to focus on training of existing tech-savvy folks who aren’t yet developers. I’ve run across quite a bit of anecdata about Salesforce admins who start as administrative assistants and become developers. More recently at Splunk .conf, I came across a Splunk admin who followed the same path by transitioning into a manager of a dev team.

Democratizing development is one thing, but equally important is remembering that it’s a funnel that enables you to bring some of those proto-developers farther down the road. You can’t wait around for the “pipeline” to fix itself starting in grade school.

Disclosure: Oracle (which owns the Java trademark),, SAP, and Splunk are clients. LinkedIn is not.


Categories: big-data, data-science, employment, marketing, salesforce.

Docker, Rocket, and bulls in a china shop

Quick backstory: Docker’s an incredibly popular container technology, and CoreOS built a cloud-native Linux distro around it.

CoreOS just announced a competing alternative to Docker called Rocket. Docker’s official response to the Rocket announcement was very telling, and surprising. It came less than 2 hours after the announcement went up, and it was packed with typos, defensiveness, and aggression.

The basic structure and meaning of the response, in my own words, is:

  • Docker has an enormous community — we own all the mindshare, implying that we’re clearly right.
  • We’re moving up the stack. Since we own the mindshare, this is the right thing to do by virtue of us doing it.
  • We love open source, we swear, although we’re definitely in the right because the majority of people are with us.
  • There’s some minuscule group of people (all vendors, apparently) who disagree with our moves. They must be wrong because we’re taking efforts to point out that they’re vendors and not users. (ad hominem, anyone?)
  • We’re going to imply that the reason Rocket exists isn’t technical or philosophical, by presenting that option as the final corner case (“of course”). Aim being to convince developers that Rocket is just some NIH thing that exists for no reason devs should care about.
  • In bold, at the very end, such as to be the take-home point of the whole post, is a line about “questionable rhetoric and timing”, followed by another implication that Docker Inc knows what’s best since it has this huge ecosystem.

It’s particularly easy to see when you compare the initial post to the current, updated version:

Docker Rocket response

What are the key differences?

  • A host of typos disappear. Their presence indicates this was rushed out the door very quickly. Why might that happen?
  • Emphasizing their commitment to the ecosystem, rather than solely the ecosystem’s commitment to them;
  • Clearly noting that Rocket’s raison d’être appears to be true technical or philosophical differences; and
  • Removing the bolding on the final paragraph, although the wording remains.

I’d interpret that as Docker’s leadership initially having a panicked knee-jerk reaction. Couple their post with Docker cofounder and CTO Solomon Hykes’ behavior on Twitter and on the Hacker News thread on the Rocket announcement (1, 2, 3, 4, 5), and you’ve got yourself a recipe for disaster.

My experiences with abusive behavior in Gentoo have led me to speak for years on the data and social-sciences research behind negative community interactions. One universally critical point is that you separate technical criticisms from emotional attacks, and Docker has failed to do so in this case. The Rocket announcement has some harsh words, no doubt about it. But taking them personally and then replying emotionally is exactly the wrong thing to do.

Responses from the community have largely been negative to Docker’s behavior throughout this process, with some exceptions:

This comes off as overly defensive and entitled, like “we brought you containers and you stab us in the back!?”

I don’t see why they need to view this as an opportunity to fight back and criticize another app container system, rather than enthusiasm about the continued spread of containers and expressing a desire to cooperate on building open, interoperable standards.

— themgt, December 1, 2014

In longer-form writeups, Daniel Compton had particularly insightful thoughts on the competitive landscape and moves among Docker Inc, CoreOS, Amazon, and Google that nicely complement my colleague Steve’s recent writeup on scale and integration. Matt Asay also wrote up a useful critique of Docker’s actions.

While Solomon would prefer to focus solely on the technology, unfortunately “Field of Dreams” approaches don’t work out so well in real life. Things like marketing, community management, and the barrier to entry really do matter. I’d strongly recommend to Solomon that in the future, he should stay out of any controversies like this, get himself some media training, and stick solely to technical arguments in public as long as he’s representing Docker Inc.

But he’s not alone — the formal statement from Docker was similarly out of touch with reality, in that it was very much focused on inside-out emotional reactions rather than the consequences they would have upon their existing and potential community.

Disclosure: CoreOS and Amazon Web Services are clients; Docker and Google are not.


Categories: cloud, community, devops, docker, open-source.

The reality of IoT today, not hype about 2019

There’s been a lot of hype around the Internet of Things in the past few years, with lots of people talking about wearables and so on. All kinds of fun stuff like smart watches (I own a Pebble myself) and Google Glass. But it’s all seemed very much in the early-adopter stages.

Here’s a graph of Google searches for the “Internet of Things,” and you can see the huge increase in interest in queries over the past year in particular.

Screen Shot 2014-10-03 at 4.25.10 PM

The biggest problem with IoT is understanding the difference between the hype and the reality. Everybody’s interested but it’s all lots of handwaving about how magical the future will be. However, I keep running across crazy stuff like windmills and cranes and airplanes that are all connected today, and saving millions of dollars for companies. That’s pretty serious, and by no means is it hype.

So we decided to run an event called IoT at Scale supported by SAP, which is one of those real companies doing real stuff, to help bring together the worlds of the trendy and the industrial. It’s not about SAP tech, they just happen to be really interested in the topic.

We’ll be digging into what’s actually happening today with IoT. It’s easy to hype it and talk about whatever trillion dollar markets, which can make you lose sight that there’s real and interesting tech doing real and important things today. It’s just in the business world rather than the consumer world, so we don’t normally think about it.

If you want to try out some real tech, and learn about things that have really been done in IoT at a deeply technical level, check out IoT at Scale. We’ll have a hackday and a day of talks, coming up soon on Oct 16-17 in Palo Alto. Next month, we’re gonna do a version of this in Berlin too.

Disclosure: SAP is a client.


Categories: internet-of-things.

IT must become a service provider, or die

The traditional role of IT departments is shifting, metamorphosing, even vanishing in some cases. It used to be that IT was the “department of no”. But a couple of decades ago, open source became a thing. Suddenly anyone could obtain world-class software without any license cost. Then a decade later, along came the cloud, in the form of SaaS companies like Salesforce as well as IaaS like AWS. With SaaS, anyone could sign up for a subscription-based purchase for a few bucks a month. Most people never did the math to understand what that looked like in the long term, but at least it fit within their purchase limits. With IaaS, anyone could now obtain the hardware as well, for a cost that fits within the typical developer’s expense budget for a single server.

Thus began shadow IT — people buying things that would’ve typically fallen under IT purview, but outside of its budget and control. Most ironically, in some cases shadow IT happened from within IT itself, as a rebellion against its own processes, budgets, and bureaucratic overhead. Before long SaaS and IaaS became dominant methods of procurement for new applications, and even grew existing share in so-called “brownfield” use as well.

However, most IT shops haven’t seriously considered the long-term implications. Departmental budgets coming from marketing and from lines of business are leaving IT, and over the course of a few years, this will transition to subtractions directly from IT’s budget. In other words, departmental budget dedication to IT becomes a voluntary contribution — they’ll put the money wherever it seems most useful, much like college tuition.

So IT must change. Here are two examples, from Mike Kail (Yahoo CIO formerly of Netflix) and from Facebook. In both cases, they’ve transformed the role of IT into a true service organization rather than a gatekeeper. In particular, note that they’ve moved toward approaches reminiscent of self-service (vending machines) and of Apple’s Genius Bar.

Yahoo IT, under new CIO Mike Kail (formerly Netflix CIO).

Yahoo IT, under new CIO Mike Kail (formerly Netflix CIO). Credits: Mike Kail

Facebook IT helpdesk, circa 2011. Credits: Facebook

Facebook IT helpdesk, circa 2011. Credits: Facebook

Even in “enterprise” level purchases, the role of IT is shifting. Consider the case of Solidfire. As they told me at their analyst day, their solid-state flash arrays start around $200K, and yet they’re adding REST APIs, and their customer base is shifting increasingly toward Fortune 500 IT shops rather than purely service providers.

That’s because IT is becoming an internal service provider in its own right, with the same competitive landscape that its external competitors face. The difference is that its mission must be to provide a lower barrier to entry. From the shadow IT buyer’s point of view, internal IT has the competitive advantage of avoiding much of the purchasing, infrastructure, and billing overhead that external vendors and outsourcers have. IT can transparently monitor to get what it needs while helping users avoid the burdens of registration and payment that they’re accustomed to with public cloud. This is an opportunity, so IT must seize the day.

The next step? That’s true integration with the business, and a focus on business value. But becoming a service provider is a vital step along the way.

Disclosure: and Solidfire are clients. AWS has been a client. Apple, Facebook, and Yahoo are not clients.


Categories: Uncategorized.

GitHub’s vanishing acceleration

In 2013, I successfully predicted GitHub’s growth from 3 million to 4 and 5 million users respectively, with sub-month accuracy.

This time around, my news is less cheerleading and much more concerning. As I began work to follow up on my growth predictions this year, the numbers stopped matching up. Using the old equation, I kept overestimating where GitHub’s user numbers would end up.

Finally I started looking into growth numbers on a monthly basis, and things got a little clearer. It looked like relative growth over previous months might have been slowing down, but the numbers jumped around so much it was hard to tell for sure. So I plotted it and used a fancy smoother called LOWESS, which is particularly good for nonparametric data (i.e. you don’t know what’s in it but want results anyway). Then it got crystal clear:

Methods: Data were acquired from GitHub search API, then LOWESS smoothed with a fraction of 0.5 and 3 iterations.

Methods: Data were acquired from GitHub search API, then LOWESS smoothed with a fraction of 0.5 and 3 iterations.

Although individual monthly data points are very noisy, there’s a clear downward trend over the longer term. Even varying some of the inputs for the LOWESS smoother didn’t change things in a meaningful way. Since GitHub started, it’s been growing a little bit slower (on a percentage basis) every month, even though its userbase is nearing 7.5 million. More explicitly: every month in 2008 got around 10% more new users than the previous month. By late 2014, on the other hand, every month has roughly the same average number of new users.

GitHub has reached an inflection point

Yesterday on Twitter I was talking about inflection points because they’re surprisingly misunderstood, and Ed Saipetch pointed to this excellent visualization of what they are. One example is when your growth begins to plateau, which is indicated by a slower velocity every month. GitHub’s still growing — don’t get confused about that. But it’s not growing as fast as it used to, and if continued, this will cause its growth to trail off well before I’d predicted.

The other type of inflection point is the one that GitHub needs to target next: shifting back from neutral or deceleration into acceleration mode again.

Moving beyond the plateau, or dodging it entirely

To regain its acceleration, GitHub has many options. Although I’m not going to be exhaustive, let’s dig into a few of them.

It can provide better offerings to existing audiences, who have stopped signing up in the same “exponential growth”-style numbers that it’s become accustomed to. Increased investment in GitHub Enterprise is one way to go about this, for example through partnerships with current giants who don’t have competitive offerings, or whose customers are requesting GitHub anyway. Embedding GitHub into tooling, whether it’s developer-facing or a backend for an office suite, whether for internal or external use at a company, is another way to advance its position.

The GitHub team could also choose to focus on competitive barriers, trying to make it increasingly easy to migrate code in and increasingly difficult to migrate code out. It could take a page from Michael Porter’s five forces and move up and down the supplier stack, while simultaneously targeting competitors (largely proprietary, as well as entrenched open-source options like CVS and Subversion) and substitutes, like ignoring version control altogether.

Another turnaround strategy is outreach to entirely new audiences — e.g. turning GitHub into a platform play rather than a version-control system for developers. Take for example the movements around pulling lawyers and legal code, or data journalists, onto GitHub. Or GitBook for authoring as another.

What’s missing, in many cases, is that platforms are rarely successful without applications — and enough of them to paint a picture of the platform’s potential. GitHub needs to invest in creating more applications for non-coders to make this type of platform play a success. Perhaps GitHub’s Atom editor or Team collaboration app could prove a useful core.

As noted by Ian Bull, geography is another approach to untapped audiences — GitHub search shows around 25K users who report living in China and 23K in India. While likely underreported, especially in China, it’s a nonzero number but clearly has huge amounts of room for growth.

Regardless of the method GitHub chooses, hitting the plateau is inevitable without significant changes in direction.

Disclosure: GitHub has been a client.


Categories: adoption, github, packaging.

Reference architectures belong in code, not pointless PDFs

Every couple of weeks, I get emails about a new reference architecture for something or other, from any one of an endless list of vendors. I inevitably click through to see what they’re talking about, and it’s almost always something hidden behind a registration wall. Every once in a while I’m sufficiently curious to fill out the form, and almost universally I end up getting force-fed a PDF whitepaper.

This is completely the wrong model. We’ve been talking about the importance of the barrier to entry for many years, and a PDF writeup and illustration of a reference architecture is a perfect example of that.

The problem is that the distribution model hasn’t changed with the times. As I sit here at VMworld, I’m hearing about how we now have ubiquitous virtual machines and containers, and we have infrastructure as code a la Puppet and Chef. Yet this stuff is still shipped in the same way it’s been shipped for decades, in a form meant to be laboriously translated from illustration into infrastructure, replicated across every single consumer of the architecture.

Why aren’t we shipping reference architectures as code samples? Even dead-tree programming books have been doing this for years. We now have the technology to ship even multi-server descriptions of IT infrastructure, so let’s do it.

Disclosure: Chef is a client. Puppet Labs has been a client. Docker and Google are not clients.


Categories: devops, packaging, virtualization.