Once upon a time, the larger the workload, the larger the machine you would use to service it. Companies from IBM to Sun supplied enormous hardware packages to customers with similarly outsized workloads. IBM, in fact, still generates substantial revenue from its mainframe hardware business. One of the under-appreciated aspects of Sun’s demise, on the other hand, was that it had nothing to do with a failure of its open source strategy; the company’s fate was sealed instead by the collapse in sales of its E10K line, due in part to the financial crisis. For vendors and customers alike, mainframe-class hardware was the epitome of computational power.
With the rise of the internet, however, this model proved less than scalable. Companies founded in the late 1990’s like Google, whose mission was to index the entire internet, looked at the numbers and correctly concluded that the economics of that mission on a scale-up model were untenable. With scale-up an effective dead end, the remaining option was to scale-out. Instead of big machines, scale-out players would build software that turned lots of small machines into bigger machines, e pluribus unum writ in hardware. By harnessing the collective power of large numbers of low cost, comparatively low power commodity boxes the scale-out pioneers could scale to workloads of previously unimagined size.
This model was so successful, in fact, that over time it came to displace scale-up as the default. Today, the overwhelming majority of companies scaling their compute requirements are following in Amazon, Facebook and Google’s footprints and choosing to scale-out. Whether they’re assembling their own low cost commodity infrastructure or out-sourcing that task to public cloud suppliers, infrastructure today is distributed by default.
For all of the benefits of this approach, however, the power afforded by scale-out did not come without a cost. The power of distributed systems mandates fundamental changes in the way that infrastructure is designed, built and leveraged.
Sharing the Collective Burden of Software
The most basic illustration of the cost of scale-out is the software designed to run on it. As Joe Gregorio articulated seven years ago:
The problem with current data storage systems, with rare exception, is that they are all “one box native” applications, i.e. from a world where N = 1. From Berkeley DB to MySQL, they were all designed initially to sit on one box. Even after several years of dealing with MegaData you still see painful stories like what the YouTube guys went through as they scaled up. All of this stems from an N = 1 mentality.
Anything designed prior to the distributed system default, then, had to be retrofit – if possible – to not just run across multiple machines instead of a single node, but to run well and take advantage of their collective resources. In many cases, it proved simpler to simply start from scratch. The Google Filesystem and HDFS papers that resulted in Hadoop are one example of this; at its core, the first iterations of the project were designed to deconstruct a given task into multiple component tasks to be more easily executed by an array of machines.
From the macro-perspective, besides the inherent computer science challenges of (re)writing software for distributed, scale-out systems – which is exceptionally difficult – the economics were problematic. With so many businesses moving to this model in a relatively short span of time, a great deal of software needed to get written quickly.
Because no single player could bear the entire financial burden, it became necessary to amortize the costs across an industry. Most of the infrastructure we take for granted today, then was developed as open source. Linux became an increasingly popular operating system choice as both host and guest; the project, according to Ohloh, is the product of over 5500 person-years in development. To put that number into context, if you could somehow find and hire 1,000 people high quality kernel engineers, and they worked 40 hours a week with two weeks vacation, it would take you 24 years to match that effort. Even Hadoop, a project that hasn’t had its 10 year anniversary yet, has seen 430 person-years committed. The even younger OpenStack, a very precocious four years old, has seen an industry conglomerate collectively contribute 594 years of effort to get the project to where it is today.
Any one of these projects could be singularly created by a given entity; indeed, this is common, in fact. Just in the database space, whether it’s Amazon with DynamoDB, Facebook with Cassandra or Google with BigQuery, each scale-out player has the ability to generate its own software. But this is only possible because they are able to build upon the available and growing foundation of open source projects, where the collective burden of software is shared. Without these pooled investments and resources, each player would have to either build or purchase at a premium everything from the bare metal up.
Scale-out, in other words, requires open source to survive.
Relentless Economies of Scale
In stark contrast to the difficulty of writing software for distributed systems, microeconomic principles love them. The economies of scale that larger players can bring to bear on the markets they target are, quite frankly, daunting. Their variable costs decrease due to their ability to purchase in larger quantities; their fixed costs are amortized over a higher volume customer base; their relative efficiency can increase as scale drives automation and improved processes; their ability to attract and retain talent increases in proportion to the difficulty of the technical challenges imposed; and so on.
If it’s difficult to quantify these advantages in precise terms, but we can at least attempt to measure the scale at which various parties are investing. Specifically, we can examine their reported plant, property and equipment investments.
If one accepts the hypothesis that economies of scale will play a significant role in determining who is competitive and who is not, this chart suggests that the number of competitive players in the cloud market will not be large. Consider that Facebook, for all of its heft and resources, is a distant fourth in terms of its infrastructure investments. This remains true, importantly, even if their spend was adjusted upwards to offset the reported savings from their Open Compute program.
Much as in the consumer electronics world, then, where Apple and Samsung are able to leverage substantial economies of scale in their mobile device production – an enormous factor in Apple’s ability to extract outsized and unmatched margins – so too is the market for scale-out likely to be dominated by the players that can realize the benefits of their scale most efficiently.
The Return of Vertical Integration
Pre-internet, the economics of designing your own hardware were less than compelling. In the absence of a global worldwide network, not to mention less connected populations, even the largest companies were content to outsource the majority of their technology business, and particularly hardware, to specialized suppliers. Scale, however, challenges those economics on a fundamental level, and forced those at the bleeding edge to rethink traditional infrastructure design, questioning all prior assumptions.
It’s long been known, for example, that Google eschewed purchasing hardware from traditional suppliers like Dell, HP or IBM in favor of its own designs manufactured by original device manufacturers (ODMs); Stephen Shankland had an in depth look at one of their internal designs in 2009. Even then, the implications of scale are apparent; it seems odd, for example, to embed batteries in the server design, but at scale, the design is “much cheaper than huge centralized UPS,” according to Ben Jai. But servers were only the beginning.
As it turns out, networking at scale is an even greater challenge than compute. On November 14th, Facebook provided details on its next generation data center network. According to the company:
The amount of traffic from Facebook to Internet – we call it “machine to user” traffic – is large and ever increasing, as more people connect and as we create new products and services. However, this type of traffic is only the tip of the iceberg. What happens inside the Facebook data centers – “machine to machine” traffic – is several orders of magnitude larger than what goes out to the Internet…
We are constantly optimizing internal application efficiency, but nonetheless the rate of our machine-to-machine traffic growth remains exponential, and the volume has been doubling at an interval of less than a year.
As of October 2013, Facebook was reporting 1.19B active monthly users. Since that time, then, machine to machine east/west networking traffic has more than doubled. Which makes it easy to understand how the company might feel compelled to reconsider traditional networking approaches, even if it means starting effectively from scratch.
Earlier that week at its re:Invent conference, meanwhile, Amazon went even further, offering an unprecedented peek behind the curtain. According to James Hamilton, Amazon’s Chief Architect, there are very few remaining aspects to AWS which are not designed internally. The company has obviously dramatically grown the software capabilities of its platform over time: on top of basic storage and compute, Amazon has integrated an enormous variety of previously distinct services: relational databases, a Map Reduce engine, data warehousing and analytical capabilities, DNS and routing, CDN, a key value store, a streaming platform – and most recently ALM tooling, a container service and a real-time service platform.
But the tendency of software platforms to absorb popular features is not atypical. What is much less common is the depth to which Amazon has embraced hardware design.
- Amazon now builds their own networking gear running their own protocol. The company claims their gear is lower cost, faster and that the cycle time for bugs is reduced from months to weekly.
- Amazon’s server and storage designs are custom to the vendor; the storage servers, for example, are optimized for density and pack in 864 disks at a weight of almost 2400 pounds.
- Intel is now working directly with Amazon to produce custom chip designs, capable of bursting to much higher clock speeds temporarily.
- To ensure adequate power for its datacenters, Amazon has progressed beyond simple negotiated agreements with power suppliers to building out custom substations, driven by custom switchgear the company itself designed.
Compute, networking, storage, power: where does this internal innovation path end? In Hamilton’s words, there is no category of hardware that is off-limits for the company. But the relentless in-sourcing is not driven by religious objections – such considerations are strictly functions of cost.
In economic terms, of course, this is an approximation of backward vertical integration. Amazon may not own the manufacturers themselves as in traditional vertical integration, but manufacturing is an afterthought next to the original design. By creating their own infrastructure from scratch, they avoid paying an innovation tax to third party manufacturers, can build strictly to their specifications and need only account for their own needs – not the requirements of every other potential vendor customer. The result is hardware that is, in theory at least, more performant, better suited to AWS requirements and lower cost.
While Amazon or Facebook have provided us with the most specifics, then, it’s safe to assume that vertical integration is a pattern that is already widespread amongst larger players and will only become more so.
For those without hardware or platform ambitions, the current technical direction is promising. With economies of scale growing ever larger and gradual reduction of third party suppliers continuing, cloud platform providers would appear to have margin yet to trim. And at least to date, competition on cloud platforms (IaaS, at least) has been sufficient to keep vendors from pocketing the difference, with industry pricing still on a downward trajectory. Cloud’s pricing advantage historically was the ability to pay less upfront and more over the longer term, but with base prices down close to 100% over a two year period, the longer term premium attached to cloud may gradually decline to the point of irrelevance.
On the software front, an enormous portfolio of high quality, highly valuable software that would have been financially out of the reach of small and even mid-sized firms even a few years ago is available today at no cost. Virtually any category of infrastructure software today – from the virtualization layer to the OS to the runtime to the database to the cloud middleware equivalents – has high quality, open source options available. And for those willing to pay a premium to outsource the operational responsibilities of building, deploying and maintaining this open source infrastructure, any number of third party platform providers would be more than happy to take those dollars.
For startups and other non-platform players, then, the combination of hardware costs amortized by scale and software costs distributed across a multitude of third parties means that effort can be directed towards business problems rather than basic, operational infrastructure.
The cloud platform players, meanwhile, symbiotically benefit from these transactions, in that each startup, government or business that chooses their platform means both additional revenue and a gain in scale that directly, if incrementally, drives down their costs (economies of scale) and indirectly increases their incentive and ability to reduce their own costs via vertical integration. The virtuous cycle of more customers leading to more scale leading to lower costs leading to lower prices leading to more customers is difficult to disrupt. This is in part why companies like Amazon or Salesforce are more than willing to trade profits for growth; scale may not be a zero sum game, but growth today will be easier to purchase than growth tomorrow – yet another reason to fear Amazon.
The most troubling implications of scale, meanwhile, are for traditional hardware suppliers (compute/storage/networking) and would-be cloud platform service providers. The former, obviously are substantially challenged by the ongoing insourcing of hardware design. Compute may have been first, with Dell being forced to go private, HP struggling with its x86 business and IBM being forced to exit the commodity server business entirely. But it certainly won’t be the last. Networking and storage players alike are or should be preparing for the same disruption server manufacturers have experienced. The problem is not that cloud providers will absorb all or even the majority of the networking and storage addressable markets; the problem is that it will absorb enough to negatively impact the scale traditional suppliers can operate at.
Those that would compete with Amazon, Google, Microsoft et al, meanwhile, or even HP or IBM’s offerings in the space, will find themselves faced with increasingly higher costs relative to larger competition, whether it’s from premiums paid to various hardware suppliers, lower relative purchasing power or both. Which implies several things. First, that such businesses must differentiate themselves quickly and clearly, offering something larger, more cost-competitive players are either unable or unwilling to. Second, that their addressable market as a result of this specialization will be a fraction of the overall opportunity. And third, that the pool of competitors for base level cloud platform services will be relatively small.
What the long term future holds should these predictions hold up and the market come to be dominated by a few larger players is less clear, because as ever in this industry, their disruptors are probably already making plans in a garage somewhere.
Disclosure: Amazon, Dell, HP, IBM and Microsoft are RedMonk clients. Facebook and Google are not.