Daniel Gruss, one of the researchers who discovered the vulnerability, called Meltdown, “probably one of the worst CPU bugs ever found.” If that isn’t alarming enough, Adrian Colyer says “Spectre is a step-up from the already very bad Meltdown.”
Details about the bugs continue to emerge, but with findings this significant we wanted to begin exploring the repercussions. This piece will summarize the nature of the vulnerabilities, explore the scope of the problems to date, and present the potential costs associated with the fallout.
There are plenty of technical explanations about the vulnerabilities, so the goal of this section is to give a high level overview what happened with extensive links to the articles, papers, and repositories for those who want to explore more.
These exploits take advantage of hardware vulnerabilities that allow attackers to overcome memory isolation. Meltdown and Spectre are distinct attack vectors born of the same class of problem: speculative execution. The net result is that CPU leaks memory across boundaries we previously thought were safe.
Here’s a quick explanation of the problem, in three acts.
Modern processors use branch prediction and speculative execution to maximize performance. For example, if the destination of a branch depends on a memory value that is in the process of being read, CPUs will try to guess the destination and attempt to execute ahead.
– Spectre Attacks: Exploiting Speculative Execution, Kocher et al.
If the guess was wrong, the results can just be rolled back and it’s as if they never happened. Or at least that’s the plan! But as we all now know, unfortunately some traces do get left behind…
– Meltdown, Colyer
All of the attacks cause information disclosure from higher-privileged or isolated same-privilege contexts, leaked via an architectural side channel, typically the CPU data cache.
– marcan/speculation-bugs, GitHub
Meltdown is so named because “the vulnerability basically melts security boundaries which are normally enforced by the hardware.” It has one known attack variant, and the problem can be addressed with Kernel Page-Table Isolation (KPTI) operating system patches. These patches can cause performance degradation for CPU speed.
The Scope of the Problems
These are vulnerabilities in computer hardware, not software. They affect virtually all high-end microprocessors produced in the last 20 years. Patching them requires large-scale coordination across the industry, and in some cases drastically affects the performance of the computers. And sometimes patching isn’t possible; the vulnerability will remain until the computer is discarded.
– The New Way Your Computer Can Be Attacked, The Atlantic
Intel has received the brunt of the criticism, as Meltdown “applies almost exclusively to chips made by Intel.” As Intel “makes about 90 percent of the world’s computer processors and 99 percent of the server chips in the data centers that effectively run the internet”, Meltdown was indeed a significant exploit.
Meltdown is the more trivial of the bugs to exploit and is also easier to fix because it can be addressed with an OS patch. (That said, some of these patches had to be rolled back when it was discovered the patches caused computers to reboot.)
On the other hand, the scope of Spectre is more alarming because “vulnerable speculative execution capabilities are found in microprocessors from Intel, AMD, and ARM that are used in billions of devices.” Because AMD and ARM-based processors are largely used in mobile devices, this means that the Spectre attack “works on mobile phones, tablets, and so on” meaning the number of impacted devices is near impossible to calculate with any precision.
According to Colyer, “Spectre is not really so much an individual attack, as a whole new class of attacks.” The revelation of Spectre has shifted our understanding of modern chip security as it affects almost all chips produced in the last two decades. A list of impacted CPUs is being tracked by Hector Martin and contributors on GitHub.
These vulnerabilities are some of the biggest revelations in computing in the past twenty years. These are material findings, but do they also represent also material costs?
It’s not feasible to assign a specific industry-wide dollar value to the Meltdown and Spectre vulnerabilities. The issues are wide-reaching, the size of the impact to a given team will vary based on a myriad of factors, and the timeline for addressing the problems is unknown and closer to “months and years” than “days and weeks.” On top of all this, our understanding of the problem is evolving and we’re learning more every week.
While there are attempts to quantitatively explore costs below, please note that the numbers are not intended to be prescriptive. The goal here is to examine the holistic costs of the problems and how operators and decision makers should think about the impact on their businesses. This is not a “YMMV” caveat; your mileage will vary.
1. Performance Reduction
Intel’s initial guidance about the impact of the patches states that “based on our most recent PC benchmarking, we continue to expect that the performance impact should not be significant for average computer users.” (This statement, however, was issued before the reboot problem came to light.) Intel’s sentiment has been echoed by the major cloud providers. AWS, for instance, stated “we don’t expect meaningful performance impact for most customer workloads.”
As more results of “real world” benchmarking tests have emerged, we are beginning to get a feel for what these ‘insignificant impacts’ may look like. Though early reporting suggested performance hits of up to 30%, those use cases appear to be extreme. As more case studies roll in, impacts appear to be much more manageable. DataDog, for example, found the average impact to the cores they monitor was less than 1%. Links to some of the benchmarking tests we followed are included below.
At this point we can assume that those paying per CPU cycle (i.e. those using cloud service providers) are going to be feeling the cost most directly as most desktops have adequate excess CPU and private data centers have capitalized their compute cost.
For those affected, the performance impact will vary based on:
- the patch: much of the focus thus far has been on the KPTI patches for Meltdown; solutions implemented for Spectre are more complex and thus its impacts are less well-documented.
- the processor: performance impacts vary by chip
- the workload: “standard desktop workloads” that don’t frequently call the kernel will see a smaller impacts, while workloads that frequently use the disk or network will see more performance degradation. (Ars Technica)
- the environment: the operating system, SSD vs. HDD storage, platform configurations, etc. can all impact performance.
Hopefully it’s abundantly clear there’s no universal predictor of how the Meltdown patch will impact any particular organization. That said, let’s go ahead and try calculating an example anyway for illustrative purposes…
This PostgreSQL benchmark by Phoronix showed an average performance hit of just over 11% after applying the PTI patch, and aligns with other benchmarks run by the Postgres community. If we take these results as an example, what can this kind of performance impact mean in terms of real dollars?
Based on our 2017 base IaaS pricing analysis, we can make a rough assumption that we can buy one hour of basic compute from a cloud provider for around $0.28/hour. If we assume that someone is running a database with four nodes for a month, that an 11% performance degradation equates to just under $100 in additional monthly expense.
While an extra thousand or so dollars may be a rounding error for many companies’ annual budgets, this example is just a small database. Think about all the services and applications that comprise an architecture, and it’s easy to see how rapidly this additional expense could grow especially for large-scale organizations.
2. Opportunity Costs
One opportunity cost is lost developer time. That can take several forms.
- Sometimes degraded performance requires additional scale out, but sometimes it means slower processing times. Processes that take longer to run can equate to lost developer productivity.
- Things that take longer to run can expose brittleness in test suites, when tests were formerly green are suddenly not passing due to the additional time to run them. This costs time to identify and correct the issue.
- Fire drills, chem spills… call them what you will, but unexpected urgency can derail roadmaps. Engineers who may have previously been building new features or working on other elements of a product are now focused on mitigating security problems, patching systems, migrating workloads, etc.
Workload migration could be done to optimize speed based on new processing realities, as Lyft has done. However, it may also be part of new discussions around OpSec threat analysis.
In many cases it’s likely that users of the public cloud are better-equipped to address security threats than those operating their own data center, as cloud providers have the expertise and scale to apply patches quickly.
Nonetheless, this is a new class of vulnerability; even perfectly written software is not immune to attack when there are hardware vulnerabilities. The Spectre revelations in particular have probably given some companies pause as they consider what workloads are running in public cloud environments alongside neighboring workloads of unknown origin. While it seems unlikely that anyone would move away from cloud entirely (particularly in the short term), it’s possible that some organizations are evaluating whether they need to transfer their most sensitive workloads in-house.
Three class-action lawsuits have already been filed against Intel. It’s also reasonable to speculate that individual customers (particularly cloud providers) may seek compensation from hardware providers to cover adverse impacts of this incident. It’s too soon to tell how things will shake out and who will bear fiscal responsibility, but regardless of the outcome it’s clear that there will be significant legal costs associated with sorting out liability associated with Meltdown/Spectre.
4. Accelerated Hardware Refresh Rate?
Because the root problem lies within the chip itself, it’s possible that some organizations will reevaluate their hardware’s lifespan. Once new chips are available to correct the underlying issues, some organizations may choose to deprecate their machines sooner than they otherwise would have in order to improve their security.
The Path Forward
Meltdown and Spectre broke our assumptions. They broke our assumptions about the CPU’s memory isolation. They broke our assumptions about device security. They broke our assumptions about what it means to write secure software.
Meltdown also shows that even error-free software, which is explicitly written to thwart side-channel attacks, is not secure if the design of the underlying hardware is not taken into account.
– Meltdown, Lipp et al.
We’ve spent two decades living on borrowed performance, and now we must reevaluate the tradeoffs we’ve unknowingly made between speed and security. The fallout from Meltdown and Spectre will continue for years to come.
Here are some questions I expect we and others in the industry should and will be discussing in the coming months:
- what immediate patching do we need to perform and how does that impact our existing plans and business?
- what future patching or changes do we or our service providers need to perform?
- how efficiently are we able to patch the infrastructure we maintain in the event of similar future issues?
- what and, perhaps more importantly, how are our service providers communicating with us? and us with our customers?
- how is this architecture’s real-world performance impacted?
- what are the long-term fixes for hardware going to be?
- how has our understanding of CPU architecture and security changed?
- should we change our expectations about how to craft software and what’s required to create a secure application?
- what new tradeoffs between performance and security should we now consider?
Each of these questions has a “dollars and cents” answer behind it. They may be impossible to calculate on a industry-wide basis because there are simply too many unknown variables, but organizationally both buyers and sellers of technology will be forced to reevaluate underlying assumptions about nearly all their infrastructure, whether it’s self-run or managed by a third party.
The one question that doesn’t have a dollars and cents answer attached to it, however, may be the most important, the one that should keep everyone up at night. If assumptions as seemingly certain as hardware execution models can be violated at will, what other bedrock foundations of computing infrastructure are subject to compromise that we just haven’t discovered yet?
Background reading on Meltdown / Spectre:
- I quite liked CloudFlare’s An Explanation of the Meltdown/Spectre Bugs for a Non-Technical Audience as a beginning basis for understanding the situation. This is an excellent starting point if you’re not at all familiar with the vulnerabilities.
Matt Klein’s Meltdown and Spectre, explained is a great entry point for diving into specifics for those that want a more in-depth read. He offers useful background on kernel/user memory and CPU cache topologies before explaining the vulnerabilities themselves; the post does an excellent job walking through basics of how the bugs work.
Adrian Colyer’s The Morning Paper analyses for Meltdown and Spectre attacks: exploiting speculative execution are excellent resources for summarizing the technical literature, and of course the original papers themselves are the key source data. You can find them at Meltdown and Spectre Attacks: Exploiting Speculative Execution.
Here are some of the benchmarking articles/case studies that helped inform this article.
- Microsoft Windows 10
- Epic Games
Disclosure: AWS, Google, Microsoft, Varnish, and MongoDB are all RedMonk clients.