James Governor's Monkchips

Reframing and Retooling for Observability

Share via Twitter Share via Facebook Share via Linkedin Share via Reddit

In 2020 Observability is making the transition from being a niche concern to becoming a new frontier for user experience, systems and service management in web companies and enterprises alike. Technology providers in the Application Performance Monitoring (APM), Log Management, and Distributed Tracing categories are all positioning themselves as Observability tools providers this year, which is partly about marketing, but also product management.

There is still quite a bit of disagreement about what Observability even is, but a key driver behind the trend is that, simply, there is a lot more uncertainty in systems and applications than there used to be. Distributed systems are by definition non-deterministic. What is more, change, rather than stability, is now the goal. Where a team used to ship a new version perhaps once a year, and had plenty of time to model its failure modes, the expectation is now for weekly or even daily feature deployments. Modern applications by design have multiple, changing, failure modes- “unknown unknowns” if you like – which means new approaches to management. We need to have a feel for overall system health by observing the system, and then sophisticated tools to analyse and query it when problems arise, without bringing the system down in order to do so. Of course if the system is down, we want tools to quickly help identify why, where the problem is, and get it up and running stat.

Observability as a trend ties into the world of DevOps – the app or service isn’t something built by one team and managed by another. Rather one team builds and maintains the app and the folks on that team want excellent tooling to understand both overall system health and the health of components making up that system for trouble-shooting.

Observability allows us to experiment, to test in production, to take a Progressive Delivery approach where we watch a system in production before deploying it more widely.

I think this post by Cindy Sridharan is a great place to start, building a picture to help the reader understand the difference between Observability and Monitoring.

“Monitoring” is best suited to report the overall health of systems. Aiming to “monitor everything” can prove to be an anti-pattern. Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes as well as blackbox tests.

“Observability”, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes. Since it’s still not possible to predict every single failure mode a system could potentially run into or predict every possible way in which a system could misbehave, it becomes important that we build systems that can be debugged armed with evidence and not conjecture.

Honeycomb, whose founder and CTO Charity Majors has arguably done more than anyone else in tech to popularize the idea of Observability as a different approach, argues that we need to move beyond traditional metrics in order to provide value in modern software delivery and ops. Observability is the ability to ask arbitrary questions of your infrastructure, understanding the internal state of the system by interrogating its outputs. For Majors, Observability is about real time query and troubleshooting, with extreme cardinality, and a sampling based approach, underpinning the move to testing in production.

“For those who don’t spend their days immersed in this shit, cardinality is the # of unique values in a dimension. So for example if you have 10 million users, your highest possible cardinality is something like unique UUIDs. Last names will be lower-cardinality than unique identifiers. Gender will be a low-cardinality dimension, while species will have the lowest-cardinality of all: {species = human}.

When you think about useful fields you might want to break down or group by…surprise, surprise: all of the most useful fields are usually high-cardinality fields, because they do the best job of uniquely identifying your requests. Consider: uuid, app name, group name, shopping cart id, unique request id, build id. All incredibly, unbelievably useful. All very high-cardinality.”

And yet you can’t group by them in typical time series databases or metrics stores”.

Needless to say Honeycomb’s product focus maps squarely to Major’s definitions and rhetorical work. Honeycomb is a really powerful query tool, with a sampling-based approach approach for rich, deep insights into system performance.

But with new entrants come responses from incumbents. The tech industry is nothing if not reactive. In September last year New Relic put down a marker that it plans to compete in the market for Observability tools, announcing the New Relic One platform, with an aggressive throw down against newer market entrants. It also ran with a definition of Observability, variants of which had been bubbling around since Twitter wrote a post about its “Observability platform” in 2013.

“Engineers at Twitter need to determine the performance characteristics of their services, the impact on upstream and downstream services, and get notified when services are not operating as expected. It is the Observability team’s mission to analyze such problems with our unified platform for collecting, storing, and presenting metrics”.

In the view of the world Observability is primarily a consolidation question – if you can aggregate logs, performance metrics, and distributed traces (sometimes called the “pillars”of Observability) with a query engine you’re in a far better position for management and debugging of microservices based apps. This definition is in many respects a more reductive view of Observability than those put forward by Majors or Sridharan, but it’s also easier to understand.

A refresh was important for New Relic – systems and service monitoring and management is a market that doesn’t stand still. Every few years a new stack emerges, and a new set of application performance management (APM) vendors with deep expertise in that stack emerges to serve the new market. Implement, reimplement, rinse, repeat. New Relic CEO Lew Cirne knows this pattern as well as anyone, having sold his company Wiley Technologies to Computer Associates in 2006.

In November 2019 New Relic acquired the assets and team members of a startup called IOpipe, to give it a stronger serverless story. Any truly end to end Observability tool will need to include serverless functions. New Relic is moving into adjacent markets, integrating telemetry across different functions including metrics, events, logs, and traces. It is positioning itself as a platform rather than just an ops tool in order to appeal to developers. To that end, it’s rebuilding its UI technology from the ground up, and more importantly in terms of developer interest, standardizing on GraphQL for API access.

Hitherto different toolsets are converging. As RedMonk likes to say, categories are “smooshing”. Some vendors are consolidating the market through acquisition – see Splunk’s acquisition of SignalFX in 2019, to bring together log management and APM metrics. Another example is SolarWinds, an APM vendor acquiring Loggly in 2018. Dynatrace, another New Relic peer, argues that it has always been an Observability platform, because of the comprehensive nature of the data it collects and makes available for analysis.

In terms of this type of market convergence, it’s a bit like Netflix moving into original content creation: “The Goal Is to Become HBO Faster Than HBO Can Become Us.”

New Relic is taking a more organic approach to broadening the systems it can “observe”, building an integrated toolset that maps to changing industry circumstances. Observability is both a new framing product management concept (so a chance to make a higher value proposition to customers) but also an opportunity to create a defensive moat against new entrants.

Datadog is well positioned for “cloud native” workloads and had its IPO last year, coincidentally on the same day that New Relic announced its Observability strategy. Weaveworks is building observability tooling to close the loop with its GitOps declarative automation story, as a means to manage and automate the rollout of services and applications to Kubernetes infrastructures. LogDNA, founded in 2015, also makes a virtue of being “built on Kubernetes”. K8s adoption is often seen as a proxy for a move to microservices.

Sumo Logic, a born in the cloud log management provider, is increasingly articulating a more metrics-like story. Humio is on a similar arc – framing logs as the natural basis for real time service and application troubleshooting, claiming its all you can eat pricing makes it possible to store and query all system events.

Of course we’re not going to see the cloud hyperscale providers standing still either – they see Observability as rightfully their province. Observability is after all a cloud platform discipline, which is now informing enterprise tech decision-making. It promises to lessen distinctions between a number of product categories which are currently worth billions in their own rights. It also involves a new way of working. Developers need to take more responsibility for building observable applications, just as they’ve taken on more responsibility for testing and continuous integration. It will be interesting to see how Observability providers can educate the market to smarter, more effective ways of working.

Some of Observability’s possibilities are a response to changes in software delivery and architecture (ie microservices), while others are a function of new opportunities opened up by lower costs of network, compute and storage driven by hyperscale clouds. Where log storage used to be expensive, for example, it’s now becoming more cost effective to collect a much higher volume of events, with the resources needed to store and process all this telemetry. Some new technical underpinnings are also crucial. Distributed tracing is maturing, and coalescing around standards, notably OpenTelemetry, a merger of the OpenCensus and OpenTracing projects.

One massive opportunity we’re beginning to see realised is the correlation of system events with GitHub events, so that we can swiftly move from trouble-shooting to source code and back again. The loop is closed between development teams and the code they test and deploy, with an audit trail of changes (GitOps again). Probably most importantly then, Observability is about a mindset shift that affects how development teams think about the apps and services they build. Dmitry Melanchenko of Salesforce captures this well in this post.

“For me, the best definition of observability is that it’s the love and care that creators of a product give to those who operate it in production, even if they operate it by themselves”.

We need to build systems with a view to better Observability. That’s going to have a significant influence on process and infrastructure decisions in 2020.


disclosure: New Relic, Sumo Logic and Salesforce are all RedMonk clients.

One comment

  1. […] Governor explained in a recent RedMonk post that observability helps teams understand both the overall health of a system and the health of all […]

Leave a Reply

Your email address will not be published. Required fields are marked *