The AWS Well Architected Framework is a set of best practices to help folks build and deploy cloud applications reliably and securely in the cloud. While not mentioned explicitly in the highest level docs, instrumentation is essential for all of the key pillars – Operational Excellence, Security, Reliability, Performance Efficiency and Cost Optimization. In order to be Well Architected, AWS must be Well Instrumented.
AWS has made some fairly profound moves with respect to instrumentation recently, which I felt were worth outlining, in as much as they help move the state of the art forward in Observability. For the great majority of infrastructure companies today AWS is not the competition, it’s the environment in which you compete. AWS has a clear focus on making it easier for third parties to build Observability tools that can help manage apps and services on its platform. The most successful platforms have always enabled third party tools, rather than simply competing against them, and AWS is no exception. Does AWS offer its own tools in most categories? Sure. But it also enables third parties to take advantage of its primitives to build and improve their own services. In many cases these third parties provide a superior user experience. So… about ecosystems for Observability.
Two of the primary platform architectures vying for new application workloads today are “Cloud Native”—aka Kubernetes workloads—and serverless functions, as exemplified by AWS Lambda. For the purposes of this post I use “serverless” to refer to Lambda functions as a compute primitive rather than serverless as a broader umbrella of managed services. Generally though we define serverless thus – “a managed service that scales to zero”. In a future post we will therefore talk more about what observability means in a world of managed services.
In the Cloud Native world open source is an expectation. Instrumentation, telemetry, and core runtimes are all open source. The Cloud Native Computing Foundation provides an imprimatur, as a vendor neutral home. Adopters therefore feel reasonably comfortable using CNCF technologies. This is the world of Kubernetes, Envoy, Prometheus and Open Telemetry. AWS is today happily embracing CNCF technologies – after all, anything that drives workloads to AWS is good for AWS. Which may help explain why AWS has 17 different ways to deploy containers.
On the Lambda side of the house things are somewhat different. Open source is not an expectation. The managed service is everything. Adopters are more interested in capability than internals. Yet… one of the criticisms of Lambda is that it’s a black box. Unlike AWS infrastructure services it’s potentially harder to observe the system, and to troubleshoot and react to problems. Third parties are still expected to be able to manage applications that include serverless components though.
It is therefore worth paying attention to some significant moves AWS is making with both its container services and Lambda architectures to make them easier to manage by third parties.
A note on personnel. In October 2020 Jaana Dogan joined AWS. At the time I wrote
It will be very interesting then to see what kinds of impacts and choices AWS starts to make in terms of Observability, potentially working with the Cloud Native Computing Foundation to drive further standardisation.
In a tweet about that post, Dogan said:
One more thing. AWS is positioned uniquely to contribute to this space. Observability is uniquely critical for a cloud provider not for the sake of Observability but as a platform to explain itself to its customers.
One open source instrumentation commitment is the AWS Distro for OpenTelemetry now in preview. AWS has its own supported distribution of OTel, so users can instrument their apps for metrics and traces without using a bunch of different SDKs, for consumption by its own tools, Amazon CloudWatch and AWS X-Ray, or by third-party tools. AWS has also committed to an approach where all enhancements are contributed upstream to the community. Amazon also now offers Prometheus as a managed service, furthering its embrace of CNCF technologies.
So – on the Cloud Native side AWS is demonstrating a growing commitment to open source. It’s contributing code and recommending that its users join the CNCF community.
You can use OTel as a collector for tracing Lambda functions.
The key new underpinning for management of serverless apps though is AWS Lambda Extensions.
You can use Lambda extensions for use cases such as capturing diagnostic information before, during, and after function invocation; automatically instrumenting your code without needing code changes; fetching configuration settings or secrets before the function invocation; detecting and alerting on function activity through security agents; and sending telemetry to custom destinations.
Developers can extend the Lambda execution environment, which opens the door for far more extensive third party instrumentation and so better system Observability. These extensions can be applied at any stage of the Lambda function lifecycle, which makes it a powerful model. There are some architectural similarities to the sidecar pattern in Kubernetes environments for streaming.
A host of third parties are already building Lambda extensions, including AppDynamics, Check Point, Coralogix, Datadog, Dynatrace, Epsagon, HashiCorp Vault (quite a cool security use case, securing and managing access to secrets before a Lambda function is invoked), Honeycomb, Imperva, Instana, Lumigo, New Relic, Sentry, Splunk, SumoLogic and Thundra.
There is a great thread by Dhruv Sood explaining more about Lambda Extensions here if you’d like to learn more ( shout out to Randall Hunt for surfacing it).
Extensions enable you to capture diagnostic info, auto-instrument your code, fetch config/secrets, send telemetry to custom destinations, & detect/alert on unexpected function activity.
So with both Cloud Native/containers architectures and Serverless, AWS is investing in making its applications more manageable for and by third parties in terms of traces, logs and metrics. If we’re going to have a successful ecosystem of third party Observability vendors this is critical work, and can only help folks building and deploying apps on AWS.
Ecosystems matter. When AWS announced its container platform ECS Anywhere, which can be deployed on premises or on edge devices, Datadog was a launch partner.
Another solid enabler for better Observability was the launch of CloudWatch Metric Streams in March – which does pretty much what it says on the tin, streaming real time CloudWatch Metrics about AWS services to Amazon Redshift or third party collectors using Amazon Kinesis Data Firehose.
Observe, Inc immediately benefited. Observe is a startup building an Observability platform on the cloud-based Snowflake data warehouse, which in turn is built on AWS. One interesting feature of the platform is the cost model – data is stored in S3, which makes concerns about the cost of storing huge amounts of data when it comes to instrumentation and the logs and events generated less of a concern than with other platforms. Having planned to build their own agent for data collection, Observe was able to shortcut the process by feeding CloudWatch Metrics directly into the platform, which saved development time, but also management and future support costs. Here is a post about it.
The release of this new feature allows AWS customers to export CloudWatch metrics data with less configuration, less management overhead, and reduced costs. Traditionally, customers would need to deploy a third-party agent or a Lambda function to poll metrics via the GetMetricData API call. In addition to deploying proprietary agents or collectors, customers would then get hit with an AWS bill of $0.01 per 1000 metrics. If a customer is generating large amounts of metrics over multiple accounts this can quickly become a blocker in terms of exporting metrics to third-party systems.
This post isn’t intended to capture every new aspect of AWS instrumentation – that would be literally impossible – but rather to point out that in some key areas AWS is increasingly opening up, making data about infrastructure events more accessible, and becoming more Well Instrumented in the process.
If you’d like to learn more about Observability I strongly recommend you attend O11ycon, June 9th and 10th. I will be on a panel with Doctor Nicole Forsgren, VP of Research and Strategy at GitHub, and Bryan Liles, Senior Staff Engineer at VMware Tanzu, discussing the business benefits of Observability.
Related posts
An Observability Platform, Serverless, and the Smoking Goat
Reframing and Retooling for Observability
This post is not sponsored or commissioned but is an independent piece of research. However some vendors mentioned are clients – including AWS, Dynatrace, Hashicorp, Honeycomb, New Relic, Observe Inc, Splunk, SumoLogic.
No Comments