In this RedMonk conversation, Jacob Leverich, Co-Founder at Observe, Inc., discusses the evolution of the observability market, focusing on the founding of Observe and the architectural decisions that shaped its development, with RedMonk’s James Governor. They explore the transition from on-premises to cloud-based solutions, the challenges of data collection and interpretation, and the importance of user context in troubleshooting. The discussion also covers the impact of OpenTelemetry on the industry and the ongoing challenges of cost management in observability solutions. They explore emerging trends in data management, including the commoditization of data storage and the impact of AI on observability practices.
This RedMonk conversation is sponsored by Observe.
Links
Transcript
James Governor (00:12)
Hello and welcome to this episode of the MonkCast. My name is James Governor, co-founder of RedMonk, and with me today is Jacob Leverich, Chief Product Officer at Observe, Inc. He’s an alumni of Splunk, Kuro Labs, and IBM. And today we’re going to be talking about the observability market in general and a little bit about the story of Observe, Inc. So welcome.
Jacob Leverich (00:34)
Yeah, thank you. Good to be here. Good to see you again.
James Governor (00:37)
So yeah, I think for me, one of the things that’s really clear in observability, and I guess in technology general is when were you founded and what was the underlying infrastructure that you were able to take advantage of at that time? In talking about observability, pretty clearly, if we were building a product that was pre-cloud, it’s gonna look very different from a product after the…
after AWS came along and provided all of this infrastructure. And so I think from my perspective, you made some interesting architectural bets when you founded Observe Inc. And I’d love to know a little bit about the bets you made and what’s happened in the intervening time, because you’ve had a good run now of engineering, product management. So tell me a bit about Observe, those bets and how you’ve seen the market evolve in those intervening years.
Jacob Leverich (01:36)
Right on, right on. Yeah. So, that’s a great question. And I guess to, start with the context, we were founded in 2017. so, you know, seven, eight years ago and, prior to that, you know, I had done a run at Splunk and so I got to see that, that, that beast from the inside out and, for what it’s worth that’s great product, a great company at the time. It was a really, really fun experience. Learned a lot, kind of saw like where.
where the technology really worked and where it really solved problems for people. also, I guess where a little bit of the legacy was starting to show its face. In particular, it was built way, back when as on-prem software, kind of predominantly using local storage. And the transition to the cloud had been a little bit of a challenge. And I guess that’s kind of one piece of puzzle. But let me start with another thing that sort of
dovetails nicely with this, which is that in 2017, what was apparently happening in the industry was an emerging desire for like open source commodity data collection. And then particularly when you think about what the APM market looked like in the tens and teens, you have vendors like New Relic and AppDynamics and Dynatrace that all sort of cut their teeth on having
an awesome proprietary agent that could auto instrument all of your applications and just give you tons of insight right out of the box. That’s cool. But it was all vendor proprietary. You couldn’t really take that data and move it from one place to another. Then one of the other emerging problems was this issue of data silos. We have different teams with different tools, the best of breed tool for their particular application or their particular use case.
And so, know, when you look at the just the, the, the digital operations of like a large business, you know, when something goes wrong, well, could be any of, you know, 10 or a hundred different things. And so you had need to have 10 or a hundred different tools to figure out like where the problem is. And so you have this like war room call from hell trying to get everyone, you know, like, together to figure out like, well, Hey, can you look at your tool to see what you see? And I’ll look at my tool and see what I see and just sort of sharing notes, but there’s really no.
those like standardization or centralization or sort of like, you know, consistency across these things. And so, so there’s this emergent trend towards, you know, man, like having all these like vendor, these best in breed vendor tools is really starting to get in the way and starting to its cracks, particularly in larger organizations. And so, so the commoditization of the data collection,
It was pretty obvious with the open source like ELK Stack and things like that. Logstash becoming very, very common for collecting log data. Fluentd also being surprisingly popular for large scale data collection. Then OpenTracing comes along. This idea of distributed tracing becoming this concept that was present within the hyperscalers like Google and Facebook.
you know, turn in the form of like Dapr and Canopy. And so now there’s this idea that, maybe this is something that we could turn into an industry standard. And so that’s where OpenTracing started to come about. And there are lots of cool ideas behind the scenes there. So one is that, you know, all the signals that were being discussed with OpenTracing were very much the same signals that you needed to do application performance monitoring in terms of understanding for this particular API request.
how long did it take and what were all the downstream dependencies so I can troubleshoot the root cause. But then also, it also kind of dovetailed nicely with, I would say the growing trend towards structured event like logging. And so like OpenTracing, there was this idea that you have attributes that are sort of like semi-structured key value pairs that you can attach to all your telemetry.
And that was also what was happening in the metrics industry with StatsD adding on tags. You see Prometheus adding on labels. There’s sort of this growing trend towards like semi-structured key value data attached to your telemetry. And then that was also happening in the logging space as well, where people were like, first like Splunk automatically extracts key equals value pairs from your unstructured logs, but then people starting to log in JSON.
and so starting to get lots of like semi-structured log data as well. So there’s sort of like this move towards like open source or open standard semi-structured telemetry data of all sorts of different shapes and sizes. So this is all happening in the teens. So we were thinking to ourselves, well, what’s the opportunity here? We thought that it was to lean into this trend towards
James Governor (06:24)
There’s a lot going on basically.
Jacob Leverich (06:35)
commodity semi-structured sort of telemetry for digital applications and to figure out, where is that data gonna go? And like, what are you gonna do with it? How are you gonna get the same value out of it that you would with the previous generation of best in breed tools? And so one of the, I think the things that we saw was that one, the volume of data is gonna be.
Profound because we it already was profound. You already have people that are that are collecting, you know, you know, know, know, terabytes or tens of terabytes or even hundreds of terabytes of like log data or APM data. And, uh, you know, if you want to have like a bunch of teams collaborating, working from the same data, well, you got to have someplace to put it so they can all log in and see the same data. So you kind of, you have to deal with the scale problem. Um, but then also all the data is going to be semi-structured in nature.
which presents itself like an opportunity. Like maybe you don’t need to do like what we did in the old days, which is just do a keyword index on top of everything and use that to find the data. If the data is already semi-structured, and if you have a data store that is good at handling the semi-structured data, being able to just retrieve like specific columns and like sort of like, I guess, filter and sort of search for the data based on those columns without having to scan everything, then you might actually have an opportunity to make this
a little bit more efficient computationally than what you did previously when you indexed everything. Then the last piece of this was that in order to handle that scale, you probably don’t want to be using local disk anymore because it’s awfully expensive, especially if you’re having to replicate it three times to get the durability that you demand out of your observability data. So the obvious destination was this Cloud Object Storage.
You know, it’s, infinitely scalable. It’s dirt cheap compared to local storage. was kind of the obvious next place for all this data to go, but you needed something that could actually like do an adequate job of the observability use case while storing all the data on top of that thing. Now what was happening in 2017 was very, very interesting, which was that there’s a new generation of commodity databases sort of built for this use case coming onto the market.
And so, you know, like I think Redshift was a good early example of this. think Snowflake was the best example of this in that timeframe, 2017. And they kind of, you know, have like great support for semi-structured data, great support for storing the data at rest in the object storage. And also, you know, it turns out they’re actually very, very good for doing the types of analytics on this data that you need for an observability use case. Now, it wasn’t like a new observation.
that you could use this type of database for this use case. And in fact, something that I had a nice brief experience at Google in 2010. And I got a glimpse of the future.
James Governor (09:37)
Tell me about your MapReduce internship, Jacob.
Jacob Leverich (09:40)
Yeah, so I was just on the MapReduce team. was just a punk in grad school. I managed to fall into the MapReduce team doing basically performance engineering. And so it’s basically trying to figure out like all these MapReduce jobs running at Google, like how can we make them faster and more efficient or whatever? And I thought I was going to be sitting there like writing MapReduce jobs all day long. But in reality, I was using a tool called Dremel.
And so Dremel was this internal sort of like, you know, columnar database for semi-structured data that, that allowed you to take all the like logs coming out of all these jobs and just do basic SQL analysis on them. And that’s kind of just like do basic sort of long-term, you know, sort of analysis of, you know, the, jobs that look like this, they perform this fast. know, that, means we should change this default or whatever. can do all the kind of the basic performance engineering stuff. And, and, and that, that tool was kind of built, you know, kind of along the same lines where you have.
a separation of storage and compute. So you have all the data stored in object storage. The compute is now elastically scalable. So you can throw as much hardware at this as you want. And then, yeah, it’s basically predominantly optimized for semi-structured data. And it turns out you can organize the data on, kind of in their file formats that you can actually do these queries very, very efficiently despite having no like kind of pre-known schema or anything like that. And so.
and so, so I kind of knew that like there exists this technology to do this, like at the hyperscaler scale, but there was nothing commercial on the market at that time that like sort of quite resembled that. so, so when snowflakes started to take off and could kind of see the parallels between that technology and what we had internally at the hyperscalers, it was like, man, like the future might be now, this might be the right time to actually bring this type of technology to bear in the observability context.
And so to recap where are so far, like, the data collection and instrumentation is commoditizing. You’re going to have vast amounts of semi-structured data. There’s a new wave of databases that actually can handle that scale and that shape of data very, very cost effectively and very, performantly.
And so just kind of like new technology to bring to bear to like help deal with with this, coming wave of commodity data. But at the same time, you still want to solve this problem of like, okay, you know, you have a big organization, you have ten different tools, whenever there’s an outage, everyone has their hair on fire on a, on a war call, you know, at 2am. And you want to try to solve that problem. So like, it’s, it’s not enough to just dump this data, like directly into the database and just hope everyone can like make sense of it.
You know, at the end of the day, there’s this, this like semi-structured data is still pretty hostile. You know, it’s pretty hostile to just like look at all this JSON data and try to make sense of it. And so, and so, um, yeah, you know, ideally you want it to be reasonably interpretable. You want to be able to like kind of know, like, you know, I’m looking at web logs versus I’m looking at some different, uh, know, database log, like you want to have some way to interpret all this data, but then you also want it to be navigable. You want to be able to say like, Hey, this request, you know, kind of generated this.
You know, weblog as well as this database query. It’s like contextualized. I can actually sort of follow the breadcrumb from one place to another. And so, and so kind of in our mind, we were thinking about this back then. you know, this, this sort of struck us as a couple of things. so one is that it’s essentially a relational data modeling problem. You basically like, have like all this semi-structured data coming in some of these.
Attributes are actually identifiers that string it all together. They’re foreign keys to some other, you know, table or some other kind of record.
James Governor (13:27)
really can provide that context when we need to troubleshoot or
Jacob Leverich (13:31)
That’s right. that’s right. So like, you hey, I saw this error. Well, like who did it impact? Well, there’s a user ID attached to that error log. And so I can go look at that and ask the question. OK, has this user had other errors? You know, and just like following the red curve from one to another is sort of like the obvious way to like help people speed up this troubleshooting process and impact analysis process. It’s just typical for all this this this sort of firefighting and.
And then the other was that, you know, often, you know, all these identifiers that they’re the present in this data, but, but you kind of never really have a user experience that, that really puts it front and center and sort of says like, you know, Hey, I’m sure I have all these logs and all these logs are referencing servers or the referencing users that referencing like microservices or whatever. But like, where is the list of users and where is that list of microservices like
More often than not, I have a question about that thing. Like I have a question about my favorite user, Frank, and like, like he filed a support ticket. He apparently had a bad day. Well, what was his day like? Let me go find Frank and then find all the information about him. And now I can figure out like what he was so angry about, you know, as, as opposed to thinking about it just in terms of the telemetry data, you know, it’s so again, in terms of like, how do we, how do we like make this, this
this troubleshooting process when you have like multiple teams and multiple tools and multiple different views of the data, like how can we actually make that more efficient? And so that was, again, one of the things that we saw, it was like sort of holding, I’d say the industry or the observability practice as a whole back was kind of obsessing about the telemetry alone and not thinking as much about the workflow and sort of like the high level like business purpose.
solve Frank’s problem. That’s all I care about. don’t care about all the guys.
James Governor (15:30)
always
be about the job to be done. What does it look like for the end user or the end users? mean, as you’ve described, having a call with a bunch of people, each one in three different systems, they’re all trying to work out what’s going on. There’s a lack of context, lack of consolidation. And that does just make things a lot harder. And mean, think one of the other contexts that we haven’t really described here
For all the difference in underlying infrastructure, of course, other people are taking advantage of that in building applications. you know, if you’re building a tool for performance management, you know, previously or observability now, we’re in a very different world than, every six months we’re shipping a few changes because you’ve got organizations that are doing obviously tens
tens of deploys a day, if not more than that. So a lot more changes, a lot more context is needed. So, and I described a bet, tell me a bit about that, because you’ve sort of described, I I tend to say that the best package or in any tech wave tends to win and win big. Arguably Snowflake has been one of those packages, but then you’re being a packager of the observability sort of workflows.
and jobs to be done on top of snowflake. So that was very much, I think, your founding story.
Jacob Leverich (17:05)
Yeah, yeah, yeah, yeah, I can I can talk a little bit. I guess what you know, even when we started, we had already, you know, kind of seen that people could see that, like, just as far as the technology goes, and being able to handle the scale and variety of this data, people had already seen started to see the writing the wall, and people were trying to load this type of data in
James Governor (17:27)
mean,
turns out Lakehouse is great for this kind of stuff.
Jacob Leverich (17:31)
Right. Right. So people are trying to load this data into Snowflake. They’re loading it into to BigQuery. People are trying to like, like kind of build their own homegrown bespoke solutions for doing this stuff. think the main challenge that people run into is that, you know, like, like those, front end for those tools tends to be SQL and not everyone is an absolute expert at SQL and not an absolute expert at building a data pipeline.
because starting with just like the raw data and just like ripping out a SQL, it’s actually very, very challenging. It’s very challenging to like author the query. Like, you know, like a lot of people don’t have the time when they’re firefighting at 2 AM to like think about, do I want to do a left join or an inner join? What am I doing here? And so like, and then also if you just like dump the data in the raw into a single table and just like try to rip on it, you’re probably going to have a bad time. You kind of have a much better, you know, time if you can actually,
segregate the data into distinct sort of like sub units like, oh, know, production versus non-production or for, you know, microservice A versus microservice B, you actually can take better advantage of these scan-based databases. And so you need, so it’s like the database itself is not batteries included. And so kind of our task was to figure out like how to like, like, like fill in these gaps and make sure that you could actually do like a world-class observability use case on top of this, this, this technology, which
Which was, which was not trivial. And, it was kind of cool because we could see people trying to do it already. So we could kind of just like take, you know, the best ideas for people that trying to do it homegrown and adapt them to our use case. and, I guess, yeah, maybe, maybe just jumping forward to like, like, well, you we, we made some of these bets and sort of like, how, do some of those bets, you know, sort of like evolve over time? Yeah.
James Governor (19:20)
What’s the payoff?
Jacob Leverich (19:23)
Yeah. mean, so, so I guess, mean, for, for us, I mean, like, I think that we, we, found that we, did, we, we, we, we did a good thing. We actually built a solution that works really, really well. particularly at scale solves a lot of these like consolidation problems solves a lot of like the, the scale problems, a lot of the performance problems that people had with the, with the older solutions and, and.
We’ve kind of started to see other people follow a little bit in our footsteps. We’re starting to see people building on top of ClickHouse sort of following a similar architecture to ours. And so I think we made the right bet there. So that part was good. I think as far as like the OpenTracing was concerned, we were obviously like way too early for that. And OpenTracing kind of like went out the window because it was too difficult for people to adopt it.
and too difficult to actually kind of just like go in and do this like manual instrumentation for distributed tracing. And in retrospect, that was obvious. It took us about 10 customer conversations to figure out they’re like, okay, OpenTracing is not happening anytime soon. But since then, OpenTelemetry has seen like a really, really interesting.
James Governor (20:27)
Honestly OpenTelemetry
is just unbelievable. The momentum behind that project. I think initially, look, and probably even now, is still sort of vendor driven. They know that it’s something that customers want. But the speed and just the ubiquity of OpenTelemetry, I think, is super interesting. mean, if we think it’s now the second biggest project at the CNCF,
Literally everyone is implementing it. So though it may not be, it’s still not easy for the end users, but knowing that it’s supported and therefore you’ve got a level of portability, I think it’s super important. And yeah, mean, it’s a really interesting standardization story from the industry as a whole. yeah, mean, yeah, absolutely makes sense. But tell us a bit more about that. I’m particularly interested, you know,
As I say, are customers ready, if we think about some of the pretty gnarly SDK issues and so on, OTel is not exactly simple. So how do you help customers to adopt it and how do you see that evolving?
Jacob Leverich (21:40)
Yeah, no. So I agree that the momentum and excitement behind it has been kind of amazing. It’s kind of caught us by surprise over the past couple of years. And I think one of the things that I am thankful for is that it still kind of matches a lot of the philosophy of OpenTracing that we had originally sort of designed for, where you’re sort of like, hey, structured logging is kind of like a much more convenient way to deal with this data rather than unstructured data. so like,
What are spans? They are basically just structured logs. And so that’s nice. And so you end up with all this this semi-structured data that that’s a nice fit for the, modern data management systems. But then also, it was also the same, like, like trend, the almost like social trend towards commodity, like vendor neutral instrumentation. And there’s like clearly been a groundswell of demand.
for getting away from anything that’s gonna lock me into my observability vendor and lock me into overages for the rest of my life. It’s like something that allows me to actually move my data around and not make it an unbearable to like do that transition if I need to in the future. And so, but to your point, it’s not perfect. It’s not as mature as a lot of the best in breed, sort of automatic instrumentation systems. I think the thing that changed palpably
Maybe a year and a half ago or two years ago was the, the auto instrumentation for, for major, like web application frameworks, became mature enough that you basically had single line installers or single line sort of like, you know, includes, for, Django and for, you know, Ruby on Rails and, and, and for, for Java and for .NET and you started to see just like good enough auto instrumentation. Yep.
That’s it. That like the dog could start to hunt. now that doesn’t mean that it’s perfect for everything. And so we’ve definitely seen, kind of other runtimes and languages, know, like certainly the statically compiled ones, like C++ and Go, it’s like a little bit more of a pain to get started. Although thankfully in Go there’s like enough of a, standard operating practice for like adding context to web requests. So was kind of good design patterns for including distributed tracing, OpenTelemetry into it.
You also see oddball languages like Julia that are just like not ready. And so, and so, so I think, you know, a lot of our guidance to customers is really, along the lines of choose boring technology. Like when you get off the beaten path and you’re using things that aren’t like a top, five or top 10 language or a web application framework.
James Governor (24:17)
You’re always going to have a lot to do.
Jacob Leverich (24:20)
Yeah. so, so it’s kind of the first advice is always just like choose boring technology and try not to really get too creative with how you do the OpenTelemetry instrumentation. The other thing is to, is to also recognize that, the distributed tracing is one thing and it’s really, really nice if you’re going to do full on APM, but, but logs and metrics are still very important signals.
And so the other shoe that dropped in the OpenTelemetry space, you know, not long ago was I would say the maturation of OpenTelemetry Collector. The basically an open source agent that I can just run somewhere as a replacement for Fluent Bit or Logstash or whatever. And it does a pretty good job at tailing logs and sending them to whatever destination you want to send them to. It does a pretty good job of Prometheus scraping.
And so you can keep your Prometheus metrics instrumentation in place and just scrape with it and send it wherever you want to. It also, you know, obviously serves as a, as like a great destination for your OpenTelemetry spans and OpenTelemetry metrics, if you started to adopt that. And so OpenTelemetry Collector is sort of turned into this gateway to the future where it’s like, okay, you can continue using your logs and metrics. basically use them as the same way you always have. Now you can start to just layer on.
auto instrumented spans as convenient to flesh out the story, particularly if you want to do like downstream database monitoring or something like that. so we’ve seen it start to graduate in its maturation like that first as like a great conduit for logs and metrics. And then we’re starting to see.
quite adequate or quite viable or even good APM solutions based on OpenTelemetry auto instrumentation. so this thing’s really starting to pick up steam is from where I sit.
James Governor (26:12)
Yeah, that’s pretty much what we’re seeing.
As is always case, there’s plenty of work to be done. There are edge cases, as you say, try and standardize on things that are, as you say, perhaps boring or more normal stuff. Unless you’re an organization that particularly enjoys that sort of, good, I’m going to deal with the gnarly edge cases. So that standardization is happening. Let me just change gears slightly, because we’ve got all of these events that we can store.
I think for me, one of the big questions about observability, and this gets to that initial bet you made, and whether we could separate compute from storage, what we could do with the data lake house approaches. And really that’s around cost. That one of the big questions in observability is how the hell am I going to store all of this stuff at a cost that is not going to break the bank?
So, you know, from an observed perspective, when you’re talking to customers, and also how do you see this evolving? mean, there’s been the question of, you know, indices, what should be our strategy there? Polling, what should be our strategy there? Like, can we really store everything at a cost that’s reasonable enough that we can… Yeah, I think that’s the basic question. Do we store it? We need to find a way of not doing that.
Do we need to be hierarchical about the storage management? What does it look like from the customer point of view? Because frankly, I think every generation of observability technology or metrics or initially it looks cheaper. And then you realize once you’re starting to feed the beast that actually the costs are rising. So talk a bit about that and how you manage that on behalf of customers. And then maybe a little bit about the sort of scale, you you mentioned those.
high-scale customers, what that looks like in practice.
Jacob Leverich (28:13)
Yeah, so I think it’s great question. So I guess the interesting thing here is that thinking about the cost of these solutions, now that we have solutions that adequately make use of object storage, do the separation of storage and compute trick, you actually get the opportunity to think about the cost of storage and the cost of processing the data independently.
Just not, which is something that you couldn’t really do before, because when you had all the data stored on like a local disc, you know, if I wanted like more retention, well, I had to add more disks. And if I was going to add more disks, I might need to add more servers. And so I kind of like these two things were like tied together and it kind of made like sizing and sort of capacity planning, just like all kind of a mess. It was kind of challenging to do it well. And so when all the data stored in object storage, you know, you know, it’s just, it’s easy to forget that like S3 is dirt cheap compared to EBS.
And also the fact that it does erasure coding instead of replication means that the storage efficiency is much, higher even than doing just naive replication on local disk. then, you’re not… One of the of, guess, gnarly secrets of doing keyword indexing for everything is that the index is often about the same size as the raw data.
And so you end up doubling the amount of storage you have to have, you just to like build this index. And so it just kind of all these like things that are sort of negatives with like kind of the traditional way that people have done this. And so storing all the data in a, uh, know, columnar form in object storage, um, the, the, the storage cost itself ends up being, uh, very, very low. And so just like the economics of like, you know, Hey, like, um,
You know, people make these awful trade-offs. It’s like, ah, I can’t afford to store it for 30 days. I’m going to go down to 14 days and I can’t restore it for 14 days. I’m going to into seven or three days. And they just like kind of kneecap themselves on the ability to like look at historical log data and do any post-mortem analysis or anything like that. so, so what we find is that with object storage, it’s just like, it’s just dirt cheap. Like you can, you can store this data for like six months or 13 months. It’s actually very, very cost-effective to store it for a long period of time. You know, a lot of people will, um,
will archive their log data for like kind of in case they need to look at it. So archive it to S3. our thought to that is like to hell with that, just keep the hot data in S3. It’s like you can do that these days, you know? so there’s kind of, you don’t need to worry so much about like the hot and the cold. It’s just like, no, it’s just like all on object storage and these databases are good enough to like do the whole use case on top of the object storage. And so I think there’s…
less of a need to think so much about the tiering. But at the same time, transiting the data isn’t free. mean, you have to kind of move it across the internet. You have to kind of encode it so you can store it into these columnar forms into the object storage. There becomes a certain scale.
where the physics of data movement does kind of catch up to you. so there’s kind of obviously a point at which, yeah, I don’t necessarily want to keep all of the data. It just doesn’t make sense. Even if I can store it very cheaply, it’s going to cost me actually a decent amount of money just to move it into the object storage. And so obviously, mean, there’s been the trend towards pipeline tools to kind of with this, just the filter.
James Governor (31:57)
It’s been a big, big, big trend, no doubt.
Jacob Leverich (32:00)
But I think the other thing, yeah. sorry. Yeah. so one of the thought I wanted to say is that, you know, I, some extent, this was the original motivation for, the rise of metrics. Like kind of as like a telemetry signal, you know, cause you, you know, kind of back in the day, all we have is just like tons and tons of web logs. And, we do our analytics on top of those. And that was all well and good. But then when you’re dealing with like a million requests per second, like, that’s a lot of logs. I don’t necessarily need.
Every single one of them, I might need a sample of them and I want like an aggregate metric so I can look at the distribution of latency or the distribution of errors across different endpoints. so they’re kind of like there’s, there’s, we’ve actually always had really good strategies for dealing with this, physics of data movement problem. and that’s why it’s sort of, it’s, it’s almost emergent that you’re going to have, you know, these different classes of data and any observability practice. You’re to have like the low level data that tells you exactly what happened with this request. And then, you know, in many cases more.
aggregated metrics data if they’re looking at just sort of things holistically. Yeah. But I guess I wanted to, there’s something interesting I think going on kind of like trend wise in industry right now that actually it paints like actually sort of a slightly different destination state for all of this stuff, which is I think really interesting, which is so these, you have these pipeline tools.
That are giving you this opportunity to like sort of send data to different destinations or filter data out. And so you can sort of send like the high value data to, to your, you you sort of premium solution that you’ve, you know, used forever and you’re kind of comfortable with. And you can send all the other data to, you know, S3 and sort of dump it in there in case you need it. And, sort of, it’s a great value prop. Like I understand exactly why I, know, you would want to do that. You know, it’s like sort of, spend a little money to save a ton of money.
And, you don’t have to throw out all the data. can still like in an emergency, you can go through and just like download the files and then kind of look at them. that, all. Yep.
James Governor (33:56)
Classic off load.
Let’s just put it over there and keep it cheap.
Jacob Leverich (33:59)
Yeah. And so, so I think our original observation was that, well, we can put all this data in the cheap place and still great, get great value out of it. so, so that, was kind of the original bet we were making with like kind of our data lake based, you know, solution is like, Hey, we can actually do observability and log search and metrics dashboards, all this sort of stuff, even though the data stored in an object storage. So, so we were trying to get one step ahead of that. But then, I think the new very, very interesting emergent trend is.
is towards the commoditization of this data storage in an object storage itself. so just paint the picture. like, you know, with OpenTelemetry, it’s sort of like, you know, commodity, vendor neutral, open source instrumentation and data collection. But then you’re going to send that data to, to say ideally like S3 somewhere very cheap where you can store a ton of data very cheaply for a very long period of time.
And, you know, kind of the naive thing to do is to just store it in like GZIP JSON files. But that kind of sucks. And one of the key technologies in the database, like Snowflake, is like these columnar semi-structured table formats that allow you to actually very, very efficiently query this data. And so that’s essentially why we chose Snowflake in the first place. then what’s now come about is this thing called Iceberg.
which is the commoditization and open standardization of that, of that file format for putting this semi-structured data into an object storage and still being able to like, you know, store it for a long period of time, store it very efficiently and compressed, they’ll be able to query it with like, you know, great efficiency and to still be able to do this high performance, high efficiency, you know, analytics on this type of data, semi-structured data. And so.
James Governor (35:49)
And with the possibility of portability because again that’s the standard that we’ve seen literally everybody adopting.
Jacob Leverich (35:55)
yeah, I know you, you nailed it. Right. So it’s like, rather than like sort of take like the pipeline tool and like send the data to like three different places, depending on like what I want out of that different thing, you know, wouldn’t it be amazing if you could just like, like, just like shunt all of the data to this data lake in an open format and all of the, like observability tools come to the data.
Right. And so like, kind of like, far as like, like solving this, physics of data movement problem, like, like addressing this, concern that, that, that so many people have, it’s like, I don’t want to make five different copies of this data. It’s very expensive to make all these and to move it around and keep control of Yeah.
James Governor (36:35)
the direction of travel, like you made a bet on Snowflake. You know, here we are in 2025 and there are customers telling their observability vendors, look, we’re running, we’re keeping our data on Snowflake. So, you know, if you want to now, admittedly with iceberg, you do have some, you know, other options, but to your point about the physics of data, people like, don’t want to be rehosting all of the time. We don’t want to send everything over there. And I think that that’s one of the changes that we’re seeing.
Jacob Leverich (37:04)
Yeah, 100%. So that’s kind of been, I would say the very interesting sort of next evolution in this market is to start to have a viable solution to this like one copy of the data problem.
James Governor (37:21)
This is interesting. We may be stumbling upon a transition here. This might be a stretch, I think arbitrage always sounds good in theory. like, can put some of the data over there on that cloud. I could put some there. could put some in one system. could use, it sounds great. But actually, the fact is the arbitrage
adds a layer of complexity. I think if we can standardize, put something in one place, that there’s inertia there. And that it’s an inertia that sort of has a customer value. Now, where I’m going with this is it’s time to make the 2025 transition. We haven’t, we haven’t talked about AI yet.
Jacob Leverich (38:14)
Oh boy.
James Governor (38:15)
You’ve told me all about the things that you have built into Observe, the standards that you’re taking advantage of, what that can bring you from a cost perspective, what that can bring you from a context perspective. One of the things I think is very interesting, everyone at the moment, people are really talking about like multi-model, like, everyone, everything has to be multi-model. And I think that’s sort of a…
attractive in theory. I’m going to use the best model for the job and I might use a little bit of Claude over here and I might be using some GPT over here and I’m not really sure that’s going to happen, that everyone’s going to be using bedrock in order that I can use. I think there will be some experimental experimentation. guess where I’m going with this is I think we’ll see some of the same inertia.
That’s not the topic of conversation. However, the topic of conversations are going to use this discussion about inertia to say, okay, let’s talk about AI because people choose their platforms. They want to take advantage of them. AI is too, I think, well, there’s two questions to ask you Jacob about AI. One of course is use of AI within Observe itself. I think one of the things that, that
that my business partner, Steven O’Grady, talks a lot about is we get very excited about code assist, but what about query assist? Because querying can be hard. There’s always, you mentioned Prometheus, there’s PromQL, there’s always different query languages in observability solutions. So yeah, two questions then for you. The first one is about how are you using AI?
to make your platform easy to use. And then the other question, and you can answer them in either order. The other question is, what impact do you think generated code are gonna have on observability and how are you gonna step up to that? Because pretty clearly, one thing is certain, we’re gonna be, developers are gonna be writing a lot more code, which means there’s a lot more that needs to be observed. So yeah, how does AI affect your…
you as a product owner, and then we’ll talk a bit about like, in terms of capability, then we’ll talk about how that changes the customer problem.
Jacob Leverich (40:46)
Yeah, it’s great question. I’ll start with sort of, I guess, what we’ve been doing with AI and a couple of interesting things we’ve seen. So I think just like everyone else, it’s been very exciting technology. We’ve been trying to figure out how to improve user experience with it. And I think a lot of our initial investments were very, very targeted. We discovered things like LLMs are fantastic at generating regular expressions to parse logs.
One of the first features we built was a thing that would do automatic field extraction based on an LLM. You give it a couple of examples. A log line, it’s worked out a perfect regex, it extracts all the fields. Now you can do structure analytics. That was actually a really cool thing. It was something that a lot of customers struggle with, so we saw a lot of great adoption of that. But then the second thing we started working on was, as you mentioned, the assisted query generation. We built very quickly a co-pilot experience where you’re trying to do a query,
you can just sort of tell it in vague English terms what you’re trying to do. And then it’ll give you a bunch of suggestions for like, hey, here’s the query that might do what you’re doing. And you select it you run it and you can go back and correct it. So it has like a nice user flow for like not having to like really become an expert in any query language, but still more or less get the job done. Now, one of the things that was really interesting though is that the expressive nature of the queries doing this naively,
You could do like a few things. Like you could do basically like, like SQL style queries were easy to do if you just had a table in front of you you wanted to like, you know, count, you know, a number of elements in a, in a column or whatever. But, but, but asking more nuanced questions, like, Hey, is the West coast deployment still up? Like, like that was, that, was kind of beyond the, ability of any of these automatic code generation tools. what we, what we, what we realized is that.
Well, hell, actually have an amazing solution to this problem, which was that early on when we were kind of thinking about our solution and contextualizing all this telemetry data, we leaned hard on building an entity model. I mentioned before, like, hey, I don’t care about the logs. care about Frank. Frank said he had an error. What happened to him? So we had built this data model and this whole system for representing.
those entities and sort of deriving knowledge about those entities from the telemetry data and organize them in a way that could be quickly queried and call them resources. And when we started thinking a little bit more holistically about AI, what we realized is like, actually that thing that we had been done for contextualizing telemetry data, that’s what Google did for search with the knowledge graph, where like you search for like the 49ers.
And like, it knows you’re talking about the 49ers that it brings up, like, you know, here’s the latest score. And by the way, here’s where you can go find more information. and so like, like that was actually what we had been building, like with our, with our contextualization of telemetry data. And it turns out to be exactly the thing that you want to give as context to an LLM. When I say like, Hey, is the front end service up? You know, like it needs to go back and resolve all those entities and say like, well, what does it mean?
to talk about the front end service. Like, what is that? And then what does it mean for it to be up? there was probably like a KPI associated with that thing. And so now I just need to connect the dots and I can propose like a query that might go answer that question. And so that turns out to have been, I think one of the key conceptual unlocks for us is to like, actually you need a very, very rich knowledge graph in order to use LLMs effectively.
for query generation. And so that’s been where we’ve been spending a lot of our time. I think tying this a little bit back to the multimodal kind of question, one of the most exciting ways in which we’ve started to package this is as an MCP server. so the MCP server is this way that you can connect to another LLM and basically just tell it.
James Governor (44:56)
should ask you, I have this, we’ll just settle on good enough and we’ll choose whatever gives us the best cost of tokens. Maybe you have a different view. So tell me about the MCP server, but I’d love to know your view, Jacob, about.
Jacob Leverich (45:09)
Yeah, So yeah, 100%. So like these LLMs are moving so fast. And so I think, you know, we like our core competency observe is on observability and making sure we meet the needs of our end users. You know, we’re not going to go deploy $200 million in building like some custom foundation model like that. That isn’t really quite make sense for us at this stage. You know, what makes more sense for us is to figure out how to best sort of integrate all of this
evolution of AI technology into this workflow. And so one of those examples was the MCP server, where it’s like, look, basically, you give something like Claude, and you just tell it, hey, by the way, I have this collection of capabilities. You can ask questions about services. You can ask questions about
about users, here’s the types of like, information you can get about these things. Here’s the spec for the API that you could use to go ask those questions. And then on our end, we just have like a, you know, a library of tools that these things can go off and do. And you hand that to one of these reasoning models and just keep it bear in mind, these reasoning models.
have the knowledge of the internet within it. And so like they’ve read like all of the Stack Overflow posts ever about here’s how to go about troubleshooting a three tier Java application, or here’s how to go about troubleshooting Kubernetes infrastructure or your Lambda serverless infrastructure or whatever. And you plug that knowledge base into a collection of tools that can now go interrogate.
not just the logs, like it’s not gonna sit there just like grepping logs. It’s actually gonna go ask questions about the entities. And I’ve given it tools to go like follow those breadcrumbs wherever they go. It actually leads to a really, really interesting capability even using kind of these off the shelf language models. And actually from what we’ve seen, it doesn’t matter which one you pick. Almost all of them are capable enough to like make joy out of that collection of tools.
Right. And so, so that’s a really, really interesting way to, package it. And, for what it’s worth, I mean, we’re kind of, you know, we’re, we’re, we’re internally developing a lot of this stuff. So it’s exciting stuff to come down the road. But, but I think where I see this going is I think for the industry, it’s going to be a little bit, it’s going to be kind of interesting. in that, traditionally, I, I, when I have an observability practice and I’ve been going for like five to 10 years.
You know, I’ve collected a massive constellation of like, uh, of instrumentation, like customer instrumentation. I’ve collected a massive collection of dashboards and monitors. I built a ton of like run books and I kind of have like, just like an investment sort of in my observability tool and.
And I have my way of doing things that I’m used to that we’ve done forever, sort of standard operating procedure. And one of the things that makes it very difficult to get out from under like one of the, the, the, guess, best of breed vendors and, and especially, you know, once you get your first overage bill, you’re like, God damn it. Like I wish I could change, but I can’t cause I have all this stuff stuck there. Well, when you think about where all this stuff is going, like the actual workflow for, for like troubleshooting, might change a little.
Like if all this like, reasoning models and, you know, kind of MCP, just like calling tools as it sees fit, actually like transpires, then it’s no longer going to be like me as a human, like trying to remember which five dashboards I need to go look at to figure out how to troubleshoot this particular class or problem. Actually the language model.
Like a reasoning model can like actually kind of like knows generally how to do this. And it’s going to follow a particular path to do it. And if it gets stuck, like if it’s unable to troubleshoot something, that’s kind of a cue to you that like, you might not be following conventional practice. You might not be following best practice in the industry. And so actually wherever it got stuck, that’s the hole you should go plug. And it also almost kind of gives you an opportunity to like, well, let’s suppose that we kind of stepped back from all the stuff that we had created before.
And let’s just start to like from first principles, start to rebuild this practice and start to troubleshoot things by like writing simple run books that this reasoning model can follow and kind of like allow it to go and interrogate the environment on its own and tell us where it got stuck or tell us where it needs help to know where to go next. And you can use that as as just sort of a waypost for like, well, where do I need to add instrumentation or where do I need to like define, you know, perhaps build a dashboard that this thing can consume. And,
I guess it sort of like turns this from like a technical exercise into a little bit more of just like a procedural, like operational exercise, where it’s like, Hey, as part of this like incident, know, like we, know, like this thing couldn’t find this particular information. So we should go fill that in. By the way, that’s what a good observability practice does anyway. It’s like you do postmortems, like you actually have like regular review of your instrumentation. kind of go back and you kind of.
you prioritize like where you need to fill in the gaps to be successful. And so I think, you know, you’re going to want that habit anyway.
James Governor (50:31)
So run books, but with a view of, as Matt Biilmann calls it, agent experience.
Jacob Leverich (50:38)
Yeah, there’s very possibly a realignment of the sort of common case user experience for these tools. Rather than me going in and manually remembering which dashboard I need to look at to troubleshoot this particular issue. Instead, all of that stuff is now context.
That the, that the LLM has it’s like supplied as context by the tool. It sort of tells the LLM by the way, here’s all the dashboards you have. Here’s all the questions you can answer. Here’s all the like services in the environment. And, and, like when you get an alert or when you have like an open ended question, you can just ask it and it’ll start that journey for you. Um, I don’t think it like kind of completes it. I don’t think that any of these, these models are smart enough or like actually have the, the, the autonomy to like, like fix anything.
But they’re actually pretty good at sort of following like the just like simple rational hypotheses based on which data you have. You know, if they get stuck, they can ask you, Oh, do you want to do this? Or do you want to do that? You want me to look at this, you know, A or B and, you’re basically what is the augmented sort of where like there’s still someone in the driver’s seat, but they have this thing that’s actually doing the hard work of like remembering.
Where the hell this dashboard is or remembering what the hell service like what was it last time that was the root cause and that incident two weeks ago.
James Governor (52:04)
Given how bad LLMs currently are at memory, we’ll see how that one goes. That’s definitely a problem that needs solving.
Jacob Leverich (52:13)
Fair enough, fair enough.
James Governor (52:14)
So we got a kind of philosophical there. guess, you know, we’re about about time to close it off to sum up like, Jacob, what do you think, if there was like top couple of things that an engineering leader should be thinking about in 2025, if they want to improve what they’re doing from an observability standpoint and I’ll, just throw that out there as a final question before we sign off. Like what, what should the, what should engineering leaders be thinking about?
Now we’ve talked about sort of the evolution of the industry. What are the decisions they should be making? Because, it’s the classic, know, best time to plant a tree was 10 years ago. The second best time is now. If we’re planting a tree for observability, what should we be doing?
Jacob Leverich (53:04)
There may be just like two things that come to mind, most importantly for this. So one is, think that the death grip that the previous generation of vendors have over us is like that crack has finally fully formed in the form of commodity instrumentation.
And I would say if I was going to build a new application today, I would build it, forward thinking with the idea that, I may, I may change observability vendors in the future. And the easiest way to facilitate that is to make sure that my instrumentation isn’t, tightly coupled to any one of them. This is always easy with logs because you just write logs to disk.
and then you can send them wherever. like it was always, it was never really a challenge with logs, but it has always been a challenge with metrics and APM. And I think that the industry has gotten to the point where like the open source solutions for these are good enough that you can begin to make that bet. And so you can start with the commodity instrumentation to begin with or steer towards it, kind of steer like towards a trajectory for it. But then the second is, you know, it’s actually not a technology thing at all.
It’s really just like a, just like a people process thing. you know, having like good culture around post mortems and having good culture around like how you, follow up on those and make sure that like the recurring issues like find their way into sprints that you sort of like keep track of like what those recurring issues are. And you have good discussions about like, well, here’s the biggest bang for the buck thing we can do that would prevent the incident that we’ve had.
You know, like the last three months or, making sure that you’re paying attention to how many interrupts or how many pages people get, especially at night and just having like a running record of that and following through on it. There’s, there’s a people process there that I think pays immense dividends. Cause this is one of these just like habits that if you follow up on it and you’re kind of continually grinding it down, like you’re going to see
the issues that really impact the quality of not only your software, but also the quality of your engineers lives. You’re going to find those really, really quickly. and if you make that investment, they’re going to thank you for it. And you’re going to have a much more robust service. And so I’d say those, those two things like plan for the, commoditization of telemetry and make sure that you have a good, know, post-mortem or reliability engineering as culture.
James Governor (55:35)
So, OTel and better post mortems, well documented. I love that. That’s such a good summary. Okay, well, that is another…
Jacob Leverich (55:45)
If I may
interject like one funny thing, you take those post mortems and you throw them into an LLM and you ask it questions like, Hey, what was the last post mortem that like, you know, sort of resembles the incident I’m having right now. You’d be amazed at what that knowledge base can bring back to you. So, so, so it’s also, there’s also some AI benefits to doing all this, right.
James Governor (56:03)
100%, 100%. AIs will be there to parse that stuff for you. All right. And that is another MonkCast. Don’t forget to subscribe, share it with your friends, say you enjoyed the show, click on those buttons. MonkCast, we’re definitely looking to grow that audience. Jacob, thanks so much for joining us It was a really interesting conversation. Thank you for all your insights. And thanks, everyone, for joining us today.
Jacob Leverich (56:30)
pleasure. Thank you.
No Comments