A RedMonk Conversation: Why Engineering Orgs Should be Thinking About Mobile SLOs (with Andrew Tunall)

Get more video from Redmonk, Subscribe!

In this RedMonk conversation, Andrew Tunall, President and Chief Product Officer at Embrace, chats with Kate Holterhoff, senior analyst at RedMonk, about the importance of mobile service level objectives (SLOs) in enhancing user experience and performance in mobile applications. They explore the challenges of measuring mobile data, the need for user-centered approaches, and the future of SLOs, particularly with the integration of AI and predictive analytics. The discussion emphasizes the shift from traditional observability metrics to a more user-focused perspective, highlighting the critical role of data in understanding and improving user interactions with mobile applications.

This RedMonk conversation is sponsored by Embrace.

Transcript

Kate Holterhoff
Welcome to this Redmonk conversation. My name is Kate Holterhoff, senior analyst at Redmonk. And today I am joined by Andrew Tunall, president and chief product officer at Embrace. Andrew, thanks so much for joining me.

Andrew Tunall
Yeah, thanks for having me, Kate.

Kate Holterhoff
Yeah, this is going to be exciting. So, Andrew, let’s set the table. Give me the elevator pitch for Embrace. What problem do you solve?

Andrew Tunall
Yeah. So, I mean, this won’t be the 10 second elevator pitch because we’re not, you we don’t sell vacuums at Embrace, but I’ll to, you know, set up kind of the like our larger hypothesis, the larger observability, I’ll say SaaS-based observability worlds and monitoring world has now been around for over 20 years. I think Dynatrace was founded in 2005. New Relic, where I was before I joined Embrace, was founded in 2008. So these are not new companies. It’s not new concepts. But by and large, the center of gravity for those products has historically been

the kind of maturation of infrastructure and increasingly application observability on the data center side. And that’s reflected not only in the tooling that they have for operators, for Kubernetes clusters and everything else, but if you look at even CNCF projects, the maturity of service meshes and the Kubernetes project, even OpenTelemetry and kind of where it’s centered around app and infrastructure observability.

is really a reflection of the kind of progress that those teams have made in just becoming more mature in how they operate modern digital applications. And that same maturity has broadly not come for front-end developers. think if you look at the biggest conferences for even web devs, they’re all Vercel’s Next.js conference, which I think is actually in like SF today.

And even in mobile, you see a lot of conferences around SwiftUI or Jetpack Compose and Android, et cetera. And you very rarely see even meetups where front-end developers are talking about the implications of really measuring whether the app is performing the way they expect it to perform for their users to be successful. And that’s what we’re all about is we have…

We have discovered that the most forward thinking digital companies in the world are increasingly not just caring about whether or not the service that talks to the mobile app or website is performing well, but whether the digital user experience that actual user interacting with the app is having a good time. Because we’ve put out research before, you and I have talked pretty extensively about the negative implications of user perceived bad.

performance in terms of user retention, in terms of their ability to stay engaged with your experience. But broadly speaking, the world just has not evolved to have that level of maturity in measuring and improving. So that’s what we do.

Kate Holterhoff
I like it. And anytime we talk about front-end engineering, it’s near and dear to my heart. So recently, you published a report entitled Defining and Measuring Mobile SLOs. So, I am familiar with service level objectives as a general concept in the observability space. However, mobile SLOs seem to be doing something slightly different. So what is a mobile SLO and why do organizations need to worry about them?

Andrew Tunall
Yeah. I mean, mobile, we talk about mobile SLOs because obviously we’re an observability platform for mobile engineering teams and the operations teams that support them. But broadly speaking, front-end SLOs or mobile SLOs are really more about the actual experience a user is having when performing a particular activity as opposed to the individual service components that support it. And so a great example of that is if you think about you using an e-commerce app where you’re searching for some sort of item in a product catalog.

You as a user don’t really care about the individual services or render activities inside the app supporting that. You just care that when you type in, you know, lawnmower into the e-commerce app and hit the little magnifying glass or enter on your keyboard, that all of the individual components work to return search results and render them and make the results that interactable on the device. And so a mobile SLO is indicative of user facing behavior that matters to you as a business because people can’t add items to their car if they can’t effectively.

Kate Holterhoff
And if having these types of SLOs is so important, why isn’t everyone doing it? Or is everyone doing it and nobody just talks about it?

Andrew Tunall
Yeah, going back to kind of the history, I guess, the setting up the conversation, the reason they don’t measure it is because they largely don’t have the data. And that’s in part because the tooling ecosystem has been set up to be broadly just reactive.

feels silly to say like measuring is important, but if you actually have standards for what you think are the interaction times or the behaviors that users are tolerant of in order to complete the activity you want them to when you’re designing a user experience, if you can’t measure it, you have no idea whether or not it’s actually violating your user standards. And most tooling that’s existed in the market for the history of mobile has been very focused around error tracking, has been very fact-

focused around crash reporting, has been very focused around kind of, just for lack of a better term, vanity metrics in terms of like usage or error rates. And very rarely has it been focused around, you know, what is the threshold at which we think when the app takes too long to be interactable, not just start and boot, but interactable, customers will lose interest and just force close it before they ever engage with the content in the app. And if you, you know, I think,

because customers have lacked that data, because there hasn’t been a lot of attention in the engineering organizations around obtaining that data, people, I mean, naturally people will kind of just measure the things that they have. And especially the fact that most of the SLO movement has happened with SRE teams and DevOps teams that have primarily been focused with backend services. I think it’s just, the industry broadly hasn’t reached that maturity yet.

Kate Holterhoff
So as with most things, it sounds like a data problem.

Andrew Tunall
Yeah, mean, like obtaining the data can be tricky. You have to, and we’ll talk a little bit about this in a bit. Like there’s obviously some dimensions of what a mobile experience is like that are vastly different than what you would measure in backend services. But even more than that, mean, unlike

a lot of the domains that we’ve dealt with in the monitoring and observability world. mean, is installed software. So like an SDK that measures data is installed software that gets tightly hitched to installed software on a distributed compute ecosystem. Like mobile is not a distributed computing or a distributed system. It’s a distributed computing ecosystem. You have monoliths running on millions of unique devices, not distributed systems running on a small number of devices.

And the result is that depending upon how you set up your project or anything else, just obtaining the data can be difficult. And without that kind of, I’ll say, organizational desire to obtain the data, like when facing barriers, I think most people give up and are saying, we’ll use whatever service metrics we have as a proxy.

Kate Holterhoff
All right. So let’s plumb this issue of complexity here, which I think a lot of what you’re saying is pointing at. What is it that DevOps teams who traditionally are working on these SLOs don’t necessarily understand about mobile data?

Andrew Tunall
Yeah. I think if you really thought about it in detail, most of these would be pretty intuitive. But given the fact most people don’t sit there and spend hours kind of poring over all of the components, the things that could exist, they maybe don’t think of them as.

you’re going to readily as I guess I do because I spend all day thinking about this stuff. like you think about like variability of devices. So maybe less impactful in the iOS ecosystem because Apple has a relatively small number of devices, relatively small number of chipsets. So, you know, their their firmware updates and their OS updates are pretty consistent. But if you think about the Android ecosystem, I mean, it’s a little bit of the Wild West. I you have devices that are

quite modern and fast, and you have devices that are quite slow and old, all running various versions of the OS. And the obvious impacts of that when you’re asking them to do things like rendering activities mean that you’re going to get a lot of variability in what happens to users. Internet connectivity, right? You have people on LTE in places like the US.

you have people on 3G connections in central Africa. And depending upon your app and their internet connectivity, get wildly different behavior. We also deal a lot with delayed data, which most observability ecosystems are generally poor dealing with. And that’s because you have people in the subway and they lose connectivity and then they terminate their app, which is usually when marshalling would happen. You have people using AllTrails

to go hiking where they don’t have internet connectivity, the app crashes. You still wanna get that because if it’s always crashing when you have no internet connectivity, you’re just missing data. As with all front end third party services, whether it’s payment gateways or auth, ad SDKs, et cetera, are super common in mobile. And then you have like all of the user behaviors, some of which are elements of kind of the

I’ll just say that the ecosystem, their properties, like does a user change app state? Do they force close during particular activities, therefore ending a span for lack of a better, and it is a span, something you’re measuring where you might get data that tells you that something is always.

not lasting very long, but really it’s because it’s lasting so long that all users are terminating the app, therefore ending the data. And then you have human behavior that just drives app state, which is, know, mobile apps are, they are stateful systems where as I build up the things I am doing within the apps, I make choices. I’m actually holding context locally that ends up impacting a whole bunch of other things.

And most of these just are not well dealt with with classic observability ecosystems. They’re separate signals that are uncorrelated around human behavior.

Kate Holterhoff
Yeah, you’re reminding me of conversations that I’ve had with folks who are grappling with the legacy of single page applications or spas where they’re dragging down all this JavaScript. And yeah, for folks with older devices, it’s untenable. So yeah, you’re speaking my language here. So here at RedMonk, we do think a lot about the developer experience. I think about front-end engineers maybe a little bit too much, but in general, it’s interesting to me that with this user aspect, it is really crucial to be thinking about these mobile SLOs. So user-centeredness seems to be integral to the functioning of mobile SLOs, if I’m hearing the story correctly. Can we talk about that? Would you expand on that idea?

Andrew Tunall
Yeah, mean, similar to what I was talking about with the search functionality, if you kind of break that into its component parts, if you were to measure just, I mean, let’s say you take a step back, classically without any data captured on the client, you might measure your search services response time. And you might set a budget for what number of requests can exceed a certain response time threshold because you intuit

that a long response time will result in integrated customer experience. And, you know, like we could come up with some numbers here, but like, let’s say that you set that at a thousand milliseconds a second. You say if more than 5 % or 10 % of our users cross or 10 % of our requests cross a one second threshold, then we’re in violation of the SLO. We’ve gone yellow.

And your P50 response time, let’s call it 250 milliseconds. Let’s say your mobile app then has an SDK that handles all the requests to whatever services is off SDK or a search SDK. And the SDK designer being totally rational looks at the distribution of requests and says, these always respond in 250 milliseconds. But if it takes twice that long, 500 milliseconds, we’re going to go ahead and retry.

And so you have then a degradation of the latency on the service that doesn’t yet trigger your backend SLO, but pushes that P50 closer to 400 or 500 milliseconds. And suddenly you have a massive increase in requests because you have a retry storm coming from the device. Or you’re not actually violating your SLO because everything’s taking 500 milliseconds. You haven’t crossed that threshold, but it’s resulting in massive numbers of retries that eventually succeed on the user device. But users now perceive that activity

the login activity or search activity is taking five times as long because it takes five retries before it succeeds within 500 milliseconds, which is the retry threshold you’ve set in the SDK. Today, people on the backend using an SLO of response time would have no idea that the retry storm or the increase in requests they’re seeing is a result of a degradation in latency because they have no information coming in about the activity that is driving that. And maybe it’s intuitable, but

you definitely don’t have common language to use with those app developers. And so when we talk about user centered SLO, we say, what is, stop talking about the technical componentry. That is important. You have to be able to break it down once something violates a threshold. But instead, like if your contract isn’t with the internet, but is instead with Kate, how do I measure Kate’s activity? And at what point she is no longer tolerant of the behavior?

And how do I measure all of the contingent parts till I have ended that activity? And when it’s critical, say if it crosses a certain latency threshold or error threshold, I know Kate’s gonna be upset. Now I actually proactively inform the team that we have, we’re kind of out of our budget of upset Kate’s. And so I can start addressing those. And ideally you’re giving those engineers common telemetry signals to have the conversation with backend engineers so that they’re speaking the same language.

Because if what I say is, know, auth is taking too long and the backend team says, looks fine to me. Like what are your, what are your latency metrics? And I’m like, I don’t know latency metrics. just, I have zero errors and I have user complaints. Both of them end up saying, not my fault. And then.it’s your users who end up suffering because you’re kind of in like a, you know, inability to resolve loop.

Kate Holterhoff
I love this metric of upset Kate’s. feel like I need to start using that my own life more often. Just throw it out there. All right. So let’s think a little bit about, I guess, maybe the future of this, because I talked to a lot of folks about the future of SLOs. mean, there seems to be a lot of excitement in this space. I’m really excited about where these things are going. And just like to name a few areas where I feel like SLOs are going to be

really making an impact in the future. mean, of course, can’t have any conversation today without starting with AI. So I mean, we’re looking at these predictive SLOs. So instead of being reactive, we’re trying to head off potential failures through these proactive measures. Machine learning is a big part of this advancement. But then we’ve also got dynamic SLOs, which I think are really picking up steam. And by that, mean, they adapt to real-time shifts in traffic. so user behavior, these environmental factors, all of these are coming into play.

with the sort of new way of approaching SLOs, which it sounds like you’re sort of positioned to meet here. And these are gonna be especially useful to ensure that services perform optimally during busy periods. So we always look at like a Black Friday as being one of those, but on quieter times, resources can then adjust in turn.

All of these seem to be part of the future here. And then, of course, sustainability. And this, think, really ties with AI when we’re thinking about starting power plants back up, because AI is just so resource intensive. So the performance metrics, they’re tracking not only velocity, but also reliability and the energy that these workloads are performing. So sustainability is also factoring into SLOs in these.

new and exciting ways. So, you when we think about the goal of green computing, you know, how are SLOs playing a role in that? You know, it sounds like in the future that’s that’s going to be more and more apparent. So I guess I’m curious from your perspective, what are you seeing in terms of the future of SLOs? are how are you know, what are some more, I guess, practical use cases that you’ve noticed?

Andrew Tunall
Yeah, I mean, I think the big shift is simply that that customers, the people we talk to who are most interested in this right now are people who have spent a lot of time defining their error budgets and their SLAs and transforming them into SLOs in their data centers. the fundamental, the question has shifted from during an outage, know, what percentage of our requests took X percentage of time to how many users were impacted. And a couple of years ago when I started at Embrace, we were

I was talking to one of our director, one of our largest customers, and he said, you know, I see your product as the future of how you operate a modern mobile user experience. And it took me a little while to like fully grok what he meant. But I think increasingly what we’re seeing people mean is I don’t want to tell the story just through the lens of like the data that we’ve been capturing for 15 years. You know, we’ve made it. Most people have unlimited data plans now.

We’ve made strides in terms of the network connectivity infrastructure we can have to transit data out of an app. They have more, you’re asking an app to do more and they have more compute power. So it’s relatively simple for us to capture meaningful data about your experience. They want to actually know, are users having a good time or not? How can we design a system, including the user facing property that works to their expectations? And SLOs are simply a tool, right? Like it’s just,

I think the tool then becomes like, how do I create, and this is true like in any organization, but how do you create the accountability tools for the organizational culture to deliver the outcomes you want? And if you view SLOs as kind of like how the team measures, are they building software the way that your organization values? Then they’re starting to think through, how do we not just test it locally in the emulator, verify that the boxes work?

But how do we verify that when it’s in our actual customer’s hands, it continues to deliver the experience that we expect because that correlates to business results. Obviously, I’m really interested in some of the things you were talking about because as we think about predictive SLOs, even as you look at that example I gave about an auth service or a search service that starts to degrade in latency.

If you can start to correlate user behavior with some series of things that happen, you can start telling teams that creep on latency or error rates or whatever it is, is eventually going to result in kind of nonlinear effects when it reaches clients. And if you can have teams that instead of just having a binary measure where they arbitrarily set it and say, okay, if 10 % of our requests are over a second, wake us up in the middle of the night, otherwise deal with it. And instead say like,

people’s experience is actually getting worse. And we predict that this will eventually result in more churn, revenue reduction, bad reviews, et cetera. This is something you should prioritize as part of your everyday teamwork instead of responding to in the middle of the night. I think that builds better software and better experiences for virtually everybody. Because I think everybody’s doing the internet on their phone. mean, was like 75 % of digital transactions last year happened on a mobile device.

Kate Holterhoff
Wow.

Andrew Tunall
So I mean, I’m really interested in how this new set of data starts getting correlated with what mature teams have already done to just build stuff better.

Kate Holterhoff
Yeah, definitely an exciting time to be following SLOs and to be sort of thinking through the future of that space. So we’re about out of time. How can folks hear more from you, Andrew?

Andrew Tunall
Yeah, the best way to kind of, I mean, if you care about what I’m talking about and in my business life is probably LinkedIn, both my LinkedIn profiles as well as our Embrace LinkedIn profile. We’ll share kind of like my Twitter handle and everything. I’ll say, you I’m not sure how long I have left on that platform. Day by day, just take it day by day. But generally that’s more personal opinion. So yeah, if you want to hear more about what we’re doing, the Embrace LinkedIn account as well as my LinkedIn account are great places to follow.

Kate Holterhoff
Perfect. So it’s been an absolute pleasure speaking with you about Embrace and especially how you all are thinking about mobile SLOs. Again, I am Kate Holderhoff, Senior Analyst at Redmonk. If you enjoyed this conversation, please like, subscribe and review all of Redmonk’s video offerings on your podcast platform of choice. If you are watching us on YouTube, please like, subscribe and engage with us in the comments.

Links

Transcript

More in this series

Conversations (90)