I dove into the topic of technical debt in the past, exploring the tradeoffs involved and the ways the metaphor does and does not serve us in our communication. The recent industry news cycles (namely waves of layoffs across the industry and the organizational chaos at Twitter) has me again pondering what tech debt means.
A Tech Debt Framing
The webinar is worth watching and has some great discussion throughout**, but the component I’m going to highlight here is possibly what they might consider to be one of the mundane bits of their conversation. What stood out to me was definitional framing.
During their initial level setting around what technical debt is, I appreciated that Amundsen and Rapson’s discussion chose to focus not on how technical debt comes to be, but rather on the characteristics of tech debt once incurred. According to Amundsen, we see evidence of technical debt when we have code/systems that can’t be easily observed or understood; according to Rapson, we see evidence of technical debt when we have slow development and release velocity.
I really liked this framing around the importance of understanding our systems. We can’t understand where we’re going if we can’t articulate where we are. Without knowing what’s happening in our system, the value of our system is inherently opaque.
Some Social Media Inputs
And on top of that webinar, there were a few tweets that were helpful in driving my thought process and focused some of these news stories through the lens of technical debt.
- These threads from Dan Luu and Chris Petrillic outline some of Twitter’s engineering decisions and the constraining factors that were part of the rationale behind them. Both threads are enlightening, showing just a handful of the considerations that went into Twitter’s infrastructure over the years.
- This thread from Julia Ferraioli about how an important part of resiliency in an open source project is its methods of communication. The line “don’t mistake availability of data for completeness of relevant information” was especially notable.
- This video from Forrest Brazeal highlights a sample app from Google Cloud wherein the team does a notably good job using GitHub to document their technical decisions that drove the application.
Total Aside: Let’s Talk About Incident Review
Complex systems are bound to fail; when they do, root causes (in as much as they can exist at all) are also complex. The concept of blameless retrospectives encourages organizations to think holistically about what caused an issue rather than placing blame on specific individuals.
Or in other words, one of the recommended practices of our industry is to try to take the human element out of an incident review. When trying to narrate lessons from an incident, we have learned that we can make our systems more resilient if we look for the systemic points of weakness that contributed to the problem rather than assigning blame to people and teams.
“Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.”
– The Retrospective Prime Directive
I feel like we as an industry treat technical debt the same way.
We often talk about technical debt in terms of, well, the technical aspects of things. This platform is overly rigid. This database can’t scale anymore. This language isn’t performant enough. This framework is outdated. This project is no longer maintained. I don’t like how these RPCs are batched…
We’ve taken the humans out of it. And on the one hand, that’s good. We want resilient systems that aren’t tied to specific people or their actions. And in the spirit of blameless culture, we don’t want to attribute the accumulation of technical debt to specific people or teams.
But what does this removal of humans and their decisions mean in terms of having institutional knowledge about a system?
Because that platform didn’t come to be by itself. That database/language/framework/project wasn’t chosen in a vacuum. Someone batched those RPCs.
If you want to understand a system, sometimes understanding the why is just as important as understanding the what.
Technical Debt As People
So how does all this come together?
If an element of tech debt is having a system that is hard to understand/observe (says Amundsen), and if a system is formed via the accumulation of years and years of decisions and tradeoffs (see threads from Luu and Petrillic) made within very specific circumstances, then it stands to reason that having insight into the context of how and why those decisions were made is an important element of an organization’s technical debt.
Now as Brazeal points out, there are Architectural Decision Records and internal wikis that have existed within our industry for a long time. We can and should document the decisions that formed our systems without relying solely on oral tradition handed down between generations of team members. Furthermore — as any user of an outdated internal wiki can attest — it’s not enough to write it down once, but it also has to be actively maintained in order to remain useful. As Ferraioli points out, meaningful communication about resilient systems is so much more than looking through Git. The availability of the data is not enough, especially if the written data is out of date and fragmented across internal systems.
The people, their stories, and how they communicate them are essential elements of a tech stack.
Humans are part of our systems, which means that humans are also part of the technical debt considerations. How knowledge is communicated, shared, and lost is part of system resiliency. We forget that platforms are sociotechnical systems at our peril.
Organizations should be resilient enough that they can survive losing a single person at any given time. Layoffs are harder in that their impact is more broadly felt, but they can still be done thoughtfully and with consideration for explicit knowledge transfer. A massive hemorrhaging of institutional knowledge by yeeting your staff into the sun, however, is its own form of technical debt.
Image credit: Licensed from Adobe Stock
** The discussion of microservices and how the major contributing factors to calculating technical debt are the “entanglement of the graph of dependencies and the length of the dependency chains” would be of particular interest to anyone watching this banter about monoliths
Disclaimer: VFunction is a RedMonk client (though I actually found the webinar via Kevin Swiber: thanks Kevin!) as is Google Cloud.