As we round the bend into 2019, it’s worth thinking about where our industry is headed. There are many exciting and challenging developments ahead: blockchain scalability, functions as a service, databases as a service—the list goes on. We’re also moving more and more into an increasingly complex, distributed world. This means distributed tracing will become especially important.
A Quick Definition
Before I explain why distributed tracing is so important, let me explain what it is.
Distributed tracing is the stitching of multiple requests across multiple systems. The stitching is often done by one or more correlation IDs, and the tracing is often a set of recorded, structured log events across all the systems, stored in a central place.
But what makes all this so critical? Distributed tracing will help address the biggest challenges we’ll confront next year.
From Modularized Monoliths to Microservices
The first challenge we’ll face in 2019 is the fact that software systems are much more distributed than they used to be. In order to deepen our capabilities within organizations, we’ve had to subdivide software features not just by modules, but by entire deployment pipelines. We’ve gone all in on microservices, pushing systems that once communicated in-process to now communicate across networks. This brings all the fallacies of distributed computing along for the ride. Network issues are often intermittent and exist outside any one software team. Which means that now we need something that can work across multiple teams.
The Climb of Complexity
The second challenge for the new year will be the increase in complexity. In many organizations, it’s reached a tipping point where we can’t simply apply the same tools we’ve used for the past 20 years. We’re automating more and more parts of our businesses. And, using tools like machine learning, we’re deepening the complexity of systems we already automated.
For many organizations, software complexity has gone so far that it’s overwhelmed the capabilities of human minds. Even for one value stream, the number of people and systems we need to touch is too high for us to anticipate every edge case and every risk. Add to that the fallacies of distributed computing and we must be ready to deal with systems with a lot of unknown risks.
Too Many Problems to Anticipate
We must face the fact that increased distribution has built many risks that exist outside any one software team. And, because complexity has overwhelmed our minds, the chance of unknown unknowns has gone up.
Let’s back up for a second and talk about knowns and unknowns. For risk, these come in four different categories:
- Known knowns: These are things we’re aware of and understand. For example, I know that if we don’t validate that a product code has five characters or fewer, the order submission process will throw a 500 error. We can easily test-drive these things out.
- Known unknowns: Things we’re aware of but don’t understand. Example: We keep running out of database connections but aren’t sure which processes are hogging them all. We can prevent many of these with good architectural design and quality gates.
- Unknown knowns: Things we’re not aware of but understand. Example: We know that unvalidated data will corrupt our Order Submission service. However, we don’t know which endpoints have no validation. Again, good architectural design and quality gates prevent these sorts of things by making quality part of the design of each feature.
- Unknown unknowns: Things we’re not aware of and don’t understand. Example: Unbeknownst to us, at midnight the router changes policies and causes a race condition in the Order Submission service, which in turn triggers orders to be fulfilled twice in the warehouse.
Unknown unknowns are problems that we could not have possibly predicted. Any attempt to prevent these types of issues ahead of time is a fool’s errand. We’d spend exorbitant costs trying to create crystal balls and would wind up not making any changes at all. After all, no changes mean no risk.
Medicine That Works After Software Gets Sick
Since we can’t predict these unknown unknowns, we need to apply tools that fix issues quickly after they happen.
Let me put it another way: We can heed all the nutritional and physical health advice in the world, but eventually, we all get sick at some point. The same is true for software systems: They all inevitably get sick. When a person is sick, we don’t say, “Well, just exercise some more.” That advice may be helpful to prevent future issues, but it doesn’t help the person at the moment. Instead, the doctor will say, “Get some rest. Take these antibiotics and call me if your symptoms persist.”
With sick software we can’t say, “Let’s create automated tests for every potential disease that might inflict our system.” Instead, we need ways to deal with problems after they’ve happened.
This is why distributed tracing will be so important in 2019. It doesn’t attempt to predict the future. It simply tells a story of what’s already happened in our systems. Tracing also is built to tell that story across the entire network. You can even involve your routers, firewalls, and proxies in a distributed trace.
Stitching a Story
As I said, distributed tracing tells a story of what’s happening in a software request. It’s simply recording what it tracks. As humans, we’re good at following stories. This makes distributed tracing a great fit for us to quickly deal with unpredictable problems.
However, this benefit doesn’t come for free. A key to telling a great story is to include the right level of detail. This means we need to design observability into our features and user stories up front. We have to think about ourselves as first-class support personas. There are specifications that make this easier to do, but we still have to think about this as we design our systems. But, at least we don’t have to predict what problems might appear.
Distributed Tracing: Your Storybook
Let’s review why distributed tracing will be so important in the coming year. We know that software systems are climbing higher in complexity. We also know that we’re slicing these systems into smaller and smaller deployable pieces, even down to the function. These systems are now too complex for our human minds to anticipate all the problems that might arise from their use. We need to put tooling in place that can quickly detect and fix these risks we can’t anticipate.
Distributed tracing is one of the best ways to track down these unexpected problems. It doesn’t assume what problems will crop up in the system. Instead, tracing tells a story across a network, and we’re good at following stories. If you’re responsible for any of these systems, distributed tracing is an absolute must for 2019.
This post was written by Mark Henke. Mark has spent over 10 years architecting systems that talk to other systems, doing DevOps before it was cool, and matching software to its business function. Every developer is a leader of something on their team, and he wants to help them see that.