In the DevOps and SRE world, observability has become an important term, especially when you’re talking about running production systems. No matter how much effort you put in creating good quality software, there will always be things you miss, like users increasing exponentially, user data that’s longer than expected, or cache keys that never expire. I’ve seen it, and I’ve produced it too. For this reason and more, systems need to be observable.
Today’s post is about answering the why and how of observability. In a world where too many pieces in the system interconnect, new problems arise every day. And the good news is that your systems lend themselves to observation when you instrument them appropriately. More on this later.
But first, let me clear the path by explaining what observability means.
What Does Observability Mean?
To better explain this concept, let me use the definition of observability from control theory in Wikipedia: “observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” If we talk about running production systems, can you infer its internal state by observing the outputs the system produces? What outputs should you be looking at that describe the inner workings of the system? You might not know it, but I’d say that CPU and memory metrics are not sufficient.
Observability helps you understand the internals of your production system by asking questions from the outside.
Historically, we monitor metric values like CPU and memory from known past issues. And every time a new issue arises, we add a new monitor. But the problem is that we usually end up with noisy monitors, which people tend to ignore. Or you end up with monitors whose purpose no one understands.
For known problems, you have everything under control—in theory. You have an incredible runbook that you just need to follow, and customers shouldn’t notice that anything happened. But this isn’t how things often work, in reality. Customers still complain about issues in your system even if your monitors look good.
Therefore, monitors for metrics alone are not enough. You need context, and you can get it from your logs.
Why Is Observability Needed?
When working with simple applications that don’t have many parts to fulfill a need, like a WordPress site, it’s easy to control its stability to some extent. You can place monitors for CPU, memory, networking, and databases. Many of these monitors will help you to understand where to apply known solutions to known problems. In other words, these type of applications are mature enough to know their own problems better.
But nowadays, the story is different. Almost everyone is working with distributed systems. There are microservices, containers, cloud, serverless, and a lot of combinations of these technologies. All of these increase the number of failures that systems will have because there are too many parts interacting. And because of the distributed system’s diversity, it’s complex to understand present problems and predict future ones.
You should recognize what the SRE model emphasizes: systems will continue failing. As your system grows in usability and complexity, new and different problems will continue emerging. Known problems from the past might not occur again, and you’ll have to deal with the unknown problems regularly.
Observability is what will help you to be better when troubleshooting in production. You’ll need to zoom in and zoom out over and over again. Take a different path, deep-dive into logs, read stack trace errors, and do everything you can to answer new questions to find out what’s causing problems.
Correlating Data Is Fundamental
At this point, you can see that for observability to work, you need to have context. Metrics alone don’t give you that context. From what I’ve seen, when memory usage increases, it’s just a sign that something terrible is about to happen. Usually, it’s a memory leak somewhere in the code or a customer that uploads a file bigger than expected. These are the type of problems that you can’t spot by just using a metric.
As the Twitter engineering team said a few years ago on their blog, the pillars of observability are
It doesn’t mean that these are going to be the only sources of information you use to observe the system from the outside, but they’re the primary source of information.
Correlating these different sources of data is challenging but not impossible. For example, in microservices, there’s heavy leverage on HTTP headers to pass information between calls. Something as simple as marking a user’s request with a unique ID can make the difference when debugging in production. By using the request ID in a centralized storage location, you can get all the context from a user’s call at a specific point in time…like the time when the user complained but your monitors said things were all good.
How Can We Make Systems Observable?
Everything that happens inside and outside of the system can emit valuable information.
An easy way to start making your systems observable is by collecting all sorts of metrics surrounding the application. I’m talking about CPU, memory, network, or disk metrics from the infrastructure that’s hosting the application. Also include logs from the applications or services of which you don’t control the source code. I’m talking about services like NGINX, Redis, or any other logs from a cloud provider like AWS CloudWatch metrics.
Another essential component of making systems observable is the logs from your applications. And for that, the instrumentation is vital. Luckily for us, nowadays there are pretty good tools out there, like OpenCensus, that you can use to emit logs, traces, and metrics from the inside of your applications.
Once in production, you might notice that the application isn’t emitting enough information, and that’s OK. You can continue adding more information as needed. But make sure you also get rid of the information you no longer need or use.
One thing I must say—all this information must be stored in a central location and in a raw format. It will be counterproductive to have to query different tools at different places to get the information you need when asking new questions. And you need to have the ability to aggregate and disaggregate information as required.
Observability Helps You With Unknown Problems
Observability, in a nutshell, is the ability to ask questions from the outside to understand the inside of a system.
Making your production systems observable doesn’t mean that you’ll be able to solve problems. You must continue examining all the information you have and think about whether it’s still useful. Avoid falling into the trap of collecting everything just because you might need it in the future.
It’s essential that you also spend time understanding the system and its architecture and components, either internal or external. Plus, it’s healthy to know where reliability is lacking in your system. Observability will help your team, especially developers, to own the system and to better understand how it behaves in a live environment with real users.
One last piece of advice. Don’t rely on the past problems you’ve had. All your monitors might say the system is OK, and yet customers still complain. Observability is about having the right data when you need to debug production systems. It will help you with the unknown problems you might have and do have.
This post was written by Christian Meléndez. Christian is a technologist that started as a software developer and has more recently become a cloud architect focused on implementing continuous delivery pipelines with applications in several flavors, including .NET, Node.js, and Java, often using Docker containers.