As technology advances many new roles are coming up. One of these roles, which has been around for about 15 years, is a site reliability engineer. However, site reliability engineering (SRE), a term coined by Google to explain how they run production systems, has recently gained popularity.
Many companies are now advertising for site reliability engineer positions or trying to implement SRE. But with movements like DevOps also becoming more prominent, you may be wondering if it’s really necessary to hire a site reliability engineer at your company.
In fact, given SRE’s close resemblance to DevOps, there is an ongoing debate over what SRE is and why it’s site reliability engineers play an important role.
In this article, we’re going to talk about what SRE is. To give you a better understanding of SRE, I will also discuss how it relates to DevOps and why you should consider adding a site reliability engineer to your team.
The Origins of SRE
In 2003, Benjamin Treynor, the originator of the term SRE, was put in charge of running a production team consisting of seven engineers. The purpose of this production team was to make sure that Google websites were available, reliable, and as serviceable as possible.
Since Benjamin was a software engineer, he designed and managed the team in the way he would have if he worked as a site reliability engineer himself. He did this by giving the team the task of spending half their time on operations tasks so they could have a better understanding of software in production. That team eventually became Google’s present-day SRE team.
As Benjamin puts it, one of the contributing factors for the idea behind SRE was the division between the product development and operations team.
Each of these teams has differing goals. On the one hand, the development team aims to launch new features and see how users adopt them. On the other hand, the operations team makes sure that the service doesn’t break. When each team has their own way of doing things, it becomes difficult to achieve business goals.
As it turned out, SRE became the paradigm to help manage Google’s large-scale systems as well as facilitate the continuous introduction of new features.
So What Is SRE?
SRE essentially involves creating a bridge between development and operations. SRE’s approach to this is to apply a software engineering mindset to system administration topics.
Since SRE is a relatively new concept, there is no consensus on what the site reliability engineer role entails or what exactly it is. A quick survey of job expectations and requirements from different job listings makes this evident.
To explain more on the site reliability engineer role, some Google engineers have written a book about SRE that you can read online for free. The book explains how Google handles SRE in their organization.
Note: although you will learn a lot from the book on how to implement SRE, this doesn’t necessarily mean that your company should copy the exact methods Google does. The main consideration should be your organization needs. For instance, a large organization’s implementation of SRE is not the same as that of a startup especially in terms of affording a team for this role.
Important Aspects of SRE
Still not convinced your organization should adopt SRE? Let’s have a look at some aspects that set the site reliability engineer role apart from other roles.
- Site reliability engineers collaborate with other engineers, product owners, and customers to come up with targets and measures. This helps ensure system availability. You easily know when action should be taken once you’ve agreed upon a system’s uptime and availability. This is done through service level indicators (SLIs) and service level objectives (SLOs).
- SRE introduces error budgets that help you measure risk and consequently balance availability and feature development. Having an error budget means that failure is accepted as normal and that requiring 100 percent availability is not necessary. With no unrealistic reliability targets set, a team has the flexibility to deliver updates and improvements to a system.
- SRE believes in reducing toil. Therefore, it aims at automating tasks that require a human operator to manually work on a system. For instance, Google expects that only 50 percent of each site reliability engineer’s time goes to coding. The other 50 percent is for the feeding and daily care of existing applications.
- A site reliability engineer should have a holistic understanding of the systems as well as the connections between the systems.
- Site reliability engineers have the task of ensuring the early discovery of problems to reduce the cost of failure.
- Since the goal of SRE is to solve problems between teams, the expectation is that both the SRE teams and the development teams have a holistic view of libraries, front end, back end, storage, and other components. And shared ownership means that any one team can’t jealously own single components.
Is There a Relationship Between SRE and DevOps?
You may have noticed that there are a lot of similarities between SRE and DevOps. It can be especially confusing because both SRE and DevOps aim to bridge the gap between operations and development. We also see the practices behind these concepts playing an important role in scaling and automating processes.
But what sets SRE apart from DevOps? DevOps bridges the gap between operations and development through aligning key goals and initiatives. While SRE uses team-lead engineers who have an operations background and mindset to remove departmental communication problems.
Another major difference between the two is the focus on coding. DevOps focuses on creation and testing—this involves moving the code through the pipeline effectively and efficiently. On the other hand, SRE focuses on creating a balance between site reliability and the need for new features.
Although DevOps helps reduce the problematic gap between operations and development, it doesn’t define clearly how to accomplish these goals. SRE embodies DevOps philosophies and goes even further to include ways of achieving reliability through engineering and operations work.
In other words, and as Google puts it, “SRE implements DevOps.”
Should You Implement SRE?
The buzz on the value of site reliability engineers has many IT managers wondering if they should add one to their team. In most cases, the addition of site reliability engineers to a team happens during the design and development of large-scale systems.
Although the SRE was created at Google, other recognizable brands such as Netflix, GitHub, and Reddit already have these teams. This means that mainly cloud-native and SaaS companies have adopted SRE. Still, other companies are gradually adopting this role for their software development teams.
Some Reasons Why You Should Consider SRE
To sum up, here are a few key reasons why a site reliability engineer role is worthwhile:
- SRE automates processes for reliability, which will save time for your in-house team. Through automation, a team eliminates manual reprogramming, which is tedious and laborious. Thus, SRE will help in recognizing and addressing operational flaws with no human interference.
- A site reliability engineer combines the role of a system administrator and developer and this prevents conflicts that might arise. How? When you have a system admin and a developer, each embraces different ideas and methodologies at the time of development and troubleshooting. But a site reliability engineer utilizes the strengths of a system developer and those of a system admin to form an operational system.
- The collaboration skill of site reliability engineers is critical for high-quality systems. It also comes in handy when there are problems during development or when a system fails. This is because site reliability engineers focus on finding a solution rather than dwelling on divisive matters of how things should be done.
- A site reliability engineer uses an innovative approach to problem-solving and this boosts the likelihood that your team will come up with a disruptive product.
To successfully implement SRE your organization should have people who have the right experience. Such a person will be able to operationally lead development teams.
Should a Startup Adopt SRE?
Well, yes, but remember that every organization has to deal with all sorts of different problems.
A startup might not have the budget to hire a team dedicated to systems reliability. Nor the time to focus on much of the practices SRE fosters. But, it’s important to remember that the SRE model comes from the book Google published a few years ago. Some companies realized they were doing SRE, as well. What I’m trying to say is that you don’t have to adopt all the practices and principles from SRE to get some benefits.
Although, one essential practice that I’d recommend you to start implementing right away is SLOs. This single number will foster a lot of discussions within teams. Start measuring to understand how your numbers look at the present moment. Then you might find out that you’re spending more time than you need creating the perfect CI/CD pipeline or trying to do several deployments per day.
You might be having more significant problems, but you need to start measuring. Getting feedback after every deployment or release is critical, so focus on that.
Understanding Your Organization Needs Will Help You Decide
Despite the debate surrounding SRE and DevOps, the most important thing is that they both aim to help applications and systems run more efficiently. With more advances in technology, we are bound to see new practices and roles coming up. Of utmost importance is being able to select or adopt new practices or roles that drive operational efficiency.
If you are trying to decide whether you should adopt SRE, think of the outcome. If you have large projects that require continuous improvement, SRE is right for you.
This post was written by Alice Njenga. Alice’s areas of expertise include technology, artificial intelligence, IoT, cloud computing, security, and telecommunication. She especially enjoys converting dense technical material to articles that are easy for the layman to understand.