The differences between DevOps and SRE

Published March 05, 2020

Can you relate to this? You are, like many other organizations today, thinking about or are already on your way to adopt DevOps and DevOps methodologies. You're hammering away at your continuous integration pipeline, setting up a clear strategy for release management and you notice a shift in your company culture towards a more learning one. It's going great.

Then you start hearing about a new idea called SRE, or Site Reliability Engineering. It seems like it's almost like DevOps, but not quite, and you find it difficult to pinpoint exactly what the differences are.

Then this blog post is just for you. In this post, we'll talk about the two concepts and try to highlight the differences and origins of DevOps and SRE.

DevOps and SRE

At first glance, DevOps and SRE seem strikingly similar. In many ways, they very much are - some call them two sides of the same coin. Both methodologies try to close the gap between development and operations, and both are aimed at improving the release cycle. There's also a large overlap in responsibilities, requirements and actual tasks. So what makes them different? Let's start by taking a closer look at Site Reliability Engineering, also known as SRE.

SRE origins

Around 2010, Google made a big decision. They had noticed that R&D was going great - new features and products were pushed to production at a rapid pace, and customers seemed to like it. However, the Operations teams - whose goals were to keep the ship floating - were not huge fans. As the two departments undoubtedly consisted of very skilled engineers and software developers, Google realized that they were pulling in different directions.

One key operations leader named Ben Treynor had come up with an interesting solution. While operations teams were normally made up of system administrators, Ben realized that they could work better and more smoothly if they also had experience in development. And so, Google changed its mentality surrounding production management and SRE was born.

What is SRE?

SRE is a way for companies to more closely join software engineering and operations. People working with SRE are experienced in both development and operations tasks, and an SRE engineer would likely spend time doing both. They utilize software as a direct solution to many problems that have traditionally required a hands-on approach. In addition, an SRE engineer integrate seamlessly with development teams, finding common ground in high code quality and automated tests. Put short, this type of role creates opportunities for better collaboration between the two departments.

Indeed, both DevOps and SRE is about breaking down silos and helping each team see the other side of the process. Google's own analogy specifies DevOps as a programming language interface and SRE as a programming class used to implement DevOps. In other words, while DevOps would define the major behavior of production management, the specifics are left unsaid. In contrast, SRE provides a clearer roadmap for how to implement, measure and ultimately achieve these DevOps goals.

So what are the differences?

According to Google themselves, the main difference is just that - DevOps specifies what we need to do and SRE tells us how to do it. SRE is really about taking the theoretical goals set up by DevOps and creating an actual workflow of methods to follow, tools to migrate to, processes to adapt, and so on. All the while maintaining the DevOps vision in mind. Or as Google says it: "SRE embodies the philosophies of DevOps with a greater focus on measuring and achieving reliability through engineering and operations work".

If you still have questions, we understand. Here are some examples of classic core DevOps concepts and how they relate to SRE.

1. Breaking down silos

As a company grows, it usually follows that the organizational structure becomes more and more complex. This, in turn, leads to the bane of everything DevOps - silos. People from each department working towards their own goals, not minding anything else and often failing to see the bigger picture. As the classic saying goes: "It works on my machine". The result? Often a high degree of frustration, longer lead times and higher costs, to name a few.

While DevOps advocates often talk about reducing silos and getting everyone working towards a shared vision, SRE people doesn't really mention the number of silos in a company. Instead, SRE's take a pragmatic approach and try to find common tools and technologies across the company. This way, responsibilities are more easily shared and communications flow better.

2. Failures

A large part of DevOps is how we as an organization view failures. With an increased speed, it's more or less unavoidable that things sometimes don't quite go as planned. With this in mind, DevOps advocates mention failures as part of the process. They shift focus towards lower time to recovery and seeing failures as learning opportunities. In DevOps, we also teach teams to shift left to identify failures earlier in the process, thus reducing the number of failures in production (but that's a topic for another blog post).

When talking SRE, this mindset becomes tangible through measurable objectives. Learning opportunities are great; consistent failures are not. Realizing the need for a balance between failures and successes, SRE defines acceptable levels of failure in what's called Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

SLIs utilize indicators such as request latency, failures per requests and throughput of requests per second. The SLOs then defines the levels of these we're ready to accept, often in the form of a percentage or a distinct number. In short, SLOs represent how well the organization performs in the SLIs over time.

3. Automation

Automation is a core concept for both DevOps and SRE. Removing the need for manual interventions, cutting lead times and more - automation can solve a lot of problems for both DevOps engineers and SRE's. So far, so good.

The main difference in relation to automation is this: SRE simply views this automation as a pure software problem. DevOps, on the other hand, often automates using different tool configurations and integrations.

What is reliability?

At its core, what DevOps and SRE is trying to achieve is reliability. With everything we've mentioned above, including accepting failures, monitoring and sharing responsibilities, we also need to find ways to measure the actual reliability of our system.

SRE's uses SLIs and SLOs and DevOps teams find other ways to measure key indicators such as failure rates, mean time to recovery, amongst other things, and in essence try to focus on code quality as a standard rather than infrastrucural metrics. Perhaps neither of them fully grasp the core of reliability - that it's simply not just about the infrastructure. Reliability needs to be kept in mind throughout the entire lifecycle, from built-in quality to system performance and, of course, system security.

So how to find common ground in this? Well, we know that failures and problems does happen. In true agile spirit, we want to be better than we were yesterday. We want to improve continuously. To do that, we need to be able to rely on our data and understand the reason the failure happened. Pair this with automation and constant system monitoring and we can over time become more and more reliable, being able to predict problems before they happen and find solutions to commonly occurring issues.

Summary

So, is there a difference between DevOps and SRE? As we see it, SRE has a tighter definition and expectations. DevOps is more a concept and mindset needed to achieve the different goals, perhaps making it more applicable to various types of situations and organizations.

However, DevOps and SRE teams still have a lot in common: bridging the gap between development and operations, sharing responsibilities and utilizing automation everywhere possible.

In the end, we need to understand all our problems and improve continuously. As mentioned above, reliability is not only relevant to infrastructure - it must be present across the entire system.