Key points
- Site Reliability Engineering (SRE) is a radically new approach to IT operations
- The ‘everything as code’ approach brings a software engineering mindset to operations
- SRE teams are empowered to say ‘no’ to software with low operability
In this two-part mini series, we’ll look at the emerging discipline of Site Reliability Engineering (SRE) and how this relates to Managed Services for IT Operations. In Part 1, we’ll cover the basics of SRE, and in Part 2, we will look at what SRE means for the IT organisation and Managed Services.
SRE – a new set of rules
Site Reliability Engineering (SRE) is a new way of running large-scale software systems. Devised and popularised by Google, SRE is a specific set of disciplines and dynamics that work together with modern software engineering practices to help produce reliable software at scale. The SRE discipline combines deep awareness of technical infrastructure, operating systems and computer networking with attention to higher-level service level objectives (SLOs) to maintain a focus on business-relevant activities.
Authors from Google recently published a book – Site Reliability Engineering – that Google has made available for free at https://landing.google.com/sre/book.html. This is a great way to learn about SRE and explore how it works.
‘Everything as Code’
Ben Traynor is VP of Engineering at Google and the person who founded SRE capability at Google and grew it from 7 people to over 1200 staff by 2014. According to Traynor, SRE is “what happens when you ask a software engineer to design an operations function”. One of the main implications of having software engineers design an IT operations function is that everything is code: servers, infrastructure, updates, rollbacks, scaling are all defined and executed as code rather than interactive human operations. This ‘everything as code’ approach has several important implications:
- All changes start in version control (such as a Git repository)
- All changes are tracked with software tooling
- All changes are testable using test-first development techniques and test-driven frameworks such as Cucumber and RSpec
- All changes are designed with instrumentation and observability in mind so that problems can be detected quickly
This means that software written to be run by SRE teams is almost certainly going to work better in production than software not written with these criteria in mind. SRE acts as a ‘high bar’ that helps to make sure that we deal with typical operational problems early and often.
Say ‘no’ to software with low operability
A key aspect of how SRE works is that the SRE function is empowered to push back on low-quality software. Specifically, if software development teams ask an SRE team to take on the running of their software, the team is empowered to demand evidence of operability in the form of automated test results and instrumentation. If the code isn’t good enough from an operations perspective, the SRE team (rightly) can reject the code as not fit for production.
The way that Google manages this focus on proven operability with their SRE teams is with an ‘error budget’. Each service or application run by an SRE team has an associated availability target (say 99.8% uptime) which comes with some allowed downtime (for 99.8% availability, the allowed downtime is just over 87 minutes per month) – this is the ‘error budget’.
The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. — Mark Roth, Google (quoted in the book Site Reliability Engineering, Chapter 3)
The SRE team tracks the downtime on the service, and if the service is currently within its availability range, the Dev team can deploy new changes. Those changes may of course cause an outage. If the changes cause an outage that makes the service unavailable for longer than the allowed downtime for that period (say, longer than 87 minutes for our 99.8% service), then the Dev team is forbidden from deploying new changes and must demonstrate significant improvements in reliability before being allowed to deploy again.
In practice of course, Dev teams and SRE teams work together to get the software ready for production operation, collaborating on instrumentation, performance, resilience, error codes, and so on, so that by the time the Dev team wants to hand over the software to the SRE team (at the Production Readiness Review (PRR)), the code has been proven to work well. Still, the ability of the SRE team to insist on good operability is a crucial reason for the success of the SRE approach.
In this first part, we covered what SRE is and how SRE teams work: the ‘everything as code’ model, error budgets, and software engineering approaches. In Part 2 of this mini series, we will see what implications SRE has for IT operations and Managed Services.
Russ McKendrick
Practice Manager (SRE & DevOps)
Russ heads up the SRE & DevOps team here at N4Stack.
He's spent almost 25 years working in IT and related industries and currently works exclusively with Linux.
When he's not out buying way too many records, Russ loves to write and has now published six books.
To find out more about Russ click here!