WHAT IS SRE?

What is SRE? Site Reliability Engineering (SRE) is an operational model created by Google in 2003 which uses software engineering principles to efficiently manage complex infrastructure.

The role of a Site Reliability Engineer is becoming increasingly popular with adoption of the model. Broadly speaking an engineering team will have a blend of skills including sysadmin, security, automation and importantly the ability to code. The target of the role is to spend half of the time on projects to improve reliability with a heavy use of automation and the other providing operational support. Closely aligned with a DevOps culture the SRE team are empowered to drive proactive improvement, embracing automation and new architecture models. This 50% of time focused on improvement provides the efficiencies to reduce the operational support burden, maintaining an efficient balance.

The following areas are core principles of a Site Reliability Engineering practice:

Agreeing an error budget (outage % SLA) with the development team
Having stringent acceptance testing before having to supporting an application
Obsessive use of automation
Site Reliability Engineers who can code
Spend 50% time of tickets & 50% on (improvement) project work

Many well known companies use a Site Reliability Engineering operational approach including Uber, Netflix, Atlassian, Amazon, Facebook & Twitter.

SRE is “what happens when a software engineer is tasked with what used to be called operations.”

Ben Treynor

Founder of Google’s SRE Team

Summary

Article Name

What is SRE? | Site Reliability Engineering | Glossary

Description

WHAT IS SRE? Site Reliability Engineering (SRE) is an operational model created by Google in 2003 which uses software engineering principles to efficiently manage complex infrastructure.

Author

Andrew Slater

Publisher Name

N4Stack

Publisher Logo