WHAT IS SRE?
What is SRE? Site Reliability Engineering (SRE) is an operational model created by Google in 2003 which uses software engineering principles to efficiently manage complex infrastructure.
The role of a Site Reliability Engineer is becoming increasingly popular with adoption of the model. Broadly speaking an engineering team will have a blend of skills including sysadmin, security, automation and importantly the ability to code. The target of the role is to spend half of the time on projects to improve reliability with a heavy use of automation and the other providing operational support. Closely aligned with a DevOps culture the SRE team are empowered to drive proactive improvement, embracing automation and new architecture models. This 50% of time focused on improvement provides the efficiencies to reduce the operational support burden, maintaining an efficient balance.
The following areas are core principles of a Site Reliability Engineering practice:
- Agreeing an error budget (outage % SLA) with the development team
- Having stringent acceptance testing before having to supporting an application
- Obsessive use of automation
- Site Reliability Engineers who can code
- Spend 50% time of tickets & 50% on (improvement) project work
Many well known companies use a Site Reliability Engineering operational approach including Uber, Netflix, Atlassian, Amazon, Facebook & Twitter.