{"id":54048,"date":"2018-09-25T09:00:12","date_gmt":"2018-09-25T08:00:12","guid":{"rendered":"http:\/\/content.n4stack.io\/?p=54048"},"modified":"2018-10-04T11:23:47","modified_gmt":"2018-10-04T10:23:47","slug":"sre-as-a-service","status":"publish","type":"post","link":"http:\/\/content.n4stack.io\/2018\/09\/25\/sre-as-a-service\/","title":{"rendered":"Site Reliability Engineering (SRE) as a Managed Service"},"content":{"rendered":"
[et_pb_section bb_built=”1″ _builder_version=”3.12.2″ background_color=”#dddddd” custom_padding=”0|0px|3px|0px|false|false” next_background_color=”#ffffff”][et_pb_row _builder_version=”3.0.48″ background_size=”initial” background_position=”top_left” background_repeat=”repeat” custom_padding=”0|0px|27px|0px|false|false”][et_pb_column type=”4_4″][et_pb_text ul_item_indent=”5px” _builder_version=”3.12.2″ text_font_size=”17.5px”]<\/p>\n
In this two-part mini series, we’ll look at the emerging discipline of Site Reliability Engineering (SRE) and how this relates to Managed Services for IT Operations. In Part 1, we’ll cover the basics of SRE, and in Part 2,<\/a> we will look at what SRE means for the IT organisation and Managed Services.<\/p>\n [\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section bb_built=”1″ _builder_version=”3.0.47″ custom_padding=”17px|0px|14px|0px|false|false” prev_background_color=”#dddddd” next_background_color=”#dddddd”][et_pb_row _builder_version=”3.0.48″ background_size=”initial” background_position=”top_left” background_repeat=”repeat”][et_pb_column type=”4_4″][et_pb_text _builder_version=”3.12.2″ text_font_size=”17px” header_text_color=”#e05206″ header_2_text_color=”#ffffff”]<\/p>\n <\/p>\n Site Reliability Engineering (SRE)<\/a> is a new way of running large-scale software systems. Devised and popularised by Google, SRE is a specific set of disciplines and dynamics that work together with modern software engineering practices to help produce reliable software at scale. The SRE discipline combines deep awareness of technical infrastructure, operating systems and computer networking with attention to higher-level service level objectives (SLOs) to maintain a focus on business-relevant activities.<\/p>\n Authors from Google recently published a book – Site Reliability Engineering<\/em> – that Google has made available for free at https:\/\/landing.google.com\/sre\/book.html.<\/a>\u00a0This is a great way to learn about SRE and explore how it works.<\/p>\n <\/p>\n <\/p>\n Ben Traynor is VP of Engineering at Google and the person who founded SRE capability at Google and grew it from 7 people to over 1200 staff by 2014.\u00a0According to Traynor, SRE is \u201cwhat happens when you ask a software engineer to design an operations function\u201d<\/a>. One of the main implications of having software engineers design an IT operations function is that everything is code: servers, infrastructure, updates, rollbacks, scaling are all defined and executed as code rather than interactive human operations. This \u2018everything as code\u2019 approach has several important implications:<\/p>\n This means that software written to be run by SRE teams is almost certainly going to work better in production than software not written with these criteria in mind. SRE acts as a ‘high bar’ that helps to make sure that we deal with typical operational problems early and often.<\/p>\n <\/p>\n <\/p>\n A key aspect of how SRE works is that the SRE function is empowered to push back on low-quality software. Specifically, if software development teams ask an SRE team to take on the running of their software, the team is empowered to demand evidence of operability in the form of automated test results and instrumentation. If the code isn’t good enough from an operations perspective, the SRE team (rightly) can reject the code as not fit for production.<\/p>\n The way that Google manages this focus on proven operability with their SRE teams is with an ‘error budget’. Each service or application run by an SRE team has an associated availability target (say 99.8% uptime) which comes with some allowed downtime (for 99.8% availability, the allowed downtime is just over 87 minutes per month) – this is the ‘error budget’.<\/p>\n The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. — Mark Roth, Google (quoted in the book <\/em>Site Reliability Engineering, Chapter 3<\/em><\/a>)<\/em><\/p><\/blockquote>\n The SRE team tracks the downtime on the service, and if the service is currently within its availability range, the Dev team can deploy new changes. Those changes may of course cause an outage. If the changes cause an outage that makes the service unavailable for longer than the allowed downtime for that period (say, longer than 87 minutes for our 99.8% service), then the Dev team is forbidden from deploying new changes and must demonstrate significant improvements in reliability before being allowed to deploy again.<\/p>\n In practice of course, Dev teams and SRE teams work together to get the software ready for production operation, collaborating on instrumentation, performance, resilience, error codes, and so on, so that by the time the Dev team wants to hand over the software to the SRE team (at the Production Readiness Review (PRR)), the code has been proven to work well. Still, the ability of the SRE team to insist on good operability is a crucial reason for the success of the SRE approach.<\/p>\n [\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section bb_built=”1″ _builder_version=”3.12.2″ background_color=”#dddddd” custom_padding=”0|0px|0|0px|false|false” prev_background_color=”#ffffff” next_background_color=”#ffffff”][et_pb_row _builder_version=”3.0.48″ background_size=”initial” background_position=”top_left” background_repeat=”repeat”][et_pb_column type=”4_4″][et_pb_text _builder_version=”3.12.2″ text_font_size=”17.5px”]<\/p>\n In this first part, we covered what SRE is and how SRE teams work: the \u2018everything as code\u2019 model, error budgets, and software engineering approaches. In Part 2<\/a> of this mini series, we will see what implications SRE has for IT operations and Managed Services.<\/em><\/p>\n [\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section bb_built=”1″ _builder_version=”3.0.47″ custom_padding=”17px|0px|54px|0px|false|false” prev_background_color=”#dddddd”][et_pb_row _builder_version=”3.0.48″ background_size=”initial” background_position=”top_left” background_repeat=”repeat”][et_pb_column type=”4_4″][et_pb_team_member name=”Russ McKendrick” position=”Practice Manager (SRE & DevOps)” image_url=”http:\/\/content.n4stack.io\/wp-content\/uploads\/2018\/07\/Russ-McKendrick.png” _builder_version=”3.12.2″ global_module=”53894″ saved_tabs=”all”]<\/p>\n Russ heads up the SRE & DevOps team here at N4Stack.<\/p>\n He’s spent almost 25 years working in IT and related industries and currently works exclusively with Linux.<\/p>\n When he’s not out buying way too many records, Russ loves to write and has now published six books.<\/p>\nSRE – a new set of rules<\/h1>\n
<\/h2>\n
‘Everything as Code’<\/h1>\n
\n
<\/h2>\n
Say ‘no’ to software with low operability<\/h1>\n