{"id":54087,"date":"2018-10-04T09:00:12","date_gmt":"2018-10-04T08:00:12","guid":{"rendered":"http:\/\/content.n4stack.io\/?p=54087"},"modified":"2020-08-19T12:22:49","modified_gmt":"2020-08-19T11:22:49","slug":"sre-managed-service","status":"publish","type":"post","link":"http:\/\/content.n4stack.io\/2018\/10\/04\/sre-managed-service\/","title":{"rendered":"Building a Site Reliability Engineering (SRE) Capability"},"content":{"rendered":"<p>[et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;3.22&#8243; background_color=&#8221;#dddddd&#8221; custom_padding=&#8221;0|0px|3px|0px|false|false&#8221;][et_pb_row _builder_version=&#8221;3.25&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221; custom_padding=&#8221;0|0px|27px|0px|false|false&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;3.25&#8243; custom_padding=&#8221;|||&#8221; custom_padding__hover=&#8221;|||&#8221;][et_pb_text ul_item_indent=&#8221;5px&#8221; _builder_version=&#8221;3.27.4&#8243; text_font_size=&#8221;17.5px&#8221;]<\/p>\n<h1 style=\"text-align: left;\">Key points<\/h1>\n<ul>\n<li>Building a Site Reliability Engineering (SRE) capability requires changes in hiring, training, and organisation behaviour<\/li>\n<li>Adopting SRE needs enhanced trust across the IT organisation<\/li>\n<li>An SRE Managed Service can help organisations get the benefits of SRE quickly and efficiently<\/li>\n<\/ul>\n<p>In this two-part mini series, we&#8217;ll look at the emerging discipline of Site Reliability Engineering (SRE) and how this relates to Managed Services for IT Operations. In <a href=\"\/2018\/09\/25\/sre-as-a-service\/\">Part 1<\/a>, we covered the basics of SRE, and in Part 2, we&#8217;ll look at what SRE means for the IT organisation and Managed Services.<br \/>\n[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;3.22&#8243; custom_padding=&#8221;17px|0px|14px|0px|false|false&#8221;][et_pb_row _builder_version=&#8221;3.25&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;3.25&#8243; custom_padding=&#8221;|||&#8221; custom_padding__hover=&#8221;|||&#8221;][et_pb_text _builder_version=&#8221;4.5.7&#8243; text_font_size=&#8221;17px&#8221; header_text_color=&#8221;#e05206&#8243; header_2_text_color=&#8221;#ffffff&#8221; hover_enabled=&#8221;0&#8243;]<\/p>\n<h1>Organisational behaviours and hiring for SRE<\/h1>\n<p>&nbsp;<\/p>\n<p>The SRE model clearly works well at Google because all the core Google services and applications have had some SRE input to help make them as reliable as they are. But it\u2019s no use just hiring a load of people as \u201cSRE\u201d staff and expecting to get the same kind of reliability as Google has; in fact, just hiring SRE people might be very counterproductive.<\/p>\n<p>To make SRE teams work you will need to hire people with an unusual range of skills:<\/p>\n<ul>\n<li>Deep knowledge and experience of operating systems, container fabrics, computer networking, alerting, and monitoring<\/li>\n<li>A drive to collaborate with Development teams on improving the operability of the software applications in Production<\/li>\n<li>An ability on focus on the business-relevant Service Level Objectives (SLOs), which are typically different for each application or service and set by the Product Manager<\/li>\n<\/ul>\n<p>These skills are quite far removed from the traditional IT Operations skillset, so you cannot just rename the Ops team the SRE team and expect good results!<\/p>\n<p>As we saw in <a href=\"\/2018\/09\/25\/sre-as-a-service\/\">Part 1<\/a>, SRE teams have the ability to say \u201cno\u201d to poorly written software changes. This needs a good deal of maturity in the organisation to enable this to happen. In many organisations, if the Development team (or Product Manager \/ Project Manager) wants to get something deployed, they will pester or push the Production-focused team to deploy the changes even if the software has not been tested properly for Production reliability (some people call this \u201cJDFI deployments\u201d!).<\/p>\n<p>To make SRE work in your organisation, you will need buy-in from senior leadership that the SRE group is empowered to refuse deployments for applications and services that have <a href=\"https:\/\/landing.google.com\/sre\/book\/chapters\/embracing-risk.html\" target=\"_blank\" rel=\"noopener noreferrer\">exceeded their error budget<\/a>. Also, according to the well-known Google SRE known as JBD, you need to \u201c<a href=\"https:\/\/medium.com\/@rakyll\/the-sre-model-6e19376ef986\" target=\"_blank\" rel=\"noopener noreferrer\">let your development team own the SRE work if the scale doesn\u2019t require SRE support<\/a>\u201d. This will feel very different to the traditional approach to IT operations in many organisations, so be sure that you don\u2019t fall short on this.<\/p>\n<h2>\u00a0<\/h2>\n<p>&nbsp;<\/p>\n<h1>Enhancing trust for SRE<\/h1>\n<p>&nbsp;<\/p>\n<p>With an error budget in place, SRE teams can have straightforward discussions with Development teams on how much risk to take on for a particular service or application. In the words of Mark Roth from Google, \u201c<a href=\"https:\/\/landing.google.com\/sre\/book\/chapters\/embracing-risk.html\" target=\"_blank\" rel=\"noopener noreferrer\">[the error budget] metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow<\/a>\u201d. This means that getting an accurate measurement of service availability is essential for building trust between SRE and Dev teams; in turn, this means you need to invest in high-quality tools and training for SREs to be able to measure and report on service availability in the first place.<\/p>\n<p>Another key practice for building trust\u00a0 is the so-called <a href=\"https:\/\/codeascraft.com\/2012\/05\/22\/blameless-postmortems\/\" target=\"_blank\" rel=\"noopener noreferrer\">\u201cblameless post-mortem\u201d as defined by John Allspaw<\/a>, ex-CTO of operations pioneers Etsy. When something goes wrong in Production (such as an application becoming unavailable), teams work to restore service quickly. After service is restored (and after people have slept if needed) a postmortem (team analysis) of the problem is carried out. The trick here is make sure this analysis is \u201cblameless\u201d, so individuals feel like they can be open with all the details needed and \u201cthat they can give this detailed account without fear of punishment or retribution\u201d, as John Allspaw puts it.<\/p>\n<h2>\u00a0<\/h2>\n<p>&nbsp;<\/p>\n<h1>Getting the benefits of SRE quickly with SRE Managed Services<\/h1>\n<p>&nbsp;<\/p>\n<p>According to the well-known <a href=\"https:\/\/web.devopstopologies.com\/#type-seven\" target=\"_blank\" rel=\"noopener noreferrer\">DevOps Team Topologies website<\/a>, the SRE pattern (known as Type 7) is \u201csuitable only for organisations with a high degree of engineering and organisational maturity\u201d. This is because without the trust and engineering maturity to make SRE work well, there is a danger of a \u201creturn to Anti-Type A if the SRE\/Ops team is told to &#8220;JFDI&#8221; deploy\u201d. However, if your company doesn&#8217;t yet have an SRE function or doesn&#8217;t want to build an SRE capability in-house, you can get the benefits of SRE quickly by using an SRE Managed Service.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-54088 size-medium\" src=\"http:\/\/content.n4stack.io\/wp-content\/uploads\/2018\/09\/SRE-Managed-Services-300x174.png\" alt=\"SRE Managed Services\" width=\"300\" height=\"174\" srcset=\"http:\/\/content.n4stack.io\/wp-content\/uploads\/2018\/09\/SRE-Managed-Services-300x174.png 300w, http:\/\/content.n4stack.io\/wp-content\/uploads\/2018\/09\/SRE-Managed-Services-768x445.png 768w, http:\/\/content.n4stack.io\/wp-content\/uploads\/2018\/09\/SRE-Managed-Services.png 1000w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/><\/p>\n<p style=\"text-align: center;\"><em>Type 7: SRE Team. Source: DevOps Team Topologies \/ devopstopologies.com &#8211; CC BY-SA<\/em><\/p>\n<p>&nbsp;<\/p>\n<p>By using an <a href=\"https:\/\/n4stack.io\/azure-service-partner\/managed-devops-sre\/\" target=\"_blank\" rel=\"noopener noreferrer\">external provider<\/a> for your SRE capability\u00a0you get the advantage of a ready-formed SRE capability with all the skills and experience needed to improve the reliability of your key systems together with a clearly defined service contract that sets out SLOs and uptime expectations. The managed SRE provider has skilled SRE staff who already know how to assess and instrument modern software for typical performance and reliability tracking, saving you time and money in discovering these things in your business. Instead, with a managed SRE service, you get to focus on defining Key Performance Indicators (KPIs) and other reliability metrics for your business services, driving value for business stakeholders.<\/p>\n<p>&nbsp;<\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;3.22&#8243; background_color=&#8221;#dddddd&#8221; custom_padding=&#8221;0|0px|0|0px|false|false&#8221;][et_pb_row _builder_version=&#8221;3.25&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;3.25&#8243; custom_padding=&#8221;|||&#8221; custom_padding__hover=&#8221;|||&#8221;][et_pb_text _builder_version=&#8221;3.27.4&#8243; text_font_size=&#8221;17.5px&#8221;]<em>In Part 2 of our SRE Managed Services mini series we covered how organisations need to change to be able to adopt SRE as a practice for operations, and how an SRE Managed Service might be more appropriate for some companies compared to building an internal SRE capability. <\/em><\/p>\n<p><em>Take a look at <a href=\"\/2018\/09\/25\/sre-as-a-service\/\">Part 1<\/a> of our series for an intro to Site Reliability Engineering and how it relates to Managed Services for IT Operations.<\/em><br \/>\n[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;3.22&#8243; custom_padding=&#8221;17px|0px|54px|0px|false|false&#8221;][et_pb_row _builder_version=&#8221;3.25&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;3.25&#8243; custom_padding=&#8221;|||&#8221; custom_padding__hover=&#8221;|||&#8221;][et_pb_team_member name=&#8221;Russ McKendrick&#8221; position=&#8221;Practice Manager (SRE &#038; DevOps)&#8221; image_url=&#8221;http:\/\/content.n4stack.io\/wp-content\/uploads\/2018\/07\/Russ-McKendrick.png&#8221; _builder_version=&#8221;4.3.4&#8243; global_module=&#8221;53894&#8243; saved_tabs=&#8221;all&#8221;]<\/p>\n<p>Russ heads up the SRE &amp; DevOps team here at N4Stack.<\/p>\n<p>He&#8217;s spent almost 25 years working in IT and related industries and currently works exclusively with Linux.<\/p>\n<p>When he&#8217;s not out buying way too many records, Russ loves to write and has now published six books.<\/p>\n<p>To find out more about Russ click <a href=\"\/russ-mckendrick\/\">here<\/a>!<\/p>\n<p>[\/et_pb_team_member][\/et_pb_column][\/et_pb_row][\/et_pb_section]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Key points Building a Site Reliability Engineering (SRE) capability requires changes in hiring, training, and organisation behaviour Adopting SRE needs enhanced trust across the IT organisation An SRE Managed Service can help organisations get the benefits of SRE quickly and efficiently In this two-part mini series, we&#8217;ll look at the emerging discipline of Site Reliability [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":54180,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":""},"categories":[3106,3886],"tags":[3885,3883,3882,3935],"yst_prominent_words":[3831,3827,3826,3833,3829,51,152,3838,1784,3736,1783,1782,3734,3830,3832,3839,3828,3836,3738,3834],"_links":{"self":[{"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/posts\/54087"}],"collection":[{"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/comments?post=54087"}],"version-history":[{"count":8,"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/posts\/54087\/revisions"}],"predecessor-version":[{"id":59771,"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/posts\/54087\/revisions\/59771"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/media\/54180"}],"wp:attachment":[{"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/media?parent=54087"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/categories?post=54087"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/tags?post=54087"},{"taxonomy":"yst_prominent_words","embeddable":true,"href":"http:\/\/content.n4stack.io\/wp-json\/wp\/v2\/yst_prominent_words?post=54087"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}