{"id":54087,"date":"2018-10-04T09:00:12","date_gmt":"2018-10-04T08:00:12","guid":{"rendered":"http:\/\/content.n4stack.io\/?p=54087"},"modified":"2020-08-19T12:22:49","modified_gmt":"2020-08-19T11:22:49","slug":"sre-managed-service","status":"publish","type":"post","link":"http:\/\/content.n4stack.io\/2018\/10\/04\/sre-managed-service\/","title":{"rendered":"Building a Site Reliability Engineering (SRE) Capability"},"content":{"rendered":"
[et_pb_section fb_built=”1″ _builder_version=”3.22″ background_color=”#dddddd” custom_padding=”0|0px|3px|0px|false|false”][et_pb_row _builder_version=”3.25″ background_size=”initial” background_position=”top_left” background_repeat=”repeat” custom_padding=”0|0px|27px|0px|false|false”][et_pb_column type=”4_4″ _builder_version=”3.25″ custom_padding=”|||” custom_padding__hover=”|||”][et_pb_text ul_item_indent=”5px” _builder_version=”3.27.4″ text_font_size=”17.5px”]<\/p>\n
In this two-part mini series, we’ll look at the emerging discipline of Site Reliability Engineering (SRE) and how this relates to Managed Services for IT Operations. In Part 1<\/a>, we covered the basics of SRE, and in Part 2, we’ll look at what SRE means for the IT organisation and Managed Services. <\/p>\n The SRE model clearly works well at Google because all the core Google services and applications have had some SRE input to help make them as reliable as they are. But it\u2019s no use just hiring a load of people as \u201cSRE\u201d staff and expecting to get the same kind of reliability as Google has; in fact, just hiring SRE people might be very counterproductive.<\/p>\n To make SRE teams work you will need to hire people with an unusual range of skills:<\/p>\n These skills are quite far removed from the traditional IT Operations skillset, so you cannot just rename the Ops team the SRE team and expect good results!<\/p>\n As we saw in Part 1<\/a>, SRE teams have the ability to say \u201cno\u201d to poorly written software changes. This needs a good deal of maturity in the organisation to enable this to happen. In many organisations, if the Development team (or Product Manager \/ Project Manager) wants to get something deployed, they will pester or push the Production-focused team to deploy the changes even if the software has not been tested properly for Production reliability (some people call this \u201cJDFI deployments\u201d!).<\/p>\n To make SRE work in your organisation, you will need buy-in from senior leadership that the SRE group is empowered to refuse deployments for applications and services that have exceeded their error budget<\/a>. Also, according to the well-known Google SRE known as JBD, you need to \u201clet your development team own the SRE work if the scale doesn\u2019t require SRE support<\/a>\u201d. This will feel very different to the traditional approach to IT operations in many organisations, so be sure that you don\u2019t fall short on this.<\/p>\n <\/p>\n <\/p>\n With an error budget in place, SRE teams can have straightforward discussions with Development teams on how much risk to take on for a particular service or application. In the words of Mark Roth from Google, \u201c[the error budget] metric removes the politics from negotiations between the SREs and the product developers when deciding how much risk to allow<\/a>\u201d. This means that getting an accurate measurement of service availability is essential for building trust between SRE and Dev teams; in turn, this means you need to invest in high-quality tools and training for SREs to be able to measure and report on service availability in the first place.<\/p>\n Another key practice for building trust\u00a0 is the so-called \u201cblameless post-mortem\u201d as defined by John Allspaw<\/a>, ex-CTO of operations pioneers Etsy. When something goes wrong in Production (such as an application becoming unavailable), teams work to restore service quickly. After service is restored (and after people have slept if needed) a postmortem (team analysis) of the problem is carried out. The trick here is make sure this analysis is \u201cblameless\u201d, so individuals feel like they can be open with all the details needed and \u201cthat they can give this detailed account without fear of punishment or retribution\u201d, as John Allspaw puts it.<\/p>\n <\/p>\n <\/p>\n According to the well-known DevOps Team Topologies website<\/a>, the SRE pattern (known as Type 7) is \u201csuitable only for organisations with a high degree of engineering and organisational maturity\u201d. This is because without the trust and engineering maturity to make SRE work well, there is a danger of a \u201creturn to Anti-Type A if the SRE\/Ops team is told to “JFDI” deploy\u201d. However, if your company doesn’t yet have an SRE function or doesn’t want to build an SRE capability in-house, you can get the benefits of SRE quickly by using an SRE Managed Service.<\/p>\n <\/p>\n <\/p>\n Type 7: SRE Team. Source: DevOps Team Topologies \/ devopstopologies.com – CC BY-SA<\/em><\/p>\n <\/p>\n By using an external provider<\/a> for your SRE capability\u00a0you get the advantage of a ready-formed SRE capability with all the skills and experience needed to improve the reliability of your key systems together with a clearly defined service contract that sets out SLOs and uptime expectations. The managed SRE provider has skilled SRE staff who already know how to assess and instrument modern software for typical performance and reliability tracking, saving you time and money in discovering these things in your business. Instead, with a managed SRE service, you get to focus on defining Key Performance Indicators (KPIs) and other reliability metrics for your business services, driving value for business stakeholders.<\/p>\n <\/p>\n [\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=”1″ _builder_version=”3.22″ background_color=”#dddddd” custom_padding=”0|0px|0|0px|false|false”][et_pb_row _builder_version=”3.25″ background_size=”initial” background_position=”top_left” background_repeat=”repeat”][et_pb_column type=”4_4″ _builder_version=”3.25″ custom_padding=”|||” custom_padding__hover=”|||”][et_pb_text _builder_version=”3.27.4″ text_font_size=”17.5px”]In Part 2 of our SRE Managed Services mini series we covered how organisations need to change to be able to adopt SRE as a practice for operations, and how an SRE Managed Service might be more appropriate for some companies compared to building an internal SRE capability. <\/em><\/p>\n
\n[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=”1″ _builder_version=”3.22″ custom_padding=”17px|0px|14px|0px|false|false”][et_pb_row _builder_version=”3.25″ background_size=”initial” background_position=”top_left” background_repeat=”repeat”][et_pb_column type=”4_4″ _builder_version=”3.25″ custom_padding=”|||” custom_padding__hover=”|||”][et_pb_text _builder_version=”4.5.7″ text_font_size=”17px” header_text_color=”#e05206″ header_2_text_color=”#ffffff” hover_enabled=”0″]<\/p>\nOrganisational behaviours and hiring for SRE<\/h1>\n
\n
\u00a0<\/h2>\n
Enhancing trust for SRE<\/h1>\n
\u00a0<\/h2>\n
Getting the benefits of SRE quickly with SRE Managed Services<\/h1>\n