Guest Contributor: Isaac Sacolick, President | CIO, StarCIO
Google wrote the book on Site Reliability Engineering (SRE), and many organizations have SREs working with their agile development teams responding to incidents and improving application reliability. SREs address DevOps fundamentals by reducing the cost of failure, replacing toil with automation, leading blameless postmortems, increasing observability, and collaborating on operational priorities.
It sounds great on paper, but instituting SREs and enabling their success isn’t trivial for many understaffed IT organizations overwhelmed with technical debt and facing mounting pressure to improve system reliability.
So I was thrilled to moderate a recent panel on SRE and Service Level Objectives where we discussed five ways to deliver exceptional service. Kit Merker, COO of Nobl9, and Thad West, CEO of Isos Technology, were the panelists, and they shared several surprising insights on how SREs can be successful and impact their businesses. Watch the video to learn more about how SREs help IT deliver exceptional service, and read below some of my insights.
1. The SRE’s Primary Role is Incident Defense
While developers are solutioning, coding features, addressing defects, or reducing technical debt, SREs fill critical application performance management roles. Organizationally, they report into IT Operations, but the best practice is to assign them to agile development teams. When there are application incidents, they are the first team members to resolve issues and ideally without disrupting the development team’s sprint and release commitments.
But Kit states that the real SRE objective is “Moving into incident defense instead of incident response.”
That’s a key insight, and Thad shared several proactive ways SREs implement incident defense by:
- Developing infrastructure as code, so environmental changes are part of change management
- Investing in larger suites of automated tests
- Implementing Jira as a source of truth for all development, build, and deploy activities
Incident defense is one of several new mindsets that bring together agile developers, QA automation engineers, ITSM operators, and stakeholders to improve application reliability and performance.
2. The Object is NOT 100 Percent Reliability
Here’s a shocker. Stop chasing the nines and overpromising on SLAs. It’s counterintuitive because many of us who lead IT Operations have the battle scars chasing five 9’s of reliability, that is, 99.999% uptime.
Thad was very specific on this issue and stated, “Creating a culture of one hundred percent reliability - it’s not possible. It’s not financially responsible to even try to achieve it. And it’s not the best use of company resources.”
Kit suggests that SREs must play a role in defining service level objectives (SLOs). “If I say, I want 100 percent reliability, 200 percent security, and 300 percent more features than anybody else, then my engineering team will flip the bid on me because they can’t engineer to that requirement, and they’re going to make up their definition of what a reliable system means.”
Kit is pointing out the tradeoffs product owners must make on setting priorities. Should the agile team work on new features, address technical debt, automate more testing, add monitoring capabilities, or enhance a service’s observability? SREs can play a critical prioritization role in marketing opportunities that improve incident defense and disclosing implementation tradeoffs with stakeholders.
3. Back into SLOs by Discussing an Unreliability Goal
Kit suggests that one way to back into a service level objective is to ask the question, “How much unreliability can we get away with before customers leave our service?” He goes on to say, “I think of the difference between excellence and perfection. The question is, when will customers break away? When will they become disappointed with us and switch to a competitor?”
Kit and Thad suggested two ways of thinking about service level objectives (SLOs) and unreliability goals. Kit recommends that teams seek the “edge of excellence,” while Thad calls it a “minimally viable excellence” or MVE.
Pick a term and stick to it, but from a practical perspective, defining SLOs require business and customer context, such as:
- Peak shopping hours in retail or trading hours in financial services
- Seasonal periods such as Black Friday in retail or during the NCAA basketball tournament period for sports media networks
- Types of activity, for example, users browsing the website for information versus buyers that are in the middle of a transaction
Defining customer and business-centric SLOs requires using these contexts during prioritization discussions with agile product owners and stakeholders. Kit suggests asking, “Where can we smartly reduce our reliability goals so that we can refocus our energy on the use cases that matter most, whether that’s swiping a credit card or making a purchase?”
4. Killing IT’s Heroics and Burnout From Chasing Incidents
Here are some questions StarCIO reviews during our digital assessments:
- How many major incidents does your ITSM team respond to monthly?
- Are IT employees burning out from all the bridge calls and war rooms, and are the escalations to dev teams slowing down innovation?
- Are employees numb to all the performance issues because dev teams rarely have the opportunity to identify and address problem root causes?
- Does it require a major crisis for executives to pivot their mindsets and recognize the need to invest in reliability, performance, or security?
There has to be a better way to improve reliability without chasing every alert and burning people out.
The first time I heard about SRE budgets, my knee-jerk reaction was it could only work in a “fail fast” culture and where architectures enabled skilled engineering teams to make reliability improvements easily. In other words, a tool for Google, Netflix, and Facebook, but likely impractical for your average bank, retailer, or insurance business.
But the discussion with Thad and Kit convinced me otherwise. In agile, we use abstract metrics such as business value and user story points to help teams prioritize smarter and more efficiently. Can error budgets diffuse some of the alert-chasing heroics? When error budgets are missed, will product owners respond to the SRE’s incident defense recommendations?
Give the webinar a listen and think about how error budgets might work in your organizations.
5. Transitioning Prioritization From a Negotiation to Collaboration
At the end of the webinar, we discussed the intersection of three objectives impacting IT:
- Businesses investing in digital transformation creating higher demand on developing and enhancing applications, data processing, analytics, and integrations
- Customers, employees, and business leaders expecting significantly higher system reliability and performance as apps have become more mission-critical to businesses and our lives
- IT teams managing to the increasing technical complexity as applications are multicloud, connect with dozen of microservices, process many orders of magnitude of data, and are subject to greater regulatory compliance
IT’s operating model must balance its efforts against these objectives, and there can be chaotic negotiations on priorities when there are ill-defined metrics and decision authorities.
SREs can help turn the negotiation into collaboration through these two efforts. First, SREs and IT Ops teams can use service level objectives and SRE error budgets to translate operational needs to customer and business priorities. Second, SREs can help integrate collaboration and workflows across tools like Jira Align used by product managers, Jira Software to track the agile development process, Jira Service Management for ITSM, and Atlassian Confluence for information sharing.
When teammates have a shared understanding of customer needs, agree on operating objectives, and integrate collaboration tools, people make smarter and faster decisions, reliability improves, and agile teams focus on execution.