The Site Reliability Engineering (SRE) ethos was invented at Google as a new approach to DevOps. The primary objectives for SRE teams are to improve the uptime and performance of applications to support user expectations in an always-on world. To achieve these goals, engineers utilize a variety of tools to alert and track incidents, analyze existing incidents, identify recurrent issues, and organize procedures around incident data.
SRE teams sit between development and operations and work to manage an error budget of acceptable downtime measured per quarter or per year. The current rate of consumption of the error budget informs decisions about whether the company should deploy new features or focus on improving reliability.
SLAs, SLOs and SLIs
We make agreements with our users (service level agreements or SLAs) and fulfill these agreements by achieving certain objectives (service level objectives or SLOs), which are defined by measurements of our users' experience (service level indicators or SLIs). These measurements include the metrics that matter most to your users, what the expectation is for a given metric, and how to respond if a metric is breaking the threshold of your definition of healthy.
SLA | Service Level Agreements that you make with your users |
SLO | Service Level Objectives your team must meet to deliver on the SLAs |
SLI | Service Level Indicators to track your SLOs and SLAs |
The 9s of Reliability and SLAs
A service level agreement (SLA) is a contractual arrangement defining the expectations of service for end users. SLAs are often measures of service availability as a percent of the time the service is down per year. Achieving more availability as a percentage, also known as achieving 9s, means achieving less downtime. To provide some perspective:
9s
|
Availability %
|
Downtime
|
---|---|---|
2 | 99% | 3.65 Days |
3 | 99.9% | 8.77 Hours |
4 | 99.99% | 52.60 Mins |
5 | 99.999% | 5.26 Mins |
6 | 99.9999% | 31.56 Secs |
Acceptable Unreliability
When defining an availability goal, it is important to consider that too much availability can actually have negative results. For example, achieving five 9s of availability is 99.999% uptime with only 5.26 minutes of downtime per year. Is this a reasonable goal for all your services? How would achieving this goal impact your team's ability to deliver new functionality?
If your error budget is too small, then it will be consumed quickly by operations teams, leaving no room for developers to roll out new features and respond to errors that arise. An acceptable level of unreliability leaves room to innovate and satisfies the maximum amount of users possible.
SLOs Are About Prioritization
Defining SLOs involves identifying the relationships between dependent systems. There is already an implicit prioritization in your organization today that will occur naturally during an incident. If System A goes down and System B goes down at the same time, at some point, someone in the organization is going to say, let's get System A back online, and then we can work on System B. Defining SLOs is really about creating transparency around discussion of this prioritization.
SRE practices include the concept of a Service Owner who is a dedicated individual in your organization responsible for, ideally, one service or small set of services. Their mission is to act as the incident commander and reliability champion to deliver on the SLOs that have been defined for that service. Instead of a reactive incident response mindset, your Service Owners can focus more on incident defense by protecting the error budget and balancing stability and new feature deployments. With a defensive mindset, your approach to stack management and infrastructure will be to automate as much as possible and produce redundant systems.
SLOs Are About Cost Prioritization
SLOs are also about cost prioritization, as a more reliable service costs more to operate. Your objective should be the lowest level of reliability (aka Acceptable Unreliability) that your users will tolerate. In some instances, too much reliability creates an excessive dependence on your service and an expectation of more reliability than your team can actually deliver from a cost basis.
Google found that by implementing periodic planned outages, productivity was increased and overall reliability measured against customer satisfaction increased. Planned outages can help to identify actors in your organization that may be using your service excessively. By intentionally causing outages, you are utilizing the total error budget and creating a reasonable expectation among your users for service availability. You can identify teams and power users that are demanding more of a service than is reasonable to achieve and work directly with them to balance everyone's needs.
Sign up to receive more great content
Learn more about Atlassian and how Isos can help by signing up to receive our latest blogs, eBooks, whitepapers and more.