The ABCs of Site Reliability Engineering: SLAs, SLOs, and SLIs

sre,-slo,-sli

The Site Reliability Engineering (SRE) ethos was invented at Google as a new approach to DevOps. The primary objectives for SRE teams are to improve the uptime and performance of applications to support user expectations in an always-on world. To achieve these goals, engineers utilize a variety of tools to alert and track incidents, analyze existing incidents, identify recurrent issues, and organize procedures around incident data.

SRE teams sit between development and operations and work to manage an error budget of acceptable downtime measured per quarter or per year. The current rate of consumption of the error budget informs decisions about whether the company should deploy new features or focus on improving reliability.

SLAs, SLOs and SLIs

We make agreements with our users (service level agreements or SLAs) and fulfill these agreements by achieving certain objectives (service level objectives or SLOs), which are defined by measurements of our users' experience (service level indicators or SLIs). These measurements include the metrics that matter most to your users, what the expectation is for a given metric, and how to respond if a metric is breaking the threshold of your definition of healthy.

SLA	Service Level Agreements that you make with your users
SLO	Service Level Objectives your team must meet to deliver on the SLAs
SLI	Service Level Indicators to track your SLOs and SLAs

The 9s of Reliability and SLAs

A service level agreement (SLA) is a contractual arrangement defining the expectations of service for end users. SLAs are often measures of service availability as a percent of the time the service is down per year. Achieving more availability as a percentage, also known as achieving 9s, means achieving less downtime. To provide some perspective:

9s	Availability %	Downtime
2	99%	3.65 Days
3	99.9%	8.77 Hours
4	99.99%	52.60 Mins
5	99.999%	5.26 Mins
6	99.9999%	31.56 Secs

Acceptable Unreliability

When defining an availability goal, it is important to consider that too much availability can actually have negative results. For example, achieving five 9s of availability is 99.999% uptime with only 5.26 minutes of downtime per year. Is this a reasonable goal for all your services? How would achieving this goal impact your team's ability to deliver new functionality?

If your error budget is too small, then it will be consumed quickly by operations teams, leaving no room for developers to roll out new features and respond to errors that arise. An acceptable level of unreliability leaves room to innovate and satisfies the maximum amount of users possible.

SLOs Are About Prioritization

Defining SLOs involves identifying the relationships between dependent systems. There is already an implicit prioritization in your organization today that will occur naturally during an incident. If System A goes down and System B goes down at the same time, at some point, someone in the organization is going to say, let's get System A back online, and then we can work on System B. Defining SLOs is really about creating transparency around discussion of this prioritization.

SRE practices include the concept of a Service Owner who is a dedicated individual in your organization responsible for, ideally, one service or small set of services. Their mission is to act as the incident commander and reliability champion to deliver on the SLOs that have been defined for that service. Instead of a reactive incident response mindset, your Service Owners can focus more on incident defense by protecting the error budget and balancing stability and new feature deployments. With a defensive mindset, your approach to stack management and infrastructure will be to automate as much as possible and produce redundant systems.

SLOs Are About Cost Prioritization

SLOs are also about cost prioritization, as a more reliable service costs more to operate. Your objective should be the lowest level of reliability (aka Acceptable Unreliability) that your users will tolerate. In some instances, too much reliability creates an excessive dependence on your service and an expectation of more reliability than your team can actually deliver from a cost basis.

Google found that by implementing periodic planned outages, productivity was increased and overall reliability measured against customer satisfaction increased. Planned outages can help to identify actors in your organization that may be using your service excessively. By intentionally causing outages, you are utilizing the total error budget and creating a reasonable expectation among your users for service availability. You can identify teams and power users that are demanding more of a service than is reasonable to achieve and work directly with them to balance everyone's needs.

Latest posts

Jira vs Confluence: Choosing Jira over Confluence for Fortune 1000 companies

Atlassian Forge vs. Atlassian Connect: A guide for tech-savvy project management leads and IT department heads

Jira vs. Redmine: Why Jira outshines Redmine for Fortune 1000 companies

WHY ISOS?

PRODUCTS WE SUPPORT

TECHNOLOGY INTEGRATIONS
& PARTNERSHIPS

DO YOU NEED HELP WITH…

The ABCs of Site Reliability Engineering: SLAs, SLOs, and SLIs

SLAs, SLOs and SLIs

The 9s of Reliability and SLAs

Acceptable Unreliability

SLOs Are About Prioritization

SLOs Are About Cost Prioritization

Sign up to receive more great content

Latest posts

Blog Authors

See More From These Topics

We’re
dedicated
to your
success

About and Resources

Services

WHY ISOS?

PRODUCTS WE SUPPORT

TECHNOLOGY INTEGRATIONS& PARTNERSHIPS

DO YOU NEED HELP WITH…

The ABCs of Site Reliability Engineering: SLAs, SLOs, and SLIs

SLAs, SLOs and SLIs

The 9s of Reliability and SLAs

Acceptable Unreliability

SLOs Are About Prioritization

SLOs Are About Cost Prioritization

Sign up to receive more great content

Latest posts

Blog Authors

See More From These Topics

Share this entry

We’re dedicated to your success

About and Resources

Services

TECHNOLOGY INTEGRATIONS
& PARTNERSHIPS

We’re
dedicated
to your
success