Whether you're a mobile app provider, a SAAS software infrastructure team, or an internal IT operations team, user satisfaction is paramount to the success of your organization as a whole. While frequent outages, extended downtime, or recurrent errors in software or website functionality contribute to user attrition, efficient incident response and management can contribute greatly to customer satisfaction and users' perception of a product or service.
For many companies, monitoring and responding to the business impacts of outages is reactive, not proactive. Often, first responders are notified, and then the incident is handled on a case-by-case basis. Postmortems and root cause analyses may provide answers to keep errors from repeating, or at least provide context on how to resolve issues should they reoccur.
But this type of incident management often only addresses the symptoms, and isn't actively tracking the current status of a service's reliability in relation to customer satisfaction. Without a plan of action to measure customer happiness, or how to predict when errors are approaching a threshold that typically results in customer complaints, you will invariably be managing your services on a hamster wheel of incidents and employee burnout.
But don't worry, there is a better way to manage service reliability and user satisfaction.
Service Level Objectives, or SLOs for short, are goals organizations seek to achieve when measuring customer happiness. These goals can be set for every service and shared across an organization using a standardized format. By standardizing the indicators (Service Level Indicators or SLIs) that are used to measure SLOs, every team can get a quick peek at the status of other services and make informed business decisions about their own projects.
When creating SLOs, deciding what to monitor can be a challenging and iterative process, as finding the right metrics such as latency or freshness is only half the equation. You must find where best to monitor this value from, as there are typically many points along the path of a request for errors to occur. In addition, correlating metrics to customer happiness, and tracking the current rate of change of a metric as it's approaching the point where you usually see customer dissatisfaction, requires testing and refinement to achieve.
Now that you have a little background information on SLOs, SLIs, and how they can be useful, this brings us to the concept of an error budget. An error budget is not monetary, which can often be confusing when first encountering this term. Error budgets are a way to track the current rate of error burn in a service for a specific period of time, usually a month or a full quarter of the year.
The important thing to understand about an error budget is that you are measuring not just the overall count of errors in a given period of time, but the rate of change in those values and how they have cumulatively consumed an overall acceptable error amount. This means you are allowing a certain amount of errors to occur and that the goal is not 100% reliability, but an achievable percentage that correlates to customer happiness. We all know and accept that errors and incidents will invariably occur, and error budgets give us a measured approach to maintaining a balance between errors and customer happiness.
If a service is consuming its error budget at an increasing rate, monitoring this can help an organization develop a plan of action that removes some of the chaos of incidents. This requires changing your mindset from preventing incidents altogether to expecting incidents to occur. Also, the error budget can assist in making business decisions, such as whether to focus your team's energy on improving service reliability, or if a large amount of error budget remains, to take more risks in releasing features that may cause error budget to be consumed during a deployment.
An error budget policy is a set of thresholds and escalation processes that you define to establish a procedure for how to respond to errors in a service. The policy should include basic instructions on how various teams and team members should respond, depending on the current utilization of an error budget. For example, you may have the following policy:
By defining these thresholds using a simple format that non-technical team members can understand, any specific outage or incident becomes less of an emergency. Instead, you can allow the error budget to inform business decisions and provide insight into the current risk that exists in your systems as a whole.
Error budgets are an extremely useful concept that can shift the culture of an organization from living in a state of reactivity, to building proactive, measured approaches to service reliability. By standardizing the approach across an organization, all teams can be aware of their current system status, and make informed business decisions about whether to focus on development and innovation or on reliability and regaining control.
If you're interested in learning more about SLOs, SLIs, and error budgets and policies, contact Isos Technology to speak with one of our service reliability experts. We can help your organization create sustainable solutions for customer happiness.