Guest Contributor: Damian Rosochacki
As a Jira administrator, one of the health metrics I consider for any system I manage is uptime. Uptime is defined as the time during which a machine, especially a computer, is in operation. In the context of Jira, this can actually extend beyond just the running time of your Cloud or Data Center instance and involve a few different factors:
- The availability of the system
- System displaying the correct information
- Ability of users to perform desired actions within the system
So while your instance may be up and running, your firewall settings may have changed and locked out some users. Or perhaps your Data Center nodes are out-of-sync and showing outdated information to some users, leading to confusion between team members. Or maybe your single-sign-on integration is having a rough afternoon and deciding not to work, so while your Jira is up and running like a well-oiled machine, no one can access it in the first place.
The truth is that no matter how well prepared you are, at some point you will face some kind of outage of your precious Jira instance. Your users will frantically ping you to fix it while lobbying their boss for a half day, since, "Without Jira, no one knows what to work on." In today's always-online world, people expect high availability for any service they interact with, especially in their workplace. The shift to remote work has only made this sentiment that much stronger.
It's now Friday afternoon and you're fighting off the food coma after lunch. Already dreaming about the weekend ahead, you're startled by the sound of a Slack notification. You open the message and see the dreaded question: "IS JIRA DOWN?"
Depending on your level of preparation, this can be a very jarring moment, or it can feel like a walk in the park. How can you make sure it's the latter and not the former? By having a robust business continuity and disaster recovery plan in place. This plan should touch on some basics of outage management, such as:
- How do you communicate with users about the outage? How (and how frequently) do you keep them updated?
- What is the business impact of Jira going down? How many teams are impacted? What's their alternative for getting work done?
- What are the steps to diagnose and recover the services? In what order should you troubleshoot the components of your service? What's the expected turnaround time for getting something fixed?
- What do you do if you can't fix it on your own?
As you can see, there are a lot of moving parts involved in bringing your services back online and making your customers feel confident in your work behind the scenes. Without a plan in place to deal with threats to your uptime stats, you may soon find yourself (and your users) quite frustrated.
So what about your plan? How fast can you restore a corrupt database or fix identity management issues? Who else on your team can assist you? Once you get it fixed, who should you notify, and through what channels? And if you need to restore Jira to a backup from earlier in the day (by the way, you do know how to do this and have actually practiced it, right?), how much data will be lost and at what cost?
If you're not sure about answering these questions, consider working with Isos Technology consultants on coming up with your own customized Business Continuity and Disaster Recovery plan for your Atlassian services. Your boss, customers, and team members will be grateful, sooner or later.