It's something many organizations don't consider on a daily basis... but name a region in the United States that isn't susceptible to some sort of natural disaster. Or if it's hosted under someone's desk... a computer that isn't susceptible to spilled coffee.
As you begin asking these questions, you may find that your Jira (and other Atlassian applications) are hosted in Tornado Alley. Maybe you had flashbacks to 2012 when a derecho took down AWS us-east-1, which included Instagram, Pinterest, and Netflix. Or what if some malware makes their way onto your network and locks you out of your data unless you send them thousands..or millions of dollars? These things and more should keep you awake at night.
But how many nights of sleep should you lose over this?
This begs the question: What is the financial impact of your Atlassian applications being down for an extended period of time? How long until your teams grind to a halt because the tools they use to organize their work are not accessible? In addition to the loss of productive hours, an outage could cause delays in a project being released because employees don't know and can't see what to work on next in Jira Software, don't have access to documentation in Confluence, or can't commit code to be built into Bitbucket. It could also keep customers/end users from submitting support tickets through Jira Service Desk. These outages could easily have both a financial impact and a customer satisfaction impact.
Hopefully this scares you! But what can you do about it? Your disaster recovery plans should provide you with at least one way to mitigate the disasters mentioned above and more. All of the risky situations above can be mitigated by including disaster recovery (DR) planning into the architecture of your Atlassian applications. Before we can discuss specific considerations, from the business side we need to identify what is the target Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO focuses on how long you and your end users have to wait until your applications are back up and running. RPO is how much data you will lose when you switch your primary Atlassian instances into your disaster recovery instances. For example, if you do a nightly backup at 2 am each day and your applications go down at 11 am, then you will lose about nine hours of data when you stand up that backup in your DR instances.
If a natural disaster impacted your data center, would the Atlassian applications be ITs first priority? Or would they need to stand up other systems before the Atlassian applications? To begin the conversation, you should ask what your current RTO and RPO for your Atlassian applications are. Do they line up with what you consider an acceptable loss of time and data?
If the numbers are within acceptable amount of time, then the two critical architecture considerations are frequency of replication to your DR region and frequency of snapshots. These two things solve very different problems. In a well-architected installation of Jira, you could replicate your database, attachments, and index files to your DR region. When a disaster occurs, it would be as simple as repointing the URL at the DR region and starting up Jira in that data center. When replication is well planned and implemented, it can reduce your outage from hours or days down to an hour. this fixes everything!
What about that malware that found its way into your network and now somebody is asking for thousands or millions of dollars to unlock your data? Or what happens if your database becomes corrupt and unrepairable? This is where snapshots are key. These are point-in-time captures of the databases backing your Atlassian applications. Once these snapshots are taken, we recommend storing them in another geographically disparate data center in storage configured for write once, ready many (WORM). This means that once it's written to storage, it cannot be deleted by the average user. One bonus of a good snapshot schedule is that if an admin or user were to delete an important project or issue accidentally, you can restore it from the most recent snapshot!
There is a lot to consider, and without a good disaster recovery plan you should find yourself thinking about it as you try and fall asleep each night. If it's more than you and your team can handle, we're here to help! We can come in and help you identify the proper disaster recovery plan for your Atlassian applications to minimize both cost and risk.
Check back for the next part of this blog, where I will begin to break down some disaster recovery architecture options!