In previous blog posts, we've discussed both the metrics to be mindful of during a catastrophic outage of an Atlassian application, and how to get started with snapshots of your applications in order to protect the data within your Atlassian tools. Now let's talk about how to both minimize the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). To do this, you will need to implement both database and data replication for each of the applications. This can be accomplished in different ways, depending on the tools. We'll cover Jira in this blog entry.
To achieve the most robust Disaster Recovery Plan, you'll want to make sure you first implement the snapshots we discussed in my last blog, in order to protect against database corruption, malware/ransomware, or any malicious actors. Your replication site will always be a very-near replica of production and will be vulnerable to the aforementioned catastrophic events. When implementing a robust DR plan with true replication, it is best to run the Atlassian Data Center applications. This allows the database and shared home directory to be already stored outside of the instance, and it best enables data replication.
First, let's talk about the database replication. All the database types that Atlassian supports offer some sort of replication, and unless you are running your database in a public cloud, we recommend using the database native tooling for this piece. The native tools offer the most robust system-level replication possible and can be implemented across geographic regions with the highest level of accuracy. If you deploy in a public cloud such as Amazon Web Services (AWS), using the Relational Database Service (RDS) with either PostgreSQL or Aurora PostgreSQL will allow for a low-maintenance, AWS-managed replication across both availability zones and regions. As a reminder, availability zones exist in the same general region but are some miles apart from each other. If there was a large-scale disaster, it's possible that an entire region could be impacted. In line with the AWS Well-Architected framework, we would recommend running both across availability zones and regions. This will minimize the RPO and increase the data protection.
Now that we've solved for database replication, we'll move on to how to replicate the data. In a non-public cloud implementation, Jira can replicate both the attachments and the search indexes over to your disaster recovery region. In order to use this feature, you will need to be able to mount your DR file system to your primary instances. From that point, Jira will maintain data consistency from the time you enable it, so you will need to replicate any existing data when you establish this replication.
If you've followed along closely, you may be thinking that these all minimize the RPO but what about the RTO? Great question! Regardless of where you are running, you will have two options for the application itself. You could have instances running and the application provisioned so that you can just start Jira up against your DR database and shared home directory when necessary. Or--if running in a public cloud or on-premise with the ability to dynamically spin up VMs--you could run zero instances but on-demand spin up the application servers. In AWS, this is deployed as an Auto Scaling Group with a desired instance count set to zero. When necessary, you would increase the instance count, and either an Amazon Machine Image (AMI) with the application pre-installed would spin up, or you would provision on-demand with a configuration management tool. There are about 100 ways to skin that cat so that it will depend on your environment. The final step to all of this would be to repoint the URL for your application with a DNS change so that your end users can start using the DR application.
A Robust Disaster Recovery Plan
A robust DR plan should incorporate snapshots, replication, and a plan to both failover to your DR environment, as well as switch back to your primary instances when it has been restored. This plan could be as simple as a runbook that instructs your system engineering teams of the steps to take, all the way up to a ChatOps bot that executes a fully automated disaster recovery switch over. And as with anything else, this should be tested regularly!
If these three blog posts have your head spinning, don't worry. We can help to develop a solution that fits you and your team!