IT incidents happen, and while they are always unplanned, your response to them doesn’t have to be. A well-handled incident with clear and transparent communication can actually build customer trust. And a well-thought out incident action plan will help minimize the noise so your team can focus on containing and resolving the issue at hand. To help you build an effective incident action plan, we’ve outlined what you need to consider and how Atlassian tools can help.
1. Create your incident action plan.Every incident action plan should include the people who will be involved in resolving the incident, the processes they should follow, and the tools and technology they should use for tracking, managing, and communicating about the incident.
PeopleIdentify the people who will need to take action when an incident occurs. Who will lead the charge? You’ll want to assign an incident manager. Who will handle the technical aspects of assessing, containing, and resolving the issue? You’ll need a technical lead to oversee things and manage the technical team. And who will be on point to handle customer-facing and other externally-facing communications? The communications manager, that’s who. It’s good to have a rotating list of people who are on call to step in so no one gets burned out.
ProcessesEvery incident is different, and so are the actions that need to be taken to resolve it. That said, a consistent, overarching structure for everything outside of the actual work of containing and resolving the incident—essentially who does what, and when and how they do it—goes a long way toward minimizing the impact. The rest of this post should serve as an outline for developing your processes.
Project Management and Communication ToolsIt takes a team to resolve an incident, but when so many people get involved, things can get confusing, fast. Atlassian tools like Jira Service Desk, Jira, Confluence, Opsgenie, and Statuspage, as well as third-party tools—some with deep Atlassian integrations, like Zoom and Slack—can also streamline processes and keep information flowing.
2. Alert the team and assign responsibilities.Whether you become aware of an incident through your organization’s own monitoring efforts or because an impacted user opens an incident ticket, the first things you’ll need to do is alert the on-call incident manager and incident response team. Opsgenie is a great tool for managing on-call schedules and escalations, both in everyday circumstances and in the event of an incident. You can set rules that route the call to the right people based on priority of the incident and automatically escalate the issue if someone doesn’t respond to an alert. Once the team has responded and is engaged, the incident manager, technical lead, and communications manager can assign and delegate responsibilities.
3. Set up incident tracking and communication tools.It may seem counterintuitive to take the time to do this when there’s a fire to put out, but setting aside a few minutes to properly set up the tools and places will save a lot of time and trouble in the long run.
Here are some great ways to track, manage, and communicate your issue:
- Log and track the issue in Jira
- Link incoming Jira Service Desk tickets to the Jira issue so you know who has been affected and with whom to follow up
- Designate a place or places for group discussions:
- Set up a chatroom in Slack for sharing information and screenshots
- Set up a video chat interface in Zoom or Skype for group conversation
- Document the issue in Confluence. Be sure to note any actions taken or changes made
- Share internal and external status updates in Statuspage
4. Assess the situation and develop a plan.Next, you’ll need to figure out, at least at a high level, what the problem is, how serious it is, and who is affected. Questions that will need to be answered include:
- What is the general nature of the issue?
- When did it start or when was it first noticed?
- What are customers seeing and/or experiencing?
- How many customers are affected? Are they opening support tickets?
- What is the extent and severity of the impact?
- Is service quality reduced or is service completely disrupted?
- Is there a security issue or impact?
- Is there data loss?
From there, the incident manager can work with the incident response team to determine the severity level of the incident and develop a plan of action.
At this point, it’s a good idea to update both internal and external teams about the incident. Statuspage is a great tool for doing this—you can even build communications templates in advance. We recommend two different pages—one for customers and one for internal staff.
5. Contain, resolve, document & communicate.There’s no short answer for what is undoubtedly the most challenging part of any incident—containing and resolving it. There may even be a number of different work streams which only adds to the complexity. Throughout this highly iterative process, there are a few best practices to keep in mind.
ContainIn some incidents, it may be necessary to limit the extent of the impact by first containing the issue before resolving it. This can, for instance, sometimes be the case with a security breach or data loss. If this is the case, tackle this first or at least in parallel with the processes of correcting the issue.
ResolveOnce the incident is contained, the team can set about correcting it. It may take many cycles of Plan > Do > Check > Act before the incident can be resolved. It important to be thoughtful and intentional about every action taken and to fully understand the impact of it before moving on to the next action.
DocumentBy documenting every action taken and the observed results in real time in Confluence, you can harness the collective power of the broader team following along, better understand the ripple effects of each action, prevent duplication of efforts, and ensure a comprehensive and detailed post mortem.
CommunicateContinue to communicate internally and externally using the separate Statuspages you have set up for internal and external stakeholders as the incident is being worked through. People will want to know how serious the issue is, if there is any action to be taken on their end to help contain the issue, and when you expect the issue to be resolved and/or service to be restored to normal.
6. Close the incident and notify stakeholders.
Once the issue and its ongoing business impact have been resolved, it can be closed in the system, and the communications team can update people through Statuspage. This doesn’t mean the incident is entirely over. There may still be some tasks to wrap up, and if the incident was either unusual or significant, a postmortem would likely be of value.
7. Hold a postmortem and publish the takeaways.Holding a postmortem is one of the most effective ways to turn a really difficult situation into something positive. This is when the learning takes place—and when all that documentation you did in Step 5 really pays off. All the information needed to review the incident, its cause, and how well the team worked together to resolve it, is right there in Confluence…with links to the Jira issue and even the Jira Service Desk tickets opened by customers.
The incident response team can now collectively review the incident, both together and independently, to identify both things that went well, and things that could have been handled better. Identify any action items—things that need to be done to make sure the same incident doesn’t happen again or that the incident management process goes even more smoothly—and assign them to people to handle them.