13 reasons your disaster recovery plan failed

IT teams that implement a technology disaster recovery plan hope they never have to use it. However, never running a disaster recovery plan through a crisis can mean an untested strategy and the risk of the DR plan failing. No organization wants to face a disruption, but IT teams must not be caught by surprise if one does occur.

The unpredictable nature of threats like ransomware and natural disasters means they can strike at any moment, even if an organization hasn’t dealt with them before. If a major disruption to IT infrastructure resources never occurs, the organization might not know for certain that its plan will work.

To adequately understand the importance of a tested disaster recovery plan, IT teams must know the causes of DR plan failure. In addition to guidance on constructing a DR plan, below you’ll find 13 common reasons why a DR plan might fail and how to avoid them.

Importance of DR planning

While most IT organizations accept that a DR plan can help in an emergency, they can never be totally certain it will work as needed, or if the systems and people will perform as intended.

DR planning aims to ensure IT infrastructure elements — including hardware, software, network services, environmental systems, physical security, cybersecurity, utility services and people — are safe from a disruptive event. If properly protected by a DR strategy, these critical elements can subsequently return to previous operations.

In a data center, DR typically addresses multiple elements, including the following:

Backup, recovery, replacement and restoration of hardware devices.
Backup, recovery and restoration of network services.
Backup, recovery, retrieval and reinstatement of systems and data.
Recovery and restoration of physical facilities used by the data center.
Recovery and restoration of utility services, such as power and water.
Recovery of IT personnel and their return to their previous roles.

In practice, the above issues might be addressed by a single DR plan. IT teams can also develop individual plans for specific mission-critical resources. The former option describes how to restore IT operations at a high level, while individual plans go into the details of recovering, restarting, testing and validating resources before they return to production status.

In short, the high-level plan describes procedures to recover and restore IT operations, and assumes that the practical details will be addressed by subject matter experts within the IT department. In theory, this approach should work, unless the incident occurs outside the scenarios presented as part of the high-level DR plan.

What if the DR plan fails?

When building DR plans, it is important to take an all-hazards approach while considering potential disruptive events. This increases the likelihood that procedures described in plans will perform as needed — or at least will help mitigate the severity of the incident.

But what if the above DR planning and recovery initiatives do not work as anticipated?

First, when developing plans, IT teams must consider the issue of DR plan failure. For example, suppose the strategy for protecting servers is to have an inventory of devices ready to replace damaged units. When was the last time the reserve servers were tested? If the backup servers do not work, for whatever reason, then recovery is jeopardized. The same goes for major business systems. If the backup app is not available, or cannot be obtained in a timely fashion, the organization’s business — and reputation — might be adversely affected.

What can cause a DR plan failure?

Ideally, IT teams identify the risks and threats to important resources, as well as the impact to the business if those resources are disabled, in the plan development phase. Activities in this phase, such as risk assessments and business impact analyses, can provide essential data for plan development and help avoid potential failures. These analyses and assessments can also identify the priorities for resource recovery and restoration, enabling a smooth and orderly recovery.

Recognizing the above realities of DR plan development and execution, the following are 13 common reasons that a DR plan might fail. Each is an element of the overall DR planning process.

Lack of senior management support and funding. This is often the most important activity in the process, as lack of management support and funding can limit the development of DR plans. This can result in an organization not implementing a plan at all or having an incomplete plan.
Not involving the right people in the planning process. The DR team typically includes technology staff and should also include employees charged with overall responsibility for the DR process. Third-party experts might also be part of the team.
Tech issues. Technology problems, such as software issues or insufficient backups, are a common reason why a DR plan failed. IT teams must conduct sufficient research and analysis to determine the most cost-effective fixes to technology recovery issues. They should also know when these elements require an update or replacement.
Failure to regularly test plans. Testing is a critical activity because it validates that the procedures defined in the plan will work as intended. It also identifies potential failure points before they can affect a real recovery.
Failure to conduct a post-test review and update the plan based on the test. Once a test is complete, the next step is to review what worked and what did not work. IT teams must update plans to reflect the lessons learned and, if possible, perform follow-up tests to validate the changes.
Not communicating the plan throughout the organization. Employees must be aware that programs exist to ensure the uninterrupted operation of the IT resources they use and know what they should do when an incident occurs.
Insufficient DR team training. Knowledge of how to recover and restore disrupted resources — whether internally or externally implemented — must be communicated and regularly reinforced through training to ensure DR teams are prepared to respond in an emergency.
Lack of employee training for a DR event. In addition to making employees aware of DR activities, periodic training is recommended so that employees will know what to do if a technology disruption occurs.
System changes that are not reflected in a revised DR plan. Whenever changes to mission-critical systems and resources occur, they must be reflected in DR plans, especially if procedures for recovery and restoration change.
Lack of regular patching of mission-critical systems. Failure to keep up on patching can result in unintended system disruptions. For example, not installing cybersecurity system patches can result in undetected malware attacks.
Failure to include DR activities in IT staff meetings. If DR is not a regular activity, it can be easily forgotten. A DR agenda item in IT staff meetings is advisable.
Failure to review and assess the plan and its associated activities. In addition to live system testing, it is good practice to periodically review and assess DR plans — of all types — to ensure they are up to date and actionable.
Failure to determine what constitutes a “failed” plan. It is important to determine what failure is for DR planning so that the key elements are properly addressed and the plan is regularly tested.

Executing the above steps can help reduce the likelihood that DR plans will fail when an emergency occurs.

Paul Kirvan is an independent consultant, IT auditor, technical writer, editor and educator. He has more than 25 years of experience in business continuity, disaster recovery, security, enterprise risk management, telecom and IT auditing.

Source link

#reasons #disaster #recovery #plan #failed

13 reasons your disaster recovery plan failed