Lessons Learned From a Disaster Recovery Story

Anyone working with mainframes knows a disaster event is a possibility. But, no one really thinks it will happen to them. So, what happens when disaster strikes? Chris Sultis, a mainframe and server disaster recovery expert at Blue Cross Blue Shield of Alabama (BCBS), knows exactly what to do. In a SHARE Academy session, he shared the timeline of one of his disaster recovery events to illustrate some important lessons.

BCBS experienced a double disk failure in a RAID 5 array, which kicked off the incident. Disaster was officially declared just after 8 p.m. on a Thursday night, and over the course of the following six days and nights, Sultis and his team basically worked around the clock to resolve the problem and successfully recover lost data.

What can you do to prepare for a potential disaster? Here are some of the key lessons Chris learned, which should be useful no matter the nature of your organization – or your disaster.

Execute Your DR Plan

The most important thing you can possibly do once a disaster is declared is to follow your disaster recovery or business continuity plan. There may be temptation to stray from the plan, but try to ignore that urge and stick to what you know. Knowing what you’re doing will enable you to approach the process with a high level of confidence that everyone else can feed off of. For example, Sultis and his team were tempted to IPL the asynchronous copy and try to restore things in real time, when the DR plan was to work from the point-in-time copy. Fortunately, they ended up executing the plan as written with a high degree of success.

IT Plan Checklist

Communication is key during a disaster recovery event, both within your DR team and when reporting back to superiors and anxious clients. When you’re working in such a time-sensitive, high-pressure environment, it can be a time sink to have to answer constant questions from so many different stakeholders. Of course, everyone should be kept updated, but dealing with endless “How are we doing?” and “Are we going to make it?” questions is irritating, distracting, and unproductive.

Sultis suggests creating a checklist summary of your DR plan, which you can distribute to your team, management, and clients. Taking the time to consolidate all the steps into an easily digestible list now will save you time and frustration later, since you can send out an update – to everyone – that you’re now on step 14 and proceeding according to plan.

Work in Shifts

Instead of requiring everyone to be present at all times, create shifts. You might think that because you have a fantastic team, they’ll be able to fix the disaster quickly. The reality is, you really don’t know how long the process is going to take. Sultis learned this lesson the hard way, as a few members of his team experienced burnout.

This idea may be a harder sell with upper management, because there is enormous pressure to have everyone on deck until the problem is solved. Try rotating shifts according to which operation is running at the time – such as the primary versus secondary data center – so that some employees can go home and take a break. Above all, be aware that while team members may love their job, it’s not their whole life. You don’t always know what employees are going through, and everyone needs time away from work to tend to families or problems of their own.

Keeping these three suggestions in mind – executing your DR plan as written, creating a checklist, and working in shifts –can help you recover from a disaster event confidently, calmly, and successfully.

2 Likes
Recent Stories
Young, the Mainframe Hacker: Breach Combers

IBM Systems Magazine: Editor's Picks for SHARE

Young, the Mainframe Hacker: Ob-Sec-urity