DATA PROTECTION TRENDS, NEWS & BACKUP TIPS

Mastering disasters: A Q&A with our DR Testing Specialist

Digital cloud computing technology illustration.

An Expert Perspective on Disaster Recovery Backup Testing

Vince Wood is senior systems administrator for CyberFortress, specializing in disaster recovery (DR) we thought we’d sit down with Vince for some expert advice from an IT pro who has built a career around disaster recovery services.

Essential Advice for Disaster Recovery Professionals

Whether you’re in IT at an enterprise or a systems admin for a cloud services provider (CSP), if you’re involved in DR, the devil is in the details — so document, document, document. Stuff breaks, backups and hardware can fail, and processes fall apart. The one thing an IT pro can do to make life easier for all is keep detailed notes.

I deal with DR testing routinely, so being a human with a fallible brain, I take notes for every meeting involving counterparts, customers, executives, those from different teams — anyone touching DR. All details and actions items are captured. If someone says we need to figure out how much capacity we have for a network attached storage (NAS) device on our DR site, it’s assigned on the spot. Then, I send around a PDF to everyone.

It’s not a matter of being able to point fingers when there’s an issue. In fact, this accelerates prompt resolution when there’s a problem, and recovering from a disaster quickly is what this is all about. Giving everyone something to refer back to provides us with checkpoints, and if there’s a disruption or divergence of thought, we can quickly retrace steps. It keeps everyone on the same page and moving ahead.

How does that apply to the 3-2-1 backup rule?

When it comes to 3-2-1, the best possible thing you can do is document your environment, even if you do it just one chunk at a time. Look at your primary storage, list what virtual machines (VMs) are running on it and what business units are relying on it for. Prioritize critical apps and data, too. Then move throughout the entire environment and do the same. When you implement 3-2-1, you’ll be able to easily and quickly assess whether you have sufficient backups to protect all your assets, and whether they’re stored in the right places.

What are some common issues you find when approaching testing?

I think the biggest issues revolve around failing to think of a DR scenario holistically.

When I prepare customers for DR test, I’ll ask what they want to test. They’ll usually provide a list of servers and VMs, which is a great start. However, some want to test apps A/B/C and tasks D/E/F, which requires understanding everything that needs restoring for those apps and services to function correctly. You must not only identify what to test, but also have to keep in mind the “blast radius;” the technical interdependencies and their impact.

Further, organizations need to determine exactly what a disaster looks like for them and to think locally. Are you in a place like Houston where hurricanes are common and power can be lost across a region? We had a customer that wanted to test printing from their application in our data center back to a printer outside the datacenter. That’s a good test. The problem was they wanted to test to a printer in their office and local region. If the disaster takes out that local region, it won’t matter if the application can send data out, because the endpoint won’t be there.

You need to make sure your testing is relevant and appropriate. And not only are there criteria ranging from geography to weather to consider, the industry you’re in can influence testing, particularly if it’s highly regulated or has strict compliance requirements.

What is the biggest DR mistake companies make?

Hands down, it’s a failure to test. As an example, what happens if a backup is taken of a specific VM in the middle of a Windows update? Without a sandbox to test, like Veeam’s Virtual Labs, a company could discover in the middle of a crisis that the restore points were unusable. At that point, it’s too late. You need to test for recoverability so you can verify that the restore points you are choosing will restore to a working state.

Minimally, you should fully test your DR capabilities and backups twice a year. If doing spot checks, put one application in a sandbox every quarter and give it a thorough review. Of course, if you’re in a highly regulated industry, you’ll need to test more. In general, the more complex your environment, the more you’ll need to test, and if you have 15 VMs to test across 10 different virtual networks, that’s a lot of work.

If you haven’t got the IT staff or bandwidth to keep up, it’ll be a lot more cost-effective and safer going with a CSP. Plus, you’ll have the expertise of people who haven’t just handled one or two disasters in their career, it’s something they’ve done many times over.