How modern businesses minimize data loss when disaster strikes

One of the questions I’m often asked is how to protect and recover public-facing applications, that is, applications which are accessible through the Internet. These applications commonly include e-commerce sites, SaaS platforms, supplier or purchasing solutions, and more. Such systems are often major revenue drivers to the business. With increasing B2B and B2C demands, rising competition, and tightening budgets, it’s now critical to minimize data loss and ensure a rapid return to operations in the event of an IT disaster.

My experiences with an e-commerce startup in the not-too-distant past showed me first-hand the challenges internal stakeholders and IT teams face when dealing with major system outages. Frankly the internal pressures can pale in comparison to those put on us by public consumers or business partners banging down one’s door should an outage ever occur. Because of greater public visibility with many outages today, these are often referred to as “resume generating events”.

With planning, design, and continuous testing we can address these concerns leading to reduced revenue impact from an outage and a more rapid time to recovery.

Challenges you’ll meet when trying to minimize data loss

Multiple moving parts – databases, middleware, web servers, load balancers, firewalls, and DNS to name a few
Data consistency demands – recovering the database from the early morning logs while the middleware is still churning can put orders out of sync, impact operations, and often requires a good deal of manual cleanup by the app and IT teams
Application performance needs – this rules out most traditional backups and snapshot methods and even VSS-integrated storage replication as all of them can stun a database, drop users, drop sessions, and worse
Public presence – customers know very quickly if these systems are down, impacting trust in the product and the company not to mention causing more calls to you, your sales team, and your customer/partner service team during outages

How did we minimize data loss with legacy approaches?

The number of servers and systems involved meant a team of people monitoring for uptime 24/7 or a heavy reliance on complex monitoring systems. Rapid data growth meant constant challenges in adding and maintaining database features such as log shipping, mirroring, and clustering. These features often add to storage growth or sprawl, increased licensing costs, personnel skill requirements and even a need to add headcount. More recently, many organizations have tried to reduce risk through storage-based or in-guest replication technologies but both have their pros and cons, and worse still – both have completely failed to address the rise of virtualization, the growing complexity of recovery or the need to do more with less.

Today, public outcry over downtime events means management and even shareholder or board pressure across the IT organization. What is needed are data protection and recovery solutions that can protect the full application stack, provide for data integrity in replication and recovery, not impact the running application, and provide for rapid recovery to minimize user downtime. Coupling such a platform with managed pilot light service and DNS services we can finally mitigate or virtually eliminate all of our defined risk factors.

Step 1 – Protect the whole application not just the parts

By leveraging a modern data protection solution we group the complete application into a single wrapper which lets us:

Assign a service-level agreement
Define a recovery plan across the application’s server stack
Protect at the write-level with no performance degradation
Track write-consistency to ensure data integrity across the stack
Replicate at the speed of the application with recovery point objectives (RPOs) measured in seconds
Allow for rapid VM recovery back to any point in time to mitigate losses from corruption or virus-based disasters
Allow for push-button or even automated disaster recovery testing
Ensure automated failback configuration post-recovery

Step 2 – Prepare the Infrastructure for Disaster

To ensure rapid recovery to users and customers, we need to pre-configure those infrastructure components that exist outside of the core application with the features needed to support a disaster declaration. This involves:

Configuring firewall and proxy assets such as NAT rules or port-forwarding for web applications or private extranet access
Adding post-recovery IP addresses and fully-qualified domain names to load balancers in a passive or listening mode
Leverage DNS availability solutions from providers such as Dyn Managed DNS or DNS Made Easy to ensure public access cuts over immediately on application recovery

Step 3 – Continuously Validate Recovery Plans

To ensure rapid return to operations in the event of a real disaster, it is critical to test and validate your achievable recovery time objective (RTO) frequently, with some organizations testing on a weekly basis. This testing must also be simple enough that anyone on the Emergency Response Team can perform it in as few steps as possible. Such testing often includes:

The use of isolated test environments to enable testing at any time with no impact to Production
Using “pilot light” systems such as directory services with test users and name resolution services to simulate recovery
Deploying “jump box” capabilities so test users can access the isolated environment
Capturing all actions taken during a test to have evidence-based validation of recovery testing

Putting it all Together

This was a high-level overview of technology risk management planning for public-facing applications. By protecting the application, preparing secondary systems for disaster, and continually testing for preparedness, any organization can minimize data loss and achieve push-button disaster recovery. This brings a new level of application availability enabling a rapid return to operations and revenue while reducing cost impact due to data loss or downtime in the event of a disaster.

How modern businesses minimize data loss when disaster strikes

Challenges you’ll meet when trying to minimize data loss

How did we minimize data loss with legacy approaches?

Step 1 – Protect the whole application not just the parts

Step 2 – Prepare the Infrastructure for Disaster

Step 3 – Continuously Validate Recovery Plans

Putting it all Together

Published by seanmmasters

One thought on “How modern businesses minimize data loss when disaster strikes”

Leave a comment Cancel reply

Challenges you’ll meet when trying to minimize data loss

How did we minimize data loss with legacy approaches?

Step 1 – Protect the whole application not just the parts

Step 2 – Prepare the Infrastructure for Disaster

Step 3 – Continuously Validate Recovery Plans

Putting it all Together

Rate this:

Share this:

Related

Published by seanmmasters

One thought on “How modern businesses minimize data loss when disaster strikes”

Leave a comment Cancel reply