One of the questions I’m often asked is how to protect and recover public-facing applications, that is, applications which are accessible through the Internet. These applications commonly include e-commerce sites, SaaS platforms, supplier or purchasing solutions, and more. Such systems are often major revenue drivers to the business. With increasing B2B and B2C demands, rising competition, and tightening budgets, it grows ever more critical to guarantee minimal data loss and rapid return to operations in the event of a disaster.
My experiences with an e-commerce startup in the not-too-distant past showed me first-hand the challenges internal stakeholders and IT teams face when dealing with major system outages. Frankly the internal pressures can pale in comparison to those put on us by public consumers or business partners banging down one’s door should an outage ever occur. Because of greater public visibility with many outages today, these are often referred to as “resume generating events”.
With planning, design, and continuous testing we can address these concerns leading to reduced revenue impact from an outage and a more rapid time to recovery.
Some of the risks facing these platforms include
- Multiple moving parts – databases, middleware, web servers, load balancers, firewalls, and DNS to name a few
- Data consistency demands – recovering the database from the early morning logs while the middleware is still churning can put orders out of sync, impact operations, and often requires a good deal of manual cleanup by the app and IT teams
- Application performance needs – this rules out most traditional backups and snapshot methods and even VSS-integrated storage replication as all of them can stun a database, drop users, drop sessions, and worse
- Public presence – customers know very quickly if these systems are down, impacting trust in the product and the company not to mention causing more calls to you, your sales team, and your customer/partner service team during outages
Before we look at one way to address these problems today, let’s define how we tackled these problems with legacy approaches:
The number of servers and systems involved meant a team of people monitoring for uptime 24/7 or a heavy reliance on complex monitoring systems. Rapid data growth meant constant challenges in adding and maintaining database features such as log shipping, mirroring, and clustering. These features often add to storage growth or sprawl, increased licensing costs, personnel skill requirements and even a need to add headcount. More recently, many organizations have tried to reduce risk through storage-based or in-guest replication technologies but both have their pros and cons, and worse still – both have completely failed to address the rise of virtualization, the growing complexity of recovery or the need to do more with less.
Today, public outcry over downtime events means management and even shareholder or board pressure across the IT organization. What is needed are data protection and recovery solutions that can protect the full application stack, provide for data integrity in replication and recovery, not impact the running application, and provide for rapid recovery to minimize user downtime. Coupling such a platform with managed pilot light service and DNS services we can finally mitigate or virtually eliminate all of our defined risk factors.
Step 1 – Protect the Core Application
By leveraging a modern data protection solution we group the complete application into a single wrapper where we can:
- Assign a service-level agreement
- Define a recovery plan across the application’s server stack
- Protect at the write-level with no performance degradation
- Track write-consistency to ensure data integrity across the stack
- Replicate at the speed of the application with recovery point objectives (RPOs) measured in seconds
- Allow for rapid VM recovery back to any point in time to mitigate losses from corruption or virus-based disasters
- Allow for push-button or even automated disaster recovery testing
- Ensure automated failback configuration post-recovery
Step 2 – Prepare the Infrastructure for Disaster
To ensure rapid recovery to users and customers, we need to pre-configure those infrastructure components that exist outside of the core application with the features needed to support a disaster declaration. This involves:
- Configuring firewall and proxy assets such as NAT rules or port-forwarding for web applications or private extranet access
- Adding post-recovery IP addresses and fully-qualified domain names to load balancers in a passive or listening mode
- Leverage DNS availability solutions from providers such as Dyn Managed DNS or DNS Made Easy to ensure public access cuts over immediately on application recovery
Step 3 – Continuously Validate Recovery Plans
To ensure rapid return to operations in the event of a real disaster, it is critical to test and validate your achievable recovery time objective (RTO) frequently, with some organizations testing on a weekly basis. This testing must also be simple enough that anyone on the Emergency Response Team can perform it in as few steps as possible. Such testing often includes:
- Creation of isolated test environments and logically separating access so as not to impact production during testing
- Creation of isolated “pilot light” systems such as directory services with test credentials, name resolution services, and “Jump Box” capability so test users can access the isolated environment
- Reporting capabilities which capture all operations during test for evidence-based validation of recovery testing
Putting it all Together
This is just a high-level overview of technology risk management planning for public-facing applications. By protecting the core application, preparing secondary systems for disaster, and continually testing for preparedness, any organization can ultimately achieve true push-button disaster recovery in minutes. This brings a new level of application availability enabling a rapid return to operations and revenue while reducing cost impact due to data loss or downtime in the event of a disaster.