One of the questions I’m often asked is how to protect and recover public-facing applications, that is, applications which are accessible through the Internet. These applications commonly include e-commerce sites, SaaS platforms, supplier or purchasing solutions, and more. Such systems are often major revenue drivers to the business. With increasing B2B and B2C demands, rising competition, and tightening budgets, it’s now critical to minimize data loss and ensure a rapid return to operations in the event of an IT disaster.
My experiences with an e-commerce startup in the not-too-distant past showed me first-hand the challenges internal stakeholders and IT teams face when dealing with major system outages. Frankly the internal pressures can pale in comparison to those put on us by public consumers or business partners banging down one’s door should an outage ever occur. Because of greater public visibility with many outages today, these are often referred to as “resume generating events”.
With planning, design, and continuous testing we can address these concerns leading to reduced revenue impact from an outage and a more rapid time to recovery.
Challenges you’ll meet when trying to minimize data loss
- Multiple moving parts – databases, middleware, web servers, load balancers, firewalls, and DNS to name a few
- Data consistency demands – recovering the database from the early morning logs while the middleware is still churning can put orders out of sync, impact operations, and often requires a good deal of manual cleanup by the app and IT teams
- Application performance needs – this rules out most traditional backups and snapshot methods and even VSS-integrated storage replication as all of them can stun a database, drop users, drop sessions, and worse
- Public presence – customers know very quickly if these systems are down, impacting trust in the product and the company not to mention causing more calls to you, your sales team, and your customer/partner service team during outages
How did we minimize data loss with legacy approaches?
The number of servers and systems involved meant a team of people monitoring for uptime 24/7 or a heavy reliance on complex monitoring systems. Rapid data growth meant constant challenges in adding and maintaining database features such as log shipping, mirroring, and clustering. These features often add to storage growth or sprawl, increased licensing costs, personnel skill requirements and even a need to add headcount. More recently, many organizations have tried to reduce risk through storage-based or in-guest replication technologies but both have their pros and cons, and worse still – both have completely failed to address the rise of virtualization, the growing complexity of recovery or the need to do more with less.
Today, public outcry over downtime events means management and even shareholder or board pressure across the IT organization. What is needed are data protection and recovery solutions that can protect the full application stack, provide for data integrity in replication and recovery, not impact the running application, and provide for rapid recovery to minimize user downtime. Coupling such a platform with managed pilot light service and DNS services we can finally mitigate or virtually eliminate all of our defined risk factors.
Step 1 – Protect the whole application not just the parts
By leveraging a modern data protection solution we group the complete application into a single wrapper which lets us:
- Assign a service-level agreement
- Define a recovery plan across the application’s server stack
- Protect at the write-level with no performance degradation
- Track write-consistency to ensure data integrity across the stack
- Replicate at the speed of the application with recovery point objectives (RPOs) measured in seconds
- Allow for rapid VM recovery back to any point in time to mitigate losses from corruption or virus-based disasters
- Allow for push-button or even automated disaster recovery testing
- Ensure automated failback configuration post-recovery
Step 2 – Prepare the Infrastructure for Disaster
To ensure rapid recovery to users and customers, we need to pre-configure those infrastructure components that exist outside of the core application with the features needed to support a disaster declaration. This involves:
- Configuring firewall and proxy assets such as NAT rules or port-forwarding for web applications or private extranet access
- Adding post-recovery IP addresses and fully-qualified domain names to load balancers in a passive or listening mode
- Leverage DNS availability solutions from providers such as Dyn Managed DNS or DNS Made Easy to ensure public access cuts over immediately on application recovery
Step 3 – Continuously Validate Recovery Plans
To ensure rapid return to operations in the event of a real disaster, it is critical to test and validate your achievable recovery time objective (RTO) frequently, with some organizations testing on a weekly basis. This testing must also be simple enough that anyone on the Emergency Response Team can perform it in as few steps as possible. Such testing often includes:
- The use of isolated test environments to enable testing at any time with no impact to Production
- Using “pilot light” systems such as directory services with test users and name resolution services to simulate recovery
- Deploying “jump box” capabilities so test users can access the isolated environment
- Capturing all actions taken during a test to have evidence-based validation of recovery testing
Putting it all Together
This was a high-level overview of technology risk management planning for public-facing applications. By protecting the application, preparing secondary systems for disaster, and continually testing for preparedness, any organization can minimize data loss and achieve push-button disaster recovery. This brings a new level of application availability enabling a rapid return to operations and revenue while reducing cost impact due to data loss or downtime in the event of a disaster.