The #1 Thing Everyone Screws Up In Enterprise IT

Too many of us in Enterprise IT ignore this one critical thing about infrastructure management:

We update all our cool technology without ever thinking about whether or not it will all stay compatible or not.

Why is that?

Users: we have huge egos. “My infrastructure is mine. I’ll do whatever I want with it. If your tech can’t keep up, too bad and screw you. I’ll buy something from someone else who “gets it”. I’ll post all over social media about how much you suck.”

Vendors: we’re slow and sniveling. “I can’t ever make demands of my customers. I can’t call them out when they mess up. They’re the smart ones. I’m just slinging tech. I’ll drop everything and get right on this.”

Bullshit.

Admins: Do not update your infrastructure without first making sure nothing will break compatibility or interrupt continued operations.

Unless you’re new to the field, you know better. The IT environment is not your toy. Your title? Who cares. Are you the Founder? It’s still not yours, it belongs to the business. Period.

Vendors: Monitor your tech stack constantly. Monitor your tech ecosystem constantly. Update your documentation constantly. To top it all off, notify your customers proactively.

Unless not a single person on the team has ever done this before, you know better. Your tech doesn’t exist in a vacuum. You’re a user, too, and you know you ignore the warnings yourself. Congratulations, you’re human! But the reality is you’re a vendor now and you need to know and act better.

One more time:

  1. Admins: check release notes, check any docs, make sure updating one thing won’t break anything (or everything) else.
  2. Vendors: write clear release notes and tell your customers about any known compatibility issues in advance.

Yes, this all takes a little more time up-front. It also avoids countless hours, days, or even worse – lost business – after-the-fact. Alternatively: do it right or don’t do it at all!

 

Changing the data protection conversation

Two years ago I wrote about the top 5 questions on data protection I was hearing as a Zerto Systems Engineer (SE). Back then the conversations were typically with mid-market enterprises and covered topics of interest to System Administrators and IT Managers. These are the front-line daily warriors who usually had specific questions around interoperability with our one supported platform at the time, VMware vSphere. Zerto was just a couple of years old and in hindsight it’s unsurprising that the conversations were of a “does it work?” nature.

Today these conversations are about increasing agility and reducing risk.

Since then I’ve worked with over 20% of the Fortune 100 and I’ve had the privilege of being pulled in to hundreds of meetings across the F500 space. The conversations have moved from the front lines to the VP level and the C-suite. The Enterprise Architects, VPs of IT and Operations, the CIO, CTO, and of course the CFO are all putting increased focus and resources toward IT resilience.

We’ve moved away from talking bits and bytes. Today these conversations are about increasing agility and reducing risk. It’s about maintaining a competitive edge. It’s about maximizing customer satisfaction. And it’s often about minimizing that negative press that comes with an outage through social platforms like Twitter and LinkedIn.

Let’s talk about the three most common data protection discussion topics that came up in my meetings through 2016.

1. Backups aren’t DR

Go ahead and Google it. You’ll find about 6,000,000 results to wade through. There are only about 700,000 for “backups are DR” (skim through the first several pages of results and you’ll find more arguments against backups as DR than for). Businesses have awakened to the realization – often through great pains such as lost customers, lost data, and employee terminations all the way up to the C-suite – that the customer data from last night or even from an hour ago is not good enough.

Outages today rarely touch one person or system, instead they touch the entire business.

The older the data the more orders are lost, the more customer relations work there is to do, the more order re-entry the sales and purchasing departments have to do, and so on. Outages today rarely touch one person or system, instead they touch the entire business. You need something that can return you to operations as quickly as possible with as little data loss as possible.

2. You’re not as “always on” as you think you are.

Notice how this is a direct contrast to the first issue – organizations with high-end, realtime data synchronization platforms vs hours-old backups. This discussion point is most often raised by those VPs or CTOs with a storage background who are proud of their very expensive solutions. They are naturally defensive of the senior employees and the budget requirements involved. What they built and the skills they armed their engineers with is to be respected, but we have moved on to different times. Today the discussion naturally turns to what happened the last time there was a virus or a data corruption issue. And the answer each and every time is that issue spread throughout the business and, surprise, “we had to resort to backups”. And now we’re back to the first conversation. This leads to an awakening moment similar to the last one.

The number of customers who have swapped those “stretched” systems out for more-or-faster primary storage on their next refresh cycle? High.

People in the room talk about the “resume generating event” a year or two ago, and then we start designing an updated solution. One of my colleagues at Zerto wrote about this very situation years ago. Layering a continuous data protection solution such as Zerto on top of stretched storage ultimately becomes a requirement. Many Zerto customers are running stretched systems from EMC, NetApp, HPE, and more recently on VMware vSAN. The number of customers who have swapped those “stretched” systems out for more-or-faster primary storage on their next refresh cycle? High. Some simply realize cost shifts or other capital savings altogether.

3. You need strategic platforms, not point solutions.

You’ve got a running operational plan, a 3-5 year plan, a vision for your IT/IS office and of course a vision for the business. This is a strategy. Meanwhile a majority of data protection solutions are purpose-built with tunnel vision in mind. They do a single job of getting you a backup or a copy of your data which can hopefully be recovered to the same data center or a facsimile. Vanishingly few enable application protection and mobility across private, hybrid, and public clouds;  cross multiple use-cases such as maintaining protection during data center migrations; integrate with partners like IBM to allow you to spin up entire data centers with integrated application protection in a matter of minutes. I could go on through the dozen or so use-cases this one platform fulfills.

Add that Zerto works across VMware vSphere, Microsoft Hyper-V, Amazon Web Services, and Microsoft Azure and that list is growing as production-class market demand shifts in new directions like cloud-native applications, containers, and more. All of this in the same product. All with simple pricing instead of Dante’s 9 circles of Hell. If you laughed at that, you know exactly where I’m coming from. On that note…

A little more for the CFOs in the room.

As someone who used to report to a CFO, I had to build a case for every product and solution over the 5-figure limit on my corporate card. I worked with resellers and vendors to toy between CAPEX and OPEX models, shift refresh cycles, adjust a variety of terms to the front or back of the deal depending on cash flow projections, any legal and ethical means to fit the solution I wanted into the budget I had available.

My colleague Darren Swift, with help from the larger Zerto team, recently built a free online tool – the Zerto Business Case Builder – that any of our partners can use to help you make the business case with as little pain as possible. Just last week I had the honor of sitting on a panel at Zerto’s annual Sales Kickoff to extol the virtues of that tool to our global sales team. I can’t stress enough the value in time saved and more rapid turnaround to purchasing this can bring you. What used to take me hours in spreadsheets now takes 15 minutes on a web page. While it isn’t available to customers today, a call to your trusted reseller is the only real barrier to entry here. Get with your partner, take the 15 minutes, you’ll be glad you did.

How modern businesses minimize data loss when disaster strikes

One of the questions I’m often asked is how to protect and recover public-facing applications, that is, applications which are accessible through the Internet. These applications commonly include e-commerce sites, SaaS platforms, supplier or purchasing solutions, and more. Such systems are often major revenue drivers to the business. With increasing B2B and B2C demands, rising competition, and tightening budgets, it’s now critical to minimize data loss and ensure a rapid return to operations in the event of an IT disaster.

My experiences with an e-commerce startup in the not-too-distant past showed me first-hand the challenges internal stakeholders and IT teams face when dealing with major system outages. Frankly the internal pressures can pale in comparison to those put on us by public consumers or business partners banging down one’s door should an outage ever occur. Because of greater public visibility with many outages today, these are often referred to as “resume generating events”.

With planning, design, and continuous testing we can address these concerns leading to reduced revenue impact from an outage and a more rapid time to recovery.

Challenges you’ll meet when trying to minimize data loss

  • Multiple moving parts – databases, middleware, web servers, load balancers, firewalls, and DNS to name a few
  • Data consistency demands – recovering the database from the early morning logs while the middleware is still churning can put orders out of sync, impact operations, and often requires a good deal of manual cleanup by the app and IT teams
  • Application performance needs – this rules out most traditional backups and snapshot methods and even VSS-integrated storage replication as all of them can stun a database, drop users, drop sessions, and worse
  • Public presence – customers know very quickly if these systems are down, impacting trust in the product and the company not to mention causing more calls to you, your sales team, and your customer/partner service team during outages

How did we minimize data loss with legacy approaches?

The number of servers and systems involved meant a team of people monitoring for uptime 24/7 or a heavy reliance on complex monitoring systems. Rapid data growth meant constant challenges in adding and maintaining database features such as log shipping, mirroring, and clustering. These features often add to storage growth or sprawl, increased licensing costs, personnel skill requirements and even a need to add headcount. More recently, many organizations have tried to reduce risk through storage-based or in-guest replication technologies but both have their pros and cons, and worse still – both have completely failed to address the rise of virtualization, the growing complexity of recovery or the need to do more with less.

Today, public outcry over downtime events means management and even shareholder or board pressure across the IT organization.  What is needed are data protection and recovery solutions that can protect the full application stack, provide for data integrity in replication and recovery, not impact the running application, and provide for rapid recovery to minimize user downtime.  Coupling such a platform with managed pilot light service and DNS services we can finally mitigate or virtually eliminate all of our defined risk factors.

Step 1 – Protect the whole application not just the parts

By leveraging a modern data protection solution we group the complete application into a single wrapper which lets us:

  • Assign a service-level agreement
  • Define a recovery plan across the application’s server stack
  • Protect at the write-level with no performance degradation
  • Track write-consistency to ensure data integrity across the stack
  • Replicate at the speed of the application with recovery point objectives (RPOs) measured in seconds
  • Allow for rapid VM recovery back to any point in time to mitigate losses from corruption or virus-based disasters
  • Allow for push-button or even automated disaster recovery testing
  • Ensure automated failback configuration post-recovery

Step 2 – Prepare the Infrastructure for Disaster

To ensure rapid recovery to users and customers, we need to pre-configure those infrastructure components that exist outside of the core application with the features needed to support a disaster declaration. This involves:

  • Configuring firewall and proxy assets such as NAT rules or port-forwarding for web applications or private extranet access
  • Adding post-recovery IP addresses and fully-qualified domain names to load balancers in a passive or listening mode
  • Leverage DNS availability solutions from providers such as Dyn Managed DNS or DNS Made Easy to ensure public access cuts over immediately on application recovery

Step 3 – Continuously Validate Recovery Plans

To ensure rapid return to operations in the event of a real disaster, it is critical to test and validate your achievable recovery time objective (RTO) frequently, with some organizations testing on a weekly basis. This testing must also be simple enough that anyone on the Emergency Response Team can perform it in as few steps as possible. Such testing often includes:

  • The use of isolated test environments to enable testing at any time with no impact to Production
  • Using “pilot light” systems such as directory services with test users and name resolution services to simulate recovery
  • Deploying “jump box” capabilities so test users can access the isolated environment
  • Capturing all actions taken during a test to have evidence-based validation of recovery testing

Putting it all Together

This was a high-level overview of technology risk management planning for public-facing applications. By protecting the application, preparing secondary systems for disaster, and continually testing for preparedness, any organization can minimize data loss and achieve push-button disaster recovery. This brings a new level of application availability enabling a rapid return to operations and revenue while reducing cost impact due to data loss or downtime in the event of a disaster.