Top 6 Factors Impacting Your Data Replication Performance

This is an article about Zerto, the award-winning IT Resilience platform I was first exposed to in 2011 and loved so much that I joined the team in 2012. For help with any Zerto-specific terminology, see Zerto’s official product documentation. This post refers to Zerto Virtual Replication versions 6.x and earlier.

Whenever I’m speaking with a Zerto customer or partner about IT Resilience and things get technical, we’ll inevitably come to the part where someone asks how fast their data will go from point A to point B. While this may not seem immediately meaningful to the business, the underlying contexts are minimizing risk, typically measured as data loss, and increasing revenues, typically measured as reducing the time taken to return to production.

Each time, a decision maker on the technical team will go right to the volume of data they need to move from point A to point B and their ISP’s listed internet connection speed. “This application is backed by a 46 TB database”, “this application has 8 virtual machines (VMs) with 1 TB of volumes each”, and so on, coupled with, “we have an X Gbps connection so we expect this to go at X Gbps.”

It’s at this time we get to have some straight-talk.

Here’s the reality: getting data out of one spot in a data center, and into another spot in another data center, has maybe 50% to do with your provider’s internet connection speed. There’s a pile of equipment, software, and configurations sitting between where your data lives and the point that I call “just getting your data out the door.” There’s another list of stuff involved in getting your data across the expanse of time and space leading to the next building. Then, that data has to get inside the next building and ultimately it’s got to land on your target storage.

Each of the factors below can interact with others in ways that are often unknown to the customer’s IT staff and the ISP’s engineers themselves. The good news is that these factors are always in play. I say “good news” because that means you can always be identifying them, keeping track of them, and so on.

  • Network devices: your routers, switches, gateways, load balancers, and then all of the ISP’s gear…
  • Network services: your MPLS, VPN, QoS, bandwidth shaping, and then all of your ISP’s services configurations…
  • Network overheads and limits: hypervisor offload, frame rate limits, packet-per-second limits, max-concurrent-connection limits, latency, end-to-end MTU configurations, other application workloads…
  • Security devices: firewalls, any host security hardware such as data encryption modules or encrypted NICs
  • Security services: Intrusion Detection, Intrusion Prevention, other network filtering services; any host security software such as host anti-virus
  • Storage devices: read/write split, RAID type, caching, backplane performance, compression, deduplication, latency, other application workloads

I recall an issue where a customer had a single application in Production that was running at over 20,000 IOPS and they were getting upset with our Support team because they weren’t getting the replication performance they expected.

…what they neglected to say (and what we almost immediately saw in our log files) was their target storage array’s write performance wasn’t anywhere near capable of keeping up with that speed. I’m not an “I told you so” person, but I did make sure to go over the above list with them for a second time.

I’ve seen performance held up by storage factors far more frequently than other factors.

In my experience, most businesses put all their money into “production” and peanuts into recovery, including the recovery site hardware. As older storage arrays continue to be replaced by modern equipment, and as IT resilience continues moving up the list of business priorities, I believe we’ll see this change for the better.

As an aside, I’ve been in so many meetings and conference calls where fingers are pointed left, right, and sideways because nobody wants to believe their piece of the puzzle is the problem. One of the best parts about IT, though, is that it’s easy to cut through the noise and get straight to the facts.

Delivering Certainty in an Uncertain (IT) World

zertocon2017-general-hayden
Special thanks to eGroup CEO Mike Carter whose post on this topic prompted me to write my own.

Retired General Michael Hayden, the former Director of the CIA and NSA, attended the second-annual ZertoCON conference as a keynote speaker. His topic of choice, risk and risk management, came as no surprise given his decades of experience in the field. One surprise did come later, though, during my conversations with several customers and partners on Gen. Hayden’s thoughts on the classic risk equation:

Risk = Threat x Vulnerability x Consequence

Specific to the information technology field, the surprise for me was the high number of people and organizations focused on blanket approaches to threats and vulnerabilities as opposed to a more targeted strategy. The problem I have with the generalized posture is that threat, as General Hayden likes to say, “is asymmetrical and all advantage goes to the attacker”. You can detect-and-deny the threats you know. You can patch the gaps you know. I say this having invested in exactly the same way back when I was the one with the budget! But there are always new tools, new attack vectors, new bugs, and more bad actors. Not to mention that the latest and hottest projects in technology right now – the joining together of automation, artificial intelligence, and machine learning systems – means we will soon face an exponential increase in the types and volumes of threats to people and information.

Will, not may.

Suddenly that blanket approach doesn’t seem like it could ever big enough, but we sure keep throwing money at it.

I’m not advocating we ignore threat mitigation as an organizational strategy. After a week at ZertoCON where we had several sessions on Ransomware and unpatched vulnerabilities, I won’t say we should abandon vulnerability mitigation either. But threats increase, always. And vulnerabilities? The next zero-day exploit is lurking right around the corner.

Can we direct our limited resources instead of taking a blanket approach?

It turns out we can. Consequence, unlike our other two variables, can be rather accurately qualified and quantified. What happens if this system is breached? What happens if we lose access to that data? Is it a $1,000 loss? A $10,000,000 loss? A brand impact? Just a grouchy business unit that is otherwise unaffected? Once you identify those real outcomes that loss might lead to, map out where you are spending your resources, and ask yourself do those mappings still make sense? Or should you be shifting resources around and investing in ways that take a targeted approach to addressing your risk profile?

Spending time and money attempting to address threats and vulnerabilities without first understanding consequence is a losing game. Assess the consequences of application and information loss first and you’ve got the beginnings of a sound strategy. Only when you know the consequences – the impact of losing an application or losing control of or access to your data – can you hope to properly direct your resources as you work to deliver certainty in an uncertain world.

Changing the data protection conversation

Two years ago I wrote about the top 5 questions on data protection I was hearing as a Zerto Systems Engineer (SE). Back then the conversations were typically with mid-market enterprises and covered topics of interest to System Administrators and IT Managers. These are the front-line daily warriors who usually had specific questions around interoperability with our one supported platform at the time, VMware vSphere. Zerto was just a couple of years old and in hindsight it’s unsurprising that the conversations were of a “does it work?” nature.

Today these conversations are about increasing agility and reducing risk.

Since then I’ve worked with over 20% of the Fortune 100 and I’ve had the privilege of being pulled in to hundreds of meetings across the F500 space. The conversations have moved from the front lines to the VP level and the C-suite. The Enterprise Architects, VPs of IT and Operations, the CIO, CTO, and of course the CFO are all putting increased focus and resources toward IT resilience.

We’ve moved away from talking bits and bytes. Today these conversations are about increasing agility and reducing risk. It’s about maintaining a competitive edge. It’s about maximizing customer satisfaction. And it’s often about minimizing that negative press that comes with an outage through social platforms like Twitter and LinkedIn.

Let’s talk about the three most common data protection discussion topics that came up in my meetings through 2016.

1. Backups aren’t DR

Go ahead and Google it. You’ll find about 6,000,000 results to wade through. There are only about 700,000 for “backups are DR” (skim through the first several pages of results and you’ll find more arguments against backups as DR than for). Businesses have awakened to the realization – often through great pains such as lost customers, lost data, and employee terminations all the way up to the C-suite – that the customer data from last night or even from an hour ago is not good enough.

Outages today rarely touch one person or system, instead they touch the entire business.

The older the data the more orders are lost, the more customer relations work there is to do, the more order re-entry the sales and purchasing departments have to do, and so on. Outages today rarely touch one person or system, instead they touch the entire business. You need something that can return you to operations as quickly as possible with as little data loss as possible.

2. You’re not as “always on” as you think you are.

Notice how this is a direct contrast to the first issue – organizations with high-end, realtime data synchronization platforms vs hours-old backups. This discussion point is most often raised by those VPs or CTOs with a storage background who are proud of their very expensive solutions. They are naturally defensive of the senior employees and the budget requirements involved. What they built and the skills they armed their engineers with is to be respected, but we have moved on to different times. Today the discussion naturally turns to what happened the last time there was a virus or a data corruption issue. And the answer each and every time is that issue spread throughout the business and, surprise, “we had to resort to backups”. And now we’re back to the first conversation. This leads to an awakening moment similar to the last one.

The number of customers who have swapped those “stretched” systems out for more-or-faster primary storage on their next refresh cycle? High.

People in the room talk about the “resume generating event” a year or two ago, and then we start designing an updated solution. One of my colleagues at Zerto wrote about this very situation years ago. Layering a continuous data protection solution such as Zerto on top of stretched storage ultimately becomes a requirement. Many Zerto customers are running stretched systems from EMC, NetApp, HPE, and more recently on VMware vSAN. The number of customers who have swapped those “stretched” systems out for more-or-faster primary storage on their next refresh cycle? High. Some simply realize cost shifts or other capital savings altogether.

3. You need strategic platforms, not point solutions.

You’ve got a running operational plan, a 3-5 year plan, a vision for your IT/IS office and of course a vision for the business. This is a strategy. Meanwhile a majority of data protection solutions are purpose-built with tunnel vision in mind. They do a single job of getting you a backup or a copy of your data which can hopefully be recovered to the same data center or a facsimile. Vanishingly few enable application protection and mobility across private, hybrid, and public clouds;  cross multiple use-cases such as maintaining protection during data center migrations; integrate with partners like IBM to allow you to spin up entire data centers with integrated application protection in a matter of minutes. I could go on through the dozen or so use-cases this one platform fulfills.

Add that Zerto works across VMware vSphere, Microsoft Hyper-V, Amazon Web Services, and Microsoft Azure and that list is growing as production-class market demand shifts in new directions like cloud-native applications, containers, and more. All of this in the same product. All with simple pricing instead of Dante’s 9 circles of Hell. If you laughed at that, you know exactly where I’m coming from. On that note…

A little more for the CFOs in the room.

As someone who used to report to a CFO, I had to build a case for every product and solution over the 5-figure limit on my corporate card. I worked with resellers and vendors to toy between CAPEX and OPEX models, shift refresh cycles, adjust a variety of terms to the front or back of the deal depending on cash flow projections, any legal and ethical means to fit the solution I wanted into the budget I had available.

My colleague Darren Swift, with help from the larger Zerto team, recently built a free online tool – the Zerto Business Case Builder – that any of our partners can use to help you make the business case with as little pain as possible. Just last week I had the honor of sitting on a panel at Zerto’s annual Sales Kickoff to extol the virtues of that tool to our global sales team. I can’t stress enough the value in time saved and more rapid turnaround to purchasing this can bring you. What used to take me hours in spreadsheets now takes 15 minutes on a web page. While it isn’t available to customers today, a call to your trusted reseller is the only real barrier to entry here. Get with your partner, take the 15 minutes, you’ll be glad you did.