Top 6 Factors Impacting Your Data Replication Performance

This is an article about Zerto, the award-winning IT Resilience platform I was first exposed to in 2011 and loved so much that I joined the team in 2012. For help with any Zerto-specific terminology, see Zerto’s official product documentation. This post refers to Zerto Virtual Replication versions 6.x and earlier.

Whenever I’m speaking with a Zerto customer or partner about IT Resilience and things get technical, we’ll inevitably come to the part where someone asks how fast their data will go from point A to point B. While this may not seem immediately meaningful to the business, the underlying contexts are minimizing risk, typically measured as data loss, and increasing revenues, typically measured as reducing the time taken to return to production.

Each time, a decision maker on the technical team will go right to the volume of data they need to move from point A to point B and their ISP’s listed internet connection speed. “This application is backed by a 46 TB database”, “this application has 8 virtual machines (VMs) with 1 TB of volumes each”, and so on, coupled with, “we have an X Gbps connection so we expect this to go at X Gbps.”

It’s at this time we get to have some straight-talk.

Here’s the reality: getting data out of one spot in a data center, and into another spot in another data center, has maybe 50% to do with your provider’s internet connection speed. There’s a pile of equipment, software, and configurations sitting between where your data lives and the point that I call “just getting your data out the door.” There’s another list of stuff involved in getting your data across the expanse of time and space leading to the next building. Then, that data has to get inside the next building and ultimately it’s got to land on your target storage.

Each of the factors below can interact with others in ways that are often unknown to the customer’s IT staff and the ISP’s engineers themselves. The good news is that these factors are always in play. I say “good news” because that means you can always be identifying them, keeping track of them, and so on.

  • Network devices: your routers, switches, gateways, load balancers, and then all of the ISP’s gear…
  • Network services: your MPLS, VPN, QoS, bandwidth shaping, and then all of your ISP’s services configurations…
  • Network overheads and limits: hypervisor offload, frame rate limits, packet-per-second limits, max-concurrent-connection limits, latency, end-to-end MTU configurations, other application workloads…
  • Security devices: firewalls, any host security hardware such as data encryption modules or encrypted NICs
  • Security services: Intrusion Detection, Intrusion Prevention, other network filtering services; any host security software such as host anti-virus
  • Storage devices: read/write split, RAID type, caching, backplane performance, compression, deduplication, latency, other application workloads

I recall an issue where a customer had a single application in Production that was running at over 20,000 IOPS and they were getting upset with our Support team because they weren’t getting the replication performance they expected.

…what they neglected to say (and what we almost immediately saw in our log files) was their target storage array’s write performance wasn’t anywhere near capable of keeping up with that speed. I’m not an “I told you so” person, but I did make sure to go over the above list with them for a second time.

I’ve seen performance held up by storage factors far more frequently than other factors.

In my experience, most businesses put all their money into “production” and peanuts into recovery, including the recovery site hardware. As older storage arrays continue to be replaced by modern equipment, and as IT resilience continues moving up the list of business priorities, I believe we’ll see this change for the better.

As an aside, I’ve been in so many meetings and conference calls where fingers are pointed left, right, and sideways because nobody wants to believe their piece of the puzzle is the problem. One of the best parts about IT, though, is that it’s easy to cut through the noise and get straight to the facts.