Most replication products talk a lot about how great high-speed data replication is. Naturally, I agree :) It's great. However, if something ever corrupts your source data, you'll likely see an ugly side of speed - your target gets all the source's corrupt data with sub-second latency. What a pain. You have no time to prevent the mirrored corruption. That might leave both source and target unusable. And that's especially frustrating if your target is supposed to be a standby for disaster recovery.
So, how do you avoid this problem? One way is by using a time-delayed target. This simply means that you always delay replication into the target by a predefined value. For example, you could delay replication by 15 minutes, an hour, or even 48 hours. The delay you pick is based on how you want to balance two things - (1) the currency of data at the target and (2) your ability to prevent the replication of corrupt data. If your users can't accept a delay at the target, you can still take advantage of this approach by simply having two targets. One would be delayed for disaster recovery and the other would use low-latency replication for day-to-day use.
You may have seen time-delayed targets discussed recently in an IBM developerWorks article about using Q Replication with DB2 pureScale. I'm going to talk about the general case and use Q Replication as an example.
First, understand the ways a delay could be implemented. A replication technology could choose to delay sending data to the target system. Or, it could stage data on the target system before applying it to the target. If it delays sending data to the target, you have the following challenges:
Other challenges may exist, but the ones listed here are big enough that many people prefer to see the delay and staging of data implemented on the target system.
If data is staged on the target, it's got to be held in somewhere. That could be files, tables, or something else. That means you need to think about disk space. You'll need at least enough to hold the maximum volume of data that you expect to change during your time delay. Reality is that you'll also want a buffer for those times when the target database is taken offline for maintenance. As an example, Q Replication implements time-delayed targets by staging data in WebSphere MQ on the target system. The following picture highlights the main points of how it works:
Yes, there generally is a queue manager on the source system. It's there for ensured, efficient delivery of data, not for staging. I left it out of the picture so I can focus only on those points specific to a time-delayed target. That leaves us with three points of interest. They are identified by numbered boxes. Here's what's happening at each:
In other words, it's very simple. Question is, how well does it work? Well, to stress Q Apply's delay processing, a sample workload of approximately 200k rows per second over several hours was run through Q Replication. Here's a graph of the data volume coming through two queues:
(Of course, this graph comes with the standard disclaimers - this is not a benchmark and your achievable throughput can vary significantly based on hardware, network, table characteristics, workload characteristics, how you configure replication, and so on)
Q Apply was set to delay changes by 15 minutes (900.000 seconds). Here's a graph of latency at the target during the test:
Pretty slick :) Q Apply easily maintained the target database with a constant delay of 15 minutes. If source problem had been caught within 15 minutes of it being introduced, Q Apply could have been stopped before the problem data was applied to the target. This means a time-delayed target can be a reasonable approach to helping you deal with significant corruption of source data in a data replication environment.
One last point here with the Q Replication approach... if you've stopped Q Apply, but know you have good data still staged in MQ, you can have Q Apply move those changes into the target if you can determine a specific time that is just prior to the corruption (it does not have to be 15 minutes). For more information, see the ApplyUpTo parameter.
* If you're using Q Replication with DB2 for z/OS, you'll need a Replication Server APAR to take advantage of Q Replication's applydelay parameter. See PM38951 for Replication Server v10 and PM40181 for Replication Server v9.