ChannelDB2

Building Time-Delayed Targets with Data Replication

 

Most replication products talk a lot about how great high-speed data replication is.  Naturally, I agree :)  It's great. However, if something ever corrupts your source data, you'll likely see an ugly side of speed - your target gets all the source's corrupt data with sub-second latency.  What a pain.  You have no time to prevent the mirrored corruption. That might leave both source and target unusable. And that's especially frustrating if your target is supposed to be a standby for disaster recovery.

 

So, how do you avoid this problem?  One way is by using a time-delayed target.  This simply means that you always delay replication into the target by a predefined value.  For example, you could delay replication by 15 minutes, an hour, or even 48 hours.  The delay you pick is based on how you want to balance two things - (1) the currency of data at the target and (2) your ability to prevent the replication of corrupt data. If your users can't accept a delay at the target, you can still take advantage of this approach by simply having two targets.  One would be delayed for disaster recovery and the other would use low-latency replication for day-to-day use.

 

You may have seen time-delayed targets discussed recently in an IBM developerWorks article about using Q Replication with DB2 pureScale. I'm going to talk about the general case and use Q Replication as an example.

 

First, understand the ways a delay could be implemented. A replication technology could choose to delay sending data to the target system. Or, it could stage data on the target system before applying it to the target.  If it delays sending data to the target, you have the following challenges:

 

  • Data must either remain in DB2 log files or be staged in something such as files or tables on the source.
  • If data is left in DB2 log files, it may affect your approach to log archiving.
  • An outage of the source system would trap delayed data on the source, which would be lost if the source were nonrecoverable.

 

Other challenges may exist, but the ones listed here are big enough that many people prefer to see the delay and staging of data implemented on the target system.

 

If data is staged on the target, it's got to be held in somewhere. That could be files, tables, or something else. That means you need to think about disk space.  You'll need at least enough to hold the maximum volume of data that you expect to change during your time delay. Reality is that you'll also want a buffer for those times when the target database is taken offline for maintenance. As an example, Q Replication implements time-delayed targets by staging data in WebSphere MQ on the target system.  The following picture highlights the main points of how it works:

 

 

Yes, there generally is a queue manager on the source system.  It's there for ensured, efficient delivery of data, not for staging.  I left it out of the picture so I can focus only on those points specific to a time-delayed target.  That leaves us with three points of interest.  They are identified by numbered boxes.  Here's what's happening at each:

 

  1. Q Capture is pulling changed data out of the DB2 log.
    • Each transaction is sent to the target as soon as a commit is seen in the log.
  2. MQ is receiving data at the target and immediately making it available to Q Apply.
    •   MQ has enough disk space to hold data in the queue if Q Apply delays applying it.
  3. Q Apply* is monitoring the queue and only applying a given change when the delay period has been met for that change.
    • For example, if the delay is 15 minutes and a change happened at 1 PM on the source, then the change is applied to the target at 1:15 PM source time.

 

In other words, it's very simple.  Question is, how well does it work?   Well, to stress Q Apply's delay processing, a sample workload of approximately 200k rows per second over several hours was run through Q Replication.  Here's a graph of the data volume coming through two queues:

 

 

(Of course, this graph comes with the standard disclaimers - this is not a benchmark and your achievable throughput can vary significantly based on hardware, network, table characteristics, workload characteristics, how you configure replication, and so on)

 

Q Apply was set to delay changes by 15 minutes (900.000 seconds).   Here's a graph of latency at the target during the test:


Pretty slick :)  Q Apply easily maintained the target database with a constant delay of 15 minutes.  If source problem had been caught within 15 minutes of it being introduced, Q Apply could have been stopped before the problem data was applied to the target.  This means a time-delayed target can be a reasonable approach to helping you deal with significant corruption of source data in a data replication environment.

 

One last point here with the Q Replication approach... if you've stopped Q Apply, but know you have good data still staged in MQ, you can have Q Apply move those changes into the target if you can determine a specific time that is just prior to the corruption (it does not have to be 15 minutes).  For more information, see the ApplyUpTo parameter.


--

* If you're using Q Replication with DB2 for z/OS, you'll need a Replication Server APAR to take advantage of Q Replication's applydelay parameter.  See PM38951 for Replication Server v10 and PM40181 for Replication Server v9.

Views: 389

Tags: Q Replication, data replication, disaster recovery, high availability

Comment

You need to be a member of ChannelDB2 to add comments!

Join ChannelDB2

Try BLU Acceleration on Cloud

© 2014   Created by channeldb2.

Badges  |  Report an Issue  |  Terms of Service