Once you get past the simple case of copying changed data from one system to another, the next most common replication solution is data distribution. In this scenario, you distribute data from a central database to two or more databases, maybe even 100's, that are likely remote. An example is sending changes from a corporate headquarters to branch offices or retail locations. You may only need the data to flow one way - from a central database to others. Or, you may also need some data to flow back to the central site.
Whatever the case, I find you can learn general considerations by taking a look at a specific implementation. It may also help you think of some not-so-obvious issues with other implementations. For this post, I'm going to look at the IBM change data capture technology called SQL Replication. It's built into DB2 LUW (and any product that contains DB2 LUW such as InfoSphere Warehouse). It also has a simple scripting language that's great for examples. I've got a small one at the bottom of this post. It shows a fast way to set up an SQL Replication distribution.
Before We Begin
Let's take a quick, high-level look at how SQL Replication works...
Like any good change data capture technology, SQL Replication has a Capture program that gets changed data from the database transaction log. Those changes are saved in a "changed data" (CD) table in the source database. A separate Apply program retrieves the changes and uses SQL inserts, updates, and deletes to modify a target table. The Capture program can delete ('prune') CD table data that has been successfully replicated to the target.
Now, some people choke at this point :) They don't like the fact that their source system is going to experience the overhead of inserts and deletes on the CD table. While I agree that this impact can be too much for some scenarios, the CD table approach can have some distinct and useful advantages for data distribution. What's more, even if you have a source where CDs are a challenge, there's a distribution topology that includes SQL Replication and reduces the impact of data distribution on source systems. I'll discuss it in a future post.
Scaling a Distribution is Easy
SQL Replication takes advantage of three things databases do well - client connectivity, queries and multiuser support. The Apply program connects to the source as an application and uses SQL queries to retrieve data from CD tables. This means each new target you add is little more than a new query user. That's something most databases can handle many of easily.
Connectivity Set Up Is Simple
In a typical distribution, an Apply program runs on each target system. This means all you do for connectivity is make the target system a client of the source database. There's no need to establish connectivity between Capture and each Apply. Naturally, you can also take advantage of features provided by the database's client-server support such as client reroute and encryption between client and server.
You Can Filter, Subset, and Transform Data as It's Replicated
Because changed data is in tables, you have the full power of SQL to manipulate the data as it's replicated. I'm only going to discuss two ways here - subsetting rows, also known as row filtering, and column-level transformations.
A common requirement is to send a target only the subset of data it uses. For example, a sales team in Europe may not need data for customers in Asia and the Americas. This is handled by providing an SQL predicate to be used when Apply queries the CD table. The query results mean that Apply only replicates the rows that satisfy the predicate.
Another common requirement is to transform column data - substring, concatenate, cast to a different data type, whatever - to meet the needs of a target. SQL Replication can use SQL column expressions and scalar functions to do this. This is nontrivial. For example, by my count DB2 has well over 100 scalar functions. Another alternative is to subscribe to a view over the source table. Many people are unaware of this possibility since so few change data capture technologies offer view support. I'll give some examples in a future post.
You Only Need to Read the Log One Time
Once data is committed to CD tables, the Capture program never has to reread log records. For example, if target database is down for scheduled maintenance, changes are saved in CD tables until (1) the target is available or (2) a retention period is up (7 days by default). There's never a need to retrieve archived log files when the target returns. All you need to do is make sure the CD has space for changes captured during the maintenance window.
One Bad Apple Doesn't Spoil the Whole Bunch
In any distribution scenario, an extended outage of a single target is always possible. For example, one target is down several days due to a hardware problem. SQL Replication lets you bring the downed target back on line without impacting the other targets. They are not delayed as the downed target catches up. And, as already mentioned, there's never a need to retrieve older or archived log files so that the downed target can get caught up.
You Have Plenty of Statistics Available
I'm amazed how often people don't think to ask about how they can validate the long-term health and performance of a data replication product. By default, the SQL Replication programs maintain 7 days of historical information for the processing they've done. This is enormously helpful for determining peaks and valleys, the health of your replication configuration, and whether your meeting SLAs. This data is easily accessible via SQL since the data is all stored in tables. If you need help understanding any of it, try using a utility called the Analyzer.
You Can Easily Combine with HADR on the Source
Your source can be covered by HADR and your Apply programs will be switched between primary and backup if the appropriate client reroute information is defined. For more about using HADR and SQL Replication together, see the video on ChannelDB2.
The rest of this post provides a sample script and instructions to set up an SQL Replication distribution with one source database and two targets.
Before you start, you need to understand two SQL Replication terms:
Follow these instructions to set up the sample distribution:
Your configuration is now running. If you want to verify data replication, do the following:
That's it, except for the script. Here it is: