ChannelDB2

Transforming Data as You Replicate

When replicating data, a very common requirement is to modify or 'transform' changed data before it reaches its final destination. Two simple examples:

-- Combine two columns of source data into a single value for a column at the target.
-- Change a numeric code at the source into a descriptive string for the target.

Many other transformations are possible, including those that are more sophisticated. The question is - how do you do transformations while replicating?

Change data capture technologies provide a number of options. The most common are:

In this post, I'm going to look at the first method - using SQL. I'll also provide a script that lets you try it yourself with the DB2 Sample database. In a future post, I'll look at the third - feeding changed data into DataStage. As far as the second goes, I find a cdc technology's built-in functions far less interesting because most are just a variation of what's already provided by database servers and transformation engines.


SQL-Based Transformations

These can be divided into the following categories:

In my experience, the most commonly used are column-level expressions. However, as I'll explain later, views are very popular with users of one technology.


Column-Level Expressions


Typically, a replication technology asks you to specify column expressions as you map source tables to target tables. You use the same SQL expressions that you can specify in, for example, the select list of a query. For replication, expressions almost always contain one or more column names combined with valid SQL language elements such as scalar functions and case expressions. The result of a given expression is a single column value to be fed to the target. Popular expressions for replication include:

Many others exist. These can be used alone or combined. The replication technology passes expressions to a database server for evaluation. This is either the source or the target server depending on how the replication technology is implemented.


Views

Did I say you could use a view as a source? :) Yes, using IBM's SQL Replication. Views are popular because they provide an easy way to define, test, and refine transformations.

SQL Replication makes changed data available in tables on the source. If you have a view on the original source table, SQL Replication can create a matching view over the changed data (CD) table. Both of these are then used when pulling data to the target. The source table view is used for initial extract and load of the target. The changed data view is used for day-to-day replication. The target data reflects the transformations done through expressions and joins defined in these two views.


I like the view approach over column expressions for four reasons. First, I can define my transformations and check the results before I do anything with replication. Compare that with column-level expressions where you may need to replicate data to the target before you can verify the final results of transformation. Second, replication definitions are simpler. See the sample replication script for a comparison. Third, you can redefine the view later - without redefining replication - as long as the resulting data still works with the target. Fourth, you can add filtering of source data by simply adding predicates (a Where clause) to the view.

However, be aware that SQL Replication places a few requirements on views. For example, schema names must be included with table names (e.g., db2admin.employee) and correlation names must be used (shown in the sample).

Moreover, some views are not eligible to be replication sources. For example, outer join syntax is not currently supported. Other limitations are documented in IBM DB2 Information Centers. If you get serious about using source views, I also recommend you bookmark the SQL Replication message that lists reasons why a view is not eligible.


Stored Procedures

These are basically user exits. You can use them to do anything you want with changed data. For example, you could develop a transformation that you don't feel is easy to accomplish through SQL alone. Or, you could use the stored procedure as a complex validation routine.

I'm not going to show a stored procedure in my sample. I think the sample is big enough and makes the point about SQL transformations without the addition of stored procedures. However, if someone really wants to see this, I'll consider it in a future post :)


Do It Yourself

We're going to use SQL Replication to map a single source table, the DB2 Sample database's Employee table, to two target tables. The target tables will be created in the Sample database. That way we only need to create one database. We'll use two column-level expressions to transform data for one target table. We'll use a source view to do the same for the other target table.

One transformation concatenates two source columns and the other uses a case expression to derive a new value from a single source column:


The view has one additional transformation. It highlights how a join can be used to include data from another table. Specifically, it adds the department name from the department table.

All of this can be done through two scripts - one DB2 SQL script and one replication script.


The Sample Scripts

Follow these instructions to use the scripts:

  1. Create the DB2 SAMPLE database if you haven't already.
  2. If you haven't enabled DB2 log archiving for the SAMPLE database, do so now from a DB2 command window:
    db2 update database configuration for sample using logarchmeth1 logretain
    db2 backup database sample
  3. Copy and paste the SQL script from this post into a file named transformation.sql. It creates our view. The script is in the first highlighted box at the end of this post.
  4. Edit this SQL script and do the following:
    - Change all occurrences of the 'db2admin' to the schema of the Employee table in your Sample database.
    - Save the file.
    - Process the file using the command: db2 -vtf transformation.sql
  5. Copy and paste the replication script from this post into a file named transformation.asnclp. It maps source to targets and contains the column-level expressions. The script is in the second highlighted box at the end of
    this post.
  6. Edit the replication script and do the following:
    - Change all occurances of the 'db2admin' to the schema used to create the Sample database's Employee and Department tables.
    - If you are working from a UNIX system, change the table space directory from 'c:\' to a UNIX directory.
    - Save the file.
  7. Process the script by running the DB2 asnclp command in a DB2 command window.
    asnclp -f transformation.asnclp
    This command produces a file with .log as its extension. It contains a history of messages and what was generated.
  8. Open a DB2 command window and start the Capture program:
    asncap capture_server=sample
  9. Open another DB2 command window, and start the first Apply program:
    asnapply apply_qual=applytf control_server=sample


Your configuration is now running. If you want to verify data replication, do the following:


  1. Select from the target tables. They should now contain a copy of transformed data from the source table.
  2. Update a row in the Employee table.
    Any number of rows will do. So will an insert or delete.
  3. Query the target tables to see your change.
    You may not see the change for up to a minute.
    By default, the Apply program sleeps one minute when it finds no data to replicate.


That's it, except for the scripts. Here they are.


First the SQL script to create the view:


connect to sample;
--
-- SQL Replication requires correlation names be used
-- for tables.
--
create view employee_transform as
select
aa.empno,
concat(aa.lastname,concat(', ',substr(aa.firstnme,1,1))) as fullname,
case
when aa.edlevel > 12 then 'Eligible'
else 'Ineligible'
end as tuition_assistance,
aa.workdept,

bb.deptname
from db2admin.employee aa, db2admin.department bb
where bb.deptno=aa.workdept;



Next, the replication script:

# -----------------
# Initialize script
# -----------------
set run script now stop on sql error on;

# ------------------------------------------
# Initialize source database for replication
# ------------------------------------------
set server capture to db sample;
create control tables for capture server;

# ------------------------------------------------------
# Identify source table(s) and corresponding CD table(s)
# ------------------------------------------------------
create registration (db2admin.employee) differential refresh stage cdemployee;

# ------------------------------------------------------------------
# Identify source view(s)
# The underlying table must be identified in the previous section.
# ------------------------------------------------------------------
create registration (db2admin.employee_transform) differential refresh;

# ------------------------------------------
# Initialize target database for replication
# (Target and Control are usually the same)
# ------------------------------------------
set server control to db sample;
set server target to db sample;
create control tables for apply control server;

# ----------------------------------------------------
# Source and target table pairs are group into a 'set'
# ----------------------------------------------------
create subscription set setname tform applyqual applytf
activate yes
timing interval 1 start date "2008-01-01" time "01:00:00.000000";

# ---------------------------------------------------------------
# Identify table space defaults used when creating target tables.
# ---------------------------------------------------------------
set profile transformsts for object target tablespace options uw using
file "c:\tform.file" size 700 pages;

# ----------------------------------------------------
# Use a table as a source and SQL expressions
# to trasnform the data while replicating.
# Create a target table to match the transformed data.
# ----------------------------------------------------
create member in setname tform applyqual applytf
activate yes
source db2admin.employee
target name db2admin.employee_tuition
definition in emptuit create using profile transformsts
cols include
(expression empno,
expression
"concat(lastname,concat(', ',substr(firstnme,1,1)))"
target fullname,
expression
"case when edlevel > 12 then 'Eligible' else 'Ineligible' end"
target tuition_assistance
);

# ------------------------------------------------------
# Use a view as a source. The view contains expressions
# that transform data during replication. No SQL
# expressions are required in the member defintion.
# Create a target table to match the view.
# ------------------------------------------------------
create member in setname tform applyqual applytf
activate yes
source db2admin.employee_transform
target name db2admin.employee_tuition2
definition in emptuit2 create using profile transformsts
keys(empno +);

quit;





Views: 713

Tags: CDC, DB2, DataStage, InfoSphere Replication Server, InfoSphere Warehouse, SQL, change data capture, goldengate, replication, transformation

Comment

You need to be a member of ChannelDB2 to add comments!

Join ChannelDB2

Try BLU Acceleration on Cloud

© 2014   Created by channeldb2.

Badges  |  Report an Issue  |  Terms of Service