Years ago, IBM had a product called Information Integrator (II). It was IBM's first attempt at a data integration platform. While II is no longer around, one thing it got right was the approach of solving customer problems through a combination of data replication and data virtualization (federation). It was very successful at this.
That approach has't gone away. IBM still provides excellent replication and federation technologies. And IBM still understands the value that the combination provides. You can see this in DB2 Advanced ESE's bundling of replication and federation. However, the one thing that's missing is that some of us replication and federation people don't talk about these solutions as much as we used to.
So, since I said 'we' :) I'm going to make up for lost time by highlighting three common solutions I see people implement through a combo of replication and federation:
Extending the Reach of Data Replication
IBM delivered the world's first federated database server (data virtualization server) back in the early 1990's. The first technology to take advantage of it was IBM's SQL Replication. Replication was the use case that helped proved the value of three key data federation concepts - (1) transparent access to any database through a single SQL dialect, (2) masking of data type differences between databases, and (3) all without requiring changes to your apps.
At the time, SQL Replication only provided DB2-to-DB2 solutions. However, federation let SQL Replication customers extend their solutions by allowing almost any database - Oracle, Informix, whatever - to be target databases for replication. They didn't have to wait for IBM to change SQL Replication to access those databases directly.
Is there still a need for this? Many data replication technologies such as IBM's CDC (Change Data Capture) provide direct access (i.e., no federation required) to many databases. Interestingly though, there always seems to be more database products than replication technologies can enable to. Therefore, as I've discussed on developerWorks, even technologies that have extensive direct access such as CDC can benefit from federation in some situations. The federation option will always be viable as long as new database products are being developed.
The Hybrid Warehouse
Data replication experts have a tendency to try to convince people that they need to replicate everything. Obviously, that's just not practical in many situations. There always seems to be more data than can fit in a single database or warehouse.
On the other side of the fence are data virtualization experts who want you to virtualize or federate the world. In fact, you might have seen experts Tweet a few months ago about whether virtualization is the "new warehouse" or the "old warehouse." The reality is that it's neither. It's impractical for almost anyone to federate everything. It really always has been.
That's where the hybrid warehouse comes in. The term was derived from an IBM white paper, but the concept is simple and applies to any database, not just warehouses. Basically, it says you replicate the data you access most frequently and federate the rest (typically the data you access infrequently or only in small amounts). Like I said, simple.
However, some people are concerned that this means they have to have a separate federated database server as a layer of software in front of their big databases. That's not the case with IBM's DB2 and InfoSphere Warehouse. On UNIX and Windows, both of these products have a portion of IBM's federation technology built into to them. In fact, any DB2 server, including DB2 Exrpess-C, can be turned into a federated database server for a limited set of data sources. What's more, they can be extended to even more data sources by installing IBM's InfoSphere Federation Server. Federation Server extends DB2's inherent function. It doesn't have to sit in front of DB2 as an added layer of data virtualization software.
Distributed Caching (via Cache Tables)
This is best discussed with an example. Let's say you want to offload some query work from a primary production database to another system but are required to centralize any changes to the data. For example, assume you want to replicate the inventory for a parts database to a remote site where a part can be ordered, but need the order must be reflected in the primary database as soon as it happens.
Most people immediately think of bidirectional (two-way) replication. That definitely works, but updates on the remote site are not immediately reflected in the primary because all asynchronous replication solutions have at least a small amount of latency. More challenging is that conflicting changes can occur when different users change the same data at different sites. Naturally, replication reports these so they can be managed. But what if you want to avoid managing conflicts? Distributed caching through cache tables is an option.
Cache tables combine unidirectional (one-way) replication with federation. Data is replicated to a secondary that is a federated database server such as InfoSphere Federation Server. Apps can then issue SQL against this data. Queries are satisfied by the federation server using the local copy of data (the distributed cache). Inserts, updates, and deletes are transparently rerouted back to the primary so that only the original source is ever updated.
The one thing people wonder about is latency of the cache. It's an odd situation conceptually - replication must occur before the secondary's app can see its own changes. The good news is you should be able to avoid issues if you set up low-latency replication.
For more detail about how cache tables work, see a previous post about them here on ChannelDB2. That example shows how to do it with SQL Replication. However, cache tables would be easy to set up with either of IBM's newer replication technologies - Q Replication or CDC.
I need a conclusion? :) Well, okay... data replication and data virtualization work well together. =P