Informatica CDC for Real Time Data Capture
Thanks to my colleague Sobhan Surapaneni for helping me out with some of the details of CDC. Sobhan was the guy who made it all happen on the project and really knows this stuff cold.
Traditional Data Warehousing and BI systems rely on batch data extraction and load routines (ETL or ELT) to acquire their data from sources. Depending on data timeliness needs, this may work for intervals down to perhaps one hour for small data sets. However as data timeliness needs go below what is reasonable with a batch process, a complete shift in technology is required. There are tools out there that acquire data in a “real-time” manner directly from the database log files and send individual transactions out over the wire to a target database.
This post is about Informatica’s CDC product, but the lessons and the manner in which it works are similar for another popular product called Golden Gate from Oracle. Note the name Change Data Capture is not the best; this really is more about a real-time solution as any DW has to implement change data capture as part of its Data Acquisition approach.
CDC works by installing essentially a log reader application called the PowerExchange Listener on each source database server. As each database/OS is different (Oracle, DB2 Unix, DB2 AS/400, SQL Server, etc), the particulars of each software installation is slightly different.
The PowerExchange Listener software will continually scan database log files and pull data records out in a sequential manner. Power Exchange Listener then links directly to the PowerExchange Client that is part of the PowerCenter main ETL hub.
The Log files that it reads capture not the SQL statement, but rather the resulting record. This is an important distinction; an update statement that operates on 1 million records will generate 1 million transactions over the CDC wire. It’s also important to note that CDC keeps track of where in the sequential time based log files it is; this is ultimately used for startup/shutdown and re-load operations. Sequential also means that if a parent record is created before a child record, it guarantees you to receive and process the parent before the child record, helping you to mirror whatever RI the source may or may not have.
Configuration of CDC involves registering which tables you wish to “listen” to in PowerExchange Navigator client tool and publishing them to the PowerExchange Listener on the source database. This is a simple operation.
After data records flow through PowerExchange Listener on the Source database, they are sent across the network into PowerCenter. From there, data is sent to its target in a similar manner as regular Informatica ETL. This is where traditional Informatica development picks up. You develop workflows and mappings using the traditional INFA PowerCenter client tools (i.e. designer, workflow manager). When workflows are run, they are expected to run continually – remember this is not batch.
Note that using Oracle database as a source is technically much more difficult than using SQL server or DB2 – there are extra pieces of software that are needed to make it work. For SQL Server, CDC uses its Replication server and not the actual real source. For DB2, CDC uses journals with concurrent processes for each journal. Note there is a configuration option to send a record after a certain time period or number of UOWs (i.e. commits). This might be useful for example if a system happens to have many small updates to a single record. The coalesce option will send that record say every 1 minute as opposed to the individual 25 transactions on it during that period.
One of the main benefits of using CDC over a competing technology is the familiarity Informatica developers have with it. To a developer, the development tools are identical; it’s the infrastructure side where the differences come into play. Mappings are built in the regular PowerCenter Designer; tables are imported from a PowerExchange data source instead of a regular database source.
For a given source, typically you’ll have one mapping and one workflow. As each source has multiple tables in it that you’ll be listening to, you’ll need parallel “pipelines” in the mapping. If you need 20 tables, you have 20 pipelines in the same mapping, each with a different source and its associated target. Informatica will ensure to process each record in its associated pipeline, and remember the records will come in the same order as they were in the source.
As you are using regular PowerCenter Designer, the same kinds of functionality are available to you for your transformations. There are three main differences in development approach:
1. You are always operating on 1 record at a time
2. Lookup caching doesn’t make sense anymore (You can’t cache real-time data)
3. There is no debugging facility
CDC will tag each record coming in with some metadata about the transaction:
- Log record time (i.e. time when the log entry was created)
- Operation: Insert/Update/Delete
With this information you can determine what to do with each transaction type. A large portion of your design will focus on what to do for an Update or a Delete operation. For example, we elected to soft-delete our records by updating a DELETE_FLG to ‘Y” for each Delete operation.
One thing to note on the initial deployment: If you wish to bring in historical data, CDC will not be able to help you due to the limited lifetime of log entries in the database. You will have to develop traditional, bulk/batch ETL jobs for the historical/initial load. This is important to note from a development and QA perspective. In a section below I’ll discuss how you use Historical and CDC together.
Infrastructure, Deployments and Source Upgrades
Being a completely different technology from traditional Informatica, there is a fair amount of additional infrastructure to setup and tune. There are numerous options which you will need to play around with to ensure optimal configuration. Additionally, you will have to prepare for various infrastructure events that you should expect to see. Some of the items which you’ll need to consider:
- TimeOut Window
- Commit points
- Records per transmit
- Automatic Restarts
- Various ports
- Stops and Restarts
- Source Database failures and failovers
- Source database upgrades
- Source database batch operations
It is very important to understand that once you have this table replication set-up, you are now tightly coupled to anything the source system does. This is particularly true during source upgrades. Some things to consider:
- Table changes require updated CDC Registrations and probably code changes (some changes may be ignored in code)
- Primary key changes are very important and will require target changes and code changes
- Data changes volume – for tables with a large # of records changed, you will have to rely on your bulk load instead of CDC
- Are Source changes even captured in the logs/journals – If the source team pulls a table out of a logged area, renames the table, or does a Create-Table-AS operation, CDC will not know about any of these changes. When this is done, you will need to rely on your Historical load to reload the table.
I highly recommend the approach that we took when setting up CDC. Early on in the development cycle we brought in an Informatica CDC consultant for a week to help us with setup and scenario solutions. Aside from setting the software up with the sysadmins and DBAs, he was able to work on the basic configuration. However the bulk of his time was to help us determine how we would support our various scenarios on both DB2 and SQL Server.
An example of what we had to figure out was the Initial Load Scenario. We needed to know how we would actually turn on CDC in conjunction with the historical load. How do we ensure we don’t miss any transactions? How do we ensure we don’t overwrite a new record with an old record? What happens when CDC gets an update, but you haven’t inserted the base historical record yet?
Ultimately, we turned CDC on first then ran the historical load on top of it. With both of these running at the same time, this is the logic we employed to ensure an accurate version of the table:
|Code Module||Incoming TXN||Existing Target Record?||Action|
As the project progressed, we elected to deploy our CDC solution to Production a few months before the Data Warehouse and OBI layers. This allowed us to use the Prod environment in a non-critical state to shake things out and tweak performance. When we did this, we brought back the Informatica consultant to help us tweak settings and get around any deployment issues we might have had. All in all, this approach worked out very well for us.
Performance for us involved several measurements. Of primary concern was the impact to the source systems, which in some cases were already at a very large load. Our findings showed that for DB2 the impacts reached a peak of 10% during the initial startup state when it was playing catch up while going through the log file, but settled down to 3-5% impact in a steady state.
Another attribute was lag time – the delay that we could expect from a record on a source to when it’s committed in our database. This typically is only a few seconds. However, at night time the source systems run several batch updates and deletes on their systems, all of which are consumed by CDC. Our largest table consists of nearly 500 Million rows, and typically has 1 million or so deletes each night as part of a batch job on the source application. We are able to completely consume 1 million deletes in only a half hour while consuming all other systems simultaneously.
I believe the lessons learned from using Informatica CDC should generally apply to Golden Gate as well. In the end they both do the same thing, and they operate using similar approaches. Of course your mileage may vary.
1. You are now tightly tied to source system updates & upgrades. You will need to know exactly what they are going to do and how they are going to do it. Every table potentially is handled differently, and knowing the attributes of that table’s change allows you to best handle it on your target system.
2. Bring a CDC expert in from Informatica to help set up development and get you going, but also help with the final Production deployment and tuning activities
3. You will still need traditional bulk ETL to handle the initial load scenario. This code is also reused from time to time depending on the techniques used by various source systems as they upgrade their system.
4. Development in CDC is quite easy if you keep the mapping logic simple.
5. CDC is robust enough to handle various infrastructure scenarios with some manual intervention. For example, cutting an application over from the primary data center to the backup data center is certainly possible, but does require some planning and manual steps.
6. Practice scenarios and performance in a Performance environment. Losing data in production is an expensive proposition so be sure to run through things appropriately.
7. Support is relatively light; we are sourcing 90 tables from 5 systems covering DB2 and SQL server and only part (25% – 75% depending) of our support INFA developer’s time is devoted to it. CDC runs reliably, and steady state intervention is needed only when something unusual happens on the source systems (usually special operations requiring greater load or shutting things down). Most support however will be to support source system upgrades.
8. Although we did not use much transformation logic in CDC, it can be used to build out a real-time ODS or reporting system that goes beyond just table replication.