Informatica CDC for Real Time Data Capture

Thanks to my colleague Sobhan Surapaneni for helping me out with some of the details of CDC.  Sobhan was the guy who made it all happen on the project and really knows this stuff cold.

Introduction

Traditional Data Warehousing and BI systems rely on batch data extraction and load routines (ETL or ELT) to acquire their data from sources.  Depending on data timeliness needs, this may work for intervals down to perhaps one hour for small data sets.  However as data timeliness needs go below what is reasonable with a batch process, a complete shift in technology is required.  There are tools out there that acquire data in a “real-time” manner directly from the database log files and send individual transactions out over the wire to a target database.

This post is about Informatica’s CDC product, but the lessons and the manner in which it works are similar for another popular product called Golden Gate from Oracle.  Note the name Change Data Capture is not the best; this really is more about a real-time solution as any DW has to implement change data capture as part of its Data Acquisition approach.

Architecture

CDC works by installing essentially a log reader application called the PowerExchange Listener on each source database server.  As each database/OS is different (Oracle, DB2 Unix, DB2 AS/400, SQL Server, etc), the particulars of each software installation is slightly different.

The PowerExchange Listener software will continually scan database log files and pull data records out in a sequential manner.  Power Exchange Listener then links directly to the PowerExchange Client that is part of the PowerCenter main ETL hub.

The Log files that it reads capture not the SQL statement, but rather the resulting record.  This is an important distinction; an update statement that operates on 1 million records will generate 1 million transactions over the CDC wire.  It’s also important to note that CDC keeps track of where in the sequential time based log files it is; this is ultimately used for startup/shutdown and re-load operations.  Sequential also means that if a parent record is created before a child record, it guarantees you to receive and process the parent before the child record, helping you to mirror whatever RI the source may or may not have.

Configuration of CDC involves registering which tables you wish to “listen” to in PowerExchange Navigator client tool and publishing them to the PowerExchange Listener on the source database.  This is a simple operation.

After data records flow through PowerExchange Listener on the Source database, they are sent across the network into PowerCenter.  From there, data is sent to its target in a similar manner as regular Informatica ETL.  This is where traditional Informatica development picks up.  You develop workflows and mappings using the traditional INFA PowerCenter client tools (i.e. designer, workflow manager).  When workflows are run, they are expected to run continually – remember this is not batch.

Note that using Oracle database as a source is technically much more difficult than using SQL server or DB2 – there are extra pieces of software that are needed to make it work.  For SQL Server, CDC uses its Replication server and not the actual real source.  For DB2, CDC uses  journals with concurrent processes for each journal. Note there is a configuration option to send a record after a certain time period or number of UOWs (i.e. commits).  This might be useful for example if a system happens to have many small updates to a single record.  The coalesce option will send that record say every 1 minute as opposed to the individual 25 transactions on it during that period.

Development

One of the main benefits of using CDC over a competing technology is the familiarity Informatica developers have with it.  To a developer, the development tools are identical; it’s the infrastructure side where the differences come into play.  Mappings are built in the regular PowerCenter Designer; tables are imported from a PowerExchange data source instead of a regular database source.

For a given source, typically you’ll have one mapping and one workflow.  As each source has multiple tables in it that you’ll be listening to, you’ll need parallel “pipelines” in the mapping.  If you need 20 tables, you have 20 pipelines in the same mapping, each with a different source and its associated target.  Informatica will ensure to process each record in its associated pipeline, and remember the records will come in the same order as they were in the source.

As you are using regular PowerCenter Designer, the same kinds of functionality are available to you for your transformations.   There are three main differences in development approach:

1.       You are always operating on 1 record at a time

2.       Lookup caching doesn’t make sense anymore (You can’t cache real-time data)

3.       There is no debugging facility

CDC will tag each record coming in with some metadata about the transaction:

  • Log record time (i.e. time when the log entry was created)
  • Journal/LogFile
  • Operation: Insert/Update/Delete

With this information you can determine what to do with each transaction type.  A large portion of your design will focus on what to do for an Update or a Delete operation.  For example, we elected to soft-delete our records by updating a DELETE_FLG to ‘Y” for each Delete operation.

One thing to note on the initial deployment:  If you wish to bring in historical data, CDC will not be able to help you due to the limited lifetime of log entries in the database.  You will have to develop traditional, bulk/batch ETL jobs for the historical/initial load. This is important to note from a development and QA perspective.  In a section below I’ll discuss how you use Historical and CDC together.

Infrastructure, Deployments and Source Upgrades

Being a completely different technology from traditional Informatica, there is a fair amount of additional infrastructure to setup and tune.  There are numerous options which you will need to play around with to ensure optimal configuration.  Additionally, you will have to prepare for various infrastructure events that you should expect to see.  Some of the items which you’ll need to consider:

  • TimeOut Window
  • Commit points
  • Records per transmit
  • Automatic Restarts
  • Various ports
  • Stops and Restarts
  • Recover
  • Source Database failures and failovers
  • Source database upgrades
  • Source database batch operations

It is very important to understand that once you have this table replication set-up, you are now tightly coupled to anything the source system does.  This is particularly true during source upgrades.  Some things to consider:

  • Table changes require updated CDC Registrations and probably code changes (some changes may be ignored in code)
  • Primary key changes are very important and will require target changes and code changes
  • Data changes volume – for tables with a large # of records changed, you will have to rely on your bulk load instead of CDC
  • Are Source changes even captured in the logs/journals – If the source team pulls a table out of a logged area, renames the table, or does a Create-Table-AS operation, CDC will not know about any of these changes.  When this is done, you will need to rely on your Historical load to reload the table.

Our Implementation

I highly recommend the approach that we took when setting up CDC.  Early on in the development cycle we brought in an Informatica CDC consultant for a week to help us with setup and scenario solutions.  Aside from setting the software up with the sysadmins and DBAs, he was able to work on the basic configuration.  However the bulk of his time was to help us determine how we would support our various scenarios on both DB2 and SQL Server.

An example of what we had to figure out was the Initial Load Scenario.  We needed to know how we would actually turn on CDC in conjunction with the historical load.  How do we ensure we don’t miss any transactions?  How do we ensure we don’t overwrite a new record with an old record?  What happens when CDC gets an update, but you haven’t inserted the base historical record yet?

Ultimately, we turned CDC on first then ran the historical load on top of it.  With both of these running at the same time, this is the logic we employed to ensure an accurate version of the table:

Code Module Incoming TXN Existing Target  Record? Action
CDC INSERT NO Insert
YES Ignore
UPDATE NO Insert
YES Update
DELETE NO Insert, Delete=Y
YES Update, Delete=Y
Historical INSERT NO Insert
YES Ignore

As the project progressed, we elected to deploy our CDC solution to Production a few months before the Data Warehouse and OBI layers.  This allowed us to use the Prod environment in a non-critical state to shake things out and tweak performance.  When we did this, we brought back the Informatica consultant to help us tweak settings and get around any deployment issues we might have had.  All in all, this approach worked out very well for us.

Performance

Performance for us involved several measurements.  Of primary concern was the impact to the source systems, which in some cases were already at a very large load.  Our findings showed that for DB2 the impacts reached a peak of 10% during the initial startup state when it was playing catch up while going through the log file, but settled down to 3-5% impact in a steady state.

Another attribute was lag time – the delay that we could expect from a record on a source to when it’s committed in our database.  This typically is only a few seconds.  However, at night time the source systems run several batch updates and deletes on their systems, all of which are consumed by CDC.  Our largest table consists of nearly 500 Million rows, and typically has 1 million or so deletes each night as part of a batch job on the source application.  We are able to completely consume 1 million deletes in only a half hour while consuming all other systems simultaneously.

Lessons Learned

I believe the lessons learned from using Informatica CDC should generally apply to Golden Gate as well.  In the end they both do the same thing, and they operate using similar approaches.  Of course your mileage may vary.

1.       You are now tightly tied to source system updates & upgrades.  You will need to know exactly what they are going to do and how they are going to do it.  Every table potentially is handled differently, and knowing the attributes of that table’s change allows you to best handle it on your target system.

2.       Bring a CDC expert in from Informatica to help set up development and get you going, but also help with the final Production deployment and tuning activities

3.       You will still need traditional bulk ETL to handle the initial load scenario.  This code is also reused from time to time depending on the techniques used by various source systems as they upgrade their system.

4.       Development in CDC is quite easy if you keep the mapping logic simple.

5.       CDC is robust enough to handle various infrastructure scenarios with some manual intervention.  For example, cutting an application over from the primary data center to the backup data center is certainly possible, but does require some planning and manual steps.

6.       Practice scenarios and performance in a Performance environment.  Losing data in production is an expensive proposition so be sure to run through things appropriately.

7.       Support is relatively light; we are sourcing 90 tables from 5 systems covering DB2 and SQL server and only part (25% – 75% depending) of our support INFA developer’s time is devoted to it.  CDC runs reliably, and steady state intervention is needed only when something unusual happens on the source systems (usually special operations requiring greater load or shutting things down).  Most support however will be to support source system upgrades.

8.       Although we did not use much transformation logic in CDC, it can be used to build out a real-time ODS or reporting system that goes beyond just table replication.

Advertisements

Posted on February 11, 2011, in Data Warehousing, Other Technologies & Tools and tagged , , , , . Bookmark the permalink. 31 Comments.

  1. Hi Jeff,
    Thanks for the excellent article. The timing could not have been better as we are in the middle of examining near real time reporting options for a customer.
    Since you mentioned Oracle Golden Gate, I think it’s worthy to mention also IBM Infosphere CDC which is a leader in the Gartner Magic Quadrant for log-based CDC products.

    Thanks,
    David

  2. Hi Jeff,
    Another few questions.
    Is your project using BI Apps (meaning DAC on top of Informatica)?
    If yes, were modifications necessary to the DAC environment?
    Also, does the Informatica CDC software require any software component changes to the the traditional Informatica Powercenter installation? In other words, If we are using another CDC product, will our traditional DAC/Informatica setup work (besides the neccessary configuration changes)? Although you mentioned that Goldengate should work as well, I was just a little unclear if we need to change or upgrade our current Informatica installation.

    Thanks,
    David

    • No this was not using the BI Apps. They pretty much are incompatible with CDC, as they pull directly from the source system. To make a hot copy of a source and then use the BI Apps doesn’t really make much sense – its an either/or situation.

      What we did do however is use traditional INFA ETL to build our own custom DW using this staging/real time layer. We essentially did Oracle’s official Reference Architecture. I’ll be writing a post on that sometime soon.

      Jeff M.

  3. Thanks. In our case, we are talking about Contact Center Telephony Analytics, which uses the universal adaptor, not pulling from the source but rather from csv files that need to be populated beforehand. Would that scenario also be incompatible with a CDC continuous trickle feed?

    David

    • I guess you could have the trickle feed populate the flat files, but you’re still doing bulk loads from there. Although that is certainly possible technically, it’s kind of ‘weird’ to use real-time to ultimately just populate batch. But then again, what we did with CDC was actually kind of the same, except we populated tables not files.

      Also note that the flat files aren’t simple table-for table dumps. They have denormalizations from many tables in them. This is how linkages across entities are created – by putting them in the flat files. This may ultimately be the reason you can’t do this; each transaction will have to lookup from many other tables to populate the flat files.
      Jeff M.

  4. Kiran Kadadasu

    Hi Jeff, The story was really good…i went on reading like a fairy tail.. 🙂 I like the way you described the things. I am also planning to go for Informatica CDC and 95% of our sources in Oracle. Do you have any technical comparison sheet to evaluate product (Informatica – CDC) to product (Oracle – GoldenGate) or any thing you did before you selecting the product?

    Thanks in advance.
    -Kiran

    • Jeff McQuigg

      Unfortunately I do not. The customer had already selected it.

      However, I am learning about GG on another project, and so far it seems like CDC has more transformational capabilities than GG, but don’t quote me on that.

      • Manohar Mishra

        Hi Jeff, Hope by now you may have some confirmative answer on Mr. Kiran’s question about comparison between Informatica CDC and Oracle Golden Gate. Appreciate if you can provide some definite information on this comparison.

        -Manohar.

  5. Hi Jeff,
    Really great article. The company I work for is planning on using PowerExchange IMS CDC option for propogating data to Oracle. Did you use PowerCenter for getting the data from IMS to DB2 (IMS->PowerExchange->PowerCenter->DB2) or were you able to propogate the data w/o needing it?

  6. Does Informatica CDC work with Sybase as a host database as well ?

    We were thinking of using Sybase RTDS to publish change data on a BUS real time and then use Informatica to listen to the bus. However, if CDC is able to hook into Sybase source, it may be beneficial for us to do that.

    • Without looking at the supported platforms list, I would expect it to work just fine.

      Jeff m.

      • Hi,
        It’s a very good post. We were told that Infa CDC would not support Sybase as source. Please check with the product team to get a confirmation on that.

  7. Hello,
    A bit late reading this article ,But very interesting!
    I Understand the Out of BOX BI Apps mappings will not work with CDC?
    Some CRM/EBS/PSFT Mappings require to join Multiple source tables to get the Result which is done currently by performing an SQL Override.
    But if only only the base table of the mapping changes and the Attribute table does not change( i. e If we take product & product Category tables as example. If Product Description change it does not mean there is a change in product Category,but during the ETL we denormalize these two to create a Product Dimension)
    In this Case if the Product Category has not changed,but is required to join for getting the description then how do we handle it?
    I mean what if the base table is changed and the Attribute table is not changed in such joins?
    Whatever CDC it might me ,streams ,ODI,informatica all of them in the end provides a similar interface for the Changed tables.
    Not sure if i understood CDC correctly!

    • What you’re asking about with finding changed data in the biapps has nothing to do with CDC. If a parent table changes and you need it, then you have to look at its update date. CDC just moves the data around and adds an update date if the source doesn’t have one.

  8. Great article

  9. Article was really good. Especially the approach of turning on CDC in conjunction with the historical load. A neat solution to deal with initial loade rather than having to deal with the complication of finding Restart Token value and updating to Restart Token File etc.

    I think I understood what Vmnet was asking about. We faced a similar situation…where we need to join multiple source tables and load the target. Capturing the changes becomes difficult in such case when change happens in Product and not in Product Category table.
    We might have to go for a intermediate stage table which will be in-sync with the source table.

    You might have to write SQL override similar to this structure:
    Select * from Product_CDC join Product_Category
    Union
    Select * from Product_Category_CDC join Product

    Here Product and Product_Category are stage tables that will be replica of the data in source and Product_CDC and Product_Category_CDC will have only the CDC records. Hence we will be able to capture change records for both the tables and still be able to so the join between them

    Hope you understood what I meant 🙂

  10. Thanks, I’ll try CDC for Oracle database

  11. Hi Jeff.
    The article is great. It gave me an very good idea on what to expect from this tool.
    In my company, we are analyzing some Integration tools, including Informatica. We are also considering HVR. Have you heard about it? do you know any comments on it?
    I haven’t found too much appart from their own documentation, so I’d like to have other’s points of view.
    Thank you!
    Regards.
    Sandra

    • Sandra – No, Sorry I haven’t heard of it. The only other product I’ve seen in action is Oracle’s Golden Gate. CDC I believe has a lot more capabilities than GG.

      Jeff M.

  12. Hi Jeff,
    To check if the record exist or not in the table, you used a cached lookup, or without cache? Used commit interval 1-1? What strategy did you use?

    • That is the biggest change when doing real-time loads – forget about caching as you can’t use it! We had to do a lookup to the table.

      Commit interval is done differently with CDC – and yes we had to play around with them for each table based on size volume, etc.

      • About “We had to do a lookup to the table.”. You have implemented a dynamic lookup or uncached lookup? Inside the mapping for cdc then you had a CDC Source Qualifier, lookup, Target and update strategy. Only this?

      • Sorry I don’t know the details of how the code was constructed – besides it was a few years ago! I do believe the mapping was done almost as any normal mapping, except for the no caching aspect. My INFA developers had very little difficulty building the mappings.

  13. Hi,

    Great Article!!

    I am having a requirement to setup Informatica Power Exchange. Source and Target are in SQL Server and there are multiple source system(all SQL Server). I have few question:
    1. Do i need to install Power Exchange Listner to each and every Source System(in each host machine)?
    2. Whether I have to enable CDC for each source table? Or installing Power Exchange is enough and it will take care of it.
    3. You said “CDC uses its Replication server and not the actual real source”. Can you explain it little further. Do i need to set-up a replication server, Or you are talking about tables which get created after enabling CDC in the SQL Sqerver.

    Can you suggest some referal site which i can refer to get the step-by-step informaion.

    Thanks,
    Anupam

  14. Hi Jeff,

    Very informative and helpful.

    You said that the log files capture the resulting record. Wonder how the DELETE’s are handled(in your case you are doing a Soft delete). What approach would you recommend if we need to do a hard delete in our Target tables

    Thanks again and appreciate your help.

  15. Hi Jeff,

    Quick question – We are planning to use PowerExchange to build an ODS and i had question about the constraints.

    Do we create the same constraints (as Source database) in the Target database? Will Creating the constraints have an impact to the CDC process flow?

    My thinking is that we are better off not creating the constraints and i am sure the log files on the source system do maintain the integrity when committing records.

    Your thoughts are appreciated.

    Thank you.

  16. Hi Jeff,
    Really good article. So many good points addressed especially the main differences in development approach.

    One question I have is the use of a Staging database. Have you used this approach by having the Real Time CDC mapping write to a staging database and then have a run continuous mapping read the staging database performing ETL writing to the target system?

    The reasoning why this was suggested are:
    – Minimize performance of the CDC mappings if you have many consumers. So read once and consume many times.
    – To loosely decouple target systems from the source system.

    Thanks,
    Peter

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: