Oracle’s Data Warehouse Reference Architecture
Last year Oracle published a white paper called “Enabling Pervasive BI Through a practical Data Warehouse Reference Architecture”. I took a read through this paper and have some comments on it (so maybe this is a book report). For your convenience, I’ve uploaded a copy of it here Oracle DW Reference Architecture Feb 2010. I also believe the architecture is a little vague (on purpose I suppose). Depending on how you look at it, you might be looking at how the BI Apps are Architected, or you might be looking at how a real-time Enterprise Data Warehouse is built.
The document lays out a generic DW architecture to support a variety of functions. It discusses each of the primary layers in the architecture and some basic characteristics of them. Unfortunately, it goes into very little detail, and at times brushes off a large topic with a short sentence. Also, the document is not without its Marketing focus, so I’ll try and call out some of them. Remember, this is a software company with a lot of different products; this paper needs to align with them. Fortunately, I’m more focused on getting things done right and less on being consistent with all of the software my company sells.
So, what does it look like? Below is my version of it. Note I’ve used different terms in some cases from the whitepaper to help explain it a bit better. I’ve also re-organized it to make it a little more tangible.
- The Staging layer is the entry point, mirrors the source system and is temporary.
- The DW layer is a generic 3NF model abstracted away from the staging layer. It is designed to store data in flexible manner.
- The Performance layer is all about optimizing data access by whatever tools needed, and therefore frequently is a Star Schema model.
- Having source, generic 3NF and Star allows each layer to be tailored to its purpose and you can leverage the strengths of each.
I’ll go through the layers one at a time and summarize what is in the document as well as ad my own thoughts. As I have just recently implemented this architecture over the last year, I am in a good position to comment on the feasibility and reality of implementing it.
The Staging Layer
Oracle calls this layer the Staging Data Layer, but Stage is much shorter and gets the point across so I’ll use it instead. Oracle’s statements are shown first in italics followed by my commentary.
- This is the only entry point into the DW. All data must go through this layer. I will not be so heavy-handed here. The reality is that some data should not go into your Data Warehouse, and instead should bypass it all and go straight to the Reporting layer. This can happen for at least two reasons: 1 – it is one off data for a specific reporting or analytics need (such as a small spreadsheet maintained by one individual for reporting) and 2 – your organization may be split along EDW and BI teams, and the EDW team is too busy/understaffed to get your piece of data loaded in. The real world tends to make clean pictures much messier.
- Use whichever loading technique is applicable – Real Time CDC, Traditional Batch ETL, file loads, messaging, etc. However, if you restrict yourself to just batch, you are missing out on a possible very powerful benefit – the real time data warehouse. Clearly each piece of data cannot have the exact same access mechanism – some may be real-time database log reads, others may need to be batch, external files, etc. Use what is appropriate for your business needs of course, but if I were to lay out a DW architecture from scratch, I would want this layer to be as much real time as possible.
- It is temporary in nature. If you are going with a Batch-only approach, then yes this layer will contain only incremental data sets and therefore won’t be of much value on its own. However, if your staging is a real-time layer that brings data in from multiple sources, this layer can have a huge impact to your overall data infrastructure. My current project is doing precisely this – other front end applications are directly querying our stage layer as it offers them a place where multiple data sets exist in the same location. More on this in the next few bullets.
- Data is cleansed, records are rejected and data is in better shape than in the source system. This is record by record cleaning to make them more usable. For example, transformation fixed width fields via UPPER(RTRIM())logic to make them more user friendly, or perform some null-value replacement.
- It closely mirrors the source system in structure – minimal transformation are performed here. Although it doesn’t come out and say that, that is the conclusion you should draw as they do not mention integration, conforming and consolidation in this layer.
- It’s in the same database as the rest of the EDW. For temporary staging tables, yes. For what I propose below – a real-time, integrated replica of sources, I think you should put it on a separate box.
When you look at what document says about the Stage layer, you can come away with very different perspectives on it. On one hand it seems like nothing more than bunch of temp tables, as with the BI Apps and the _FS and _DS tables. However this approach is not very interesting and is so last year darling. If it’s just temp tables then honestly I don’t think it needs a write-up.
If you instead think of the stage layer as a real-time integrated and cleansed version of the sources you have something much more powerful. Limited reporting can occur directly off of it for operational reasons. Systems can use it directly in their applications, as is occurring on my current project. Various source applications are presently and are planning to come to our staging area instead of going to the actual source systems themselves. Soon we will have created a clearing house for all application data across the enterprise.
As we’re still in the beginning phases of doing this, and quite frankly the design never had intended this, we’ve noticed some improvements we can make. Our initial approach was to mirror the source as much as possible. We made only 1 exception: the creation of Oracle DATE fields for DB2 iOS5 which store dates as a numeric field in seconds. However, as we were working with our first stage layer consumer, a few things became clear about the philosophy of the Stage layer:
1. Perform more rigorous cleansing to ensure it is more easily usable. Doing joins on UPPER(RTRIM()) fields does not work well. Assume the data is going to be used directly.
2. Enhance the data mode to make it more useable. Perhaps this is a simple denormalization to eliminate a query join. This is the ultimate challenge however; to perform potentially very complex integrations in a real-time manner. It may be difficult to impossible to accomplish this; on our systems clean integration and consolidation of key entities is simply impossible. However there are areas that we can integrate more, and provide benefits for those areas. Ultimately, solving tough integration challenges is not a back-end DW job, it’s a front end job to make sure things are integrated from the beginning.
3. Break the stage layer out on a separate HW environment. The nature of our real time, transaction oriented stage layer is totally different than a batch oriented DW or DM layer. Our block sizes have been globally set at 32K, which is terrible for essentially an OLTP layer. Furthermore, the staging layer will have uptime requirements that are greater than the rest of the solution, so we plan on running it in a clustered, HA system with Oracle RAC.
If you think of the stage layer as another application, it changes things around as opposed to being a collection of temporary landing tables with no reuse in mind.
Data Warehouse Layer
Oracle calls this the Foundation Layer, but it is in fact a Data Warehouse. I prefer this term, as the overall architecture does in fact include both a Data Warehouse and a collection of Data Marts. Explicitly calling them by name helps to cut down on confusion on the most mis-used terms in all of DW & BI. BTW, I wrote a lot more on this topic a while ago here. I was pleased that the document discusses 3NF vs. Stars in the beginning – it lays it very nicely.
The DW layer is where it all comes together – where all data is integrated, consolidated and conformed into common entities.
- Data Management is the main goal and as such uses 3NF. This layer aims to capture and represent data in a flexible and dynamic format. It is not designed for high-speed queries or end-user usability in mind; that is for the Performance Layer. Efficient storage and supporting unknown requirements are its goals.
- The model is generic. This means that abstraction from the actual source data structures is needed to help insulate from source changes and business changes. For example, one would not make a table for each level of a hierarchy; if the hierarchy changes the tables would not work any longer. Instead, design in generic structures to allow for change. In other words, let the structure be maintained in the data not in the data model. So in our example, you would model a recursive parent-child relationship with any number of levels indicated by a level type. Adding or removing a level would therefore have no impact to the data model, and possibly not even to the code
- All Data is Granular. Aggregates and summaries are not needed here; it keeps transaction as they are at the atomic level.
- They recommend starting with a 3rd party industry model. I don’t have much experience with Industry standard models, but they may be a nice jumpstart for you.
Again the document doesn’t discuss much about how it is built. The assumption then is that it is a combination of real-time and batch as needed.
The final layer in the EDW Architecture is the Access and Performance Layer.
- This layer is tailored for both end user ease of use and query performance. This layer should be done in a top-down approach, as that is the only way to know what model will perform best for your specific requirements. A bottom up approach will always miss out on how to best solve a given problem using the BI Tool and data model working together.
- There are multiple embedded dependent Data Marts. Embedded means they are on the same hardware platform, and dependent means they are derived from the Data Warehouse. I will lighten up on the definition here somewhat. Oracle is perhaps forgetting about some of their own products such as Essbase cubes which are external. However, as performance allows, it is much easier to build the Data Marts in the same system as the DW with views or CTAS operations than moving it outside the database.
Additionally, I will allow for other data elements to come from external sources (bypassing the DW) such as a user maintained Spreadsheet. Typically this kind of information doesn’t belong in your Enterprise EDW, so make allowances for it. I have had a customer who shot themselves in the foot with the conflicting mandate: “All Data Marts are 100% dependent on the EDW, but we don’t want all data in the EDW”. So what do you do when there is data that you don’t want in there?
Finally, there is more than one DM. As BI Applications projects come along, they are their own separate entity requiring their own data model to suit their specific needs. Even though they may have some overlapping data, as it is all coming from the same source one can ensure that there is a single version of the truth.
- Regular Data Marts are Star Schema for performance and ease of use. Almost self-explanatory. This is where you use the appropriate technique for the job. Stars simply are easier to use and are designed with performance in mind.
- Other mart formats may be used as well – cubes, flat structures, etc. Again, suit the technology and format to the specific needs of a system.
- Subject oriented Subsets. This is where specific business needs are solved. A mature system may have several data marts, each performing a specific set of capabilities on a subset of the overall EDW.
- Implemented with Views or Aggregations. Whenever I hear “Views for the Data Mart” I get worried. Views are logic encapsulation devices and do nothing whatsoever for performance. This is the way TeraData looks at the world – views on top of 3NF. I am not convinced it works well (I’ve seen it fail all 3 times I’ve been on a TD project). Perhaps views can work for very small and simple structures, but do not plan on having your Data Mart be all views. What the Oracle document does not mention is the complexity of metric calculations that can take place here, something far too complex for views, possibly even MVs. Additionally, solutions such as snapshots or SCDs may be implemented here which cannot be done with Views. Plan on persisting the data via Materialized views or even regular ETL code.
- User or Process created data from this layer can be fed back to the Stage layer. I’ll add if needed. Also, for some reason they specifically remove EPM from the discussion; not sure why.
The Architecture talks a bit out the BI layer as well, but honestly this is where the document breaks down into a pure marketing document. Essentially it says: OBI is awesome and can handle anything you throw at it and all other BI tools should kneel before it:
- Use it for metric calculation on top of the Performance Layer and before other BI tools. So model everything in OBI, then put BO or Cognos on top of OBI for their pretty user iterfaces. This is a pretty silly recommendation honestly, one that has been around for a long time. Technically yes it is possible. But it is additional work, layers, and support, not to mention additional cost.
- Use OBI to prototype directly against the source systems and then swap out the physical layer. Honestly this pure marketing recommendation is ridiculous. First off modeling OLTP sources is much harder to do. Secondly it will take significantly more effort for the BI developer to figure out the source model. Third, who wants a prototype model and reporting system being migrated into production with all of the short cuts, temporary folder, mock-ups and general junk that happen during a prototyping effort? Again, although in theory it is possible, reality says you will end up with complete garbage, and your prototype will move at a snail’s pace. Built your prototype off of MS Access.
- Data in the stage can be combined with data in the performance layer. I’ve been seeing this option come up a bit more recently. Basically, it’s using OBI fragmentation to map in both the OLTP and the DM into the same model so that both data sets can be queried seamlessly. I’ve seen this promoted by Siebel/Oracle for years, and quite frankly it doesn’t work except for very specific cases. There are problems mixing data in different states together at query time that chokes this solution.
So in the end you end up with a picture like the following:
The BI Apps
So, how does Oracle reconcile the BI Apps with this architecture? They mention it a few times in a somewhat confusing manner, but ultimately they get to it: They exist in parallel with the EDW. Therefore there are two extracts from the same source system. In fact, they are positioning them as light analysis reporting systems when compared to the much more sophisticated EDW solution. In many ways this is true; the benefit of the BI apps is time to market; it is impossible to develop the same content by hand in as little time as you could by buying the solution.
This is a tricky topic. As they correctly point out, pre-built reporting marts are here to stay, so you’ll need to account for them somehow. I think Oracle’s recommendation is probably a good one, even if it sounds a bit weird. Just like you leverage the benefits of 3NF and Stars for specific purposes, you leverage the BI Apps for time to market, while you leverage the EDW for a more complete and sophisticated solution.
One option that I would also consider is to simply use the BI Apps as the EDW. Of course this is completely against the Reference architecture, but it is an option. The BI Apps can be extended until you get blue in the face; why not use it as a foundation?
The document is a bit confusing at times, but it makes you think about many things. It’s basic structure is a good one, but keep in mind it is not the only way to do Enterprise Data Warehousing. I’d say that as there is little firm design solutions presented (i.e., Real-time vs. batch), so it turns out to be either highly adaptable or almost meaningless (choose your phrase based on your job title). To put some more meat on it, and take it forward a bit so that you are building for the future, here is one way I would modify the architecture:
1. Real-time CDC capture into a Real-Time ODS layer
2. Real-time layer is cleansed and integrated as much as possible based on source system realities
3. Other applications leverage the Real-Time ODS layer
4. A Generic 3NF EDW model is loaded by batch
5. Data Marts built as needed by batch
6. OBI reports on both the DM layers and the Real-Time ODS layer as needed
Of course one could take this further into everything real-time, but that’s a bit of a stretch for now.