Degenerate Dimensions and Aggregates

An interesting topic came up recently on my latest project that I think is quite informative on how OBI works under the hood.  It involves how to model degenerate dimensional fields into OBI when there are aggregates present.  Individually each of these two topics are relatively well know with solutions for how to handle them, but when combined they can get pretty tricky.  This is especially true when you are using the BI Apps, which to a large degree still rely on the old method of manually mapping Time Series (Ago & ToDate) metrics (as opposed to the OBI 10.x + technique of Time Series functions) which require an explosion of Logical Table Sources (LTSs).

The core of this post concerns what do you do when some of your aggregates don’t have all of the degenerate fields that your base fact has?  What do you need to do to model them properly?  In the explanation I will shed a bit more light on how OBI “thinks” so that you too can model things properly the first time.

Background

First a refresher on modeling degenerate dimensions. If you are familiar with this technique please skip ahead to the section labeled “Growing the Problem”.  For more info on this topic, see pages 5&6 of this advanced preso.

A degenerate dimension is simply any field that although truly a dimensional field (non-aggregateable metric), for performance or convenience sake is put into the fact table.  A classic example of this is an Order #, which may be stored in a W_ORDER_F table as opposed to a W_ORDER_D table.  The reasons for this are straightforward: fewer tables, fewer ETL jobs, and no 1:1 join at query time between two tables at the same grain of Order which will benefit performance.

In OBI we call this a combo table.  Simply put, a combo table is a physical table (& alias) that maps directly to both an OBI Logical Fact table and a Logical Dimension table.  As both Dim & Fact have the same physical alias as a source, no self-join is needed, resulting in nice, fast SQL.

Here is a basic setup of how this might look.  Note this one has a bit of logic on the fact table field to convert it to something meaningful to the user.

Here is the fact mapping:

And here is the dimensional mapping for Employee Type using the same physical layer table:

Don’t forget to build a Dimension for it and set the Content tab on both the Dimensional LTS and the Fact LTS (not shown):

To summarize, there are two logical tables, “Facts: Headcount” and “Headcount Degen” that use one physical alias source for their source.  The Business Model Diagram:

But the Physical Diagram shows only one table:

Growing the Problem

At first glance that wasn’t too difficult.  Now I want to expand the example some to show a limitation of this technique when more than one table source is needed.  We’ll start to find out that this is unfortunately a tedious technique that arises even before we get to the real problem discussed in this post: dealing with aggregates at different grains.

In many cases you may have many varieties of a single fact table in order to build your time based metrics.  This is especially true of the BI apps which heavily rely on the older technique for time based metrics.  Essentially, for each prior period you want to have a metric for, say Last Year Headcount, you’ll need a new Logical Table Source (LTS) for the Logical Fact table and alias with a different join to the Date dimension, say on YEAR_AGO_WID.

In this example, we need to have a current month headcount, a prior month headcount, and a year ago month headcount.  Thus, 3 fact sources:

In order to make queries work properly, the degenerate dimension fields have to be mapped to the logical dimension for all of these Fact LTSs or else any query that breaks headcount out by or filters on Employee Type will not work properly.  I will introduce two different degen fields now in order to set up the aggregation problem later.

Make sure:

  • The fields are mapped to all of the dimensional LTSs
  • The LTSs content tab is set to the details level of the Dim Headcount Degen dimension
  • The Fact LTS for these 3 physical tables are set to the details level of the Dim Headcount Degen dimension

Ok, so now you can use either Employee Type or Performance Rating with the 3 different time based versions for month headcount.

Before moving on to the true purpose of this post, I want to make the situation worse, as in reality the ultimate recommendation of this post can be driven by this problem just as easily as the aggregate problem.  In this example we are using snapshot fact tables.  You may find it necessary to have multiple time based versions of these as well – for Quarter and Year.  With snapshots, really all you are doing is filtering a monthly snapshot to find the last month in the quarter – you are not truly aggregating with Group Bys.  Call this “pseudo-aggregation” or whatever you like, but we start to see the problem even more.

If this is the fact table mapping:

The same set of tables need to be mapped to the logical dimension as well. Adding 5 more LTS to the existing 3, we have 8 LTSs that need to be mapped to the dimension.  Don’t forget to reproduce the CASE WHEN logic either.  Starting to get tedious?

Don’t forget to set the content tabs on both the dimension and the fact LTSs.

Mid-Point Summary:

By this point I imagine you are getting the point: this is incredibly tedious and error prone RPD work.  From a pure effort perspective, simply using the Time Series functions in 10g+ should cut down on this work some (there are of course other things to configure with them, and having metrics that work with multiple snapshot points still needs this work).  However, if you are using the BI Apps or a legacy Siebel Analytics system you probably don’t have a choice and have to do this.  By this point a light bulb should be going off in your head as to a much simpler and less error-prone solution.  But let’s hold off on that until the end.

The Aggregate Problem

Finally to the true purpose of this post, something actually tricky to model!

Let’s say that through the course of your performance tuning efforts, you’ve identified a few aggregates to build to help improve performance.  You’ve elected to aggregate along an Organization structure, to the top level of that hierarchy.  But along the way, as with any aggregate, you’ve purposefully left dimensions off in order to roll records up to a point where there is a benefit.  One of the dimensions you’ve left off is the Manager_IND field, which is the underlying column for Employee Type in your degenerate dimension.  However you have kept the PERFORMANCE_RATING field, which is used in the degen dimension.  How do you handle this?

What We Expect

Let’s start by considering what the actual problem is before we go into showing it.  When we bring the Fact aggregate into our Fact LTS, we can cleanly map its metrics.  We know that we will have a corresponding LTS in the degenerate dimension, but it will map to only one of the two fields.  As there is a corresponding dim LTS there, we can set the Fact LTS Content tab to the Details level of the Degenerate Dimension.

Or can we?  Consider a simple query involving Month, Headcount and Employee Type.  As the Employee Type field is not in the Dimensional LTS for the aggregate, we would expect it not to be considered, thus the corresponding aggregate Fact LTS not to be considered, thus the base fact table used and not the aggregate.  Unfortunately that line of progression is the opposite of what happens.

What Actually Happens

We have to recall that the most important thing in OBI modeling is the Fact Content Tab.  Everything flows from that, and everything is subordinate to that (except priority groups, which won’t solve this problem).  By setting the Fact Aggregate LTS to the Degen Details level, OBI is expecting to use the corresponding Dimensional LTS for the aggregate – in fact it is all it will consider.  But then as it is going through the process of determining which dimensional LTS to use, it stumbles a bit.  On the one hand, it knows that this must be the correct Dim LTS from the Fact LTS content tab and the physical joins (none actually; its the same table), but on the other hand it does not map to the Employee Type field.  It is not like HAL 9000 – it has been given conflicting configurations.  And bad things happen then.

Since the problem emanates from the Fact table outwards, perhaps the problem lies there.  In this case the highest grained Fact LTS to solve the query is in fact the aggregate we just added based on what we’ve told OBI.  But if it actually tries to use it, a variety of bizarre things happens, including multiple fact tables in the query joined by other common dimensions.  Essentially a completely wrong query is generated.

The solution to the problem lies in the granularity of the Fact LTS – we haven’t given OBI enough information inside of the Logical Fact table to make the right choice.  To it, all Fact LTSs configured at the Degen.Details level have identical granularity, so it picks the best one.

Solution

The solution is to give OBI the proper information to make the right call.  In this simple example, this involves splitting apart the degenerate dimension into two different ones, one for Employee Type and one for Performance Rating, each with its own hierarchy.  In each of these LTS, only those aggregate sources that actually contain the mapping are added, and the corresponding Fact LTSs have either one or both of the dimensions as applicable.  With this, OBI now has adequate information to know that this new aggregate will operate with Performance Rating but not Employee Type, so it will not be considered.

However, this leaves us with one simple thought: Ugh.

Each Fact and aggregate table now has to be mapped potentially to two different degenerate dimensions.  In the real world this may be even more depending on the number of fields that have different inclusion/exclusion in varying fact aggregates.

A brief view of what this mess looks like is shown below.  In this scenario, I have 2 different aggregates, one that maps to Employee type but not Performance Rating and one that is the opposite – with Performance Rating and without Employee Type.  The list of Table sources for each Logical Table:

Yes, that is 16 LTSs for each dimension for a total of 32 mapping exercises to support combo tables in addition to the 16 for the facts.

Once this is done, we can properly give OBI more intel on the dimensionality in order to chose correctly.Once you do this OBI will generate the proper queries and uses the aggs only when it can.

The content tab for a fact LTS that has both fields in it:

And one that has only Performance Rating:

And one that has only Employee Type:

Reality

This quite frankly is a horrible exercise to go through.  32 dimensional LTSs to map and configure properly.  And before you think this is a silly example, this is the exact kind of config that I recently had to do for an HR Analytics project – in fact it was a bit worse than this as there were 3 different combinations instead of only 2.

Ok, so what is a better solution?

Simple: create a true and real dimension and map it only once.  Use a small W_LOV_D or W_CODE_D table to join to your fact table and store the values in a physical table which you map only once.  The additional effort involved is straightforward and minimal:

  1. Load records into a small lookup table, given them a code type to distinguish between each of them.  For this example, MANAGER_IND is either a 1 or a 0; simply have a record for each of those values with the key of 1/0 and the Values of ‘Manager’/’Employee’.  For something like a performance rating, put all possible values in the table – it still won’t be that many.
  2. Map the table into OBI Physical layer
  3. Join to each fact alias as with any dimension – use the degenerate fields on the fact side of the join (e.g., MANAGER_IND).  Make sure to have bitmap indexes on these join fields to help with Star Transformation queries.
  4. Map the dimension table for each as normal.  Don’t forget to filter on the right set of keys if using a shared table.  E.g., for Performance Rating make sure to filter on CODE_TYPE = ‘PERFRTNG’ or something like that.
  5. Map them and link them as any normal dimension.  Note the content tab of the Fact LTSs will be exactly the same; you are simply saving a lot of manual mapping work on the dimensions.

Using a separate table works well for smaller dimensions – I don’t recommend it where very large joins are needed, say a 1:1 join between W_ORDER_D and W_ORDER_F.  If that is the case, you may be stuck with the RPD degenerate mapping solution above.  On smaller tables, the performance hit should be negligible.  As mentioned before, using Time Series functions can help cut down the manual mapping effort as well.

The key take-away for making the decision here is to not underestimate the amount of OBI work that has to be done.  Although none of it is particularly difficult, there is a lot of tedious mapping work to be done. In my experience, the more you have of that the greater your chances for errors.  This of course will bear itsself out in a more painful QA period.

Advertisements

Posted on September 4, 2012, in BI Theory and Best Practices, Repository, Data Modeling and tagged , , , . Bookmark the permalink. 6 Comments.

  1. Jeff,
    Thanks for another excellent post.
    We suffered by using 10g+ time series functions, because fragmentation does not work with them.
    Also, since we have many columns similar to your examples of perf rating and emp type, we put them all in a mini dimension/ attribute profile dimension.
    Although we had used combos often, to save on performance, it seems like they always come back to haunt us, and lately we have been designing new facts as purely thin, and rely on MD’s.
    David

    • Yeah, MDs are a good idea to incorporate if you can get away with them for the collection of smaller dimensions.
      Jeff M.

  2. Another point: when using combos, we always are faced with the prompt drop down/ multiselect problem. Since the attribute physically is in the combo, this can cause terrible performance problems since a select distinct is performed on the fact. To alleviate this, we use LOV’s in a separate LTS with a WHERE clause. Physically, the shared LOV table is fictitiously joined to the fact once, and each LTS has its’ own WHERE. It’s very tedious, and so much easier to use mini dimensions.

    David

    • Well for prompt performance, once the first person loads up the values it will be in cache for others to use. Beyond that in the database you can have them indexed so the query comes back in a reasonable time even from a larger table.

      Jeff M.

  3. hI,

    We have concept like 3 columns should hit a seperate LTS and 3 should hit different LTS…Its LTS under Degeenrated Dimension….Also Columns under first LTS should display values only for Sequence # 500…..Now when i am doing last column is coming up as blank..Please help me on this…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: