OBI’s Nested Aggregation > Ralph Kimball
Ok so I put a provocative title on this post for a reason. This post will explore why some of Kimball’s concepts may be out dated when newer query generation or database capabilities are taken into account. Specifically, I’m going to discuss the OBI concept of Nested Aggregation, also known as Dimension Based Aggregation, in detail. Using this OBI Modeling technique you can relax a bit on one of the Kimball Dimensional Modeling rules. I’ll show how it works in a simplified manner, plus identify some things to be careful with and how it does not alleviate you from doing real modeling work.
I’m going to start the discussion on nested aggregation by showing how it can be used as an alternative to Kimball’s original rules and guidance on dimensional design. One of the main steps that we’ve all been taught about dimensional modeling is to ‘declare the grain’ of the fact or to ‘normalize’ the facts. That is, keep them all on the same exact grain. He clearly states to not mix the grain into a single fact table.
Consider the example of an Order Header and Order Details data model as he does in his industry defining book, “The Data Warehouse Toolkit”. Ideally we would love to have everything all in one fact table at the Details level as it makes things nice and simple. Having two tables is messy and extra work, so lets try to KISS it.
One very common and useable solution is to have one fact table at the header level and another at the details level, with some aggregation going on from Details up to Header. In this scenario you would keep all of the Order Header grained facts in the Header fact table and not in the Details/Line Items table. You would however be able to roll-up the details level info (Sale $) into the header level, treating the header level table as a Fact/Aggregate hybrid. Although this is certainly a decent solution, for very large data sets you will end up with two very large tables with an increased ETL load window which from what I’ve seen is a big deal. Furthermore, to use information from both places you will likely require two queries, one for each table. Plus you would not be able to show any of the header level facts alongside the additional dimensionality in the details table unless you did some nice OBIEE modeling.
A second solution is to allocate the facts from the higher level to the lower level using some algorithm. This is the one that Mr. Kimball frequently proposes as the best solution. Perhaps there is a Discount Amount or a Shipping charge that you would split up over the few line items attached to the Order. By doing so, you are changing the grain of the Header level Discount $ from Order Header to Order Line Items by calling it something new like Discount $ Line Allocation. Once you have done that, all of your facts are at the same grain and all is well with the world.
Unfortunately, this is hardly ever done in the real world. Although Mr. Kimball does mention that this can be difficult, he brushes it aside and says it is usually already done by the business somewhere. Yeah, right. I have never seen this technique used in any of over 50 BI projects so far. Quite frankly I think he dropped the ball on this recommendation as it has no place in the real world.
Luckily, OBI comes to the rescue with third option, one with awesome technology never envisioned back in the late ‘90s. The limitation Mr. Kimball was using back when these ideas were created must have been that he expected hand-written SQL to access things, and at some point doing anything too complex would be unusable by users. I couldn’t agree more. But with a modern query generator, that complexity completely disappears, opening up a new world of possibilities. Now, instead of only using Sum(), Count(), Min(), Max(), and other single functions, you can combine them together in interesting ways. You could for example do a Sum( of a Max()) or a Sum (of a Last()).
In this example we’re going to show how to overcome the scenario above by using nested aggregation. This example has more interesting subject matter than Orders however. The topic is beer ratings, something very near and dear to my heart as you can see from my beer website brewpalace.com
The data model is straightforward for beer and bottles: Company –< Beer Brand –< Bottle. Thus, each beer could have multiple bottles, say of different sizes or from different countries. But a Beer rating is static for each beer – thus it would be the same for all of the bottles attached to the beer.
Consider if we had a mixed grain fact table, at the bottle level of detail, but we had the Beer rating in it. This is the exact same setup as Orders above. A piece of the table might look like:
The metric we’d like to make is called Avg. Rating. Lets start by going through the math to see what the answer should be and what it cannot be to be correct.
If we were to simply throw an Avg() aggregation on the above table, our answer would be 3.43:
However, we see that Extra Stout is listed 6 times in the table, and Harp 3 times – those repeating values should throw off the results. What we really want is an average based on the Beer grain and not the Bottle grain:
This looks better. We should be able mentally to justify that the 3.43 is too high as there are 2 extra 4’s and 5 extra 3.5 rating in there, but with only 1 extra 3 to counteract the extra high numbers.
Looking at the last table above, it is essentially a Beer grained table, which is 100% analogous to the Order Header table. The question is then, how can we dynamically convert the top, bottle grained table with the actual data set into the beer level table so that we can then do a nice clean average?
Well, if you were to build the Beer level table as an aggregate from the base bottle grain table, how would you do it? Depending on the data set, the quality and the existence of Nulls, you would probably build it by picking one of the rows from the detail level using either a First(), Last(), Min() or Max() function. I guess you could also use Avg(), but sometimes strange things can happen with division. So, you would essentially build it as:
Create table Beer_Agg As Select Company, Brand, First(Beer Rating) From Bottle_Facts Group by Company, Brand;
Well this is exactly what nested aggregation does; it just does it on the fly in SQL. When we think about describing what we just did, we can say that we eliminated the Bottle dimension by selecting the First over it. The language and terms I used in italics are important to remember, as it helps clear up our thinking on what is honestly a tricky thing to think about. From that point forward, once we have our “aggregate table”, then you can go ahead and apply an Avg() aggregation rule to accurately compute Avg. Rating. Nested aggregation is the process of combining these two together at query time.
To configure a Nested Aggregation, check off the Based on Dimensions checkbox to bring up the editor. As you see here, we have two aggregation rules set up, and they are done in sequence, from top to bottom. We’ve set it up to issue a first over the Bottle Dimension, then after that any other usage we can just do a regular average. That is exactly what we did to convert the base table to the aggregate table before.
Keep in mind that when it issues one of these aggregations, it will group by all of the other dimensions that are at the fact grain for the LTS being used. In fact, it unfortunately groups them by the Level key of the dimension, meaning it actually joins back tot he dim. For example, it won’t do a group by on Fact.SAMPLE_DT_WID, but rather W_DAY_D.ROW_WID, requiring that join.
Nested Aggregation in Action
Using a regular metric with a simple Avg() agg rule only, we can see the problems that occur:
Looks fine – we want 3.29. But drilling down from Brand into bottle causes a problem:
3.43 – we know that is incorrect, regardless of how we set the aggregation rule in the expression editor in Answers.
Using the Nested aggregation metric, we get a more consistent result:
And drilling down to bottle level:
Perfect. Let’s take a look at what it did in the Physical SQL. Doing so will begin to highlight some of the negatives of this technique, or at least some other things to consider.
The first result, the one with only the average, shows the clear structure of what is being done under the scenes. The other two SQLs get a bit confusing, so I’m only going to use this one:
select avg(case D1.c5 when 1 then D1.c2 else NULL end ) as c1 from (select D1.c2 as c2, D1.c3 as c3, D1.c4 as c4, ROW_NUMBER() OVER (PARTITION BY D1.c3, D1.c4 ORDER BY D1.c3 DESC, D1.c4 DESC) as c5 from (select distinct FIRST_VALUE(T43775.RATING IGNORE NULLS) OVER (PARTITION BY T43788.ROW_WID, T43764.BEER_ID ORDER BY T43788.ROW_WID NULLS LAST, T43764.BEER_ID NULLS LAST, T43775.BOTTLE_ID NULLS LAST, T43775.BOTTLE_SIZE NULLS LAST ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as c2, T43788.ROW_WID as c3, T43764.BEER_ID as c4 from W_DAY_D T43788, BEER T43764, COMPANY T44005, MIX_LEVEL_F T43775 where ( T43775.SAMPLE_DT_WID = T43788.ROW_WID and T43764.BEER_ID = T43775.BEER_ID and T43764.COMPANY_ID = T44005.COMPANY_ID and T44005.COMPANY_NAME = 'Guinness' ) ) D1 ) D1
We see that on the inner portion, it is generating a query similar to the one we made for our aggregate – it is inline building a new table without the Bottle Dimension, and it is using the First() to make it happen. Using a First will generate a SQl that filters on the ROW_NUMBER result = 1 (not shown in the SQL), whereas a Max() function will introduce Group Bys into the creation of the new grained in-line view. From there we see that the outer portion of the query is issuing the AVG(), just as we had hoped.
This simple example becomes much more complex when dealing with multiple levels on a report at the same time – say for the Beer level report with the Total lines – that has two grains on the report. I won’t post the SQL here, but it does have 3 query parts which hit the fact table separately. The use of the WITH syntax cuts that down to one, but remember its just syntax and not necessarily indicative of what the optimizer is actually doing.
A second example: Headcount
The BI Apps themselves use Nested Aggregation with Workforce Analytics to determine Headcount numbers. Headcount snapshots are captured in a table pretty much every day in the main Workforce Month Event table. I say pretty much every day because they are actually more like ranges of a certain employee’s status. If I get transferred on the 10th, I’ll get a new record on the 10th that will last until something else changes. If I get promoted on the 21st, then I’ll have three records for the month – 1st – 9th, 10th -20th, 21st – current or end of month.
To get a headcount of people for some group or status, filter the recordset down as normal, but you want to make sure you are getting the most current record for that person. On a summary level, getting an accurate headcount means getting the latest, most accurate version of these employee snapshots as they change over time. With this thinking in mind, you might begin to see how we want to use a Last() aggregation over time to first pick the most current recordset, then do our counting and summing on top of that. Sure enough, the headcount metrics are Sum(Last () over Time). Banking Balances are identical to Headcount in this manner.
The tricky part is in the sequencing – which goes first? Do I want Sum(Last()) or Last(Sum())? You will have to think a bit for your situation – sometimes saying it out loud or thinking about if you built a specific table for it will help.
In the beginning I outlined three ways to solve a mixed grain problem. All work. But each has its own tradeoffs in ETL performance vs. query performance vs. Business agreement to definitions. When considering Nested Aggregation, think about the following:
- The creation of the in-line view that eliminates one dimension may be too expensive to do at query time, especially on a large data set which is used with limited to no filters. Deriving the Orders table at query time on an unfiltered query with millions of rows is a bad idea. If the metric is frequently used only on a smaller subset, then by all means it is a good solution. If you look closely at the in-line select, it actually joins back to the dimension tables, and you may be stuck with a group by as well.
- Aggregates do not eliminate this problem, unless you simply do not map the metric to the lowest grain table that causes the problem. If you had both header and details fact tables, you could certainly elect to not map in the Discount $ to the details fact table. Of course doing so may cut out some of your query capability, and it may force two queries – almost as if they were conformed queries. Perhaps the two queries is better than one query on the details table – or perhaps it is worse! Building the aggregate and mapping the metric to both places will help in many cases with your performance problem, but not in all cases – your mileage may vary.
- With that in mind, having two tables means having two tables – extra ETL, storage, mapping, QA, etc.
- Don’t go more than 2 aggregation rules – anything more gets to be unwieldy, difficult to understand and QA.