Modeling Header & Details
Originally posted 3/16/2007
A common question I hear is about how to model & configure a Header-Detail scenario. By saying header-details, we are referring to a transaction that is split over more than one granularity. This is common modeling scenario such as Order & Order Line Item, Invoice & Invoice Detail, Agreement & Agreement Line Item, etc.
First, let’s describe the scenario using the classic Order Header & Detail as an example. Think of going to the supermarket and buying a bunch of products. Think about the receipt that you get. You will have a list of products, with Quantities and Unit Prices. At the bottom, in addition to the total, you will also have additional values such as Taxes, Discounts, Points Earned, and Coupons (i.e., Promotions) that apply to the whole of the transaction, not the individual line items. In a transactional system, the Order Header would contain most of the information that pertains to the whole transaction, such as location, time, employee, and customer (if known). The list of line item details would contain all of the above, plus the addition of Product. This is what is meant by different granularities – they are the exact same, but the line item details also contains the product dimension, and is thus at a lower level of granularity.
So how should such a transaction be modeled? The reality is that there is no one right answer.
First, think a bit about how they might be used from a pure data modeling/SQL perspective. Consider joining Order Header and Order Line Item together as follows:
Where Order.OrderID = Order_Line.OrderID;
What will happen if an Order has 5 line items, totaling $200 is sales, and a total Tax Amount of $10? 5 records would return in the query:
There are 5 records returned, each one is for a different product, but we just happened not to show them.
Now what if you aggregated them to the Order Header level?
Where Order_Header.OrderID = Order_Line.OrderID
Group by Order_Header.OrderID;
Here, although the Sale_Amt is correct, the Tax_Amt has been overstated. Whenever you mix granularities, you will ALWAYS encounter this problem. In Dimensional Modeling, Facts must be Normalized. This means that following solutions by themselves will not work, as they all produce the same resulting data set :
- Physical Table: Making one table (W_ORDER_LINEITEM_F) that holds both values from Header and Details.
- View: Making 2 separate fact tables, but making a view that joins between the two
- LTS View: Joining the 2 tables in a single Logical Table Source in OBI
If you are still learning SQL, it is imperative to understand that the joins and WHERE clauses will determine the base data set, which is then aggregated as was done with the above queries. You can kind of think of it like the query below, where the inner portion runs first to get the dataset, then the outer portion runs:
Where Order.OrderID = Order_Line.OrderID) DATA_SET
Group by BASE_DATA_SET.OrderID;
In fact, that is how the database will actually go about executing the query internally.
So, what is the solution if we want Tax_Amt on this query to be $10.00 and not $50.00? There are two main categories to choose from.
#1 Lower the grain with a pro-ration
Essentially, divide the header facts and pro-rate them down to the details level based on some business rule(s). Thus if a header has $10 in tax and 5 detail records, each would get $2.
Or would it? Maybe there is a better more sophisticated way than assigning them equal prorated tax amounts, perhaps based on value, quantity, or some other business rule. The point here is that you are essentially creating a new definition for a new metric, one that begins as a prorated tax amount. Sometimes this is difficult to do, and your user community may not like this idea. This is the primary solution proposed by Ralph Kimball.
This concept can be implemented using the three techniques identified above (Table, View or LTS Join), with the difference being that the pro-ration rule is applied to take a header value and pro-rate it over multiple child records:
1. Physical Table: Use the ETL to calculate the rule and store the prorated values in one table. The main benefit here is that there is only one fact table which contains everything, meaning good performance and simpler BI Tool mapping. Downsides are primarily the additional effort required, and I guess you could throw in additional data storage which I don’t see as a problem for most deployments.
2. Database View: The calculation is contained in the view SQL, but still requires two fact tables. If the business rule to prorate is difficult, then this may not be realistic
3. LTS View: Pretty much the same as the Database view option, except the business rule has been moved into the BI tool and out of the database
Of the three examples above, the first one will be your best bet in most cases, as it is simpler and will outperform the other two due to not having a Header-Details join. With large tables, it will significantly outperform the options with joins due to simply the smaller dataset. In Oracle systems it also aligns with Oracle’s Star Transformation feature better, which is a major performance booster.
There is another variation, one that is essentially the same technique used when displaying metrics with a multi-valued dimension. In the multi-valued situation, a single fact record applied to multiple dimension records, such as Sales Amt and a set of Sales Rep who worked on the deal. In this scenario, a weighting factor is recommended to be added to a bridge or relationship table that captures the M:M relationship. The metric can then be computed by multiplying the header fact with the dimension’s weighting factor for that fact to arrive at a prorated amount.
If you think about it, that’s really what the Order Line Item Detail table is: it holds the M:M relationship between Order Header and Product, plus some additional information that happens to be on the Line Item Detail record. The grain of either solution is Order-Product whether you have a normalized solution using a bridge table or a denormalized Line Item table or a full denormalized one where you merge Header and Line together. Thus, if so inclined you may substitute a pro-rated value in the details fact table for a weighting factor, which can then be multiplied by the header base fact to arrive at a prorated factor. Make sure your weighting factors add up to 1.00000 for each Header record however!
Note that all of the techniques discussed above are essentially the same thing; they are forcing a higher level fact to a lower level by prorating it to a lower level. They simply differ in their techniques, effort, performance characteristics and storage needs.
- Eliminates an extra fact table (storage, ETL processing, DB Management, Less BI tool mapping)
- Can perform exceptionally well
- Mapping into the BI tool is simpler
- An acceptable pro-ration rule must be identified
- Creates a piece of data that does not exist (A header level metrics pro-rated to a lower level)
- A header-only query must use the details table all of the time; can be slower if the ratio between header and details becomes large (> 10 details per Header)
#2 Use multiple queries and merge the results together
This technique puts more of the work and sophistication into the BI Server. With this solution, two separate fact tables are used, one for each grain. This refers to both Physical tables and also in the Business Model layer. When using a metric from Header and a Metric from Details, BI EE will issue two queries and merge the result set. Thus,
From Order_Header, Order_Line
Where Order.OrderID = Order_Line.OrderID
Group by Order.OrderID;
Will be essentially re-written as:
Select D1.OrderID, D1.TTL_TAX_AMT , D2.TTL_SALES_AMT
Group by Order.OrderID ) as D1,
Group by OrderLine.OrderID) AS D2
WHERE D1.OrderID = D2.OrderID;
In reality, the BI server will most commonly issue two separate queries to the database and do the join on the server, but the SQL above illustrates how that work from a data set and logic perspective. With this solution, each inner query maintains its proper grain, most importantly the header query does not get lower to the details level and thus maintains its accuracy.
- Simpler (although more) ETL, as tables map more 1:1 with a source system
- May be the only option when mapping to some data sources such as a transactional one
- Header only queries may be faster as line items are not scanned in the query
- BI Tool configuration becomes more complex and there is more of it
- For queries that use both header and detail, longer query times as 2 queries are executed
- Requires 2 fact tables instead of 1 (storage, ETL processing, DB Management, Less BI tool mapping)
If you can identify a workable and agreeable pro-ration rule, then go for the single, lower level fact table. However this is usually its downfall, in which case use the BI tool solution.