It’s Not ‘To Stitch or Not to Stitch’, But Rather Where to Stitch

It’s been a while since I’ve posted anything aside from an announcement – in fact for the entire year.  That’s was due to being swamped on my current project where I was involved in a number of things.  But now that we’ve deployed our first 2 releases, things are starting to settle in a bit, so I have some time to touch on some topics.  Sorry about the ugly formatting – I’m no expert with this editor.

The first topic came out of the end of the project when we were performing a few weeks of load testing.  Our performance numbers at load (100 concurrent users) were pretty terrible.  After a few realizations, I decided to try a tweak in OBI that resulted in enormous impacts on our results.

Who Does The Work?

The general approach to developing with OBI is to push as much processing work to the database as possible.  The goal is generally to let the DB engine handle all the work, all within itself, and reduce as much as possible any performance or latency hits introduced by a network.  In general this is correct.

However, there may be a point where that philosophy becomes overkill.

When OBI generates SQL, there are a number of options it has available to use as defined in the Features tab of your database:

The specific scenario I’m going to discuss is with a conformed query.  A conformed query is simply the merging together of two separate queries along a shared dimension.  There are a few options in the Features tab that have a large degree of influence over how OBI handles a conformed query.

The Scenario

Consider the following simple model of a conformed data model:

LTS stands for Logical Table Source; forget about Logical Tables they don’t matter.

Then consider the following simple query from Answers:

SELECT Person.Name, LTS1.Metric1, LTS2.Metric2;

OBI will need to go after 2 result sets and then merge them together:

#1:

SELECT Person.Name, SUM(LTS1.Metric1)
FROM Driver, LTS1
GROUP BY Person.Name;

#2:

SELECT Person.Name, SUM(LTS2.Metric2)
FROM Driver, LTS2
GROUP BY Person.Name;

Then the results of each query will have to be merged together.  It’s not quite as simple as you might first expect, especially if there are more conformed fields than just Person.Name.  The tricky part deals with null values and different datasets, which OBI assumes is always the case:

Example Query Fragment Result Sets:

Query #1 Query #2
Ann, 100Bob, 130

Chuck, 150

Bob, 1000Chuck, 1010

Dave, 1200

As you can see, the result sets are different, so simply joining on the common field of Person.Name is inadequate – some additional wrapper logic is needed to produce the following results:

Person.Name Metric1 Metric2
Ann 100
Bob 130 1000
Chuck 150 1010
Dave 1200

What needs to occur is a FULL OUTER JOIN to stitch the two together.  And OUTER JOINS, let alone FULL OUTER JOINS, are CPU-intensive operations. If there were more than one field, the merging logic is even more complicated, as combinations of values with nulls need to be considered.  More on that later.

OBI’s SQL Generation Choices

The two base inner queries will happen no matter how you configure OBI; it’s the data retrieval part of the problem, and you have to get the data out of the raw tables somehow.  Sure there are some alterations and strange things (like the WITH syntax), but basically those 2 queries have to get fired no matter what.

The merge/stitch/join potion however is flexible, and is ultimately what this article is about.  OBI has 2 choices:

a)      The Database can do the merge

b)      OBI can do the merge

There are 2 database features which control where the merge will occur: ROWNUM_SUPPORTED and PERF_PREFER_INTERNAL_STITCH_JOIN.  On many databases, OBI defaults the ROWNUM_SUPPORTED feature to True and PERF_PREFER_INTERNAL_STITCH_JOIN to False.  Let’s take a look at the effect of these 2 options:

ROWNUM_SUPPORTED PERF_PREFER_INTERNAL_STITCH_JOIN Join Location
1 Yes No Database
2 Yes Yes OBI
3 No No OBI
4 No Yes OBI

When OBI does the stitching, it simply fires off multiple, but simple, queries to the database, waits for each to come back, and then does the merging on its own:

  • Database Load Reduced
  • Network Load Increased
  • OBI Load Increased

When the database does the stitching, the two base queries are wrapped together in a monster SQL statement with a FULL OUTER JOIN linking them and a ton of additional value checking SQL in the format of:

D1.c4 is not null then D1.c4 when D2.c2 is not null then D2.c2 end  as c1,
  • Database CPU Load Increased
  • Network Load Reduced
  • OBI Load Reduced

Let’s look at some examples to show more specifically what I’m talking about.

Here is the first option, where the database does the join.  These are real queries from my current system, but don’t focus on the internals of the black and blue text – it’s the red text that is important.  The SQL in black and blue is the core data access code, the stuff that is basically the same regardless of how you glue them together. Note this is a 3 column select with a few filters on it and a strange piece that selects from Dual (ignore it):

select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3
from
(select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4,
D1.c5 as c5
from
(select case  when D1.c4 is not null then D1.c4 when D2.c2 is not null then D2.c2 end  as c1,
D2.c1 as c2,
case  when case  when D1.c3 is not null then D1.c3 when D2.c3 is not null then D2.c3 end  = ‘Driver’ then D1.c1 else D1.c2 end  as c3,
case  when D1.c3 is not null then D1.c3 when D2.c3 is not null then D2.c3 end  as c4,
case  when D1.c5 is not null then D1.c5 when D2.c4 is not null then D2.c4 end  as c5,
ROW_NUMBER() OVER (PARTITION BY case  when D1.c3 is not null then D1.c3 when D2.c3 is not null then D2.c3 end , case  when D1.c4 is not null then D1.c4 when D2.c2 is not null then D2.c2 end , case  when D1.c5 is not null then D1.c5 when D2.c4 is not null then D2.c4 end  ORDER BY case  when D1.c3 is not null then D1.c3 when D2.c3 is not null then D2.c3 end  ASC, case  when D1.c4 is not null then D1.c4 when D2.c2 is not null then D2.c2 end  ASC, case  when D1.c5 is not null then D1.c5 when D2.c4 is not null then D2.c4 end  ASC) as c6
from
(select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4,
D1.c5 as c5
from
(select count(distinct T10979.DRIVER_WID) as c1,
count(distinct 0) as c2,
T13395.ENTITY_TYPE as c3,
TRUNC(T804.DAY_DATE) as c4,
T804.DAY_WID as c5,
ROW_NUMBER() OVER (PARTITION BY T804.DAY_WID, T13395.ENTITY_TYPE ORDER BY T804.DAY_WID ASC, T13395.ENTITY_TYPE ASC) as c6
from
(select ‘Vehicle’ as ENTITY_TYPE  from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date (W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /* Company (W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM T867 /* Metric (W_MEASURE_TYPE_DIM) */ ,
W_DRIVER_DAY_AGG T10979 /* Facts: Measure DayDrv (W_DRIVER_DAY_AGG) */
where  ( T804.DAY_WID = T10979.DATE_WID and T806.CUSTOMER_WID = T10979.CUSTOMER_WID and T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T10979.MEASURE_TYPE_WID and T10979.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and T804.DAY_DATE between TO_DATE(‘2010-07-11 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) and TO_DATE(‘2010-07-25 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID, T13395.ENTITY_TYPE,  TRUNC(T804.DAY_DATE)
) D1
where  ( D1.c6 = 1 )
) D1 full outer join (select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4
from
(select sum(T1021.MEASURE_COUNT) as c1,
TRUNC(T804.DAY_DATE) as c2,
T13395.ENTITY_TYPE as c3,
T804.DAY_WID as c4,
ROW_NUMBER() OVER (PARTITION BY T804.DAY_WID, T13395.ENTITY_TYPE ORDER BY T804.DAY_WID ASC, T13395.ENTITY_TYPE ASC) as c5
from
(select ‘Vehicle’ as ENTITY_TYPE  from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date (W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /* Company (W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM T867 /* Metric (W_MEASURE_TYPE_DIM) */ ,
W_FLEET_AGG T1021 /* Facts: Measure Fleet (W_FLEET_AGG) */
where  ( T804.DAY_WID = T1021.DATE_WID and T806.CUSTOMER_WID = T1021.CUSTOMER_WID and T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T1021.MEASURE_TYPE_WID and T1021.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and T804.DAY_DATE between TO_DATE(‘2010-07-11 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) and TO_DATE(‘2010-07-25 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID, T13395.ENTITY_TYPE,  TRUNC(T804.DAY_DATE)
) D1
where  ( D1.c5 = 1 )
) D2 On D1.c3 = D2.c3 and D1.c5 = D2.c4
) D1
where  ( D1.c6 = 1 )
) D1
order by c1

Now with PERF_PREFER_INTERNAL_STITCH_JOIN = TRUE:

select D1.c4 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c1 as c4,
D1.c5 as c5
from
(select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4,
D1.c5 as c5
from
(select count(distinct T10979.DRIVER_WID) as c1,
count(distinct 0) as c2,
T13395.ENTITY_TYPE as c3,
TRUNC(T804.DAY_DATE) as c4,
T804.DAY_WID as c5,
ROW_NUMBER() OVER (PARTITION BY T804.DAY_WID, T13395.ENTITY_TYPE ORDER BY T804.DAY_WID ASC, T13395.ENTITY_TYPE ASC) as c6
from
(select ‘Vehicle’ as ENTITY_TYPE  from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date (W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /* Company (W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM T867 /* Metric (W_MEASURE_TYPE_DIM) */ ,
W_DRIVER_DAY_AGG T10979 /* Facts: Measure DayDrv (W_DRIVER_DAY_AGG) */
where  ( T804.DAY_WID = T10979.DATE_WID and T806.CUSTOMER_WID = T10979.CUSTOMER_WID and T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T10979.MEASURE_TYPE_WID and T10979.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and T804.DAY_DATE between TO_DATE(‘2010-07-11 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) and TO_DATE(‘2010-07-25 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID, T13395.ENTITY_TYPE,  TRUNC(T804.DAY_DATE)
) D1
where  ( D1.c6 = 1 )
) D1
order by c3, c5;
——————————————————————————————-
 
select D2.c2 as c1,
D2.c1 as c2,
D2.c3 as c3,
D2.c4 as c4
from
(select D1.c1 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c4 as c4
from
(select sum(T1021.MEASURE_COUNT) as c1,
TRUNC(T804.DAY_DATE) as c2,
T13395.ENTITY_TYPE as c3,
T804.DAY_WID as c4,
ROW_NUMBER() OVER (PARTITION BY T804.DAY_WID, T13395.ENTITY_TYPE ORDER BY T804.DAY_WID ASC, T13395.ENTITY_TYPE ASC) as c5
from
(select ‘Vehicle’ as ENTITY_TYPE  from dual union select ‘Driver’ from dual) T13395,
W_DAY_DIM T804 /* Date (W_DAY_DIM) */ ,
W_CUSTOMER_DIM T806 /* Company (W_CUSTOMER_DIM) */ ,
W_MEASURE_TYPE_DIM T867 /* Metric (W_MEASURE_TYPE_DIM) */ ,
W_FLEET_AGG T1021 /* Facts: Measure Fleet (W_FLEET_AGG) */
where  ( T804.DAY_WID = T1021.DATE_WID and T806.CUSTOMER_WID = T1021.CUSTOMER_WID and T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T1021.MEASURE_TYPE_WID and T1021.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and T804.DAY_DATE between TO_DATE(‘2010-07-11 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) and TO_DATE(‘2010-07-25 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) )
group by T804.DAY_WID, T13395.ENTITY_TYPE,  TRUNC(T804.DAY_DATE)
) D1
where  ( D1.c5 = 1 )
) D2
order by c3, c4;

As you can see, the heart of the queries are identical, the difference is only in where the merging occurs.

One side note:  some databases do not like the ROW_NUMBER() function, so see if yours does.  MPP databases from my TeraData experience perform this operation very slowly.  I’ve also seen conflicting information from authors on whether it’s good or bad to use ROW_NUMBER as it pertains to performance.

Without the ROWNUM_SUPPORTED setting, the queries are even simpler:

select count(distinct T10979.DRIVER_WID) as c1,

count(distinct 0) as c2,

T13395.ENTITY_TYPE as c3,

TRUNC(T804.DAY_DATE) as c4,

T804.DAY_WID as c5

from

(select ‘Vehicle’ as ENTITY_TYPE  from dual union select ‘Driver’ from dual) T13395,

W_DAY_DIM T804 /* Date (W_DAY_DIM) */ ,

W_CUSTOMER_DIM T806 /* Company (W_CUSTOMER_DIM) */ ,

W_MEASURE_TYPE_DIM T867 /* Metric (W_MEASURE_TYPE_DIM) */ ,

W_DRIVER_DAY_AGG T10979 /* Facts: Measure DayDrv (W_DRIVER_DAY_AGG) */

where  ( T804.DAY_WID = T10979.DATE_WID and T806.CUSTOMER_WID = T10979.CUSTOMER_WID and T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T10979.MEASURE_TYPE_WID and T10979.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and T804.DAY_DATE between TO_DATE(‘2010-07-11 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) and TO_DATE(‘2010-07-25 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) )

group by T804.DAY_WID, T13395.ENTITY_TYPE,  TRUNC(T804.DAY_DATE)

order by c5, c3

+++dev2:2a0000:2a0001:—-2010/11/03 06:08:29

——————– Sending query to database named AM Data Mart (id: <<49984>>):

select sum(T1021.MEASURE_COUNT) as c1,

TRUNC(T804.DAY_DATE) as c2,

T13395.ENTITY_TYPE as c3,

T804.DAY_WID as c4

from

(select ‘Vehicle’ as ENTITY_TYPE  from dual union select ‘Driver’ from dual) T13395,

W_DAY_DIM T804 /* Date (W_DAY_DIM) */ ,

W_CUSTOMER_DIM T806 /* Company (W_CUSTOMER_DIM) */ ,

W_MEASURE_TYPE_DIM T867 /* Metric (W_MEASURE_TYPE_DIM) */ ,

W_FLEET_AGG T1021 /* Facts: Measure Fleet (W_FLEET_AGG) */

where  ( T804.DAY_WID = T1021.DATE_WID and T806.CUSTOMER_WID = T1021.CUSTOMER_WID and T806.CUSTOMER_WID = 188111.0 and T867.MEASURE_TYPE_WID = T1021.MEASURE_TYPE_WID and T1021.CUSTOMER_WID = 188111.0 and T13395.ENTITY_TYPE = ‘Driver’ and T804.DAY_DATE between TO_DATE(‘2010-07-11 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) and TO_DATE(‘2010-07-25 00:00:00’ , ‘YYYY-MM-DD HH24:MI:SS’) )

group by T804.DAY_WID, T13395.ENTITY_TYPE,  TRUNC(T804.DAY_DATE)

order by c4, c3

To Internally Stitch or Not?

You’re going to hate me for this, but there is no single answer.  It depends very much on a variety of items.  Personally I prefer the OBI stitch join syntax; it is simply easier to read during development.  QA’ing the monster queries is akin to how the crew read the black and green screen from the Matrix and made perfect sense of it – not for the inexperienced.  I am a firm believer that if the SQL is not perfect, it’s wrong.  So in order to debug/tune the SQL, this is the way to go. But of course what a person prefers is not always what the technology prefers.

When I say perfect, I mean that the proper function shipping and grouping is being pushed to the database.  If these are incorrect, you may end up with a situation where the query is returning too many records to OBI across the network.  The difference here can be huge; in some cases your inner query may return a million rows or more across the network for OBI to then write to disk and do its MAX() computation.  With a large recordset, this is a deal breaker.  With a small recordset, you will be ok.  This is why it is imperative to know what your queries are doing and how many rows they return from the OBI log file:

——————– Rows 15, bytes 1680 retrieved from database query id: <<49984>> — Good
——————– Rows 150000, bytes 16800000 retrieved from database query id: <<49984>> — Bad

Ok, so now I’m assuming you did your RPD work and you have the correct core SQL being generated.  Now it’s on to determining where the stitch should occur.

Recently I had the opportunity to do some load testing on our pre-production system.

We made 3 realizations that are specific to our system (you may or may not have the same ones):

  1. The database CPU was the bottleneck doing nested loops for processing the stitching
  2. The inner SQL all returned only a few rows per query, usually under 10
  3. We had a tremendous amount of unused OBI Server CPU power

We these inputs, we switched over to letting OBI do the stitching.  The results were dramatic:

  1. At 100 concurrent users, the DB load factor went from 75 to 9 (16 CPU cores)
  2. Avg. report response time improved 9x.  Remember, this is at load, not a single query on an idle box
  3. OBI Server load maxed out at 1% across 3 nodes even with stitching

These results speak for themselves.  A substantial improvement by changing one single checkbox!

Be Careful Of

There are a few things to be careful of when you go from one technique to another – the direction doesn’t matter much:

  1. Make sure your inner SQL is perfect – that is a prerequisite for this entire topic
  2. Regression test everything when you make any SQL generation switches.  I have seen differences appear going from OBI to DB stitch and also the reverse.  I cannot stress this enough.
  3. To resolve these differences, particularly when letting OBI do the stitching, is it is a little more strict than the DB stitching.  Example:  After letting OBI do the stitching, we tracked a discrepancy down to having an incorrect mid-level hierarchy key.  We had defined the level key as State, and Oracle DB was fine with it.  But OBI didn’t like it, and instead wanted us to make the level key to be Country & State.  Expect to find some difficult RPD configuration issues to resolve.
  4. Nested Aggregations force your hand somewhat in that they frequently require ROW_NUMBER support in order to generate multi-level SQL (i.e., MAX(SUM())). Without it, the database will bring the results of the inner aggregation across the network to OBI – this could be millions of records.  If this is the case and you want OBI to do the joins, make sure that your PERF_PREFER_INTERNAL_STITCH_JOIN is set to Yes.

Conclusion

For many systems, where the stitching occurs will not be a problem.  It depends on the complexity of your queries, the type of hardware you have available, where your bottlenecks are, the planned load and even metrics.  Furthermore, you may not see the benefits of going one way or another until the system is at full user load causing resource capacities and limits to be hit causing bottlenecks. I strongly recommend not only a load test, but compare both options at load to identify which option is best.  Of course before you do this it is imperative that you regression test to make sure you get the same results – do not make the mistake of assuming such a small configuration change can’t have large impacts.

Posted on November 11, 2010, in BI Theory and Best Practices, OBI EE Development, Performance, Repository and tagged , , , , , , , , , . Bookmark the permalink. 5 Comments.

  1. Hy Jeff,

    Thanks for the great obi input 😉

    The SQL is really difficult to read with the horizontal bar and I can’t grab it easily to compare them. It will be better to give them in a file form for instance.

    When you talk about Db Load Factor, do you use a formula to calculate it of you speak only about the CPU utilization ?

    What is your definition of 100 OBIEE concurrent user ?
    (How do you simulate them ? It’s 100 OBIEE sessions that ask for 100 concurrent answers ? This 100 concurrent users open 100 active connection/session in the database ?)

    And just to add to your article that a lot of other database features parameters can lead to a stitch join when the feature is not enable (LEFT_OUTER_JOIN disable for instance when you have a vertical fragmentation).

    Once again, thanks. It’s a great white paper.

    Cheers
    Nico

  2. Thanks for the clarifying questions Nico.

    The formatting – ugh. I suck at the editing features of WordPress, but I’m not taking all of the blame on this one. 🙂

    The DB load factor was reported from Solaris from the process monitoring tool. A LF of 20 means there are 20 processes requesting (using or waiting) CPU. So in our before scenario, we have about 60 processes backed up as we had 16 CPU cores.

    100 Concurrent was 100 user load runner sessions firing queries with a 30 second think time. For us this really meant 100 users firing queries all the time, as we’d have at least one or two slow ones per page. Keep in mind that each user-page view had many reports on them, and each report fired off up to 7 DB queries. These were heavy pages.

    Good point on other features changing the location of the stitch – very true indeed.

    Jeff M.

    • Thanks for the clarifications.

      I know that for WordPress, you have a plugin to highlight the code such as http://alexgorbatchev.com/SyntaxHighlighter/. It is made by Alex Gorbachev. (I don’t know if it’s the same guy than from Pythian. http://www.pythian.com/about/management_team/alex_gorbachev/ )

      I never thought to monitor the computer from the CPU. It’s still a domain that I have to discover. It’s seems to say that the join (the stitch) is only made in the CPU (primary storage). Is it always true or did you choose the hardware configuration in order to achieve this goal ?

      Also for the stitch join, (I know that you know it but it’s just to add all the possibilities) I just thought that you can manage where it will take place just by setting the inner table on one database and the outer in an other in the physical layer.

      Many thanks for the definition of concurrent user. I know now how many time I will them allow to think : only 30 second 😉

      Good day
      Nico

  3. One other benefit of doing stitch join in the bi server: the queries issued will run concurrently, whereas if they’re in a single statement (e.g. after WITH) they will run serially. Thus a logical query that issues 4 physical queries each of which takes 60 seconds will complete in a little over 60 seconds if issued separately (and then joined in obiee), whereas the query will take 4×60 seconds (plus) if issued in a single sql statement after a WITH. Of course, the other pros and cons that Jeff mentions still pertain.

  1. Pingback: Build a Performance Layer to Make the BI Apps Fly « Achieving Great BI With Oracle BI

Leave a comment