Approaches to Building BI Applications and Data Warehouses

Original Post date 2/27/09

I’d like to comment a bit on the differences between BI and Data Warehousing as I believe they are commonly confused and used interchangeably.  Having just come from a project which mixed the two together with poor results, I think some discussion on the topic is a good idea and one that is fresh in my head.  This is a long post, so go get some coffee (or Red Bull) first.

Before I get into this, I fully understand and accept that there are varied opinions about this topic.  I am not going to propose a one-size-fits-all philosophy.  There also may be some technical features out there which some may claim make my comments obsolete.  I am going to focus on the purpose of the two and how that drives how you build a BI system.  If you are hoping I’ll get into a Kimball vs. Inmon data warehouse architecture or modeling discussion you will be disappointed.  I recommend you read their books and articles for the details of their solutions.

I am going to approach this by defining two very distinct items at different ends of a spectrum, and then discuss how they can and cannot be merged into the same thing.  The point is to contrast the two needs so conflicts become apparent.

What is the Essence of a Data Warehouse?

I believe it is simply to collect and standardize major data elements together for use somehow.  Somehow may mean directly by a tool or user, it may mean to feed back into operational system, it may be used to feed dependent data marts, or even virtual ones.  I’m not interested in discussion data mart philosophy at this point.  For purposes of this discussion, an ODS and a DW are effectively the same thing; I don’t care if one is staging to feed the other or not.

What I am interested in is the purest most elemental function of a Data Warehouse.  Since data is to be used somehow by other consumers, a DW should have some key attributes:

  • Consolidated  & Integrated: Simply copying data from multiple systems over to a database doesn’t really provide much value beyond query processing off-load.  Integration of data is where a DW can provide real benefits and value to down stream consumers.  By performing this activity once, all downstream systems a) don’t have to do it and b) use the same results which c) means they’re speaking the same language.  Maintaining one list of Customers and Products is an example of the goal here – one reference list for all of the company to use.
  • Cleansed: Again, simply copying raw data provides little value.  Taking the effort to apply cleansing rules, integration rules, de-duping, code lookups, null value replacements, etc allows such rules to be applied once and consumed many times in an identical fashion.
  • Lossless: The information contained in the DW cannot lose any detailed information.  If some transformation eliminates a piece of information, such as a complex relationship, then a downstream system will not be able to use it. The goal here is to assume that you know nothing about how this information might be used, and so prepare for all uses.  This is one area where I think storing data in the data warehouse in a pure dimension model fails, as a dimensional model assumes some type of usage and therefore has optimized the data structure.  Flexibility and Optimization are opposites.
  • Storage Efficient: As Data Warehouses can be rather large and require significant storage costs, data redundancy and efficiency is important.  For companies with smaller data volumes this is not too big of a concern, but no one wants to throw money away.
  • Timely: No one can use data from the DW if it isn’t there, so feeds to it should be done as quickly as is needed and not one bit sooner.
  • Historical: Some level of data retention is desirable for downstream systems to consume.  Downstream systems may hold onto data longer than the DW if they have a specific need.

Anything beyond these attributes quite frankly is pure gravy.  Some organizations can do much more with their systems as they have solved these problems already.

What is the Essence of a BI Application?

A BI Application supports and provides intelligence for Business Process needs.  Aside from the BI tools and their capabilities (e.g., dashboards, ad-hoc) they are tailored to the specific users and the specific tasks they perform.  Thus a BI Application may take on many different shapes depending on the demands of its targeted user community.  This is why there are so many BI tools out there, many of which are completely different in nature, using very different assumptions and technologies.  A real-time predictive model is as much BI as a Scorecard application for executives or a marketing segmentation engine.  All use data as needed for their target users and processes.

Some good attributes of a BI Application might be:

  • Insightful Information:  Information that makes a business user “smarter” is what we are after here.  Using the power of computers to make humans more intelligent and productive.  Isn’t that why computers are in the workplace in the first place?  Raw data elements usually do not provide any great insight.  Total Order $, although certainly a useful metric, is hardly insightful for managing and improving a business.  Applying business knowledge to a process, one might transform a simple order amount sum into a deviation from a regional manager’s committed order amount in order to identify where your sales organization’s problems lie and where your efforts should be focused.
  • Timely Information:  Information that is stale is like yesterday’s newspaper.  Whatever the timeliness needs are, the BI Application must align with those needs to be effective.
  • Reliable Information: This one is quite simple: if people don’t trust what the BI Application says they won’t use it.  This can come from either unreliable data or unreliable business knowledge applied to those rules.
  • Known: People must know what the system is doing and how it does it.  A high-powered BI system can apply sophisticated business rules to raw data to arrive at powerful metrics.  Users must understand what these metrics are, what goes into them, and what they mean.  However, this can also apply to some of the most basic definitions.  Particularly when the visibility of the BI system’s outputs reach farther into an organization, the basic definitions of elements becomes very important.  Sales, Customer Service, and Services must have clear definitions of what a Customer is in order to communicate properly with each other.
  • Useable: If the information is too complex or the BI Tools in use are too cumbersome, the impact of the information can be lost.  My mother has a computer at home, but she really doesn’t use it for anything beyond sending a few emails – she doesn’t know how to use it, so all of the power and capabilities it has are lost on her.  Commonly this boils down to using the right tool(s) for the right audience, and deploying those tools in a user friendly manner.
  • Performant: The days of waiting 24 hours for an analysis report have been gone for a long time.  People are internally expecting Google-like response times from their queries (luckily they know that’s not $ realistic all the time!)  Poor performance causes frustration with users, and they will find other ways of getting their information, even if that means running through longer manual processes from raw data – at least they have something to start working on.  Performance has a greater risk on larger data sets of course, to the point where a system can become utterly unusable.
  • Historical: It is common to have information needs in BI Applications that are much greater than in a DW.  Identify the data retention needs and plan for them.

The first word of BI is Business, the first word of DW is Data

BI is the application of knowledge to data in order to produce information valuable to a business processbp(k * d) = i if you will.  Data comes from one or more sources such as data warehouse or a transactional system.  By itself, data is useless.  However, when business experience and knowledge go to work on that data, guided by a specific business process, important insights and intelligence can be identified.  As business processes or knowledge about that business change, the same raw data can be used to develop new intelligence.  Given different processes, the same piece of data can be used in multiple ways.  Think of a Web Server log – marketers can use the log for one purpose which IT can use it for a totally different purpose.

As you can see, two types of systems are totally different in their goals.  Data Warehouses are essentially about the capture, storage and integration of information for use by other consumers, even without knowing who they may be or for what purpose they use the data.  BI Applications are (mostly) used directly by people for a particular and changing business purpose by a variety of tools and platforms.

Why then would we ever think that the same solution would work for both?

It doesn’t.

As each has different goals and different data needs, they each require different designs and different approaches to being built.

Building Approach

Use a Bottom-Up Design Approach with 3NF or Hybrid for the Model for a DW
Assume the data in your Data Warehouse can be used by anyone for any purpose.  This means that you cannot remove any information whatsoever.  Think about compression algorithms: ZIP and RAR are lossless algorithms, meaning they do not lose even one bit of information.  JPG and MP3 however do in fact lose information, information that you can never get back without the original source.  ZIP and RAR are what you want for a DW, while JPG and MP3 create gaps which at some point will be a problem.

Since you will never know how your information is to be used, forget about it and focus on what you have.  The bottom up approach works best here, as raw data is traditionally some sort of 3NF model.  Lightly de-normalize as desired to simplify or improve loading or query performance, but make sure not to lose any information.

This might work as follows:
1.    Go system by system, and module by module within larger systems.
2.    Bring related pieces over one at a time (e.g., Sales Orders or AR).  Be sure to be thorough so as to not forget the need to support varied and unknown uses.  It is better to do it once than multiple times.
3.    As a major common entity is encountered (aka Reference Data), broad input as to its definition should be sought.  For example, the first time Customer is to be brought over, a corporate definition of the Customer Entity should be agreed upon and built into the DW.  Set up Data Governance programs to define and maintain these definitions and their business rules

Use a Top-down design approach for BI with whatever model suits your technology
Designing a BI Application combines three tracks that meet in the middle:
1.    First and foremost, what does your business need to get its job done?  Design a system that perfectly achieves that goal and does so in an optimal manner.
2.    What data is available in your data warehouse or other sources?  Some business informational needs may not be possible, or may require additional steps to acquire.
3.    What does your technology of choice prefer?  Aside from the obvious differences between MOLAP and ROLAP tools, what kind of data model does your tool prefer?  3NF? Star? Snowflake?  Know the features of your database engine.

The design process must marry all three of these together into an optimal design for your business users, their process, and the tool(s) they use.  I would recommend the following as a basic high level guideline:
1.    Start with understanding what shape your BI tool prefers its data to be in and what capabilities your database platform has to help you
2.    Design to that model type to perfectly and optimally meet the needs of the business process. This is your ideal model.  By definition any model that differs from this one will be less efficient, capable or performant.  This model will have tool specific features, but most importantly business process specific needs.  It may build special tables that combine data from all over together into a single field.  Performance is key here, so the mantra of “build once use many” is a primary design goal.  Your goal should be simple BI Tool implementation that produce simple queries – query complexity = query slowness.  In OBIEE, the mode your physical model matches your Business Model the better off you are – less translation.
3.    Review the model (or the basics of an initial model) with data availability and quality to see what can be delivered and what cannot without additional effort.
4.    Identify the effort required to build such a model.  If effort is high vs. time and budget, identify short-cuts such as database features to keep power and performance high with less effort, e.g., Views vs. Tables

BI and DW Together

This is the particular area where I see many errors in approach by customers and other consultants.  Using a bottom up approach for a BI system will fail, as will a top-down approach for a DW.  Why?

Problems with Top-Down for a DW

  • A Data Warehouse must be flexible enough to support everyone in a large company, regardless of how this information is used.  To first identify all ways information can be used will take so long that the first requirements will be obsolete long before the last set of requirements is identified.  Instead focus on source system by source system, module by module and entity by entity.
  • Advanced metrics and business transformations commonly are not a part of the DW, but rather a part of the various BI systems.  They simply don’t belong there, as they are “too far away” from the BI tools and the user community, and they usually are not shared across many organizations, limiting their reuse.

Problems with Bottom-Up for a BI Application

  • Any model that is not driven directly from the BI tools and Business Process will always be less efficient than an ideal model.  When a model differs from the ideal model, some query time transformation will always have to be performed, which will have an impact on performance.  There is simply no getting around this.  Faking the ideal model with database views only fakes out the BI tool or user – it does not fake out the database engine, which still has to perform the translation.  This is why using views is considered a bad idea – they are all too frequently indicative of additional translations, meaning a non-ideal model.  Some metrics require pages of SQL code to build – go ahead -try to do it through a view – I dare you.
  • Sophisticated metrics and information are specific to the BI Application and come directly from the users’ mouths.  Such information cannot come from a centralized DW group, as they are not in communication with both the business users nor the BI tool experts.
  • Additionally, there is a scalability problem with having the DW group not only work on adding new source systems to the DW, but also having them create advanced metrics as well for multiple business units is simply too much.  Instead, have the DW team focus on migrating over raw data and let multiple BI teams work on converting the raw data into information tailored to the many business process.

Technical Aspects
So what do BI and DW systems look like on a diagram?  I think there are several architectures that an organization can use, and I’m not going to pick one out beyond the other.  A few (my names):

  • Totally Independent Data Marts, no DW – Sort of the Kimball Bus Architecture, but he’s altered it over the years as it became apparent that wasn’t going to work.  This is common in smaller and fast growing companies – a bunch of point solutions with no integration at all.  The DW is the sum of all Data Marts, like a virtual DW.  Practically no common definitions exist across marts.
  • Data Marts with DW (Hub & Spoke) – Extremely common today.  The DW feeds much of the information to a variety of Data Marts specific to a project, org, process, source system or region.  Can be broken up into reference and transactional, and as more is migrated into the DW the more sharing and common definitions can be achieved.  A major benefit (and reason) is the need to have data & information in a Data Mart that is not in the DW due to policy, timing or leveragability.  I don’t think 100% dependent data marts are realistic because of this.  Independent marts and portions of hybrid marts using the same data from the DW should be targeted for a replacement feed from the DW to ensure the same data is used throughout.  Some level (none to a lot) of direct DW querying is permitted.  Furthermore, minimal cross mart definitions exist.
  • Consolidated Marts –  This essentially puts all data marts (independent & dependent) onto the same database platform as the DW, usually in different schemas to allow for local control.  Here BI Project teams still function, just on a shared platform.  Data can be shared directly, reducing some redundancy and loading issues.  Some sort of centralized control is advisable at this point to help curtain Data Mart explosion.  Minimal common definitions across marts are common, which is their Achilles heel.  Serious horsepower is needed for this option and all further options.  Hybrid marts requiring some external information will still exist, and policies should be flexible enough to handle them.
  • Virtual Marts – Essentially the opposite of the Kimball Virtual DW option, this relies heavily on the horsepower of the database platform to make virtual Data Marts.  Here BI Project teams still function, just on a shared platform and less independently.  A good level of common definitions can be achieved with this approach.  The DW team will also need to grow in size substantially.  The interesting thing about this is that the marts are designed using good top-down process (in order to get its benefits), but then the benefits of building tables are thrown out for a pure view based solution.  Remember the math involved with the translation does not go away as far as the database is concerned, and hardware does not solve software design flaws.  Can be used for smaller datasets, and can be augmented with materialization as need to improve performance.

One problem with this model is it has no place for data that you don’t want in the DW.  Need some external marketing data for your marketing system? Sorry, that data doesn’t belong in the DW, so you’ll have to figure out some horrible work around.  Not exactly what I would call “business process optimized” would you?

  • Super DW – Where there is only 1 of everything running off of one data model.  I don’t really think this is realistic for many of the reasons I’ve already mentioned, from performance to tools.  What if you want a little Essbase in your life?  Then you now have a dependent external Data Mart and your Super DW isn’t quite as super anymore.

I think the reality for a goal (for some a very long term goal) is a blend of Consolidated and Virtual, keeping in mind that there will always be a few Independent or Hybrid marts out there.  As long as virtualizing doesn’t cause a performance hit, and as long as the model is driven from a top down perspective I think its OK to do so.  If you go with a Virtual and the performance begins to suffer, then start building tables and aggregates to speed things up.  Remember, an aggregate is not redundant data as I have heard a few times.  Maintaining two copies of Sales Orders is redundant, but not if the data has been transformed substantially, aggregated, or combined with other data.

I’ve also left out a ton of other things – this section could be enormous.  I’m sure your Monster Energy Drink is wearing off by now anyway.

Example

A recent project attempted to build virtual marts using a bottom-up pass through approach, and the results were quite frankly terrible.  The policies put in place prevented a decent portion of our data from being integrated together, requiring mapping of OBIEE into an Essbase cube for some of it – there was zero integration aside from a report on a dashboard page.

However, the major problem they faced was the insistence on building a BI Application with a pure bottom-up approach.  They essentially built their 3NF tables, made some basic views on top of them and mapped them up through OBI.  The analysts on the BI Application project had no idea what the metrics meant or if they could be used together, and certainly couldn’t discuss them with the business community.  I was shocked and horrified to see an analyst asking a DBA what the definition of a metric was.

Sure enough, guess what the big killer of the project was?  Poor performance.  They refused to deviate from the bottom-up virtual DW model and paid the price for it – a maximum of only a few users allowed simultaneously even on a large box.  During this project minimal to no design was performed on key aspects of the system – whatever the DW group said were the tables are what they mapped up in an ad-hoc manner.  If something didn’t work right, they would go back and address it.  There were many gaps found out when report building time came along, as the project team did not understand the informational and data needs, and how the reports impacted them.  Again it was all ad-hoc based development.  Needless to say the testing period was hectic for the developers and unusable for the testers.

Conclusion

The key thing that I wanted to communicate in this post is the process used for obtaining and designing the underlying models. BI and DW have two totally different goals, and as such different approaches tailored to those goals are required.  It is extremely important to ensure your mart is well tailored to business process and tool needs first and foremost.  Where to build it (in the DW or outside it) and how to build it (virtual or materialized) are secondary concerns for today’s post.

Posted on March 23, 2010, in BI Theory and Best Practices, Data Modeling, Data Warehousing and tagged , , , . Bookmark the permalink. 5 Comments.

Leave a comment