Big Data for Oracle BI
The term Big Data has existed in some form or another for years but recently has taken on a new and more official meaning. In today’s world of massive internet applications, digital instruments streaming non-stop data, scientific data collection and fraud detection, Big Data has grown far beyond what even a large company used to consider large – into the hundreds of terabytes or even petabytes. Furthermore, Big Data has a large unstructured component to it, whether comments on websites, blog data, internet usage, images or documents. This kind of information typically does not map well to traditional database technologies which rely on a very structured table/column arrangement.
Considering high volume and great variability of data, along with very high uptime and extremely short response times needed, traditional RDBMSs simply won’t work – they will not be able to scale out to provide 1 second response time when a Facebook user posts a picture or visits a friend’s wall when there are millions of users looking at petabytes of data. Thus, completely different kinds of data access and storage technologies are needed, ones which are designed to scale far beyond even a very powerful systems such as Oracle ExaData.
This article discusses Oracle’s view of Big Data and in particular how it pertains to Data Warehousing and Business Intelligence. Keep in mind there are many offerings and capabilities pertaining to the acquisition and use of Big Data which are well beyond the scope of Data Warehousing and BI systems; I’m going to focus on just a slice of it here.
Hadoop and NoSQL
Hadoop is a massively distributed file system approach to storing very large amounts of data with high volumes and a complete lack of predictable structure. Hadoop is a share-nothing architecture which spreads data across many nodes; for some of the larger internet systems we’re talking thousands of separate commodity (read: inexpensive) PC servers. It uses the server’s file system (albeit remapped) to store and access files. An Open Source project, Hadoop was originally derived from Google’s work and is now organized by the Apache project.
NoSQL (confusingly named “Not Only SQL” – a misnomer as there is no SQL access to NoSQL. Go figure.) is another massively distributed system, but it operates with more database like features than Hadoop. Sharing a roughly similar architecture as Hadoop, it relies on spreading data out across a large number of nodes to achieve massive parallelism. Oracle’s NoSQL is a simple Key-Value (KV) pair that is essentially hashed out across its many nodes, all with significant data redundancy and load balancing. A simple way of thinking about a KV store is imagining a 2 column table with 2 fields: Key and Value, with hash partitioning on the Key and each partition is on a different physical node, and within each node it is sorted according to the key. The Value field is just a payload – a document, free form text, etc. – a BLOB in database terminology. Whereas Hadoop is best for completely unstructured data, NoSQL is better for semi-structured data.
Both of these systems are not databases in the traditional sense; they don’t provide a lot of general purpose RDBMS features that we’ve come to rely on. There are no tables or columns, joins, there are no Rollback segments, foreign keys, and no simple SQL language access. Instead, they are lightweight solutions that focus on a specific problem and throw out any overhead features and capabilities not germane to incredible speed with massive amounts of data and massive transaction loads.
It is important to understand that these technologies do not replace a traditional RDBMS; they simply do not have the capabilities and are not optimized for OLTP or DW type usage patterns. You would never put Oracle EBS or SAP on NoSQL or Hadoop, just as you would not move your Excel calculations into PowerPoint. Instead, they should be taken as complimentary technologies to be used when and where the need arises. Much of what Oracle offers and will be discussed in this article pertain to how to make them work together for an integrated solution.
Another way to look at it is that they are simply at a lower level than an RDBMS; they are designed to be big, dumb superfast storage devices and not a database. Their only job is storage and access to large chunks of data; all of the interpretation is up to the application. With this in mind, you might store an entire document as a single object; the database doesn’t know how to interpret the data. Interpretation and organization (for example extracting First Name and Last Name) is entirely up to the application to parse it and make sense of it. Much of the access to these systems is done via program code that you have to specifically write for each use case.
Using the Data in Hadoop and NoSQL for DW & BI
For the purposes of DW and BI, the focus is primarily on how to extract relevant information from the massive quantity of data (aka noise) being collected in these massively parallel systems. This means extracts into a traditional DW system; although you can do some “real-time” analysis, the capabilities are limited at this time. What is up to the application is to determine what information is needed out of all of the noise data and extract only that bit. From there hopefully a much smaller data set can be used within traditional RDBMS solutions, although keep in mind “smaller” can still be very large!
Oracle provides several mechanisms to get data out of Hadoop and NoSQL and into your regular Oracle RDBMS. Most of these technologies are also based on Open Source side projects, but purchasing the Oracle versions provide you with additional features and of course support. Note there are dozens more beyond the scope of this article that may fill a specific need. They all have bizarre names such as Pig, Flume, Oozie, Sqoop, Mahout, and many others; many are included with Oracle’s rebundled version of Cloudera, a company which provides integrated tools to work with Hadoop.
One very key thing to note about using these systems – you must programmatically access it via code. There is no simple, straightforward way to simply do a generic map and access it via SQL; even with some assistance from abstraction layers, there will always be some sort of manual programming that is needed. Time to hire some Java programmers essentially.
Extracting from Hadoop
All data access to Hadoop is via a programming language called MapReduce. These mini programs provide some basic functionality running on each of the nodes, with an aim to reduce the data set as it moves along from all of the data in the HDFS (aka noise) to just the data you want. Note they are to be treated as batch programs; even though they may execute quickly they are still batch, eliminating any sort of true real-time access. There is essentially no getting around using MapReduce and Java programming to do this even though there are some tools to assist.
Oracle’s offering here is in line with their appliance strategy of late: bundle all of the software on a monster piece of hardware (as in 216 CPU cores, 864GB+ memory and 648TB storage) and make it as turn-key as possible. The Oracle Big Data Appliance with its Oracle Big Data Connectors, provides several options for data access to Hadoop:
Method 1: Direct Connector for HDFS (ODCH)
Here an Oracle External Table is defined which points to an OS file that is linked to a MapReduce program responsible for running a batch extract. As an external table, it can then be queried as normal. Note this is a pure “pull” method – define the links and query as needed. Oracle is positioning this connector as an exploratory technology, meaning one would use if for analysis one what you ultimately would like to batch in production.
Method 2: HIVE (Pull) and ODI
HIVE is an open source abstraction and linking layer that maps to output created by a MapReduce program. Using the HIVE processing engine, you define a table in the database and map it to a file on the Hadoop File System (HDFS). When you actually query the table from the database via HIVE, it will execute the MapReduce program to process the query. Recall as the MapReduce is a batch system, it may take some time for it to execute. However, batch is generally acceptable as this is best designed to extract information into a DW table from HDFS in batch mode.
Part of this technique is to use ODI to make the process a bit more automated. Using pre-built coding patterns called Knowledge Modules (KMs) allows you to get up and going more quickly than via a 100% custom coded solution. Note that the KMs leverage HIVE for the access to the HDFS. Note also that other ETL tools such as Informatica have their own connectors and techniques to get at Hadoop data as well.
Method 3: Oracle Loader for Hadoop (Push)
This is analogous to writing an extract and loading it via SQL*Loader or Data Pump – in other words a “push” model. Here you can do this in an online mode over a JDBC/OCI connection directly into the database, or you can stage it in a data file for use by DataPump. Again, custom coded MapReduce code is needed to perform the extract to the HDFS, then a copy command before it can be loaded into the database.
When you look at these 3 methods, they really aren’t any different that writing custom extracts from a different source system. Either you have a gateway from Oracle into the source (ODCH) or write an extract to a file using the source database language and then import the file using either DataPump or an External table.
Extracting from NoSQL
Pulling data from NoSQL is a bit more challenging as the tools are less mature, and Oracle does not offer all of the Open Source options that exist. Staying with Oracle only tools, presently you will need to write Java programs to find the data you want, parse it, and dump it to files for loading by traditional Oracle tools.
Integration with OBI
Based on the above extraction techniques, it is straightforward to see how this data can be integrated into OBI; extract it into a table in your DW/M schema basically. You will need to know what you are looking for out of the massive amount of noise; it’s not a “map everything” kind of exercise. Some of the techniques above allow a pseudo-real-time access to the raw Hadoop data, but keep in mind that batch MapReduce programs will be run when you do this.
Although I don’t want to dig too much into this, there is an additional piece of functionality to add to the mix: R. R is an open source statistical programming language similar to SASS or SPSS. It is designed for statisticians. By itself, R has nothing to do with Big Data, but Oracle provides some enhancements to R to allow it to scale up to much larger volumes.
As part of the R application is an interactive visualization engine. Through scripting and embedding within the database, R output (say an advanced R graph not native to OBI or BI Publisher) can be linked into the UI via output XML files. These XML files can then be consumed by BI Publisher as normal.
All of the technologies mentioned above here come from the Open Source world, but Oracle has done work on each of them to some degree or another and bundled them together. In many cases they have improved the software performance or features, and in all cases they have them pre-configured and optimized for a the Oracle Big Data Appliance.
The Big Data Appliance contains the following pre-configured set of software across its 18 nodes:
- NoSQL Database
- Hadoop, MapReduce
- Cloudera Manager to manage Hadoop. Cloudera is an OEMed product.
- Applications such as HIVE (with MySql metadata store)
- Oracle R – a pure Open source version of R. Not to be confused with Oracle R Enterprise within the Database 11G server
- Oracle Loader for Hadoop
- Oracle Direct Connector for HDFS
- Oracle R connector for Hadoop
For anyone familiar with any of the Exa appliances from Oracle, you know these are massive machines, and of course the BDA is no different: 864GB RAM, 216 CPU Cores, 648TB with 40Gbps connections to other Exa appliances.
It is clear that Oracle is moving everything towards appliances. In the following Oracle diagram this becomes quite obvious:
Although much of Big Data planned to have low-priced commodity hardware, the Oracle Appliances have one nice benefit: a 40Gbps InfiniBand connection between the boxes, something you won’t be able to get with commodity HW connected by traditional Ethernet connections.
Big Data is becoming more and more common at customers in the context of BI and Data Warehousing. Of course it is not for every customer; only those with very massive amounts of unstructured data are likely to need any of these technologies. The challenge in the Business Intelligence world revolves around how to get at this potentially valuable information and integrate it with other data elements to be useful for analysis. As a young technology, expect many iterations and enhancements along the way!
For more information, please read Oracle’s Whitepaper.
Posted on August 29, 2012, in Architecture, Data Warehousing, Other Technologies & Tools and tagged BDA, Big Data, Big Data Appliance, BigData, Cloudera, Hadoop, MapReduce, NoSQL. Bookmark the permalink. 2 Comments.