Liberate your data

Intelligence is all about knowledge. This website is dedicated sharing expertise on Oracle BI. More »

 

Using Oracle Big Data SQL to Add Dimensions and Attributes to Hadoop Reporting

In a previous post I looked at using Oracle’s new Big Data SQL product with ODI12c, where I used Big Data SQL to expose two Hive tables as Oracle external tables, and then join them using the BETWEEN operator, something that’s not possible with regular HiveQL. In this post I’m going to look at using Oracle Big Data SQL with OBIEE11g, to enable reporting against Hive tables without the need to use Hive ODBC drivers and to bring in reference data without having to stage it in Hive tables in the Hadoop cluster.

In this example I’ve got some webserver log activity from the Rittman Mead Blog stored as a Hive table in Hadoop, which in its raw form only has a limited amount of descriptive data and wouldn’t be all that useful to users reporting against it using OBIEE. Here’s the contents of the Hive table as displayed via SQL*Developer:

NewImage

When I bring this table into OBIEE, I really want to add details of the country that each user is visiting from, and also details of the category that each post referenced in the webserver logs belongs to. Tables for these reference data items can be found in an accompanying Oracle database, like this:

NewImage

The idea then is to create an ORACLE_HIVE external table over the Hive table containing the log activity, and then import all of these tables into the OBIEE RPD as regular Oracle tables. Back in SQL*Developer, connected to the database that has the link setup to the Hadoop cluster via Big Data SQL, I create the external table using the new ORACLE_HIVE external table access driver:

NewImage

And now with the Hive table exposed as the Oracle external table BDA_OUTPUT.ACCESS_PER_POST_EXTTAB, I can import all four tables into the OBIEE repository.

NewImage

I can now create joins across the two Oracle schemas and four tables:

NewImage

and then create a business model and presentation model to define a simple star schema against the combined dataset:

NewImage

Once the RPD is saved and made available to the Presentation layer, I can now go and create some simple reports against the Hive and Oracle tables, with the Big Data SQL feature retrieving the Hive data using SmartScan technology running directly on the Hadoop cluster – bypassing MapReduce and filtering, projecting and just returning the results dataset back to the Exadata server running the Oracle SQL query.

NewImage

In the previous ODI12c and Big Data SQL posting, I used the Big Data SQL feature to enable a join between the Hive table and a table containing IP address range lookups using the BETWEEN operator, so that I could return the country name for each visitor to the website. I can do a similar thing with OBIEE, by first recreating the main incoming fact table source as a view over the ORACLE_HIVE external table and adding an IP integer calculation that I can then use for the join to the IP range lookup table (and also take the opportunity to convert the log-format date string into a proper Oracle DATE datatype):

NewImage

and then using that to join to a new table I’ve imported from the BLOG_REFDATA Oracle schema that contains the IP range lookups:

NewImage

Now I can add country as a dimension, and create reports that break down site visits by country of access.

NewImage

Similarly, I can break the date column in the view over the Hive external table out into its own logical dimension table, and then create some reports to show site access over time.

NewImage

and with the final RPD looking like this:

NewImage

If you’re interested in reading more about Oracle Big Data SQL I also covered it earlier on the blog around the launch date, with this post introducing the feature and another looking at how it extends Oracle security over your Hadoop cluster.

Rittman Mead BI Forum 2015 Call for Papers Now Open – Closes on Jan 18th 2015

The Call for Papers for the Rittman Mead BI Forum 2015 is currently open, with abstract submissions open to January 18th 2015. As in previous years the BI Forum will run over consecutive weeks in Brighton, UK and Atlanta, GA, with the provisional dates and venues as below:

  • Brighton, UK : Hotel Seattle, Brighton, UK : May 6th – 8th 2015
  • Atlanta, GA : Renaissance Atlanta Midtown Hotel, Atlanta, USA : May 13th-15th 2015

Now on it’s seventh year, the Rittman Mead BI Forum is the only conference dedicated entirely to Oracle Business Intelligence, Oracle Business Analytics and the technologies and processes that support it – data warehousing, data analysis, data visualisation, big data and OLAP analysis. We’re looking for session around tips & techniques, project case-studies and success stories, and sessions where you’ve taken Oracle’s BI products and used them in new and innovative ways. Each year we select around eight-to-ten speakers for each event along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Charles Elliott and Jordan Meyer.

NewImage

Last year we had a big focus on cloud, and a masterclass and several sessions on bringing Hadoop and big data to the world of OBIEE. This year we’re interested in project stories and experiences around cloud and Hadoop, and we’re keen to hear about any Oracle BI Apps 11g implementations or migrations from the earlier 7.9.x releases. Getting back to basics we’re always interested in sessions around OBIEE, Essbase and data warehouse data modelling, and we’d particularly like to encourage session abstracts on data visualization, BI project methodologies and the incorporation of unstructured, semi-structured and external (public) data sources into your BI dashboards. For an idea of the types of presentations that have been selected in the past, check out the BI Forum 2014, 2013 and 2012 homepages, or feel free to get in touch via email at mark.rittman@rittmanmead.com

The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues if you can speak at both. All session this year will be 45 minutes long, all we’ll be publishing submissions and inviting potential attendees to vote on their favourite sessions towards the end of January. Other than that – have a think about abstract ideas now, and make sure you get them in by January 18th 2015 – just over a week from now!.

Top 10 Rittman Mead Blog Posts from 2014

It’s the afternoon of New Year’s Eve over in the UK, so to round the year off here’s the top 10 blog posts from 2014 from the Rittman Mead blog, based on Google Analytics stats (page views for 2014 in brackets, only includes articles posted in 2014)

  1. Using Sqoop for Loading Oracle Data into Hadoop on the BigDataLite VM – Mark Rittman, March 22, 2014 (8466)
  2. OBIEE Dashboard prompt: at least one mandatory – Gianni Ceresa, March 17th 2014 (7683)
  3. Thoughts on Using Amazon Redshift as a Replacement for an Oracle Data Warehouse – Peter Scott, February 20th 2014 (6993)
  4. The Secret Life of Conditional Formatting in OBIEE – Gianni Ceresa, March 26th 2014 (5606)
  5. Trickle-Feeding Log Files to HDFS using Apache Flume – Mark Rittman, May 18th 2014 (5494)
  6. The State of the OBIEE11g World as of May 2014 – Mark Rittman, May 12th 2014 (4932)
  7. Date formatting in OBIEE 11g – setting the default Locale for users  – Robin Moffatt, February 12th 2014 (4840)
  8. Automated Regression Testing for OBIEE – Robin Moffatt, Jan 23rd 2014 (4040)
  9. OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 2 : Load Data into Hive Tables, Analyze using Hive & Impala – Mark Rittman, Jan 18th 2014 (3439)
  10. Introduction to Oracle BI Cloud Service : Product Overview – Mark Rittman, Sep 22nd 2014 (3190)

In all, the blog in one form or another has been going for 10 years now, and our most popular post of all time over the same period is Robin Moffatt’s “Upgrading OBIEE to 11.1.1.7” – well done Robin. To everyone else, have a Happy New Year and a prosperous 2015, and see you next year when it all starts again!

Data Integration Tips: ODI – One Data Server with several Physical Schemas

Yes, I’m hijacking the “Data Integration Tips” series of my colleague Michael Rainey (@mRainey) and I have no shame!

DISCLAIMER
This tip is intended for newcomers in the ODI world and is valid with all the versions of ODI. It’s nothing new, it has been posted by other authors on different blogs. But I see so much people struggling with that on the ODI Space on OTN that I wanted to explain it in full details, with all the context and with my own words. So next time I can just post a link to this instead of explaining it from scratch.

The Problem

I’m loading data from a schema to another schema on the same Oracle database but it’s slower than when I write a SQL insert statement manually. The bottle neck of the execution is in the steps from the LKM SQL to SQL. What should I do?

Why does it happen?

Loading Knowledge Modules (LKMs) are used to load the data from one Data Server to another. It usually connects to the source and the target Data Server to execute some steps on each of them. This is required when working with different technologies or different database instances for instance. So if we define two Data Servers to connect to our two database schemas, we will need a LKM.

In this example, I will load a star schema model in HR_DW schema, using the HR schema from the same database as a source. Let’s start with the approach using two Data Servers. Note that here we use directly the database schema to connect to our Data Servers.

Two Data Servers connecting to the same database instance, using directly the database schema to connect.

And here are the definitions of the physical schemas :

Physical Schemas

Let’s build a simple mapping using LOCATIONS, COUNTRIES and REGIONS as source to denormalize it and load it into a single flattened DIM_LOCATIONS table. We will use Left Outer joins to be sure we don’t miss any location even if there is no country or region associated. We will populate LOCATION_SK with a sequence and use an SCD2 IKM.

Mapping - Logical tab

If we check the Physical tab, we can see two different Execution Groups. This mean the Datastores are in two different Data Servers and therefore a LKM is required. Here I used LKM SQL to SQL (Built-In) which is a quite generic one, not particularly designed for Oracle databases. Performances might be better with a technology-specific KM, like LKM Oracle to Oracle Pull (DB Link). By choosing the right KM we can leverage the technology-specific concepts – here the Oracle database links – which often improve performance. But still, we shouldn’t need any database link as everything lies in the same database instance.

Mapping - Physical tab

 

Another issue is that temporary objects needed by the LKM and the IKM are created in the HR_DW schema. These objects are the C$_DIM_LOCATIONS table created by the LKM to bring the data in the Target Data Servers and the I$_DIM_LOCATIONS table created by the IKM to detect when a new row is needed or when a row needs to be updated according to the SCD2 rules. Even though these objects are deleted in the clean-up steps at the end of the mapping execution, it would be better to use another schema for these temporary objects instead of target schema that we want to keep clean.

The Solution

If the source and target Physical Schemas are located on the same Data Server – and the technology can execute code – there is no need for a LKM. So it’s a good idea to try to reuse as much as possible the same Data Server for data coming from the same place. Actually, the Oracle documentation about setting up the topology recommends to create an ODI_TEMP user/schema on any RDBMS and use it to connect.

This time, let’s create only one Data Server with two Physical schemas under it and let’s map it to the existing Logical schemas. Here I will use ODI_STAGING name instead of ODI_TEMP because I’m using the excellent ODI Getting Started virtual machine and it’s already in there.

One Data Server with two Physical Schemas under it

As you can see in the Physical Schema definitions, there is no other password provided to connect with HR or HR_DW directly. At run-time, our agent will only use one connection to ODI_STAGING and execute code through it, even if it needs to populate HR_DW tables. It means that we need to be sure that ODI_STAGING has all the required privileges to do so.

Physical schemas

Here are the privileges I had to grant to ODI_STAGING :

GRANT SELECT on HR.LOCATIONS TO ODI_STAGING;
GRANT SELECT on HR.COUNTRIES TO ODI_STAGING;
GRANT SELECT on HR.REGIONS TO ODI_STAGING;

GRANT SELECT, INSERT, UPDATE, DELETE on HR_DW.DIM_LOCATIONS to ODI_STAGING;
GRANT SELECT on HR_DW.DIM_LOCATIONS_SEQ to ODI_STAGING;

Let’s now open our mapping again and go on the physical tab. We now have only one Execution Group and there is no LKM involved. The code generated is a simple INSERT AS SELECT (IAS) statement, selecting directly from the HR schema and loading into the HR_DW schema without any database link. Data is loaded faster and our first problem is addressed.

Mapping - Physical tab without LKM

Now let’s tackle the second issue we had with temporary objects being created in HR_DW schema. If you scroll upwards to the Physical Schema definitions (or click this link, if you are lazy…) you can see that I used ODI_STAGING as Work Schema in all my Physical Schemas for that Data Server. This way, all the temporary objects are created in ODI_STAGING instead of the source or target schema. Also we are sure that we won’t have any issue with missing privileges, because our agent uses directly ODI_STAGING to connect.

So you can see it has a lot of advantages using a single Data Server when sources come from the same place. We get rid of the LKM and the schema used to connect can also be used as Work Schema so we keep the other schemas clean without any temporary objects.

The only thing you need to remember is to give the right privileges to ODI_STAGING (or ODI_TEMP) on all the objects it needs to handle. If your IKM has a step to gather statistics, you might also want to grant ANALYZE ANY. If you need to truncate a table before loading it, you have two approaches. You can grant DROP ANY table to ODI_STAGING, but this might be a dangerous privilege to give in production. A safer way is to create a stored procedure ODI_TRUNCATE in all the target database schema. This procedure takes a table name as a parameter and truncates that table using the Execute Immediate statement. Then you can grant execute on that procedure to ODI_STAGING and edit your IKM step to execute that procedure instead of using the truncate syntax.

 

That’s it for today, I hope this article can help some people to understand the reason of that Oracle recommendation and how to implement it. Stay tuned on this blog and on Twitter (@rittmanmead, @mRainey, @markrittman, @JeromeFr, …) for more tips about Data Integration!

Oracle BI By Bakboord – 2014 in review

The WordPress.com stats helper monkeys prepared a 2014 annual report for this blog. Here’s an excerpt: Madison Square Garden can seat 20,000 people for a concert. This blog was viewed about 68,000 times in 2014. If it were a concert at Madison Square Garden, it would take about 3 sold-out performances for that many people…Read more Oracle BI By Bakboord – 2014 in review