Tag Archives: Obiee

Introducing Rittman Mead’s New User Engagement Service


Although you’ve been hearing a lot on the blog recently about extending Oracle BI&DW with Hadoop and big data technologies, another initiative Rittman Mead have been working on over the past couple of months is user engagement. All-too often developers create BI systems that meet all an organisation’s technical goals, but then fail to get adopted by users for reasons we can’t understand; all we know is that our brilliantly-architected system just doesn’t seem to engage users, and that’s just as fatal to a project as the ETL not working or the underlying technology being out-of-date.

Surveys have shown that adoption rates for new BI systems are often as low as 25%, because we don’t focus enough on the user experience (UX) or user interface, queries run too slow, the data or reports just aren’t relevant or the overall experience just isn’t up to the standard that internet users expect now. If you’re a BI manager or a CIO for your organisation it’s essential that you know whether the BI systems you’re providing for your users are actually being used, and it’s quite often the case that a BI system provided by the IT department is thought to be well-used by the end-users, but when you check the usage stats you’ll find that engagement has really fallen-off since the initial rollout of the system – and the worst thing is, you probably don’t really know why this is, all you know is that users aren’t happy and they’re now looking to implement some self-service tools that’ll break your data governance model.

To help you improve user engagement for your BI system and increase the return on your investment in Oracle technology, Rittman Mead are developing a User Engagement Service to address these issues along with a User Engagement Toolkit developed by the likes of Robin Moffatt, Jordan Meyer, Tom Underhill and other OBIEE and UX experts within Rittman Mead. Initial details of the service are on our Rittman Mead User Engagement Service homepage and the Rittman Mead User Engagement Service Datasheet, look-out for more information on these services over the coming days and weeks, and if you’re interested in getting-involved in our early-adopter program you can sign-up using the form below and we’ll get back to you shortly.

Combining Oracle Big Data Discovery and Oracle Visual Analyzer on BICS

So now that Oracle Visual Analyzer is out as part of Oracle BI Cloud Service, and Visual Analyzer (VA) is due to ship on-premise as part of OBIEE12c sometime in the next twelve months, several of our customers have asked us if they need both VA and Oracle Big Data Discovery if they’re looking to analyse Hadoop data as part of a BI project. It’s an interesting question so I thought it’d be useful to go through my thoughts on how the two tools work together, when to use one, and when to use the other.

Taking our standard “big data” dataset of website log activity, Twitter mentions and page details from our WordPress blogging software, before Visual Analyzer came along the two usual ways we’d want to analyze these datasets is either a traditional BI metrics analysis-type scenario, and a data discovery/visualization scenario where we’re more interested in the content of the data rather than precise metrics. My half of the recent BI Forum 2015 Masterclass goes through these two scenarios in detail (presentation slides in PDF format here), and it’s Big Data Discovery that provides the more “Tableau”-type experience with fast point-and-click access to both datasets joined together on their common website page URL details.


Now we have Visual Analyzer though, things get interesting; in my article on Visual Analyzer within BICS I showed a number of data visualisations that look pretty similar to what you’d get with Big Data Discovery, and when we have VA available on-site as part of OBIEE12c we’ll be able to connect it directly to Hadoop via Cloudera Impala, potentially analyzing the whole dataset rather than the (representative) sample that Big Data Discovery loads into its Endeca Server-based engine.


So if the customer is looking to analyze data held in Hadoop and Visual Analyzer is available, where’s the value in Big Data Discovery (BDD)? To my mind there’s three areas where BDD goes beyond what VA can do, or helps you perform tasks that you’ll need to do before you can work with your data in VA:

  • The initial data discovery, preparation and cleansing that you’d otherwise have to do using HiveQL or an ETL tool such as ODI12c
  • Providing you with a high-level overview and landscape of your data, when that’s more important to you at the time than precise counts and drill-down analysis
  • Understanding how data joins together, and how best to use your datasets in terms of metrics, facts, dimensions and so forth

Taking the data preparation and cleansing part first, I’ve covered in several blogs over the past couple of years how tools such as ODI can be used to create formal, industrialized data pipelines to ingest, prepare and then summarise data coming into your Hadoop system, and how you can drop-down to languages such as HiveQL, Pig and Spark to code these data transformations yourself. In the case of my webserver log, twitter and page details datasets this work would include standardising URL formats across the three sources, geocoding the IP addresses in the access logs to derive the country and city for site visitors, turning dates and times in different formats into ones that work as Hive timestamps, and so forth. Doing this all using ODI and/or HiveQL can be a pretty technical task, so where BDD comes in useful even – if VA and an OBIEE RPD is the final destination for the data.


Datasets that you transform and enrich in Big Data Discovery can be saved back to Hive as new Hive tables, or exported out as files for you to load into Oracle using SQL*Developer, or upload into BICS to use in Visual Analyzer. Where BDD then becomes useful is giving you a quick, easy to use overview of your dataset before you get into the serious business of defining facts, dimensions and aliases against these three Hive tables. The screenshots below show a couple of typical Big Data Discovery Studio data visualisation pages against the webserver logs dataset, and you can see how easy it is to create simple charts, tag clouds and maps against the data you’re working with – the aim being to give you an overview of the data you’re working with, help you understand its contents and “shape”, before moving further down the curation process and applying formal structures to the data.


Where things get harder to do within Big Data Discovery is when more-and-more formatting, complex joining and “arranging” of the data is required; for example, BDD gives you a lot of flexibility in how you join datasets, but this flexibility can be confusing for end-users if they’re then presented with every possible variation of a three-table join rather than having the data presented to them as simple facts and dimensions. But this is how we’d really expect it – if you go back to the logical data architecture I went through in the blog post a while ago about the updated Oracle Information Management Reference Architecture, the trade-off in using schema-on-read data reservoirs is that this data, although quick and cheap to store, requires a lot more work to be done each time you access the data to get “value” from it.


OBIEE, in contrast, makes you define your data structures in-full before you present data to end-users, dividing data in the three datasets into measures (for the fact tables) and attributes (for dimensions) and making it possible to add more dimension lookups (for a date dimension, for Twitter users in this case) and separate the overall set of data into more focused subject areas. Working with the dataset on the on-premise version of OBIEE first, the RPD that I created to present this data in a more formal, dimensional and hierarchical way to users looked like this:


I can leave this RPD connected directly to the underlying Hive and Impala tables if I want to use just Answers and Dashboards, but for the time being I either need to export the underlying Hive tables into CSV files or into an Oracle Database before uploading into Visual Analyzer, but come OBIEE12c this should all be seamless. What users are then presented with when they go into Visual Analyzer is then something like this:


Notice how the various attributes of interest are grouped into fact and dimension table folders, and there’s a simple means to add calculations, change the visualisation type and swap chart settings around. Note also that the count on the screen is the actual count of records in the full dataset, not the sample that BDD takes in order to provide an overview of values and distribution in the full dataset. Whilst it’s relatively easy to create a line chart, for example, to show tweets per user within BDD, using Visual Analyzer it’s just a case of double-clicking on the relevant measures and attributes on one side of the page, selecting and arranging the visualisation and applying any filters using dialog boxes and value-selectors – all much more familiar and obvious to BI users.


Enrichment to the data that I’ve done in Big Data Discovery should in most cases be able to come through to Visual Analyzer; for example, I used Big Data Discovery’s text enrichment features to determine the sentiment of blog post titles, tweets and other commentary data, I could use the latitude and longitude values derived during the visitor IP address geocoding to plot site visitors on a map. Using the sentiment value derived from the post title, tweet contents and other textual data, I can create a chart of our most popular posts mentioned on Twitter and colour bars to show how positive, or negative, the comments about the post were.


The only thing that Visual Analyzer can’t yet do that would be useful, is to be able to include more than one subject area in a project. To analyze the number of tweets and the number of page views for posts in a scatter chart, for example, I currently have to create a separate subject area that includes both sets of facts and dimensions, though I understand BICS on VA will have the ability to include multiple subject areas in a forthcoming release.


So in summary, I’d say that Big Data Discovery, and Visual Analyzer as part of BI Cloud Service, are complementary tools rather than one being able to replace the other in a big data context. I find that Big Data Discovery is a great tool to initially understand, catalog and view at a high-level data sources going into VA, and then  to do some user-driven cleaning-up of the data, enhancing it and enriching it before committing it to the formal dimensional model that Visual Analyzer requires.

In its BICS guise there’s the additional step of having to export the Hadoop data out of your Big Data Appliance or other Hadoop cluster and upload it in the form of files using BICS’s data load or the new Data Sync utility, but when VA comes as part of OBIEE12c in the next twelve months you’ll be able to directly connect to the Hadoop cluster using Impala ODBC and analyse the data directly, in-place.

I’ll be covering more on BICS over the next few weeks, including how I got data from Hadoop into BICS using the new Data Sync utility.

OBIEE New Features for Front-End Developers and Business Users

OBIEE was released this May and Robin and Mark wrote great posts about New Features from System Administrators and the new support for HiveServer2 and Cloudera Impala respectively. In this post, we will see some useful new features for Front-End development. Some of these features like tree map, new styles, etc. were included in the OBIEE version, but I’m including them here anyway because if you are moving from an earlier release of you probably haven’t seen them yet.

Search & Sort options inside the Subject Area

There is a new option to search inside the subject area when we create an analysis. If we press the magnifying glass icon in the Subject Areas pane, a search box appears and OBIEE will return all the objects that contains the text that we entered. We can also use the % wildcard.


In addition there is also the option to order the objects inside the subject area sorting the folders and the presentation columns inside them. The default value is showing the objects in the way that are saved in the repository.


Save Calculated Items in the Web Catalog

A very useful feature for business users is the possibility to create calculated items in an analysis and save it in the catalog in order to reuse it in other analyses for the same subject area.

In the Results tab there is a new icon for creating a new measure. The calculation is created in the same way the column formulas are created.


After the new measure is created you can go to the Criteria tab and in the column pop-up menu select the Save Column As option. In this manner you save this new measure as a new object in the catalog to be reused in other analyses. The feature of creating a new calculated measure for tables and pivot tables was included in the version but the possibility to save the column in the catalog is a new feature of the OBIEE


You can also change the properties of the new column in the Criteria tab but be aware that conditional formatting and conditional action links for the column are not saved to the catalog. Regular action links are saved for the column as well as the format properties.

The saved column can be edited from the catalog pane and catalog page. All the changes that you make in the catalog are applied to every analysis that used this column. If you modify it inside a particular analysis, these changes are valid only for the analysis that you are working on.


To be able to save new measures as columns you should have the Save Column privilege.


Global Variables

In this version we find a new new type of variables: the Global Variables. Global Variables can be defined in the context of an analysis and can be used in other analysis. Useful to do some intermediate reusable calculations.

To create a new global variable, select the Edit Formula option for the required column. When you press the Variable button in the Formula Editor you will see the new Global option. The Insert Global Variable dialog appears and you can select an existing global variable to be used in the formula, or you can create a new one. To create a new one you need to enter a name, type and value. If you want to use an expression like in the example (Min(Revenue)) as value, the data type should be Text.


To reference a global variable you need to use the fully qualified name including the context: @{global.variables.variablename}.


Also to manage global variables you should have the Manage Global Variables privilege.

New Visualisation: Treemap

There is a new visualisation called Treemap since OBIEE Treemap groups the dimensional data that you selected in the analysis in tiles. By default, the tiles size is based in the content of first measure of the analysis and the tiles colour is based in the content of the second measure. If you have only one measure in your analysis, is used for both size and colour.


You can edit it as any other view and change the group by information as well as the measures which affects the size and colour of the tile and the range of colours that you want to use. Also you can choose the style between seeing the information in coloured bins or using continuous colour fill. If you selected the first one you can also select the amount of different coloured bins do you want to use.


New Preview Styles

When you want to preview an analysis in the Results tab, instead of showing immediately how the analysis will look, OBIEE offers you different Styles to see your analysis. These feature was also included since the version. In case you want to create a custom style, Oracle recommends to use the new Skyros style as a starting point.


Horizontal Layout for Radio Button and Check Box Dashboard Prompts

Radio button and check box dashboards prompts can be showed horizontally. When you create a new dashboard prompt using check box or radio button under Options, you can select between horizontal or Vertical Layout appears under Options.



Enhancements in Export & Print Options

In this version, you will find more options in the Print and Export Options dialog. You can select if you want to include charts, images and formatting or to specify column properties like the column width and to wrap the text in columns.


In summary, these features are simple but quite useful for business users and front-end developers, and give more flexibility to create better data visualisations.


New Oracle Magazine article on Oracle BI Cloud Service

NewImageThe May/June 2015 edition of Oracle Magazine is now out, and my Business Intelligence article in this edition is on Oracle BI Cloud Service (BICS). In “Upload, Model, Analyze and Report” I focus on the “departmental power-user” use-case where someone with a small team wants to share data and dashboards with others in the department, is familiar with OBIEE but wants to get something up-and-running quickly without having to include IT or the formal development process. By just uploading a spreadsheet of data and quickly modeling it into a star-schema using BICS’s simple-to-use web-based tools, you can create reports that can be shared with others in your team using OBIEE’s familiar dashboard interface.

Rittman Mead offer a number of services around Oracle BI Cloud Service and now have our own BICS “pod” for demonstration and evaluation purposes. Now that Oracle have released Visual Analyser early for BICS we’re expecting a lot of interest and demand for services, support and training around Oracle’s cloud version of OBIEE, so if you’re interested in moving part of your OBIEE estate in the cloud, or you’re a departmental BI lead looking to run OBIEE within your department without the need to get IT involved, drop us a line at enquiries@rittmanmead.com and we’ll be pleased to help.

Using HBase and Impala to Add Update and Delete Capability to Hive DW Tables, and Improve Query Response Times

One of our customers is looking to offload part of their data warehouse platform to Hadoop, extracting data out of a source system and loading it into Apache Hive tables for subsequent querying using OBIEE11g. One of the challenges that the project faces though is how to handle updates to dimensions (and in their case, fact table records) when HDFS and Hive are typically append-only filesystems; ideally writes to fact tables should only require INSERTs and filesystem appends but in this case they wanted to use an accumulating fact snapshot table, whilst the dimension tables all used SCD1-type attributes that had their values overwritten when updates to those values came through from the source system.

The obvious answer then was to use Apache HBase as part of the design, a NoSQL database that sits over HDFS but allows updates and deletes to individual rows of data rather than restricting you just to append/inserts. I covered HBase briefly on the blog a few months ago when we used it to store webserver log entries brought into Hadoop via Flume, but in this case it makes an ideal landing point for data coming into our Hadoop system as we can maintain a current-state record of the data brought into the source system updating and overwriting values if we need to. What was also interesting to me though was how well we could integrate this HBase data into our mainly SQL-style data processing; how much Java I’d have to use to work with HBase, and whether we could get OBIEE to connect to the HBase tables and query them directly (with a reasonable response time). In particular, could we use the Hive-on-HBase feature to create Hive tables over the HBase ones, and then query those efficiently using OBIEE, so that the data flow looked like this?


To test this idea out, I took the Flight Delays dataset from the OBIEE11g SampleApp & Exalytics demo data [PDF] and created four HBase tables to hold the data from them, using the BigDataLite 4.1 VM and the HBase Shell. This dataset has four tables:

  • FLIGHT_DELAYS – around 220m US flight records listing the origin airport, destination airport, carrier, year and a bunch of metrics (flights, late minutes, distance etc)
  • GEOG_ORIGIN – a list of all the airports in the US along with their city, state, name and so on
  • GEOG_DEST – a copy of the GEOG_ORIGIN table, used for filtering and aggregating on both origin and destination 
  • CARRIERS – a list of all the airlines associated with flights in the FLIGHT_DELAYS table

HBase is a NoSQL, key/value-store database where individual rows have a key, and then one or more column families made up of one or more columns. When you define a HBase table you only define the column families, and the data load itself creates the columns within them in a similar way to how the Endeca Server holds “jagged” data – individual rows might have different columns to each other and like MongoDB you can define a new column just by loading it into the database.

Using the HBase Shell CLI on the BigDataLite VM I therefore create the HBase tables using just these high-level column family definitions, with the individual columns within the column families to be defined later when I load data into them.

hbase shell
create 'carriers','details'
create 'geog_origin','origin'
create 'geog_dest','dest'
create 'flight_delays','dims','measures'

To get data into HBase tables there’s a variety of methods you can use. Most probably for the full project we’ll write a Java application that uses the HBase client to read, write, update and delete rows that are read in from the source application (see this previous blog post for an example where we use Flume as the source), or to set up some example data we can use the HBase Shell and enter the HBase row/cell values directly, like this for the geog_dest table:

put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_dest','LAX','dest:city','Los Angeles, CA'
put 'geog_dest','LAX','dest:state','California'
put 'geog_dest','LAX','dest:id','12892'

and you can then use the “scan” command from the HBase shell to see those values stored in HBase’s key/value store, keyed on LAX as the key.

hbase(main):015:0> scan 'geog_dest'
ROW                                    COLUMN+CELL                                                                                                     
 LAX                                   column=dest:airport_name, timestamp=1432067861347, value=Los Angeles, CA: Los Angeles                           
 LAX                                   column=dest:city, timestamp=1432067861375, value=Los Angeles, CA                                                
 LAX                                   column=dest:id, timestamp=1432067862018, value=12892                                                            
 LAX                                   column=dest:state, timestamp=1432067861404, value=California                                                    
1 row(s) in 0.0240 seconds

For testing purposes though we need a large volume of rows and entering them all in by-hand isn’t practical, so this is where we start to use the Hive integration that now comes with HBase. For the BigDataLite 4.1 VM all you need to do to get this working is install the hive-hbase package using yum (after first installing the Cloudera CDH5 repo into /etc/yum.repos.d), load the relevant JAR files when starting your Hive shell session, and then create a Hive table over the HBase table mapping Hive columns to the relevant HBase ones, like this:

ADD JAR /usr/lib/hive/lib/zookeeper.jar;
ADD JAR /usr/lib/hive/lib/hive-hbase-handler.jar;
ADD JAR /usr/lib/hive/lib/guava-11.0.2.jar;
ADD JAR /usr/lib/hive/lib/hbase-client.jar;
ADD JAR /usr/lib/hive/lib/hbase-common.jar;
ADD JAR /usr/lib/hive/lib/hbase-hadoop-compat.jar;
ADD JAR /usr/lib/hive/lib/hbase-hadoop2-compat.jar;
ADD JAR /usr/lib/hive/lib/hbase-protocol.jar;
ADD JAR /usr/lib/hive/lib/hbase-server.jar;
ADD JAR /usr/lib/hive/lib/htrace-core.jar;
 (key string,
  carrier_desc string
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
("hbase.columns.mapping" = ":key,details:carrier_desc")
TBLPROPERTIES ("hbase.table.name" = "carriers");
CREATE EXTERNAL TABLE hbase_geog_origin
 (key string,
  origin_airport_name string,
  origin_city string,
  origin_state string,
  origin_id string
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
("hbase.columns.mapping" = ":key,origin:airport_name,origin:city,origin:state,origin:id")
TBLPROPERTIES ("hbase.table.name" = "geog_origin");
 (key string,
  dest_airport_name string,
  dest_city string,
  dest_state string,
  dest_id string
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
("hbase.columns.mapping" = ":key,dest:airport_name,dest:city,dest:state,dest:id")
TBLPROPERTIES ("hbase.table.name" = "geog_dest");
CREATE EXTERNAL TABLE hbase_flight_delays
 (key string,
  year string,
  carrier string,
  orig string,
  dest string,
  flights tinyint,
  late   tinyint,
  cancelled bigint,
  distance smallint
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,measures:flights,measures:late,measures:cancelled,measures:distance")
TBLPROPERTIES ("hbase.table.name" = "flight_delays");

Bulk loading data into these Hive-on-HBase tables is then just a matter of loading the source data into a regular Hive table, and then running INSERT INTO TABLE … SELECT commands to copy the regular Hive rows into the HBase tables via their Hive metadata overlays:

insert into table hbase_carriers                           
select carrier, carrier_desc from carriers;
insert into table hbase_geog_origin
select * from geog_origin;
insert into table hbase_geog_dest
select * from geog_dest;
insert into table hbase_flight_delays
select row_number() over (), * from flight_delays;

Note that I had to create a synthetic sequence number key for the fact table, as the source data for that table doesn’t have a unique key for each row – something fairly common for data warehouse fact table datasets. In fact storing fact table data into a HBase table is not a very good idea for a number of reasons that we’ll see in a moment, and bear-in-mind that HBase is designed for sparse datasets and low-latency inserts and row retrievals so don’t read too much into this approach yet.

So going back to the original reason for using HBase to store these tables, updating rows within them is pretty straightforward. Taking the geog_origin HBase table at the start, if we get the row for SFO at the start using a Hive query over the HBase table, it looks like this:

hive> select * from hbase_geog_origin where key = 'SFO'; 
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
SFO   San Francisco, CA: San Francisco   San Francisco, CA   California   14771
Time taken: 29.126 seconds, Fetched: 1 row(s)

To update that row and others, I can load a new data file into the Hive table using HiveQL’s LOAD DATA command, or INSERT INTO TABLE … SELECT from another Hive table containing the updates, like this:

insert into table hbase_geog_origin    
select * from origin_updates;

To check that the value has in-fact updated I can either run the same SELECT query against the Hive table over the HBase one, or drop into the HBase shell and check it there:

hbase(main):001:0> get 'geog_origin','SFO'
COLUMN                                 CELL                                                                                                           
 origin:airport_name                   timestamp=1432050681685, value=San Francisco, CA: San Francisco International                                  
 origin:city                           timestamp=1432050681685, value=San Francisco, CA                                                               
 origin:id                             timestamp=1432050681685, value=14771                                                                           
 origin:state                          timestamp=1432050681685, value=California                                                                      
4 row(s) in 0.2740 seconds

In this case the update file/Hive table changed the SFO airport name from “San Francisco” to “San Francisco International”. I can change it back again using the HBase Shell like this, if I want:

put 'geog_origin','SFO','origin:airport_name','San Francisco, CA: San Francisco'

and then checking it again using the HBase Shell’s GET command on that key value shows it’s back to the old value – HBase actually stores X number of versions of each cell with a timestamp for each version, but by default it shows you the current one:

hbase(main):003:0> get 'geog_origin','SFO'
COLUMN                                 CELL                                                                                                           
 origin:airport_name                   timestamp=1432064747843, value=San Francisco, CA: San Francisco                                                
 origin:city                           timestamp=1432050681685, value=San Francisco, CA                                                               
 origin:id                             timestamp=1432050681685, value=14771                                                                           
 origin:state                          timestamp=1432050681685, value=California                                                                      
4 row(s) in 0.0130 seconds

So, so far so good. We’ve got a way of storing data in Hive-type tables on Hadoop and a way of updating and amending records within them by using HBase as the underlying storage, but what are these tables like to query? Hive-on-HBase tables with just a handful of HBase rows return data almost immediately, for example when I create a copy of the geog_dest HBase table and put just a single row entry into it, then query it using a Hive table over it:

hive> select * from hbase_geog_dest2;
LAXLos Angeles, CA: Los AngelesLos Angeles, CACalifornia12892
Time taken: 0.257 seconds, Fetched: 1 row(s)

Hive in this case even with a single row would normally take 30 seconds or more to return just that row; but when we move up to larger datasets such as the flight delays fact table itself, running a simple row count on the Hive table and then comparing that to the same query running against the Hive-on-HBase version shows a significant time-penalty for the HBase version:

hive> select sum(cast(flights as bigint)) as flight_count from flight_delays;
Total jobs = 1
Launching Job 1 out of 1
Total MapReduce CPU Time Spent: 7 seconds 670 msec
Time taken: 37.327 seconds, Fetched: 1 row(s)

compared to the Hive-on-HBase version of the fact table:

hive> select sum(cast(flights as bigint)) as flight_count from hbase_flight_delays;
Total jobs = 1
Launching Job 1 out of 1
Total MapReduce CPU Time Spent: 1 minutes 19 seconds 240 msec
Time taken: 99.154 seconds, Fetched: 1 row(s)

And that’s to be expected; as I said earlier, HBase is aimed at low-latency single-row operations rather than full table scan, aggregation-type queries, so it’s not unexpected that HBase performs badly here, but the response time is even worse if I try and join the HBase-stored Hive fact table to one or more of the dimension tables also stored in HBase.

In our particular customer example though these HBase tables were only going to be loaded once-a-day, so what if we copy the current version of each HBase table row into a snapshot Hive table stored in regular HDFS storage, so that our data loading process looks like this:


and then OBIEE queries the snapshot of the Hive-on-HBase table joined to the dimension table still stored in HBase, so that the query side looks like this:


Let’s try it out by taking the original Hive table I used earlier on to load the hbase_flight_delays table. and join that to one of the Hive-on-HBase dimension tables; I’ll start first by creating a baseline response time by joining that source Hive fact table to the source Hive dimension table (also used earlier to load the corresponding Hive-on-HBase table):

select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from flight_delays f 
join geog_origin o on f.orig = o.origin                                                             
and o.origin_state = 'California'                                                                       
group by o.origin_airport_name; 
17638Arcata/Eureka, CA: Arcata
9146Bakersfield, CA: Meadows Field
125433Burbank, CA: Bob Hope
1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field
Time taken: 43.896 seconds, Fetched: 27 row(s)

So that’s just under 44 seconds to do the query entirely using regular Hive tables. So what if I swap-out the regular Hive dimension table for the Hive-on-HBase version, how does that affect the response time?

hive> select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from flight_delays f       
    > join hbase_geog_origin o on f.orig = o.key                                                        
    > and o.origin_state = 'California'                                                                 
    > group by o.origin_airport_name;
17638Arcata/Eureka, CA: Arcata
9146Bakersfield, CA: Meadows Field
125433Burbank, CA: Bob Hope
1653Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field
Time taken: 51.757 seconds, Fetched: 27 row(s)

That’s interesting – even though we used the (updatable) Hive-on-HBase dimension table in the query, the response time only went up a few seconds to 51, compared to the 44 when we used just regular Hive tables. Taking it one step further though, what if we used Cloudera Impala as our query engine and copied the Hive-on-HBase fact table into a Parquet-stored Impala table, so that our inward data flow looked like this:


By using the Impala MPP engine – running on Hadoop but directly reading the underlying data files, rather than going through MapReduce as Hive does – and in-addition storing its data in column-store query-orientated Parquet storage, we can take advantage of OBIEE’s new support for Impala and potentially bring the query response time even further. Let’s go into the Impala Shell on the BigDataLite 4.1 VM, update Impala’s view of the Hive Metastore table data dictionary, and then create the corresponding Impala snapshot fact table using a CREATE TABLE … AS SELECT Impala SQL command:

[oracle@bigdatalite ~]$ impala-shell
[bigdatalite.localdomain:21000] > invalidate metadata;
[bigdatalite.localdomain:21000] > create table impala_flight_delays
                                > stored as parquet
                                > as select * from hbase_flight_delays;

Now let’s use the Impala Shell to join the Impala version of the flight delays table with data stored in Parquet files, to the Hive-on-HBase dimension table created earlier within our Hive environment:

[bigdatalite.localdomain:21000] > select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
                                > join hbase_geog_origin o on f.orig = o.key
                                > and o.origin_state = 'California'  
                                > group by o.origin_airport_name;
Query: select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
join hbase_geog_origin o on f.orig = o.key
and o.origin_state = 'California'
group by o.origin_airport_name
| flight_count | origin_airport_name                                              |
| 31907        | Fresno, CA: Fresno Yosemite International                        |
| 125433       | Burbank, CA: Bob Hope                                            |
| 1653         | Santa Maria, CA: Santa Maria Public/Capt. G. Allan Hancock Field |
Fetched 27 row(s) in 2.16s

Blimey – 2.16 seconds, compared to the best time of 44 seconds we go earlier when we just used regular Hive tables, let alone join to the dimension table stored in HBase. Let’s crank-it-up a bit and join another dimension table in, filtering on both origin and destination values:

[bigdatalite.localdomain:21000] > select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
                                > join hbase_geog_origin o on f.orig = o.key
                                > join hbase_geog_dest d on f.dest = d.key
                                > and o.origin_state = 'California'  
                                > and d.dest_state = 'New York'
                                > group by o.origin_airport_name;
Query: select sum(cast(f.flights as bigint)) as flight_count, o.origin_airport_name from impala_flight_delays f
join hbase_geog_origin o on f.orig = o.key
join hbase_geog_dest d on f.dest = d.key
and o.origin_state = 'California'
and d.dest_state = 'New York'
group by o.origin_airport_name
| flight_count | origin_airport_name                                   |
| 947          | Sacramento, CA: Sacramento International              |
| 3880         | San Diego, CA: San Diego International                |
| 4030         | Burbank, CA: Bob Hope                                 |
| 41909        | San Francisco, CA: San Francisco International        |
| 3489         | Oakland, CA: Metropolitan Oakland International       |
| 937          | San Jose, CA: Norman Y. Mineta San Jose International |
| 41407        | Los Angeles, CA: Los Angeles International            |
| 794          | Ontario, CA: Ontario International                    |
| 4176         | Long Beach, CA: Long Beach Airport                    |
Fetched 9 row(s) in 1.48s

Even faster. So that’s what we’ll be going with as our initial approach for the data loading and querying; load data into HBase tables as planned at the start, taking advantage of HBase’s CRUD capabilities but bulk-loading and initially reading the data using Hive tables over the HBase ones; but then, before we make the data available for querying by OBIEE, we copy the current state of the HBase fact table into a Parquet-stored Impala table, using Impala’s ability to work with Hive tables and metadata and create joins across both Impala and Hive tables, even when one of the Hive tables uses HBase as its underlying storage.