Tag Archives: Oracle BI Suite EE
OBIEE 11g Presentation Server Catalog Replication Across Redundant Servers
The thing about OBIEE 11g, is that there is no single way to get things “up and running”. Over the past year I have performed countless OBIEE installs and almost every one has been different to the last, due to infrastructure challenges, scale requirements or requirements for high-availability and server redundancy.
There is little we can do about infrastructure challenges, typically around policies which are in place across the IT estate and must be adhered to. Scale is simple; more users means more or bigger servers. The last point on the list is HA & server redundancy, and this is arguably the biggest area we need to address before punching in “./runInstaller” at the command-prompt.
Mark Rittman is currently pulling a wider blog posts around options for HA & server redundancy, so I’m not about to get into that in this post. What I will share in my first Rittman Mead blog post is something I have been working on recently, that is sometimes implemented as part of a HA/redundancy solution; BI Presentation Server catalog replication.
BI Presentation Server catalog replication is the Oracle supported method of moving a BI Presentation Server catalog from one running OBIEE instance to another. This is useful is many situations namly:
- Release cycle management when specific aspects of the BI Presentation Server catalog need to be moved from one OBIEE environment to another, such as development to testing.
- Full BI Presentation Server catalog synchronization between OBIEE instances to provide two OBIEE instances with exactly the same BI Presentation Server catalog for DR.
The 2nd point above is the scenario I am looking to cover here – using this utility we can make sure the BI Presentation Server catalog on two distinct OBIEE instances are the same. In a DR event we can either point users at the secondary server or reverse this utility to pump the data from the DR server to the primary server after the fault on the server has been cleared. The RPD and other changes, such as user roles, would be managed as part of the release process and copied to the DR server when they are pushed to the production server. It is only the BI Presentation Server catalog which is constantly changing as the OBIEE instance is used, and therefore only the BI Presentation Server catalog which needs to be constantly replicated.
To set up BI Presentation Server catalog replication, we can use a command-line utility which ships with Oracle Business Intelligence 11g, called either “sawrepaj.bat” or “sawrepaj.sh” depending on your operating system (the two are functionally identical, and differ only in the way they are invoked from the command-line). “sawrepaj” has options to synchronise catalogs between servers, or to simply export a catalog for backups or as part of the release lifecycle; in addition, we can control the grain of what is exported down to an individual catalog folder or file.
The utility runs from the command-line and accepts a number of options, or commands, to specify how the replication takes place, using the following syntax:
sawrepaj.sh [/C path] command [command parameters]
- mark – instruct the OBIEE server to ‘watch’ catalog files and record changes.
- run – run the batch, as defined in the config.xml file.
Before using the utility, we need to create a config.xml file which instructs the OBIEE Presentation Services to record changes, and tells sawrepaj what to do when it executes.
The config.xml file can be placed anywhere in the filesystem and is referenced using the /C switch when calling sawrepaj – more about this /C switch later. The config file tells the sawrepaj utility which Oracle BI Presentation Servers are involved in the replication, and how to access them. The example below includes a source and target Oracle BI Presentation Server – however you could add countless other BI Presentation Servers depending on your environment.
<?xml version="1.0" encoding="UTF-8"?> <!-- Oracle BI Presentation Services Replication Agent Configuration File --> <Config> <General> <ExportDirectory>smb://path/to/directory</ExportDirectory> <LogExpiresHours>48</LogExpiresHours> </General> <Server name=”serverA" user="weblogic" pwd="Password01"> <Property name= "oracle.bi.presentation.url" value ="http://servera:9704/analytics/saw.dll"/> <Property name="oracle.bi.presentation.usesoaptcptransport" value="N"/> </Server> <Server name="serverB" user="weblogic" pwd="Password01"> <Property name= "oracle.bi.presentation.url" value ="http://serverb:9704/analytics/saw.dll"/> <Property name="oracle.bi.presentation.usesoaptcptransport" value="N"/> </Server> <Folderset name="all"> <Folder>/</Folder> </Folderset> <ReplicationTask destination="serverA" source="serverB" folders="all" /> </Config>
The three sections of interest in the above file are Server, FolderSet and ReplicationTask. Each
- Server tag points to an OBIEE Presentation Services instance with credentials for a user with has SOAP access.
- FolderSet tag contains a list of Folder tags which detail the folders within the webcat to replicate.
- ReplicationTask contains the instruction of what to do when we run sawrepaj. For the above example, serverA is the target and ServerB is the source.
Before any replication can begin, we need to enable each BI Presentation Server component for replication. Add the below to the instanceconfig.xml file between the <catalog> tags. This needs to be done for all all Presentation Server components, whether source or target for the replication.
<Replication> <Enabled>true</Enabled> <ReadLogRecordsSinceHoursAgo>120</ReadLogRecordsSinceHoursAgo> <RecordsInFileLimit>4000</RecordsInFileLimit> </Replication>
Now, back to the sawrepaj commands:
mark – The mark command is used to tell tBI Presentation services which webcat files to monitor for changes. We can be as specific as a single file, or as as vague as an entire webcat. You will need to run this on a BI server with the following command:
sawrepaj mark all /
run – The final part of the process, the run command will instruct sawrepaj to run all the tasks in the config.xml file. The below is an example of the command. If your config.xml file is not in the root path of the sawrepaj tool, you will need to specify it’s location with the /C switch. This command can be added to a cronjob, a Windows scheduler or can be ran manually, or as part of another script.
sawrepaj /C /my/config/file/config.xml run
That’s it! Once completed, the above steps will give you all you need to set up full catalog replication between multiple BI servers.
Date formatting in OBIEE 11g – setting the default Locale for users
A short and sweet blog post this one, simply to plug a gap in Google that I found when trying to do achieve this today.
How user’s see things like date formats in OBIEE is determined by their locale, which is usually related to language but not always the same.
The classic example, and the one I was trying to resolve, was that users were seeing dates presenting in the ‘wrong’ format. The users are British and instead of seeing dates in the correct format of DD/MM/YYYY (19/01/2014) they were MM/DD/YYYY (01/19/2014). (side note: this lighthearted article in the Guardian refers to this ridiculous format as “Middle-Endianness”…)
In this example I have got the same date value in two columns. One column I have explicitly formatted (“DATE_VERBOSE”) so there is no ambiguity about the date. The second column (“DATE_LOCALE”) is the same date value, but formatted according to the locale setting. It is this default locale setting that I want to demonstrate how to set – manually changing all date columns to a particular format is not a sustainable way to develop reports…
Oracle’s documentation is not very clear in this respect, and Google is a mishmash of half solutions and rubbish, so I humbly offer this short instruction for how to ensure users have the correct locale applied.
Solution 1 – manually
Each user can configure their chosen locale, from the My Account dialog:
Whenever changing the locale, you need to logout and log back in to OBIEE for it to take effect.
This setting, if they override it from the default, is stored in the Presentation Catalog under the user’s userprefsxmlstore.xml file
<sawprefs:userPreference prefKey="Locale">en-gb</sawprefs:userPreference>
So for a single user, this method is fine – but for more than one existing user, or any new users, it doesn’t fly. We don’t want to go hacking Presentation Catalog files, that is bad, mmm’kay?
Solution 2 – The USERLOCALE System Session Variable
There is a System Session Variable called USERLOCALE which you can set per user, and will define their locale. Easy! All you have to watch out for is that its value (thanks Christian) is case sensitive. This cost me some hours today, so I’ll say it again – the value you specify for USERLOCALE is case sensitive and OBIEE will not help you out if you specify it wrong (it’ll just ignore it).
You can hardcode the value to the same for all users, or if you want to get fancy you could drive it from a database table to make it variable dependent upon the user’s location that you store in a table.
The Session Variable and example Initialisation Block are shown here:
To validate what is going on with the session variable, you can use a Static Text view with the following content to show the values of the relevant session and presentation variables:
[u][b]System Session Variables[/b][/u] <p align=left> [b]NQ_SESSION.USERLOCALE: [/b]@{biServer.variables['NQ_SESSION.USERLOCALE']}[br/] [b]NQ_SESSION.WEBLANGUAGE: [/b] @{biServer.variables['NQ_SESSION.WEBLANGUAGE']}[br/] </p> [u][b]Predefined Presentation Variables[/b][/u][br/] <p align="left"> [b]session.language:[/b] @{session.language}[br/] [b]session.locale:[/b] @{session.locale}[br/] </p>
In my test report the Static Text view is to the right of the data table, and I can now see that the USERLOCALE session variable has been populated. Crucially, the session.locale presentation variable is inheriting the correct value, which in turn is driving the desired formatting of the DATE_LOCALE column.
Method 3 – the hack
I’m including this here for completeness, and because you will come across it in plenty of places on the web so it is good to understand why it is not the correct solution.
Within the OBIEE installation folder (FMW_HOME) you will find the ORACLE_HOME folder, Oracle_BI1, in which all the application’s binaries and internal configuration are stored. This is not a “user serviceable” folder, and Oracle are at liberty to change any of the files within it whenever you patch or upgrade the software.
Within Oracle_BI1 there is a file called localemappings.xml and this file sets the default locale used if one is not explicitly configured. If you change the “catch all” line in this XML file to your desired locale, it will set the default.
<when matches="*"><localeDefinition name="en-gb"/></when>
But, this is not the correct way to do it, and there is no need to because the USERLOCALE method above works just fine.
What about AllowedLocales in instanceconfig.xml?
In combing through the documentation (RTFM, after all), you will come across reference to the Localization tags for the Presentation Services configuration file instanceconfig.xml
. Within this tag you can specify AllowedLocales. However, all this does is provide a way to restrict the dropdown locale list (shown in method 1 above). So this is a nice thing to do for your user base if there is a finite number of locales they will want to use (save them wading through many options), but it does not influence the default locale – even if you set it to a single value, your desired default.
Summary
If you want to set the default locale, use the system session variable USERLOCALE. Make sure you specify your value in lowercase, otherwise it won’t work.
Rittman Mead BI Forum 2014 Abstract Scoring Now Live – For 1 Week Only!
The call for papers for the Rittman Mead BI Forum 2014 closed at the end of January, and we’ve had some excellent submissions on topics ranging from OBIEE, Visualizations and data discovery through to in-memory analytics, big data and data integration. As always, we’re now opening up the abstract submission list for scoring, so that anyone considering coming to either the Brighton or Atlanta events can have a say in what abstracts are selected.
The voting forms, and event details, are below:
- Brighton, May 7th – 9th 2014, Hotel Seattle Brighton : Abstract Voting Form
- Atlanta, May 14th – 16th 2014, Renaissance Atlanta Midtown : Abstract Voting Form
In case you missed it, we also announced the speaker for the Wednesday masterclass the other day – Lars George from Cloudera, who’ll be talking about Hadoop, HBase, Cloudera and how it all applies to the worlds of analytics, BI and DW – something we’re all really excited about.
Voting is open for just one week, and will close at 5pm PST on Tuesday, 18th Feb. Shortly afterwards we’ll announce the speaker line-up, and open-up registrations for both events. Keep an eye on the blog for more details as they come.
Rittman Mead BI Forum 2014 Abstract Scoring Now Live – For 1 Week Only!
The call for papers for the Rittman Mead BI Forum 2014 closed at the end of January, and we’ve had some excellent submissions on topics ranging from OBIEE, Visualizations and data discovery through to in-memory analytics, big data and data integration. As always, we’re now opening up the abstract submission list for scoring, so that anyone considering coming to either the Brighton or Atlanta events can have a say in what abstracts are selected.
The voting forms, and event details, are below:
- Brighton, May 7th – 9th 2014, Hotel Seattle Brighton : Abstract Voting Form
- Atlanta, May 14th – 16th 2014, Renaissance Atlanta Midtown : Abstract Voting Form
In case you missed it, we also announced the speaker for the Wednesday masterclass the other day – Lars George from Cloudera, who’ll be talking about Hadoop, HBase, Cloudera and how it all applies to the worlds of analytics, BI and DW – something we’re all really excited about.
Voting is open for just one week, and will close at 5pm PST on Tuesday, 18th Feb. Shortly afterwards we’ll announce the speaker line-up, and open-up registrations for both events. Keep an eye on the blog for more details as they come.
Testing Oracle Direct Connector for HDFS as an Alternative to Hive ODBC for OBIEE11g
In a post on the blog a couple of weeks ago, I looked at loading up a set of flight delays data into Apache Hadoop, then analysing it using Apache Hive, Cloudera Impala and OBIEE. In this scenario, OBIEE connects to the Hadoop cluster using Hive and Impala ODBC drivers, and then either Hive (through MapReduce jobs) or Impala (through its in-memory distributed query engine) returns the data to OBIEE, for display on the dashboard.
In my initial tests, as you’d expect Hive was fairly slow (as its optimised towards large (TB+) batch jobs), whilst Impala was fast, on a par with regular Oracle database queries albeit with a much more limited set of SQL functions supported. One of the questions I got asked after I posted the blog article though, was how this approach compared to accessing the Hadoop data via Oracle’s “big data connectors” – either Oracle Loader for Hadoop (OLH) or Oracle Direct Connector for HDFS? Could these give us similar performance to Hive or Impala, but make the Hadoop data more “Oracle-compatible” and suitable for querying with OBIEE?
For anyone not all that familiar with Oracle’s big data connectors, there’s two that are particularly relevant to BI/DW use-cases; Oracle Loader for Hadoop (OLH) is a bulk-loader between Hadoop and the Oracle Database that uses MapReduce on the Hadoop-side to prepare, sort and transform data before efficiently loading it into an Oracle database. There’s already lots of ways you can move data in and out of Oracle from a Hadoop source – sqoop for example – but presumably Oracle feel OLH is a particularly-efficient way of doing it, exploits Oracle direct/bulk-loading capabilities and parallelism, and so forth – but it’s a loader, not a transparent reading mechanism, so it’s only really appropriate for ETL-type situations where you want to copy the data out of Hadoop and into an Oracle data warehouse.
Oracle Direct Connector for HDFS (ODCH) is a bit different though, in that it uses the Oracle external table feature to map an Oracle data dictionary table onto HDFS files, or optionally Apache Hive tables (and therefore onto HDFS files, NoSQL databases or whatever). So you could potentially use ODCH to create Oracle database table representations of your Hive tables / HDFS files, then map these into the OBIEE repository and work with them just like any other Oracle (external) table. What this means then is that you’re not reliant on Hive or Impala ODBC drivers, or the BI Server having to generate HiveQL or ImpalaQL queries (with Impala of course not yet being officially supported), making your work from the OBIEE side a lot easier than if you’re trying to work with Hive or Impala directly.
So how well does it work then? I decided to give it a try using the new BigDataLite VM I mentioned on the blog last week, along with the Airline Delays dataset I used in the post on Impala and Hive the other week. My starting point then was a Hive database with four tables – flight performance (19m rows), dest and origin (2k rows each) and carrier (1.5k rows), like this:
Now obviously Oracle’s big data connectors aren’t installed as part of Hadoop by default, but the BigDataLite VM does have them pre-installed, along with an Oracle Database 12c database, so a lot of the work in setting things up is done for you. By default, an install of the big data connectors connects to the Hadoop environment on the same machine, so all of the setup commands I use will assume that the Hive server and so on are on the same server – localhost in this case – although of course you can configure them to connect to a remote Hadoop server or cluster.
The way ODCH works is that it uses the Oracle external table “pre-processor” feature to stream HDFS file content into your query session, with the pre-processor in this scenario being a utility provided by the big data connectors to connect the two environments together. This blog post from Oracle explains the process in a bit more detail, but the key things to understand are that it’s conceptually similar to accessing regular file data via external tables, giving us the benefit of fairly seamless access to this external data (vs. using SQL*Loader, or OLH in this case), but it’s also not “real” Oracle table data so there’s no indexes or any other performance structures that can make queries run faster – thought like regular external tables you can run ODCH loading in-parallel.
So let’s use the setup in the BigDataLite VM to create some external tables in the Oracle database that map to my Hive tables, and underlying HDFS data files. Before we do this though we need to set up a few things; first, we need to create a temporary directory on the Linux filesystem to hold the metadata files created by ODCH, and we also need to specify where the ODCH HDFS file streamer utility can be found. Logging in as oracle/welcome1 on the VM, I first create the temporary directory, and then set an environment variable the rest of the process will be looking for:
mkdir bi_airlines_dir export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*
Now I log into SQL*Plus and create a database user that I’ll create the external tables in, and grant it access to the USERS tablespace:
sqlplus / as sysdba create user bi_airlines identified by bi_airlines quota unlimited on users; grant connect, resource to bi_airlines;
Next, I need to create two database directory objects; one to point to where the ODCH HDFS stream pre-processor lives, and the other to point to my temporary directory:
create or replace bi_airlines_dir as ‘/home/oracle/bi_airlines_dir’; grant read, write on directory bi_airlines_dir to bi_airlines; create or replace directory osch_bin_path as ‘/u01/connectors/osch/bin’; grant read, write, execute on directory osch_bin_path to bi_airlines;
Now, re-purposing the scripts on the BigDataLite VM provided to create external tables over the Moviework Hive database, each of my tables requires two setup files; one, a shell script, to call the ODCH utility and reference an XML parameter file, and the second, the XML file, which contains the definition of the external table. To take an example, the shell script (genloc_origin_hive.sh) to create an external table over my “origin” Hive table looks like this:
# Add HIVE_HOME/lib* to HADOOP_CLASSPATH. This cannot be done # in the login profiles since this breaks Pig in the previous lab. export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/* hadoop jar $OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ -conf /home/oracle/movie/moviework/osch/origin_hive.xml \ -createTable
whereas the accompanying XML file (origin_hive.xml) looks like this:
<Configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>BI_AIRLINES.ORIGIN_EXT_TAB_HIVE</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>hive</value> </property> <property> <name>oracle.hadoop.exttab.hive.tableName</name> <value>origin</value> </property> <property> <name>oracle.hadoop.exttab.hive.databaseName</name> <value>bi_airlines</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>jdbc:oracle:thin:@localhost:1521:orcl</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>BI_AIRLINES</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>bi_airlines_dir</value> </property> </configuration>
Within the XML file, we specify the name of the external table that the utility will create in Oracle for you (ORIGIN_EXT_TAB_HIVE within the BI_AIRLINES schema); that it’s a Hive (rather than HDFS file) source, and the Hive database name and table name to source data from. With the operating system directory that you pass to it, the utility stores metadata within this to point to the individual HDFS files that contain your data, which means that whilst the HDFS data itself isn’t copied to Oracle, a hard reference to the file is, which means if you add more files to the Hive table’s HDFS directory, you’ll need to re-run the utility to register them – so its not completely automatic beyond this point, but more or less OK if you’ve got a single file providing the data, as I have in this slightly non-representative case.
So let’s run the shell script and create one of the external tables, passing in the Oracle database password for the “bi_airlines” user when prompted:
[oracle@bigdatalite osch]$ sh genloc_origin_hive.sh Oracle SQL Connector for HDFS Release 2.3.0 - Production Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved. [Enter Database Password:] 14/02/06 11:59:58 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. 14/02/06 11:59:58 INFO hive.metastore: Trying to connect to metastore with URI thrift://bigdatalite.localdomain:9083 14/02/06 11:59:58 INFO hive.metastore: Waiting 1 seconds before next connection attempt. 14/02/06 11:59:59 INFO hive.metastore: Connected to metastore. The create table command succeeded. CREATE TABLE "BI_AIRLINES"."ORIGIN_EXT_TAB_HIVE" ( "ORIGIN" VARCHAR2(4000), "ORIGIN_DEST" VARCHAR2(4000), "ORIGIN_CITY" VARCHAR2(4000), "ORIGIN_STATE" VARCHAR2(4000), "AIRPORT_ID" VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "BI_AIRLINES_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'7C' MISSING FIELD VALUES ARE NULL ( "ORIGIN" CHAR(4000) NULLIF "ORIGIN"=0X'5C4E', "ORIGIN_DEST" CHAR(4000) NULLIF "ORIGIN_DEST"=0X'5C4E', "ORIGIN_CITY" CHAR(4000) NULLIF "ORIGIN_CITY"=0X'5C4E', "ORIGIN_STATE" CHAR(4000) NULLIF "ORIGIN_STATE"=0X'5C4E', "AIRPORT_ID" CHAR(4000) NULLIF "AIRPORT_ID"=0X'5C4E' ) ) LOCATION ( 'osch-20140206120000-1947-1' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files were created. osch-20140206120000-1947-1 contains 1 URI, 150606 bytes 150606 hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/bi_airlines.db/origin/geog_origin.txt
Note the HDFS file reference at the end – this is the metadata created by the utility, which we’d need to update if more data files are added to the underlying Hive table.
So let’s fire-up SQL*Plus and take a look at the tables:
[oracle@bigdatalite osch]$ sqlplus bi_airlines/bi_airlines SQL*Plus: Release 12.1.0.1.0 Production on Thu Feb 6 13:20:12 2014 Copyright (c) 1982, 2013, Oracle. All rights reserved. Last Successful login time: Thu Feb 06 2014 12:21:52 -05:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> select table_name from user_tables; TABLE_NAME -------------------------------------------------------------------------------- FLIGHT_PERF_EXT_TAB_HIVE ORIGIN_EXT_TAB_HIVE DEST_EXT_TAB_HIVE CARRIER_EXT_TAB_HIVE SQL> set timing on SQL> select count(*) from flight_perf_ext_tab_hive; COUNT(*) ---------- 19648958 Elapsed: 00:01:09.74
Not bad, but just over a minute to do a quick row-count on the table. Let’s try the same thing using Hive itself – note that ODCH and Hive aren’t using the same access technique to the underlying data, and one might more sense than the other for particular query situations.
[oracle@bigdatalite ~]$ hive Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties Hive history file=/tmp/oracle/hive_job_log_2b7797b2-1ef8-4ffc-b8b1-ab4f77b19cea_423831297.txt hive> select count(*) from bi_airlines.flight_performance; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201402052208_0001, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201402052208_0001 Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201402052208_0001 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2014-02-06 13:26:48,223 Stage-1 map = 0%, reduce = 0% 2014-02-06 13:27:23,402 Stage-1 map = 51%, reduce = 0% 2014-02-06 13:27:40,487 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 16.59 sec 2014-02-06 13:27:41,512 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:42,523 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:43,535 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:44,549 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:45,558 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:46,569 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:47,582 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:48,596 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec 2014-02-06 13:27:49,608 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec 2014-02-06 13:27:50,616 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec 2014-02-06 13:27:51,625 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec 2014-02-06 13:27:52,641 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec MapReduce Total cumulative CPU time: 35 seconds 50 msec Ended Job = job_201402052208_0001 MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 Cumulative CPU: 35.05 sec HDFS Read: 524501756 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 35 seconds 50 msec OK 19648958 Time taken: 76.525 seconds
Ok, a bit longer, 76 seconds vs. 69 seconds – but as I said, they’re different access mechanisms and you’d probably get different results if you joined the data, filtered it etc.
Let’s try it in Impala now:
[oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication Connected to bigdatalite.localdomain:21000 Server version: impalad version 1.2.3 RELEASE (build 1cab04cdb88968a963a8ad6121a2e72a3a623eca) Welcome to the Impala shell. Press TAB twice to see a list of available commands. Copyright (c) 2012 Cloudera, Inc. All rights reserved. (Shell build version: Impala Shell v1.2.3 (1cab04c) built on Fri Dec 20 19:39:39 PST 2013) [bigdatalite.localdomain:21000] > select count(*) from bi_airlines.flight_performance; Query: select count(*) from bi_airlines.flight_performance +----------+ | count(*) | +----------+ | 19648958 | +----------+ Returned 1 row(s) in 1.09s [bigdatalite.localdomain:21000] >
Ah, that’s more like it – the count this time was returned in just over a second, which is what you’d expect as Impala is optimised for ad-hoc queries, Hive for larger, batch-style SQL access and transformations. Moving over to OBIEE though, after creating a TNSNAMES connection through to the BigDataLite Oracle Database 12c database, I can import the external tables in just like any regular Oracle source:
Building the RPD against these tables is thereafter the same as any Oracle database source – no need to worry about Hive drivers, SQL dialects, Impala not being supported and so on.
But … when you come to run queries, it’s slow. As slow as accessing Hadoop data via Hive, in the order of minutes to return a result set to the dashboard.
And this is what you’d expect if you connected OBIEE to a file data source, or an Oracle external table. There’s no indexes on the underlying tables, no aggregates and so forth, and you’re not benefiting from the query optimisations you’d get with technologies such as Impala, Stinger from Hortonworks or whatever. In reality, if what you’re after is the convenience of connecting to Oracle tables rather than Hive or Impala ones, you’d just use the external tables you created to then load the data into regular Oracle tables, and just query those.
What ODCH is really for is data extraction and ETL, it’s not a high-performance ad-hoc query tool, and realistically you’d only really use it like this for initial prototyping or if your use-case really suited direct query access to HDFS file data. So – back to Impala then for this type of Hadoop access, though it’s a neat feature to be aware of, particularly in the context of ODI and bulk-loading Hadoop data into Oracle.