Tag Archives: Big Data
Rittman Mead BI Forum 2014 Abstract Scoring Now Live – For 1 Week Only!
The call for papers for the Rittman Mead BI Forum 2014 closed at the end of January, and we’ve had some excellent submissions on topics ranging from OBIEE, Visualizations and data discovery through to in-memory analytics, big data and data integration. As always, we’re now opening up the abstract submission list for scoring, so that anyone considering coming to either the Brighton or Atlanta events can have a say in what abstracts are selected.
The voting forms, and event details, are below:
- Brighton, May 7th – 9th 2014, Hotel Seattle Brighton : Abstract Voting Form
- Atlanta, May 14th – 16th 2014, Renaissance Atlanta Midtown : Abstract Voting Form
In case you missed it, we also announced the speaker for the Wednesday masterclass the other day – Lars George from Cloudera, who’ll be talking about Hadoop, HBase, Cloudera and how it all applies to the worlds of analytics, BI and DW – something we’re all really excited about.
Voting is open for just one week, and will close at 5pm PST on Tuesday, 18th Feb. Shortly afterwards we’ll announce the speaker line-up, and open-up registrations for both events. Keep an eye on the blog for more details as they come.
Rittman Mead BI Forum 2014 Abstract Scoring Now Live – For 1 Week Only!
The call for papers for the Rittman Mead BI Forum 2014 closed at the end of January, and we’ve had some excellent submissions on topics ranging from OBIEE, Visualizations and data discovery through to in-memory analytics, big data and data integration. As always, we’re now opening up the abstract submission list for scoring, so that anyone considering coming to either the Brighton or Atlanta events can have a say in what abstracts are selected.
The voting forms, and event details, are below:
- Brighton, May 7th – 9th 2014, Hotel Seattle Brighton : Abstract Voting Form
- Atlanta, May 14th – 16th 2014, Renaissance Atlanta Midtown : Abstract Voting Form
In case you missed it, we also announced the speaker for the Wednesday masterclass the other day – Lars George from Cloudera, who’ll be talking about Hadoop, HBase, Cloudera and how it all applies to the worlds of analytics, BI and DW – something we’re all really excited about.
Voting is open for just one week, and will close at 5pm PST on Tuesday, 18th Feb. Shortly afterwards we’ll announce the speaker line-up, and open-up registrations for both events. Keep an eye on the blog for more details as they come.
Testing Oracle Direct Connector for HDFS as an Alternative to Hive ODBC for OBIEE11g
In a post on the blog a couple of weeks ago, I looked at loading up a set of flight delays data into Apache Hadoop, then analysing it using Apache Hive, Cloudera Impala and OBIEE. In this scenario, OBIEE connects to the Hadoop cluster using Hive and Impala ODBC drivers, and then either Hive (through MapReduce jobs) or Impala (through its in-memory distributed query engine) returns the data to OBIEE, for display on the dashboard.
In my initial tests, as you’d expect Hive was fairly slow (as its optimised towards large (TB+) batch jobs), whilst Impala was fast, on a par with regular Oracle database queries albeit with a much more limited set of SQL functions supported. One of the questions I got asked after I posted the blog article though, was how this approach compared to accessing the Hadoop data via Oracle’s “big data connectors” – either Oracle Loader for Hadoop (OLH) or Oracle Direct Connector for HDFS? Could these give us similar performance to Hive or Impala, but make the Hadoop data more “Oracle-compatible” and suitable for querying with OBIEE?
For anyone not all that familiar with Oracle’s big data connectors, there’s two that are particularly relevant to BI/DW use-cases; Oracle Loader for Hadoop (OLH) is a bulk-loader between Hadoop and the Oracle Database that uses MapReduce on the Hadoop-side to prepare, sort and transform data before efficiently loading it into an Oracle database. There’s already lots of ways you can move data in and out of Oracle from a Hadoop source – sqoop for example – but presumably Oracle feel OLH is a particularly-efficient way of doing it, exploits Oracle direct/bulk-loading capabilities and parallelism, and so forth – but it’s a loader, not a transparent reading mechanism, so it’s only really appropriate for ETL-type situations where you want to copy the data out of Hadoop and into an Oracle data warehouse.
Oracle Direct Connector for HDFS (ODCH) is a bit different though, in that it uses the Oracle external table feature to map an Oracle data dictionary table onto HDFS files, or optionally Apache Hive tables (and therefore onto HDFS files, NoSQL databases or whatever). So you could potentially use ODCH to create Oracle database table representations of your Hive tables / HDFS files, then map these into the OBIEE repository and work with them just like any other Oracle (external) table. What this means then is that you’re not reliant on Hive or Impala ODBC drivers, or the BI Server having to generate HiveQL or ImpalaQL queries (with Impala of course not yet being officially supported), making your work from the OBIEE side a lot easier than if you’re trying to work with Hive or Impala directly.
So how well does it work then? I decided to give it a try using the new BigDataLite VM I mentioned on the blog last week, along with the Airline Delays dataset I used in the post on Impala and Hive the other week. My starting point then was a Hive database with four tables – flight performance (19m rows), dest and origin (2k rows each) and carrier (1.5k rows), like this:
Now obviously Oracle’s big data connectors aren’t installed as part of Hadoop by default, but the BigDataLite VM does have them pre-installed, along with an Oracle Database 12c database, so a lot of the work in setting things up is done for you. By default, an install of the big data connectors connects to the Hadoop environment on the same machine, so all of the setup commands I use will assume that the Hive server and so on are on the same server – localhost in this case – although of course you can configure them to connect to a remote Hadoop server or cluster.
The way ODCH works is that it uses the Oracle external table “pre-processor” feature to stream HDFS file content into your query session, with the pre-processor in this scenario being a utility provided by the big data connectors to connect the two environments together. This blog post from Oracle explains the process in a bit more detail, but the key things to understand are that it’s conceptually similar to accessing regular file data via external tables, giving us the benefit of fairly seamless access to this external data (vs. using SQL*Loader, or OLH in this case), but it’s also not “real” Oracle table data so there’s no indexes or any other performance structures that can make queries run faster – thought like regular external tables you can run ODCH loading in-parallel.
So let’s use the setup in the BigDataLite VM to create some external tables in the Oracle database that map to my Hive tables, and underlying HDFS data files. Before we do this though we need to set up a few things; first, we need to create a temporary directory on the Linux filesystem to hold the metadata files created by ODCH, and we also need to specify where the ODCH HDFS file streamer utility can be found. Logging in as oracle/welcome1 on the VM, I first create the temporary directory, and then set an environment variable the rest of the process will be looking for:
mkdir bi_airlines_dir export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/*
Now I log into SQL*Plus and create a database user that I’ll create the external tables in, and grant it access to the USERS tablespace:
sqlplus / as sysdba create user bi_airlines identified by bi_airlines quota unlimited on users; grant connect, resource to bi_airlines;
Next, I need to create two database directory objects; one to point to where the ODCH HDFS stream pre-processor lives, and the other to point to my temporary directory:
create or replace bi_airlines_dir as ‘/home/oracle/bi_airlines_dir’; grant read, write on directory bi_airlines_dir to bi_airlines; create or replace directory osch_bin_path as ‘/u01/connectors/osch/bin’; grant read, write, execute on directory osch_bin_path to bi_airlines;
Now, re-purposing the scripts on the BigDataLite VM provided to create external tables over the Moviework Hive database, each of my tables requires two setup files; one, a shell script, to call the ODCH utility and reference an XML parameter file, and the second, the XML file, which contains the definition of the external table. To take an example, the shell script (genloc_origin_hive.sh) to create an external table over my “origin” Hive table looks like this:
# Add HIVE_HOME/lib* to HADOOP_CLASSPATH. This cannot be done # in the login profiles since this breaks Pig in the previous lab. export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HIVE_HOME/lib/* hadoop jar $OSCH_HOME/jlib/orahdfs.jar \ oracle.hadoop.exttab.ExternalTable \ -conf /home/oracle/movie/moviework/osch/origin_hive.xml \ -createTable
whereas the accompanying XML file (origin_hive.xml) looks like this:
<Configuration> <property> <name>oracle.hadoop.exttab.tableName</name> <value>BI_AIRLINES.ORIGIN_EXT_TAB_HIVE</value> </property> <property> <name>oracle.hadoop.exttab.sourceType</name> <value>hive</value> </property> <property> <name>oracle.hadoop.exttab.hive.tableName</name> <value>origin</value> </property> <property> <name>oracle.hadoop.exttab.hive.databaseName</name> <value>bi_airlines</value> </property> <property> <name>oracle.hadoop.connection.url</name> <value>jdbc:oracle:thin:@localhost:1521:orcl</value> </property> <property> <name>oracle.hadoop.connection.user</name> <value>BI_AIRLINES</value> </property> <property> <name>oracle.hadoop.exttab.defaultDirectory</name> <value>bi_airlines_dir</value> </property> </configuration>
Within the XML file, we specify the name of the external table that the utility will create in Oracle for you (ORIGIN_EXT_TAB_HIVE within the BI_AIRLINES schema); that it’s a Hive (rather than HDFS file) source, and the Hive database name and table name to source data from. With the operating system directory that you pass to it, the utility stores metadata within this to point to the individual HDFS files that contain your data, which means that whilst the HDFS data itself isn’t copied to Oracle, a hard reference to the file is, which means if you add more files to the Hive table’s HDFS directory, you’ll need to re-run the utility to register them – so its not completely automatic beyond this point, but more or less OK if you’ve got a single file providing the data, as I have in this slightly non-representative case.
So let’s run the shell script and create one of the external tables, passing in the Oracle database password for the “bi_airlines” user when prompted:
[oracle@bigdatalite osch]$ sh genloc_origin_hive.sh Oracle SQL Connector for HDFS Release 2.3.0 - Production Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved. [Enter Database Password:] 14/02/06 11:59:58 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. 14/02/06 11:59:58 INFO hive.metastore: Trying to connect to metastore with URI thrift://bigdatalite.localdomain:9083 14/02/06 11:59:58 INFO hive.metastore: Waiting 1 seconds before next connection attempt. 14/02/06 11:59:59 INFO hive.metastore: Connected to metastore. The create table command succeeded. CREATE TABLE "BI_AIRLINES"."ORIGIN_EXT_TAB_HIVE" ( "ORIGIN" VARCHAR2(4000), "ORIGIN_DEST" VARCHAR2(4000), "ORIGIN_CITY" VARCHAR2(4000), "ORIGIN_STATE" VARCHAR2(4000), "AIRPORT_ID" VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY "BI_AIRLINES_DIR" ACCESS PARAMETERS ( RECORDS DELIMITED BY 0X'0A' CHARACTERSET AL32UTF8 PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream' FIELDS TERMINATED BY 0X'7C' MISSING FIELD VALUES ARE NULL ( "ORIGIN" CHAR(4000) NULLIF "ORIGIN"=0X'5C4E', "ORIGIN_DEST" CHAR(4000) NULLIF "ORIGIN_DEST"=0X'5C4E', "ORIGIN_CITY" CHAR(4000) NULLIF "ORIGIN_CITY"=0X'5C4E', "ORIGIN_STATE" CHAR(4000) NULLIF "ORIGIN_STATE"=0X'5C4E', "AIRPORT_ID" CHAR(4000) NULLIF "AIRPORT_ID"=0X'5C4E' ) ) LOCATION ( 'osch-20140206120000-1947-1' ) ) PARALLEL REJECT LIMIT UNLIMITED; The following location files were created. osch-20140206120000-1947-1 contains 1 URI, 150606 bytes 150606 hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/bi_airlines.db/origin/geog_origin.txt
Note the HDFS file reference at the end – this is the metadata created by the utility, which we’d need to update if more data files are added to the underlying Hive table.
So let’s fire-up SQL*Plus and take a look at the tables:
[oracle@bigdatalite osch]$ sqlplus bi_airlines/bi_airlines SQL*Plus: Release 12.1.0.1.0 Production on Thu Feb 6 13:20:12 2014 Copyright (c) 1982, 2013, Oracle. All rights reserved. Last Successful login time: Thu Feb 06 2014 12:21:52 -05:00 Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> select table_name from user_tables; TABLE_NAME -------------------------------------------------------------------------------- FLIGHT_PERF_EXT_TAB_HIVE ORIGIN_EXT_TAB_HIVE DEST_EXT_TAB_HIVE CARRIER_EXT_TAB_HIVE SQL> set timing on SQL> select count(*) from flight_perf_ext_tab_hive; COUNT(*) ---------- 19648958 Elapsed: 00:01:09.74
Not bad, but just over a minute to do a quick row-count on the table. Let’s try the same thing using Hive itself – note that ODCH and Hive aren’t using the same access technique to the underlying data, and one might more sense than the other for particular query situations.
[oracle@bigdatalite ~]$ hive Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties Hive history file=/tmp/oracle/hive_job_log_2b7797b2-1ef8-4ffc-b8b1-ab4f77b19cea_423831297.txt hive> select count(*) from bi_airlines.flight_performance; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201402052208_0001, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201402052208_0001 Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201402052208_0001 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2014-02-06 13:26:48,223 Stage-1 map = 0%, reduce = 0% 2014-02-06 13:27:23,402 Stage-1 map = 51%, reduce = 0% 2014-02-06 13:27:40,487 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 16.59 sec 2014-02-06 13:27:41,512 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:42,523 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:43,535 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:44,549 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:45,558 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:46,569 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:47,582 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 34.0 sec 2014-02-06 13:27:48,596 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec 2014-02-06 13:27:49,608 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec 2014-02-06 13:27:50,616 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec 2014-02-06 13:27:51,625 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec 2014-02-06 13:27:52,641 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 35.05 sec MapReduce Total cumulative CPU time: 35 seconds 50 msec Ended Job = job_201402052208_0001 MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 Cumulative CPU: 35.05 sec HDFS Read: 524501756 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 35 seconds 50 msec OK 19648958 Time taken: 76.525 seconds
Ok, a bit longer, 76 seconds vs. 69 seconds – but as I said, they’re different access mechanisms and you’d probably get different results if you joined the data, filtered it etc.
Let’s try it in Impala now:
[oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication Connected to bigdatalite.localdomain:21000 Server version: impalad version 1.2.3 RELEASE (build 1cab04cdb88968a963a8ad6121a2e72a3a623eca) Welcome to the Impala shell. Press TAB twice to see a list of available commands. Copyright (c) 2012 Cloudera, Inc. All rights reserved. (Shell build version: Impala Shell v1.2.3 (1cab04c) built on Fri Dec 20 19:39:39 PST 2013) [bigdatalite.localdomain:21000] > select count(*) from bi_airlines.flight_performance; Query: select count(*) from bi_airlines.flight_performance +----------+ | count(*) | +----------+ | 19648958 | +----------+ Returned 1 row(s) in 1.09s [bigdatalite.localdomain:21000] >
Ah, that’s more like it – the count this time was returned in just over a second, which is what you’d expect as Impala is optimised for ad-hoc queries, Hive for larger, batch-style SQL access and transformations. Moving over to OBIEE though, after creating a TNSNAMES connection through to the BigDataLite Oracle Database 12c database, I can import the external tables in just like any regular Oracle source:
Building the RPD against these tables is thereafter the same as any Oracle database source – no need to worry about Hive drivers, SQL dialects, Impala not being supported and so on.
But … when you come to run queries, it’s slow. As slow as accessing Hadoop data via Hive, in the order of minutes to return a result set to the dashboard.
And this is what you’d expect if you connected OBIEE to a file data source, or an Oracle external table. There’s no indexes on the underlying tables, no aggregates and so forth, and you’re not benefiting from the query optimisations you’d get with technologies such as Impala, Stinger from Hortonworks or whatever. In reality, if what you’re after is the convenience of connecting to Oracle tables rather than Hive or Impala ones, you’d just use the external tables you created to then load the data into regular Oracle tables, and just query those.
What ODCH is really for is data extraction and ETL, it’s not a high-performance ad-hoc query tool, and realistically you’d only really use it like this for initial prototyping or if your use-case really suited direct query access to HDFS file data. So – back to Impala then for this type of Hadoop access, though it’s a neat feature to be aware of, particularly in the context of ODI and bulk-loading Hadoop data into Oracle.
Rittman Mead BI Forum 2014 Call for Papers Closing Soon – And News on This Year’s Masterclass
Its a couple of days to go until the call for papers for the Rittman Mead BI Forum 2014 closes, with suggested topics this year including OBIEE (of course), Essbase, Endeca, Big Data, Visualizations, In-Memory analysis and data integration. So far we’ve had some excellent submissions but we’re still looking for more – so if you’re considering putting an abstract in, do it now before we close the process late this Friday night!
I’m also very excited to announce that this year’s optional one-day masterclass on the Wednesday before each event will be presented by Lars George from Cloudera, who be talking about Hadoop, Cloudera’s distribution of Hadoop and their management and real-time query tools, and how these relate to the world of Oracle BI&DW. Lars is a Cloudera Solutions Architect and Head of Services for them in EMEA, and is also an HBase committer and author of the book “HBase: The Definitive Guide”.
You’ll probably have seen a lot on big data, and Cloudera, on this blog over the past few months, and I’m particularly grateful to Justin Kestelyn who used to run OTN and the Oracle ACE Program, but now does a similar role over at Cloudera, for making it happen. Thanks Justin and Lars, and we’ll look forward to seeing Lars in Brighton and Atlanta in May this year.
Once the call for papers closes, we’ll do the usual vote to allow potential attendees to influence the paper selection, and then we’ll announce the agendas and open the events up for registration later in February. Until then though – get your abstracts in now before it’s too late…
Looking at the ODI12c Hadoop Demos in the New Oracle BigDataLite VM
In a post earlier today on the blog I took a look at the new Oracle BigDataLite virtual machine that’s now downloadable from OTN, and walked-through some of the Cloudera Hadoop tools that come with the VM. At the end of the post I mentioned that there was also an install of ODI12c on the VM, and it comes with a couple of Hadoop integration examples already set-up for you. So what do these examples do, and how do they use the Hadoop tools and servers on the VM?
Let’s start with some background first. Hadoop is a framework for executing simple selection, filtering and aggregation batch jobs in a fault-tolerant way across horizontal clusters of servers (the BigDataLite VM is just a single node cluster, but the process is the same). When you load data into a Hadoop cluster for analysis, it’s put into what’s called HDFS (Hadoop Distributed File System), a Unix-like filesystem that spreads data across all nodes in the cluster and has built-in redundancy and fault tolerance – basically Hadoop server nodes are designed with cheap, higher-failure-rate hardware in-mind, and the Hadoop parallel query process detects failed nodes and works around them. In ODI terms, you’d often find data of interest sitting in HDFS, most probably because someone has done some prior processing or analysis using a tool like R, and now you want to load the results in a regular Oracle data warehouse.
Then, so that SQL-based tools such as ODI can access these files, another technology called Hive provides a SQL-like access layer over the files, very similar to how Oracle accesses files through external tables, with a Hive metastore playing the role of the Oracle data dictionary in terms of arranging files into tables, columns and databases.
Then, in the background, when you query the Hive tables, the Hive Server creates MapReduce jobs on the fly to return your data, splitting the job into various mapper and reducer activities which then run across the Hadoop cluster. Hive isn’t really (these days) designed for BI-type ad-hoc queries, but it’s great for batch access to Hadoop data which is why ODI uses it.
In addition there are a bunch of Oracle utilities that Oracle provide for connecting Hadoop to the Oracle database, collectively called Oracle’s Big Data Connectors. One of them, Oracle Loader for Hadoop, extracts data from Hadoop by pushing all of the data transformation work into MapReduce jobs, allowing you to leverage the power of the Hadoop cluster whilst easily loading data into Oracle tables. One of the ODI Hadoop knowledge modules uses this utility, along with another one called Oracle Direct Connector for HDFS. The diagram below shows the architecture behind Oracle Loader for Hadoop, and how it leverages MapReduce to do the “heavy lifting” around the data transformation.
Oracle Direct Connector for HDFS is even-more conceptually-familiar, and allows you to create a special type of external table in Oracle to connect to HDFS files, as opposed to regular filesystem files.
The last piece of the puzzle is an add-in to Oracle Data Integrator, called Oracle Data Integrator Application Adaptor for Hadoop. Available for both ODI11g and 12c, this provides a number of new knowledge modules designed for accessing Hadoop data along with connectivity to Hive and HDFS, and is a pre-requisite for the connectivity we’ll see in this posting.
The knowledge modules that this application adapter provides are:
- IKM File to Hive (Load Data) – for loading file data into an existing Hive table
- IKM Hive Control Append – for loading data to-and-from Hive tables, for in-Hive-database ETL
- IKM Hive Transform – for transforming Hive data using more complex expressions and SerDes
- IKM File-Hive to Oracle (OLH) – for loading data into an Oracle table from Hive, using OLH/ODCH
- CKM Hive – for applying static and flow controls to Hive tables
- RKM Hive – for reverse-engineering Hive metadata into the ODI repository
So let’s take a closer look at what’s in the ODI12c examples in the BigDataLite VM, starting with the Topology. If you take a look at the Topology tab in ODI Studio you’ll see the Hive technology, and if you drill into it further, connections to the Hive server on the VM and the various Hive databases.
What this is connecting to is a service within the Hadoop cluster called HiveServer2 – this is an improvement over the old HiveServer1 that came with earlier distributions of Cloudera Hadoop, which could only reliably support a single connection, whereas HiveServer2 can support many concurrent connections. If you go over to Cloudera Manager and look at the Services tab, you’ll see it listed alongside the Hive Metastore server under the main Hive service.
Note that most out-of-the-box Cloudera Hadoop 4 distributions don’t have HiveServer2 enabled and running, so you’ll need to add it from the Services menu if you’re creating your own Hadoop setup.
HiveServer2 runs on port 10000, and ODI connects to it via a JDBC connection. The files and tables that ODI is then going to work with exist in a Hive database, which you can see by looking at Hue, and clicking on the Metastore Manager icon. The tables ODI will be working with are movieapp_log_avro, and the table called movieapp_log_odistage. In the background, these Hive tables map onto HDFS files, with the movieapp_log_avro one using the Apache Avro data serialisation tool to parse log data into separate “columns” of data.
So the ODI project does two things (or three, to be precise):
1. It uses the reusable mapping feature to load data from the avro-format log file into a staging table, also in Hive
2. It then takes that Hive data and loads it in to an Oracle database table
All of these are then wrapped-up into an ODI package, which calls the first step (and the reusable mapping), and then then second step.
The Model section within the Designer navigator shows the two Hive tables as data sources and targets we can work with, with the avro file’s parsing specification turning the log file into a set of columns we can extract from.
So looking at the first mapping, it reads from the reusable mapping over the avro table, then transforms and loads the data into another Hive table.
Switching to the Physical tab in the mapping editor, you can see that all the work is taking place within a single execution unit – because the transformation is all internal to Hive.
Looking at the target properties, you can see the IKM Hive Control Append knowledge module was used.
Running the mapping shows you the various steps in the process, and as this is an internal Hive transformation, all you see is HiveQL – the SQL dialect used by Hive.
The second mapping then takes this Hive staging table and loads its data into Oracle, with two execution units shown in the Physical mapping view.
Then, looking at the target object’s properties, you can see that the IKM File-Hive to Oracle (OLH-OSCH) knowledge module is used to move data out of Hadoop and into Oracle.
And when you execute the mapping, you can see the Oracle Loader for Hadoop mapping file being created, the utility run, and then the data moved through the usual staging table and into the target Oracle table.
So there you have it. There’s more you can do with ODI and the ODI Application Adaptor for Hadoop but these are two nice examples – take a look if you get a chance.