Tag Archives: Big Data
Adding Oracle Big Data SQL to ODI12c to Enhance Hive Data Transformations
An updated version of the Oracle BigDataLite VM came out a couple of weeks ago, and as well as updating the core Cloudera CDH software to the latest release it also included Oracle Big Data SQL, the SQL access layer over Hadoop that I covered on the blog a few months ago (here and here). Big Data SQL takes the SmartScan technology from Exadata and extends it to Hadoop, presenting Hive tables and HDFS files as Oracle external tables and pushing down the filtering and column-selection of data to individual Hadoop nodes. Any table registered in the Hive metastore can be exposed as an external table in Oracle, and a BigDataSQL agent installed on each Hadoop node gives them the ability to understand full Oracle SQL syntax rather than the cut-down SQL dialect that you get with Hive.
There’s two immediate use-cases that come to mind when you think about Big Data SQL in the context of BI and data warehousing; you can use Big Data SQL to include Hive tables in regular Oracle set-based ETL transformations, giving you the ability to reference Hive data during part of your data load; and you can also use Big Data SQL as a way to access Hive tables from OBIEE, rather than having to go through Hive or Impala ODBC drivers. Let’s start off in this post by looking at the ETL scenario using ODI12c as the data integration environment, and I’ll come back to the BI example later in the week.
You may recall in a couple of earlier posts earlier in the year on ETL and data integration on Hadoop, I looked at a scenario where I wanted to geo-code web server log transactions using an IP address range lookup file from a company called MaxMind. To determine the country for a given IP address you need to locate the IP address of interest within ranges listed in the lookup file, something that’s easy to do with a full SQL dialect such as that provided by Oracle:
In my case, I’d want to join my Hive table of server log entries with a Hive table containing the IP address ranges, using the BETWEEN operator – except that Hive doesn’t support any type of join other than an equi-join. You can use Impala and a BETWEEN clause there, but in my testing anything other than a relatively small log file Hive table took massive amounts of memory to do the join as Impala works in-memory which effectively ruled-out doing the geo-lookup set-based. I then went on to do the lookup using Pig and a Python API into the geocoding database but then you’ve got to learn Pig, and I finally came up with my best solution using Hive streaming and a Python script that called that same API, but each of these are fairly involved and require a bit of skill and experience from the developer.
But this of course is where Big Data SQL could be useful. If I could expose the Hive table containing my log file entries as an Oracle external table and then join that within Oracle to an Oracle-native lookup table, I could do my join using the BETWEEN operator and then output the join results to a temporary Oracle table; once that’s done I could then use ODI12c’s sqoop functionality to copy the results back down to Hive for the rest of the ETL process. Looking at my Hive database using SQL*Developer 4.0.3’s new ability to work with Hive tables I can see the table I’m interested in listed there:
and I can also see it listed in the DBA_HIVE_TABLES static view that comes with Big Data SQL on Oracle Database 12c:
SQL> select database_name, table_name, location 2 from dba_hive_tables 3 where table_name like 'access_per_post%'; DATABASE_N TABLE_NAME LOCATION ---------- ------------------------------ -------------------------------------------------- default access_per_post hdfs://bigdatalite.localdomain:8020/user/hive/ware house/access_per_post default access_per_post_categories hdfs://bigdatalite.localdomain:8020/user/hive/ware house/access_per_post_categories default access_per_post_full hdfs://bigdatalite.localdomain:8020/user/hive/ware house/access_per_post_full
There are various ways to create the Oracle external tables over Hive tables in the linked Hadoop cluster, including using the new DBMS_HADOOP package to create the Oracle DDL from the Hive metastore table definitions or using SQL*Developer Data Modeler to generate the DDL from modelled Hive tables, but if you know the Hive table definition and its not too complicated, you might as well just write the DDL statement yourself using the new ORACLE_HIVE external table access driver. In my case, to create the corresponding external table for the Hive table I want to geo-code, it looks like this:
CREATE TABLE access_per_post_categories( hostname varchar2(100), request_date varchar2(100), post_id varchar2(10), title varchar2(200), author varchar2(100), category varchar2(100), ip_integer number) organization external (type oracle_hive default directory default_dir access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));
Then it’s just a case of importing the metadata for the external table over Hive, and the tables I’m going to join to and then load the results into, into ODI’s repository and then create a mapping to bring them all together.
Importantly, I can create the join between the tables using the BETWEEN clause, something I just couldn’t do when working with Hive tables on their own.
Running the mapping then joins the webserver log lookup table to the geocoding IP address range lookup table through the Oracle SQL engine, removing all the complexity of using Hive streaming, Pig or the other workaround solutions I used before. What I can then do is add a further step to the mapping to take the output of my join and use that to load the results back into Hive, like this:
I’ll then use IKM SQL to to Hive-HBase-File (SQOOP) knowledge module to set up the export from Oracle into Hive.
Now, when I run the mapping I can see the initial table join taking place between the Oracle native table and the Hive-sourced external table, and the results then being exported back into Hadoop at the end using the Sqoop KM.
Finally, I can view the contents of the downstream Hive table loaded via Sqoop, and see that it does in-fact contain the country name for each of the page accesses.
Oracle Big Data SQL isn’t a solution suitable for everyone; it only runs on the BDA and requires Exadata for the database access, and it’s an additional license cost on top of the base BDA software bundle. But if you’ve got it available it’s an excellent way to blend Hive and Oracle data, and a great way around some of the restrictions around HiveQL and the Hive JDBC/ODBC drivers. More on this topic later next week, when I’ll look at using Big Data SQL in conjunction with OBIEE 11g.
News and Updates from Oracle Openworld 2014
It’s the Saturday after Oracle Openworld 2014, and I’m now home from San Francisco and back in the UK. It’s been a great week as usual, with lots of product announcements and updates to the BI, DW and Big Data products we use on current projects. Here’s my take on what was announced this last week.
New Products Announced
From a BI and DW perspective, the most significant product announcements were around Hadoop and Big Data. Up to this point most parts of an analytics-focused big data project required you to code the solution yourself, with the diagram below showing the typical three steps in a big data project – data ingestion, analysis and sharing the results.
At the moment, all of these steps are typically performed from the command-line using languages such as Python, R, Pig, Hive and so on, with tools like Apache Flume and Apache Sqoop used to bring data into and out of the Hadoop cluster. Under the covers, these tools use technologies such as MapReduce or Spark to do their work, automatically running jobs in parallel across the cluster and making use of the easy scalability of Hadoop and NoSQL databases.
You can also neatly divide the work up on a big data project into two phases; the “discovery” phase typically performed by a data scientist where data is loaded, analysed, correlated and otherwise “understood” to provide the initial insights, and then an “exploitation” phase where we apply governance, provide the output data in a format usable by BI tools and otherwise share the results with the wider corporate audience. The updated Information Management Reference Architecture we collaborated on with Oracle and launched by in June this year had distinct discovery and exploitation phases, and the architecture itself made a clear distinction between the Innovation part that enabled the discovery phase of a project and the Execution part that delivered the insights and data in a more governed, production setting.
This was the theme of the product announcements around analytics, BI, data warehousing and big data during Openworld 2014, with Oracle’s Omri Traub in the photo below taking us through Oracle’s big data product strategy. What Oracle are doing here is productising and “democratising” big data, putting it clearly in context of their existing database, engineered systems and BI products and linking them all together into an overall information management architecture and delivery process.
So working through from ingestion through to data analysis, these steps have typically been performed by data scientists using scripting tools and rudimentary data visualisation engines, making them labour-intensive and reliant on a small set of people conversant with these tools and process. Oracle Big Data Discovery is aimed squarely at these steps, and combines Apache Spark-based data preparation and transformation capabilities with an analysis and visualisation engine based on Endeca Server.
Key features of Big Data Discovery include:
- Ability to analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based transformation engine
- Create a catalog of the data on your Hadoop cluster, and then search that catalog using Endeca Server search technologies
- Create recommendations of other datasets that might interest you, based on what you’re looking at now
- Visualize your datasets to help understand what they contain, and discover new insights
Under the covers it comprises two parts; the data loading, transformation and profiling part that uses Apache Spark to do its work in parallel across all the nodes in the cluster, and the analysis part, which takes data prepared by Apache Spark and loads into the Endeca Server in-memory engine to perform the analysis, aggregation and data visualisation. Unlike the Spark part the Endeca server element runs just on one node and limits the size of the analysis dataset to what can run in-memory in the Endeca Server engine, but in practice you’re going to work with a sample of the data rather than the entire dataset at that stage (in time the assumption is that the Endeca Server engine will be unbundled and run natively on YARN, giving it the same scalability as the Spark-based data ingestion and transformation part). Initially Big Data Discovery will run on-premise with a cloud version later on, and it’s not dependent on Big Data Appliance – expect to see something later this year / early next year.
Another new product that addresses the discovery phase and discovery lab part of a big data project is Oracle Data Enrichment Cloud Service, from the Oracle Data Integration team and designed to complement ODI and Oracle EDQ. Whilst Oracle positioned ODECS as something you’d use as well as Big Data Discovery and typically upstream from BDD, to me there seemed to be a fair bit of overlap between the products, with both tools doing data profiling and transformation but BDD being more focused on the exploration and discovery part, and ODECS being more focused on early-stage data profiling and transformation.
ODECS is clearly more of an ETL tool complement and runs natively in the cloud, right from the start. It’s most probably aimed at customers with their Hadoop dataset already in the cloud, maybe using Amazon Elastic MapReduce or Oracle’s new Hadoop-as-a-Service and has more in common with the old Data Quality Option for Oracle Warehouse Builder than Endeca’s search-first analytic interface. It’s got a very nice interface including a mobile-enabled website and the ability to include and merge in external datasets, including Oracle’s own Data as a Service platform offering. Along with the new Metadata Management tool Oracle also launched at Openworld it’s a great addition to the Oracle Data Integration product suite, but I can’t help thinking that its initial availability only on Oracle’s public cloud platform is going to limit its use with Oracle’s typical customers – we’ll have to just wait and see.
The other major product that addresses big data projects was Oracle Big Data SQL. Partly addressing the discovery phase of big data projects but mostly (to my mind) addressing the exploitation phase, and the execution part of the information management architecture, Big Data SQL gives Oracle Exadata the ability to return data from Hive and NoSQL on the Big Data Appliance as well as data from its normal relational store. I covered Big Data SQL on the blog a few weeks ago and I’ll be posting some more in-depth articles on it next week, but the other main technical innovation with the product is its bringing of Exadata’s SmartScan feature to Hadoop, projecting and filtering data at the Hadoop storage node level and also giving Hadoop the ability to understand regular Oracle SQL, rather than the cut-down version you get with HiveQL.
Where this then leaves us is with the ability to do most of a big data project using (Oracle) tools, bringing big data analysis within reach of organisations with Oracle-style budgets but without access to rare data scientist-type resources. Going back to my diagram earlier, a post-OOW big data project using the new products launched in this last week could look something like this:
Big Data SQL is out now and depends on BDA and Exadata for its use; Big Data Discovery should be out in a few months time, runs on-premise but doesn’t require BDA, whilst ODECS is cloud-only and runs on a BDA in the background. Expect more news and more integration/alignment from the products as 2014 ends and 2015 starts, and we’re looking forward to using them on Oracle-centric Hadoop projects in the near future.
Product Updates for BI, Data Integration, Exalytics, BI Applications and OBIEE
Other news announced over the week for products we more commonly use on projects include:
- Oracle BI Cloud Service, now GA and covered on the blog in a five-part series just before Openworld
- Oracle have ended development of the Informatica version of the BI Apps at release 7.9.6.4, and there won’t be an 11g release that uses Informatica as the embedded ETL tool; instead they’ll need to reimplement using ODI to get to BI Apps 11g, and I did hear mention of a migration tool to be released soon
- Oracle Transactional BI Enterprise Edition, a cloud-based BI Apps version for Fusion Apps running in Oracle Public Cloud
- Certification for Oracle Database 12c In-Memory for Exalytics, with TimesTen for Exalytics expected to be de-emphasised over time.
- A new option to install Exalytics in the Big Data Appliance Starter Rack, bring in-memory BI analysis closer to big data
- More details on OBIEE 12c, including devops improvements and the new Tableau-killer Visual Analyzer data analysis tool
- Further extensions of ODI and GoldenGate into the big data world, including the ability for GoldenGate to stream into Apache Flume
- Examples of ODI integration with cloud and SaaS data sources, including a great demo of ODI Salesforce.com and Amazon Redshift integration
Finally, something that we were particularly pleased to see was the updated Oracle Information Management Architecture I mentioned earlier referenced in most of the analytics sessions, with Oracle’s Balaji Yelamanchili for example introducing it in his big data and business analytics general session mid-way through the week.
We love the way this brings together the big data components and puts them in the context of the wider data warehouse and analytic processes, and compared to a few years ago when Hadoop and big data was considered completely separate to data warehousing and BI and done by staff completely different to the core business analytics team, this new reference architecture puts it squarely within the world of BI and analytics we work in. It also emphasises the new abilities Hadoop, NoSQL databases and big data can bring us – support for wider sets of data sources with dynamic schemas, the ability to economically work with and analyse much larger datasets, and support discovery-type upfront analysis work. Finally, it recognises that to get true value out of analysis you start on Hadoop, you eventually need to add proper data governance, make the results more widely available using full SQL tools, and use the right tools – relational databases, OLAP servers and the like – to analyse the data once its in a more structured form.
If you missed our write-up on the updated Information Management Reference Architecture you can can read our two-part blog post here and here, read the Oracle white paper, or listen to the podcast with OTN Archbeat’s Bob Rhubart. For now though I’m looking forward to seeing the family after a week and a half away in San Francisco – thanks to OTN and the Oracle ACE Director Program for sponsoring my visit over to SF for Openworld, and we’ll post our conference presentation slides later next week when we’re back in the UK and US offices.
Oracle Big Data Information Management Reference Architecture
Getting The Users’ Trust – Part 1
Looking back over some of my truly ancient Rittman Mead blogs (so old in fact that they came with me when I joined the company soon after Rittman Mead was launched), I see recurrent themes on why people “do” BI and what makes for successful implementations. After all, why would an organisation wish to invest serious money in a project if it does not give value either in terms of cost reduction or increasing profitability through smart decisions. This requires technology to provide answers and a workforce that is both able to use this technology and has faith that the answers returned allow them to do their jobs better. Giving users this trust in the BI platform generally boils down to resolving these three issues: ease of use of the reporting tool, quickness of data return and “accuracy” or validity of the response. These last two issues are a fundamental part of my work here at Rittman Mead and underpin all that I do in terms of BI architecture, performance, and data quality. Even today as we adapt our BI systems to include Big Data and Advanced Analytics I follow the same sound approaches to ensure usable, reliable data and the ability to analyse it in a reasonable time.
Storage is cheap so don’t aggregate away your knowledge. If my raw data feed is sales by item by store by customer by day and I only store it in my data warehouse as sales by month by state I can’t go back to do any analysis on my customers, my stores, my products. Remember that the UNGROUP BY only existed in my April Fools’ post. Where you choose to store your ‘unaggregated’ data may well be different these days; Hadoop and schema on read paradigms often being a sensible approach. Mark Rittman has been looking at architectures where both the traditional DWH and Big Data happily co-exist.
When improving performance I tend to avoid tuning specific queries, instead I aim to make frequent access patterns work well. Tuning individual queries is almost always not a sustainable approach in BI; this week’s hot, ‘we need the answer immediately’ query may have no business focus next week. Indexes that we create to make a specific query fly may have no positive effect on other queries; indeed, indexes may degrade other aspects of BI performance such as increased data load times and have subtle effects such as changing a query plan cost so that groups of materialized views are no longer candidates in query re-write (this is especially true when you use nested views and the base view is no longer accessed).
My favoured performance improvement techniques are: correctly placing the data be it clustering, partitioning, compressing, table pinning, in-memory or whatever, and making sure that the query optimiser knows all about the nature of the data; again and again “right” optimiser information is key to good performance. Right is not just about running DBMS_STATS.gather_XXX over tables or schemas every now and then; it is also about telling the optimiser about data relationships between data items. Constraints describe the data, for example which columns allow NULL values, which columns are part of parent-child relationships (foreign keys). Extended table statistics can help describe relationships between columns in a single table for example in a product dimensions table the product sub-category and the product category columns will have an interdependence, without that knowledge cardinality estimates can be very wrong and favour nested loop style plans that could be very poor performing on large data sets.
Sometimes we will need to create aggregates to answer queries quickly; I tend to build ‘generic’ aggregates, those that can be used by many queries. Often I find that although data is loaded frequently, even near-real-time, many business users wish to look at larger time windows such as week, month, or quarter; In practice I see little need for day level aggregates over the whole data warehouse timespan, however, there will always be specific cases that might require day-level summaries. If I build summary tables or use Materialized Views I would aim to make tables that are at least 80% smaller than the base table and to avoid aggregates that partially roll up many dimensional hierarchies; customer category by product category by store region by month would probably not be the ideal aggregate for most real-user queries. That said Oracle does allow us to use fancy grouping semantics in the building of aggregates (grouping sets, group by rollup and group by cube.) The in-database Oracle OLAP cube functionality is still alive and well (and was given a performance boost in Oracle 12c); it may be more appropriate to aggregate in a cube (or relational-look-alike) rather than individual summaries.
Getting the wrong results quickly is no good, we must be sure that the results we display are correct. As professional developers we test to prove that we are not losing or gaining data through incorrect joins and filters, but ETL coding is often the smallest factor in “incorrect results” and this brings me to part 2, Data Quality.
Using Oracle GoldenGate for Trickle-Feeding RDBMS Transactions into Hive and HDFS
A few months ago I wrote a post on the blog around using Apache Flume to trickle-feed log data into HDFS and Hive, using the Rittman Mead website as the source for the log entries. Flume is a good technology to use for this type of capture requirement as it captures log entries, HTTP calls, JMS queue entries and other “event” sources easily, has a resilient architecture and integrates well with HDFS and Hive. But what if the source you want to capture activity for is a relational database, for example Oracle Database 12c? With Flume you’d need to spool the database transactions to file, whereas what you really want is a way to directly connect to the database engine and capture the changes from source.
Which is exactly what Oracle GoldenGate does, and what most people don’t realise is that GoldenGate can also load data into HDFS and Hive, as well as the usual database targets. Hive and HDFS aren’t fully-supported targets yet, you can use the Oracle GoldenGate for Java adapter to act as the handler process and then land the data in HDFS files or Hive tables on your target Hadoop platform. My Oracle Support has two tech nodes, “Integrating OGG Adapter with Hive (Doc ID 1586188.1)” and “Integrating OGG Adapter with HDFS (Doc ID 1586210.1)” that give example implementations of the Java adapters you’d need for these two target types, with the overall end-to-end process for landing Hive data looking like the diagram below (and the HDFS one just swapping out HDFS for Hive at the handler adapter stage)
This is also a good example of the sorts of technology we’d use to implement the “data factory” concept within the new Oracle Information Management Reference Architecture, the part of the architecture that moves data between the Hadoop and NoSQL-based Data Reservoir, and the relationally-stored enterprise information store; in this case, trickle-feeding transactional data from the Oracle database into Hadoop, perhaps to archive it at lower-cost than we could do in an Oracle database, or to add transaction activity data to a Hadoop-based application
So I asked my colleague Nelio Guimaraes to set up a GoldenGate capture process on our Cloudera CDH5.1 Hadoop cluster, using GoldenGate 12.1.2.0.0 for our source Oracle 11gR2 database and Oracle GoldenGate for Java, downloadable separately on edelivery.oracle.com under Oracle Fusion Middleware > Oracle GoldenGate Application Adapters 11.2.1.0.0 for JMS and Flat File Media Pack. In our example, we’re going to capture activity on the SCOTT.EMP table in the Oracle database, and then perform the following step to set up replication from it into a replica Hive table:
- Create a table in Hive that corresponds to the table in Oracle database.
- Create a table in the Oracle database and prepare the table for replication.
- Configure the Oracle GoldenGate Capture to extract transactions from the Oracle database and create the trail file.
- Configure the Oracle GoldenGate Pump to read the trail and invoke the custom adapter
- Configure the property file for the Hive handler
- Code, Compile and package the custom Hive handler
- Execute a test.
Setting up the Oracle Database Source Capture
Let’s go into the Oracle database first, check the table definition, and then connect to Hadoop to create a Hive table of the same column definition.
[oracle@centraldb11gr2 ~]$ sqlplus scott/tiger SQL*Plus: Release 11.2.0.3.0 Production on Thu Sep 11 01:08:49 2014 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, Oracle Label Security, OLAP, Data Mining, Oracle Database Vault and Real Application Testing options SQL> describe DEPT Name Null? Type ----------------------------------------- -------- ---------------------------- DEPTNO NOT NULL NUMBER(2) DNAME VARCHAR2(14) LOC VARCHAR2(13) SQL> exit ... [oracle@centraldb11gr2 ~]$ ssh oracle@cdh51-node1 Last login: Sun Sep 7 16:11:36 2014 from officeimac.rittmandev.com [oracle@cdh51-node1 ~]$ hive ... create external table dept ( DEPTNO string, DNAME string, LOC string ) row format delimited fields terminated by '\;' stored as textfile location '/user/hive/warehouse/department'; exit ...
Then I install Oracle Golden Gate 12.1.2 on the source Oracle database, just as you’d do for any Golden Gate install, and make sure supplemental logging is enabled for the table I’m looking to capture. Then I go into the ggsci Golden Gate command-line utility, to first register the user it’ll be connecting as, and what table it needs to capture activity for.
[oracle@centraldb11gr2 12.1.2]$ cd /u01/app/oracle/product/ggs/12.1.2/ [oracle@centraldb11gr2 12.1.2]$ ./ggsci $ggsci> DBLOGIN USERID sys@ctrl11g, PASSWORD password sysdba $ggsci> ADD TRANDATA SCOTT.DEPT COLS(DEPTNO), NOKEY
GoldenGate uses a number of components to replicate data from source to targets, as shown in the diagram below.
For our purposes, though, there are just three that we need to configure; the Extract component, which captures table activity on the source; the Pump process that moves data (or the “trail”) from source database to the Hadoop cluster; and the Replicat component that takes that activity and applies it to the target tables. In our example, the extract and pump processes will be as normal, but we need to create a custom “handler” for the target Hive table that uses the Golden Gate Java API and the Hadoop FS Java API.
The tool we use to set up the extract and capture process is ggsci, the command-line Golden Gate Software Command Interface. I’ll use it first to set up the Manager process that runs on both source and target servers, giving it a port number and connection details into the source Oracle database.
$ggsci> edit params mgr PORT 7809 USERID sys@ctrl11g, PASSWORD password sysdba PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/12.1.2/dirdat/*, USECHECKPOINTS
Then I create two configuration files, one for the extract process and one for the pump process, and then use those to start those two processes.
$ggsci> edit params ehive EXTRACT ehive USERID sys@ctrl11g, PASSWORD password sysdba EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, FORMAT RELEASE 11.2 TABLE SCOTT.DEPT; $ggsci> edit params phive EXTRACT phive RMTHOST cdh51-node1.rittmandev.com, MGRPORT 7809 RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, FORMAT RELEASE 11.2 PASSTHRU TABLE SCOTT.DEPT; $ggsci> ADD EXTRACT ehive, TRANLOG, BEGIN NOW $ggsci> ADD EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, EXTRACT ehive $ggsci> ADD EXTRACT phive, EXTTRAILSOURCE /u01/app/oracle/product/ggs/12.1.2/dirdat/et $ggsci> ADD RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, EXTRACT phive
As the Java event handler on the target Hadoop platform won’t be able to ordinarily get table metadata for the source Oracle database, we’ll use the defgen utility on the source platform to create the parameter file that the replicat process will need.
$ggsci> edit params dept defsfile ./dirsql/DEPT.sql USERID ggsrc@ctrl11g, PASSWORD ggsrc TABLE SCOTT.DEPT; ./defgen paramfile ./dirprm/dept.prm NOEXTATTR
Note that NOEXTATTR means no extra attributes; because the version on target is a generic and minimal version, the definition file with extra attributes won’t be interpreted. Then, this DEPT.sql file will need to be copied across to the target Hadoop platform where you’ve installed Oracle GoldenGate for Java, to the /dirsql folder within the GoldenGate install.
[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1 oracle@cdh51-node1's password: Last login: Wed Sep 10 17:05:49 2014 from centraldb11gr2.rittmandev.com [oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/ [oracle@cdh51-node1 11.2.1] $ pwd/u01/app/oracle/product/ggs/11.2.1 [oracle@cdh51-node1 11.2.1]$ ls dirsql/ DEPT.sql
Then, going back to the source Oracle database platform, we’ll start the Golden Gate Monitor process, and then the extract and pump processes.
[oracle@cdh51-node1 11.2.1]$ ssh oracle@centraldb11gr2 oracle@centraldb11gr2's password: Last login: Thu Sep 11 01:08:18 2014 from bdanode1.rittmandev.com GGSCI (centraldb11gr2.rittmandev.com) 7> start mgr Manager started. GGSCI (centraldb11gr2.rittmandev.com) 8> start ehive Sending START request to MANAGER ... EXTRACT EHIVE starting GGSCI (centraldb11gr2.rittmandev.com) 9> start phive Sending START request to MANAGER ... EXTRACT PHIVE starting
Setting up the Hadoop / Hive Replicat Process
Setting up the Hadoop side involves a couple of similar steps to the source capture side; first we configure the parameters for the Manager process, then configure the extract process that will pull table activity off of the trail file, sent over by the pump process on the source Oracle database.
[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1 oracle@cdh51-node1's password: Last login: Wed Sep 10 21:09:38 2014 from centraldb11gr2.rittmandev.com [oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/ [oracle@cdh51-node1 11.2.1]$ ./ggsci $ggsci> edit params mgr PORT 7809 PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/11.2.1/dirdat/*, usecheckpoints, minkeepdays 3 $ggsci> add extract tphive, exttrailsource /u01/app/oracle/product/ggs/11.2.1/dirdat/rt $ggsci> edit params tphive EXTRACT tphive SOURCEDEFS ./dirsql/DEPT.sql CUserExit ./libggjava_ue.so CUSEREXIT PassThru IncludeUpdateBefores GETUPDATEBEFORES TABLE SCOTT.DEPT;
Now it’s time to create the Java hander that will write the trail data to the HDFS files and Hive table. The My Oracle Support Doc.ID 1586188.1 I mentioned at the start of the article has a sample Java program called SampleHandlerHive.java that writes incoming transactions into an HDFS file within the Hive directory, and also writes it to a file on the local filesystem. To get this working on our Hadoop system, we created a new java source code file from the content in SampleHandlerHive.java, updated the path from hadoopConf.addResource to point the the correct location of core-site.xml, hdfs-site.xml and mapred-site.xml, and then compiled it as follows:
export CLASSPATH=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/* javac -d . SampleHandlerHive.java
Successfully executing the above command created the SampleHiveHandler.class under /u01/app/oracle/product/ggs/11.2.1//dirprm/com/mycompany/bigdata. To create the JAR file that the GoldenGate for Java adapter will need, I then need to change directory to the /dirprm directory under the Golden Gate install, and then run the following commands:
jar cvf myhivehandler.jar com chmod 755 myhivehandler.jar
I also need to create a properties file for this JAR to use, in the same /dirprm directory. This properties file amongst other things tells the Golden Gate for Java adapter where in HDFS to write the data to (the location where the Hive table keeps its data files), and also references any other JAR files from the Hadoop distribution that it’ll need to get access to.
[oracle@cdh51-node1 dirprm]$ cat tphive.properties #Adapter Logging parameters. gg.log=log4j gg.log.level=info #Adapter Check pointing parameters goldengate.userexit.chkptprefix=HIVECHKP_ goldengate.userexit.nochkpt=true # Java User Exit Property goldengate.userexit.writers=jvm jvm.bootoptions=-Xms64m -Xmx512M -Djava.class.path=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/u01/app/oracle/product/ggs/11.2.1/dirprm:/u01/app/oracle/product/ggs/11.2.1/dirprm/myhivehandler.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-common-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-configuration-1.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-logging-1.1.3.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-lang-2.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop/conf.dist:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/guava-11.0.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/hadoop-auth-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-hdfs-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/commons-cli-1.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/protobuf-java-2.5.0.jar #Properties for reporting statistics # Minimum number of {records, seconds} before generating a report jvm.stats.time=3600 jvm.stats.numrecs=5000 jvm.stats.display=TRUE jvm.stats.full=TRUE #Hive Handler. gg.handlerlist=hivehandler gg.handler.hivehandler.type=com.mycompany.bigdata.SampleHandlerHive gg.handler.hivehandler.HDFSFileName=/user/hive/warehouse/department/dep_data gg.handler.hivehandler.RegularFileName=cinfo_hive.txt gg.handler.hivehandler.RecordDelimiter=; gg.handler.hivehandler.mode=tx
Now, the final step on the Hadoop side is to start its Golden Gate Manager process, and then start the Replicat and apply process.
GGSCI (cdh51-node1.rittmandev.com) 5> start mgr Manager started. GGSCI (cdh51-node1.rittmandev.com) 6> start tphive Sending START request to MANAGER ... EXTRACT TPHIVE starting
Testing it All Out
So now I’ve got the extract and pump processes running on the Oracle Database side, and the apply process running on the Hadoop side, let’s do a quick test and see if it’s working. I’ll start by looking at what data is in each table at the beginning.
SQL> select * from dept; DEPTNO DNAME LOC ---------- -------------- ------------- 10 ACCOUNTING NEW YORK 20 RESEARCH DALLAS 30 SALES CHICAGO 40 OPERATIONS BOSTON 50 TESTE PORTO 60 NELIO STS 70 RAQUEL AVES 7 rows selected.
Over on the Hadoop side, there’s just one row in the Hive table:
hive> select * from customer; OK 80MARCIA ST
Now I’ll go back to Oracle and insert a new row in the DEPT table:
SQL> insert into dept (deptno, dname, loc) 2 values (75, 'EXEC','BRIGHTON'); 1 row created. SQL> commit; Commit complete.
And, going back over to Hadoop, I can see Golden Gate has added that record to the Hive table, by the Golden Gate for Java adapter writing the transaction to the underlying HDFS file.
hive> select * from customer; OK 80MARCIA ST 75 EXEC BRIGHTON
So there you have it; Golden Gate replicating Oracle RBDMS transactions into HDFS and Hive, to complement Apache Flume’s ability to replicate log and event data into Hadoop. Moreover, as Michael Rainey explained in this three part blog series, Golden Gate is closely integrated into the new 12c release of Oracle Data Integrator, making it even easier to manage Golden Gate replication processes into your overall data loading project, and giving Hadoop developers and Golden Gate users access to the full set of load orchestration and data quality features in that product rather than having to rely on home-grown scripting, or Oozie.