Tag Archives: Big Data
Realtime BI Show with Kevin and Stewart – BI Forum 2015 Special!
Jordan Meyer and I were very pleased to be invited onto the Realtime BI Show podcast last week, run by Kevin McGinley and Stewart Bryson, to talk about the upcoming Rittman Mead BI Forum running in Brighton and Atlanta in May 2015. Stewart and Kevin are of course speaking at the Atlanta BI Forum event on May 13th-15th 2015 at the Renaissance Atlanta Midtown Hotel, Atlanta, and in the podcast we talk about the one-day masterclass that Jordan and I are running, some of the sessions at the event, and the rise of big data and data discovery within the Oracle BI+DW industry.
Full details on the two BI Forum 2015 events can be found on the event homepage, along with details of the optional one-day masterclass on Delivering the Oracle Information Management and Big Data Reference Architecture, the guest speakers and the inaugural Data Visualization Challenge. Registration is now open and can be done online using the two links below.
- Rittman Mead BI Forum 2015, Brighton – May 6th – 8th 2015
- Hosted at the Hotel Seattle, Brighton Marina.
- Rittman Mead BI Forum 2015, Atlanta – May 13th – 15th 2015
- Hosted at the Renaissance Atlanta Midtown Hotel, Atlanta.
We’ve also set up a special discount code for listeners to the Realtime BI Show, with 10%-off both registration and the masterclass fee for both the Brighton and Atlanta events – use code RTBI10 on the Eventbrite registration forms to qualify.
Announcing Oracle E-Business Suite for Hadoop and MongoDB
Rittman Mead are very pleased today to announce our special edition of Oracle E-Business Suite R12 running on Apache Hadoop and MongoDB, for customers looking for the ultimate in scalability, flexible data storage and lower cost-of-ownership. Powered by Hadoop technologies such as Apache Hive, HDFS and MapReduce, optional reference data storage in MongoDB and reporting provided by Apache Pig, we think this represents the ultimate platform for large deployments of Oracle’s premier ERP suite.
In this special edition of Oracle E-Business Suite R12, we’ve replaced the Oracle Database storage engine with Hadoop, MapReduce and Apache Hive, with MapReduce providing the data processing engine and Apache Hive providing a SQL layer integrated with Oracle Forms. We’ve replaced Oracle Workflow with Apache Oozie and MongoDB as the optional web-scale NoSQL database for document and reference data storage, freeing you from the size limitations of relational databases, the hassles of referential integrity and restrictions of defined schemas. Developer access is provided through Apache Hue, or you can write your own Java MapReduce and or JavaScript MongoDB API programs to extend E-Business Suite’s functionality. Best of all, there’s no need for expensive DBAs as developers handle all data-modeling themselves (with MongoDB’s collections automatically adapting to new data schemas), and HDFS’s three-node replication removes the need for complicated backup & recovery procedures.
We’ve also brought Oracle Reports into the 21st century by replacing it with Apache Pig, a high-level abstraction language for Hadoop that automatically compiles your “Pig Latin” programs into MapReduce code, and allows you to bring in data from Facebook, Twitter to combine with your main EBS dataset stored in Hive and MongoDB.
On the longer-term roadmap, features and enhancements we’re planning include:
- Loosening the current INSERT-only restriction to allow UPDATES, DELETEs and full ACID semantics once HIVE-5317 is implemented.
- Adding MongoDB’s new write-reliabiity and durability so that data is always saved when EBS writes it to the underlying MongoDB collection
- Reducing the current 5-30 minute response times to less than a minute by moving to Tez or Apache Spark
- Providing integration with Oracle Discoverer 9iAS to delight end-users, and provide ad-hoc reporting truly at the speed-of-thought
For more details on our special Oracle E-Business Suite for Hadoop edition, contact us at enquiries@rittmanmead.com – but please note we’re only accepting new customers for today, April 1st 2015.
Announcing Oracle E-Business Suite for Hadoop and MongoDB
Rittman Mead are very pleased today to announce our special edition of Oracle E-Business Suite R12 running on Apache Hadoop and MongoDB, for customers looking for the ultimate in scalability, flexible data storage and lower cost-of-ownership. Powered by Hadoop technologies such as Apache Hive, HDFS and MapReduce, optional reference data storage in MongoDB and reporting provided by Apache Pig, we think this represents the ultimate platform for large deployments of Oracle’s premier ERP suite.
In this special edition of Oracle E-Business Suite R12, we’ve replaced the Oracle Database storage engine with Hadoop, MapReduce and Apache Hive, with MapReduce providing the data processing engine and Apache Hive providing a SQL layer integrated with Oracle Forms. We’ve replaced Oracle Workflow with Apache Oozie and MongoDB as the optional web-scale NoSQL database for document and reference data storage, freeing you from the size limitations of relational databases, the hassles of referential integrity and restrictions of defined schemas. Developer access is provided through Apache Hue, or you can write your own Java MapReduce and or JavaScript MongoDB API programs to extend E-Business Suite’s functionality. Best of all, there’s no need for expensive DBAs as developers handle all data-modeling themselves (with MongoDB’s collections automatically adapting to new data schemas), and HDFS’s three-node replication removes the need for complicated backup & recovery procedures.
We’ve also brought Oracle Reports into the 21st century by replacing it with Apache Pig, a high-level abstraction language for Hadoop that automatically compiles your “Pig Latin” programs into MapReduce code, and allows you to bring in data from Facebook, Twitter to combine with your main EBS dataset stored in Hive and MongoDB.
On the longer-term roadmap, features and enhancements we’re planning include:
- Loosening the current INSERT-only restriction to allow UPDATES, DELETEs and full ACID semantics once HIVE-5317 is implemented.
- Adding MongoDB’s new write-reliabiity and durability so that data is always saved when EBS writes it to the underlying MongoDB collection
- Reducing the current 5-30 minute response times to less than a minute by moving to Tez or Apache Spark
- Providing integration with Oracle Discoverer 9iAS to delight end-users, and provide ad-hoc reporting truly at the speed-of-thought
For more details on our special Oracle E-Business Suite for Hadoop edition, contact us at enquiries@rittmanmead.com – but please note we’re only accepting new customers for today, April 1st 2015.
Oracle GoldenGate, MySQL and Flume
Back in September Mark blogged about Oracle GoldenGate (OGG) and HDFS . In this short followup post I’m going to look at configuring the OGG Big Data Adapter for Flume, to trickle feed blog posts and comments from our site to HDFS. If you haven’t done so already, I strongly recommend you read through Mark’s previous post, as it explains in detail how the OGG BD Adapter works. Just like Hive and HDFS, Flume isn’t a fully-supported target so we will use Oracle GoldenGate for Java Adapter user exits to achieve what we want.
What we need to do now is
- Configure our MySQL database to be fit for duty for GoldenGate.
- Install and configure Oracle GoldenGate for MySQL on our DB server
- Create a new OGG Extract and Trail files for the database tables we want to feed to Flume
- Configure a Flume Agent on our Cloudera cluster to ‘sink’ to HDFS
- Create and configure the OGG Java adapter for Flume
- Create External Tables in Hive to expose the HDFS files to SQL access
Setting up the MySQL Database Source Capture
The MySQL database I will use for this example contains blog posts, comments etc from our website. We now want to use Oracle GoldenGate to capture new blog post and our readers’ comments and feed this information in to the Hadoop cluster we have running in the Rittman Mead Labs, along with other feeds, such as Twitter and activity logs.
The database has to be configured to user binary logging and also we need to ensure that the socket file can be found in /tmp/mysql.socket. You can find the details for this in the documentation. Also we need to make sure that the tables we want to extract from are using the InnoDB engine and not the default MyISAM one. The engine can easily be changed by issuing
alter table wp_mysql.wp_posts engine=InnoDB;
Assuming we already have installed OGG for MySQL on /opt/oracle/OGG/ we can now go ahead and configure the Manager process and the Extract for our tables. The tables we are interested in are
wp_mysql.wp_posts wp_mysql.wp_comments wp_mysql.wp_users wp_mysql.wp_terms wp_mysql.wp_term_taxonomy
First configure the manager
-bash-4.1$ cat dirprm/mgr.prm PORT 7809 PURGEOLDEXTRACTS /opt/oracle/OGG/dirdat/*, USECHECKPOINTS
Now configure the Extract to capture changes made to the tables we are interested in
-bash-4.1$ cat dirprm/mysql.prm EXTRACT mysql SOURCEDB wp_mysql, USERID root, PASSWORD password discardfile /opt/oracle/OGG/dirrpt/FLUME.dsc, purge EXTTRAIL /opt/oracle/OGG/dirdat/et GETUPDATEBEFORES TRANLOGOPTIONS ALTLOGDEST /var/lib/mysql/localhost-bin.index TABLE wp_mysql.wp_comments; TABLE wp_mysql.wp_posts; TABLE wp_mysql.wp_users; TABLE wp_mysql.wp_terms; TABLE wp_mysql.wp_term_taxonomy;
We should now be able to create the extract and start the process, as with a normal extract.
ggsci>add extract mysql, tranlog, begin now ggsci>add exttrail ./dirdat/et, extract mysql ggsci>start extract mysql ggsci>info mysql ggsci>view report mysql
We will also have to generate metadata to describe the table structures in the MySQL database. This file will be used by the Flume adapter to map columns and data types to the Avro format.
-bash-4.1$ cat dirprm/defgen.prm -- To generate trail source-definitions for GG v11.2 Adapters, use GG 11.2 defgen, -- or use GG 12.1.x defgen with "format 11.2" definition format. -- If using GG 12.1.x as a source for GG 11.2 adapters, also generate format 11.2 trails. -- UserId logger, Password password SOURCEDB wp_mysql, USERID root, PASSWORD password DefsFile dirdef/wp.def TABLE wp_mysql.wp_comments; TABLE wp_mysql.wp_posts; TABLE wp_mysql.wp_users; TABLE wp_mysql.wp_terms; TABLE wp_mysql.wp_term_taxonomy; -bash-4.1$ ./defgen PARAMFILE dirprm/defgen.prm *********************************************************************** Oracle GoldenGate Table Definition Generator for MySQL Version 12.1.2.1.0 OGGCORE_12.1.2.1.0_PLATFORMS_140920.0203 ... *********************************************************************** ** Running with the following parameters ** *********************************************************************** SOURCEDB wp_mysql, USERID root, PASSWORD ****** DefsFile dirdef/wp.def TABLE wp_mysql.wp_comments; Retrieving definition for wp_mysql.wp_comments. TABLE wp_mysql.wp_posts; Retrieving definition for wp_mysql.wp_posts. TABLE wp_mysql.wp_users; Retrieving definition for wp_mysql.wp_users. TABLE wp_mysql.wp_terms; Retrieving definition for wp_mysql.wp_terms. TABLE wp_mysql.wp_term_taxonomy; Retrieving definition for wp_mysql.wp_term_taxonomy. Definitions generated for 5 tables in dirdef/wp.def.
Setting up the OGG Java Adapter for Flume
The OGG Java Adapter for Flume will use the EXTTRAIL created earlier as a source, pack the data up and feed to the cluster Flume Agent, using Avro and RPC. The Flume Adapter thus needs to know
- Where is the OGG EXTTRAIL to read from
- How to treat the incoming data and operations (e.g. Insert, Update, Delete)
- Where to send the Avro messages to
First we create a parameter file for the Flume Adapter
-bash-4.1$ cat dirprm/flume.prm EXTRACT flume SETENV ( GGS_USEREXIT_CONF = "dirprm/flume.props") CUSEREXIT libggjava_ue.so CUSEREXIT PASSTHRU INCLUDEUPDATEBEFORES GETUPDATEBEFORES NOCOMPRESSUPDATES SOURCEDEFS ./dirdef/wp.def DISCARDFILE ./dirrpt/flume.dsc, purge TABLE wp_mysql.wp_comments; TABLE wp_mysql.wp_posts; TABLE wp_mysql.wp_users; TABLE wp_mysql.wp_terms; TABLE wp_mysql.wp_term_taxonomy;
There are two things to note here
- The OGG Java Adapter User Exit is configured in a file called flume.props
- The source tables’ structures are defined in wp.def
The flume.props file is a ‘standard’ User Exit config file
-bash-4.1$ cat dirprm/flume.props gg.handlerlist=ggflume gg.handler.ggflume.type=com.goldengate.delivery.handler.flume.FlumeHandler gg.handler.ggflume.host=bd5node1.rittmandev.com gg.handler.ggflume.port=4545 gg.handler.ggflume.rpcType=avro gg.handler.ggflume.delimiter=; gg.handler.ggflume.mode=tx gg.handler.ggflume.includeOpType=true # Indicates if the operation timestamp should be included as part of output in the delimited separated values # true - Operation timestamp will be included in the output # false - Operation timestamp will not be included in the output # Default :- true gg.handler.ggflume.includeOpTimestamp=true # Optional properties to use the transaction grouping functionality #gg.handler.ggflume.maxGroupSize=1000 #gg.handler.ggflume.minGroupSize=1000 ### native library config ### goldengate.userexit.nochkpt=TRUE goldengate.userexit.timestamp=utc goldengate.log.logname=cuserexit goldengate.log.level=INFO goldengate.log.tofile=true goldengate.userexit.writers=javawriter gg.report.time=30sec gg.classpath=AdapterExamples/big-data/flume/target/flume-lib/* javawriter.stats.full=TRUE javawriter.stats.display=TRUE javawriter.bootoptions=-Xmx32m -Xms32m -Djava.class.path=ggjava/ggjava.jar -Dlog4j.configuration=log4j.properties
Some points of interest here are
- The Flume agent we will send our data to is running on port 4545 on host bd5node1.rittmandev.com
- We want each record to be prefixed with I(nsert), U(pdated) or D(delete)
- We want each record to be postfixed with a timestamp of the transaction date
- The Java class com.goldengate.delivery.handler.flume.FlumeHandler will do the actual work. (The curios reader can view the code in /opt/oracle/OGG/AdapterExamples/big-data/flume/src/main/java/com/goldengate/delivery/handler/flume/FlumeHandler.java)
Before starting up the OGG Flume, let’s first make sure that the Flume agent on bd5node1 is configure to receive our Avro message (Source) and also what to do with the data (Sink)
a1.channels = c1 a1.sources = r1 a1.sinks = k2 a1.channels.c1.type = memory a1.sources.r1.channels = c1 a1.sources.r1.type = avro a1.sources.r1.bind = bda5node1 a1.sources.r1.port = 4545 a1.sinks.k2.type = hdfs a1.sinks.k2.channel = c1 a1.sinks.k2.hdfs.path = /user/flume/gg/%{SCHEMA_NAME}/%{TABLE_NAME} a1.sinks.k2.hdfs.filePrefix = %{TABLE_NAME}_ a1.sinks.k2.hdfs.writeFormat=Writable a1.sinks.k2.hdfs.rollInterval=0 a1.sinks.k2.hdfs.hdfs.rollSize=1048576 a1.sinks.k2.hdfs.rollCount=0 a1.sinks.k2.hdfs.batchSize=100 a1.sinks.k2.hdfs.fileType=DataStream
Here we note that
- The agent’s source (inbound data stream) is to run on port 4545 and to use avro
- The agent’s sink will write to HDFS and store the files in /user/flume/gg/%{SCHEMA_NAME}/%{TABLE_NAME}
- The HDFS files will be rolled over every 1Mb (1048576 bytes)
We are now ready to head back to the webserver that runs the MySQL database and start the Flume extract, that will feed all committed MySQL transactions against our selected tables to the Flume Agent on the cluster, which in turn will write the data to HDFS
-bash-4.1$ export LD_LIBRARY_PATH=/usr/lib/jvm/jdk1.7.0_55/jre/lib/amd64/server -bash-4.1$ export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_55/ -bash-4.1$ ./ggsci ggsci>add extract flume, exttrailsource ./dirdat/et ggsci>start flume ggsci>info flume EXTRACT FLUME Last Started 2015-03-29 17:51 Status RUNNING Checkpoint Lag 00:00:00 (updated 00:00:06 ago) Process ID 24331 Log Read Checkpoint File /opt/oracle/OGG/dirdat/et000008 2015-03-29 17:51:45.000000 RBA 7742
If I now submit this blogpost I should see the results showing up our Hadoop cluster in the Rittman Mead Labs.
[oracle@bda5node1 ~]$ hadoop fs -ls /user/flume/gg/wp_mysql/wp_posts -rw-r--r-- 3 flume flume 3030 2015-03-30 16:40 /user/flume/gg/wp_mysql/wp_posts/wp_posts_.1427729981456
We can quickly create an externally organized table in Hive to view the results with SQL
hive> CREATE EXTERNAL TABLE wp_posts( op string, ID int, post_author int, post_date String, post_date_gmt String, post_content String, post_title String, post_excerpt String, post_status String, comment_status String, ping_status String, post_password String, post_name String, to_ping String, pinged String, post_modified String, post_modified_gmt String, post_content_filtered String, post_parent int, guid String, menu_order int, post_type String, post_mime_type String, comment_count int, op_timestamp timestamp ) COMMENT 'External table ontop of GG Flume sink, landed in hdfs' ROW FORMAT DELIMITED FIELDS TERMINATED BY ';' STORED AS TEXTFILE LOCATION '/user/flume/gg/wp_mysql/wp_posts/'; hive> select post_title from gg_flume.wp_posts where op='I' and id=22112; Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1427647277272_0017, Tracking URL = http://bda5node1.rittmandev.com:8088/proxy/application_1427647277272_0017/ Kill Command = /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/hadoop/bin/hadoop job -kill job_1427647277272_0017 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 2015-03-30 16:51:17,715 Stage-1 map = 0%, reduce = 0% 2015-03-30 16:51:32,363 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1.88 sec 2015-03-30 16:51:33,422 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.38 sec MapReduce Total cumulative CPU time: 3 seconds 380 msec Ended Job = job_1427647277272_0017 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Cumulative CPU: 3.38 sec HDFS Read: 3207 HDFS Write: 35 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 380 msec OK Oracle GoldenGate, MySQL and Flume Time taken: 55.613 seconds, Fetched: 1 row(s)
Please leave a comment and you’ll be contributing to an OGG Flume!
More on the Rittman Mead BI Forum 2015 Masterclass : “Delivering the Oracle Big Data and Information Management Reference Architecture”
Each year at the Rittman Mead BI Forum we host an optional one-day masterclass before the event opens properly on Wednesday evening, with guest speakers over the year including Kurt Wolff, Kevin McGinley and last year, Cloudera’s Lars George. This year I’m particularly excited that together with Jordan Meyer, our Head of R&D, I’ll be presenting the masterclass on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture”.
Last year we launched at the Brighton BI Forum event a new reference architecture that Rittman Mead had collaborated with Oracle on, that incorporated big data and schema-on-read databases into the Oracle data warehouse and BI reference architecture. In two subsequent blog posts, and in a white paper published on the Oracle website a few weeks after, concepts such as the “Discovery Lab”, “Data Reservoirs” and the “Data Factory” were introduced as a way of incorporating the latest thinking, and product capabilities, into the reference architecture for Oracle-based BI, data warehousing and big data systems.
One of the problems I always feel with reference architectures though is that they tell you what you should create, but they don’t tell you how. Just how do you go from a set of example files and a vague requirement from the client to do something interesting with Hadoop and data science, and how do you turn the insights produced by that process into a production-ready, enterprise Big Data system? How do you implement the data factory, and how do you use new tools such as Oracle Big Data Discovery and Oracle Big Data SQL as part of this architecture? In this masterclass we’re looking to explain the “how” and “why” to go with this new reference architecture, based on experiences working with clients over the past couple of years.
The masterclass will be divided into two sections; the first, led by Jordan Meyer, will focus on the data discovery and “data science” parts of the Information Management architecture, going through initial analysis and discovery of datasets using R and Oracle R Enterprise. Jordan will share techniques he uses from both his work at Rittman Mead and his work with Slacker Radio, a Silicon Valley startup, and will introduce the R and Oracle R Enterprise toolset for uncovering insights, correlations and patterns in sample datasets and productionizing them as database routines. Over his three hours he’ll cover topics including:
Session #1 – Data exploration and discovery with R (2 hours)
1.1 Introduction to R
1.2 Tidy Data
1.3 Data transformations
1.4 Data Visualization
Session #2 – Predictive Modeling in the enterprise (1 hr)
2.1 Classification
2.2 Regression
2.3 Deploying models to the data warehouse with ORE
After lunch, I’ll take the insights and analysis patterns identified in the Discovery Lab and turn them into production big data pipelines and datasets using Oracle Data Integrator 12c, Oracle Big Data Discovery and Oracle Big Data SQL For a flavour of the topics I’ll be covering take a look at this Slideshare presentation from a recent Oracle event, and in the masterclass itself I’ll concentrate on techniques and approaches for ingesting and transforming streaming and semi-structured data, storing it in Hadoop-based data stores, and presenting it out to users using BI tools like OBIEE, and Oracle’s new Big Data Discovery.
Session # 3 – Building the Data Reservoir and Data Factory (2 hr)
3.1 Designing and Building the Data Reservoir using Cloudera CDH5 / Hortonworks HDP, Oracle BDA and Oracle Database 12c
3.2 Building the Data Factory using ODI12c & new component Hadoop KM modules, real-time loading using Apache Kafka, Spark and Spark Streaming
Session #4 – Accessing and visualising the data (1 hr)
4.1 Discovering and Analyzing the Data Reservoir using Oracle Big Data Discovery
4.2 Reporting and Dashboards across the Data Reservoir using Oracle Big Data SQL + OBIEE 11.1.1.9
You can register for a place at the two masterclasses when booking your BI Forum 2015 place, but you’ll need to hurry as we limit the number of attendees at each event in order to maximise interaction and networking within each group. Registration is open now and the two events take place in May – hopefully we’ll see you there!