Category Archives: Rittman Mead

Rittman Mead/Oracle Data Integration Speakeasy @ Oracle Open World

If you are attending Oracle Open World this year and fancy bit of a different experience, come and join Rittman Mead and Oracle’s Data Integration teams for drinks and networking at 7pm on Tuesday 30th September at the Local Edition speakeasy on Market Street.

We will be providing a couple of hours of free drinks with the opportunity to quiz our leading data integration experts and Oracle’s data integration team about any aspect of the data integration toolset, architecture and our innovative implementation approaches, and to relax and kick back at the end of a long day. So whether you want to know about how ODI can facilitate your big data strategy, or implement data quality and data governance across your enterprise data architecture, please come along.

The Local Edition is located at 691 Market St, San Francisco, CA and the event runs from 7pm to 9pm. Please register here.

For further information on this event and the sessions we are presenting at Oracle Open World contact us at info@rittmanmead.com.

Using Oracle GoldenGate for Trickle-Feeding RDBMS Transactions into Hive and HDFS

A few months ago I wrote a post on the blog around using Apache Flume to trickle-feed log data into HDFS and Hive, using the Rittman Mead website as the source for the log entries. Flume is a good technology to use for this type of capture requirement as it captures log entries, HTTP calls, JMS queue entries and other “event” sources easily, has a resilient architecture and integrates well with HDFS and Hive. But what if the source you want to capture activity for is a relational database, for example Oracle Database 12c? With Flume you’d need to spool the database transactions to file, whereas what you really want is a way to directly connect to the database engine and capture the changes from source.

Which is exactly what Oracle GoldenGate does, and what most people don’t realise is that GoldenGate can also load data into HDFS and Hive, as well as the usual database targets. Hive and HDFS aren’t fully-supported targets yet, you can use the Oracle GoldenGate for Java adapter to act as the handler process and then land the data in HDFS files or Hive tables on your target Hadoop platform. My Oracle Support has two tech nodes, “Integrating OGG Adapter with Hive (Doc ID 1586188.1)” and “Integrating OGG Adapter with HDFS (Doc ID 1586210.1)” that give example implementations of the Java adapters you’d need for these two target types, with the overall end-to-end process for landing Hive data looking like the diagram below (and the HDFS one just swapping out HDFS for Hive at the handler adapter stage)

NewImage

This is also a good example of the sorts of technology we’d use to implement the “data factory” concept within the new Oracle Information Management Reference Architecture, the part of the architecture that moves data between the Hadoop and NoSQL-based Data Reservoir, and the relationally-stored enterprise information store; in this case, trickle-feeding transactional data from the Oracle database into Hadoop, perhaps to archive it at lower-cost than we could do in an Oracle database, or to add transaction activity data to a Hadoop-based application

NewImage

So I asked my colleague Nelio Guimaraes to set up a GoldenGate capture process on our Cloudera CDH5.1 Hadoop cluster, using GoldenGate 12.1.2.0.0 for our source Oracle 11gR2 database and Oracle GoldenGate for Java, downloadable separately on edelivery.oracle.com under Oracle Fusion Middleware > Oracle GoldenGate Application Adapters 11.2.1.0.0 for JMS and Flat File Media Pack. In our example, we’re going to capture activity on the SCOTT.EMP table in the Oracle database, and then perform the following step to set up replication from it into a replica Hive table:

  1. Create a table in Hive that corresponds to the table in Oracle database.
  2. Create a table in the Oracle database and prepare the table for replication.
  3. Configure the Oracle GoldenGate Capture to extract transactions from the Oracle database and create the trail file.
  4. Configure the Oracle GoldenGate Pump to read the trail and invoke the custom adapter
  5. Configure the property file for the Hive handler
  6. Code, Compile and package the custom Hive handler
  7. Execute a test. 

Setting up the Oracle Database Source Capture

Let’s go into the Oracle database first, check the table definition, and then connect to Hadoop to create a Hive table of the same column definition.

[oracle@centraldb11gr2 ~]$ sqlplus scott/tiger
SQL*Plus: Release 11.2.0.3.0 Production on Thu Sep 11 01:08:49 2014
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL> describe DEPT
 Name Null? Type
 ----------------------------------------- -------- ----------------------------
 DEPTNO NOT NULL NUMBER(2)
 DNAME VARCHAR2(14)
 LOC VARCHAR2(13)
SQL> exit
...
[oracle@centraldb11gr2 ~]$ ssh oracle@cdh51-node1
Last login: Sun Sep 7 16:11:36 2014 from officeimac.rittmandev.com
[oracle@cdh51-node1 ~]$ hive
...
create external table dept
(
 DEPTNO string, 
 DNAME string, 
 LOC string
) row format delimited fields terminated by '\;' stored as textfile
location '/user/hive/warehouse/department'; 
exit
...

Then I install Oracle Golden Gate 12.1.2 on the source Oracle database, just as you’d do for any Golden Gate install, and make sure supplemental logging is enabled for the table I’m looking to capture. Then I go into the ggsci Golden Gate command-line utility, to first register the user it’ll be connecting as, and what table it needs to capture activity for.

[oracle@centraldb11gr2 12.1.2]$ cd /u01/app/oracle/product/ggs/12.1.2/
[oracle@centraldb11gr2 12.1.2]$ ./ggsci
$ggsci> DBLOGIN USERID sys@ctrl11g, PASSWORD password sysdba
$ggsci> ADD TRANDATA SCOTT.DEPT COLS(DEPTNO), NOKEY

GoldenGate uses a number of components to replicate data from source to targets, as shown in the diagram below.

NewImageFor our purposes, though, there are just three that we need to configure; the Extract component, which captures table activity on the source; the Pump process that moves data (or the “trail”) from source database to the Hadoop cluster; and the Replicat component that takes that activity and applies it to the target tables. In our example, the extract and pump processes will be as normal, but we need to create a custom “handler” for the target Hive table that uses the Golden Gate Java API and the Hadoop FS Java API.

The tool we use to set up the extract and capture process is ggsci, the command-line Golden Gate Software Command Interface. I’ll use it first to set up the Manager process that runs on both source and target servers, giving it a port number and connection details into the source Oracle database.

$ggsci> edit params mgr
PORT 7809
USERID sys@ctrl11g, PASSWORD password sysdba
PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/12.1.2/dirdat/*, USECHECKPOINTS

Then I create two configuration files, one for the extract process and one for the pump process, and then use those to start those two processes.

$ggsci> edit params ehive
EXTRACT ehive
USERID sys@ctrl11g, PASSWORD password sysdba
EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, FORMAT RELEASE 11.2
TABLE SCOTT.DEPT;
$ggsci> edit params phive
EXTRACT phive
RMTHOST cdh51-node1.rittmandev.com, MGRPORT 7809
RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, FORMAT RELEASE 11.2
PASSTHRU
TABLE SCOTT.DEPT;
$ggsci> ADD EXTRACT ehive, TRANLOG, BEGIN NOW
$ggsci> ADD EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, EXTRACT ehive
$ggsci> ADD EXTRACT phive, EXTTRAILSOURCE /u01/app/oracle/product/ggs/12.1.2/dirdat/et
$ggsci> ADD RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, EXTRACT phive

As the Java event handler on the target Hadoop platform won’t be able to ordinarily get table metadata for the source Oracle database, we’ll use the defgen utility on the source platform to create the parameter file that the replicat process will need.

$ggsci> edit params dept
defsfile ./dirsql/DEPT.sql
USERID ggsrc@ctrl11g, PASSWORD ggsrc
TABLE SCOTT.DEPT;
./defgen paramfile ./dirprm/dept.prm NOEXTATTR

Note that NOEXTATTR means no extra attributes; because the version on target is a generic and minimal version, the definition file with extra attributes won’t be interpreted. Then, this DEPT.sql file will need to be copied across to the target Hadoop platform where you’ve installed Oracle GoldenGate for Java, to the /dirsql folder within the GoldenGate install. 

[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1
oracle@cdh51-node1's password: 
Last login: Wed Sep 10 17:05:49 2014 from centraldb11gr2.rittmandev.com
[oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/
[oracle@cdh51-node1 11.2.1]
$ pwd/u01/app/oracle/product/ggs/11.2.1
[oracle@cdh51-node1 11.2.1]$ ls dirsql/
DEPT.sql

Then, going back to the source Oracle database platform, we’ll start the Golden Gate Monitor process, and then the extract and pump processes.

[oracle@cdh51-node1 11.2.1]$ ssh oracle@centraldb11gr2
oracle@centraldb11gr2's password: 
Last login: Thu Sep 11 01:08:18 2014 from bdanode1.rittmandev.com
GGSCI (centraldb11gr2.rittmandev.com) 7> start mgr
Manager started.
 
GGSCI (centraldb11gr2.rittmandev.com) 8> start ehive
 
Sending START request to MANAGER ...
EXTRACT EHIVE starting
 
GGSCI (centraldb11gr2.rittmandev.com) 9> start phive
 
Sending START request to MANAGER ...
EXTRACT PHIVE starting

Setting up the Hadoop / Hive Replicat Process

Setting up the Hadoop side involves a couple of similar steps to the source capture side; first we configure the parameters for the Manager process, then configure the extract process that will pull table activity off of the trail file, sent over by the pump process on the source Oracle database.

[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1
oracle@cdh51-node1's password: 
Last login: Wed Sep 10 21:09:38 2014 from centraldb11gr2.rittmandev.com
[oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/
[oracle@cdh51-node1 11.2.1]$ ./ggsci
$ggsci> edit params mgr
PORT 7809
PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/11.2.1/dirdat/*, usecheckpoints, minkeepdays 3
$ggsci> add extract tphive, exttrailsource /u01/app/oracle/product/ggs/11.2.1/dirdat/rt
$ggsci> edit params tphive
EXTRACT tphive
SOURCEDEFS ./dirsql/DEPT.sql
CUserExit ./libggjava_ue.so CUSEREXIT PassThru IncludeUpdateBefores
GETUPDATEBEFORES
TABLE SCOTT.DEPT;

Now it’s time to create the Java hander that will write the trail data to the HDFS files and Hive table. The My Oracle Support Doc.ID 1586188.1 I mentioned at the start of the article has a sample Java program called SampleHandlerHive.java that writes incoming transactions into an HDFS file within the Hive directory, and also writes it to a file on the local filesystem. To get this working on our Hadoop system, we created a new java source code file from the content in SampleHandlerHive.java, updated the path from hadoopConf.addResource to point the the correct location of core-site.xml, hdfs-site.xml and mapred-site.xml, and then compiled it as follows:

export CLASSPATH=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/*
javac -d . SampleHandlerHive.java

Successfully executing the above command created the SampleHiveHandler.class under /u01/app/oracle/product/ggs/11.2.1//dirprm/com/mycompany/bigdata. To create the JAR file that the GoldenGate for Java adapter will need, I then need to change directory to the /dirprm directory under the Golden Gate install, and then run the following commands:

jar cvf myhivehandler.jar com
chmod 755 myhivehandler.jar

I also need to create a properties file for this JAR to use, in the same /dirprm directory. This properties file amongst other things tells the Golden Gate for Java adapter where in HDFS to write the data to (the location where the Hive table keeps its data files), and also references any other JAR files from the Hadoop distribution that it’ll need to get access to.

[oracle@cdh51-node1 dirprm]$ cat tphive.properties 
#Adapter Logging parameters. 
gg.log=log4j
gg.log.level=info
 
#Adapter Check pointing  parameters
goldengate.userexit.chkptprefix=HIVECHKP_
goldengate.userexit.nochkpt=true
 
# Java User Exit Property
goldengate.userexit.writers=jvm
jvm.bootoptions=-Xms64m -Xmx512M -Djava.class.path=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/u01/app/oracle/product/ggs/11.2.1/dirprm:/u01/app/oracle/product/ggs/11.2.1/dirprm/myhivehandler.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-common-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-configuration-1.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-logging-1.1.3.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-lang-2.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop/conf.dist:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/guava-11.0.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/hadoop-auth-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-hdfs-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/commons-cli-1.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/protobuf-java-2.5.0.jar
 
#Properties for reporting statistics
# Minimum number of {records, seconds} before generating a report
jvm.stats.time=3600
jvm.stats.numrecs=5000
jvm.stats.display=TRUE
jvm.stats.full=TRUE
 
#Hive Handler.  
gg.handlerlist=hivehandler
gg.handler.hivehandler.type=com.mycompany.bigdata.SampleHandlerHive
gg.handler.hivehandler.HDFSFileName=/user/hive/warehouse/department/dep_data
gg.handler.hivehandler.RegularFileName=cinfo_hive.txt
gg.handler.hivehandler.RecordDelimiter=;
gg.handler.hivehandler.mode=tx

Now, the final step on the Hadoop side is to start its Golden Gate Manager process, and then start the Replicat and apply process.

GGSCI (cdh51-node1.rittmandev.com) 5> start mgr
 
Manager started. 
 
GGSCI (cdh51-node1.rittmandev.com) 6> start tphive
 
Sending START request to MANAGER ...
EXTRACT TPHIVE starting

Testing it All Out

So now I’ve got the extract and pump processes running on the Oracle Database side, and the apply process running on the Hadoop side, let’s do a quick test and see if it’s working. I’ll start by looking at what data is in each table at the beginning.

SQL> select * from dept;     

    DEPTNO DNAME  LOC
 ---------- -------------- -------------

10 ACCOUNTING  NEW YORK
20 RESEARCH  DALLAS
30 SALES  CHICAGO
40 OPERATIONS  BOSTON
50 TESTE  PORTO
60 NELIO  STS
70 RAQUEL  AVES
 
7 rows selected.

Over on the Hadoop side, there’s just one row in the Hive table:

hive> select * from customer;

OK 80MARCIA   ST

Now I’ll go back to Oracle and insert a new row in the DEPT table:

SQL> insert into dept (deptno, dname, loc)
  2  values (75, 'EXEC','BRIGHTON'); 

1 row created. 
SQL> commit; 

Commit complete.

And, going back over to Hadoop, I can see Golden Gate has added that record to the Hive table, by the Golden Gate for Java adapter writing the transaction to the underlying HDFS file.

hive> select * from customer;

OK 80MARCIA   ST
75 EXEC       BRIGHTON

So there you have it; Golden Gate replicating Oracle RBDMS transactions into HDFS and Hive, to complement Apache Flume’s ability to replicate log and event data into Hadoop. Moreover, as Michael Rainey explained in this three part blog series, Golden Gate is closely integrated into the new 12c release of Oracle Data Integrator, making it even easier to manage Golden Gate replication processes into your overall data loading project, and giving Hadoop developers and Golden Gate users access to the full set of load orchestration and data quality features in that product rather than having to rely on home-grown scripting, or Oozie.

OBIEE SampleApp in The Cloud: Importing VirtualBox Machines to AWS EC2

Virtualisation has revolutionised how we work as developers. A decade ago, using new software would mean trying to find room on a real tin server to install it, hoping it worked, and if it didn’t, picking apart the pieces probably leaving the server in a worse state than it was to begin with. Nowadays, we can just launch a virtual machine to give a clean environment and if it doesn’t work – trash it and start again.
The sting in the tail of virtualisation is that full-blown VMs are heavy – for disk we need several GB just for a blank OS, and dozens of GB if you’re talking about a software stack such as Fusion MiddleWare (FMW), and the host machine needs to have the RAM and CPU to support it all too. Technologies such as Linux Containers go some way to making things lighter by abstracting out a chunk of the OS, but this isn’t something that’s reached the common desktop yet.

So whilst VMs are awesome, it’s not always practical to maintain a library of all of them on your local laptop (even 1TB drives fill up pretty quickly), nor will your laptop have the grunt to run more than one or two VMs at most. VMs like this are also local to your laptop or server – but wouldn’t it be neat if you could duplicate that VM and make a server based on it instantly available to anyone in the world with an internet connection? And that’s where The Cloud comes in, because it enables us to store as much data as we can eat (and pay for), and provision “hardware” at the click of a button for just as long as we need it, accessible from anywhere.

Here at Rittman Mead we make extensive use of Amazon Web Services (AWS) and their Elastic Computing Cloud (EC2) offering. Our website runs on it, our training servers run on it, and it scales just as we need it to. A class of 3 students is as easy to provision for as a class of 24 – no hunting around for spare servers or laptops, no hardware sat idle in a cupboard as spare capacity “just in case”.

One of the challenges that we’ve faced up until now is that all servers have had to be built from scratch in the cloud. Obviously we work with development VMs on local machines too, so wouldn’t it be nice if we could build VMs locally and then push them to the cloud? Well, now we can. Amazon offer a route to import virtual machines, and in this article I’m going to show how that works. I’ll use the superb SampleApp v406 VM that Oracle provide, because this is a great real-life example of a VM that is so useful, but many developers can find too memory-intensive to be able to run on their local machines all the time.

This tutorial is based on exporting a Linux guest VM from a Linux host server. A Windows guest probably behaves differently, but a Mac or Windows host should work fine since VirtualBox is supported on both. The specifics are based on SampleApp, but the process should be broadly the same for all VMs. 

Obtain the VM

We’re going to use SampleApp, which can be downloaded from Oracle.

  1. Download the six-part archive from http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples–167534.html
  2. Verify the md5 checksums against those published on the download page:
    [oracle@asgard sampleapp406]$ ll
    total 30490752
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 01:33 SampleAppv406.zip.001
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 01:30 SampleAppv406.zip.002
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:03 SampleAppv406.zip.003
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:34 SampleAppv406.zip.004
    -rw-r--r-- 1 oracle oinstall 5242880000 Sep  9 02:19 SampleAppv406.zip.005
    -rw-r--r-- 1 oracle oinstall 4977591522 Sep  9 02:53 SampleAppv406.zip.006
    [oracle@asgard sampleapp406]$ md5sum *
    2b9e11f69ada5f889088dd74b5229322  SampleAppv406.zip.001
    f8a1a5ae6162b20b3e9c6c888698c071  SampleAppv406.zip.002
    68438cfea87e8d3a2e2f15ff00dadf12  SampleAppv406.zip.003
    b71d9ace4f75951198fc8197da1cfe62  SampleAppv406.zip.004
    4f1a5389c9e0addc19dce6bbc759ec20  SampleAppv406.zip.005
    2c430f87e22ff9718d5528247eff2da4  SampleAppv406.zip.006
  3. Unpack the archive using 7zip — the instructions for SampleApp are very clear that you must use 7zip, and not another archive tool such as winzip.
    [oracle@asgard sampleapp406]$ time 7za x SampleAppv406.zip.001</code>7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
    p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,80 CPUs)
    
    Processing archive: SampleAppv406.zip.001
    
    Extracting SampleAppv406Appliance
    Extracting SampleAppv406Appliance/SampleAppv406ga-disk1.vmdk
    Extracting SampleAppv406Appliance/SampleAppv406ga.ovf
    
    Everything is Ok
    
    Folders: 1
    Files: 2
    Size: 31191990916
    Compressed: 5242880000
    
    real 1m53.685s
    user 0m16.562s
    sys 1m15.578s
  4. Because we need to change a couple of things on the VM first (see below), we’ll have to import the VM to VirtualBox so that we can boot it up and make these changes.You can import using the VirtualBox GUI, or as I prefer, the VBoxManage command line interface. I like to time all these things (just because, numbers), so stick a time command on the front:
    time VBoxManage import --vsys 0 --eula accept SampleAppv406Appliance/SampleAppv406ga.ovf

    This took 12 minutes or so, but that was on a high-spec system, so YMMV.
    [...]
    0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
    Successfully imported the appliance.
    
    real    12m15.434s
    user    0m1.674s
    sys     0m2.807s

Preparing the VM

Importing Linux VMs to Amazon EC2 will only work if the kernel is supported, which according to an AWS blog post includes Red Hat Enterprise Linux 5.1 – 6.5. Whilst SampleApp v406 is built on Oracle Linux 6.5 (which isn’t listed by AWS as supported), we have the option of telling the VM to use a kernel that is Red Hat Enterprise Linux compatible (instead of the default Unbreakable Enterprise Kernel – UEK). There are some other pre-requisites that you need to check if you’re trying this with your own VM, including a network adaptor configured to use DHCP. The aforementioned blog post has details.

  1. Boot the VirtualBox VM, which should land you straight in the desktop environment, logged in as the oracle user.
  2. We need to modify a file as root (superuser). Here’s how to do it graphically, or use vi if you’re a real programmer:
    1. Open a Terminal window from the toolbar at the top of the screen
    2. Enter
      sudo gedit /etc/grub.conf

      The sudo bit is important, because it tells Linux to run the command as root. (I’m on an xkcd-roll here: 1, 2)

    3. In the text editor that opens, you will see a header to the file and then a set of repeating sections beginning with title. These are the available kernels that the machine can run under. The default is 3, which is zero-based, so it’s the fourth title section. Note that the kernel version details include uek which stands for Unbreakable Enterprise Kernel – and is not going to work on EC2.
    4. Change the default to 0, so that we’ll instead boot to a Red Hat Compatible Kernel, which will work on EC2
    5. Save the file
  3. Optional steps:
    1. Whilst you’ve got the server running, add your SSH key to the image so that you can connect to it easily once it is up on EC2. For more information about SSH keys, see my previous blog post here, and a step-by-step for doing it on SampleApp here.
    2. Disable non-SSH key logins (in /etc/ssh/sshd_config, set PasswordAuthentication no and PubkeyAuthentication yes), so that your server once on EC2 is less vulnerable to attack. Particularly important if you’re using the stock image with Admin123 as the root password.
    3. Set up screen, and OBIEE and the database as a Linux service, both covered in my article here.
  4. Shutdown the instance by entering this at a Terminal window:

    sudo shutdown -h now

Export the VirtualBox VM to Amazon EC2

Now we’re ready to really get going. The first step is to export the VirtualBox VM to a format that Amazon EC2 can work with. Whilst they don’t explicitly support VMs from VirtualBox, they do support the VMDK format – which VirtualBox can create. You can do the export from the graphical interface, or as before, from the command line:

time VBoxManage export "OBIEE SampleApp v406" --output OBIEE-SampleApp-v406.ovf

0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Successfully exported 1 machine(s).

real    56m51.426s
user    0m6.971s
sys     0m12.162s

If you compare the result of this to what we downloaded from Oracle it looks pretty similar – an OVF file and a VMDK file. The only difference is that the VMDK file is updated with the changes we made above, including the modified kernel settings which are crucial for the success of the next step.

[oracle@asgard sampleapp406]$ ls -lh
total 59G
-rw------- 1 oracle oinstall  30G Sep  9 10:55 OBIEE-SampleApp-v406-disk1.vmdk
-rw------- 1 oracle oinstall  15K Sep  9 09:58 OBIEE-SampleApp-v406.ovf

We’re ready now to get all cloudy. For this, you’ll need:

  1. An AWS account
    1. You’ll also need your AWS account’s Access Key and Secret Key
  2. AWS EC2 commandline tools installed, along with a Java Runtime Environment (JRE) 1.7 or greater:

    wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip
    sudo mkdir /usr/local/ec2
    sudo unzip ec2-api-tools.zip -d /usr/local/ec2
    # You might need to fiddle with the following paths and version numbers: 
    sudo yum install -y java-1.7.0-openjdk.x86_64
    cat >> ~/.bash_profile <<EOF
    export JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre"
    export EC2_HOME=/usr/local/ec2/ec2-api-tools-1.7.1.1/
    export PATH=$PATH:$EC2_HOME/bin
    EOF<

  3. Set your credentials as environment variables:
    export AWS_ACCESS_KEY=xxxxxxxxxxxxxx
    export AWS_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxx
  4. Ideally a nice fat pipe to upload the VM file over, because at 30GB it is not trivial (not in 2014, anyway)

What’s going to happen now is we use an EC2 command line tool to upload our VMDK (virtual disk) file to Amazon S3 (a storage platform), from where it gets converted into an EBS volume (Elastic Block Store, i.e. a EC2 virtual disk), and from there attached to a new EC2 instance (a “server”/”VM”).

Before we can do the upload we need an S3 “bucket” to put the disk image in that we’re uploading. You can create one from https://console.aws.amazon.com/s3/. In this example, I’ve got one called rmc-vms – but you’ll need your own.

Once the bucket has been created, we build the command line upload statement using ec2-import-instance:

time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY

Points to note:

  • m3.large is the spec for the VM. You can see the available list here. In the AWS blog post it suggests only a subset will work with the import method, but I’ve not hit this limitation yet.
  • region is the AWS Region in which the EBS volume and EC2 instance will be built. I’m using ew-west-1 (Ireland), and it makes sense to use the one geographically closest to where you or your users are located. Still waiting for uk-yorks-1
  • architecture and platform relate to the type of VM you’re importing.

The upload process took just over 45 minutes for me, and that’s from a data centre with a decent upload:

[oracle@asgard sampleapp406]$ time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY
Requesting volume size: 200 GB
TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       0       Status       active  StatusMessage   Pending : Downloaded 0
Creating new manifest at rmc-vms/d77672aa-0e0b-4555-b368-79d386842112/OBIEE-SampleApp-v406-disk1.vmdkmanifest.xml
Uploading the manifest file
Uploading 31191914496 bytes across 2975 parts
0% |--------------------------------------------------| 100%
   |==================================================|
Done
Average speed was 11.088 MBps
The disk image for import-i-fh08xcya has been uploaded to Amazon S3
where it is being converted into an EC2 instance.  You may monitor the
progress of this task by running ec2-describe-conversion-tasks.  When
the task is completed, you may use ec2-delete-disk-image to remove the
image from S3.

real    46m59.871s
user    10m31.996s
sys     3m2.560s

Once the upload has finished Amazon automatically converts the VMDK (now residing on S3) into a EBS volume, and then attaches it to a new EC2 instance (i.e. a VM). You can monitor the status of this task using ec2-describe-conversion-tasks, optionally filtered on the TaskId returned by the import command above:

ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       3898992128
Status  active  StatusMessage   Pending : Downloaded 31149971456

This is now an ideal time to mention as a side note the Linux utility watch, which simply re-issues a command for you every x seconds (2 by default). This way you can leave a window open and keep an eye on the progress of what is going to be a long-running job

watch ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya

Every 2.0s: ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya                                                             Tue Sep  9 12:03:24 2014

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  active  StatusMessage   Pending InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBytesConverted       5848511808
Status  active  StatusMessage   Pending : Downloaded 31149971456

And whilst we’re at it, if you’re using a remote server to do this (as I am, to take advantage of the large bandwidth), you will find screen invaluable for keeping tasks running and being able to reconnect at will. You can read more about screen and watch here.

So back to our EC2 import job. To start with, the task will be Pending: (NB unlike lots of CLI tools, you read the output of this one left-to-right, rather than as columns with headings)

$ ec2-describe-conversion-tasks --region eu-west-1
TaskType        IMPORTINSTANCE  TaskId  import-i-ffvx6z86       ExpirationTime  2014-09-12T15:32:01Z    Status  active  StatusMessage   Pending InstanceID      i-b2245ef2
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5021144064      VolumeSize      60      AvailabilityZone        eu-west-1a      ApproximateBytesConverted       4707330352      Status  active  StatusMessage   Pending : Downloaded 5010658304

After a few moments it gets underway, and you can see a Progress percentage indicator: (scroll right in the code snippet below to see)

TaskType        IMPORTINSTANCE  TaskId  import-i-fgr0djcc       ExpirationTime  2014-09-15T15:39:28Z    Status  active  StatusMessage   Progress: 53%   InstanceID      i-c7692e87
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5582545920      VolumeId        vol-f71368f0    VolumeSize      20      AvailabilityZone        eu-west-1a      ApproximateBytesConverted       5582536640      Status  completed

Note that at this point you’ll see also see an Instance in the EC2 list, but it won’t launch (no attached disk – because it’s still being imported!)

If something goes wrong you’ll see the Status as cancelled, such as in this example here where the kernel in the VM was not a supported one (observe it is the UEK kernel, which isn’t supported by Amazon):

TaskType        IMPORTINSTANCE  TaskId  import-i-ffvx6z86       ExpirationTime  2014-09-12T15:32:01Z    Status  cancelled       StatusMessage   ClientError: Unsupported kernel version 2.6.32-300.32.1.el5uek       InstanceID      i-b2245ef2
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   5021144064      VolumeId        vol-91b1c896    VolumeSize      60      AvailabilityZone        eu-west-1a      ApproximateBytesConverted    5021128688      Status  completed

After an hour or so, the task should complete:

TaskType        IMPORTINSTANCE  TaskId  import-i-fh08xcya       ExpirationTime  2014-09-16T10:07:44Z    Status  completed       InstanceID      i-b07d3bf0
DISKIMAGE       DiskImageFormat VMDK    DiskImageSize   31191914496     VolumeId        vol-a383f8a4    VolumeSize      200     AvailabilityZone        eu-west-1a      ApproximateBy
tesConverted    31191855472     Status  completed

At this point you can remove the VMDK from S3 (and should do, else you’ll continue to be charged for it), following the instructions for ec2-delete-disk-image

Booting the new server on EC2

Go to your EC2 control panel, where you should see an instance (EC2 term for “server”) in Stopped state and with no name.

Select the instance, and click Start on the Actions menu. After a few moments a Public IP will be shown in the details pane. But, we’re not home free quite yet…read on.

Firewalls

So this is where it gets a bit tricky. By default, the instance will have launched with Amazon’s Firewall (known as a Security Group) in place which – unless you have an existing AWS account and have modified the default security group’s configuration – is only open on port 22, which is for ssh traffic.

You need to head over to the Security Group configuration page, accessed in several ways but easiest is clicking on the security group name from the instance details pane:

Click on the Inbound tab and then Edit, and add “Custom TCP Rule” for the following ports:

  • 7780 (OBIEE front end)
  • 7001 (WLS Console / EM)
  • 5902 (oracle VNC)

You can make things more secure by allowing access to the WLS admin (7001) and VNC port (5902) to a specific IP address or range only.

Whilst we’re talking about security, your server is now open to the internet and all the nefarious persons out there, so you’ll be wanting to harden your server not least by resetting all the passwords to ones which aren’t publicly documented in the SampleApp user documentation!

Once you’ve updated your Security Group, you can connect to your server! If you installed the OBIEE and database auto start scripts (and if not, why not??) you should find OBIEE running just nicely on http://[your ip]:7780/analytics – note that the port is 7780, not 9704.

2014-09-09_20-21-23

If you didn’t install the script, you will need to start the services manually per the SampleApp documentation. To connect to the server you can ssh (using Terminal, PuTTY, etc) to the server or connect on VNC (Admin123 is the password). For VNC clients try Screen Share on Macs (installed by default), or RealVNC on Windows.

Caveats & Disclaimers

  • Running a server on AWS EC2 costs real money, so watch out. Once you’ve put your credit card details in, Amazon will continue to charge your card whilst there are chargeable items on your account (EBS volumes, instances – running or not- , and so on). You can get an idea of the scale of charges here.
  • As mentioned above, a server on the open internet is a lot more vulnerable than one virtualised on your local machine. You will get poked and probed, usually by automated scripts looking for open ports, weak passwords, and so on. SampleApp is designed to open the toybox of a pimped-out OBIEE deployment to you, it is not “hardened”, and you risk learning the tough way about the need for it if you’re not careful.

Cloning

Amazon EC2 supports taking a snapshot of a server, either for backup/rollback purposes or spinning up as a clone, using an Amazon Machine Image (AMI). From the Instances page, simply select “Create an Image” to build your AMI. You can then build another instance (or ten) from this AMI as needed, exact replicas of the server as it was at the point that you created the image.

Lather, Rinse, and Repeat

There’s a whole host of VirtualBox “appliances” out there, and some of them such as the developer-tools-focused ones only really make sense as local VMs. But there are plenty that would benefit from a bit of “Cloud-isation”, where they’re too big or heavy to keep on your laptop all the time, but are handy to be able to spin up at will. A prime example of this for me is the EBS Vision demo database that we use for our BI Apps training. Oracle used to provide an pre-built Amazon image (know as an AMI) of this, but since withdrew it. However, Oracle do publish Oracle VM VirtualBox templates for EBS 12.1.3 and 12.2.3 (related blog), so from this with a bit of leg-work and a big upload pipe, it’s a simple matter to brew your own AWS version of it — ready to run whenever you need it.

Sunday Times Tech Track 100

Over the weekend, Rittman Mead was listed in the Sunday Times Tech Track 100. We are extremely proud to get recognition for the business as well as our technical capability and expertise.

A lot of the public face of Rittman Mead focuses on the tools and technologies we work with. Since day one we have had a core policy to share as much information as possible. Even before the advent of social media, we shared pretty much everything we knew through either our blog or by speaking at conferences, but we very rarely talk about the business itself. However, a lot of the journey we have gone through over the last 7 years has been about the growth and maintenance of a successful, sustainable, multi-national business. We have been able to talk about, educate and evangelise about the tools and technologies as a result of having the successful business to support this.

I remember during one interview we did several years ago the candidate asked (and I’m paraphrasing): “How do you guys make any money, all I see/read is people sitting in airports writing blog posts about leading edge technologies?”.

One massive benefit from this is we often face the same problems (albeit on a different scales) to those that we talk about with customers, so we have been able to better understand the underlying drivers and proposed solutions for our clients.

From a personal point of view, this has meant spending a lot more time looking at contracts as opposed to code and reading business books/blogs as opposed to technical ones. However, it has been well worth it and I would like to say thanks to all of those both inside and outside of the company who have helped contribute to this success.

Analyzing Twitter Data using Datasift, MongoDB, Hive and ODI12c

Last week I posted an article on the blog around analysing Twitter data using Datasift, MongoDB and Pig, where I used the Datasift service to stream tweets about Rittman Mead into a MongoDB NoSQL database, and then queried the dataset using Pig. The context for this is the idea of a “data reservoir”, where we supplement the more traditional file and relational datasets we find in data warehouses with other data, typically machine generated, unstructured or very low-level, to add context to the numbers in our reporting system. In the example I quoted in the article, it’d be very interesting to take the activity we record against our blog and website and correlate that with the “conversation” that happens about it in the social media world; for example, were the hits for a particular article due to it been mentioned in a tweet, and did a spike in activity correspond to a particularly influential Twitter user retweeting something we’d tweeted?

NewImage

In that previous article I’d used Pig to access and analyse the data, in part because I saw a match between the nested datasets in a typical DataSift Twitter message and the relations, tuples and bags you get in a Pig schema. For example, if you look at the Tweet from Borkur in the screenshot below from RoboMongo, a Mac OS X client for MongoDB that I’ve found useful, you can see the author details nested inside the interaction details, and the Type attribute having many values under the Trends parent attribute – these map well onto Pig tuples and bags respectively.

NewImage

What I’d really like to do with this dataset, though, is to take certain elements of it and use that to supplement the data I’m loading using ODI12c. Whilst ODI can run arbitrary R, Pig and shell scripts using the ODI Procedure feature (as I did here to make use of Sqoop, before Oracle added Sqoop KMs to ODI12.1.3), it gets the best out of Hadoop when it can access data using Hive, the SQL layer over Hadoop that represents HDFS data as rows and columns, and allows us to SELECT and INSERT data using SQL commands – or to be precise, a dialect of SQL called HiveQL. But how will Hive cope with the nested and repeating data structures in a DataSift Twitter message, and allow us to get just the data out that we’re interested in?

In fact, the MongoDB connector for Hadoop that I used for Pig the other day also comes with Hive connectivity, in the form of a SerDe that lets Hive report against data in a MongoDB database (David Allen blogged about another MongoDB Hive storage handler a while ago, in an article about MongoDB and ODI). What’s more, this Hive connector for MongoDB is actually easier to work with that the Pig connector, as instead of worrying about Tuples and Bags you can just pick out the nested attributes that you’re interested in using a dot notation. For example, if I’m only interested in the InteractionID, username, tweet content and number of followers within a particular Twitter dataset, I can create a table that looks like this in Hive:

CREATE TABLE tweet_data(
  interactionId string,
  username string,
  content string,
  author_followers int)
ROW FORMAT SERDE 
  'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
  'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
  'mongo.columns.mapping'='{"interactionId":"interactionId",
  "username":"interaction.interaction.author.username",
  "content":\"interaction.interaction.content",
  "author_followers_count":"interaction.twitter.user.followers_count"}'
  )
TBLPROPERTIES (
  'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
  )

And at that point, it’s pretty easy to bring the dataset into ODI12c, through the IKM Hive to Hive Control Append knowledge module, and join up the Twitter dataset with the website log data that’s coming in via Flume. ODI can connect to Hive via JDBC drivers supplied with CDH4/5, and once you register the Hive connection and reverse-engineer the Hive metastore metadata into ODI’s repository, the complexity of the underlying Hive storage is hidden and you’re just presented with tables and columns, just like any other datastore type.

NewImage

Starting with the Twitter data first, I create a Hive table outside of ODI that returns the precise set of tweet attributes that I’m interested in, and then filter that dataset down to just the tweets that link to content on our website, by filtering on the tweet link’s URL matching the start of our website address.

NewImage

Then I load-up the hits from the Rittman Mead website, previously landed into Hadoop using Flume and exposed to ODI as another Hive table, filter out all the non-blog page accesses and keep just the URL part of the Apache Weblog request field, removing the transport mechanism and other bits around it.

NewImage

Then, I use a final ODI mapping to join the two datasets together, using ODI’s ability to apply HiveQL expressions to the incoming datasets so that’ve got the same format – trailing ‘/‘ at the end of the URL, no ampersand and query text at the end of the URL, and so on. Both this and the previous transformation are great examples of where ODI can help with this sort of work, making it pretty easy to munge and correct your data so that you’re then able to match-up the two different sources.

NewImage

Then it’s just a case of creating a package or load plan to sequence the mappings, and then run them using the local or standalone agent. You can see the individual KM steps running on the left-hand side, with ODI generating HiveQL queries which in turn are translated into MapReduce and run in parallel across the Hadoop cluster.

NewImage

And then, at the end of the process, I’ve got a Hive table of all of our blog articles that have been mentioned on Twitter (since we started consuming the tweet feed, a day or so ago), with the number of page requests and the number of times that page got mentioned in tweets.

NewImage

Obviously there’s a lot more we can do with this; we can access the number of followers each twitter user has, along with their location, gender and the sentiment (positive, negative, neutral) of the tweet. From that we can work out some impact from the twitter activity, and we can also add to it data from other sources such as Facebook, LinkedIn and so on to get a fuller picture of the activity around our site. Then, the data we’re gathering in can either be left in MongoDB, or I can use these ODI mappings to either archive it in Hive tables, or export the highlights out to Oracle Database using Sqoop or Oracle Loader for Hadoop.