Tag Archives: Oracle Big Data Appliance

Rittman Mead BI Forum 2014 Registration Now Open – Don’t Miss Out!

Just a quick reminder to say that registration for the Rittman Mead BI Forum 2014 is now open, with the speaker and presentation list now up on the event website. As with previous years, the BI Forum runs in Brighton on the first week, and then moves over to Atlanta on the second, with the dates and venues as follows:

We’ve got a fantastic line-up of sessions and speakers, including:

  • Oracle ACE and past BI Forum best speaker winner Kevin McGinley, on adding third-party visualisations to OBIEE
  • Sessions from TimesTen PMs Chris Jenkins and Susan Cheung on what’s coming with TimesTen
  • Wayne Van Sluys from InterRel, on Essbase optimisation
  • Oracle’s Andrew Bond, and our own Stewart Bryson (Oracle ACE) with an update to Oracle’s reference BI, DW and Big Data Architecture
  • Dan Vlamis on using Oracle Database analytics with the Oracle BI Applications
  • Sessions from Oracle’s Jack Berkowitz, Adam Bloom and Matt Bedin on what’s coming with OBIEE and Oracle BI Applications
  • Peak Indicators’ Alastair Burgess on tuning TimesTen with Aggregate Persistence
  • Endeca sessions from Chris Lynskey (PM), Omri Traub (Development Manager) on Endeca, along with ones from Branchbird’s Patrick Rafferty and Truls Bergersen
  • And sessions from Rittman Mead’s Robin Moffatt (OBIEE performance), Gianni Ceresa (Essbase) and Michael Rainey (ODI, with Nick Hurt from IFPI)

NewImage

We’ve also got some excellent keynote sessions including one in the US from Maria Colgan on the new in-memory database option, and another in Brighton from Matt Bedin and Adam Bloom on BI in the Cloud – along with the opening-night Oracle product development keynote in both Brighton and Atlanta.

We’re also very exited to welcome Lars George from Cloudera to deliver this year’s optional one-day masterclass, this year on Hadoop, big data, and how Oracle BI&DW developers can get started with this technology. Lars is Cloudera’s Chief Architect in EMEA and an HBase committer, and he’ll be covering topics such as:

  • What is Hadoop, what’s in the Hadoop ecosystem and how do you design a Hadoop cluster
  • Using tools such as Flume and Sqoop to import data into Hadoop, and then analyse it using Hive, Pig, Impala and Cloudera Search
  • Introduction to NoSQL and HBase
  • Connecting Hadoop to tools such as OBIEE and ODI using JDBC, ODBC, Impala and Hive

If you’ve been meaning to take a look at Hadoop, or if you’ve made a start but would like a chance to discuss techniques with someone who’s out in the field every week designing and building Hadoop systems, this session is aimed at you – it’s on the Wednesday before each event and you can book at the same time as registering for the main BI Forum days.

NewImage

Attendance is limited to around seventy at each event, and we’re running the Brighton BI Forum back at the Hotel Seattle, whilst the US one is running at the Renaissance Midtown Hotel, Atlanta. We encourage attendees to stay at the hotel as well so as to maximise networking opportunities, and this year you can book US accommodation directly with the hotel so you can collect any Marriott points, corporate discounts etc. As usual, we’ll take good care of you over the two or three days, with meals each night, drinks receptions and lots of opportunities to meet colleagues and friends in the industry.

Full details are on the BI Forum 2014 web page including links to the registration sites. Book now so you don’t miss-out – each year we sell-out in advance, so don’t leave it to the last minute if you’re thinking of coming. Hopefully see you all in Brighton and Atlanta in May 2014!

Using Sqoop for Loading Oracle Data into Hadoop on the BigDataLite VM

This is old-hat for most Hadoop veterans, but I’ve been meaning to note it on the blog for a while, for anyone who’s first encounter with Hadoop is Oracle’s BigDataLite VM.

Most people looking to bring external data into Hadoop, do so through flat-file exports that they then import into HDFS, using the “hadoop fs” command-line tool or Hue, the web-based developer tool in BigDataLite, Cloudera CDH, Hortonworks and so on. They then often create Hive tables over them, either creating them from the Hive / Beeswax shell or through Hue, which can create a table for you out of a file you upload from your browser. ODI, through the ODI Application Adapter for Hadoop, also gives you a knowledge module (IKM File to Hive) that’s used in the ODI demos also on BigDataLite to load data into Hive tables, from an Apache Avro-format log file.

NewImage

What a lot of people don’t know who’re new to Hadoop, is that you can skip the “dump to file” step completely, and load data straight into HDFS direct from the Oracle database, without an intermediate file export step. The tool you use for this comes as part of the Cloudera CDH4 Hadoop distribution that’s on BigDataLite, and it’s called “Sqoop”.

“Sqoop”, short for “SQL to Hadoop”, gives you the ability to do the following Oracle data transfer tasks amongst other ones:

  • Import whole tables, or whole schemas, from Oracle and other relational databases into Hadoop’s file system, HDFS
  • Export data from HDFS back out to these databases – with the export and import being performed through MapReduce jobs
  • Import using an arbitrary SQL SELECT statement, rather than grabbing whole tables
  • Perform incremental loads, specifying a key column to determine what to exclude
  • Load directly into Hive tables, creating HDFS files in the background and the Hive metadata automatically

Documentation for Sqoop as shipped with CDH4 can be found on the Cloudera website here, and there are even optimisations and plugins for databases such as Oracle to enable faster, direct loads – for example OraOOP.

Normally, you’d need to download and install JDBC drivers for sqoop before you can use it, but BigDataLite comes with the required Oracle JDBC drivers, so let’s just have a play around and see some examples of Sqoop in action. I’ll start by importing the ACTIVITY table from the MOVIEDEMO schema that comes with the Oracle 12c database also on BigDataLite (make sure the database is running, first though):

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY

You should then see sqoop process your command in its console output, and then run the MapReduce jobs to bring in the data via the Oracle JDBC driver:

14/03/21 18:21:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.5.0
14/03/21 18:21:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/03/21 18:21:37 INFO manager.SqlManager: Using default fetchSize of 1000
14/03/21 18:21:37 INFO tool.CodeGenTool: Beginning code generation
14/03/21 18:21:38 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM ACTIVITY t WHERE 1=0
14/03/21 18:21:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
14/03/21 18:21:38 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar
Note: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/03/21 18:21:41 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.jar
14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:42 INFO mapreduce.ImportJobBase: Beginning import of ACTIVITY
14/03/21 18:21:42 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/03/21 18:21:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ACTIVITY_ID), MAX(ACTIVITY_ID) FROM ACTIVITY
14/03/21 18:21:45 INFO mapred.JobClient: Running job: job_201403111406_0015
14/03/21 18:21:47 INFO mapred.JobClient: map 0% reduce 0%
14/03/21 18:22:19 INFO mapred.JobClient: map 25% reduce 0%
14/03/21 18:22:27 INFO mapred.JobClient: map 50% reduce 0%
14/03/21 18:22:29 INFO mapred.JobClient: map 75% reduce 0%
14/03/21 18:22:37 INFO mapred.JobClient: map 100% reduce 0%
...
14/03/21 18:22:39 INFO mapred.JobClient: Map input records=11
14/03/21 18:22:39 INFO mapred.JobClient: Map output records=11
14/03/21 18:22:39 INFO mapred.JobClient: Input split bytes=464
14/03/21 18:22:39 INFO mapred.JobClient: Spilled Records=0
14/03/21 18:22:39 INFO mapred.JobClient: CPU time spent (ms)=3430
14/03/21 18:22:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=506802176
14/03/21 18:22:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2714157056
14/03/21 18:22:39 INFO mapred.JobClient: Total committed heap usage (bytes)=506724352
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Transferred 103 bytes in 56.4649 seconds (1.8241 bytes/sec)
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Retrieved 11 records.

By default, sqoop will put the resulting file in your user’s home directory in HDFS. Let’s take a look and see what’s there:

[oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/ACTIVITYFound 6 items
-rw-r--r-- 1 oracle supergroup 0 2014-03-21 18:22 /user/oracle/ACTIVITY/_SUCCESS
drwxr-xr-x - oracle supergroup 0 2014-03-21 18:21 /user/oracle/ACTIVITY/_logs
-rw-r--r-- 1 oracle supergroup 27 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00000
-rw-r--r-- 1 oracle supergroup 17 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00001
-rw-r--r-- 1 oracle supergroup 24 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00002
-rw-r--r-- 1 oracle supergroup 35 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00003
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000001,Rate
2,Completed
3,Pause
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000014,Start
5,Browse

What you can see there is that sqoop has imported the data as a series of “part-m” files, CSV files with one per MapReduce reducer. There’s various options in the docs for specifying compression and other performance features for sqoop imports, but the basic format is a series of CSV files, one per reducer.

You can also import Oracle and other RDBMS data directly into Hive, with sqoop creating equivalent datatypes for the data coming in (basic datatypes only, none of the advanced spatial and other Oracle ones). For example, I could import the CREW table in the MOVIEDEMO schema in like this, directly into an equivalent Hive table:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import<

Taking a look at Hive, I can then see the table this is created, describe it and count the number of rows it contains:

[oracle@bigdatalite ~]$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/oracle/hive_job_log_effb9cb5-6617-49f4-97b5-b09cd56c5661_1747866494.txt

hive> desc CREW;
OK
crew_id double 
name string 
Time taken: 2.211 seconds

hive> select count(*) from CREW;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403111406_0017, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201403111406_0017
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201403111406_0017
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-03-21 18:33:40,756 Stage-1 map = 0%, reduce = 0%
2014-03-21 18:33:46,797 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:47,812 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:48,821 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:49,907 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:50,916 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:51,929 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:52,942 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:53,951 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:54,961 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_201403111406_0017
MapReduce Jobs Launched: 
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.75 sec HDFS Read: 135111 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
6860
Time taken: 20.095 seconds

I can even do an incremental import to bring in new rows, appending their contents to the existing ones in Hive/HDFS:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import --incremental append --check-column CREW_ID

The data I’ve now loaded can be processed by a tool such as ODI, or you can use a tool such as Pig to do some further analysis or number crunching, like this:

grunt> RAW_DATA = LOAD 'ACTIVITY' USING PigStorage(',') AS  
grunt> (act_type: int, act_desc: chararray);  
grunt> B = FILTER RAW_DATA by act_type < 5; 
grunt> STORE B into 'FILTERED_ACTIVITIES' USING PigStorage(‘,'); 

the output of which is another file on HDFS. Finally, I can export this data back to my Oracle database using the sqoop export feature:

[oracle@bigdatalite ~]$ sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY_FILTERED --export-dir FILTERED_ACTIVITIES

So there’s a lot more to sqoop than just this, including features and topics such as compression, transforming data and so on, but it’s a useful tool and also something you could call from the command-line, using an ODI Tool if you want to. Recent versions of Hue also come with a GUI for Sqoop, giving you the ability to create jobs graphically and also schedule them using Oozie, the Hadoop scheduler in CDH4,

Running R on Hadoop using Oracle R Advanced Analytics for Hadoop

When most people think about analytics and Hadoop, they tend to think of technologies such as Hive, Pig and Impala as the main tools a data analyst uses. When you talk to data analysts and data scientists though, they’ll usually tell you that their primary tool when working on Hadoop and big data sources is in fact “R”, the open-source statistical and modelling language “inspired” by SAS but now with its own rich ecosystem, and particularly suited to the data preparation, data analysis and data correlation tasks you’ll often do on a big data project.

You can see examples where R has been used in recent copies of Oracle’s OBIEE SampleApp, where R is used to predict flight delays, with the results then rendered through Oracle R Enterprise, part of the Oracle Database Enterprise Edition Advanced Analytics Option and which allows you to run R scripts in the database and output the results in the form of database functions.

NewImage

Oracle actually distribute two R packages as part of a wider set of Oracle R technology products – Oracle R, their own distribution of open-source R that you can download via the Oracle Public Yum repository, and Oracle R Enterprise (ORE), an add-in to the database that provides efficient connectivity between R and Oracle and also allows you to run R scripts directly within the database’s JVM. ORE is actually surprisingly pretty good, with its main benefit being that you can perform R analysis directly against data in the database, avoiding the need to dump data to a file and giving you the scalability of the Oracle Database to run your R models, rather than begin constrained by the amount of RAM in your laptop. In the screenshot below, you can see part of an Oracle R Enterprise script we’ve written that analyses data from the flight delays dataset:

NewImage

with the results then output in the R console:

NewImage

But in many cases though, clients won’t want you to run R analysis on their main databases due to the load it’ll put on them, so what do you do when you need to analyse large datasets? A common option is to run your R queries on Hadoop, giving you the flexibility and power of the R language whilst taking advantage of the horizontal scalability of Hadoop, HDFS and MapReduce. There’s quite a few options for doing this – the open-source RHIPE and the R package “parallel” both provide R-on-Hadoop capabilities – but Oracle also have a product in this area, “Oracle R Advanced Analytics for Hadoop” (ORAAH) previously known as “Oracle R Connector for Hadoop” that according to the docs is particularly well-designed for parallel reads and writes, has resource management and database connectivity features, and comes as part of Oracle Big Data Appliance, Oracle Big Data Connectors and the recently released BigDataLite VM. The payoff here then is that by using ORAAH you can scale-up R to work at Hadoop-scale, giving you an alternative to the more set-based Hive and Pig languages when working with super-large datasets.

NewImage

Oracle R Advanced Analytics for Hadoop is already set-up and configured on the BigDataLite VM, and you can try it out by opening a command-line session and running the “R” executable:

NewImage

Type in library(ORCH) to load the ORAAH R libraries (ORCH being the old name of the product), and you should see something like this:

NewImage

At that point you can now run some demo programs to see examples of what ORAAH can do; for example read and write data to HDFS using the demo(“hdfs_cpmv.basic”), or use Hive as a source for R data frames (“hive_analysis”,”ORCH”), like this:

NewImage

The most important feature is being able to run R jobs on Hadoop though, with ORAAH converting the R scripts you write into Hadoop jobs via the Hadoop Streaming utility, that gives MapReduce the ability to use any executable or script as the mapper or reducer. The “mapred_basic” R demo that ships with ORAAH shows a basic example of this working, and you can see the MapReduce jobs being kicked-off in the R console, and in the Hadoop Jobtracker web UI:

NewImage

But what if you want to use ORAAH on your own Hadoop cluster? Let’s walk through a typical setup process using a three-node Cloudera CDH 4.5 cluster running on a VMWare ESXi server I’ve got at home, where the three nodes are configured like this:

  • cdh45d-node1.rittmandev.com : 8GB RAM VM running OEL6.4 64-bit and with a full desktop available
  • cdh45d-node2.rittmandev.com : 2GB RAM VM running Centos 6.2 64-bit, and with a minimal Linux install with no desktop etc.
  • cdh45d-node3.rittmandev.com, as per cdh45d-node2

The idea here is that “cdh45d-node1” is where the data analyst can run their R session, or of course they can just SSH into it. I’ve also got Cloudera Manager installed on there using the free “standard” edition of CDH4.5, and with the cluster spanning the other two nodes like this:

NewImage

If you’re going to use R as a regular OS user (for example, “oracle”) then you’ll also need to use Hue to create a home directory for that user in HDFS, as ORAAH will automatically set that directory as the users’ working directory when they load up the ORAAH libraries.

Step 1: Configuring the Hadoop Node running R Client Software

On the first node where I’m using OEL6.4, Oracle R 2.15.3 is already installed by Oracle, but we want R 3.0.1 for ORAAH 2.3.1, so you can install it using Oracle’s Public Yum repository like this:

sudo yum install R-3.0.1

Once you’ve done that, you need to download and install the R packages in the ORAAH 2.3.1 zip file that you can get hold of from the Big Data Connectors download page on OTN. Assuming you’ve downloaded and unzipped them to /root/ORCH2.3.1, running as root install the R packages within the zip file in the correct order, like this:

cd /root/ORCH2.3.1/
R --vanilla CMD INSTALL OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCHcore_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCHstats_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCH_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz

You’ll also need to download and install the “png” source package from the R CRAN website (http://cran.r-project.org/web/packages/png/index.html), and possibly download and install the libpng-devel libraries from the Oracle Public Yum site before it’ll compile.

sudo yum install libpng-devel
R --vanilla CMD INSTALL png_0.1-7.tar.gz 

At that point, you should be good to go. If you’ve installed CDH4.5 using parcels rather than packages though, you’ll need to set a couple of environment variables in your .bash_profile script to tell ORAAH where to find the Hadoop scripts (parcels install Hadoop to /opt/cloudera, whereas packages install to /usr/lib, which is where ORAAH looks for them normally):

# .bash_profile
 
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi
 
# User specific environment and startup programs
 
PATH=$PATH:$HOME/bin
 
export PATH
export ORCH_HADOOP_HOME=/opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30
export HADOOP_STREAMING_JAR=/opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30/lib/hadoop-0.20-mapreduce/contrib/streaming

You should then be able to log into the R console from this particular machine and load up the ORCH libraries:

[oracle@cdh45d-node1 ~]$ . ./.bash_profile
[oracle@cdh45d-node1 ~]$ R
 
Oracle Distribution of R version 3.0.1  (--) -- "Good Sport"
Copyright (C)  The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
You are using Oracle's distribution of R. Please contact
Oracle Support for any problems you encounter with this
distribution.
 
> library(ORCH)
Loading required package: OREbase
 
Attaching package: ‘OREbase’
 
The following objects are masked from ‘package:base’:
 
    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table
 
Loading required package: OREstats
Loading required package: MASS
Loading required package: ORCHcore
Oracle R Connector for Hadoop 2.3.1 (rev. 288)
Info: using native C base64 encoding implementation
Info: Hadoop distribution is Cloudera's CDH v4.5.0
Info: using ORCH HAL v4.1
Info: HDFS workdir is set to "/user/oracle"
Info: mapReduce is functional
Info: HDFS is functional
Info: Hadoop 2.0.0-cdh4.5.0 is up
Info: Sqoop 1.4.3-cdh4.5.0 is up
Warning: OLH is not found
Loading required package: ORCHstats
> 

Step 2: Configuring the other Nodes in the Hadoop Cluster

Once you’ve set up your main Hadoop node with the R Client software, you’ll also need to install  R 3.0.1 onto all of the other nodes in your Hadoop cluster, along with a subset of the ORCH library files. If your other nodes are also running OEL you can just repeat the “yum install R-3.0.1” step from before, but in my case I’m running a very stripped-down Centos 6.2 install on my other nodes so I need to install wget first, then grab the Oracle Public Yum repo file along with a GPG key file that it’ll require before allowing you download from that server:

[root@cdh45d-node2 yum.repos.d]# wget http://public-yum.oracle.com/public-yum-ol6.repo
 
--2014-03-10 07:50:25--  http://public-yum.oracle.com/public-yum-ol6.repo
Resolving public-yum.oracle.com... 109.144.113.166, 109.144.113.190
Connecting to public-yum.oracle.com|109.144.113.166|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4233 (4.1K) 1
Saving to: <code>public-yum-ol6.repo'
 
100%[=============================================&gt;] 4,233       --.-K/s   in 0s      
 
2014-03-10 07:50:26 (173 MB/s) - </code>public-yum-ol6.repo' saved [4233/4233]
 
[root@cdh45d-node2 yum.repos.d]# wget http://public-yum.oracle.com/RPM-GPG-KEY-oracle-ol6 -O /etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
 
--2014-03-10 07:50:26--  http://public-yum.oracle.com/RPM-GPG-KEY-oracle-ol6
Resolving public-yum.oracle.com... 109.144.113.190, 109.144.113.166
Connecting to public-yum.oracle.com|109.144.113.190|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1011 1
Saving to: <code>/etc/pki/rpm-gpg/RPM-GPG-KEY-oracle'
 
100%[=============================================&gt;] 1,011       --.-K/s   in 0s      
 
2014-03-10 07:50:26 (2.15 MB/s) - </code>/etc/pki/rpm-gpg/RPM-GPG-KEY-oracle' saved [1011/1011]

You’ll then need to edit the public-yum-ol6.repo file to enable the correct repository (RHEL/Centos/OEL 6.4 in my case) for your VMs, and also enable the add-ons repository, with the file contents below being the repo file after I made the required changes.

[ol6_latest]
name=Oracle Linux $releasever Latest ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/latest/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enableid=0
 
[ol6_addons]
name=Oracle Linux $releasever Add ons ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/addons/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1
 
[ol6_ga_base]
name=Oracle Linux $releasever GA installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/0/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u1_base]
name=Oracle Linux $releasever Update 1 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/1/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u2_base]
name=Oracle Linux $releasever Update 2 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/2/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u3_base]
name=Oracle Linux $releasever Update 3 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/3/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u4_base]
name=Oracle Linux $releasever Update 4 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/4/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

Once you’ve saved the file, you can the run the install of R-3.0.1 as before, copy the ORCH files across to the server and then install just the OREbase, OREstats, OREmodels and OREserver packages, like this:

[root@cdh45d-node2 yum.repos.d]# yum install R-3.0.1
ol6_UEK_latest                                                  | 1.2 kB     00:00     
ol6_UEK_latest/primary                                          |  13 MB     00:03     
ol6_UEK_latest                                                                 281/281
ol6_addons                                                      | 1.2 kB     00:00     
ol6_addons/primary                                              |  42 kB     00:00     
ol6_addons                                                                     169/169
ol6_latest                                                      | 1.4 kB     00:00     
ol6_latest/primary                                              |  36 MB     00:10     
ol6_latest                                                                 24906/24906
ol6_u4_base                                                     | 1.4 kB     00:00     
ol6_u4_base/primary                                             | 2.7 MB     00:00     
ol6_u4_base                                                                  8396/8396
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package R.x86_64 0:3.0.1-2.el6 will be installed
 
[root@cdh45d-node2 ~]# cd ORCH2.3.1\ 2/
[root@cdh45d-node2 ORCH2.3.1 2]# ls
ORCH_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
ORCHcore_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
ORCHstats_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREbase’ ...
* DONE (OREbase)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREstats’ ...
* DONE (OREstats)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREmodels’ ...
* DONE (OREmodels)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREserver’ ...
* DONE (OREserver)
Making 'packages.html' ... done

Now, you can open a terminal window on the main node with the R client software, or SSH into the server, and run one of the ORAAH demos where the R job gets run on the Hadoop cluster.

> demo("mapred_basic","ORCH")
 
 
demo(mapred_basic)
---- ~~~~~~~~~~~~
 
> #
> #     ORACLE R CONNECTOR FOR HADOOP DEMOS
> #
> #     Name: mapred_basic
> #     Description: Demonstrates running a mapper and a reducer containing 
> #                  R script in ORCH.
> #
> #
> #
> 
> 
> ##
> # A simple example of how to operate with key-values. Input dataset - cars.
> # Filter cars with with "dist" > 30 in mapper and get mean "dist" for each 
> # "speed" in reducer.
> ##
> 
> ## Set page width
> options(width = 80)
 
> # Put the cars dataset into HDFS
> cars.dfs <- hdfs.put(cars, key='speed')
 
> # Submit the hadoop job with mapper and reducer R scripts
> x <- try(hadoop.run(
+     cars.dfs,
+     mapper = function(key, val) {
+             orch.keyvals(key[val$dist > 30], val[val$dist > 30,])
+     },
+     reducer = function(key, vals) {
+         X <- sum(vals$dist)/nrow(vals)
+         orch.keyval(key, X)
+     },
+     config = new("mapred.config",
+         map.tasks = 1,
+         reduce.tasks = 1
+     )
+ ), silent = TRUE)
 
> # In case of errors, cleanup and return
> if (inherits(x,"try-error")) {
+  hdfs.rm(cars.dfs)
+  stop("execution error")
+ }
 
> # Print the results of the mapreduce job
> print(hdfs.get(x))
   val1     val2
1    10 34.00000
2    13 38.00000
3    14 58.66667
4    15 54.00000
5    16 36.00000
6    17 40.66667
7    18 64.50000
8    19 50.00000
9    20 50.40000
10   22 66.00000
11   23 54.00000
12   24 93.75000
13   25 85.00000
 
> # Remove the HDFS files created above
> hdfs.rm(cars.dfs)
[1] TRUE
 
> hdfs.rm(x)
[1] TRUE

with the job tracker web UI on Hadoop confirming the R script ran on the Hadoop cluster.

NewImage

So that’s a basic tech intro and tips on the install. Documentation for ORAAH and the rest of the Big Data Connectors is available for reading on OTN (including a list of all the R commands you get as part of the ORAAH/ORCH package). Keep an eye on the blog though for more on R, ORE and ORAAH as I try to share some examples from datasets we’ve worked on.

Rittman Mead BI Forum 2014 Now Open for Registration!

I’m very pleased to announce that the Rittman Mead BI Forum 2014 running in Brighton and Atlanta, May 2014, is now open for registration. Keeping the format as before – a single stream at each event, world-class speakers and expert-level presentations, and a strictly-limited number of attendees – this is the premier Oracle BI tech conference for developers looking for something beyond marketing and beginner-level content.

This year we have a fantastic line-up of speakers and sessions, including:

  • Oracle ACE and past BI Forum best speaker winner Kevin McGinley, on adding third-party visualisations to OBIEE
  • Tony Heljula, winner of multiple best speaker awards and this year presenting on Exalytics and TimesTen Columnar Storage
  • Sessions from TimesTen PMs Chris Jenkins and Susan Cheung on what’s coming with TimesTen
  • Edward Roske, author of multiple books on Essbase, on Essbase optimisation
  • Oracle’s Andrew Bond, and our own Stewart Bryson (Oracle ACE) with an update to Oracle’s reference BI, DW and Big Data Architecture
  • Sessions from Oracle’s Jack Berkowitz, Adam Bloom and Matt Bedin on what’s coming with OBIEE and Oracle BI Applications
  • Endeca sessions from Chris Lynskey (PM), Omri Traub (Development Manager) on Endeca, along with ones from Branchbird’s Patrick Rafferty and Truls Bergersen
  • And sessions from Rittman Mead’s Robin Moffatt (OBIEE performance), Gianni Ceresa (Essbase) and Michael Rainey (ODI, with Nick Hurt from IFPI)
NewImage

We’ve also got some excellent keynote sessions including one in the US from Maria Colgan on the new in-memory database option, and another in Brighton from Matt Bedin and Adam Bloom on BI in the Cloud – along with the opening-night Oracle product development keynote in both Brighton and Atlanta.

We’re also very exited to welcome Lars George from Cloudera to deliver this year’s optional one-day masterclass, this year on Hadoop, big data, and how Oracle BI&DW developers can get started with this technology. Lars is Cloudera’s Chief Architect in EMEA and an HBase committer, and he’ll be covering topics such as:

  • What is Hadoop, what’s in the Hadoop ecosystem and how do you design a Hadoop cluster
  • Using tools such as Flume and Sqoop to import data into Hadoop, and then analyse it using Hive, Pig, Impala and Cloudera Search
  • Introduction to NoSQL and HBase
  • Connecting Hadoop to tools such as OBIEE and ODI using JDBC, ODBC, Impala and Hive

If you’ve been meaning to take a look at Hadoop, or if you’ve made a start but would like a chance to discuss techniques with someone who’s out in the field every week designing and building Hadoop systems, this session is aimed at you – it’s on the Wednesday before each event and you can book at the same time as registering for the main BI Forum days.

NewImage

Attendance is limited to around seventy at each event, and we’re running the Brighton BI Forum back at the Hotel Seattle, whilst the US one is running at the Renaissance Midtown Hotel, Atlanta. We encourage attendees to stay at the hotel as well so as to maximise networking opportunities, and this year you can book US accommodation directly with the hotel so you can collect any Marriott points, corporate discounts etc. As usual, we’ll take good care of you over the two or three days, with meals each night, drinks receptions and lots of opportunities to meet colleagues and friends in the industry.

Full details are on the BI Forum 2014 web page including links to the registration sites. Book now so you don’t miss-out – each year we sell-out in advance, so don’t leave it to the last minute if you’re thinking of coming. Hopefully see you all in Brighton and Atlanta in May 2014!

Rittman Mead BI Forum 2014 Abstract Scoring Now Live – For 1 Week Only!

The call for papers for the Rittman Mead BI Forum 2014 closed at the end of January, and we’ve had some excellent submissions on topics ranging from OBIEE, Visualizations and data discovery through to in-memory analytics, big data and data integration. As always, we’re now opening up the abstract submission list for scoring, so that anyone considering coming to either the Brighton or Atlanta events can have a say in what abstracts are selected.

The voting forms, and event details, are below:

In case you missed it, we also announced the speaker for the Wednesday masterclass the other day – Lars George from Cloudera, who’ll be talking about Hadoop, HBase, Cloudera and how it all applies to the worlds of analytics, BI and DW – something we’re all really excited about.

Voting is open for just one week, and will close at 5pm PST on Tuesday, 18th Feb. Shortly afterwards we’ll announce the speaker line-up, and open-up registrations for both events. Keep an eye on the blog for more details as they come.