Tag Archives: Big Data

Rittman Mead BI Forum 2014 Registration Now Open – Don’t Miss Out!

Just a quick reminder to say that registration for the Rittman Mead BI Forum 2014 is now open, with the speaker and presentation list now up on the event website. As with previous years, the BI Forum runs in Brighton on the first week, and then moves over to Atlanta on the second, with the dates and venues as follows:

We’ve got a fantastic line-up of sessions and speakers, including:

  • Oracle ACE and past BI Forum best speaker winner Kevin McGinley, on adding third-party visualisations to OBIEE
  • Sessions from TimesTen PMs Chris Jenkins and Susan Cheung on what’s coming with TimesTen
  • Wayne Van Sluys from InterRel, on Essbase optimisation
  • Oracle’s Andrew Bond, and our own Stewart Bryson (Oracle ACE) with an update to Oracle’s reference BI, DW and Big Data Architecture
  • Dan Vlamis on using Oracle Database analytics with the Oracle BI Applications
  • Sessions from Oracle’s Jack Berkowitz, Adam Bloom and Matt Bedin on what’s coming with OBIEE and Oracle BI Applications
  • Peak Indicators’ Alastair Burgess on tuning TimesTen with Aggregate Persistence
  • Endeca sessions from Chris Lynskey (PM), Omri Traub (Development Manager) on Endeca, along with ones from Branchbird’s Patrick Rafferty and Truls Bergersen
  • And sessions from Rittman Mead’s Robin Moffatt (OBIEE performance), Gianni Ceresa (Essbase) and Michael Rainey (ODI, with Nick Hurt from IFPI)

NewImage

We’ve also got some excellent keynote sessions including one in the US from Maria Colgan on the new in-memory database option, and another in Brighton from Matt Bedin and Adam Bloom on BI in the Cloud – along with the opening-night Oracle product development keynote in both Brighton and Atlanta.

We’re also very exited to welcome Lars George from Cloudera to deliver this year’s optional one-day masterclass, this year on Hadoop, big data, and how Oracle BI&DW developers can get started with this technology. Lars is Cloudera’s Chief Architect in EMEA and an HBase committer, and he’ll be covering topics such as:

  • What is Hadoop, what’s in the Hadoop ecosystem and how do you design a Hadoop cluster
  • Using tools such as Flume and Sqoop to import data into Hadoop, and then analyse it using Hive, Pig, Impala and Cloudera Search
  • Introduction to NoSQL and HBase
  • Connecting Hadoop to tools such as OBIEE and ODI using JDBC, ODBC, Impala and Hive

If you’ve been meaning to take a look at Hadoop, or if you’ve made a start but would like a chance to discuss techniques with someone who’s out in the field every week designing and building Hadoop systems, this session is aimed at you – it’s on the Wednesday before each event and you can book at the same time as registering for the main BI Forum days.

NewImage

Attendance is limited to around seventy at each event, and we’re running the Brighton BI Forum back at the Hotel Seattle, whilst the US one is running at the Renaissance Midtown Hotel, Atlanta. We encourage attendees to stay at the hotel as well so as to maximise networking opportunities, and this year you can book US accommodation directly with the hotel so you can collect any Marriott points, corporate discounts etc. As usual, we’ll take good care of you over the two or three days, with meals each night, drinks receptions and lots of opportunities to meet colleagues and friends in the industry.

Full details are on the BI Forum 2014 web page including links to the registration sites. Book now so you don’t miss-out – each year we sell-out in advance, so don’t leave it to the last minute if you’re thinking of coming. Hopefully see you all in Brighton and Atlanta in May 2014!

Using Oracle R Enterprise to Analyze Large In-Database Datasets

The other week I posted an article on the blog about Oracle R Advanced Analytics for Hadoop, part of Oracle’s Big Data Connectors and used for running certain types of R analysis over a Hadoop cluster. ORAAH lets you move data in and out of HDFS and Hive and into in-memory R data frames, and gives you the ability to create Hadoop MapReduce jobs but using R commands and syntax. If you’re looking to use R to analyse, prepare and explore your data, and you’ve got access to a large Hadoop cluster, ORAAH is a useful way to go beyond the normal memory constraints of R running on your laptop.

But what if the data you want to analyse is currently in an Oracle database? You can export the relevant tables to flat files and then import them into HDFS, or you can use a tools such as sqoop to copy the data directly into HDFS and Hive tables. Another option you could consider though is to run your R analysis directly on the database tables, avoiding the need to move data around and taking advantage of the scalability of your Oracle database – which is where Oracle R Enterprise comes in.

Oracle R Enterprise is part of the Oracle Database Enterprise Edition “Advanced Analytics Option”, so it’s licensed separately to ORAAH and the Big Data Connectors. What it gives you is three things:

image2

  • Some client packages to install locally on your desktop along. installed into regular R (or ideally, Oracle’s R distribution)
  • Some database server-side R packages to provide a “transparency layer”, converting R commands into SQL ones, along with extra SQL stats functions to support R
  • The ability to spawn-off R engines within the Oracle Database’s using the extproc mechanism, for performing R analysis directly on the data rather than through the client on your laptop

Where this gets interesting for us is that the ORE transparency layer makes it simple to move data in and out of the Oracle Database, but more importantly it allows us to use database tables and views as R “ore.frames” – proxies for “data frames”, the equivalent to database tables in R and the basic data set that R commands work on. Going down this route avoids the need to export the data we’re interesting out of the Oracle Database, with the ORE transparency layer converting most R function calls to Oracle Database SQL ones – meaning that we can use the data analyst-friendly R language whilst using Oracle under the covers for the heavy lifting.

NewImage

There’s more to ORE than just the transparency layer, but let’s take a look at how you might use ORE and this feature, using the same “flight delays” dataset I used in my post a couple of months ago on Hadoop, Hive and Impala. We’ll use the OBIEE 11.1.1.7.1 SampleApp v309R2 that you can download from OTN as it’s got Oracle R Enterprise already installed, although you’ll need to follow step 10 in the accompanying deployment guide to install the R packages that Oracle couldn’t distribute along with SampleApp.

In the following examples, we’ll:

  • Connect to the main PERFORMANCE fact table in the BI_AIRLINES schema, read in it’s metadata (columns), and then set it up as a “virtual” R data frame that actually  points through to the database table
  • Then we’ll perform some basic analysis, binning and totalling for that table, to give us a sense of what’s in it
  • And then we’ll run some more R analysis on the table, outputting the results in the form of graphs and answering questions such as “which days of the week are best to fly out on?” and “how have airlines relative on-time performance changed over time?”

Let’s start off them by starting the R console and connecting to the database schema containing the flight delays data.

[oracle@obieesample ~]$ R
 
Oracle Distribution of R version 2.15.1  (--) -- "Roasted Marshmallows"
Copyright (C)  The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
You are using Oracle's distribution of R. Please contact
Oracle Support for any problems you encounter with this
distribution.
 
[Previously saved workspace restored]
 
> library(ORE)
Loading required package: OREbase
 
Attaching package: ‘OREbase’
 
The following object(s) are masked from ‘package:base’:
 
    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table
 
Loading required package: OREstats
Loading required package: MASS
Loading required package: OREgraphics
Loading required package: OREeda
Loading required package: OREdm
Loading required package: lattice
Loading required package: OREpredict
Loading required package: ORExml
> ore.connect("bi_airlines","orcl","localhost","BI_AIRLINES",all=TRUE)
Loading required package: ROracle
Loading required package: DBI
> 

Note that “library(ORE)” loads up the Oracle R Enterprise R libraries, and “ore.connect” connects the R session to the relevant Oracle database.

I then synchronise R’s view of the objects in this database schema with its own metadata views, list out what tables are available to us in that schema, and attach that schema to my R session so I can manipulate them from there.

> ore.sync()
> ore.ls()
 [1] "AIRCRAFT_GROUP"           "AIRCRAFT_TYPE"           
 [3] "AIRLINE_ID"               "AIRLINES_USER_DATA"      
 [5] "CANCELLATION"             "CARRIER_GROUP_NEW"       
 [7] "CARRIER_REGION"           "DEPARBLK"                
 [9] "DISTANCE_GROUP_250"       "DOMESTIC_SEGMENT"        
[11] "OBIEE_COUNTY_HIER"        "OBIEE_GEO_AIRPORT_BRIDGE"
[13] "OBIEE_GEO_ORIG"           "OBIEE_ROUTE"             
[15] "OBIEE_TIME_DAY_D"         "OBIEE_TIME_MTH_D"        
[17] "ONTIME_DELAY_GROUPS"      "PERFORMANCE"             
[19] "PERFORMANCE_ENDECA_MV"    "ROUTES_FOR_LINKS"        
[21] "SCHEDULES"                "SERVICE_CLASS"           
[23] "UNIQUE_CARRIERS"         
> ore.attach("bi_airlines")
> 

Now although we know these objects as database tables, what ORE does is present them to R as “data frames” using ore.frame as a proxy, the fundamental data structure in R that looks just like a table in the relational database world. Behind the scenes though, ORE maps these data frames to the underlying Oracle structures using the ore.frame proxy, and turns R commands into SQL function calls including a bunch of new ones added specifically for ORE. Note that this is conceptually different to Oracle R Advanced Analytics for Hadoop, which doesn’t map (or overload) standard R functions to their Hadoop (MapReduce or Hive) equivalent – it instead gives you a set of new R functions that you can use to create MapReduce jobs, which you can then submit to a Hadoop cluster for processing, giving you a more R-native way of creating MapReduce jobs; ORE in-contrast tries to map all of R functionality to Oracle database functions, allowing you to run normal R sessions but with Oracle Database allowing you process bigger R queries closer to the data.

Let’s use another two R commands to see how it views the PERFORMANCE table in the flight delays data set, and get some basic sizing metrics.

> class(PERFORMANCE)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> dim(PERFORMANCE)
[1] 6362422     112

Now at this point I could pull the data from one of those tables directly into an in-memory R data frame, like this:

> carriers <- ore.pull(UNIQUE_CARRIERS)
Warning message:
ORE object has no unique key - using random order 
> class(UNIQUE_CARRIERS)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> class(carriers)
[1] "data.frame"
> 

As you see, R sees the UNIQUE_CARRIERS object as an ore.frame, whilst carriers (into which data from UNIQUE_CARRIERS was loaded) is a regular data.frame object. In some cases you might want to load data from Oracle tables into a regular data.frame, but what’s interesting here is that we can work directly with ore.frame objects and let the Oracle database do the hard work. So let’s get to work on the PERFORMANCE ore.frame object and do some initial analysis and investigation.

> df <- PERFORMANCE[,c("YEAR","DEST","ARRDELAY")]
> class(df)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
> head(df)
  YEAR DEST ARRDELAY
1 2010  BOI      -13
2 2010  BUF       44
3 2010  BUF      -14
4 2010  BUR       -6
5 2010  BUR       -2
6 2010  BUR       -9
Warning messages:
1: ORE object has no unique key - using random order 
2: ORE object has no unique key - using random order 
> options(ore.warn.order = FALSE)
> head(PERFORMANCE[,c(1,4,23)])
  YEAR DAYOFMONTH DESTWAC
1 2010         16      83
2 2010         16      22
3 2010         16      22
4 2010         16      91
5 2010         16      91
6 2010         16      91
>

In the above script, the first command creates a temporary ore.frame object made up of just three of the columns from the PERFORMANCE table / ore.frame. Then I switch off the warning about these tables not having unique keys (“options(ore.warn.order = FALSE)”), and then I select three more columns directly from the PERFORMANCE table / ore.frame.

> aggdata <- aggregate(PERFORMANCE$DEST,
+                      by = list(PERFORMANCE$DEST),
+                      FUN = length)
> class(aggdata)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
 
> head(aggdata)
    Group.1     x
ABE     ABE  4104
ABI     ABI  2497
ABQ     ABQ 33298
ABR     ABR     5
ABY     ABY  1028
ACK     ACK   346
 
> (t <- table(PERFORMANCE$DAYOFWEEK))
 
     1      2      3      4      5      6      7 
943305 924442 932113 942066 956123 777203 887170
 
> dat = PERFORMANCE[PERFORMANCE$ARRDELAY<100 & PERFORMANCE$ARRDELAY>-100,]
> ad = with(dat, split(ARRDELAY,UNIQUECARRIER))
> boxplot(ad,col = "blue", notch = TRUE, cex = 0.5, varwidth = TRUE)

In the above set of scripts, I first aggregate flights by destination airports, then count flights by day of week. In the final set of commands I get a bit more advanced and create a box plot graph showing the range of flight delays by airline, which produces the following graph from the R console:

NewImage

whereas in the next one I create a histogram of flight delays (minutes), showing the vast majority of delays are just a few minutes.

> ad = PERFORMANCE$ARRDELAY
> ad = subset(ad, ad>-200&ad<200)
> hist(ad, breaks = 100, main = "Histogram of Arrival Delay")

NewImage

All of this so far, to be fair, you could do just as easily in SQL or in a tool like Excel, but they’re the sort of commands an R analyst would want to run before getting onto the interesting stuff, and it’s great that they could now do this on the full dataset in an Oracle database, not just on what they can pull into memory on their laptop. Let’s do something more interesting now, and answer the question “which day of the week is best for flying out, in terms of not hitting delays?”

> ad = PERFORMANCE$ARRDELAY
> ad = subset(ad, ad>-200&ad<200)
> hist(ad, breaks = 100, main = "Histogram of Arrival Delay")
> ontime <- PERFORMANCE
> delay <- ontime$ARRDELAY
> dayofweek <- ontime$DAYOFWEEK
> bd <- split(delay, dayofweek)
> boxplot(bd, notch = TRUE, col = "red", cex = 0.5,
+         outline = FALSE, axes = FALSE,
+         main = "Airline Flight Delay by Day of Week",
+         ylab = "Delay (minutes)", xlab = "Day of Week")

NewImage

Looks like Tuesday’s the best. So how has a selection of airlines performed over the past few years?

> ontimeSubset <- subset(PERFORMANCE, UNIQUECARRIER %in% c("AA", "AS", "CO", "DL","WN","NW")) 
> res22 <- with(ontimeSubset, tapply(ARRDELAY, list(UNIQUECARRIER, YEAR), mean, na.rm = TRUE))
> g_range <- range(0, res22, na.rm = TRUE)
> rindex <- seq_len(nrow(res22))
> cindex <- seq_len(ncol(res22))
> par(mfrow = c(2,3))
> res22 <- with(ontimeSubset, tapply(ARRDELAY, list(UNIQUECARRIER, YEAR), mean, na.rm = TRUE))
> g_range <- range(0, res22, na.rm = TRUE)
> rindex <- seq_len(nrow(res22))
> cindex <- seq_len(ncol(res22))
> par(mfrow = c(2,3))
> for(i in rindex) {
+   temp <- data.frame(index = cindex, avg_delay = res22[i,])
+   plot(avg_delay ~ index, data = temp, col = "black",
+        axes = FALSE, ylim = g_range, xlab = "", ylab = "",
+        main = attr(res22, "dimnames")[[1]][i])
+        axis(1, at = cindex, labels = attr(res22, "dimnames")[[2]]) 
+        axis(2, at = 0:ceiling(g_range[2]))
+        abline(lm(avg_delay ~ index, data = temp), col = "green") 
+        lines(lowess(temp$index, temp$avg_delay), col="red")
+ } 
>

NewImage

See this presentation from the BIWA SIG for more examples of ORE queries against the flight delays dataset, which you can adapt from the ONTIME_S dataset that ships with ORE as part of the install.

Now where R and ORE get really interesting, in the context of BI and OBIEE, is when you embed R scripts directly in the Oracle Database and use them to provide forecasting, modelling and other “advanced analytics” features using the database’s internal JVM and an R engine that gets spun-out on-demand. Once you’ve done this, you can expose the calculations through an OBIEE RPD, as Oracle have done in the OBIEE 11.1.1.7.1 SampleApp, shown below:

NewImage

But that’s really an article in itself – so I’ll cover this process and how you surface it all through OBIEE in a follow-up post soon.

Using Sqoop for Loading Oracle Data into Hadoop on the BigDataLite VM

This is old-hat for most Hadoop veterans, but I’ve been meaning to note it on the blog for a while, for anyone who’s first encounter with Hadoop is Oracle’s BigDataLite VM.

Most people looking to bring external data into Hadoop, do so through flat-file exports that they then import into HDFS, using the “hadoop fs” command-line tool or Hue, the web-based developer tool in BigDataLite, Cloudera CDH, Hortonworks and so on. They then often create Hive tables over them, either creating them from the Hive / Beeswax shell or through Hue, which can create a table for you out of a file you upload from your browser. ODI, through the ODI Application Adapter for Hadoop, also gives you a knowledge module (IKM File to Hive) that’s used in the ODI demos also on BigDataLite to load data into Hive tables, from an Apache Avro-format log file.

NewImage

What a lot of people don’t know who’re new to Hadoop, is that you can skip the “dump to file” step completely, and load data straight into HDFS direct from the Oracle database, without an intermediate file export step. The tool you use for this comes as part of the Cloudera CDH4 Hadoop distribution that’s on BigDataLite, and it’s called “Sqoop”.

“Sqoop”, short for “SQL to Hadoop”, gives you the ability to do the following Oracle data transfer tasks amongst other ones:

  • Import whole tables, or whole schemas, from Oracle and other relational databases into Hadoop’s file system, HDFS
  • Export data from HDFS back out to these databases – with the export and import being performed through MapReduce jobs
  • Import using an arbitrary SQL SELECT statement, rather than grabbing whole tables
  • Perform incremental loads, specifying a key column to determine what to exclude
  • Load directly into Hive tables, creating HDFS files in the background and the Hive metadata automatically

Documentation for Sqoop as shipped with CDH4 can be found on the Cloudera website here, and there are even optimisations and plugins for databases such as Oracle to enable faster, direct loads – for example OraOOP.

Normally, you’d need to download and install JDBC drivers for sqoop before you can use it, but BigDataLite comes with the required Oracle JDBC drivers, so let’s just have a play around and see some examples of Sqoop in action. I’ll start by importing the ACTIVITY table from the MOVIEDEMO schema that comes with the Oracle 12c database also on BigDataLite (make sure the database is running, first though):

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY

You should then see sqoop process your command in its console output, and then run the MapReduce jobs to bring in the data via the Oracle JDBC driver:

14/03/21 18:21:36 INFO sqoop.Sqoop: Running Sqoop version: 1.4.3-cdh4.5.0
14/03/21 18:21:36 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/03/21 18:21:37 INFO manager.SqlManager: Using default fetchSize of 1000
14/03/21 18:21:37 INFO tool.CodeGenTool: Beginning code generation
14/03/21 18:21:38 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM ACTIVITY t WHERE 1=0
14/03/21 18:21:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
14/03/21 18:21:38 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar
Note: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/03/21 18:21:41 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-oracle/compile/b4949ed7f3e826839679143f5c8e23c1/ACTIVITY.jar
14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:41 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:42 INFO mapreduce.ImportJobBase: Beginning import of ACTIVITY
14/03/21 18:21:42 INFO manager.OracleManager: Time zone has been set to GMT
14/03/21 18:21:44 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/03/21 18:21:45 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(ACTIVITY_ID), MAX(ACTIVITY_ID) FROM ACTIVITY
14/03/21 18:21:45 INFO mapred.JobClient: Running job: job_201403111406_0015
14/03/21 18:21:47 INFO mapred.JobClient: map 0% reduce 0%
14/03/21 18:22:19 INFO mapred.JobClient: map 25% reduce 0%
14/03/21 18:22:27 INFO mapred.JobClient: map 50% reduce 0%
14/03/21 18:22:29 INFO mapred.JobClient: map 75% reduce 0%
14/03/21 18:22:37 INFO mapred.JobClient: map 100% reduce 0%
...
14/03/21 18:22:39 INFO mapred.JobClient: Map input records=11
14/03/21 18:22:39 INFO mapred.JobClient: Map output records=11
14/03/21 18:22:39 INFO mapred.JobClient: Input split bytes=464
14/03/21 18:22:39 INFO mapred.JobClient: Spilled Records=0
14/03/21 18:22:39 INFO mapred.JobClient: CPU time spent (ms)=3430
14/03/21 18:22:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=506802176
14/03/21 18:22:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2714157056
14/03/21 18:22:39 INFO mapred.JobClient: Total committed heap usage (bytes)=506724352
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Transferred 103 bytes in 56.4649 seconds (1.8241 bytes/sec)
14/03/21 18:22:39 INFO mapreduce.ImportJobBase: Retrieved 11 records.

By default, sqoop will put the resulting file in your user’s home directory in HDFS. Let’s take a look and see what’s there:

[oracle@bigdatalite ~]$ hadoop fs -ls /user/oracle/ACTIVITYFound 6 items
-rw-r--r-- 1 oracle supergroup 0 2014-03-21 18:22 /user/oracle/ACTIVITY/_SUCCESS
drwxr-xr-x - oracle supergroup 0 2014-03-21 18:21 /user/oracle/ACTIVITY/_logs
-rw-r--r-- 1 oracle supergroup 27 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00000
-rw-r--r-- 1 oracle supergroup 17 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00001
-rw-r--r-- 1 oracle supergroup 24 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00002
-rw-r--r-- 1 oracle supergroup 35 2014-03-21 18:22 /user/oracle/ACTIVITY/part-m-00003
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000001,Rate
2,Completed
3,Pause
[oracle@bigdatalite ~]$ hadoop fs -cat /user/oracle/ACTIVITY/part-m-000014,Start
5,Browse

What you can see there is that sqoop has imported the data as a series of “part-m” files, CSV files with one per MapReduce reducer. There’s various options in the docs for specifying compression and other performance features for sqoop imports, but the basic format is a series of CSV files, one per reducer.

You can also import Oracle and other RDBMS data directly into Hive, with sqoop creating equivalent datatypes for the data coming in (basic datatypes only, none of the advanced spatial and other Oracle ones). For example, I could import the CREW table in the MOVIEDEMO schema in like this, directly into an equivalent Hive table:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import<

Taking a look at Hive, I can then see the table this is created, describe it and count the number of rows it contains:

[oracle@bigdatalite ~]$ hive
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.5.0.jar!/hive-log4j.properties
Hive history file=/tmp/oracle/hive_job_log_effb9cb5-6617-49f4-97b5-b09cd56c5661_1747866494.txt

hive> desc CREW;
OK
crew_id double 
name string 
Time taken: 2.211 seconds

hive> select count(*) from CREW;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201403111406_0017, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201403111406_0017
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201403111406_0017
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-03-21 18:33:40,756 Stage-1 map = 0%, reduce = 0%
2014-03-21 18:33:46,797 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:47,812 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:48,821 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:49,907 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:50,916 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.74 sec
2014-03-21 18:33:51,929 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:52,942 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:53,951 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
2014-03-21 18:33:54,961 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.75 sec
MapReduce Total cumulative CPU time: 1 seconds 750 msec
Ended Job = job_201403111406_0017
MapReduce Jobs Launched: 
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.75 sec HDFS Read: 135111 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 750 msec
OK
6860
Time taken: 20.095 seconds

I can even do an incremental import to bring in new rows, appending their contents to the existing ones in Hive/HDFS:

[oracle@bigdatalite ~]$ sqoop import --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table CREW --hive-import --incremental append --check-column CREW_ID

The data I’ve now loaded can be processed by a tool such as ODI, or you can use a tool such as Pig to do some further analysis or number crunching, like this:

grunt> RAW_DATA = LOAD 'ACTIVITY' USING PigStorage(',') AS  
grunt> (act_type: int, act_desc: chararray);  
grunt> B = FILTER RAW_DATA by act_type < 5; 
grunt> STORE B into 'FILTERED_ACTIVITIES' USING PigStorage(‘,'); 

the output of which is another file on HDFS. Finally, I can export this data back to my Oracle database using the sqoop export feature:

[oracle@bigdatalite ~]$ sqoop export --connect jdbc:oracle:thin:@localhost:1521/orcl --username MOVIEDEMO --password welcome1 --table ACTIVITY_FILTERED --export-dir FILTERED_ACTIVITIES

So there’s a lot more to sqoop than just this, including features and topics such as compression, transforming data and so on, but it’s a useful tool and also something you could call from the command-line, using an ODI Tool if you want to. Recent versions of Hue also come with a GUI for Sqoop, giving you the ability to create jobs graphically and also schedule them using Oozie, the Hadoop scheduler in CDH4,

Running R on Hadoop using Oracle R Advanced Analytics for Hadoop

When most people think about analytics and Hadoop, they tend to think of technologies such as Hive, Pig and Impala as the main tools a data analyst uses. When you talk to data analysts and data scientists though, they’ll usually tell you that their primary tool when working on Hadoop and big data sources is in fact “R”, the open-source statistical and modelling language “inspired” by SAS but now with its own rich ecosystem, and particularly suited to the data preparation, data analysis and data correlation tasks you’ll often do on a big data project.

You can see examples where R has been used in recent copies of Oracle’s OBIEE SampleApp, where R is used to predict flight delays, with the results then rendered through Oracle R Enterprise, part of the Oracle Database Enterprise Edition Advanced Analytics Option and which allows you to run R scripts in the database and output the results in the form of database functions.

NewImage

Oracle actually distribute two R packages as part of a wider set of Oracle R technology products – Oracle R, their own distribution of open-source R that you can download via the Oracle Public Yum repository, and Oracle R Enterprise (ORE), an add-in to the database that provides efficient connectivity between R and Oracle and also allows you to run R scripts directly within the database’s JVM. ORE is actually surprisingly pretty good, with its main benefit being that you can perform R analysis directly against data in the database, avoiding the need to dump data to a file and giving you the scalability of the Oracle Database to run your R models, rather than begin constrained by the amount of RAM in your laptop. In the screenshot below, you can see part of an Oracle R Enterprise script we’ve written that analyses data from the flight delays dataset:

NewImage

with the results then output in the R console:

NewImage

But in many cases though, clients won’t want you to run R analysis on their main databases due to the load it’ll put on them, so what do you do when you need to analyse large datasets? A common option is to run your R queries on Hadoop, giving you the flexibility and power of the R language whilst taking advantage of the horizontal scalability of Hadoop, HDFS and MapReduce. There’s quite a few options for doing this – the open-source RHIPE and the R package “parallel” both provide R-on-Hadoop capabilities – but Oracle also have a product in this area, “Oracle R Advanced Analytics for Hadoop” (ORAAH) previously known as “Oracle R Connector for Hadoop” that according to the docs is particularly well-designed for parallel reads and writes, has resource management and database connectivity features, and comes as part of Oracle Big Data Appliance, Oracle Big Data Connectors and the recently released BigDataLite VM. The payoff here then is that by using ORAAH you can scale-up R to work at Hadoop-scale, giving you an alternative to the more set-based Hive and Pig languages when working with super-large datasets.

NewImage

Oracle R Advanced Analytics for Hadoop is already set-up and configured on the BigDataLite VM, and you can try it out by opening a command-line session and running the “R” executable:

NewImage

Type in library(ORCH) to load the ORAAH R libraries (ORCH being the old name of the product), and you should see something like this:

NewImage

At that point you can now run some demo programs to see examples of what ORAAH can do; for example read and write data to HDFS using the demo(“hdfs_cpmv.basic”), or use Hive as a source for R data frames (“hive_analysis”,”ORCH”), like this:

NewImage

The most important feature is being able to run R jobs on Hadoop though, with ORAAH converting the R scripts you write into Hadoop jobs via the Hadoop Streaming utility, that gives MapReduce the ability to use any executable or script as the mapper or reducer. The “mapred_basic” R demo that ships with ORAAH shows a basic example of this working, and you can see the MapReduce jobs being kicked-off in the R console, and in the Hadoop Jobtracker web UI:

NewImage

But what if you want to use ORAAH on your own Hadoop cluster? Let’s walk through a typical setup process using a three-node Cloudera CDH 4.5 cluster running on a VMWare ESXi server I’ve got at home, where the three nodes are configured like this:

  • cdh45d-node1.rittmandev.com : 8GB RAM VM running OEL6.4 64-bit and with a full desktop available
  • cdh45d-node2.rittmandev.com : 2GB RAM VM running Centos 6.2 64-bit, and with a minimal Linux install with no desktop etc.
  • cdh45d-node3.rittmandev.com, as per cdh45d-node2

The idea here is that “cdh45d-node1” is where the data analyst can run their R session, or of course they can just SSH into it. I’ve also got Cloudera Manager installed on there using the free “standard” edition of CDH4.5, and with the cluster spanning the other two nodes like this:

NewImage

If you’re going to use R as a regular OS user (for example, “oracle”) then you’ll also need to use Hue to create a home directory for that user in HDFS, as ORAAH will automatically set that directory as the users’ working directory when they load up the ORAAH libraries.

Step 1: Configuring the Hadoop Node running R Client Software

On the first node where I’m using OEL6.4, Oracle R 2.15.3 is already installed by Oracle, but we want R 3.0.1 for ORAAH 2.3.1, so you can install it using Oracle’s Public Yum repository like this:

sudo yum install R-3.0.1

Once you’ve done that, you need to download and install the R packages in the ORAAH 2.3.1 zip file that you can get hold of from the Big Data Connectors download page on OTN. Assuming you’ve downloaded and unzipped them to /root/ORCH2.3.1, running as root install the R packages within the zip file in the correct order, like this:

cd /root/ORCH2.3.1/
R --vanilla CMD INSTALL OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
R --vanilla CMD INSTALL OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCHcore_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCHstats_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
R --vanilla CMD INSTALL ORCH_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz

You’ll also need to download and install the “png” source package from the R CRAN website (http://cran.r-project.org/web/packages/png/index.html), and possibly download and install the libpng-devel libraries from the Oracle Public Yum site before it’ll compile.

sudo yum install libpng-devel
R --vanilla CMD INSTALL png_0.1-7.tar.gz 

At that point, you should be good to go. If you’ve installed CDH4.5 using parcels rather than packages though, you’ll need to set a couple of environment variables in your .bash_profile script to tell ORAAH where to find the Hadoop scripts (parcels install Hadoop to /opt/cloudera, whereas packages install to /usr/lib, which is where ORAAH looks for them normally):

# .bash_profile
 
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
        . ~/.bashrc
fi
 
# User specific environment and startup programs
 
PATH=$PATH:$HOME/bin
 
export PATH
export ORCH_HADOOP_HOME=/opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30
export HADOOP_STREAMING_JAR=/opt/cloudera/parcels/CDH-4.5.0-1.cdh4.5.0.p0.30/lib/hadoop-0.20-mapreduce/contrib/streaming

You should then be able to log into the R console from this particular machine and load up the ORCH libraries:

[oracle@cdh45d-node1 ~]$ . ./.bash_profile
[oracle@cdh45d-node1 ~]$ R
 
Oracle Distribution of R version 3.0.1  (--) -- "Good Sport"
Copyright (C)  The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
You are using Oracle's distribution of R. Please contact
Oracle Support for any problems you encounter with this
distribution.
 
> library(ORCH)
Loading required package: OREbase
 
Attaching package: ‘OREbase’
 
The following objects are masked from ‘package:base’:
 
    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table
 
Loading required package: OREstats
Loading required package: MASS
Loading required package: ORCHcore
Oracle R Connector for Hadoop 2.3.1 (rev. 288)
Info: using native C base64 encoding implementation
Info: Hadoop distribution is Cloudera's CDH v4.5.0
Info: using ORCH HAL v4.1
Info: HDFS workdir is set to "/user/oracle"
Info: mapReduce is functional
Info: HDFS is functional
Info: Hadoop 2.0.0-cdh4.5.0 is up
Info: Sqoop 1.4.3-cdh4.5.0 is up
Warning: OLH is not found
Loading required package: ORCHstats
> 

Step 2: Configuring the other Nodes in the Hadoop Cluster

Once you’ve set up your main Hadoop node with the R Client software, you’ll also need to install  R 3.0.1 onto all of the other nodes in your Hadoop cluster, along with a subset of the ORCH library files. If your other nodes are also running OEL you can just repeat the “yum install R-3.0.1” step from before, but in my case I’m running a very stripped-down Centos 6.2 install on my other nodes so I need to install wget first, then grab the Oracle Public Yum repo file along with a GPG key file that it’ll require before allowing you download from that server:

[root@cdh45d-node2 yum.repos.d]# wget http://public-yum.oracle.com/public-yum-ol6.repo
 
--2014-03-10 07:50:25--  http://public-yum.oracle.com/public-yum-ol6.repo
Resolving public-yum.oracle.com... 109.144.113.166, 109.144.113.190
Connecting to public-yum.oracle.com|109.144.113.166|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4233 (4.1K) 1
Saving to: <code>public-yum-ol6.repo'
 
100%[=============================================&gt;] 4,233       --.-K/s   in 0s      
 
2014-03-10 07:50:26 (173 MB/s) - </code>public-yum-ol6.repo' saved [4233/4233]
 
[root@cdh45d-node2 yum.repos.d]# wget http://public-yum.oracle.com/RPM-GPG-KEY-oracle-ol6 -O /etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
 
--2014-03-10 07:50:26--  http://public-yum.oracle.com/RPM-GPG-KEY-oracle-ol6
Resolving public-yum.oracle.com... 109.144.113.190, 109.144.113.166
Connecting to public-yum.oracle.com|109.144.113.190|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1011 1
Saving to: <code>/etc/pki/rpm-gpg/RPM-GPG-KEY-oracle'
 
100%[=============================================&gt;] 1,011       --.-K/s   in 0s      
 
2014-03-10 07:50:26 (2.15 MB/s) - </code>/etc/pki/rpm-gpg/RPM-GPG-KEY-oracle' saved [1011/1011]

You’ll then need to edit the public-yum-ol6.repo file to enable the correct repository (RHEL/Centos/OEL 6.4 in my case) for your VMs, and also enable the add-ons repository, with the file contents below being the repo file after I made the required changes.

[ol6_latest]
name=Oracle Linux $releasever Latest ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/latest/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enableid=0
 
[ol6_addons]
name=Oracle Linux $releasever Add ons ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/addons/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1
 
[ol6_ga_base]
name=Oracle Linux $releasever GA installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/0/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u1_base]
name=Oracle Linux $releasever Update 1 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/1/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u2_base]
name=Oracle Linux $releasever Update 2 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/2/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u3_base]
name=Oracle Linux $releasever Update 3 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/3/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=0
 
[ol6_u4_base]
name=Oracle Linux $releasever Update 4 installation media copy ($basearch)
baseurl=http://public-yum.oracle.com/repo/OracleLinux/OL6/4/base/$basearch/
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-oracle
gpgcheck=1
enabled=1

Once you’ve saved the file, you can the run the install of R-3.0.1 as before, copy the ORCH files across to the server and then install just the OREbase, OREstats, OREmodels and OREserver packages, like this:

[root@cdh45d-node2 yum.repos.d]# yum install R-3.0.1
ol6_UEK_latest                                                  | 1.2 kB     00:00     
ol6_UEK_latest/primary                                          |  13 MB     00:03     
ol6_UEK_latest                                                                 281/281
ol6_addons                                                      | 1.2 kB     00:00     
ol6_addons/primary                                              |  42 kB     00:00     
ol6_addons                                                                     169/169
ol6_latest                                                      | 1.4 kB     00:00     
ol6_latest/primary                                              |  36 MB     00:10     
ol6_latest                                                                 24906/24906
ol6_u4_base                                                     | 1.4 kB     00:00     
ol6_u4_base/primary                                             | 2.7 MB     00:00     
ol6_u4_base                                                                  8396/8396
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package R.x86_64 0:3.0.1-2.el6 will be installed
 
[root@cdh45d-node2 ~]# cd ORCH2.3.1\ 2/
[root@cdh45d-node2 ORCH2.3.1 2]# ls
ORCH_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
ORCHcore_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
ORCHstats_2.3.1_R_x86_64-unknown-linux-gnu.tar.gz
OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz
OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREbase_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREbase’ ...
* DONE (OREbase)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREstats_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREstats’ ...
* DONE (OREstats)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREmodels_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREmodels’ ...
* DONE (OREmodels)
Making 'packages.html' ... done
[root@cdh45d-node2 ORCH2.3.1 2]# R --vanilla CMD INSTALL OREserver_1.4_R_x86_64-unknown-linux-gnu.tar.gz 
* installing to library ‘/usr/lib64/R/library’
* installing *binary* package ‘OREserver’ ...
* DONE (OREserver)
Making 'packages.html' ... done

Now, you can open a terminal window on the main node with the R client software, or SSH into the server, and run one of the ORAAH demos where the R job gets run on the Hadoop cluster.

> demo("mapred_basic","ORCH")
 
 
demo(mapred_basic)
---- ~~~~~~~~~~~~
 
> #
> #     ORACLE R CONNECTOR FOR HADOOP DEMOS
> #
> #     Name: mapred_basic
> #     Description: Demonstrates running a mapper and a reducer containing 
> #                  R script in ORCH.
> #
> #
> #
> 
> 
> ##
> # A simple example of how to operate with key-values. Input dataset - cars.
> # Filter cars with with "dist" > 30 in mapper and get mean "dist" for each 
> # "speed" in reducer.
> ##
> 
> ## Set page width
> options(width = 80)
 
> # Put the cars dataset into HDFS
> cars.dfs <- hdfs.put(cars, key='speed')
 
> # Submit the hadoop job with mapper and reducer R scripts
> x <- try(hadoop.run(
+     cars.dfs,
+     mapper = function(key, val) {
+             orch.keyvals(key[val$dist > 30], val[val$dist > 30,])
+     },
+     reducer = function(key, vals) {
+         X <- sum(vals$dist)/nrow(vals)
+         orch.keyval(key, X)
+     },
+     config = new("mapred.config",
+         map.tasks = 1,
+         reduce.tasks = 1
+     )
+ ), silent = TRUE)
 
> # In case of errors, cleanup and return
> if (inherits(x,"try-error")) {
+  hdfs.rm(cars.dfs)
+  stop("execution error")
+ }
 
> # Print the results of the mapreduce job
> print(hdfs.get(x))
   val1     val2
1    10 34.00000
2    13 38.00000
3    14 58.66667
4    15 54.00000
5    16 36.00000
6    17 40.66667
7    18 64.50000
8    19 50.00000
9    20 50.40000
10   22 66.00000
11   23 54.00000
12   24 93.75000
13   25 85.00000
 
> # Remove the HDFS files created above
> hdfs.rm(cars.dfs)
[1] TRUE
 
> hdfs.rm(x)
[1] TRUE

with the job tracker web UI on Hadoop confirming the R script ran on the Hadoop cluster.

NewImage

So that’s a basic tech intro and tips on the install. Documentation for ORAAH and the rest of the Big Data Connectors is available for reading on OTN (including a list of all the R commands you get as part of the ORAAH/ORCH package). Keep an eye on the blog though for more on R, ORE and ORAAH as I try to share some examples from datasets we’ve worked on.

Rittman Mead BI Forum 2014 Now Open for Registration!

I’m very pleased to announce that the Rittman Mead BI Forum 2014 running in Brighton and Atlanta, May 2014, is now open for registration. Keeping the format as before – a single stream at each event, world-class speakers and expert-level presentations, and a strictly-limited number of attendees – this is the premier Oracle BI tech conference for developers looking for something beyond marketing and beginner-level content.

This year we have a fantastic line-up of speakers and sessions, including:

  • Oracle ACE and past BI Forum best speaker winner Kevin McGinley, on adding third-party visualisations to OBIEE
  • Tony Heljula, winner of multiple best speaker awards and this year presenting on Exalytics and TimesTen Columnar Storage
  • Sessions from TimesTen PMs Chris Jenkins and Susan Cheung on what’s coming with TimesTen
  • Edward Roske, author of multiple books on Essbase, on Essbase optimisation
  • Oracle’s Andrew Bond, and our own Stewart Bryson (Oracle ACE) with an update to Oracle’s reference BI, DW and Big Data Architecture
  • Sessions from Oracle’s Jack Berkowitz, Adam Bloom and Matt Bedin on what’s coming with OBIEE and Oracle BI Applications
  • Endeca sessions from Chris Lynskey (PM), Omri Traub (Development Manager) on Endeca, along with ones from Branchbird’s Patrick Rafferty and Truls Bergersen
  • And sessions from Rittman Mead’s Robin Moffatt (OBIEE performance), Gianni Ceresa (Essbase) and Michael Rainey (ODI, with Nick Hurt from IFPI)
NewImage

We’ve also got some excellent keynote sessions including one in the US from Maria Colgan on the new in-memory database option, and another in Brighton from Matt Bedin and Adam Bloom on BI in the Cloud – along with the opening-night Oracle product development keynote in both Brighton and Atlanta.

We’re also very exited to welcome Lars George from Cloudera to deliver this year’s optional one-day masterclass, this year on Hadoop, big data, and how Oracle BI&DW developers can get started with this technology. Lars is Cloudera’s Chief Architect in EMEA and an HBase committer, and he’ll be covering topics such as:

  • What is Hadoop, what’s in the Hadoop ecosystem and how do you design a Hadoop cluster
  • Using tools such as Flume and Sqoop to import data into Hadoop, and then analyse it using Hive, Pig, Impala and Cloudera Search
  • Introduction to NoSQL and HBase
  • Connecting Hadoop to tools such as OBIEE and ODI using JDBC, ODBC, Impala and Hive

If you’ve been meaning to take a look at Hadoop, or if you’ve made a start but would like a chance to discuss techniques with someone who’s out in the field every week designing and building Hadoop systems, this session is aimed at you – it’s on the Wednesday before each event and you can book at the same time as registering for the main BI Forum days.

NewImage

Attendance is limited to around seventy at each event, and we’re running the Brighton BI Forum back at the Hotel Seattle, whilst the US one is running at the Renaissance Midtown Hotel, Atlanta. We encourage attendees to stay at the hotel as well so as to maximise networking opportunities, and this year you can book US accommodation directly with the hotel so you can collect any Marriott points, corporate discounts etc. As usual, we’ll take good care of you over the two or three days, with meals each night, drinks receptions and lots of opportunities to meet colleagues and friends in the industry.

Full details are on the BI Forum 2014 web page including links to the registration sites. Book now so you don’t miss-out – each year we sell-out in advance, so don’t leave it to the last minute if you’re thinking of coming. Hopefully see you all in Brighton and Atlanta in May 2014!