Tag Archives: User Groups & Conferences

One Day to the Brighton Rittman Mead BI Forum 2015 – Here’s the Agenda!

It’s the night before the Brighton Rittman Mead BI Forum 2015, and some delegates are already here ready for the masterclass tomorrow. Everyone else will either be arriving later in the day for the drinks reception, Oracle Keynote and dinner, or getting here early Thursday morning ready for the event proper. Safe travels for everyone coming down to Brighton, the official Twitter hashtag for the event is #biforum2015, and in the meantime here’s the final agenda for this week’s event:

Rittman Mead BI Forum 2015
Hotel Seattle, Brighton, UK 

Wednesday 6th May 2015

  • 10.00 – 10.00 Registration for Masterclass attendees
  • 10.30 – 12.30 Masterclass Part 1
  • 12.30 – 13.30 Lunch
  • 13.30 – 16.30 Masterclass Part 2
  • 18.00 – 19.00 Drinks Reception in Hotel Seattle Bar
  • 19.00 – 20.00 Oracle Keynote – Nick Tuson & Philippe Lions
  • 20.00 – 22.00 Dinner at Hotel Seattle

Thursday 7th May 2015

  • 08.45 – 09.00 Welcome and Opening Comments
  • 09.00 – 09.45 Steve Devine (Independent) – The Art and Science of Creating Effective Data Visualisations
  • 09.45 – 10.30 Chris Royles (Oracle Corporation) – Big Data Discovery
  • 10.30 – 11.00 Coffee
  • 11.00 – 11.45 Christian Screen (Sierra-Cedar) – 10 Tenets for Making Your Oracle BI Applications Project Succeed Like a Boss
  • 11.45 – 12.30 Philippe Lions and Nick Tuson (Oracle Corporation) Looking Ahead to Oracle BI 12c and Visual Analyzer
  • 12.30 – 13.30 Lunch
  • 13.30 – 14.30 Day 1 Debate – “Self-Service BI – The Answer to Users’ Prayers, or the Path to Madness?”
  • 14.30 – 15.15 Emiel van Bockel (CB) Watch and see 12c on Exalytics
  • 15.15 – 15.45 Coffee
  • 15.45 – 16.30 Philippe Lions (Oracle Corporation) – Solid Standing for Analytics in the Cloud
  • 16.30 – 17.15 Manuel Martin Marquez (C.E.R.N.) – Governed Information Discovery: Data-driven decisions for more efficient operations at CERN
  • 18.00 – 18.45 Guest Speaker/Keynote – Reiner Zimmermann (Oracle Corporation) – Hadoop or not Hadoop …. this is the question
  • 19.00 – 20.00 Depart for dinner at restaurant
  • 20.00 – 22.00 Dinner at external venue

Friday 8th May 2015

  • 09.00 – 09.45 Daniel Adams (Rittman Mead) User Experience First: Guided information and attractive dashboard design
  • 09.45 – 10.30 André Lopes (Liberty Global) A Journey into Big Data and Analytics
  • 10.30 – 11.00 Coffee 
  • 11.00 – 11.45 Antony Heljula (Peak Indicators) – Predictive BI – Using the Past to Predict the Future
  • 11.45 – 12.30 Robin Moffatt (Rittman Mead) Data Discovery and Systems Diagnostics with the ELK stack
  • 12.30 – 13.00 Short Lunch
  • 13.00 – 14.00 Data Visualization Bake-off
  • 14.00 – 14.45 Gerd Aiglstorfer (G.A. itbs GmbH) Driving OBIEE Join Semantics on Multi Star Queries as User
  • 14.45 – 15.00 Closing Remarks, and Best Speaker Award

See you all at the Hotel Seattle, Brighton, tomorrow!

So What’s the Real Point of ODI12c for Big Data Generating Pig and Spark Mappings?

Oracle ODI12c for Big Data came out the other week, and my colleague Jérôme Françoisse put together an introductory post on the new features shortly after, covering ODI’s new ability to generate Pig and Spark transformations as well as the traditional Hive ones. How this works is that you can now select Apache Pig, or Apache Spark (through pySpark, the Spark API through Python) as the implementation language for an ODI mapping, and ODI will generate one of those languages instead of HiveQL commands to run the mapping.

NewImage

How this works is that ODI12c 12.1.3.0.1 adds a bunch of new component-style KMs to the standard 12c ones, providing filter, aggregate, file load and other features that generate pySpark and Pig code rather than the usual HiveQL statement parts. Component KMs have also been added for Hive as well, making it possible now to include non-Hive datastores in a mapping and join them all together, something it was hard to do in earlier versions of ODI12c where the Hive IKM expected to do the table data extraction as well.

But when you first look at this you may well be tempted to think “…so what?”, in that Pig compiles down to MapReduce in the end, just like Hive does, and you probably won’t get the benefits of running Spark for just a single batch mapping doing largely set-based transformations. To my mind where this new feature gets interesting is its ability to let you take existing Pig and Spark scripts, which process data in a different, dataflow-type way compared to Hive’s set-based transformations and which also potentially also use Pig and Spark-specific function libraries, and convert them to managed graphical mappings that you can orchestrate and run as part of a wider ODI integration process.

Pig, for example, has the LinkedIn-originated DataFu UDF library that makes it easy to sessionize and further transform log data, and the Piggybank community library that extends Pig’s loading and saving capabilities to additional storage formats, and provides additional basic UDFs for timestamp conversion, log parsing and so forth. We’ve used these libraries in the past to process log files from our blog’s webserver and create classification models to help predict whether a visitor will return, with the Pig script below using the DataFu and Piggybank libraries to perform these tasks easily in Pig.

register /opt/cloudera/parcels/CDH/lib/pig/datafu.jar;
register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar;

DEFINE Sessionize datafu.pig.sessions.Sessionize('60m');
DEFINE Median datafu.pig.stats.StreamingMedian();
DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.9','0.95');
DEFINE VAR datafu.pig.VAR();
DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();

--------------------------------------------------------------------------------
-- Import and clean logs
raw_logs = LOAD '/user/flume/rm_logs/apache_access_combined' USING TextLoader AS (line:chararray);

-- Extract individual fields
logs_base = FOREACH raw_logs
GENERATE FLATTEN
(REGEX_EXTRACT_ALL(line,'^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\S+) (\S+) "([^"]*)" "([^"]*)"')) AS
(remoteAddr: chararray, remoteLogName: chararray, user: chararray, time: chararray, request: chararray, status: chararray, bytes_string: chararray, referrer:chararray, browser: chararray);

-- Remove Bots and convert timestamp
logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|Bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*');

-- Remove uselesss columns and convert timestamp
clean_logs = FOREACH logs_base_nobots GENERATE CustomFormatToISO(time,'dd/MMM/yyyy:HH:mm:ss Z') as time, remoteAddr, request, status, bytes_string, referrer, browser;

--------------------------------------------------------------------------------
-- Sessionize the data

clean_logs_sessionized = FOREACH (GROUP clean_logs BY remoteAddr) {
ordered = ORDER clean_logs BY time;
GENERATE FLATTEN(Sessionize(ordered))
AS (time, remoteAddr, request, status, bytes_string, referrer, browser, sessionId);
};

-- The following steps will generate a tsv file in your home directory to download and work with in R
store clean_logs_sessionized into '/user/jmeyer/clean_logs' using PigStorage('t','-schema');

If you know Pig (or read my previous articles on this theme), you’ll know that pig has the concept of an “alias”, a dataset you define using filters, aggregations, projections and other operations against other aliases, with a typical pig script starting with a large data extract and then progressively whittling it down to just the subset of data, and derived data, you’re interested in. When it comes to script execution, Pig only materializes these aliases when you tell it to store the results in permanent storage (file, Hive table etc) with the intermediate steps just being instructions on how to progressively arrive at the final result. Spark works in a similar way with its RDDs, transformations and operations which either create a new dataset based off of an existing one, or materialise the results in permanent storage when you run an “action”. So let’s see if ODI12c for Big Data can create a similar dataflow, based as much as possible on the script I’ve used above.

… and in-fact it can. The screenshot below shows the logical mapping to implement this same Pig dataflow, with the data coming into the mapping as a Hive table, an expression operator creating the equivalent of a Pig alias based off of a filtered, transformed version of the original source data using the Piggybank CustomFormatToISO UDF, and then runs the results of that through an ODI table function that in the background transforms the data using Pig’s GENERATE FLATTEN command and a call to the DataFu Sessionize UDF.

NewImage

And this is the physical mapping to go with the logical mapping. Note that all of the Pig transformations are contained within a separate execution unit, that contains operators for the expression to transform and filter the initial dataset, and another for the table function.

NewImage

The table function operator runs the input fields through an arbitrary Pig Latin script, in this case defining another alias to match the table function operator name and using the DataFu Sessionize UDF within a FOREACH to first sort, and then GENERATE FLATTEN the same columns but with a session ID for user sessions with the same IP address and within 60 seconds of each other.

NewImage

If you’re interested in the detail of how this works and other usages of the new ODI12c for Big Data KMs, then come along to the masterclass I’m running with Jordan Meyer at the Brighton and Atlanta Rittman Mead BI Forums where I’ll go into the full details as part of a live end-to-end demo. Looking at the Pig Latin that comes out of it though, you can see it more or less matches the flow of the hand-written script and implements all of the key steps.

NewImage

Finally, checking the output of the mapping I can see that the log entries have been sessionized and they’re ready to pass on to the next part of the classification model.

NewImage

So that to my mind is where the value is in ODI generating Pig and Spark mappings. It’s not so much taking an existing Hive set-based mapping and just running it using a different language, it’s more about being able to implement graphically the sorts of data flows you can create with Pig and Spark, and being able to get access to the rich UDF and data access libraries that these two languages benefit from. As I said, come along to the masterclass Jordan and I are running, and I’ll go into much more detail and show how the mapping is set up, along with other mappings to create an end-to-end Hadoop data integration process.

Setting up Security and Access Control on a Big Data Appliance

Like all Oracle Engineered Systems, Oracle’s field servicing and Advanced Customer Services (ACS) teams go on-site once a BDA has been sold to a customer and do the racking, installation and initial setup. They will usually ask the customer a set of questions such as “do you want to enable Kerberos authentication”, “what’s the range of IP addresses you want to use for each of the network interfaces”, “what password do you want to use” and so on. It’s usually enough to get a customer going, but in-practice we’ve found most customers need a number of other things set-up and configured before they use the BDA in development and production; for example:

  • Integrating Cloudera Manager, Hue and other tools with the corporate LDAP directory
  • Setting up HDFS and SSH access for the development and production support team, so they can log in with their usual corporate credentials
  • Come up with a directory layout and file placement strategy for loading data into the BDA, and then moving it around as data gets processed
  • Configuring some sort of access control to the Hive tables (and sometimes HDFS directories) that users use to get access to the Hadoop data
  • Devising a backup and recovery strategy, and thinking about DR (disaster recovery)
  • Linking the BDA to other tools and products in the Oracle Big Data and Engineered Systems family; Exalytics, for example, or setting up ODI and OBIEE to access data in the BDA

The first task we’re usually asked to do is integrate Cloudera Manager, the web-based admin console for the Hadoop parts of the BDA, with the corporate LDAP server. By doing this we can enable users to log into Cloudera Manager with their usual corporate login (and restrict access to just certain LDAP groups, and further segregate users into admin ones and stop/start/restart services-type ones), and similarly allow users to log into Hue using their regular LDAP credentials. In my experience Cloudera Manager is easier to set up than Hue, but let’s look at a high-level at what’s involved.

LDAP Integration for Hue, Cloudera Manager, Hive etc

In our Rittman Mead development lab, we have OpenLDAP running on a dedicated appliance VM and a number of our team setup as LDAP users. We’ve defined four LDAP groups, two for Cloudera Manager and two for Hue, with varying degrees of access for each product.

NewImage

Setting up Cloudera Manager is pretty straightforward, using the Administration > Settings menu in the Cloudera Manager web UI (note this option is only available for the paid, Cloudera Enterprise version, not the free Cloudera Express version). Hue security integration is configured through the Hue service menu, and again you can configure the LDAP search credentials, any LDAPS or certificate setup, and then within Hue itself you can define groups to determine what Hue features each set of users can use.

NewImage

Where Hue is a bit more fiddly (last time I looked) is in controlling access to the tool itself; Cloudera Manager lets you explicitly define which LDAP groups can access the tool with other users then locked-out, but Hue either allows all authenticated LDAP users to login to the tool or makes you manually import each authorised user to grant them access (you can then either have Hue check-back to the LDAP server for their password each login, or make a copy of the password and store it within Hue for later use, potentially getting out-of-sync with their LDAP directory password version). In practice what I do is use the manual authorisation method but then have Hue link back to the LDAP server to check the users’ password, and then map their LDAP groups into Hue groups for further role-based access control. There’s a similar process for Hive and Impala too, where you can configure the services to authenticate against LDAP, and also have Hive use user impersonation so their LDAP username is passed-through the ODBC or JDBC connection and queries run as that particular user.

Configuring SSH and HDFS Access and Setting-up Kerberos Authentication

Most developers working with Hadoop and the BDA will either SSH (Secure Shell) into the cluster and work directly on one of the nodes, or connect into their workstation which has been configured as a Hadoop client for the BDA. If they SSH in directly to the cluster they’ll need Linux user accounts there, and if they go in via their workstation the Hadoop client installed there will grant them access as the user they’re logged-into the workstation as. On the BDA you can either set-up user accounts on each BDA node separately, or more likely configure user authentication to connect to the corporate LDAP and check credentials there.

NewImage

One thing you should definitely do, either when your BDA is initially setup by Oracle or later on post-install, is configure your Hadoop cluster as a secure cluster using Kerberos authentication. Hadoop normally trusts that each user accessing Hadoop services via the Hadoop Filesystem API (FS API) is who they say they are, but using the example above I could easily setup an “oracle” user on my workstation and then access all Hadoop services on the main cluster without the Hadoop FS API actually checking that I am who I say I am – in other words the Hadoop FS API shell doesn’t check your password, it merely runs a “whoami” Linux command to determine my username and grants me access as them.

NewImage

The way to address this is to configure the cluster for Kerberos authentication, so that users have to have a valid Kerberos ticket before accessing any secured services (Hive, HDFS etc) on the cluster. I covered this as part of an article on configuring OBIEE11g to connect to Kerberos-secured Hadoop clusters last Christmas and you can either do it as part of the BDA install, or later on using a wizard in more recent versions of CDH5, the Cloudera Hadoop distribution that the BDA uses.

NewImage

The complication with Kerberos authentication is that your organization needs to have a Kerberos KDC (Key Distribution Center) server setup already, which will then link to your corporate LDAP or Active Directory service to check user credentials when they request a Kerberos ticket. The BDA installation routine gives you the option of creating a KDC as part of the BDA setup, but that’s only really useful for securing inter-cluster connections between services as it won’t be checking back to your corporate directory. Ideally you’d set up a connection to an existing, well-tested and well-understood Kerberos KDC server and secure things that way – but beware that not all Oracle and other tools that run on the BDA are setup for Kerberos authentication – OBIEE and ODI are, for example, but the current 1.0 version of Big Data Discovery doesn’t yet support Kerberos-secured clusters.

Coming-up with the HDFS Directory Layout

It’s tempting with Hadoop to just have a free-for-all with the Hadoop HDFS filesystem setup, maybe restricting users to their own home directory but otherwise letting them put files anywhere. HDFS file data for Hive tables typically goes in Hive’s own filesystem area /user/hive/warehouse, but users can of course create Hive tables over external data files stored in their own part of the filesystem.

What we tend to do (inspired by Gwen Shapira’a “Scaling ETL with Hadoop” presentation) is create separate areas for incoming data, ETL processing data and process output data, with developers then told to put shared datasets in these directories rather than their own. I generally create additional Linux users for each of these directories so that these can own the HDFS files and directories rather than individual users, and then I can control access to these directories using HDFS’s POSIX permissions. A typical user setup script might look like this:

[oracle@bigdatalite ~]$ cat create_mclass_users.sh 
sudo groupadd bigdatarm
sudo groupadd rm_website_analysis_grp
useradd mrittman -g bigdatarm
useradd ryeardley -g bigdatarm
useradd mpatel -g bigdatarm
useradd bsteingrimsson -g bigdatarm
useradd spoitnis -g bigdatarm
useradd rm_website_analysis -g rm_website_analysis_grp
echo mrittman:welcome1 | chpasswd
echo ryeardley:welcome1 | chpasswd
echo mpatel:welcome1 | chpasswd
echo bsteingrimsson:welcome1 | chpasswd
echo spoitnis:welcome1 | chpasswd
echo rm_website_analysis:welcome1 | chpasswd

whilst a script to setup the directories for these users, and the application user, might look like this:

[oracle@bigdatalite ~]$ cat create_hdfs_directories.sh 
set echo on
#setup individual user HDFS directories, and scratchpad areas
sudo -u hdfs hadoop fs -mkdir /user/mrittman
sudo -u hdfs hadoop fs -mkdir /user/mrittman/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/ryeardley
sudo -u hdfs hadoop fs -mkdir /user/ryeardley/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/mpatel
sudo -u hdfs hadoop fs -mkdir /user/mpatel/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/bsteingrimsson
sudo -u hdfs hadoop fs -mkdir /user/bsteingrimsson/scratchpad
sudo -u hdfs hadoop fs -mkdir /user/spoitnis
sudo -u hdfs hadoop fs -mkdir /user/spoitnis/scratchpad
 
#setup etl directories
sudo -u hdfs hadoop fs -mkdir -p /data/rm_website_analysis/logfiles/incoming
sudo -u hdfs hadoop fs -mkdir /data/rm_website_analysis/logfiles/archive/
sudo -u hdfs hadoop fs -mkdir -p /data/rm_website_analysis/tweets/incoming
sudo -u hdfs hadoop fs -mkdir /data/rm_website_analysis/tweets/archive
 
#change ownership of user directories
sudo -u hdfs hadoop fs -chown -R mrittman /user/mrittman
sudo -u hdfs hadoop fs -chown -R ryeardley /user/ryeardley
sudo -u hdfs hadoop fs -chown -R mpatel /user/mpatel
sudo -u hdfs hadoop fs -chown -R bsteingrimsson /user/bsteingrimsson
sudo -u hdfs hadoop fs -chown -R spoitnis /user/spoitnis
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/mrittman
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/ryeardley
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/mpatel
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/bsteingrimsson
sudo -u hdfs hadoop fs -chgrp -R bigdatarm /user/spoitnis
 
#change ownership of shared directories
sudo -u hdfs hadoop fs -chown -R rm_website_analysis /data/rm_website_analysis
sudo -u hdfs hadoop fs -chgrp -R rm_website_analysis_grp /data/rm_website_analysis

Giving you a directory structure like this (with the directories for Hive, Impala, HBase etc removed for clarity)

NewImage

In terms of Hive and Impala data, there’s varying opinions on whether to create tables as EXTERNAL and store the data (including sub-directories for table partitions) in the /data/ HDFS area or let Hive store them in its own /user/hive/warehouse area – I tend to let Hive store them within its area as I use Apache Sentry to then control access to those Tables’s data.

Setting up Access Control for HDFS, Hive and Impala Data

At its simplest level, access control can be setup on the HDFS directory structure by using HDFS’s POSIX security model:

  • Each HDFS file or directory has an owner, and a group
  • You can add individual Linux users to a group, but an HDFS object can only have one group owning it

What this means in-practice though is you have to jump through quite a few hoops to set up finer-grained access control to these HDFS objects. What we tend to do is set RW access to the /data/ directory and subdirectories to the application user account (rm_website_analysis in this case), and RO access to that user’s associated group (rm_website_analysis_grp). If users then want access to that application’s data we add them to the relevant application group, and a user can belong to more than one group, making it possible to grant access to more than one application data area

[oracle@bigdatalite ~]$ cat ./set_hdfs_directory_permissions.sh 
sudo -u hdfs hadoop fs -chmod -R 750 /data/rm_website_analysis
usermod -G rm_website_analysis_grp mrittman

making it possible for the main application owner to write data to the directory, but group members only have read access. What you can also now do with more recent versions of Hadoop (CDH5.3 onwards, for example) is define access control lists to go with individual HDFS objects, but this feature isn’t enabled by default as it consumes more namenode memory than the traditional POSIX approach. What I prefer to do though is control access by restricting users to only accessing Hive and Impala tables, and using Apache Sentry, or Oracle Big Data SQL, to provide role-based access control over them.

Apache Sentry is a project originally started by Cloudera and then adopted by the Apache Foundation as an incubating project. It aims to provide four main authorisation features over Hive, Impala (and more recently, the underlying HDFS directories and datafiles):

  • Secure authorisation, with LDAP integration and Kerberos prerequisites for Sentry enablement
  • Fine-grained authorisation down to the column-level, with this feature provided by granting access to views containing subsets of columns at this point
  • Role-based authorisation, with different Sentry roles having different permissions on individual Hive and Impala tables
  • Multi-tenant administration, with a central point of administration for Sentry permissions

From this Cloudera presentation on Sentry on Slideshare, Sentry inserts itself into the query execution process and checks access rights before allowing the rest of the Hive query to execute. Sentry is configured through security policy files, or through a new web-based interface introduced with recent versions of CDH5, for example.

NewImage

The other option for customers using Oracle Exadata,Oracle Big Data Appliance and Oracle Big Data SQL is to use the Oracle Database’s access control mechanisms to govern access to Hive (and Oracle) data, and also set-up fine-grained access control (VPD), data masking and redaction to create a more “enterprise” access control system.

NewImage

So these are typically tasks we perform when on-boarding an Oracle BDA for a customer. If this is of interest to you and you can make it to either Brighton, UK next week or Atlanta, GA the week after, I’ll be covering this topic at the Rittman Mead BI Forum 2015 as part of the one-day masterclass with Jordan Meyer on the Wednesday of each week, along with topics such as creating ETL data flows using Oracle Data Integrator for Big Data, using Oracle Big Data Discovery for faceted search and cataloging of the data reservoir, and reporting on Hadoop and NoSQL data using Oracle Business Intelligence 11g. Spaces are still available so register now if you’d like to hear more on this topic.

Last Chance to Register for the Brighton Rittman Mead BI Forum 2015!

It’s just a week to go until the start of the Brighton Rittman Mead BI Forum 2015, with the optional one-day masterclass starting on Wednesday, May 6th at 10am and the event opening with a reception and Oracle keynote later in the evening. Spaces are still available if you want to book now, but we can’t guarantee places past this Friday so register now if you’re planning to attend.

NewImage

As a reminder, here’s some earlier blog posts and articles about events going on at the Brighton event, and at the Atlanta event the week after:

We’re also running our first “Data Visualisation Challenge” at both events, where we’re asking attendees to create their most impressive and innovative data visualisation within OBIEE using the Donors Choose dataset, with the rule being that you can use any OBIEE or related technology as long as the visualisation runs with OBIEE and can respond to dashboard prompt controls. We’re also opening it up to OBIEE running as part of Oracle BI Cloud Service (BICS), so if you want to give Visual Analyser a spin within BICS we’d be interested in seeing the results.

Registration is still open for the Atlanta BI Forum event too, running the week after Brighton on the 13th-15th May 2015 at the Renaissance Atlanta Midtown hotel. Full details of both events are on the event homepage, with the registration links for Brighton and Atlanta given below.

  • Rittman Mead BI Forum 2015, Brighton –  May 6th – 8th 2015 
We look forward to seeing you all in Brighton next week, or Atlanta the week after – but remember to book soon, before we close registration!

BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).

OBIEE Regression Testing

In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.

In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.

If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.

Data Discovery with the ELK Stack

My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:

The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:

By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.

The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!


Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.