Tag Archives: Oracle Data Integrator
Rittman Mead at ODTUG KScope’15, Hollywood Florida
ODTUG KScope’15 is running in Hollywood, Florida next week and Rittman Mead are running a number of sessions during the week on OBIEE, Essbase, ODI and Big Data. I’ve personally been attending ODTUG KScope (or “Kaleidoscope”, as it used to be known) for many years now and it’s the best developer-centric conference we go to, coupled with amazing venues and a great community atmosphere.
Sessions we’re running over the week include:
- Gianni Ceresa : 2-in-1: RPD Magic and Hyperion Planning “Adapter”
- Jerome : Manage Your Oracle Data Integrator Development Lifecycle
- Michael Rainey : Practical Tips for Oracle Business Intelligence Applications 11g Implementations
- Michael Rainey : GoldenGate and Oracle Data Integrator: A Perfect Match
- Mark Rittman : Bringing Oracle Big Data SQL to OBIEE and ODI
- Mark Rittman : End-to-End Hadoop Development Using OBIEE, ODI, and Oracle Big Data
- Mark Rittman : Thursday Deep Dive – Business Intelligence: Bringing Oracle Tools to Big Data
- Andy Rocha & Pete Tamisin : OBIEE Can Help You Achieve Your GOOOOOOOOOALS!
We’ll also be taking part in various “Lunch and Learn” sessions, community and ACE/ACE Director events, and you can also talk to us about our new OBIEE “User Engagement” initiative and how you can get involved as an early adopter. Details and agenda for KScope’15 can be found on the event website, and if you’re coming we’ll look forward to seeing you in sunny Hollywood, Florida!
Presentation Slides and Photos from the Rittman Mead BI Forum 2015, Brighton and Atlanta
It’s now the Saturday after the two Rittman Mead BI Forum 2015 events, last week in Atlanta, GA and the week before in Brighton, UK. Both events were a great success and I’d like to say thanks to the speakers, attendees, our friends at Oracle and my colleagues within Rittman Mead for making the two events so much fun. If you’re interested in taking a look at some photos from the two events, I’ve put together two Flickr photosets that you can access using the links below:
- Flickr Photoset from the Brighton Rittman Mead BI Forum 2015
- Flickr Photoset from the Atlanta Rittman Mead BI Forum 2015
We’ve also uploaded the presentation slides from the two events (where we’ve been given permission to share them) to our website, and you can download them including the Delivering the Oracle Information Management and Big Data Reference Architecture masterclass using the links below:
Delivering the Oracle Information Management & Big Data Reference Architecture (Mark Rittman & Jordan Meyer, Rittman Mead)
- Part 1 : Delivering the Discovery Lab (Jordan Meyer, Head of R&D at Rittman Mead)
- Part 2 : Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture (Mark Rittman, CTO, Rittman Mead)
Brighton, May 7th and 8th 2015
- Steve Devine, Independent : “The Art and Science of Creating Effective Data Visualisations”
- Chris Royles, Oracle Corporation : “Big Data Discovery”
- Christian Screen, Sierra-Cedar : “10 Tenats for Making Your Oracle BI Applications Project Succeed Like a Boss”
- Emiel van Bockel, CB : “Watch and see 12c on Exalytics”
- Daniel Adams, Rittman Mead : “User Experience First: Guided information and attractive dashboard design”
- Robin Moffatt, Rittman Mead : “Data Discovery and Systems Diagnostics with the ELK stack”
- André Lopes / Roberto Manfredini, Liberty Global : “A Journey into Big Data and Analytics”
- Antony Heljula, Peak Indicators : “Predictive BI – Using the Past to Predict the Future”
- Gerd Aiglstorfer, G.A. itbs GmbH : “Driving OBIEE Join Semantics on Multi Star Queries as User”
- Manuel Martin Marquez, CERN – European Laboratory for Particle Physics, “Governed Information Discovery: Data-driven decisions for more efficient operations at CERN”
Atlanta, May 14th and 15th 2015
- Robin Moffatt, Rittman Mead : “Smarter Regression Testing for OBIEE”
- Mark Rittman : “Oracle Big Data Discovery Tips and Techniques from the Field”
- Hasso Schaap, Qualogy : “Developing strategic analytics applications on OBICS PaaS”
- Tim German / Cameron Lackpour, Qubix / CLSolve : “Hybrid Mode – An Essbase Revolution”
- Stewart Bryson, Red Pill Analytics, “Supercharge BI Delivery with Continuous Integration”
- Andy Rocha & Pete Tamisin, Rittman Mead : “OBIEE Can Help You Achieve Your GOOOOOOOOOALS!”
- Christian Screen, Sierra-Cedar : “10 Tenats for Making Your Oracle BI Applications Project Succeed Like a Boss”
- Sumit Sarkar, Progress Software : “Make sense of NoSQL data using OBIEE”
Congratulations also to Emiel van Bockel and Robin Moffatt who jointly-won Best Speaker award at the Brighton event, and to Andy Rocha and Pete Tamsin who won Best Speaker in Atlanta for their joint session. It’s time for a well-earned rest now and then back to work, and hopefully we’ll see some of you at KScope’15, Oracle Openworld 2015 or the UKOUG Tech and Apps 2015 conferences later in 2015.
So What’s the Real Point of ODI12c for Big Data Generating Pig and Spark Mappings?
Oracle ODI12c for Big Data came out the other week, and my colleague Jérôme Françoisse put together an introductory post on the new features shortly after, covering ODI’s new ability to generate Pig and Spark transformations as well as the traditional Hive ones. How this works is that you can now select Apache Pig, or Apache Spark (through pySpark, the Spark API through Python) as the implementation language for an ODI mapping, and ODI will generate one of those languages instead of HiveQL commands to run the mapping.
How this works is that ODI12c 12.1.3.0.1 adds a bunch of new component-style KMs to the standard 12c ones, providing filter, aggregate, file load and other features that generate pySpark and Pig code rather than the usual HiveQL statement parts. Component KMs have also been added for Hive as well, making it possible now to include non-Hive datastores in a mapping and join them all together, something it was hard to do in earlier versions of ODI12c where the Hive IKM expected to do the table data extraction as well.
But when you first look at this you may well be tempted to think “…so what?”, in that Pig compiles down to MapReduce in the end, just like Hive does, and you probably won’t get the benefits of running Spark for just a single batch mapping doing largely set-based transformations. To my mind where this new feature gets interesting is its ability to let you take existing Pig and Spark scripts, which process data in a different, dataflow-type way compared to Hive’s set-based transformations and which also potentially also use Pig and Spark-specific function libraries, and convert them to managed graphical mappings that you can orchestrate and run as part of a wider ODI integration process.
Pig, for example, has the LinkedIn-originated DataFu UDF library that makes it easy to sessionize and further transform log data, and the Piggybank community library that extends Pig’s loading and saving capabilities to additional storage formats, and provides additional basic UDFs for timestamp conversion, log parsing and so forth. We’ve used these libraries in the past to process log files from our blog’s webserver and create classification models to help predict whether a visitor will return, with the Pig script below using the DataFu and Piggybank libraries to perform these tasks easily in Pig.
register /opt/cloudera/parcels/CDH/lib/pig/datafu.jar; register /opt/cloudera/parcels/CDH/lib/pig/piggybank.jar; DEFINE Sessionize datafu.pig.sessions.Sessionize('60m'); DEFINE Median datafu.pig.stats.StreamingMedian(); DEFINE Quantile datafu.pig.stats.StreamingQuantile('0.9','0.95'); DEFINE VAR datafu.pig.VAR(); DEFINE CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO(); DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix(); -------------------------------------------------------------------------------- -- Import and clean logs raw_logs = LOAD '/user/flume/rm_logs/apache_access_combined' USING TextLoader AS (line:chararray); -- Extract individual fields logs_base = FOREACH raw_logs GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\S+) (\S+) "([^"]*)" "([^"]*)"')) AS (remoteAddr: chararray, remoteLogName: chararray, user: chararray, time: chararray, request: chararray, status: chararray, bytes_string: chararray, referrer:chararray, browser: chararray); -- Remove Bots and convert timestamp logs_base_nobots = FILTER logs_base BY NOT (browser matches '.*(spider|robot|bot|slurp|Bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*'); -- Remove uselesss columns and convert timestamp clean_logs = FOREACH logs_base_nobots GENERATE CustomFormatToISO(time,'dd/MMM/yyyy:HH:mm:ss Z') as time, remoteAddr, request, status, bytes_string, referrer, browser; -------------------------------------------------------------------------------- -- Sessionize the data clean_logs_sessionized = FOREACH (GROUP clean_logs BY remoteAddr) { ordered = ORDER clean_logs BY time; GENERATE FLATTEN(Sessionize(ordered)) AS (time, remoteAddr, request, status, bytes_string, referrer, browser, sessionId); }; -- The following steps will generate a tsv file in your home directory to download and work with in R store clean_logs_sessionized into '/user/jmeyer/clean_logs' using PigStorage('t','-schema');
If you know Pig (or read my previous articles on this theme), you’ll know that pig has the concept of an “alias”, a dataset you define using filters, aggregations, projections and other operations against other aliases, with a typical pig script starting with a large data extract and then progressively whittling it down to just the subset of data, and derived data, you’re interested in. When it comes to script execution, Pig only materializes these aliases when you tell it to store the results in permanent storage (file, Hive table etc) with the intermediate steps just being instructions on how to progressively arrive at the final result. Spark works in a similar way with its RDDs, transformations and operations which either create a new dataset based off of an existing one, or materialise the results in permanent storage when you run an “action”. So let’s see if ODI12c for Big Data can create a similar dataflow, based as much as possible on the script I’ve used above.
… and in-fact it can. The screenshot below shows the logical mapping to implement this same Pig dataflow, with the data coming into the mapping as a Hive table, an expression operator creating the equivalent of a Pig alias based off of a filtered, transformed version of the original source data using the Piggybank CustomFormatToISO UDF, and then runs the results of that through an ODI table function that in the background transforms the data using Pig’s GENERATE FLATTEN command and a call to the DataFu Sessionize UDF.
And this is the physical mapping to go with the logical mapping. Note that all of the Pig transformations are contained within a separate execution unit, that contains operators for the expression to transform and filter the initial dataset, and another for the table function.
The table function operator runs the input fields through an arbitrary Pig Latin script, in this case defining another alias to match the table function operator name and using the DataFu Sessionize UDF within a FOREACH to first sort, and then GENERATE FLATTEN the same columns but with a session ID for user sessions with the same IP address and within 60 seconds of each other.
If you’re interested in the detail of how this works and other usages of the new ODI12c for Big Data KMs, then come along to the masterclass I’m running with Jordan Meyer at the Brighton and Atlanta Rittman Mead BI Forums where I’ll go into the full details as part of a live end-to-end demo. Looking at the Pig Latin that comes out of it though, you can see it more or less matches the flow of the hand-written script and implements all of the key steps.
Finally, checking the output of the mapping I can see that the log entries have been sessionized and they’re ready to pass on to the next part of the classification model.
So that to my mind is where the value is in ODI generating Pig and Spark mappings. It’s not so much taking an existing Hive set-based mapping and just running it using a different language, it’s more about being able to implement graphically the sorts of data flows you can create with Pig and Spark, and being able to get access to the rich UDF and data access libraries that these two languages benefit from. As I said, come along to the masterclass Jordan and I are running, and I’ll go into much more detail and show how the mapping is set up, along with other mappings to create an end-to-end Hadoop data integration process.
Last Chance to Register for the Brighton Rittman Mead BI Forum 2015!
It’s just a week to go until the start of the Brighton Rittman Mead BI Forum 2015, with the optional one-day masterclass starting on Wednesday, May 6th at 10am and the event opening with a reception and Oracle keynote later in the evening. Spaces are still available if you want to book now, but we can’t guarantee places past this Friday so register now if you’re planning to attend.
As a reminder, here’s some earlier blog posts and articles about events going on at the Brighton event, and at the Atlanta event the week after:
- Announcing the Special Guest Speakers for Brighton & Atlanta BI Forum 2015
- More on the Rittman Mead BI Forum 2015 Masterclass : “Delivering the Oracle Big Data and Information Management Reference Architecture”
- Announcing the BI Forum 2015 Data Visualisation Challenge
- RM BI Forum 2015 : Justification Letters for Employers
- Realtime BI Show with Kevin and Stewart – BI Forum 2015 Special!
- Previewing Three Sessions at the Brighton Rittman Mead BI Forum 2015
- Previewing Four Sessions at the Atlanta Rittman Mead BI Forum 2015
- BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack
We’re also running our first “Data Visualisation Challenge” at both events, where we’re asking attendees to create their most impressive and innovative data visualisation within OBIEE using the Donors Choose dataset, with the rule being that you can use any OBIEE or related technology as long as the visualisation runs with OBIEE and can respond to dashboard prompt controls. We’re also opening it up to OBIEE running as part of Oracle BI Cloud Service (BICS), so if you want to give Visual Analyser a spin within BICS we’d be interested in seeing the results.
Registration is still open for the Atlanta BI Forum event too, running the week after Brighton on the 13th-15th May 2015 at the Renaissance Atlanta Midtown hotel. Full details of both events are on the event homepage, with the registration links for Brighton and Atlanta given below.
- Rittman Mead BI Forum 2015, Brighton – May 6th – 8th 2015
- Hosted at the Hotel Seattle, Brighton Marina.
- Rittman Mead BI Forum 2015, Atlanta – May 13th – 15th 2015
- Hosted at the Renaissance Atlanta Midtown Hotel, Atlanta.
BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack
I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).
OBIEE Regression Testing
In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.
In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.
If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.
Data Discovery with the ELK Stack
My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:
ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:
The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:
By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.
The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!
Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.