Tag Archives: Oracle Data Integrator
BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack
I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).
OBIEE Regression Testing
In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.
In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.
If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.
Data Discovery with the ELK Stack
My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:
ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:
The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:
By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.
The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!
Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.
Data Integration Tips: ODI 12.1.3 – Convert to Flow
The many who have already upgraded Oracle Data Integrator from the 11g version to 12c probably know about this great feature called “convert to flow”. If not, well…here you go!
First, a little background on why I think this is an excellent bit of functionality. The ODI Upgrade Assistant will convert objects from 11g to 12c and it does a pretty decent job of it. When converting Interfaces, the upgrade process creates a Mapping in ODI 12c by taking the logical “mapping” layout and loading it into a Dataset object. I assumed the reason was because it wasn’t easy to convert an Interface directly to a full on flow-based mapping, which you typically would develop in ODI 12.1.3 rather than using the limited Dataset (only joins, filters, and lookups allowed). After the upgrade, you would then be stuck with loads of mappings that are not using the latest flow-based features and components.
Now, in ODI 12.1.3, we have the ability to convert our Dataset into the standard ODI 12c flow based components within the Mapping. With a right-click on the Dataset component, we can see the “Convert to Flow” option.
Select Convert to Flow and accept the warning that our Mapping will be changed forever…and boom! No more Dataset!
This is great for my individual Mappings, but now I want to convert my migrated Reusable Mapping Datasets to flow based components.
Wait, what? No option to Convert to Flow! It looks like the Reusable Mappings (which were upgraded from my ODI 11g Temporary Interfaces) cannot be converted to flow for some reason. Hmm… Well, let’s finish converting my Datasets to flow based components for the rest of my 342 upgraded Mappings…one-by-one. Yikes! Actually, we can find a better way to do this. Time to get Groovy with the ODI SDK!
Using Groovy, I can create a simple script to loop through all of my mappings, find the dataset, and call the convertToFlow function on that dataset component. Here’s a look at the guts of the script.
for (mapping in mappingsList){ componentsList=mapping.getAllComponentsOfType("DATASET") for (component in componentsList){ java.util.List convertIssues = new ArrayList() blnConvert = 1 try { blnConvert = component.convertToFlow(convertIssues) if (blnConvert) { for (item in convertIssues) { out.println item.toString() } } } catch (Exception e) { out.println e; } out.println mapping.getName() + " had a dataset converted to flow." } }
Just remember to pass the results list object as a parameter to the convertToFlow call (and make sure the List object is properly instantiated as an ArrayList – as I was humbly reminded by David Allan via Twitter!). Once completed, you should be able to open each mapping and see that the dataset has been removed and only flow-based components exist.
Excellent, now we’ve completed our conversion in no time at all. But wait, what about those Reusable Mappings? Remember, we don’t have the right-click menu option to convert to flow as we did with the standard Mapping. Well, let’s see what our friend the ODI SDK has to say about that!
With a slight tweak to the code, replacing Mapping classes with ReusableMapping classes, we can batch convert our Reusable Mapping dataset components to flow based components in an instant. The reason it works via the API is due to the inheritance of the ReuseableMapping class. It inherits the same component methods from the interface oracle.odi.domain.mapping.IMapComponentOwner, which in turn have the same methods and functions, such as convertToFlow, as we had available in the Mapping class. I’m not quite sure why ODI Studio doesn’t expose “Convert to Flow” in the Reusable Mappings, but I’m sure it’s a simple fix we’ll see in an ODI 12c release down the road.
So there you have it, another Data Integration Tip from Rittman Mead – this time, a little help post-migration from ODI 11g to ODI 12c. If you would like more details on how Rittman Mead can help your migration of Oracle Data Integrator to the latest version, send us a note at info@rittmanmead.com. We’d love to help!
Oracle Data Integrator Enterprise Edition Advanced Big Data Option Part 1- Overview and 12.1.3.0.1 install
Oracle recently announced Oracle Data Integrator Enterprise Edition Advanced Big Data Options as part of the new 12.1.3.0.1 release of ODI. It includes various great new functionalities to work on an Hadoop ecosystem. Let’s have a look at the new features and how to install it on Big Data Lite 4.1 Virtual Machine.
Note that some of these new features, for example Pig and Spark support and use of Oozie, requires the new ODI EE Advanced Big Data Option license on-top of base ODI EE.
Pig and Spark support
So far ODI12c allowed us to use Hive for any Hadoop-based transformation. With this new release, we can now use Pig and Spark as well. Depending on the use case, we can choose which technology will give better performance and switch from one to another with very few changes. That’s the beauty of ODI – all you need is to do is create the logical dataflow in your mapping and choose your technology. There is no need to be a Pig Latin expert or a PySpark ninja, all of this will be generated for you! These two technologies are now available in the Topology, along with the Hadoop Data Server to define where lies the Data. You can also see some Loading Knowledge Modules for Pig and Spark.
Pig, as Mark wrote before, is a dataflow language. It makes it really appropriate with the new “flow paradigm” introduced in ODI 12c. The idea is to write a data pipeline in Pig Latin. That code will undercover create MapReduce jobs that will be executed.
Quoting Mark one more time, Spark is a cluster processing framework that can be used in different programming languages, the two most common being Python and Scala. It allows to do operation like filters, joins and aggregates. All of this can be done in-memory which can provides way better performance over MapReduce. The ODI team choose to use Python as a programming language for Spark so the Knowledge Modules will use PySpark.
New Hive Driver and LKMs
This release also brings significant improvements to the existing Hive technology. A new driver as been introduced under the name DataDirect Apache Hive JDBC Driver. It is actually the Weblogic Hive JDBC driver which aims at improving the performance and the stability.
New Knowledges Modules are introduced to benefit from this new driver and they are LKMs instead multi-connections IKMs as it use to be. Thanks to that, it can be combined with other LKMs into the same mapping which was not the case before.
Oozie Agent
Oozie is another Apache project and they define it as “a workflow scheduler system to manage Apache Hadoop jobs”. We can create workflow of different jobs in the Hadoop stack, and then schedule it at a certain time or trigger it when data becomes available.
What Oozie does is similar to the role of the ODI agent, and it’s now possible to use directly an existing Oozie engine instead of deploying a standalone agent on the hadoop cluster.
The Oozie engine will do what your ODI agent usually does – execution, scheduling, monitoring – but it is integrated in the Hadoop ecosystem. So we will be able to schedule and monitor our ODI jobs at the same place as all our other Hadoop jobs that we use outside of ODI. Oozie can also automatically retrieve the Hadoop logs. Also we lower the footprint because it doesn’t requires to install an ODI-specific component on the cluster. However, according to the white paper (link below), it looks like Load Plans are not supported. So the idea would be to execute the Load Plans with a standalone or JEE agent that will delegate the execution of Big Data-related scenarios to the Oozie Engine.
HDFS support in file-related ODI Tools
Most of the ODI tools handling files can also do it on HDFS now. So you can delete, move, copy files and folders. You can also append files and transfer it to HDFS via FTP. It’s even possible to detect when a file is created on HDFS. All you need to do is to indicate your Hadoop Logical Schema for source, target or both. In the following example I’m copying a file from the Unix filesystem to HDFS.
I think this is a huge step forward. If we want to use ODI 12c for our Hadoop data integration, it must be able to do everything end-to-end. The maintenance or administrative tasks such as archiving, deleting or copying should also be done using ODI. So far it was a bit tedious to created a shell script using hdfs dfs commands and then launch it using OdiOsCommand tool. Now we can directly use the file tools in a package or a procedure!
New mapping components : Jagged and Flatten
The two new components can be used in a Big Data context but also in your traditional data integration. The first one, Jagged, will pivot a set of key-value pairs into a columns with their values.
The Flatten components can be used with advanced files when you have nested attributes, like in JSON. Using a flatten component will generate more rows if needed to extract different values for a same attribute nested into another attribute.
You can see the detail of all the new features in the white paper “Advancing Big Data Integration” for ODI 12c.
How to install it?
This patch must be applied on top of an existing Oracle Data Integrator 12.1.3.0.0 installation. It is not a bundled patch and it’s only related to Big Data Options so there is no point to install it if you don’t need its functionalities. Also make sure you are licensed for ODIEE Advanced Big Data Option if you plan to use Spark or Pig technology/KMs or execute your jobs using the Oozie engine.
To showcase this, I used the excellent –and free! – Big Data Lite 4.1 VM which already has ODI 12.1.3 and all the Hadoop components we need. So this example will be on an Oracle Enterprise Linux environment.
The first step is to download it from the OTN or My Oracle Support. Also make sure you close ODI Studio and shut down the agents. Then the README recommends to update OPatch and check the OUI. So let’s do that and also set some environment variables and unzip the ODI patch.
[oracle@bigdatalite ~]$ mkdir /home/oracle/bck [oracle@bigdatalite ~]$ ORACLE_HOME=/u01/ODI12c/ [oracle@bigdatalite ~]$ cd $ORACLE_HOME [oracle@bigdatalite ODI12c]$ unzip /home/oracle/Desktop/p6880880_132000_Generic.zip -d $ORACLE_HOME [oracle@bigdatalite ODI12c]$ OPatch/opatch lsinventory -jre /usr/java/latest/ [oracle@bigdatalite ODI12c]$ export PATH=$PATH:/u01/ODI12c/OPatch/ [oracle@bigdatalite ODI12c]$ unzip -d /home/oracle/bck/ /home/oracle/Desktop/p20042369_121300_Generic.zip [oracle@bigdatalite ODI12c]$ cd /home/oracle/bck/
This patch is actually composed of three piece. One of them, the second one, is only needed if you have an enterprise installation. If you have a standalone install, you can just skip it. Note that I always specify the JRE to be used by OPatch to be sure everything works fine.
[oracle@bigdatalite bck]$ unzip p20042369_121300_Generic.zip [oracle@bigdatalite ODI12c]$ cd 20042369/ [oracle@bigdatalite 20042369]$ opatch apply -jre /usr/java/latest/ [oracle@bigdatalite 20042369]$ cd /home/oracle/bck/ // ONLY FOR ENTERPRISE INSTALL //[oracle@bigdatalite bck]$ unzip p20674616_121300_Generic.zip //[oracle@bigdatalite bck]$ cd 20674616/ //[oracle@bigdatalite 20674616]$ opatch apply -jre /usr/java/latest/ //[oracle@bigdatalite 20674616]$ cd /home/oracle/bck/ [oracle@bigdatalite bck]$ unzip p20562777_121300_Generic.zip [oracle@bigdatalite bck]$ cd 20562777/ [oracle@bigdatalite 20562777]$ opatch apply -jre /usr/java/latest/
Now we need to run the upgrade assistant that will execute some scripts to upgrade our repositories. But in Big Data Lite, the tables of the repository have been compressed, so we first need to uncompress them and rebuild the invalid indexes as David Allan pointed it out on twitter. Here are the SQL queries that will create the DDL statement you need to run if you are also using Big Data Lite VM :
select 'alter table '||t.owner||'.'||t.table_name||' move nocompress;' q from all_tables t where owner = 'DEV_ODI_REPO' and table_name <> 'SNP_DATA'; select 'alter index '||owner||'.'||index_name||' rebuild tablespace '||tablespace_name ||';' from all_indexes where owner = 'DEV_ODI_REPO' and status = 'UNUSABLE';
Once it’s done we can start the upgrade assistant :
[oracle@bigdatalite 20562777]$ cd /u01/ODI12c/oracle_common/upgrade/bin [oracle@bigdatalite bin]$ ./ua
The steps are quite straightforward so I’ll leave it to you. Here I selected Schemas, but if you have a standalone agent you will have to run it again and select “Standalone System Component Configurations” to upgrade the domain as well.
Before opening ODI Studio we will clear the JDev cache so we are sure everything looks nice.
[oracle@bigdatalite bin]$ rm -rf /home/oracle/.odi/system12.1.3.0.0/
We can now open ODI Studio. Don’t worry the version mentioned there and in the upgrade assistant is still 12.1.3.0.0 but if you can see the new features it has been installed properly.
The last step is to go in the topology and change the driver used for all the Hive Data Server. As all the new LKMs use the new weblogic driver, we need to define the url instead of the existing one. We simply select “DataDirect Apache Hive JDBC Driver” instead of the existing Apache driver.
And that’s it, we can now enjoy all the new Big Data features in ODI 12c! A big thanks to David Allan and Denis Gray for their technical and licensing help. Stay tuned as I will soon publish a second blog post detailing some features.
Analysing ODI performance with Flame Graphs
Flame Graphs are a visualisation that I learnt about through the excellent Linux systems performance work of Brendan Gregg, and saw Luca Canali talk about recently at UKOUG Tech 14. They’re a brilliant way of summarising extremely dense information in a way from which the main components accounting for the most time can be identified. I was recently doing some analysis for a client on their ODI batch runtime and I thought it would be a good idea to try them out. Load Plans can have complex hierarchies to them and working out which main sections account for what time can be tricky, as can following a load plan step through to a session and on to a session step and its constituent parts.
A flame graph is made up of the “stack trace” on the y-axis, and the amount of time spent in each on the x-axis. This is different from most other standard visualisations where the x-axis represents the passage of time, and instead summarises the data at multiple levels of the stack trace hierarchy. The “stack trace” in this case with ODI is Load plan -> load plan step (load plan step […]) -> session -> session step -> task. It’s as easy to see the overall run time as it is a load plan step part way down, as a constituent task of a session step. And what’s more, flame graphs look nice! This may seem a flimsy reason for using them on their own, but it’s a bonus over trawling through dull tables of data alone.
Looking at the flame graph above (taken from a demo BI Apps implementation) it’s nice and easy to see that the Warehouse Load Phase accounts for c.75% of the time, within which the two areas accounting for most time are AP and AR balances. This is from literally a single glance at one graphic. Flame Graphs are built as SVGs which enables them to be interactive (here’s an example). Clicking on any of the stack trace boxes drills into that area, so for the tasks taking less time (and so displaying less text) this is useful to see the specifics. Here’s the GL balance load in detail, showing how long the row inserts take in proportion to the index build:
Creating the flame graph is simple. You just need a stack trace that is semi-colon separated, followed by a space-delimited counter value at the end. A bit of recursive SQL magic with the SNP_ tables (helpfully documented by Oracle here) gives us this kind of output file with one line for every task executed and its duration:
;Start_Load_Plan;Global_Variable_Refresh;Source_Extract_Phase;1_General;2_General_PRE-SDE;3_PRE-SDE_Day;Finalize_Day;Finalize_W_DAY_D;CREATE_INDEXES;Create_Indexes_:_W_DAY_D_2/2;EXEC_TABLE_MAINT_PROC;TABLE_MAINT_PROC;Create Indexes 3 [...]
which you then run through the Flame Graph tool:
cat /tmp/odi.out |~/git/FlameGraph/flamegraph.pl --title "EBSVISION FIN HR_21_20141021_223159 / 2014-10-24 15:41:42" > /tmp/odi-flame-graph.svg
Simply load the resulting SVG into a web browser such as Chrome, and you’re done. Here’s an example that you can download and try out.
An Introduction to Analysing ODI Runtime Data Through Elasticsearch and Kibana 4
An important part of working with ODI is analysing the performance when it runs, and identifying steps that might be inefficient as well as variations in runtime against a baseline trend. The Operator tool in ODI itself is great for digging down into individual sessions and load plan executions, but for broader analysis we need a different approach. We also need to make sure we keep the data available for trend analysis, as it’s often the case that tables behind Operator are frequently purged for performance reasons.
In this article I’m going to show how we can make use of a generic method of pulling information out of an RDBMS such as Oracle and storing it in Elasticsearch, from where it can be explored and analysed through Kibana. It’s standalone, it’s easy to do, it’s free open source – and it looks and works great! Here I’m going to use it for supporting the analysis of ODI runtime information, but it is equally applicable to any time-based data you’ve got in an RDBMS (e.g. OBIEE Usage Tracking data).
Kibana is an open-source data visualisation and analysis tool, working with data stored in Elasticsearch. These tools work really well for very rapid analysis of any kind of data that you want to chuck at them quickly and work with. By skipping the process of schema definition and data modelling the time taken to the first results is drastically reduced. It enables to you quickly start “chucking about” data and getting meaning out of it before you commit full-scale to how you want to analyse it, which is what the traditional modelling route can sometimes force you to do prematurely.
ODI writes runtime information to the database, about sessions run, steps executed, time taken and rows processed. This data is important for analysing things like performance issues, and batch run times. Whilst with the equivalent runtime data (Usage Tracking) from OBIEE there is the superb RPD/Dashboard content that Oracle ship in SampleApp v406, for ODI the options aren’t as vast, ultimately being based on home-brew SQL against the repository tables using the repository schema documentation from Oracle. Building an OBIEE metadata model against the ODI schema is one option, but then requires an OBIEE server on which to run it – or merging into an existing OBIEE deployment – which means that it can become more hassle than it’s worth. It also means a bunch of up-front modelling before you get any kind of visualisations and data out. By copying the data across into Elasticsearch it’s easy to quickly build analyses against it, and has the additional benefit of retaining the data as long as you’d like meaning that it’s still available for long-term trend analysis once the data’s been purged from the ODI repository itself.
Let’s take a bit of a walk through the ODI dashboard that I’ve put together. First up is a view on the number of sessions that have run over time, along with their duration. For duration I’ve shown 50th (median), 75th and 95th percentiles to get an idea of the spread of session runtimes. At the moment we’re looking at all sessions, so it’s not surprising that there is a wide range since there’ll always be small sessions and longer ones:
Next up on the dashboard comes a summary of top sessions by runtime, both cumulative and per-session. The longest running sessions are an obvious point of interest, but cumulative runtime is also important; something may only take a short while to run when compared to some singular long-running sessions, but if it runs hundreds of times then it all adds up and can give a big performance boost if time is shaved off it.
Plotting out session execution times is useful to be able to see both when the longest running sessions ran:
The final element on this first dashboard is one giving the detail for each of the top x long-running session executions, including the session number so that it can be examined in further detail through the Operator tool.
Kibana dashboards are interactive, so you can click on a point in a graph to zoom in on that time period, as well as click & drag to select an arbitrary range. The latter technique is sometimes known as “Brushing”, and if I’m not describing it very well have a look at this example here and you’ll see in an instant what I mean.
As you focus on a time period in one graph the whole dashboard’s time filter changes, so where you have a table of detail data it then just shows it for the range you’ve selected. Notice also that the granularity of the aggregation changes as well, from a summary of every three hours in the first of the screenshots through to 30 seconds in the last. This is a nice way of presenting a summary of data, but isn’t always desirable (it can mask extremes and abnormalities) so can be configured to be fixed as well.
Time isn’t the only interaction on the dashboard – anything that’s not a metric can be clicked on to apply a filter. So in the above example where the top session by cumulative time are listed out we might want to find out more about the one with several thousand executions
Simply clicking on it then filters the dashboard and now the session details table and graph show information just for that session, including duration, and rows processed:
Session performance analysis
As an example of the benefit of using a spread of percentiles we can see here is a particular session that had an erratic runtime with great variation, that then stabilised. The purple line is the 95th percentile response time; the green and blue are 50th and 75th respectively. It’s clear that whilst up to 75% of the sessions completed in about the same kind of time each time they ran, the remaining quarter took anything up to five times as long.
One of the most important things in performance is ensuring consistent performance, and that is what happens here from about half way along the horizontal axis at c.February:
But what was causing the variation? By digging a notch deeper and looking at the runtime of the individual steps within the given session it can be seen that the inconsistent runtime was caused by a single step (the green line in this graph) within the execution. When this step’s runtime stabilises, so does the overall performance of the session:
This is performing a port-mortem on a resolved performance problem to illustrate how useful the data is – obviously if there were still a performance problem we’d have a clear path of investigation to pursue thanks to this data.
If you’re interested in how to put together this kind of dashboard in Kibana and Elasticsearch, loading in data on a regular basis from an RDBMS, then stay tuned for part 2 when all will be revealed…