Category Archives: Rittman Mead

The greatest OBIEE usage study ever…

Rittman Mead is excited to announce the first ever global OBIEE usage survey. Our goal is to provide deep analysis into how organisations use their OBIEE systems and to create a set of industry benchmarks. We will be releasing the results of this research for free to the OBIEE community.

measure

The types of metrics we are looking at are:

  • DAU/MAU (daily average users/monthly average users) – this gives an approximation of utility;
  • Frequency of use – how often your users access dashboards;
  • Recency of use – when was the last time reports and dashboards were accessed;
  • Reach of system – how many users are accessing the system.

Here’s how it works: we need your Usage Tracking data. To make providing this data easier we can send you a script for your system administrator to run to extract this. We even have an option to obfuscate key attributes so we can’t see any usernames or sensitive details.

Once we receive your data, we will analyze your data individually and provide you with a free report designed to provide you unique insight into to your system’s usage, an example of which is available here.

We will also add your obfuscated, depersonalised and aggregated data to our benchmarking database and let you know how your system performs against industry standards.

Please note: you do have to be running the usage tracking feature of OBIEE for this to work. We strongly advise having this running in any system and can help you get it turned on, if required. Also any data you send to Rittman Mead is completely anonymous and holds no personal or sensitive attributes. It and will only be used for benchmarking.

At the end of the survey we will perform a detailed analysis of the complete OBIEE usage database and publish the results.

How do I take part?

Please email us at ux@rittmanmead.com and we will send full over the scripts and full instructions.

Why are we doing this?

We are currently focused on user engagement of BI and analytics systems and have been conducting research over the last few months. We have found very few tangible studies about enterprise BI usage, in particular OBIEE usage.

We are creating this database from OBIEE users around the world and will use this as the academic basis for furthering our research into user engagement and OBIEE.

UE-report-exported

Three Easy Ways to Stream Twitter Data into ElasticSearch

For the past few months a friend has been driving me crazy with all his praise for Splunk. He was going on about how easy it is to install, integrate different sources and build reports. I eventually started playing around to see if it could be used for a personal project I’m working on. In no time at all I understood what he was on about and I could see the value and ease of use of the product. Unfortunately the price of such a product means it is not a solution for everyone so I started looking around for alternatives and ElasticSearch caught my eye as a good option.

In this post we will focus on how we can stream Twitter data into ElasticSearch and explore the different options for doing so. Storing data in ElasticSearch is just the first step but you only gain real value when you start analysing this data. In the next post we will add sentiment analysis to our Twitter messages and see how we can analyse this data by building Kibana dashboards. But for now we will dig a bit deeper into the following three configuration options:

We will look at the installation and configuration of each of these and see how we can subscribe to twitter using the Twitter API. Data will then get processed, if required, and sent to Elasticsearch.

twitter_es_dataflow_shadow

Why Use Elasticsearch

Elasticsearch has the ability to store large quantities of semi-structured (JSON) data and provides the ability to quickly and easily query this data. This makes it a good option for storing Twitter data which is delivered as JSON and a perfect candidate for the project I’m working on.

Prerequisites

You will need a server to host all the required components. I used an AWS free tier (t2.micro) instance running Amazon Linux 64-bit. This post assumes you already have an elasticsearch cluster up and running and that you have a basic understanding of elasticsearch. There are some good blog posts, written by Robin Moffatt, which were very useful during the installation and configuration stages.

Twitter Stream API

In order to access the Twitter Streaming API, you need to register an application at http://apps.twitter.com. Once created, you should be redirected to your app’s page, where you can get the consumer key and consumer secret and create an access token under the “Keys and Access Tokens” tab. These values will be used as part of the configuration for all the sample configurations to follow.

The API allows two types of subscriptions. Either subscribe to specific keywords or to a user timeline (similar to what you see as a twitter user).

logstash

We'll start with logstash as this is probably the easiest one to configure and seems to be the recommended approach for integrating sources with elasticsearch in recent versions. At the time of writing this post, logstash only supported streaming based on keywords which meant it was not suitable for my needs but it’s still a very useful option to cover.

logstash installation

To install logstash you need to download the correct archive based on the version of elasticsearch you are running.

curl -O https://download.elasticsearch.org/logstash/logstash/logstash-x.x.x.tar.gz

Extract the archived file and move the extracted folder to a location of your choice

tar zxf logstash-x.x.x.tar.gz
mv logstash-x.x.x /usr/share/logstash/

logstash configuration

To configure logstash we need to provide input, output and filter elements. For our example we will only specify input (twitter) and output (elasticsearch) elements as we will be storing the full twitter message.

For a full list of logstash twitter input settings see the official documentation.

Using your favourite text editor, create a file called twitter_logstash.conf and copy the below text. Update the consumer_key, consumer_secret, oath_token and oath_token_secret values with the values from your Twitter Stream App created earlier.

input {
    twitter {
        # add your data
        consumer_key => "CONSUMER_KEY_GOES_HERE"
        consumer_secret => "CONSUMER_SECRET_GOES_HERE"
        oauth_token => "ACCESS_TOKEN_GOES_HERE"
        oauth_token_secret => "ACCESS_TOKEN_SECRET_GOES_HERE"
        keywords => ["obiee","oracle"]
        full_tweet => true
    }
}
output {
    elasticsearch_http {
        host => "localhost"
        index => "idx_ls"
        index_type => "tweet_ls"
    }
}

This configuration will receive all tweets tagged with obiee or oracle and store them to an index called idx_ls in elasticsearch.

To run logstash, execute the following command from the installed location

bin/logstash -f twitter_logstash.conf

If you subscribed to active twitter tags you should see data within a few seconds. To confirm if your data is flowing you can navigate to http://server_ip:9200/_cat/indices?v which will show you a list of indices with some relevant information.

es_index_list_shadow

With this easy configuration you can get Twitter data flowing in no time at all.

Twitter River Plugin

Next we will look at using the River Plugins to stream Twitter data. The only reason to use this approach over logstash is if you want to subscribe to a user timeline. Using this feature will show the same information as the Twitter application or viewing your timeline online.

Twitter River Plugin installation

Before installing the plugin you need to determine which version is compatible with your version of elasticsearch. You can confirm this at https://github.com/elasticsearch/elasticsearch-river-twitter and selecting the correct one.

To install you need to use the elasticsearch plugin installation script. From the elasticsearch installation directory, execute:

bin/plugin -install elasticsearch/elasticsearch-river-twitter/x.x.x

Then restart your Elasticsearch service.

Twitter River Plugin configuration

To configure the twitter subscriber we will again create a .conf file with the necessary configuration elements. Create a new file called twitter_river.conf and copy the following text. As with logstash, update the required fields with the values from the twitter app created earlier.

{
  "type": "twitter",
  "twitter" : {
        "oauth" : {
            "consumer_key" : "CONSUMER_KEY_GOES_HERE",
            "consumer_secret" : "CONSUMER_SECRET_GOES_HERE",
            "access_token" : "ACCESS_TOKEN_GOES_HERE",
            "access_token_secret" : "ACCESS_TOKEN_SECRET_GOES_HERE"
        },
        "filter" : {
            "tracks" : ["obiee", "oracle"]
        },
        "raw" : true,
        "geo_as_array" : true
  },
  "index": {
    "index": "idx_rvr",
    "type": "tweet_rvr",
    "bulk_size": 100,
    "flush_interval": "5s"
  }
}

This configuration is identical to the logstash configuration and will receive the same tweets from twitter. To subscribe to a user timeline instead of keywords, replace the filter configuration element:

"filter" : {
      "tracks" : ["obiee", "oracle"],
},

with a user type element

"type" : "user",

To start the plugin you need to execute the following from a terminal window.

curl -XPUT localhost:9200/_river/idx_rvr/_meta -d @twitter_river.conf

Depending on how active your subscribed tags are you should see data within a few seconds in elasticsearch. You can again navigate to http://server_ip:9200/_cat/indices?v to confirm if your data is flowing. Note this time that you should see two new rows, one index called _river and the other idx_rvr. idx_rvr is where your twitter data will be stored.

To stop the plugin (or change between keywords and user timeline), execute the following from a terminal window:

curl -XDELETE 'localhost:9200/_river/idx_rvr';

Tweepy

Finally we will look at the most flexible solution of them all. It is a bit more complicated to install and configure but, given what you gain, the small amount of extra time spent is well worth the effort. Once you have Tweepy working you will be able to write you own python code to manipulate the data as you see fit.

Tweepy installation

As Tweepy is a python package we will use pip to install the required packages. If you don't have pip installed. Execute one of the following, depending on your linux distribution.

yum -y install python-pip

or

apt-get install python-pip

Next we will install the Tweepy and elasticsearch packages

pip install tweepy
pip install elasticsearch

Tweepy configuration

Create a new file called twitter_tweepy.py and copy the following text to the file

import tweepy
import sys
import json
from textwrap import TextWrapper
from datetime import datetime
from elasticsearch import Elasticsearch


consumer_key="CONSUMER_KEY_GOES_HERE"
consumer_secret="CONSUMER_SECRET_GOES_HERE"

access_token="ACCESS_TOKEN_GOES_HERE"
access_token_secret="ACCESS_TOKEN_SECRET_GOES_HERE"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

es = Elasticsearch()

class StreamListener(tweepy.StreamListener):
    status_wrapper = TextWrapper(width=60, initial_indent='    ', subsequent_indent='    ')

    def on_status(self, status):
        try:
            #print 'n%s %s' % (status.author.screen_name, status.created_at)

            json_data = status._json
            #print json_data['text']

            es.create(index="idx_twp",
                      doc_type="twitter_twp",
                      body=json_data
                     )

        except Exception, e:
            print e
            pass

streamer = tweepy.Stream(auth=auth, listener=StreamListener(), timeout=3000000000 )

#Fill with your own Keywords bellow
terms = ['obiee','oracle']

streamer.filter(None,terms)
#streamer.userstream(None)

As with the river plugin you can subscribe to the user timeline by changing the subscription type. To do this replace the last line in the script with

streamer.userstream(None)

To start the listener you need to execute the python file

python twitter_tweepy.py

Navigate to the elasticsearch index list again to ensure you are receiving data.

Conclusion

Getting Twitter data into Elasticsearch is actually pretty simple. Logstash is by far the easiest one to configure and if subscribing to keywords is your only requirement it should be the preferred solution. Now that we have the foundation in place, in the next post we will have a look at how we can enhance this data by adding sentiment analysis and how we can use this data to make decisions.

Combining Spark Streaming and Data Frames for Near-Real Time Log Analysis & Enrichment

A few months ago I posted an article on the blog around using Apache Spark to analyse activity on our website, using Spark to join the site activity to some reference tables for some one-off analysis. In this article I’ll be taking an initial look at Spark Streaming, a component within the overall Spark platform that allows you to ingest and process data in near real-time whilst keeping the same overall code-based as your batch-style Spark programs.

NewImage

Like regular batch-based Spark programs, Spark Streaming builds on the concept of RDDs (Resilient Distributed Datasets) and provides an additional high-level abstraction called a “discretized stream” or DStream, representing a continuous stream of RDDs over a defined time period. In the example I’m going to create I’ll use Spark Streaming’s DStream feature to hold in-memory the last 24hrs worth of website activity, and use it to update a “Top Ten Pages” Impala table that’ll get updated once a minute.

NewImage

To create the example I started with the Log Analyzer example in the set of DataBricks Spark Reference Applications, and adapted the Spark Streaming / Spark SQL example to work with our CombinedLogFormat log format that contains two additional log elements. In addition, I’ll also join the incoming data stream with some reference data sitting in an Oracle database and then output a parquet-format file to the HDFS filesystem containing the top ten pages over that period.

The bits of the Log Analyzer reference application that we reused comprise of two scripts that compile into a single JAR file; a script that creates a Scala object to parse the incoming CombinedLogFormat log files, and other with the main program in. The log parsing object contains a single function that takes a set of log lines, then returns a Scala class that breaks the log entries down into the individual elements (IP address, endpoint (URL), referrer and so on). Compared to the DataBricks reference application I had to add two extra log file elements to the ApacheAccessLog class (referer and agent), and add some code in to deal with “-“ values that could be in the log for the content size; I also added some extra code to ensure the URLs (endpoints) quoted in the log matched the format used in the data extracted from our WordPress install, which stores all URLs with a trailing forward-slash (“/“).

package com.databricks.apps.logs
case class ApacheAccessLog(ipAddress: String, clientIdentd: String,
 userId: String, dateTime: String, method: String,
 endpoint: String, protocol: String,
 responseCode: Int, contentSize: Long, 
 referer: String, agent: String) {
}
 
object ApacheAccessLog {
val PATTERN = """^(S+) (S+) (S+) [([wd:/]+s[+-]d{4})] "(S+) (S+) (S+)" (d{3}) ([d-]+) "([^"]+)" "([^"]+)"""".r
def parseLogLine(log: String): ApacheAccessLog = {
 val res = PATTERN.findFirstMatchIn(log)
 if (res.isEmpty) {
 ApacheAccessLog("", "", "", "","", "", "", 0, 0, "", "")
 }
 else {
 val m = res.get
 val contentSizeSafe : Long = if (m.group(9) == "-") 0 else m.group(9).toLong
 val formattedEndpoint : String = (if (m.group(6).charAt(m.group(6).length-1).toString == "/") m.group(6) else m.group(6).concat("/"))
 
 ApacheAccessLog(m.group(1), m.group(2), m.group(3), m.group(4),
 m.group(5), formattedEndpoint, m.group(7), m.group(8).toInt, contentSizeSafe, m.group(10), m.group(11))
 }
 }
}

The body of the main application script starts by importing Scala classes for Spark, Spark SQL and Spark Streaming, and then defines two variable that determine the amount of log data the application will consider; WINDOW_LENGTH (86400 milliseconds, or 24hrs) which determines the window of log activity that the application will consider, and SLIDE_INTERVAL, set to 60 milliseconds or one minute, which determines how often the statistics are recalculated. Using these values means that our Spark Streaming application will recompute every minute the top ten most popular pages over the last 24 hours.

package com.databricks.apps.logs.chapter1
import com.databricks.apps.logs.ApacheAccessLog
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{StreamingContext, Duration}
object LogAnalyzerStreamingSQL {
 val WINDOW_LENGTH = new Duration(86400 * 1000)
 val SLIDE_INTERVAL = new Duration(60 * 1000)

In our Spark Streaming application, we’re also going to load-up reference data from our WordPress site, exported and stored in an Oracle database, to add post title and post author values to the raw log entries that come in via Spark Streaming. In the next part of the script then we define a new Spark context and then a Spark SQL context off-of the base Spark context, then create a Spark SQL data frame to hold the Oracle-sourced WordPress data to later-on join to the incoming DStream data – using Spark’s new Data Frame feature and the Oracle JDBC drivers that I separately download off-of the Oracle website, I can pull in reference data from Oracle or other database sources, or bring it in from a CSV file as I did in the previous Spark example, to supplement my raw incoming log data. 

def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in Scala")
 val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
 import sqlContext.implicits._
 
 val postsDF = sqlContext.load("jdbc", Map(
 "url" -> "jdbc:oracle:thin:blog_refdata/password@bigdatalite.rittmandev.com:1521:orcl",
 "dbtable" -> "BLOG_REFDATA.POST_DETAILS"))
 
 postsDF.registerTempTable("posts")

Note also how Spark SQL lets me declare a data frame (or indeed any RDD with an associated schema) as a Spark SQL table, so that I can later run SQL queries against it – I’ll come back to this at the end).

Now comes the first part of the Spark Streaming code. I start by defining a new Spark Streaming content off of the same base Spark context that I created the Spark SQL one off-of, then I use that Spark Streaming context to create a DStream that reads newly-arrived files landed in an HDFS directory  – for this example I’ll manually copy the log files into an “incoming” HDFS directory, whereas in real-life I’d connect Spark Streaming to Flume using FlumeUtils for a more direct-connection to activity on the webserver. 

val streamingContext = new StreamingContext(sc, SLIDE_INTERVAL)
val logLinesDStream = streamingContext.textFileStream("/user/oracle/rm_logs_incoming")

Then I call the Scala “map” transformation to convert the incoming DStream into an ApacheAccessLog-formatted DStream, and cache this new DStream in-memory. Next and as the final part of this stage, I call the Spark Streaming “window” function which packages the input data into in this case a 24-hour window of data, and creates a new Spark RDD every SLIDE_INTERVAL – in this case 1 minute – of time.

val accessLogsDStream = logLinesDStream.map(ApacheAccessLog.parseLogLine).cache()
val windowDStream = accessLogsDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL)

Now that Spark Streaming is creating RDDs for me to represent all the log activity over my 24 hour period, I can use the .foreachRDD control structure to turn that RDD into its own data frame (using the schema I’ve inherited from the ApacheAccessLog Scala class earlier on), and filter out bot activity and references to internal WordPress pages so that I’m left with actual page accesses to then calculate the top ten list from.

windowDStream.foreachRDD(accessLogs => {
 if (accessLogs.count() == 0) {
 println("No logs received in this time interval")
 } else {
 accessLogs.toDF().registerTempTable("accessLogs")
// Filter out bots 
 val accessLogsFilteredDF = accessLogs
 .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp|bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*"))
 .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin|wp-includes|favicon.ico|xmlrpc.php|wp-comments-post.php).*")).toDF()
 .registerTempTable("accessLogsFiltered")

Then, I use Spark SQL’s ability to join tables created against the windowed log data and the Oracle reference data I brought in earlier, to create a parquet-formatted file containing the top-ten most popular pages over the past 24 hours. Parquet is the default storage format used by Spark SQL and is suited best to BI-style columnar queries, but I could use Avro, CSV or another file format If I brought the correct library imports in.

val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) as total FROM accessLogsFiltered a JOIN posts p ON a.endpoint = p.POST_SLUG GROUP BY p.POST_TITLE, P.POST_AUTHOR ORDER BY total DESC LIMIT 10 ") 
 
 // Persist top ten table for this window to HDFS as parquet file
 
 topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet", "parquet", SaveMode.Overwrite) 
 }
 })

Finally, the last piece of the code starts-off the data ingestion process and then continues until the process is interrupted or stopped.

streamingContext.start()
    streamingContext.awaitTermination()
  }
}

I can now go over to Hue and move some log files into the HDFS directory that the Spark application is running on, like this:

file_upload

Then, based on the SLIDE_INTERVAL I defined in the main Spark application earlier on (60 seconds, in my case) the Spark Streaming application picks up the new files and processes them, outputting the results as a Parquet file back on the HDFS filesystem (these two screenshots should display as animated GIFs)

spark_processing

So what to do with the top-ten pages parquet file that the Spark Streaming application creates? The most obvious thing to do would be to create an Impala table over it, using the schema metadata embedded into the parquet file, like this:

CREATE EXTERNAL TABLE rm_logs_24hr_top_ten <br />LIKE PARQUET '/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet/part-r-00001.parquet'<br /> STORED AS PARQUET<br /> LOCATION '/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet';

Then I can query the table using Hue again, or I can import the Impala table metadata into OBIEE and analyse it using Answers and Dashboards.

NewImage

So that’s a very basic example of Spark Streaming, and I’ll be building on this example over the new few weeks to add features such as persistent storing of all processed data, and classification and clustering the data using Spark MLlib. More importantly, copying files into HDFS for ingestion into Spark Streaming adds quite a lot of latency and it’d be better to connect Spark directly to the webserver using Flume or even better, Kafka – I’ll add examples showing these features in the next few posts in this series.

Oracle Data Integrator to load Oracle BICS and Oracle Storage Cloud Service

It is known that Oracle focuses its sales on cloud this year and in the BI world, we have seen the introduction of Oracle Business Intelligence Cloud Service – BICS – that we already covered on the blog. Another great product to store files and unstructured data is Oracle Storage Cloud Service, that can also be used as a staging area before loading a BICS schema. To provision our BI system in the cloud or to backup our files, it’s important to find a reliable and maintainable way to move on-premise data to the cloud. And when we speak about reliable and maintainable way to move data, I can’t help but think about Oracle Data Integrator. Luckily, Ayush Ganeriwal wrote two excellent blog posts explaining how to use ODI to load these two cloud services! And he even gave away the Knowledge Modules, ODI Open Tools, Jar files and other things he developed to make it work.

Load data to Oracle Storage Cloud Service with a package.

Oracle Storage Cloud Service is used to store files and unstructured data in the cloud, very much like Amazon S3. In the case of a company that has moved to the cloud, they need some space to store their files that will be used by other cloud services. Oracle Storage Cloud Service is the place to store all these files. If you want to use it as a cloud backup of your local filesystem, you might even choose Oracle Storage Cloud Archive Service which has cheaper storage but cost more when retrieving or editing data, similarly to Amazon Glacier .

Oracle Storage Cloud Service offers a RESTful web service API to load data as well as a Java Library directly wrapping this API and supporting client-side encryption. When it comes to load data, what could be better than using the tool we already use for all the rest? In this blog post , Ayush details how he created an ODI OpenTool that can load data to the Oracle Storage Cloud Service using the Java Library.

We first need to download the Java Library and place it in our ODI Studio or Agent classpaths. Below are the default locations for these classpaths, but we can also specify additional classpath if we want. Don’t forget to close ODI Studio and shutdown the agent(s) before adding the files.

ODI Studio on Unix : ~/.odi/oracledi/userlib
ODI Studio on Windows : %APPDATA%odioraclediuserlib
12c Standalone Agent : /odi/agent/lib
11g Standalone Agent : /oracledi/agent/drivers/

For a JEE agent, we need to create a server template as described in the doc : https://docs.oracle.com/middleware/1212/odi/ODIDG/setup_topology.htm#CHDHGIGI.

In the same folder, we can also drop the Jar file for the Open Tool that Ayush created and posted on java.net (last link in the folder at the moment of writing). Remember that all the drivers, libraries and Open Tools have to be present in every ODI Studio or agent install.

We can now reopen ODI Studio as we need to register the new Open Tool in the master repository so everyone can use it. In the ODI Menu, click on Add Remove Open  Tools and then on the googles icon to search for the Open Tools. In our case, the class is under the package oracle.odi.rest.

Adding an ODI Open Tool

Select the OracleStorageCloudGet tool and click the Plus icon to add it. There we can see the description of the tool, the command, the parameters and the icon used. Let’s add the OracleStorageCloudPut as well and click OK.

Open Tool description

That’s it! We can now see two new ODI Open Tools under the Plugins category in the Package Toolbox. We can add it to the package and fill all the parameters or copy/paste the command to use it in an ODI Procedure.

Open Tools in a package

Load data to Oracle Business Intelligence Cloud Service with a mapping.

BICS offers a Data Loader which is actually an Apex application taking Excel, CSV, text or zip files as input. But this is more a manual process used when working with sample data or really small dataset. If we plan to use BICS in a production environment, we need something more resilient that can be scheduled and that handles errors. We could use Data Sync but maybe you don’t want to introduce a new tool. Luckily, BICS also provides a RESTful web service API we can use to load data. Again, ODI is the perfect tool to create our own reusable components to integrate with new sources or targets. And once again, Ayush did all the job for us as described in this article.

The first step is similar to what we did previously, we need make all the drivers and libraries available for ODI Studio and the agent. The steps in the two Knowledge Modules contains Jython code, calling some java methods from the odi-bics.jar file. These methods – just like the Oracle Storage Cloud Service java library – will use the Jersey Libraries to do the RESTful web service calls so we will need them as well. Basically, just take all the jar files from the archive you can find on java.net under the name “RKM and IKM for Oracle BI Cloud Service”. As mentioned above, we need to add it to the classpath of ODI Studio and the agent installs and restart them.

The article mentions that we need to create a new Technology for BI Cloud Service but an XML export is already created in the archive. Go in the Topology, right click on Technology and import the xml file.

Adding a Technology

We can also import the two Knowledge Modules, either globally or under a single project. Ayush wrote one RKM to reverse-engineer the tables from our BICS schema and one Multi-Connections IKM to integrate the data from the on-premise database to the BICS schema in the cloud. I had to tweak the IKM a little bit but it might be linked to my particular setup. The steps were not displayed in the order it was set in the XML file and I had to reorder it properly to place the drop before the create and truncate. Probably a bug, I will investigate that later.

Reordering the steps in the KM

We are done with the install of new components and we can now set up the topology. First create a Dataserver with the newly imported technology and provide a login/password and the URL to the BICS instance as the Service URL. From my testing and unlike Ayush’s screenshot, it should start with https and should not include the port number. Of course, we need to create a physical schema where we need to provide our Identity Domain in both Tenant fields. We can leave the Database fields empty. The last step to do in the Topology is to create a Logical schema and associate it to the Physical schema through a context.

topo_DS

We can then switch to the Designer tab and create a new Model with BI Cloud Service technology and that Logical Schema. On the Reverse Engineer tab we select Customized and the RKM BI Cloud Service. We can also specify a mask to restrict the metadata import to specific table names before hitting the Reverse Engineer button at the top. I set the mask to DIM_LOC% and only my DIM_LOCATION as been reverse engineered from the BICS schema.

Model and reverse engineering

Finally, we can create a mapping. In this example I load two target tables. DIM_LOCATIONS (plural) is sitting on an on-premise Oracle Database for users accessing OBIEE from our HQ while DIM_LOCATION (singular) is in the schema linked to the our BICS instance used by remote users. There is nothing special here, except that I unselected the insert and update checkboxes for my surrogate key (LOCATION_SK) because I asked BICS to automatically populate it when I created the table – under the scene, a sequence is created and a trigger added to the table to populate the field with the next value of the sequence on each insert.

ODI-BICS-mapping

On the Physical tab, the LKM should be set on LKM SQL Multi-connect because we will delegate the data transfer to the Multi-Connections IKM. This IKM is the one we imported : IKM SQL to BI Cloud Service. I choose to enable the truncate option here.

mapping_2DS

Let’s hit the execute button and watch the result in the operator and in our BICS schema.

bics_result_DS

The two articles demonstrate one more time how easy it is to plug a new technology within ODI. This tool is a wonderful framework that can be used to move data from and to any technologies, even if it’s Big Data or Cloud as we have seen here. By having every integration jobs happening in the same place, we can have an easier maintenance, a better monitoring and we can schedule all the jobs together. It’s also way easier to see the big picture in our projects.

If you have any question about this post feel free to reach me, we love that kind of interesting challenges at Rittman Mead.

Thanks a lot to Ayush for such good components and libraries he provided. I’m glad I will share the stage with him and a fellow ODI expert, Holger Friedrich, at the Oracle Open World. We will speak about Best Practices for Development Lifecycle Management . Come to see us and say hi!

Taking a Closer Look at Knowledge Modules in ODI12c – Component-Style and Multi-Connect KMs

Another question that came up from the ODI12c Bootcamp Course I’m delivering for a client in London at the moment is how to choose between the different knowledge modules that come with ODI12c. What with the choice now between template-style KMs and the new component-style KMs, the new option of multi-connect KMs, and the general question around which KM you pick within a KM type when building a table mapping, I thought it’d be interesting to take a closer look at how knowledge modules work with ODI12c and how you go about making the right choice of KM when creating a mapping.

As a quick primer, Oracle Data Integrator up until the recent 12c release had six types of knowledge module you could use in a data mapping:

NewImage

They were:

  • Load Knowledge Modules, for loading source data out of the source database server and into a staging table typically on the target database platform
  • Integrate Knowledge Modules, for taking that staging data and integrating (inserting, updating, merging etc) it into the target table
  • Reverse Knowledge Modules, for reverse-engineering the table metadata from a source system
  • Check Knowledge Modules, for performing data quality checks on source and target tables
  • Journalise Knowledge Modules, for setting up change data capture on a source table or table set
  • Service Knowledge Modules, for exposing tables or other datastore as CRUD-type web services

Using ODI11g as an example, when you created a new mapping you selected an LKM for extracting data out of your source database, an IKM for integrating the results into the target table, and optionally a CKM or JKM if you needed to run data quality checks or use table journalization (CDC). In all cases you had to first import the knowledge module definitions into the ODI11g Work Repository and your project before you could use them. To take an example, an ODI11g mapping where the source was a file and the target, an Oracle database, might look like this as a Mapping diagram:

NewImage

Looking at the Flow diagram, at the start there are no knowledge modules to select from as none have yet been imported:

NewImage

I therefore import a selection of IKM, LKM and other KMs for the technologies I’m using, and then I’m able to assign an LKM and IKM to my flow diagram.

NewImage

For both Load Knowledge Modules (LKMs) and Integration Knowledge Modules (IKMs), you have a number of options ranging from generic JBDC/SQL-type modules that connect to a source and then transfer the data using JDBC batch routines, through to highly-specialized ones that leverage particular platform technologies. I typically start the prototyping phase of my ODI projects by selecting the simplest, most generic LKM and IKM I can find, and once I’ve got the mapping logic correct then shift to one that uses more of the underlying database’s features – the docs also have a nice guide for making your KM selection. For example, I might assign the LKM SQL to Oracle knowledge module to the source table and the IKM SQL Control Append one to the target, like this:

NewImage

In this case, ODI will first create a Java routine that extract via a JDBC connection the rows of data from the file, and then load that data into a staging table on the target database server. Then, that staging data will be integrated into the target table using a regular SQL INSERT statement.

Of course I could do this more efficiently using an Oracle External Table. Let’s select the LKM File to Oracle (EXTERNAL TABLE) load knowledge module and also change the IKM to IKM Oracle Incremental Update; note that this would only work if the target database server could see the file we’re loading in via this mechanism – if the file had to stay on a remote server then I’d have to stick with the IKM SQL Control Append and ODI would effectively ignore the request to use an Oracle External Table.

NewImage

With ODI12c, things are a bit different due to the introduction of another type of knowledge module: “Component-Style” Knowledge Modules. Component-style KMs were introduced for a few reasons with ODI12c; they made migration of mappings and projects from OWB easier as OWB mappings are made up of lots of arbitrarily-arranged mapping operator components that can be combined into all-types of data flow, and they also made it possible for Oracle to create lots more of these granular mapping components and use them across all technology types. For example as Oracle’s David Allen talks about in this comment on one of our previous blog posts and in this follow-up blog of his own, Oracle could create a generic Table Function component-style KM and have it apply a SQL table function for Oracle sources, or run a Pig relation through an arbitrary Pig Latin script as I did in a more recent blog post.

These new component-style KMs come built-in to ODI12c which means that you don’t actually have to import any template-style KMs in to get started with a mapping; in the example below from my blog post earlier today on ODI12c and Oracle Streams, I can run the mapping I’ve just created by just selecting from the built-in component-style KMs that ship with ODI12c out-of-the-box.

NewImage19

So should we now use component-style KMs when creating mappings and avoid using the old template-style ones? The docs don’t say this explicitly, but my impression is that component-style KMs are the way Oracle wants to take things forward and in most cases, ODI will automatically select suitable component-style LKMs and IKMs when you create the physical mapping and this is usually the best option; only time I switch to a template-style IKM or LKM is if I’m using a platform technology that component-style KMs don’t yet cover, or I’m working on some edge-case – by default though I go with component-style LKMs and IKMs. Of course you still need to import JKMs and other KM types and presumably Oracle will extend component KMs beyond IKMs and LKMs over time, but that’s my recommendation for now.

So now we’ve got what component-style KMs are, there’s another new KM concept that came along with ODI12c – “Multi-Connect” KMs. I used a multi-connect IKM File-Hive to Oracle knowledge module towards the end of another article where I used Oracle Loader for Hadoop to export data out of Hadoop and into an Oracle Database; normally when you use an IKM the staging and target areas are on the same database, but in this case the staging table for the mapping was on the Hadoop (Hive) side and I therefore had to select LKM SQL Multi-Connection as the load knowledge module, whereapon I could then select IKM File-Hive to Oracle (OLH/OSCH) as the integration knowledge module.

NewImage

Similarly, with our ODI12c mapping if I moved the staging area to the source database, or more commonly a separate “ETL-hub”-style database where the customer wants a more ETL (compared to ELT)-style integration setup, I can request that the staging location for the mapping moves to this hub database using a new feature in the logical mapping editor:

NewImage

Then in the mapping I can add some transformations that take place on the ETL hub database, something you’d most probably do to avoid licensing ODI on their full data warehouse database server.

NewImage

Then when I switch to the Physical mapping view I can see this additional execution unit with the transformations in it, I first select the LKM SQL Multi-Connect to bring the file data in, then I can select an LKM and IKM combination that supports multi-connect but uses an Oracle-specific technology (in this case, dblinks) to move the data from the ETL hub to the target database.

NewImage

Or, if I didn’t like using dblinks or the ETL hub was on a non-Oracle platform, I could use the LKM SQL Multi-Connect LKM to expose the ETL Hub tables and then use the more generic IKMs in the list that also support multi-connections but bring the data across to the target database using a Java routine and batching, with the data flowing through the agent rather than Oracle server-to-Oracle server.

NewImage

So – a few thoughts on the new knowledge module setup in ODI12c, and what these new component KMs and multi-connect KMs are actually for. You can find more articles on ODI12c and data integration over on our blog, if you’re interested in reading more about ODI12c development and internals.