The Oracle Business Intelligence Developers Guide, written by Mark Rittman

The definitive guide to developing BI applications on the Oracle Business Intelligence 11g platform. More »

With Oracle Exadata into the highest gears of speed!

Everything needs to get faster, but how fast ? (Dutch) More »

Share knowledge

Intelligence is all about knowledge. This website is dedicated sharing expertise on Oracle BI. More »

Meet the experts

Meet worlds best OBIEE experts during the BI Brighton event by Rittman Mead More »

 

Three Easy Ways to Stream Twitter Data into ElasticSearch

For the past few months a friend has been driving me crazy with all his praise for Splunk. He was going on about how easy it is to install, integrate different sources and build reports. I eventually started playing around to see if it could be used for a personal project I’m working on. In no time at all I understood what he was on about and I could see the value and ease of use of the product. Unfortunately the price of such a product means it is not a solution for everyone so I started looking around for alternatives and ElasticSearch caught my eye as a good option.

In this post we will focus on how we can stream Twitter data into ElasticSearch and explore the different options for doing so. Storing data in ElasticSearch is just the first step but you only gain real value when you start analysing this data. In the next post we will add sentiment analysis to our Twitter messages and see how we can analyse this data by building Kibana dashboards. But for now we will dig a bit deeper into the following three configuration options:

We will look at the installation and configuration of each of these and see how we can subscribe to twitter using the Twitter API. Data will then get processed, if required, and sent to Elasticsearch.

twitter_es_dataflow_shadow

Why Use Elasticsearch

Elasticsearch has the ability to store large quantities of semi-structured (JSON) data and provides the ability to quickly and easily query this data. This makes it a good option for storing Twitter data which is delivered as JSON and a perfect candidate for the project I’m working on.

Prerequisites

You will need a server to host all the required components. I used an AWS free tier (t2.micro) instance running Amazon Linux 64-bit. This post assumes you already have an elasticsearch cluster up and running and that you have a basic understanding of elasticsearch. There are some good blog posts, written by Robin Moffatt, which were very useful during the installation and configuration stages.

Twitter Stream API

In order to access the Twitter Streaming API, you need to register an application at http://apps.twitter.com. Once created, you should be redirected to your app’s page, where you can get the consumer key and consumer secret and create an access token under the “Keys and Access Tokens” tab. These values will be used as part of the configuration for all the sample configurations to follow.

The API allows two types of subscriptions. Either subscribe to specific keywords or to a user timeline (similar to what you see as a twitter user).

logstash

We'll start with logstash as this is probably the easiest one to configure and seems to be the recommended approach for integrating sources with elasticsearch in recent versions. At the time of writing this post, logstash only supported streaming based on keywords which meant it was not suitable for my needs but it’s still a very useful option to cover.

logstash installation

To install logstash you need to download the correct archive based on the version of elasticsearch you are running.

curl -O https://download.elasticsearch.org/logstash/logstash/logstash-x.x.x.tar.gz

Extract the archived file and move the extracted folder to a location of your choice

tar zxf logstash-x.x.x.tar.gz
mv logstash-x.x.x /usr/share/logstash/

logstash configuration

To configure logstash we need to provide input, output and filter elements. For our example we will only specify input (twitter) and output (elasticsearch) elements as we will be storing the full twitter message.

For a full list of logstash twitter input settings see the official documentation.

Using your favourite text editor, create a file called twitter_logstash.conf and copy the below text. Update the consumer_key, consumer_secret, oath_token and oath_token_secret values with the values from your Twitter Stream App created earlier.

input {
    twitter {
        # add your data
        consumer_key => "CONSUMER_KEY_GOES_HERE"
        consumer_secret => "CONSUMER_SECRET_GOES_HERE"
        oauth_token => "ACCESS_TOKEN_GOES_HERE"
        oauth_token_secret => "ACCESS_TOKEN_SECRET_GOES_HERE"
        keywords => ["obiee","oracle"]
        full_tweet => true
    }
}
output {
    elasticsearch_http {
        host => "localhost"
        index => "idx_ls"
        index_type => "tweet_ls"
    }
}

This configuration will receive all tweets tagged with obiee or oracle and store them to an index called idx_ls in elasticsearch.

To run logstash, execute the following command from the installed location

bin/logstash -f twitter_logstash.conf

If you subscribed to active twitter tags you should see data within a few seconds. To confirm if your data is flowing you can navigate to http://server_ip:9200/_cat/indices?v which will show you a list of indices with some relevant information.

es_index_list_shadow

With this easy configuration you can get Twitter data flowing in no time at all.

Twitter River Plugin

Next we will look at using the River Plugins to stream Twitter data. The only reason to use this approach over logstash is if you want to subscribe to a user timeline. Using this feature will show the same information as the Twitter application or viewing your timeline online.

Twitter River Plugin installation

Before installing the plugin you need to determine which version is compatible with your version of elasticsearch. You can confirm this at https://github.com/elasticsearch/elasticsearch-river-twitter and selecting the correct one.

To install you need to use the elasticsearch plugin installation script. From the elasticsearch installation directory, execute:

bin/plugin -install elasticsearch/elasticsearch-river-twitter/x.x.x

Then restart your Elasticsearch service.

Twitter River Plugin configuration

To configure the twitter subscriber we will again create a .conf file with the necessary configuration elements. Create a new file called twitter_river.conf and copy the following text. As with logstash, update the required fields with the values from the twitter app created earlier.

{
  "type": "twitter",
  "twitter" : {
        "oauth" : {
            "consumer_key" : "CONSUMER_KEY_GOES_HERE",
            "consumer_secret" : "CONSUMER_SECRET_GOES_HERE",
            "access_token" : "ACCESS_TOKEN_GOES_HERE",
            "access_token_secret" : "ACCESS_TOKEN_SECRET_GOES_HERE"
        },
        "filter" : {
            "tracks" : ["obiee", "oracle"]
        },
        "raw" : true,
        "geo_as_array" : true
  },
  "index": {
    "index": "idx_rvr",
    "type": "tweet_rvr",
    "bulk_size": 100,
    "flush_interval": "5s"
  }
}

This configuration is identical to the logstash configuration and will receive the same tweets from twitter. To subscribe to a user timeline instead of keywords, replace the filter configuration element:

"filter" : {
      "tracks" : ["obiee", "oracle"],
},

with a user type element

"type" : "user",

To start the plugin you need to execute the following from a terminal window.

curl -XPUT localhost:9200/_river/idx_rvr/_meta -d @twitter_river.conf

Depending on how active your subscribed tags are you should see data within a few seconds in elasticsearch. You can again navigate to http://server_ip:9200/_cat/indices?v to confirm if your data is flowing. Note this time that you should see two new rows, one index called _river and the other idx_rvr. idx_rvr is where your twitter data will be stored.

To stop the plugin (or change between keywords and user timeline), execute the following from a terminal window:

curl -XDELETE 'localhost:9200/_river/idx_rvr';

Tweepy

Finally we will look at the most flexible solution of them all. It is a bit more complicated to install and configure but, given what you gain, the small amount of extra time spent is well worth the effort. Once you have Tweepy working you will be able to write you own python code to manipulate the data as you see fit.

Tweepy installation

As Tweepy is a python package we will use pip to install the required packages. If you don't have pip installed. Execute one of the following, depending on your linux distribution.

yum -y install python-pip

or

apt-get install python-pip

Next we will install the Tweepy and elasticsearch packages

pip install tweepy
pip install elasticsearch

Tweepy configuration

Create a new file called twitter_tweepy.py and copy the following text to the file

import tweepy
import sys
import json
from textwrap import TextWrapper
from datetime import datetime
from elasticsearch import Elasticsearch


consumer_key="CONSUMER_KEY_GOES_HERE"
consumer_secret="CONSUMER_SECRET_GOES_HERE"

access_token="ACCESS_TOKEN_GOES_HERE"
access_token_secret="ACCESS_TOKEN_SECRET_GOES_HERE"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

es = Elasticsearch()

class StreamListener(tweepy.StreamListener):
    status_wrapper = TextWrapper(width=60, initial_indent='    ', subsequent_indent='    ')

    def on_status(self, status):
        try:
            #print 'n%s %s' % (status.author.screen_name, status.created_at)

            json_data = status._json
            #print json_data['text']

            es.create(index="idx_twp",
                      doc_type="twitter_twp",
                      body=json_data
                     )

        except Exception, e:
            print e
            pass

streamer = tweepy.Stream(auth=auth, listener=StreamListener(), timeout=3000000000 )

#Fill with your own Keywords bellow
terms = ['obiee','oracle']

streamer.filter(None,terms)
#streamer.userstream(None)

As with the river plugin you can subscribe to the user timeline by changing the subscription type. To do this replace the last line in the script with

streamer.userstream(None)

To start the listener you need to execute the python file

python twitter_tweepy.py

Navigate to the elasticsearch index list again to ensure you are receiving data.

Conclusion

Getting Twitter data into Elasticsearch is actually pretty simple. Logstash is by far the easiest one to configure and if subscribing to keywords is your only requirement it should be the preferred solution. Now that we have the foundation in place, in the next post we will have a look at how we can enhance this data by adding sentiment analysis and how we can use this data to make decisions.

Dynamic Dashboard Documentation – Flexible Page Content Control in OBIEE

There is a need within the OBI user community to have an OBIEE dashboard page where users can update the page content on one or more dashboard pages with metadata (information about the data contained on the dashboard) in free form. What we at Art of BI have done to fill that need is to […]

The post Dynamic Dashboard Documentation – Flexible Page Content Control in OBIEE appeared first on Art of Business Intelligence Blog.

Dynamic Dashboard Documentation – Flexible Page Content Control in OBIEE

There is a need within the OBI user community to have an OBIEE dashboard page where users can update the page content on one or more dashboard pages with metadata (information about the data contained on the dashboard) in free form. What we at Art of BI have done to fill that need is to […]

The post Dynamic Dashboard Documentation – Flexible Page Content Control in OBIEE appeared first on Art of Business Intelligence Blog.

Combining Spark Streaming and Data Frames for Near-Real Time Log Analysis & Enrichment

A few months ago I posted an article on the blog around using Apache Spark to analyse activity on our website, using Spark to join the site activity to some reference tables for some one-off analysis. In this article I’ll be taking an initial look at Spark Streaming, a component within the overall Spark platform that allows you to ingest and process data in near real-time whilst keeping the same overall code-based as your batch-style Spark programs.

NewImage

Like regular batch-based Spark programs, Spark Streaming builds on the concept of RDDs (Resilient Distributed Datasets) and provides an additional high-level abstraction called a “discretized stream” or DStream, representing a continuous stream of RDDs over a defined time period. In the example I’m going to create I’ll use Spark Streaming’s DStream feature to hold in-memory the last 24hrs worth of website activity, and use it to update a “Top Ten Pages” Impala table that’ll get updated once a minute.

NewImage

To create the example I started with the Log Analyzer example in the set of DataBricks Spark Reference Applications, and adapted the Spark Streaming / Spark SQL example to work with our CombinedLogFormat log format that contains two additional log elements. In addition, I’ll also join the incoming data stream with some reference data sitting in an Oracle database and then output a parquet-format file to the HDFS filesystem containing the top ten pages over that period.

The bits of the Log Analyzer reference application that we reused comprise of two scripts that compile into a single JAR file; a script that creates a Scala object to parse the incoming CombinedLogFormat log files, and other with the main program in. The log parsing object contains a single function that takes a set of log lines, then returns a Scala class that breaks the log entries down into the individual elements (IP address, endpoint (URL), referrer and so on). Compared to the DataBricks reference application I had to add two extra log file elements to the ApacheAccessLog class (referer and agent), and add some code in to deal with “-“ values that could be in the log for the content size; I also added some extra code to ensure the URLs (endpoints) quoted in the log matched the format used in the data extracted from our WordPress install, which stores all URLs with a trailing forward-slash (“/“).

package com.databricks.apps.logs
case class ApacheAccessLog(ipAddress: String, clientIdentd: String,
 userId: String, dateTime: String, method: String,
 endpoint: String, protocol: String,
 responseCode: Int, contentSize: Long, 
 referer: String, agent: String) {
}
 
object ApacheAccessLog {
val PATTERN = """^(S+) (S+) (S+) [([wd:/]+s[+-]d{4})] "(S+) (S+) (S+)" (d{3}) ([d-]+) "([^"]+)" "([^"]+)"""".r
def parseLogLine(log: String): ApacheAccessLog = {
 val res = PATTERN.findFirstMatchIn(log)
 if (res.isEmpty) {
 ApacheAccessLog("", "", "", "","", "", "", 0, 0, "", "")
 }
 else {
 val m = res.get
 val contentSizeSafe : Long = if (m.group(9) == "-") 0 else m.group(9).toLong
 val formattedEndpoint : String = (if (m.group(6).charAt(m.group(6).length-1).toString == "/") m.group(6) else m.group(6).concat("/"))
 
 ApacheAccessLog(m.group(1), m.group(2), m.group(3), m.group(4),
 m.group(5), formattedEndpoint, m.group(7), m.group(8).toInt, contentSizeSafe, m.group(10), m.group(11))
 }
 }
}

The body of the main application script starts by importing Scala classes for Spark, Spark SQL and Spark Streaming, and then defines two variable that determine the amount of log data the application will consider; WINDOW_LENGTH (86400 milliseconds, or 24hrs) which determines the window of log activity that the application will consider, and SLIDE_INTERVAL, set to 60 milliseconds or one minute, which determines how often the statistics are recalculated. Using these values means that our Spark Streaming application will recompute every minute the top ten most popular pages over the last 24 hours.

package com.databricks.apps.logs.chapter1
import com.databricks.apps.logs.ApacheAccessLog
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.{StreamingContext, Duration}
object LogAnalyzerStreamingSQL {
 val WINDOW_LENGTH = new Duration(86400 * 1000)
 val SLIDE_INTERVAL = new Duration(60 * 1000)

In our Spark Streaming application, we’re also going to load-up reference data from our WordPress site, exported and stored in an Oracle database, to add post title and post author values to the raw log entries that come in via Spark Streaming. In the next part of the script then we define a new Spark context and then a Spark SQL context off-of the base Spark context, then create a Spark SQL data frame to hold the Oracle-sourced WordPress data to later-on join to the incoming DStream data – using Spark’s new Data Frame feature and the Oracle JDBC drivers that I separately download off-of the Oracle website, I can pull in reference data from Oracle or other database sources, or bring it in from a CSV file as I did in the previous Spark example, to supplement my raw incoming log data. 

def main(args: Array[String]) {
 val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in Scala")
 val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
 import sqlContext.implicits._
 
 val postsDF = sqlContext.load("jdbc", Map(
 "url" -> "jdbc:oracle:thin:blog_refdata/password@bigdatalite.rittmandev.com:1521:orcl",
 "dbtable" -> "BLOG_REFDATA.POST_DETAILS"))
 
 postsDF.registerTempTable("posts")

Note also how Spark SQL lets me declare a data frame (or indeed any RDD with an associated schema) as a Spark SQL table, so that I can later run SQL queries against it – I’ll come back to this at the end).

Now comes the first part of the Spark Streaming code. I start by defining a new Spark Streaming content off of the same base Spark context that I created the Spark SQL one off-of, then I use that Spark Streaming context to create a DStream that reads newly-arrived files landed in an HDFS directory  – for this example I’ll manually copy the log files into an “incoming” HDFS directory, whereas in real-life I’d connect Spark Streaming to Flume using FlumeUtils for a more direct-connection to activity on the webserver. 

val streamingContext = new StreamingContext(sc, SLIDE_INTERVAL)
val logLinesDStream = streamingContext.textFileStream("/user/oracle/rm_logs_incoming")

Then I call the Scala “map” transformation to convert the incoming DStream into an ApacheAccessLog-formatted DStream, and cache this new DStream in-memory. Next and as the final part of this stage, I call the Spark Streaming “window” function which packages the input data into in this case a 24-hour window of data, and creates a new Spark RDD every SLIDE_INTERVAL – in this case 1 minute – of time.

val accessLogsDStream = logLinesDStream.map(ApacheAccessLog.parseLogLine).cache()
val windowDStream = accessLogsDStream.window(WINDOW_LENGTH, SLIDE_INTERVAL)

Now that Spark Streaming is creating RDDs for me to represent all the log activity over my 24 hour period, I can use the .foreachRDD control structure to turn that RDD into its own data frame (using the schema I’ve inherited from the ApacheAccessLog Scala class earlier on), and filter out bot activity and references to internal WordPress pages so that I’m left with actual page accesses to then calculate the top ten list from.

windowDStream.foreachRDD(accessLogs => {
 if (accessLogs.count() == 0) {
 println("No logs received in this time interval")
 } else {
 accessLogs.toDF().registerTempTable("accessLogs")
// Filter out bots 
 val accessLogsFilteredDF = accessLogs
 .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp|bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*"))
 .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin|wp-includes|favicon.ico|xmlrpc.php|wp-comments-post.php).*")).toDF()
 .registerTempTable("accessLogsFiltered")

Then, I use Spark SQL’s ability to join tables created against the windowed log data and the Oracle reference data I brought in earlier, to create a parquet-formatted file containing the top-ten most popular pages over the past 24 hours. Parquet is the default storage format used by Spark SQL and is suited best to BI-style columnar queries, but I could use Avro, CSV or another file format If I brought the correct library imports in.

val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) as total FROM accessLogsFiltered a JOIN posts p ON a.endpoint = p.POST_SLUG GROUP BY p.POST_TITLE, P.POST_AUTHOR ORDER BY total DESC LIMIT 10 ") 
 
 // Persist top ten table for this window to HDFS as parquet file
 
 topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet", "parquet", SaveMode.Overwrite) 
 }
 })

Finally, the last piece of the code starts-off the data ingestion process and then continues until the process is interrupted or stopped.

streamingContext.start()
    streamingContext.awaitTermination()
  }
}

I can now go over to Hue and move some log files into the HDFS directory that the Spark application is running on, like this:

file_upload

Then, based on the SLIDE_INTERVAL I defined in the main Spark application earlier on (60 seconds, in my case) the Spark Streaming application picks up the new files and processes them, outputting the results as a Parquet file back on the HDFS filesystem (these two screenshots should display as animated GIFs)

spark_processing

So what to do with the top-ten pages parquet file that the Spark Streaming application creates? The most obvious thing to do would be to create an Impala table over it, using the schema metadata embedded into the parquet file, like this:

CREATE EXTERNAL TABLE rm_logs_24hr_top_ten <br />LIKE PARQUET '/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet/part-r-00001.parquet'<br /> STORED AS PARQUET<br /> LOCATION '/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet';

Then I can query the table using Hue again, or I can import the Impala table metadata into OBIEE and analyse it using Answers and Dashboards.

NewImage

So that’s a very basic example of Spark Streaming, and I’ll be building on this example over the new few weeks to add features such as persistent storing of all processed data, and classification and clustering the data using Spark MLlib. More importantly, copying files into HDFS for ingestion into Spark Streaming adds quite a lot of latency and it’d be better to connect Spark directly to the webserver using Flume or even better, Kafka – I’ll add examples showing these features in the next few posts in this series.

Oracle Data Integrator to load Oracle BICS and Oracle Storage Cloud Service

It is known that Oracle focuses its sales on cloud this year and in the BI world, we have seen the introduction of Oracle Business Intelligence Cloud Service – BICS – that we already covered on the blog. Another great product to store files and unstructured data is Oracle Storage Cloud Service, that can also be used as a staging area before loading a BICS schema. To provision our BI system in the cloud or to backup our files, it’s important to find a reliable and maintainable way to move on-premise data to the cloud. And when we speak about reliable and maintainable way to move data, I can’t help but think about Oracle Data Integrator. Luckily, Ayush Ganeriwal wrote two excellent blog posts explaining how to use ODI to load these two cloud services! And he even gave away the Knowledge Modules, ODI Open Tools, Jar files and other things he developed to make it work.

Load data to Oracle Storage Cloud Service with a package.

Oracle Storage Cloud Service is used to store files and unstructured data in the cloud, very much like Amazon S3. In the case of a company that has moved to the cloud, they need some space to store their files that will be used by other cloud services. Oracle Storage Cloud Service is the place to store all these files. If you want to use it as a cloud backup of your local filesystem, you might even choose Oracle Storage Cloud Archive Service which has cheaper storage but cost more when retrieving or editing data, similarly to Amazon Glacier .

Oracle Storage Cloud Service offers a RESTful web service API to load data as well as a Java Library directly wrapping this API and supporting client-side encryption. When it comes to load data, what could be better than using the tool we already use for all the rest? In this blog post , Ayush details how he created an ODI OpenTool that can load data to the Oracle Storage Cloud Service using the Java Library.

We first need to download the Java Library and place it in our ODI Studio or Agent classpaths. Below are the default locations for these classpaths, but we can also specify additional classpath if we want. Don’t forget to close ODI Studio and shutdown the agent(s) before adding the files.

ODI Studio on Unix : ~/.odi/oracledi/userlib
ODI Studio on Windows : %APPDATA%odioraclediuserlib
12c Standalone Agent : /odi/agent/lib
11g Standalone Agent : /oracledi/agent/drivers/

For a JEE agent, we need to create a server template as described in the doc : https://docs.oracle.com/middleware/1212/odi/ODIDG/setup_topology.htm#CHDHGIGI.

In the same folder, we can also drop the Jar file for the Open Tool that Ayush created and posted on java.net (last link in the folder at the moment of writing). Remember that all the drivers, libraries and Open Tools have to be present in every ODI Studio or agent install.

We can now reopen ODI Studio as we need to register the new Open Tool in the master repository so everyone can use it. In the ODI Menu, click on Add Remove Open  Tools and then on the googles icon to search for the Open Tools. In our case, the class is under the package oracle.odi.rest.

Adding an ODI Open Tool

Select the OracleStorageCloudGet tool and click the Plus icon to add it. There we can see the description of the tool, the command, the parameters and the icon used. Let’s add the OracleStorageCloudPut as well and click OK.

Open Tool description

That’s it! We can now see two new ODI Open Tools under the Plugins category in the Package Toolbox. We can add it to the package and fill all the parameters or copy/paste the command to use it in an ODI Procedure.

Open Tools in a package

Load data to Oracle Business Intelligence Cloud Service with a mapping.

BICS offers a Data Loader which is actually an Apex application taking Excel, CSV, text or zip files as input. But this is more a manual process used when working with sample data or really small dataset. If we plan to use BICS in a production environment, we need something more resilient that can be scheduled and that handles errors. We could use Data Sync but maybe you don’t want to introduce a new tool. Luckily, BICS also provides a RESTful web service API we can use to load data. Again, ODI is the perfect tool to create our own reusable components to integrate with new sources or targets. And once again, Ayush did all the job for us as described in this article.

The first step is similar to what we did previously, we need make all the drivers and libraries available for ODI Studio and the agent. The steps in the two Knowledge Modules contains Jython code, calling some java methods from the odi-bics.jar file. These methods – just like the Oracle Storage Cloud Service java library – will use the Jersey Libraries to do the RESTful web service calls so we will need them as well. Basically, just take all the jar files from the archive you can find on java.net under the name “RKM and IKM for Oracle BI Cloud Service”. As mentioned above, we need to add it to the classpath of ODI Studio and the agent installs and restart them.

The article mentions that we need to create a new Technology for BI Cloud Service but an XML export is already created in the archive. Go in the Topology, right click on Technology and import the xml file.

Adding a Technology

We can also import the two Knowledge Modules, either globally or under a single project. Ayush wrote one RKM to reverse-engineer the tables from our BICS schema and one Multi-Connections IKM to integrate the data from the on-premise database to the BICS schema in the cloud. I had to tweak the IKM a little bit but it might be linked to my particular setup. The steps were not displayed in the order it was set in the XML file and I had to reorder it properly to place the drop before the create and truncate. Probably a bug, I will investigate that later.

Reordering the steps in the KM

We are done with the install of new components and we can now set up the topology. First create a Dataserver with the newly imported technology and provide a login/password and the URL to the BICS instance as the Service URL. From my testing and unlike Ayush’s screenshot, it should start with https and should not include the port number. Of course, we need to create a physical schema where we need to provide our Identity Domain in both Tenant fields. We can leave the Database fields empty. The last step to do in the Topology is to create a Logical schema and associate it to the Physical schema through a context.

topo_DS

We can then switch to the Designer tab and create a new Model with BI Cloud Service technology and that Logical Schema. On the Reverse Engineer tab we select Customized and the RKM BI Cloud Service. We can also specify a mask to restrict the metadata import to specific table names before hitting the Reverse Engineer button at the top. I set the mask to DIM_LOC% and only my DIM_LOCATION as been reverse engineered from the BICS schema.

Model and reverse engineering

Finally, we can create a mapping. In this example I load two target tables. DIM_LOCATIONS (plural) is sitting on an on-premise Oracle Database for users accessing OBIEE from our HQ while DIM_LOCATION (singular) is in the schema linked to the our BICS instance used by remote users. There is nothing special here, except that I unselected the insert and update checkboxes for my surrogate key (LOCATION_SK) because I asked BICS to automatically populate it when I created the table – under the scene, a sequence is created and a trigger added to the table to populate the field with the next value of the sequence on each insert.

ODI-BICS-mapping

On the Physical tab, the LKM should be set on LKM SQL Multi-connect because we will delegate the data transfer to the Multi-Connections IKM. This IKM is the one we imported : IKM SQL to BI Cloud Service. I choose to enable the truncate option here.

mapping_2DS

Let’s hit the execute button and watch the result in the operator and in our BICS schema.

bics_result_DS

The two articles demonstrate one more time how easy it is to plug a new technology within ODI. This tool is a wonderful framework that can be used to move data from and to any technologies, even if it’s Big Data or Cloud as we have seen here. By having every integration jobs happening in the same place, we can have an easier maintenance, a better monitoring and we can schedule all the jobs together. It’s also way easier to see the big picture in our projects.

If you have any question about this post feel free to reach me, we love that kind of interesting challenges at Rittman Mead.

Thanks a lot to Ayush for such good components and libraries he provided. I’m glad I will share the stage with him and a fellow ODI expert, Holger Friedrich, at the Oracle Open World. We will speak about Best Practices for Development Lifecycle Management . Come to see us and say hi!