Liberate your data

Intelligence is all about knowledge. This website is dedicated sharing expertise on Oracle BI. More »

 

Rittman Mead at Oracle OpenWorld 2019

Rittman Mead at Oracle OpenWorld 2019

Oracle OpenWorld is coming soon! 16th-20th September in Moscone Center, San Francisco. It's Oracle's biggest conference and I'll represent Rittman Mead there with the talk "Become a Data Scientist"  exploring how Oracle Analytics Cloud can speed any analyst path to data science. If you are an analyst looking to move your first steps in data-science or a manager trying to understand how to optimize your business analytics workforce, look no further, this presentation is your kickstarter!

Rittman Mead at Oracle OpenWorld 2019

To have an introduction to the topic have a look at my blog post series episodes I, II and III.

If you'll be at OOW2019 and you see me around, don't hesitate to stop me! I’d be pleased to speak with you about OAC, Analytics, ML, and more important topics like food or wine as well!

Spatial Analytics Made Easy: Oracle Spatial Studio

Spatial Analytics Made Easy: Oracle Spatial Studio

Let's say we need to understand where our company needs to open a new shop. Most of the time the decision is driven by gut feeling and some knowledge of the market and client base, but what if we could have visual insights about where are the high density zones with customers not covered by a shop nearby like in the map below?

Spatial Analytics Made Easy: Oracle Spatial Studio

Well... welcome Oracle Spatial Studio!

Spatial Studio is Oracle's new tool for creating spatial analytics with a visual GUI. It uses Oracle Spatial database functions in the backen d exposed with an interface in line with the Oracle Analytics Cloud one. Let's see how it works!

QuickStart Installation

First of all we need to download Spatial Studio from the Oracle web page, for this initial test I downloaded the "Quick Start", a self contained version pre-deployed in a lightweight application server. For more robust applications you may want to download the EAR file deployable in Weblogic.

Spatial Analytics Made Easy: Oracle Spatial Studio

Once downloaded and unzipped the file, we just need to verify we have a Java JDK 8 (update 181 or higher) under the hood and we can immediately start Oracle Spatial Studio with the ./start.sh command.

The command will start the service on the local machine that can be accessed at https://localhost:4040/spatialstudio. By default Oracle Spatial Studio Quickstart uses HTTPS protocol with self-signed certificates, thus the first time you access the URL you will need to add a security exception in your browser. The configurations such as port, JVM parameters, host and HTTP/HTTPS protocol can be changed in the conf/server.json file.

We can then login with the default credentials admin/welcome1

Spatial Analytics Made Easy: Oracle Spatial Studio

The first step in the Spatial Studio setup is the definition of the metadata connection type. This needs to point to an Oracle database with the spatial option. For my example I initially used an Oracle Autonomous Data Warehouse, for which I had to drop the wallet and specify the schema details.

Spatial Analytics Made Easy: Oracle Spatial Studio

Once logged in, the layout and working flows are very similar to Oracle Analytics Cloud making the transition between the two very easy (more details on this later on). In the left menu we can access, like in OAC, Projects (visualizations), Data, Jobs and the Console.

Spatial Analytics Made Easy: Oracle Spatial Studio

In order to do Spatial Analysis we need to start from a Dataset, this can be existing tables or views, or we can upload local files. To create a Dataset, click on Create and Dataset

Spatial Analytics Made Easy: Oracle Spatial Studio

We have then three options:

  • Upload a Spreadsheet containing spatial information (e.g. Addresses, Postcodes, Regions, Cities etc)
  • Upload a Shapefile containing geometric locations and associated attributes.
  • Use spatial data from one of the existing connections, this can point to any connection containing spatial information (e.g. a table in a database containing customer addresses)
Spatial Analytics Made Easy: Oracle Spatial Studio

Sample Dataset with Mockaroo

I used Mockaroo, a realistic data generator service, to create two excel files: one containing customers with related locations and a second one with shops and related latitude and longitude. All I had to do was to select which fields I wanted to include in my file and the related datatype.

Spatial Analytics Made Easy: Oracle Spatial Studio

For example, the list of shop dataset contained the following columns:

  • Id: as row number
  • Shop Name: as concatenation of Shop and the Id
  • Lat: Latitude
  • Long: Longitude
  • Dept: the Department (e.g. Grocery, Books, Health&Beauty)

Mockaroo offers a perfect service and has a free tier of datasets with less than 1000 rows which can be useful for demo purposes. For each column defined, you can select between a good variety of column types. You can also define your own type using regular expressions!

Spatial Analytics Made Easy: Oracle Spatial Studio

Adding the Datasets to Oracle Spatial Studio

Once we have the two datasources in Excel format, it's time to start playing with Spatial Studio. We first need to upload the datasets, we can do it via Create and Dataset. Starting with the Customer.xlsx one. Once selected the file to upload Spatial Studio provides (as OAC) an overview of the dataset together with options to change configurations like dataset name, target destination (metadata database) and column names.

Spatial Analytics Made Easy: Oracle Spatial Studio

Once modified the table name to TEST_CUSTOMERS and clicked on Submit Spatial Studio starts inserting all the rows into the SPATIAL_STUDIO connection with a routine that could take seconds or minutes depending on the dataset volume. When the upload routine finishes I can see the TEST_CUSTOMERS table appearing in the list of datasets.

Spatial Analytics Made Easy: Oracle Spatial Studio

We can immediately see the yellow warning sign next to the dataset name, it's due to the fact that we have a dataset with no geo-coded information, we can solve this problem by clicking on the option button and then Prepare and Geocode Addresses

Spatial Analytics Made Easy: Oracle Spatial Studio

Oracle Spatial Studio will suggest, based on the column content, some geo-type matching e.g. City Name, Country and Postal Code. We can use the defaults or modify them if we feel they are wrong.

Spatial Analytics Made Easy: Oracle Spatial Studio

Once clicked on Apply the geocoding job starts.

Spatial Analytics Made Easy: Oracle Spatial Studio

Once the job ends, we can see the location icon next to our dataset name

Spatial Analytics Made Easy: Oracle Spatial Studio

We can do the same for the Shops.xlsx dataset, starting by uploading it and store it as TEST_SHOPS dataset.

Spatial Analytics Made Easy: Oracle Spatial Studio

Once the dataset is uploaded I can geo-locate the information based on the Latitude and Longitude, I can click on the option button and the selecting Prepare and Create Lon/Lat Index. Then I'll need to assign the Longitude and Latitude column correctly and click on Ok.

Spatial Analytics Made Easy: Oracle Spatial Studio

Spatial Analytics

Now it's time to do some Spatial Analysis so I can click on Create and Project and I'll face an empty canvas by default

Spatial Analytics Made Easy: Oracle Spatial Studio

The first step is to add a Map, I can do that by selecting the visualizations menu and then dragging the map to the canvas.

Spatial Analytics Made Easy: Oracle Spatial Studio

Next step is to add some data by clicking on Data Elements and then Add Dataset

Spatial Analytics Made Easy: Oracle Spatial Studio

I select the TEST_CUSTOMERS dataset and add it to the project, then I need to drag it on top of the map to visualize my customer data.

Spatial Analytics Made Easy: Oracle Spatial Studio

Oracle Spatial Studio Offers several options to change the data visualizations like color, opacity, blur etc.

Spatial Analytics Made Easy: Oracle Spatial Studio

Now I can add the TEST_SHOPS dataset and visualize it on the map with the same set of steps followed before.

Spatial Analytics Made Easy: Oracle Spatial Studio

It's finally time for spatial analysis! Let's say, as per initial example, that I want to know which of my customers doesn't have any shops in the nearest 200km. In order to achieve that I need to first create buffer areas of 200km around the shops, by selecting the TEST_SHOPS datasource and then clicking on the Spatial Analysis.

Spatial Analytics Made Easy: Oracle Spatial Studio

This will open a popup window listing a good number of spatial analysis, by clicking on the Transform tab I can see the Add a buffer of a specified distance option.

Spatial Analytics Made Easy: Oracle Spatial Studio

Unfortunately the buffer function is not available in ADW at the moment.

Spatial Analytics Made Easy: Oracle Spatial Studio

I had to rely on an Oracle Database Cloud Service 18c Enterprise Edition - High Performance (which includes the Spatial option) to continue for my metadata storage and processing. Few Takeaways:

  • Select 18c (or anything above 12.2): I hit an issue ORA-00972: identifier is too long when importing the data in a 12.1 Database, which (thanks StackOverflow) is fixed as of 12.2.
  • High Performance: This includes the Spatial Option

Once I used the DBCS as metadata store, I can finally use the buffer function and set the parameter of 200km around the shops.

Spatial Analytics Made Easy: Oracle Spatial Studio

The TEST_SHOPS_BUFFER is now visible under Analysis and can be added on top of the Map correctly showing the 200km buffer zone.

Spatial Analytics Made Easy: Oracle Spatial Studio

I can understand which customers have a shop in the nearest 200k by creating an analysis and select the option "Return shapes within a specified distance of another"

Spatial Analytics Made Easy: Oracle Spatial Studio

In the parameters I can select the TEST_CUSTOMERS as Layer to be filtered, the TEST_SHOPS as the Layer to be used as filter and the 200Km as distance.

Spatial Analytics Made Easy: Oracle Spatial Studio

I can then visualize the result by adding the TEST_CUSTOMERS_WITHIN_DISTANCE layer in the map.

Spatial Analytics Made Easy: Oracle Spatial Studio

TEST_CUSTOMERS_WITHIN_DISTANCE contains the customers already "covered" by a shop in the 200km range, what I may want to do now is remove them from my list of customers in order to do analysis on the remaining ones, how can I do that? Unfortunately in the first Spatial Studio version there is no visual way of doing DATASET_A MINUS DATASET_B but, hey, it's just the first incarnation and we can expect that type of functions and many others to be available in future releases!

The following paragraph is an in-depth analysis in the database of functions that will probably be exposed in Spatial Studio's future version, so if not interested, progress directly to the section named "Progressing in the Spatial Analysis".

A Look in the Database

Since we want to achieve our goal of getting the customers not covered by a shop now, we need to look a bit deeper where the data is stored: in the database. This gives us two opportunities: check how Spatial Studio works under the covers and freely use SQL to achieve our goals (DATASET_A MINUS DATASET_B).

First let's have a look at the tables created by Spatial Studio: we can see some metadata tables used by studio as well as the database representation of our two excel files TEST_CUSTOMERS and TEST_SHOPS.

Spatial Analytics Made Easy: Oracle Spatial Studio

Looking in depth at the metadata we can also see a table named SGTECH$TABLE followed by an ID. That table collects the information regarding the geo-coding job we executed against our customers dataset which were located starting from zip-codes and addresses. We can associate the table to the TEST_CUSTOMERS dataset with the following query against the SGTECH_OBJECTS metadata table.

SELECT NAME, 
  JSON_VALUE(data, '$.gcHelperTableName') DATASET  
FROM SGTECH_OBJECT 
WHERE OBJECTTYPE='dataset'
AND NAME='TEST_CUSTOMERS';
Spatial Analytics Made Easy: Oracle Spatial Studio

The SGTECH$TABLEA004AA549110B928755FC05F01A3EF89 table contains, as expected, a row for each customer in the dataset, together with the related geometry if the geo-coding was successful and some metadata flags like GC_ATTEMPTED, GC_STATUS and GC_MATCH_CODE stating the accuracy of the geo-coding match.

Spatial Analytics Made Easy: Oracle Spatial Studio

What about all the analysis like the buffer and the customers within distance? For each analysis Spatial Studio creates a separate view with the SGTECH$VIEW prefix followed by an ID.

Spatial Analytics Made Easy: Oracle Spatial Studio

To understand which view is referring to which analysis we need to query the metadata table SGTECH_OBJECTS with a query like

SELECT NAME, 
  JSON_VALUE(data, '$.tableName') DATASET  
FROM SGTECH_OBJECT 
WHERE OBJECTTYPE='dataset'

With the following result

Spatial Analytics Made Easy: Oracle Spatial Studio

We know then that the TEST_CUSTOMERS_WITHIN_DISTANCE can be accessed by the view SGTECH$VIEW0B2B36785A28843F74B58B3CCF1C51E3 and when checking its SQL we can clearly see that it executes the SDO_WITHIN_DISTANCE function using the TEST_CUSTOMERS.GC_GEOMETRY, the TEST_SHOPS columns LONGITUDE and LATITUDE and the distance=200 unit=KILOMETER parameters we set in the front-end.

CREATE OR replace force editionable view "SPATIAL_STUDIO"."SGTECH$VIEW0B2B36785A28843F74B58B3CCF1C51E3"
SELECT
    ...
FROM
    "TEST_CUSTOMERS"   "t1",
    "TEST_SHOPS"       "t2"
WHERE
    sdo_within_distance("t1"."GC_GEOMETRY",
    spatial_studio.sgtech_ptf("t2"."LONGITUDE", "t2"."LATITUDE"), 
    'distance=200 unit=KILOMETER'
    ) = 'TRUE';

Ok, we now understood which view contains the data, thus we can create a new view containing only the customers which are not within the 200km distance with

CREATE VIEW TEST_CUSTOMERS_NOT_WITHIN_DISTANCE AS
SELECT
    t1.id            AS id,
    t1.first_name    AS first_name,
    t1.last_name     AS last_name,
    t1.email         AS email,
    t1.gender        AS gender,
    t1.postal_code   AS postal_code,
    t1.street        AS street,
    t1.country       AS COUNTRY,
    t1.city          AS city,
    t1.studio_id     AS studio_id,
    t1.gc_geometry   AS gc_geometry
FROM
    test_customers t1
WHERE
    id NOT IN (
        SELECT
            id
        FROM
            spatial_studio.sgtech$view0b2b36785a28843f74b58b3ccf1c51e3
    );

Progressing in the Spatial Analysis

In the previous paragraph we created a view in the database named TEST_CUSTOMERS_NOT_WITHIN_DISTANCE containing the customer without a shop in a 200km radius. We can now import it into Spatial Studio by creating a new dataset, selecting the connection to the database (in our case named SPATIAL_STUDIO) as source and then the newly created TEST_CUSTOMERS_NOT_WITHIN_DISTANCE view.

Spatial Analytics Made Easy: Oracle Spatial Studio

The dataset is added, but it has a yellow warning icon next to it

Spatial Analytics Made Easy: Oracle Spatial Studio

Spatial Studio requests us to define a primary key, we can do that by accessing the properties of the dataset, select the Columns tab, choosing which column acts as primary key and validate it. After this step I can visualize this customer in a map.

Spatial Analytics Made Easy: Oracle Spatial Studio

What's next? Well If I want to open a new shop, I may want to do that where there is a concentration of customers, which is easily visualizable with Spatial Studio by changing the Render Style to Heatmap.

Spatial Analytics Made Easy: Oracle Spatial Studio

With the following output

Spatial Analytics Made Easy: Oracle Spatial Studio

We can clearly see some major concentrations around Dallas, Washington and Minneapolis. Focusing more on Dallas, Spatial Studio also offers the option to simulate a new shop in the map and calculate the 200km buffer around it. I can clearly see that adding a shop halfway between Oklahoma City and Dallas would allow me to cover both clients within the 200km radius.

Spatial Analytics Made Easy: Oracle Spatial Studio

Please remember that this is a purely demonstrative analysis, and some of the choices, like the 200km buffer are expressly simplistic. Other factors could come into play when choosing a shop location like the revenue generated by some customers. And here it comes the second beauty of Oracle Spatial Studio, we can export datasets as GeoJSON or CSV and include them in Data Visualization.

Spatial Analytics Made Easy: Oracle Spatial Studio

For example I can export the data of TEST_CUSTOMERS_NOT_WITHIN_DISTANCE from Spatial Studio and include then in a Data Visualization Project blending them with the Sales related to the same customers.

Spatial Analytics Made Easy: Oracle Spatial Studio

I can now focus not only on the customer's position but also on other metrics like Profit or Sales Amount that I may have in other datasets. For another example of Oracle Spatial Studio and Data Visualization interoperability check out this video from Oracle Analytics Senior Director Philippe Lions.

Conclusions

Spatial analytics made easy: this is the focus of Oracle Spatial Studio. Before spatial queries were locked down at database level with limited access from an analyst point of view. Now we have a visual tool with a simple GUI (in line with OAC) that easily enables spatial queries for everybody!

But this is only the first part of the story: the combination of capabilities achievable when mixing Oracle Spatial Studio and Oracle Analytics Cloud takes any type of analytics to the next level!

KSQL in Football: FIFA Women’s World Cup Data Analysis

One of the football (as per European terminology) highlights of the summer is the FIFA Women’s World Cup. France, Brazil, and the USA are the favourites, and this year Italy is present at the event for the first time in 20 years.

From a data perspective, the World Cup represents an interesting source of information. There's a lot of dedicated press coverage, as well as the standard social media excitement following any kind of big event.

The idea in this blog post is to mix information coming from two distinct channels: the RSS feeds of sport-related newspapers and Twitter feeds of the FIFA Women’s World Cup. The goal will be to understand how the sentiment of official news related to the two teams involved in the final compares to that of the tweets.

In order to achieve our targets, we'll use pre-built connectors available in Confluent Hub to source data from RSS and Twitter feeds, KSQL to apply the necessary transformations and analytics, Google’s Natural Language API for sentiment scoring, Google BigQuery for data storage, and Google Data Studio for visual analytics.

Data sources

The beginning of our journey starts with connecting to various data sources. Twitter represents the default source for most event streaming examples, and it's particularly useful in our case because it contains high-volume event streaming data with easily identifiable keywords that can be used to filter for relevant topics.

Ingesting Twitter data

Ingesting Twitter data is very easy with Kafka Connect, a framework for connecting Kafka with external systems. Within the pre-built connectors we can find the Kafka Connect Twitter, all we need to do is install it using the Confluent Hub client.

confluent-hub install jcustenborder/kafka-connect-twitter:latest

To start ingesting the Twitter data, we need to create a configuration file containing the following important bits:

  • filter.keywords: We need to list all the keywords we are interested in, separated by a comma. Since we want to check tweets from the FIFA Women’s World Cup, we’ll use FIFAWWC, representing both the World Cup Twitter handle and the most common related hashtag.
  • kafka.status.topic: This topic that will be used to store the tweets we selected. twitter_avro: This is because the connector output format is AVRO.
  • twitter.oauth: This represents Twitter credentials. More information can be found on the Twitter’s developer website.

After the changes, our configuration file looks like the following:

filter.keywords=FIFAWWC
kafka.status.topic=twitter_avro
twitter.oauth.accessToken=<TWITTER ACCESS TOKEN>
twitter.oauth.accessTokenSecret=<TWITTER ACCESS TOKEN SECRET>
twitter.oauth.consumerKey=<TWITTER ACCESS CUSTOMER KEY>
twitter.oauth.consumerSecret=<TWITTER CUSTOMER SECRET>

It's time to start it up! We can use the Confluent CLI load command:

confluent load twitter -d $TWITTER_HOME/twitter.properties

$TWITTER_HOME is the folder containing the configuration file. We can check the Kafka Connect status by querying the REST APIs with the following:

curl -s "http://localhost:8083/connectors/twitter/status" | jq [.connector.state] 
[
  "RUNNING"
]

We can also check if all the settings are correct by consuming the AVRO messages in the twitter_avro topic with a console consumer:

confluent consume twitter_avro --value-format avro

And the result is, as expected, an event stream of tweets.

RSS feeds as another data source

The second data source that we'll use for our FIFA Women’s World Cup sentiment analytics are RSS feeds from sports-related newspapers. RSS feeds are useful because they share official information about teams and players, like results, episodes, and injuries. RSS feeds should be considered neutral since they should only report facts. For this blog post, we’ll use RSS feeds as a way to measure the average sentiment of the news. As per the Twitter case above, a prebuilt Kafka Connect RSS Source exists, so all we need to do is to install it via the Confluent Hub client:

confluent-hub install kaliy/kafka-connect-rss:latest

Then, create a configuration file with the following important parameters:

  • rss.urls: This is a list of space-separated RSS feed URLs. For our Women’s World Cup example, we’ve chosen the following sources: La Gazzetta dello Sport, Transfermarkt, Eurosport, UEFA, The Guardian, Daily Mail, The Sun Daily, BBC
  • topic: The Kafka topic to write to, which is rss_avro in our case

The full configuration file looks like the following:

name=RssSourceConnector
tasks.max=1
connector.class=org.kaliy.kafka.connect.rss.RssSourceConnector
rss.urls=https://www.transfermarkt.co.uk/rss/news https://www.eurosport.fr/rss.xml https://www.uefa.com/rssfeed/news/rss.xml https://www.theguardian.com/football/rss https://www.dailymail.co.uk/sport/index.rss https://www.thesundaily.my/rss/sport http://feeds.bbci.co.uk/news/rss.xml https://www.gazzetta.it/rss/home.xml
topic=rss_avro

And again, we can start the ingestion of RSS feeds with the Confluent CLI:

confluent load RssSourceConnector -d $RSS_HOME/RssSourceConnector.properties

We can test the status of Kafka Connectors using this simple procedure, and calling it like:

./connect_status.sh
RssSourceConnector  |  RUNNING  |  RUNNING
twitter             |  RUNNING  |  RUNNING

We can see that the both the RssSourceConnector and the twitter Connect are up and running. We can then check the actual data with the console consumer.

confluent consume rss_avro --value-format avro

Below is the output as expected.

Shaping the event streams

After ingesting the Twitter and RSS event streams into topics, it’s time to shape them with KSQL. Shaping the topics accomplishes two purposes:

  1. It makes the topics queryable from KSQL
  2. It defines additional structures that can be reused in downstream applications

The Twitter stream lands in Avro format with the fields listed in the related GitHub repo. We can easily declare a TWITTER_STREAM KSQL stream on top of TWITTER_AVRO with:

CREATE STREAM TWITTER_STREAM WITH (
KAFKA_TOPIC='TWITTER_AVRO',
VALUE_FORMAT='AVRO',
TIMESTAMP='CREATEDAT'
);

There is no need to define the single fields in the event stream declaration because they are already in AVRO and thus will be sourced from the Confluent Schema Registry. Schema Registry is the component within Kafka, in charge of storing, versioning and serving the topics Avro Schemas. When a topic is in AVRO format, its schema is stored in the Schema Registry, where downstream applications (like KSQL in this case) can retrieve it and use it to “shape” the messages in the topic.

The important bits of the above KSQL for our definition are:

  • KAFKA_TOPIC='TWITTER_AVRO': the definition of the source topic
  • VALUE_FORMAT='AVRO': the definition of the source topic format
  • TIMESTAMP='CREATEDAT': the Tweet's creation date, which is used as the event timestamp

We can now check that the fields’ definition has correctly been retrieved by the Schema Registry with:

DESCRIBE TWITTER_STREAM;

Or, we can use the REST API by:

curl -X "POST" "http://localhost:8088/ksql" \
-H "Content-Type: application/vnd.ksql.v1+json; charset=utf-8" \
-d $'{
"ksql": "DESCRIBE TWITTER_STREAM;",
"streamsProperties": {}
}'

The resulting fields section will be:

"fields": [
{
"name": "ROWTIME",
"schema": {
"type": "BIGINT",
"fields": null,
"memberSchema": null
},
{
"name": "ROWKEY",
"schema": {
"type": "STRING",
"fields": null,
"memberSchema": null
},
{
"name": "CREATEDAT",
"schema": {
"type": "BIGINT",
"fields": null,
"memberSchema": null
},
{
"name": "ID",
"schema": {
"type": "BIGINT",
"fields": null,
"memberSchema": null
},
...
}

The same applies to the RSS feed contained in rss_avro with:

create stream RSS_STREAM
WITH(
KAFKA_topic='rss_avro',
TIMESTAMP='DATE',
TIMESTAMP_FORMAT='yyyy-MM-dd''T''HH:mm:ss''Z''',
VALUE_FORMAT='AVRO'
)

The result will be:

ksql> describe RSS_STREAM;
Name                 : RSS_STREAM
Field   | Type
--------------------------------------------------------------
ROWTIME | BIGINT           (system)
ROWKEY  | VARCHAR(STRING)  (system)
FEED    | STRUCT<TITLE VARCHAR(STRING), URL VARCHAR(STRING)>
TITLE   | VARCHAR(STRING)
ID      | VARCHAR(STRING)
LINK    | VARCHAR(STRING)
CONTENT | VARCHAR(STRING)
AUTHOR  | VARCHAR(STRING)
DATE    | VARCHAR(STRING)
--------------------------------------------------------------

We can also use the URL manipulation functions added in KSQL 5.2 to extract useful information from the LINK column with:

CREATE STREAM RSS_STREAM_URL_DECODING AS
SELECT LINK,
URL_EXTRACT_HOST(LINK) HOST,
URL_EXTRACT_PATH(LINK) PATH,
URL_EXTRACT_PROTOCOL(LINK) PROTOCOL,
URL_EXTRACT_QUERY(LINK) QUERY_TEXT
FROM RSS_STREAM;

The result will be:

ksql> SELECT HOST, PATH, PROTOCOL, QUERY_TXT FROM RSS_STREAM_URL_DECODING LIMIT 5;
www.dailymail.co.uk | /sport/football/article-6919585/Paul-Scholes-backs-Manchester-United-spring-surprise-Barcelona.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
www.dailymail.co.uk | /sport/formulaone/article-6916337/Chinese-Grand-Prix-F1-race-LIVE-Shanghai-International-Circuit.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
www.dailymail.co.uk | /sport/football/article-6919403/West-Brom-make-approach-Preston-manager-Alex-Neil.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
www.dailymail.co.uk | /sport/football/article-6919373/Danny-Murphy-Jermaine-Jenas-fascinating-mind-games-thrilling-title-race.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
www.dailymail.co.uk | /sport/football/article-6919215/Brazilian-legend-Pele-successfully-undergoes-surgery-remove-kidney-stone-Sao-Paulo-hospital.html | https | ns_mchannel=rss&ns_campaign=1490&ito=1490
Limit Reached
Query terminated

Sentiment analytics and Google’s Natural Language APIs

Text processing is a part of machine learning and is continuously evolving with a huge variety of techniques and related implementations. Sentiment analysis represents a branch of text analytics and aims to identify and quantify affective states contained in a text corpus.

Natural Language APIs provide sentiment scoring as a service using two dimensions:

  1. Score: Positive (Score > 0) or Negative (Score < 0) Emotion
  2. Magnitude: Emotional Content Amount

For more information about sentiment score and magnitude interpretation, refer to the documentation.

Using Natural Language APIs presents various benefits:

  • Model training: Natural Language is a pre-trained model, ideal in situations where we don't have a set of already-scored corpuses.
  • Multi-language: RSS feeds and tweets can be written in multiple languages. Google Natural Language is capable of scoring several languages natively.
  • API call: Natural Language can be called via an API, making the integration easy with other tools.

Sentiment scoring in KSQL with user- defined functions (UDFs)

The Natural Language APIs are available via client libraries in various languages, including Python, C#, and Go. For the purposes of this blog post, we'll be looking at the Java implementation since it is currently the language used to implement KSQL user-defined functions (UDFs). For more details on how to build a UD(A)F function, please refer to How to Build a UDF and/or UDAF in KSQL 5.0 by Kai Waehner, which we'll use as base for the GSentiment class definition.

The basic steps to implementing Natural Language API calls in a UDF are the following:

  1. Add the google.cloud.language JAR dependency in your project. If you are using Maven, you just need to add the following in your pom.xml

<dependency>: <groupId>com.google.cloud</groupId> <artifactId>google-cloud-language</artifactId> <version>1.25.0</version> </dependency>

  1. Create a new Java class called GSentiment.
  2. Import the required classes:

//KSQL UDF Classes import io.confluent.ksql.function.udf.Udf; import io.confluent.ksql.function.udf.UdfDescription; //Google NL Classes import com.google.cloud.language.v1.LanguageServiceClient; import com.google.cloud.language.v1.Sentiment; import com.google.cloud.language.v1.AnalyzeSentimentResponse; import com.google.cloud.language.v1.Document; import com.google.cloud.language.v1.Document.Type;

  1. Define the GSentimen class and add the Java annotations @UdfDescription(name = "gsentiment", description = "Sentiment scoring using Google NL API") public class Gsentiment { ... }.
  2. Within the class, declare gsentiment as the method accepting a String text as input. As of now, UDFs can't return two output values, so we are returning the sentiment score and magnitude as an array of double. @Udf(description = "return sentiment scoring") public List<Double> gsentiment( String text) { ... }.
  3. Within the gsentiment method, invoke the Natural Language API sentiment and cast the result as an array. Since a UDF can return only one parameter currently, we need to pipe the sentiment score and magnitude into an array of two elements.
Double[] arr = new Double[2];   
try (LanguageServiceClient languageServiceClient = LanguageServiceClient.create()) {
Document document = Document.newBuilder()
                .setContent(text)
                .setType(Type.PLAIN_TEXT)
                .build();
AnalyzeSentimentResponse response = languageServiceClient.analyzeSentiment(document);
Sentiment sentiment = response.getDocumentSentiment();

arr[0]=(double)sentiment.getMagnitude();
arr[1]=(double)sentiment.getScore();
} 
catch (Exception e) {
arr[0]=(double) 0.0;
arr[1]=(double) 0.0;
}
return Arrays.asList(arr);

  1. As mentioned in How to Build a UDF and/or UDAF in KSQL 5.0, build an uber JAR that includes the KSQL UDF and any dependencies, and copy it to the KSQL extension directory (defined in the ksql.extension.dir parameter in ksql-server.properties).
  2. Add an environment variable GOOGLE_APPLICATION_CREDENTIALS pointing to the service account key that will be used to authenticate to Google services.
  3. Restart KSQL.

At this point, we should be able to call the GSentiment UDF from KSQL:

ksql> SELECT GSENTIMENT(text) FROM TWITTER_STREAM LIMIT 5;
[0.10000000149011612, -0.10000000149011612]
[0.20000000298023224, 0.20000000298023224]
[0.5                , 0.10000000149011612]
[0.10000000149011612, 0.10000000149011612]
[0.0                , 0.0]
Limit Reached
Query terminated

As expected, the UDF returns an ARRAY of numbers. In order to get the sentiment score and magnitude in separated columns, we simply need to extract the relevant values:

ksql> SELECT GSENTIMENT(TEXT)[0] SCORE, GSENTIMENT(TEXT)[1] MAGNITUDE FROM TWITTER_STREAM LIMIT 5;
0.20000000298023224 | 0.10000000149011612
0.30000001192092896 | 0.10000000149011612
0.800000011920929   | 0.800000011920929
0.0                 | 0.0
0.30000001192092896 | 0.30000001192092896
Limit Reached
Query terminated

However, we should note that Natural Language APIs are priced per API call and, in the above SQL, we are calling the API two times—one for each GSENTIMENT call. Therefore, the above SQL will cost us two API calls per document. To optimise the cost, we can create a new event stream TWITTER_STREAM_WITH_SENTIMENT, which will physicalize in Kafka the array.

CREATE STREAM TWITTER_STREAM_WITH_SENTIMENT AS
SELECT
*,
GSENTIMENT(TEXT) AS SENTIMENT
FROM TWITTER_STREAM;

Next, parse the sentiment SCORE and MAGNITUDE from the TWITTER_STREAM_WITH_SENTIMENT event stream:

CREATE STREAM TWITTER_STREAM_WITH_SENTIMENT_DETAILS as
SELECT *,
SENTIMENT[0] SCORE,
SENTIMENT[1] MAGNITUDE
FROM TWITTER_STREAM_WITH_SENTIMENT;

With this second method, we optimize the cost with a single Natural Language API call per tweet. We can do the same with the RSS feeds by declaring:

CREATE STREAM RSS_STREAM_WITH_SENTIMENT AS
SELECT
*,
GSENTIMENT(CONTENT) SENTIMENT
FROM RSS_STREAM_FLATTENED;
CREATE STREAM RSS_STREAM_WITH_SENTIMENT_DETAILS as
SELECT *,
SENTIMENT[0] SCORE,
SENTIMENT[1] MAGNITUDE
FROM RSS_STREAM_WITH_SENTIMENT;

Sink to Google BigQuery

The following part of this blog post focuses on pushing the dataset into Google BigQuery and visual analysis in Google Data Studio.

Pushing the data into BigQuery is very easy—just install the BigQuery Sink Connector with:

confluent-hub install wepay/kafka-connect-bigquery:latest

Next, configure it while applying the following parameters (amongst others):

  • topics: defines the topic to read (in our case RSS_STREAM_WITH_SENTIMENT_DETAILS and TWITTER_STREAM_WITH_SENTIMENT_DETAILS)
  • project: the name of the Google project that we’ll use for billing
  • datasets=.*=wwc: defines the BigQuery dataset name
  • keyfile=$GOOGLE_CRED/myGoogleCredentials.json: points to the JSON file containing Google's credentials (which in our case is the same file used in the Google Natural Language scoring)

Before starting the connector, we need to ensure the BigQuery dataset named wwc (as per configuration file) exists, otherwise, the connector will fail. To do so, we can log into BigQuery, select the same project defined in the configuration file, and click on CREATE DATASET. Then, we’ll need to fill in all the details (more information about the dataset creation in the Google documentation).

After creating the dataset, it’s time to start the connector with the Confluent CLI:

confluent load bigquery-connector -d $RSS_HOME/connectorBQ.properties

If the Kafka sink works, we should see one table per topic defined in the configuration file, which are RSS_STREAM_WITH_SENTIMENT_DETAILS and TWITTER_STREAM_WITH_SENTIMENT_DETAILS in our case.

Of course, we can query the data from BigQuery itself.

Visual Analysis in Google Data Studio

To start analysing the data in Google Data Studio, simply connect to the related console and select “Blank Report.”

We’ll be asked which data source to use for the project. Thus, we need to set up a connection to the wwc dataset by clicking on CREATE NEW DATASOURCE and selecting BigQuery as the connection. Then, select the project, dataset, and table ( TWITTER_STREAM_WITH_SENTIMENT_DETAILS).


We can then review the list of columns, types, and aggregations,  adding the data source to the report.

Finally, we can start creating visualisations like tiles to show record counts, line charts for the sentiment trend, and bar charts defining the most used languages.

A more advanced visualisation like a scatterplot shows the most common hashtags and the associated average sentiment value.

Below is a map visualising the average sentiment by country.

Analysing and comparing sentiment scores

Now that we have the two streams of data coming from Twitter and RSS feeds, we can do the analysis in KSQL and, in parallel, visually in Google Data Studio. We can, for example, examine the average sentiment over a timeframe and check how one source sentiment score compares to the other.

On the 27th of June, the quarterfinal match between Norway and England was played, with the result being that England beat Norway 3–0. Let’s check if we can somehow find significant similarities in the sentiment scoring of our dataset.

Starting with the Twitter feed, we can check all the tweets including ENGLAND and NORWAY by filtering the related hashtag #NORENG. To obtain the team related overall score, I’m then assigning to each team all the tweets containing the country full name and aggregating the  SENTIMENTSCORE with the following SQL:

CREATE STREAM TWITTER_NORWAY_ENGLAND AS
SELECT
CASE
WHEN UCASE(TEXT) LIKE '%NORWAY%' THEN SENTIMENTSCORE
END AS NORWAY_SENTIMENTSCORE,
CASE
WHEN UCASE(TEXT) LIKE '%ENGLAND%' THEN SENTIMENTSCORE
END AS ENGLAND_SENTIMENTSCORE
FROM TWITTER_STREAM_WITH_SENTIMENT_DETAILS
WHERE TEXT LIKE '%#NORENG%';

We can check the overall sentiment score associated with the two teams using:

SELECT
SUM(NORWAY_SENTIMENTSCORE)/COUNT(NORWAY_SENTIMENTSCORE) AS NORWAY_AVG_SCORE,
SUM(ENGLAND_SENTIMENTSCORE)/COUNT(ENGLAND_SENTIMENTSCORE) AS ENGLAND_AVG_SCORE
FROM TWITTER_NORWAY_ENGLAND
GROUP BY 1;

The GROUP BY 1 is necessary since KSQL currently requires a GROUP BY clause when using aggregation functions like SUM. Since we don’t aggregate for any columns other than the window time, we can use the number 1 as fix aggregator for the total. The result of the above query is in line with the final score, with the winner (England) having an average sentiment score of 0.212, and the loser (Norway) having a score of 0.0979.

We can also look at the behaviour per hour with the TUMBLING KSQL windowing function:

SELECT
TIMESTAMPTOSTRING(WINDOWSTART(), 'EEE dd MMM HH') AS START_WINDOW,
SUM(NORWAY_SENTIMENTSCORE)/COUNT(NORWAY_SENTIMENTSCORE) AS NORWAY_AVG_SCORE,
SUM(ENGLAND_SENTIMENTSCORE)/COUNT(ENGLAND_SENTIMENTSCORE) AS ENGLAND_AVG_SCORE
FROM TWITTER_NORWAY_ENGLAND
WINDOW TUMBLING(SIZE 1 HOURS)
GROUP BY 1;

The query yields the following result:

Thu 27 Jun 16 | 0.18409091 | 0.16075758
Thu 27 Jun 17 | 0.14481481 | 0.13887096
Thu 27 Jun 18 | 0.14714406 | 0.12107647
Thu 27 Jun 19 | 0.07926398 | 0.34757579
Thu 27 Jun 20 | 0.10077705 | 0.13762544
Thu 27 Jun 21 | 0.08387538 | 0.17832865

We can clearly see that towards match time (19:00 BST), the ENGLAND average score has a spike, in coincidence with England’s first goal in the third minute. We can see the same on the Line Chart in Google Data Studio.

We can do a similar exercise on top of the RSS feeds stream, but first, we need to somehow filter it to get only FIFA Women’s World Cup 2019 data, since the predefined connector is ingesting all the news from the RSS sources without a topic filter. To do so, we create a new stream filtering only contents containing WOMEN and CUP:

CREATE STREAM RSS_STREAM_WITH_SENTIMENT_DETAILS_WWC as
SELECT *
FROM RSS_STREAM_WITH_SENTIMENT_DETAILS
WHERE UCASE(CONTENT) LIKE '%WOMEN%'
AND UCASE(CONTENT) LIKE '%CUP%';

We can now analyse the overall RSS sentiment with:

SELECT SUM(SENTIMENTSCORE)/COUNT(SENTIMENTSCORE)
FROM RSS_STREAM_WITH_SENTIMENT_DETAILS_WWC
GROUP BY 1;

As before, the SUM(SENTIMENTSCORE)/COUNT(SENTIMENTSCORE) is calculating the average. We can then calculate the sentiment average for the selected team. Taking the same example of ENGLAND and NORWAY used previously, we just declare a stream filtering the sentiment for the two nations. For example:

CREATE STREAM RSS_NORWAY_ENGLAND AS
SELECT
CASE WHEN UCASE(CONTENT) LIKE '%NORWAY%' THEN SENTIMENTSCORE END NORWAY_SENTIMENTSCORE,
CASE WHEN UCASE(CONTENT) LIKE '%ENGLAND%' THEN SENTIMENTSCORE END ENGLAND_SENTIMENTSCORE
FROM RSS_STREAM_WITH_SENTIMENT_DETAILS_WWC;

Then, we can analyse the separate scoring with:

SELECT
SUM(NORWAY_SENTIMENTSCORE)/COUNT(NORWAY_SENTIMENTSCORE) NORWAY_AVG_SENTIMENT,
SUM(ENGLAND_SENTIMENTSCORE)/COUNT(ENGLAND_SENTIMENTSCORE) ENGLAND_AVG_SENTIMENT
FROM RSS_NORWAY_ENGLAND
WHERE ROWTIME > STRINGTODATE('2019-06-27', 'yyyy-MM-dd')
GROUP BY 1;

The result is an average sentiment of 0.0575 for Norway and 0.111 for England, again in line with the match result where England won 3–0.

We can also understand the variation of the sentiment over time by using KSQL windowing functions like TUMBLING:

SELECT
TIMESTAMPTOSTRING(WINDOWSTART(), 'EEE dd MMM HH') START_WINDOW,
SUM(NORWAY_SENTIMENTSCORE)/COUNT(NORWAY_SENTIMENTSCORE) NORWAY_AVG_SENTIMENT,
SUM(ENGLAND_SENTIMENTSCORE)/COUNT(ENGLAND_SENTIMENTSCORE) ENGLAND_AVG_SENTIMENT
FROM RSS_NORWAY_ENGLAND WINDOW TUMBLING(SIZE 1 HOURS)
WHERE ROWTIME >= STRINGTODATE('2019-06-27', 'yyyy-MM-dd')
GROUP BY 1;

This yields the following results:

Thu 27 Jun 17 | 0.12876364 | 0.12876364
Thu 27 Jun 18 | 0.24957054 | 0.24957054
Thu 27 Jun 19 | 0.15606978 | 0.15606978
Thu 27 Jun 20 | 0.09970317 | 0.09970317
Thu 27 Jun 21 | 0.00809077 | 0.00809077
Thu 27 Jun 23 | 0.41298701 | 0.12389610

As expected from this source, most of the scoring of the two countries are the same since the number of articles is limited and almost all articles mention both ENGLAND and NORWAY.


Strangely, as we can see in the graph above, the NORWAY sentiment score on the 27th of June at 11:00 pm GMT (so after the match ended) is much higher than the ENGLAND one.

We can look at the data closely with:

SELECT ROWKEY, NORWAY_SENTIMENTSCORE, ENGLAND_SENTIMENTSCORE
from NORWAY_ENGLAND where TIMESTAMPTOSTRING(ROWTIME,'dd/MM HH') = '27/06 23';
https://www.theguardian.com/...      | 0.375974032  | 0.375974032
https://www.bbc.co.uk/.../48794550   | null         | -0.45428572
https://www.bbc.co.uk/.../48795487   | 0.449999988  | 0.449999988

We can see that  NORWAY is being associated with two articles: one from The Guardian with a positive 0.375 score and one from BBC with a positive 0.449 score. ENGLAND, on the other hand, is associated with another BBC article, having a negative -0.454 score.
We can also compare the hourly Twitter and RSS sentiment scores by creating two tables:

CREATE TABLE TWITTER_NORWAY_ENGLAND_TABLE AS
SELECT
TIMESTAMPTOSTRING(WINDOWSTART(), 'dd HH') START_WINDOW,
SUM(NORWAY_SENTIMENTSCORE)/COUNT(NORWAY_SENTIMENTSCORE) NORWAY_AVG_SCORE,
SUM(ENGLAND_SENTIMENTSCORE)/COUNT(ENGLAND_SENTIMENTSCORE) ENGLAND_AVG_SCORE
FROM TWITTER_NORWAY_ENGLAND
WINDOW TUMBLING(SIZE 1 HOURS)
GROUP BY 1;
CREATE TABLE RSS_NORWAY_ENGLAND_TABLE AS
SELECT
TIMESTAMPTOSTRING(WINDOWSTART(), 'dd HH') START_WINDOW,
SUM(NORWAY_SENTIMENTSCORE)/COUNT(NORWAY_SENTIMENTSCORE) NORWAY_AVG_SCORE,
SUM(ENGLAND_SENTIMENTSCORE)/COUNT(ENGLAND_SENTIMENTSCORE) ENGLAND_AVG_SCORE
FROM NORWAY_ENGLAND WINDOW TUMBLING(SIZE 1 HOURS)
WHERE ROWTIME >= STRINGTODATE('2019-06-27', 'yyyy-MM-dd')
GROUP BY 1;

The key of both the tables is the window start date, as we can see from:

ksql> select rowkey from RSS_NORWAY_ENGLAND_TABLE limit 1;
1 : Window{start=1561557600000 end=-}

We can then join the results together with the following statement:

SELECT A.START_WINDOW,
A.NORWAY_AVG_SENTIMENT TWITTER_NORWAY_SCORE,
A.ENGLAND_AVG_SENTIMENT TWITTER_ENGLAND_SCORE,
B.NORWAY_AVG_SENTIMENT RSS_NORWAY_SCORE,
B.ENGLAND_AVG_SENTIMENT RSS_ENGLAND_SCORE
FROM
TWITTER_NORWAY_ENGLAND_TABLE A JOIN
RSS_NORWAY_ENGLAND_TABLE B
ON A.ROWKEY = B.ROWKEY;

This yields the following result:

Thu 27 Jun 17 | 0.14481481 | 0.13887096 | 0.12876364 | 0.12876364
Thu 27 Jun 18 | 0.14714406 | 0.12107647 | 0.24957054 | 0.24957054
Thu 27 Jun 19 | 0.07926398 | 0.34757579 | 0.15606978 | 0.15606978
Thu 27 Jun 20 | 0.10077705 | 0.13762544 | 0.09970317 | 0.09970317
Thu 27 Jun 21 | 0.08387538 | 0.17832865 | 0.00809077 | 0.00809077

And the end result similarly in a Data Studio Line Chart

Interested in more?

If you’re interested in what KSQL can do, you can download the Confluent Platform to get started with the event streaming SQL engine for Apache Kafka. To help you get started, Rittman Mead provides a 30 day Kafka quick start package

This article was originally posted on the Confluent blog.

Merging OBIEE 12c .RPD binary files directly in Git

Let's talk about OBIEE concurrent development!

Enabling concurrent development for OBIEE RPD is a recurring theme in OBIEE blogs. Full support for RPD development with Gitflow has long since been part of the Rittman Mead's BI Developer Toolkit and is described in great detail in Minesh's blog post. What you are currently reading is a follow-up to Minesh's post, but taking it one step further: instead of calling Python scripts to perform Gitflow steps, we want to perform all those steps directly from our Git client (including the ones performing a merge, like feature finish), be it command line or a visual application like Sourcetree.

RPD versioning directly in Git - do we need that?

How is versioning directly in a Git client better than calling Python scripts? First of all, it is the convenience of using the same approach, the same tool for all content your need to version control. A Python script will have to come with instructions for its use, whereas every developer knows how to use Git. Last but not least, a 3-way merge, which is used for Gitflow's feature finish, release finish and hotfix finish commands, requires three repositories that need to be passed to the script in the right order. Doing merges in your Git client would be quicker and less error prone.

What is a Git merge?

Before we proceed with discussing our options for merging OBIEE RPDs, let us quickly recap on how Git merges work.

There are two types of Git merges: Fast-forward Merges and 3-way Merges. Strictly speaking, Fast-forward Merges are no merges at all. If the base branch has not seen any changes whilst you worked on your feature branch, merging the feature back into the base simply means 'fast-forwarding' the base branch tip to your feature branch tip, i.e. your feature becomes the new base. That is allowed because the two branches have not diverged - the histories of the base and the feature branches form a single history line.

When the two branches have diverged, i.e. when the base has been modified by the time we want to merge our feature, a 3-way merge is the only option.

In the above diagram, feature 1 can be fast-forward merged whereas feature 2 must be 3-way merged into the develop branch.

Note that because a Fast-forward Merge is not an actual merge but rather a replacement, it is not relevant what content is being merged. The 3-way Merge however, depending on the content being merged, can be quite challenging or even impossible. And can result in merge conflicts that require manual resolution.

So... can Git 3-way merge RPDs?

OBIEE RPD can be saved in two formats: a single binary .rpd file or one or many .xml files (depending on what rpd-to-xml conversion method you use). The choice here seems obvious - it is common knowledge that Git cannot reliably 3-way merge binary files. So XML format it is. Or is it?

Like any other text file, Git certainly can merge XML files. But will it produce an XML that is still recognised as a consistent OBIEE RPD? Well, there are some OBIEE developer teams that have reported success with this approach. My own experience even with the most trivial of RPD changes shows that somewhere during the .xml to .rpd conversion, then introducing changes in the .rpd and in the end converting it back to .xml, the XML tags get reshuffled and sometimes their identifiers can change as well. (Equalising RPD objects is supposed to help with the latter.) I found no standard Git merge algorithm that would reliably and consistently perform RPD merge for XML format produced this way, be it a single large XML file or a collection of small XML files.

Fortunately, there is a better (and less risky) way.

Creating a Git custom merge driver

It is possible to create custom Git merge drivers and then assign them to specific file extensions (like .rpd) in the .gitattributes file - as described in Git documentation. According to the guide, defining a custom merge driver in Git is really straight forward: just add a new entry to the .git/config file:

[merge "filfre"]
	name = feel-free merge driver
	driver = filfre %O %A %B %L %P
	recursive = binary

Here, filfre is the code name of the custom merge driver, feel-free merge driver is the descriptive name of it (hardly used anywhere) and the driver value is where we define the driver itself. It is a shell command for your operating system. Typically it would call a shell script or a binary executable. It can be a java -jar execution or a python my-python-script.py call. The latter is what we want - we have already got a 3-way merge script for OBIEE RPD in the Rittman Mead's BI Developer Toolkit, as blogged by Minesh.

For the script to know about the content to be merged, it receives the following command line arguments: %O %A %B %L %P. These are the values that Git passes to the custom merge driver:

  • %O - this is the Base or the Original for the 3-way merge. If we are using Git Flow, this is the develop branch's version, from which our feature branch was created;
  • %A - this is the Current version for the 3-way merge. If we are using Git Flow, this is the feature branch that we want to merge back into develop;
  • %B - this is the Other or the Modified version of the 3-way merge. If we are using Git Flow, this is the develop branch as it is currently (diverged from the original Base), when we want to merge our feature branch back into it.

There are two more values, which we do not need and will ignore: %L is Conflict marker size, e.g. 7 for '>>>>>>>'. This is irrelevant for us, because we are handling binary files. %P is the full path name where the merge result will be stored - again irrelevant for us, because Python is capable of getting full paths for the files it is handling, in case it needs it.

Creating a Git custom merge driver for OBIEE .rpd binary files

What we need here is a Python script that performs a 3-way RPD merge by calling OBIEE commands comparerpd and patchrpd from command line. Please note that OBIEE creates a 4th file as the output of the merge, whereas a git merge driver is expected to overwrite the Current (%A) input with the merge result. In Python, that is quite doable.

Another important thing to note is that the script must return exit code 0 in case of a success and exit code 1 in case there were merge conflicts and automatic merge could not be performed. Git determines the success of the merge solely based on the exit code.

Once we have the Python script ready and have tested it standalone, we open our local Git repository folder where our OBIEE .rpd files will be versioned and open the file <repo root>/.git/config for editing and add the following lines to it:

[merge "rpdbin"]
    name = binary RPD file merge driver
    driver = python C:/Developer/bi_developer_toolkit/git-rpd-merge-driver.py %O %A %B

Our Python script expects 3 command line arguments - names of .rpd files: Base (%O), Current (%A) and Modified (%B). Those will be temporary files, created by Git in run time.

Once the config file is modified, create a new file <repo root>/.gitattributes and add the following line to it:

*.rpd merge=rpdbin

This assumes that your binary RPD files will always have the extension .rpd. If with a different extension, the custom merge driver will not be applied to them.

And that is it - we are done!

Note: if you see that the custom merge driver works from the Git command line tool but does not work in Sourcetree, you may need to run Sourcetree as Administrator.

Trying it out

We will use Sourcetree as our Git/Gitflow client - it is really good at visualising the versioning flow and shows the currently available Gitflow commands for the currently checked out branch.

We will use the RPD from Oracle Sample Application v602 for OBIEE 12c 12.2.1.1.0. for our testing.

After initialising Gitflow in our Git repository, we add the out-of-the-box Sample Apps RPD to our repository's develop branch - that will be our Base.

Then we create two copies of it and modify each copy to introduce changes we would like to see merged. In the screenshots below, you can see Business Models and Databases renamed. But I did also change the content of those Business Models.

Repo 1:

Repo 2:

Now we create a new feature branch and overwrite the Base rpd it contains with our Repo 1 rpd.

As the next step, we check out the develop branch again and replace the Base rpd there with Repo 2 rpd.

Note that we need to make sure the develop branch is different from the original Base when we finish our feature. If the develop branch will be the same as the original Base when we finish the feature, a fast-forward merge will be done instead and our custom merge driver will not be applied.

The result should look like this in Sourcetree. You can see a fork, indicating that the develop and the feature8 branches have diverged:

We are ready to test our custom 3-way merge driver. In Sourcetree, from the Gitflow menu, select Finish Feature.

Confirm your intention to merge the feature back into develop.

If all goes as planned, Git will call your custom merge driver. In Sourcetree, click the Show Full Output checkbox to see the output from your script. In my script, I tagged all output with a [Git RPD Merge Driver] prefix (except the output coming from external functions). This is what my output looks like:

Now let us check the result: make sure the develop branch is checked out, then open the merged RPD in the Admin tool.

We can see that it worked - we can now do full Gitflow lifecycle for OBIEE .rpd files directly in Git.

But what if the merge fails?

If the merge fails, the feature branch will not be deleted and you will have to merge the .rpd files manually in the OBIEE Admin tool. Note that you can get the Current, the Modified and the Base .rpd files from Git. Once you are happy with your manual merge result, check out the develop branch and add it there.