Category Archives: Rittman Mead
How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau
I don't trust statistics and personally believe that at least 74% of them are wrong.... but I bet nearly 100% of people with any interest in fantasy (or just any) TV shows are watching the 7th series of Game of Thrones (GoT) by HBO.
If you are one of those, join me in the analysis of the latest tweets regarding the subject. Please be also aware that, if you're not on the latest episode, some spoilers may be revealed by this article. My suggestion is then to go back and watch the episodes first and then come back here for the analysis!
If you aren't part of the above group then ¯\_(ツ)_/¯
. Still this post contains a lot of details on how to perform analysis on any tweet with Tableau and BigQuery together with Kafka sources and sink configurations. I'll leave to you to find another topic to put all this in practice.
Overall Setup
As described in my previous post on analysing Wimbledon tweets I've used Kafka for the tweet extraction phase. In this case however, instead of querying the data directly in Kafka with Presto, I'm landing the data into a Google BigQuery Table. The last step is optional, since as in last blog I was directly querying Kafka, but in my opinion represents the perfect use case of all technologies: Kafka for streaming and BigQuery for storing and querying data.
The endpoint is represented by Tableau, which has a native connector to BigQuery. The following image represents the complete flow
One thing to notice: at this point in time I'm using a on-premises installation of Kafka which I kept from my previous blog. However since source and target are natively cloud application I could easily move also Kafka in the cloud using for example the recently announced Confluent Kafka as a Service.
Now let's add some details about the overall setup.
Kafka
For the purpose of this blog post I've switched from the original Apache Kafka distribution to the Confluent open source one. I've chosen the Confluent distribution since it includes the Kafka Connect which is
A framework for scalably and reliably streaming data between Apache Kafka and other data systems
Using this framework anybody can write a connector to push data from any system (Source Connector) to Kafka or pull data from it (Sink Connector). This is a list of available connectors developed and maintained either from Confluent or from the community. Moreover Kafka Connect provides the benefit of parsing the message body and storing it in Avro format which makes it easier to access and faster to retrieve.
Kafka Source for Twitter
In order to source from Twitter I've been using this connector. The setup is pretty easy: copy the source folder named kafka-connect-twitter-master
under $CONFLUENT_HOME/share/java
and modify the file TwitterSourceConnector.properties
located under the config
subfolder in order to include the connection details and the topics.
The configuration file in my case looked like the following:
name=connector1
tasks.max=1
connector.class=com.github.jcustenborder.kafka.connect.twitter.TwitterSourceConnector
# Set these required values
twitter.oauth.accessTokenSecret=<TWITTER_TOKEN_SECRET>
process.deletes=false
filter.keywords=#got,gameofthrones,stark,lannister,targaryen
kafka.status.topic=rm.got
kafka.delete.topic=rm.got
twitter.oauth.consumerSecret=<TWITTER_CONSUMER_SECRET>
twitter.oauth.accessToken=<TWITTER_ACCESS_TOKEN>
twitter.oauth.consumerKey=<TWITTER_CONSUMER_KEY>
Few things to notice:
process.deletes=false
: I'll not delete any message from the streamkafka.status.topic=rm.got
: I'll write against a topic namedrm.got
filter.keywords=#got,gameofthrones,stark,lannister,targaryen
: I'll take all the tweets with one of the following keywords included. The list could be expanded, this was just a test case.
All the work is done! the next step is to start the Kafka Connect execution via the following call from $CONFLUENT_HOME/share/java/kafka-connect-twitter
$CONFLUENT_HOME/bin/connect-standalone config/connect-avro-docker.properties config/TwitterSourceConnector.properties
I can see the flow of messages in Kafka using the avro-console-consumer
command
./bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 --property schema.registry.url=http://localhost:8081 --property print.key=true --topic twitter --from-beginning
You can see (or maybe it's a little bit difficult from the GIF) that the message body was transformed from JSON to AVRO format, the following is an example
{"CreatedAt":{"long":1502444851000},
"Id":{"long":895944759549640704},
"Text":{"string":"RT @haranatom: Daenerys Targaryen\uD83D\uDE0D https://t.co/EGQRvLEPIM"},
[...]
,"WithheldInCountries":[]}
Kafka Sink to BigQuery
Once the data is in Kafka, the next step is push it to the selected datastore: BigQuery. I can rely on Kafka Connect also for this task, with the related code written and supported by the community and available in github.
All I had to do is to download the code and change the file kcbq-connector/quickstart/properties/connector.properties
...
topics=rm.got
..
autoCreateTables=true
autoUpdateSchemas=true
...
# The name of the BigQuery project to write to
project=<NAME_OF_THE_BIGQUERY_PROJECT>
# The name of the BigQuery dataset to write to (leave the '.*=' at the beginning, enter your
# dataset after it)
datasets=.*=<NAME_OF_THE_BIGQUERY_DATASET>
# The location of a BigQuery service account JSON key file
keyfile=/home/oracle/Big-Query-Key.json
The changes included:
- the topic name to source from Kafka
- the project, dataset and Keyfile which are the connection parameters to BigQuery. Note that the Keyfile is automatically generated when creating a BigQuery service.
After verifying the settings, as per Kafka connect instructions, I had to create the tarball of the connector and extract it's contents
cd /path/to/kafka-connect-bigquery/
./gradlew clean confluentTarBall
mkdir bin/jar/ && tar -C bin/jar/ -xf bin/tar/kcbq-connector-*-confluent-dist.tar
The last step is to launch the connector by moving into the kcbq-connector/quickstart/
subfolder and executing
./connector.sh
Note that you may need to specify the CONFLUENT_DIR
if the Confluent installation home is not in a sibling directory
export CONFLUENT_DIR=/path/to/confluent
When everything start up without any error a table named rm_got
(the name is automatically generated) appears in the BigQuery dataset I defined previously and starts populating.
A side note: I encountered a Java Heap Space
error during the run of the BigQuery sink. This was resolved by increasing the heap space setting of the connector via the following call
export KAFKA_HEAP_OPTS="-Xms512m -Xmx1g"
BigQuery
BigQuery, based on Dremel's paper, is Google's proposition for an enterprise cloud datawarehouse which combines speed and scalability with separate pricing for storage and compute. If the cost of storage is common knowledge in the IT world, the compute cost is a fairly new concept. What this means is that the cost of the same query can vary depending on how the data is organized. In Oracle terms, we are used to associating the query cost to the one defined in the explain plan. In BigQuery that concept is translated from "performance cost" to also "financial cost" of a query: the more data a single query has to scan, the higher is the cost for it. This makes the work of optimizing data structures not only visible performance wise but also on the financial side.
For the purpose of the blog post, I had almost 0 settings to configure other than creating a Google Cloud Platform, creating a BigQuery project and a dataset.
During the Project creation phase, a Keyfile is generated and stored locally on the computer. This file contains all the credentials needed to connect to BigQuery from any external application, my suggestion is to store it in a secure place.
{
"type": "service_account",
"project_id": "<PROJECT_ID>",
"private_key_id": "<PROJECT_KEY_ID>",
"private_key": "<PRIVATE_KEY>",
"client_email": "<E-MAIL>",
"client_id": "<ID>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "<URL>"
}
This file is used in the Kafka sink as we saw above.
Tableau
Once the data is landed in BigQuery, It's time to analyse it with Tableau!
The Connection is really simple: from Tableau home I just need to select Connect-> To a Server -> Google BigQuery
, fill in the connection details and select the project and datasource.
An important feature to set is the Use Legacy SQL checkbox in the datasource definition. Without this setting checked I wasn't able to properly query the BigQuery datasource. This is due to the fact that "Standard SQL" doesn't support nested columns while Legacy SQL (also known as BigQuery SQL) does, for more info check the related tableau website.
Analysing the data
Now it starts the fun part: analysing the data! The integration between Tableau and BigQuery automatically exposes all the columns of the selected tables together with the correctly mapped datatypes, so I can immediately start playing with the dataset without having to worry about datatype conversions or date formats. I can simply include in the analysis the CreatedAt
date and the Number of Records
measure (named # of Tweets
) and display the number of tweets over time.
Now I want to analyse where the tweets are coming from. I can use using the the Place.Country
or the Geolocation.Latitude
and Geolocation.Longitude
fields in the tweet detail. Latitute and Longitude are more detailed while the Country is rolled up at state level, but both solutions have the same problem: they are available only for tweets with geolocation activated.
After adding Place.Country
and # of Tweets
in the canvas, I can then select the map as visualization. Two columns Latitude (generated)
and Longitude (generated)
are created on the fly mapping the country locations and the selected visualization is shown.
However as mentioned before, this map shows only a subset of the tweets since the majority of tweets (almost 99%) has no location.
The fields User.Location
and User.TimeZone
suffer from a different problem: either are null or the possible values are not coming from a predefined list but are left to the creativity of the account owner which can type whatever string. As you can see, it seems we have some tweets coming from directly from Winterfell, Westeros, and interesting enough... Hogwarts!
Checking the most engaged accounts based on User.Name
field clearly shows that Daenerys and Jon Snow take the time to tweet between fighting Cercei and the Whitewalkers.
The field User.Lang
can be used to identify the language of the User. However, when analysing the raw data, it can be noticed that there are language splits for regional language settings (note en
vs en-gb
). We can solve the problem by creating a new field User.Lang.Clean
taking only the first part of the string with a formula like
IF FIND([User.Lang],'-') =0
THEN [User.Lang]
ELSE
LEFT([User.Lang],FIND([User.Lang],'-')-1)
END
With the interesting result of Italian being the 4th most used language, overtaking portuguese, and showing the high interest in the show in my home country.
Character and House Analysis
Still with me? So far we've done some pretty basic analysis on top of pre-built fields or with little transformations... now it's time to go deep into the tweet's Text
field and check what the people are talking about!
The first thing I wanted to do is check mentions about the characters and related houses. The more a house is mentioned, the more should be relevant correct?
The first text analysis I want to perform was Stark vs Targaryen mention war: showing how many tweets were mentioning both, only one or none of two of the main houses. I achieved it with the below IF
statement
IF contains(upper([Text]), 'STARK') AND contains(upper([Text]),'TARGARYEN')
THEN 'Both'
ELSEIF contains(upper([Text]), 'STARK')
THEN 'Stark'
ELSEIF contains(upper([Text]), 'TARGARYEN')
THEN 'Targaryen'
ELSE 'None'
END
With the results supporting the house Stark
I can do the same at single character level counting the mentions on separate columns like for Jon Snow
IIF(contains(upper([Text]), 'JON')
OR contains(upper([Text]),'SNOW'), 1,0)
Note the OR
condition since I want to count as mentions both the words JON
and SNOW
since those can uniquely be referred at the same character. Similarly I can create a column counting the mentions to Arya Stark with the following formula
IIF(contains(upper([Text]), 'ARYA'), 1,0)
Note in this case I'm filtering only the name (ARYA
) since Stark can be a reference to multiple characters (Sansa, Bran ...). I created several columns like the two above for some characters and displayed them in a histogram ordered by # of Mentions
in descending order.
As expected, after looking at the Houses results above, Jon Snow is leading the user mentions with a big margin over the others with Daenerys in second place.
The methods mentioned above however have some big limitations:
- I need to create a different column for every character/house I want to analyse
- The formula complexity increases if I want to analyse more houses/characters at the same time
My goal would be to have an Excel file, where I set the research Key (like JON
and SNOW
) together with the related character and house and mash this data with the BigQuery table.
The joining key would be like
CONTAINS([BigQuery].[Text], [Excel].[Key]) >0
Unfortunately Tableau allows only =
operators in text joining conditions during data blending making the above syntax impossible to implement. I have now three options:
- Give Up: Never if there is still hope!
- Move the Excel into a BigQuery table and resolve the problem there by writing a view on top of the data: works but increases the complexity on BigQuery side, plus most Tableau users will not have write access to related datasources.
- Find an alternative way of joining the data: If the
CONTAINS
join is not possible during data-blending phase, I may use it a little bit later...
Warning: the method mentioned below is not the optimal performance wise and should be used carefully since it causes data duplication if not handled properly.
Without the option of using the CONTAINS
I had to create a cartesian join during data-blending phase. By using a cartesian join every row in the BigQuery table is repeated for every row in the Excel table. I managed to create a cartesian join by simply put a 1-1
condition in the data-blending section.
I can then apply a filter on the resulting dataset to keep only the BigQuery rows mentioning one (or more) Key
from the Excel file with the following formula.
IIF(CONTAINS(UPPER([Text]),[Key]),[Id],NULL)
This formula filters the tweet Id
where the Excel's [Key]
field is contained in the UPPER([Text])
coming from Twitter. Since there are multiple Keys assigned to the same character/house (see Jon Snow with both keywords JON
and SNOW
) the aggregation for this column is count distinct which in Tableau is achieved with COUNTD
formula.
I can now simply drag the Name
from the Excel file and the # of Mentions
column with the above formula and aggregation method as count distinct.
The beauty of this solution is that now if I need to do the same graph by house, I don't need to create columns with new formulas, but simply remove the Name
field and replace it with Family
coming from the Excel file.
Also if I forgot a character or family I simply need to add the relevant rows in the Excel lookup file and reload it, nothing to change in the formulas.
Sentiment Analysis
Another goal I had in mind when analysing GoT data was the sentiment analysis of tweets and the average sentiment associated to a character or house. Doing sentiment analysis in Tableau is not too hard, since we can reuse already existing packages coming from R.
For the Tableau-R integration to work I had to install and execute the RServe
package from a workstation where R was already installed and set the connection in Tableau. More details on this configuration can be found in Tableau documentation
Once configured Tableau to call R functions it's time to analyse the sentiment. I used Syuzhet package (previously downloaded) for this purpose. The Sentiment
calculation is done by the following formula:
SCRIPT_INT(
"library(syuzhet);
r<-(get_sentiment(.arg1,method = 'nrc'))",
ATTR([Text]))
Where
SCRIPT_INT
: The method will return an integer score for each Tweet with positives sentiments having positives scores and negative sentiments negative scoresget_sentiment(.arg1,method = 'nrc')
: is the function usedATTR([Text])
: the input parameter of the function which is the tweet text
At this point I can see the score associated to every tweet, and since that R package uses dictionaries, I limited my research to tweets in english language (filtering on the column User.Lang.Clean
mentioned above by en
).
The next step is to average the sentiment by character, seems an easy step but devil is in the details! Tableau takes the output of the SCRIPT_INT
call to R as aggregated metric, thus not giving any visual options to re-aggregate! Plus the tweet Text
field must be present in the layout for the sentiment to be calculated otherwise the metric results NULL
.
Fortunately there are functions, and specifically window functions like WINDOW_AVG
allowing a post aggregation based of a formula defining the start and end. The other cool fact is that window function work per partition of the data and the start and end of the window can be defined using the FIRST()
and LAST()
functions.
We can now create an aggregated version of our Sentiment
column with the following formula
WINDOW_AVG(FLOAT([Sentiment]), FIRST(), LAST())
This column will be repeated with the same value for all rows within the same "partition", in this case the character Name
.
Be aware that this solution doesn't re-aggregate the data, we'll still see the data by single tweet Text
and character Name
. However the metric is calculated at total per character so graphs can be displayed.
I wanted to show a Scatter Plot based on the # of Mentions
and Sentiment
of each character. With the window functions and the defined above it's as easy as dragging the fields in the proper place and select the scatter plot viz.
The default view is not very informative since I can't really associate a character to its position in the chart until I go over the related image. Fortunately Tableau allows the definition of custom shapes and I could easily assign character photos to related names.
If negative mentions for Littlefinger and Cercei was somehow expected, the characters with most negative sentiment are Sansa Stark, probably due to the mysterious letter found by Arya in Baelish room, and Ellaria Sand. On the opposite side we strangely see the Night King and more in general the WhiteWalkers with a very positive sentiment associated to them. Strange, this needs further investigation.
Deep Dive on Whitewalkers and Sansa
I can create a view per Character with associate tweets and sentiment score and filter it for the WhiteWalkers. Looks like there are great expectations for this character in the next episodes (the battle is coming) which are associated with positive sentiments.
When analysing the detail of the number of tweets falling in each sentiment score category it's clear why Sansa and Whitewalkers have such a different sentiment average. Both appear as normal distributions, but the center of the Whitewalkers curve is around 1 (positive sentiment) while for Sansa is between -1 and 0 (negative sentiment).
This explanation however doesn't give me enough information, and want to understand more about what are the most used words included in tweets mentioning WhiteWalkers or Night King.
Warning: the method mentioned above is not the optimal performance wise and should be used carefully since it causes data duplication if not handled properly.
There is no easy way to do so directly in Tableau, even using R since all the functions expect the output size to be 1-1 with the input, like sentiment score and text.
For this purpose I created a view on top of the BigQuery table directly in Tableau using the New Custom SQL option. The SQL used is the following
SELECT ID, REPLACE(REPLACE(SPLIT(UPPER(TEXT),' '),'#',''),'@','') word FROM [Dataset.rm_got]
The SPLIT
function divides the Text
field in multiple rows one for every word separated by space. This is a very basic split and can of course be enhanced if needed. On top of it the SQL removes references to #
and @
. Since the view contains the tweet's Id
field, this can be used to join this dataset with the main table.
The graph showing the overall words belonging to characters is not really helpful since the amount of words (even if I included only the ones with more than e chars) is too huge to be analysed properly.
When analysing the single words in particular tweets I can clearly see that the Whitewalkers sentiment is driven by words like King
, Iron
, Throne
having a positive sentiment. On the other hand Sansa stark is penalized by words like Kill
and Fight
probably due to the possible troubles with Arya.
One thing to mention is that the word Stark is classified with a negative sentiment due to the general english dictionary used for the scoring. This affects all the tweets and in particular the average scores of all the characters belonging to the House Stark. A new "GoT" dictionary should be created and used in order to avoid those kind of misinterpretations.
Also when talking about "Game of Thrones", words like Kill
or Death
can have positive or negative meaning depending on the sentence, a imaginary tweet like
Finally Arya kills Cercei
Should have a positive sentiment for Arya and a negative for Cercei, but this is where automatic techniques of sentiment classification show their limits. Not even a new dictionary could help in this case.
The chart below shows the percentage of words classified with positive (score 1 or 2) or negative (score -1 or -2) for the two selected characters. We can clearly see that Sansa has more negative words than positive as expected while Whitewalkers is on the opposite side.
Furthermore the overall sentiment for the two characters may be explained by the following graph. This shows for every sentence sentiment category (divided in bins Positive
, Neutral
, Negative
), an histogram based on the count of words by single word sentiment. We can clearly see how words with positive sentiment are driving the Positive
sentence category (and the opposite).
Finally the last graph shows the words that have mostly impacted the overall positive and negative sentiment for both characters.
We can clearly see that Sansa negative sentiment is due to Stark
, Hate
and Victim
. On the other side Whitewalkers positive sentiment is due to words like King
(Night King is the character) and Finally
probably due to the battle coming in the next episode. As you can see there are also multiple instances of the King
word due to different punctualization preceeding or following the world. I stated above that the BigQuery SQL extracting the words via the SPLIT
function was very basic, we can now see why. Little enhancements in the function would aggregate properly the words.
Are you still there? Do you wonder what's left? Well there is a whole set of analysis that can be done on top of this dataset, including checking the sentiment behaviour by time during the live event or comparing this week's dataset with the next episode's one. The latter may happen next week so... Keep in touch!
Hope you enjoyed the analysis... otherwise... Dracarys!
Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3
Last week there was Wimbledon, if you are a fan of Federer, Nadal or Djokovic then it was one of the events not to be missed. I deliberately excluded Andy Murray from the list above since he kicked out my favourite player: Dustin Brown.
Two weeks ago I was at Kscope17 and one of the common themes, which reflected where the industry is going, was the usage of Kafka as central hub for all data pipelines. I wont go in detail on what's the specific role of Kafka and how it accomplishes, You can grab the idea from two slides taken from a recent presentation by Confluent.
One of the key points of all Kafka-related discussions at Kscope was that Kafka is widely used to take data from providers and push it to specific data-stores (like HDFS) that are then queried by analytical tools. However the "parking to data-store" step can sometimes be omitted with analytical tools querying directly Kafka for real-time analytics.
We wrote at the beginning of the year a blog post about doing it with Spark Streaming and Python however that setup was more data-scientist oriented and didn't provide a simple ANSI SQL familiar to the beloved end-users.
As usual, Oracle annouced a new release during Kscope. This year it was Oracle Data Visualization Desktop 12.2.3.0.0 with a bunch of new features covered in my previous blog post.
The enhancement, amongst others, that made my day was the support for JDBC and ODBC drivers. It opened a whole bundle of opportunities to query tools not officially supported by DVD but that expose those type of connectors.
One of the tools that fits in this category is Presto, a distributed query engine belonging to the same family of Impala and Drill commonly referred as sql-on-Hadoop. A big plus of this tool, compared to the other two mentioned above, is that it queries natively Kafka via a dedicated connector.
I found then a way of fitting the two of the main Kscope17 topics, a new sql-on-Hadoop tool and one of my favourite sports (Tennis) in the same blog post: analysing real time Twitter Feeds with Kafka, Presto and Oracle DVD v3. Not bad as idea.... let's check if it works...
Analysing Twitter Feeds
Let's start from the actual fun: analysing the tweets! We can navigate to the Oracle Analytics Store and download some interesting add-ins we'll use: the Auto Refresh plugin that enables the refresh of the DV project, the Heat Map and Circle Pack visualizations and the Term Frequency advanced analytics pack.
Importing the plugin and new visualizations can be done directly in the console as explained in my previous post. In order to be able to use the advanced analytics function we need to unzip the related file and move the .xml
file contained in the %INSTALL_DIR%\OracleBI1\bifoundation\advanced_analytics\script_repository
. In the Advanced Analytics zip file there is also a .dva
project that we can import into DVD (password Admin123) which gives us a hint on how to use the function.
We can now build a DVD Project about the Wimbledon gentleman singular final containing:
- A table view showing the latest tweets
- A horizontal bar chart showing the number of tweets containing mentions to Federer, Cilic or Both
- A circle view showing the most tweeted terms
- A heatmap showing tweet locations (only for tweets with an activated localization)
- A line chart showing the number of tweets over time
The project is automatically refreshed using the auto-refresh plugin mentioned above. A quick view of the result is provided by the following image.
So far all good and simple! Now it's time to go back and check how the data is collected and queried. Let's start from Step #1: pushing Twitter data to Kafka!
Kafka
We covered Kafka installation and setup in previous blog post, so I'll not repeat this part.
The only piece I want to mention, since gave me troubles, is the advertised.host.name
setting: it's a configuration line in /opt/kafka*/config/server.properties
that tells Kafka which is the host where it's listening.
If you leave the default localhost
and try to push content to a topic from an external machine it will not show up, so as pre-requisite change it to a hostname/IP that can be resolved externally.
The rest of the Kafka setup is the creation of a Twitter producer, I took this Java project as example and changed it to use the latest Kafka release available in Maven. It allowed me to create a Kafka topic named rm.wimbledon
storing tweets containing the word Wimbledon
.
The same output could be achieved using Kafka Connect and its sink and source for twitter. Kafka Connect has also the benefit of being able to transform the data before landing it in Kafka making the data parsing easier and the storage faster to retrieve. I'll cover the usage of Kafka Connect in a future post, for more informations about it, check this presentation from Robin Moffatt of Confluent.
One final note about Kafka: I run a command to limit the retention to few minutes
bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic rm.wimbledon --config retention.ms=300000
This limits the amount of data that is kept in Kafka, providing better performances during query time. This is not always possible in Kafka due to data collection needs and there are other ways of optimizing the query if necessary.
At this point of our project we have a dataflow from Twitter to Kafka, but no known way of querying it with DVD. It's time to introduce the query engine: Presto!
Presto
Presto was developed at Facebook, is in the family of sql-on-Hadoop tools. However, as per Apache Drill, it could be called sql-on-everything since data don't need to reside on an Hadoop system. Presto can query local file systems, MongoDB, Hive, and a big variety of datasources.
As the other sql-on-Hadoop technologies it works with always-on daemons which avoid the latency proper of Hive in starting a MapReduce job. Presto, differently from the others, divides the daemons in two types: the Coordinator and the Worker. A Coordinator is a node that receives the query from the clients, it analyses and plans the execution which is then passed on to Workers to carry out.
In other tools like Impala and Drill every node by default could add as both worker and receiver. The same can also happen in Presto but is not the default and the documentation suggest to dedicate a single machine to only perform coordination tasks for best performance in large cluster (reference to the doc).
The following image, taken from Presto website, explains the flow in case of usage of the Hive metastore as datasource.
Installation
The default Presto installation procedure is pretty simple and can be found in the official documentation. We just need to download the presto-server-0.180.tar.gz
tarball and unpack it.
tar -xvf presto-server-0.180.tar.gz
This creates a folder named presto-server-0.180
which is the installation directory, the next step is to create a subfolder named etc
which contains the configuration settings.
Then we need to create five configuration files and a folder within the etc
folder:
node.environment
: configuration specific to each node, enables the configuration of a clusterjvm.config
: options for the Java Virtual Machineconfig.properties
: specific coordinator/worker settingslog.properties
: specifies log levelscatalog
: a folder that will contain the data source definition
For a the basic functionality we need the following are the configurations:
node.environment
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data
With the environment
parameter being shared across all the nodes in the cluster, the id
being a unique identifier of the node and the data-dir
the location where Presto will store logs and data.
jvm.config
-server
-Xmx4G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
I reduced the -Xmx
parameter to 4GB as I'm running in a test VM. The parameters can of course be changed as needed.
config.properties
Since we want to keep it simple we'll create a unique node acting both as coordinator and as worker, the related config file is:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://linuxsrv.local.com:8080
Where the coordinator=true
tells Presto to function as coordinator, http-server.http.port
defines the ports, and discovery.uri
is the URI to the Discovery server (in this case the same process).
log.properties
com.facebook.presto=INFO
We can keep the default INFO
level, other levels are DEBUG
, WARN
and ERROR
.
catalog
The last step in the configuration is the datasource setting: we need to create a folder named catalog
within etc
and create a file for each connection we intend to use.
For the purpose of this post we want to connect to the Kafka topic named rm.wimbledon
. We need to create a file named kafka.properties
within the catalog
folder created above. The file contains the following lines
connector.name=kafka
kafka.nodes=linuxsrv.local.com:9092
kafka.table-names=rm.wimbledon
kafka.hide-internal-columns=false
where kafka.nodes
points to the Kafka brokers and kafka.table-names
defines the list of topics delimited by a ,
.
The last bit needed is to start the Presto server by executing
bin/launcher start
We can append the --verbose
parameter to debug the installation with logs that can be found in the var/log
folder.
Presto Command Line Client
In order to query Presto via command line interface we just need to download the associated client (see official doc) which is in the form of a presto-cli-0.180-executable.jar
file. We can now rename the file to presto
and make it executable.
mv presto-cli-0.180-executable.jar presto
chmod +x presto
Then we can start the client by executing
./presto --server linuxsrv.local.com:8080 --catalog kafka --schema rm
Remember that the client has a JDK 1.8 as prerequisite, otherwise you will face an error. Once the client is successfully setup, we can start querying Kafka
You could notice that the schema (rm
) we're connecting is just the prefix of the rm.wimbledon
topic used in kafka. In this way I could potentially store other topics using the same rm
prefix and being able to query them all together.
We can check which schemas can be used in Kafka with
presto:rm> show schemas;
Schema
--------------------
information_schema
rm
(2 rows)
We can also check which topics are contained in rm
schema by executing
presto:rm> show tables;
Table
-----------
wimbledon
(1 row)
or change schema by executing
use information_schema;
Going back to the Wimbledon example we can describe the content of the topic by executing
presto:rm> describe wimbledon;
Column | Type | Extra | Comment
-------------------+---------+-------+---------------------------------------------
_partition_id | bigint | | Partition Id
_partition_offset | bigint | | Offset for the message within the partition
_segment_start | bigint | | Segment start offset
_segment_end | bigint | | Segment end offset
_segment_count | bigint | | Running message count per segment
_key | varchar | | Key text
_key_corrupt | boolean | | Key data is corrupt
_key_length | bigint | | Total number of key bytes
_message | varchar | | Message text
_message_corrupt | boolean | | Message data is corrupt
_message_length | bigint | | Total number of message bytes
(11 rows)
We can immediately start querying it like
presto:rm> select count(*) from wimbledon;
_col0
-------
42295
(1 row)
Query 20170713_102300_00023_5achx, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:00 [27 rows, 195KB] [157 rows/s, 1.11MB/s]
Remember all the queries are going against Kafka in real time, so the more messages we push, the more results we'll have available. Let's now check what the messages looks like
presto:rm> SELECT _message FROM wimbledon LIMIT 5;
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
{"created_at":"Thu Jul 13 10:22:46 +0000 2017","id":885444381767081984,"id_str":"885444381767081984","text":"RT @paganrunes: Ian McKellen e Maggie Smith a Wimbl
{"created_at":"Thu Jul 13 10:22:46 +0000 2017","id":885444381913882626,"id_str":"885444381913882626","text":"@tomasberdych spricht vor dem @Wimbledon-Halbfinal
{"created_at":"Thu Jul 13 10:22:47 +0000 2017","id":885444388645740548,"id_str":"885444388645740548","text":"RT @_JamieMac_: Sir Andrew Murray is NOT amused wit
{"created_at":"Thu Jul 13 10:22:49 +0000 2017","id":885444394404503553,"id_str":"885444394404503553","text":"RT @IBM_UK_news: What does it take to be a #Wimbled
{"created_at":"Thu Jul 13 10:22:50 +0000 2017","id":885444398929989632,"id_str":"885444398929989632","text":"RT @PakkaTollywood: Roger Federer Into Semifinals \
(5 rows)
As expected tweets are stored in JSON format, We can now use the Presto JSON functions to extract the relevant informations from it. In the following we're extracting the user.name
part of every tweet. Node the LIMIT 10
(common among all the SQL-on-Hadoop technologies) to limit the number of rows returned.
presto:rm> SELECT json_extract_scalar(_message, '$.user.name') FROM wimbledon LIMIT 10;
_col0
---------------------
pietre --
BLICK Sport
Neens
Hugh Leonard
••••Teju KaLion••••
Charlie Murray
Alex
The Daft Duck.
Hotstar
Raj Singh Chandel
(10 rows)
We can also create summaries like the top 10 users by number of tweets.
presto:rm> SELECT json_extract_scalar(_message, '$.user.name') as screen_name, count(json_extract_scalar(_message, '$.id')) as nr FROM wimbledon GROUP BY json_extract_scalar(_message, '$.user.name') ORDER BY count(json_extract_scalar(_message, '$.id')) desc LIMIT 10;
screen_name | nr
---------------------+-----
Evarie Balan | 125
The Master Mind | 104
Oracle Betting | 98
Nichole | 85
The K - Man | 75
Kaciekulasekran | 73
vientrainera | 72
Deporte Esp | 66
Lucas Mc Corquodale | 64
Amal | 60
(10 rows)
Adding a Description file
We saw above that it's possible to query with ANSI SQL statements using the Presto JSON function. The next step will be to define a structure on top of the data stored in the Kafka topic to turn raw data in a table format. We can achieve this by writing a topic description file. The file must be in json format and stored under the etc/kafka
folder; it is recommended, but not necessary, that the name of the file matches the kafka topic (in our case rm.wimbledon
). The file in our case would be the following
{
"tableName": "wimbledon",
"schemaName": "rm",
"topicName": "rm.wimbledon",
"key": {
"dataFormat": "raw",
"fields": [
{
"name": "kafka_key",
"dataFormat": "LONG",
"type": "BIGINT",
"hidden": "false"
}
]
},
"message": {
"dataFormat": "json",
"fields": [
{
"name": "created_at",
"mapping": "created_at",
"type": "TIMESTAMP",
"dataFormat": "rfc2822"
},
{
"name": "tweet_id",
"mapping": "id",
"type": "BIGINT"
},
{
"name": "tweet_text",
"mapping": "text",
"type": "VARCHAR"
},
{
"name": "user_id",
"mapping": "user/id",
"type": "VARCHAR"
},
{
"name": "user_name",
"mapping": "user/name",
"type": "VARCHAR"
},
[...]
]
}
}
After restarting Presto when we execute the DESCRIBE
operation we can see all the fields available.
presto:rm> describe wimbledon;
Column | Type | Extra | Comment
-------------------+-----------+-------+---------------------------------------------
kafka_key | bigint | |
created_at | timestamp | |
tweet_id | bigint | |
tweet_text | varchar | |
user_id | varchar | |
user_name | varchar | |
user_screenname | varchar | |
user_location | varchar | |
user_followers | bigint | |
user_time_zone | varchar | |
_partition_id | bigint | | Partition Id
_partition_offset | bigint | | Offset for the message within the partition
_segment_start | bigint | | Segment start offset
_segment_end | bigint | | Segment end offset
_segment_count | bigint | | Running message count per segment
_key | varchar | | Key text
_key_corrupt | boolean | | Key data is corrupt
_key_length | bigint | | Total number of key bytes
_message | varchar | | Message text
_message_corrupt | boolean | | Message data is corrupt
_message_length | bigint | | Total number of message bytes
(21 rows)
Now I can use the newly defined columns in my query
presto:rm> select created_at, user_name, tweet_text from wimbledon LIMIT 10;
and the related results
We can always mix defined columns with custom JSON parsing Presto syntax if we need to extract some other fields.
select created_at, user_name, json_extract_scalar(_message, '$.user.default_profile') from wimbledon LIMIT 10;
Oracle Data Visualization Desktop
As mentioned at the beginning of the article, the overall goal was to analyse Wimbledon twitter feed in real time with Oracle Data Visualization Desktop via JDBC, so let's complete the picture!
JDBC drivers
First step is to download the Presto JDBC drivers version 0.175, I found them in the Maven website. I tried also the 0.180 version downloadable directly from Presto website but I had several errors in the connection.
After downloading we need to copy the driver presto-jdbc-0.175.jar
under the %INSTALL_DIR%\lib
folder where %INSTALL_DIR%
is the Oracle DVD installation folder and start DVD. Then I just need to create a new connection like the following
Note that:
- URL: includes also the
/kafka
postfix, this tells Presto which storage I want to query - Driver Class Name: this setting puzzled me a little bit, I was able to discover the string (with the help of Gianni Ceresa) by concatenating the folder name and the driver class name after unpacking the jar file
** Username/password: those strings can be anything since for the basic test we didn't setup any security on Presto.
The whole JDBC process setting is described in this youtube video provided by Oracle.
We can then define the source by just selecting the columns we want to import and create few additional ones like the Lat
and Long
parsing from the coordinates
column which is in the form [Lat, Long]
. The dataset is now ready to be analysed as we saw at the beginning of the article, with the final result being:
Conclusions
As we can see from the above picture the whole process works (phew....), however it has some limitations: there is no pushdown of functions to the source so most of the queries we see against Presto are in the form of
select tweet_text, tweet_id, user_name, created_at from (
select coordinates,
coordinates_lat_long,
created_at,
tweet_id,
tweet_text,
user_followers,
user_id,
user_location,
user_name,
user_screenname,
user_time_zone
from rm.wimbledon)
This means that the whole dataset is retrieved every time making this solution far from optimal for big volumes of data. In those cases probably the "parking" to datastore step would be necessary. Another limitation is related to the transformations, the Lat
and Long
extractions from coordinates
field along with other columns transformations are done directly in DVD, meaning that the formula is applied directly in the visualization phase. In the second post we'll see how the source parsing phase and query performances can be enhanced using Kafka Connect, the framework allowing an easy integration between Kafka and other sources or sinks.
One last word: winning Wimbledon eight times, fourteen years after the first victory and five years after the last one it's something impressive! Chapeau mr Federer!
Streaming Global Cyber Attack Analytics with Tableau and Python
Introduction and Hacks
As grandiose a notion as the title may imply, there have been some really promising and powerful moves made in the advancement of smoothly integrating real-time and/or streaming data technologies into most any enterprise reporting and analytics architecture. When used in tandem with functional programming languages like Python, we now have the ability to create enterprise grade data engineering scripts to handle the manipulation and flow of data, large or small, for final consumption in all manner of business applications.
In this cavalcade of coding, we're going to use a combination of Satori, a free data streaming client, and python to stream live world cyber attack activity via an api. We'll consume the records as json, and then use a few choice python libraries to parse, normalize, and insert the records into a mysql database. Finally, we'll hook it all up to Tableau and watch cyber attacks happen in real time with a really cool visualization.
The Specs
For the this exercise, we're going to bite things off a chunk at a time. We're going to utilize a service called Satori, a streaming data source aggregator that will make it easy for us to hook up to any number of streams to work with as we please. In this case, we'll be working with the Live Cyber Attack Threat Map data set. Next, we'll set up our producer code that will do a couple of things. First it will create the API client from which we will be ingesting a constant flow of cyber attack records. Next, we'll take these records and convert them to a data frame using the Pandas library for python. Finally, we will insert them into a MySQL database. This will allow us to use this live feed as a source for Tableau in order to create a geo mapping of countries that are currently being targeted by cyber attacks.
The Data Source
Satori is a new-ish service that aggregates the web's streaming data sources and provides developers with a client and some sample code that they can then use to set up their own live data streams. While your interests may lie in how you can stream your own company's data, it then simply becomes a matter of using python's requests library to get at whatever internal sources you might need. Find more on the requests library here.
Satori has taken a lot of the guess work out of the first step of the process for us, as they provide basic code samples in a number of popular languages to access their streaming service and to generate records. You can find the link to this code in a number of popular languages here. Note that you'll need to install their client and get your own app key. I've added a bit of code at the end to handle the insertion of records, and to continue the flow, should any records produce a warning.
Satori Code
# Imports
from __future__ import print_function
import sys
import threading
from pandas import DataFrame
from satori.rtm.client import make_client, SubscriptionMode
# Local Imports
from create_table import engine
# Satori Variables
channel = "live-cyber-attack-threat-map"
endpoint = "wss://open-data.api.satori.com"
appkey = " "
# Local Variables
table = 'hack_attacks'
def main():
with make_client(
endpoint=endpoint, appkey=appkey) as client:
print('Connected!')
mailbox = []
got_message_event = threading.Event()
class SubscriptionObserver(object):
def on_subscription_data(self, data):
for message in data['messages']:
mailbox.append(message)
got_message_event.set()
subscription_observer = SubscriptionObserver()
client.subscribe(
channel,
SubscriptionMode.SIMPLE,
subscription_observer)
if not got_message_event.wait(30):
print("Timeout while waiting for a message")
sys.exit(1)
for message in mailbox:
# Create dataframe
data = DataFrame([message],
columns=['attack_type', 'attacker_ip', 'attack_port',
'latitude2', 'longitude2', 'longitude',
'city_target', 'country_target', 'attack_subtype',
'latitude', 'city_origin', 'country_origin'])
# Insert records to table
try:
data.to_sql(table, engine, if_exists='append')
except Exception as e:
print(e)
if __name__ == '__main__':
main()
Creating a Table
Now that we've set up the streaming code that we'll use to fill our table, we'll need to set up the table in MySQL to hold them all. For this we'll use the SQLAlchemy ORM (object relational mapper). It's a high falutin' term for a tool that simply abstracts SQL commands to be more 'pythonic'; that is, you don't necessarily have to be a SQL expert to create tables in your given database. Admittedly, it can be a bit daunting to get the hang of, but give it a shot. Many developers choose to interact a with a given database either via direct SQL or using an ORM. It's good practice to use a separate python file, in this case settings.py
(or some variation thereof), to hold your database connection string in the following format (the addition of the mysqldb
tag at the beginning is as a result of the installation of the mysql library you'll need for python), entitled SQLALCHEMY_DATABASE_URI
:
'mysql+mysqldb://db_user:pass@db_host/db_name'
Don't forget to sign in to your database to validate success!
Feeding MySQL and Tableau
Now all we need to do is turn on the hose and watch our table fill up. Running producer.py
, we can then open a new tab, log in to our database to make sure our table is being populated, and go to work. Create a new connection to your MySQL database (called my db 'hacks') in Tableau and verify that everything is in order once you navigate to the data preview. There are lots of nulls in this data set, but this will simply be a matter of filtering them out on the front end.
Tableau should pick up right away on the geo data in the dataset, as denoted by the little globe icon next to the field.
We can now simply double-click on the corresponding geo data field, in this case we'll be using Country Target
, and then the Number of Records
field in the Measures area.
I've chosen to use the 'Dark' map theme for this example as it just really jives with the whole cyber attack, international espionage vibe. Note that you'll need to maintain a live connection, via Tableau, to your datasource and refresh at the interval you'd like, if using Tableau Desktop. If you're curious about how to automagically provide for this functionality, a quick google search will come up with some solutions.
Enabling A Modern Analytics Platform
Over recent years, bi-modal analytics has gained interest and, dare I say it, a level of notoriety, thanks to Garnter’s repositioning of its Magic Quadrant in 2016. I’m going to swerve the debate, but if you are not up to speed, then I recommend taking a look here first.
Regardless of your chosen stance on the subject, one thing is certain: the ability to provision analytic capabilities in more agile ways and with greater end user flexibility is now widely accepted as an essential part of any modern analytics architecture.
But are there any secrets or clues that could help you in modernising your analytics platform?
What Is Driving the Bi-Modal Shift?
The demand for greater flexibility from our analytics platforms has its roots in the significant evolutions seen in the businesses environment. Specifically, we are operating in/with:
- increasingly competitive marketplaces, requiring novel ideas, more tailored customer relationships and faster decisions;
- turbulent global economies, leading to a drive to reduce (capex) costs, maximise efficiencies and a need to deal with increased regulation;
- broader and larger, more complex and more externalised data sets, which can be tapped into with much reduced latency;
- empowered and tech-savvy departmental users, with an increased appetite for analytical decision making, combined with great advances in data discovery and visualisation technologies to satisfy this appetite;
In a nutshell, the rate at which change occurs is continuing to gather pace and so to be an instigator of change (or even just a reactor to it as it happens around you) requires a new approach to analytics and data delivery and execution.
Time to Head Back to the Drawing Board?
Whilst the case for rapid, user-driven analytics is hard to deny, does it mean that our heritage BI and Analytics platforms are obsolete and ready for the scrap heap?
I don’t think so: The need to be able to monitor operational processes, manage business performance and plan for the future have not suddenly disappeared; The need for accurate, reliable and trusted data which can be accessed securely and at scale is as relevant now as it was before. And this means that, despite what some might have us believe, all the essential aspects of the enterprise BI platforms we have spent years architecting, building and growing cannot be simply wiped away.
[Phew!]
Instead, our modern analytics platforms must embrace both ends of the spectrum equally: highly governed, curated and trustworthy data to support business management and control, coupled with highly available, flexible, loosely governed data to support business innovation. In other words, both modes must coexist and function in a relative balance.
The challenge now becomes a very different one: how can we achieve this in an overarching, unified business architecture which supports departmental autonomy, encourages analytical creativity and innovation, whilst minimising inefficiency and friction? Now that is something we can really get our teeth into!
What’s IT All About?
Some questions:
- Do you have a myriad of different analytics tools spread across the business which are all being used to fulfil the same ends?
- Are you constantly being asked to provide data extracts or have you resorted to cloning your production database and provisioning SQL Developer to your departmental analysts?
- Are you routinely being asked to productionise things that you have absolutely no prior knowledge of?
If you can answer Yes to these questions, then you are probably wrestling with an unmanaged or accidental bi-modal architecture.
At Rittman Mead, we have seen several examples of organisations who want to hand greater autonomy to departmental analysts and subject matter experts, so that they can get down and dirty with the data to come up with novel and innovative business ideas. In most of the cases I have observed, this has been driven at a departmental level and instead of IT embracing the movement and leading the charge, results have often been achieved by circumventing IT. Even in the few examples where IT have engaged in the process, the scope of their involvement has normally been focused on the provision of hardware and software, or increasingly, the rental of some cloud resources. It seems to me that the bi-modal shift is often perceived as a threat to traditional IT, that it is somehow the thin end of a wedge leading to full departmental autonomy and no further need for IT! In reality, this has never been (and will never be) the ambition or motivation of departmental initiatives.
In my view, this slow and faltering response from IT represents a massive missed opportunity. More importantly though, it increases the probability that the two modes of operation will be addressed in isolation and this will only ever lead to siloed systems, siloed processes and ultimately, a siloed mentality. The creation of false barriers between IT and business departments can never be a positive thing.
That’s not to say that there won’t be any positive results arising from un-coordinated initiatives, it’s just that unwittingly, they will cause an imbalance in the overall platform: You might deliver an ultra-slick, flexible, departmentally focused discovery lab, but this will encourage the neglect and stagnation of the enterprise platform. Alternatively, you may have a highly accurate, reliable and performant data architecture with tight governance control which creates road-blocks for departmental use cases.
Finding the Right Balance
So, are there any smart steps that you can take if you are looking to build out a bi-modal analytics architecture? Well, here are a few ideas that you should consider as factors in a successful evolution:
1. Appreciate Your Enterprise Data Assets
You’ve spent a lot of time and effort developing and maintaining your data warehouse and defining the metadata so that it can be exposed in an easily understandable and user friendly way. The scope of your enterprise data also provides a common base for the combined data requirements for all of your departmental analysts. Don’t let this valuable asset go to waste! Instead provide a mechanism whereby your departmental analysts can access enterprise data quickly, easily, when needed and as close to the point of consumption as possible. Then, with good quality and commonly accepted data in their hands, give your departmental analysts a level of autonomy and the freedom to cut loose.
2. Understand That Governance Is Not a Dirty Word
In many organisations, data governance is synonymous with red tape, bureaucracy and hurdles to access. This should not be the case. Don’t be fooled into thinking that more agile means less control. As data begins to be multi-purposed, moved around the business, combined with disparate external data sources and used to drive creativity in new and innovative ways, it is essential that the provenance of the enterprise data is known and quantifiable. That way, departmental initiatives will start with a level of intrinsic confidence, arising from the knowledge that the base data has been sourced from a well known, consistent and trusted source. Having this bedrock will increase confidence in your analytical outputs and lead to stronger decisions. It will also drive greater efficiencies when it comes to operationalising the results.
3. Create Interdependencies
Don’t be drawn into thinking “our Mode 1 solution is working well, so let’s put all our focus and investment into our Mode 2 initiatives”. Instead, build out your Mode 2 architecture with as much integration into your existing enterprise platform as possible. The more interdependencies you can develop, the more you will be able to reduce data handling inefficiencies and increase benefits of scale down the line. Furthermore, interdependency will eliminate the risk of creating silos and allowing your enterprise architecture to stagnate, as both modes will have a level of reliance on one another. It will also encourage good data management practice, with data-workers talking in a common and consistent language.
4. Make the Transition Simple
Probably the single most important factor in determining the success of your bi-modal architecture is the quality with which you can transition a Mode 2 model into something operational and production-ready in Mode 1. The more effective this process is, the more likely you are to maximise your opportunities (be it new sales revenue, operating cost etc.) and increase your RoI. The biggest barriers to smoothing this transition will arise when departmental outputs need to be reanalysed, respecified and redesigned so that they can be slotted back into the enterprise platform. If both Mode 1 and Mode 2 activity is achieved with the same tools and software vendors, then you will have a head start…but even if disparate tools are used for the differing purposes, then there are always things that you can do that will help. Firstly, make sure that the owners of the enterprise platform have a level of awareness of departmental initiatives, so that there is a ‘no surprises’ culture…who knows, their experience of the enterprise data could even be exploited to add value to departmental initiatives. Secondly, ensure that departmental outputs can always be traced back to the enterprise data model easily (note: this will come naturally if the other 3 suggestions are followed!). And finally, define a route to production that is not overbearing or cumbersome. Whilst all due diligence should be taken to ensure the production environment is risk-free, creating artificial barriers (such as a quarterly or monthly release cycle) will render a lot of the good work done in Mode 2 useless.
Unify – bringing together the best of both worlds
Since I started teaching OBIEE in 2011, I had the pleasure of meeting many fascinating people who work with Business Intelligence.
In talking to my students, I would generally notice three different situations:
Folks were heavy users of OBIEE, and just ready to take their skills to the next level.
They were happily transitioning to OBIEE from a legacy reporting tool, that didn’t have the power that they needed.
There were not-so-good times, like when people were being forced to transition to OBIEE. They felt that they were moving away from their comfort zone and diving into a world of complicated mappings that would first require them to become rocket scientists. They were resistant to change.
It was this more challenging crowd, that mostly sparked my interest for other analytics tools. I received questions like: “Why are we switching to another system? What are the benefits?”
I wanted to have a good answer to these questions. Over the years, different projects have allowed me the opportunity to work with diverse reporting tools. My students’ questions were always in mind: Why? And what are the benefits? So, I always took the time to compare/contrast the differences between OBIEE and these other tools.
I noticed that many of them did a fantastic job at answering the questions needed, and so did OBIEE. It didn’t take me long to have the answer that I needed: the main difference in OBIEE is the RPD!
The RPD is where so much Business Intelligence happens. There, developers spend mind boggling times connecting the data, deriving complex metrics and hierarchies, joining hundreds of tables, and making everything a beautiful drag and drop dream for report writers.
Yes, many other tools will allow us to do magic with metadata, but most of them require this magic to be redefined every time we need a new report, or the report has a different criteria. Yes, the RPD requires a lot of work upfront, but that work is good for years to come. We never lose any of our previous work, we just enhance our model. Overtime, the RPD becomes a giant pool of knowledge for a company and is impressively saved as a file.
For tapping into the RPD metadata, traditionally we have used BI Publisher and OBIEE. They are both very powerful and generally complement each other well. Other tools have become very popular in the past few years. Tableau is an example that quickly won the appreciation of the BI community and has kept consistent leadership in Gartner’s BI Magic quadrant since 2013. With a very slick interface and super fast reporting capability, Tableau introduced less complex methods to create amazing dashboards - and fast! So, what is there not to like? There is really so much TO like!
Going back to the comparing and contrasting, the main thing that Tableau doesn’t offer is… the RPD. It lacks a repository with the ability to save the join definitions, calculations and the overall intelligence that can be used for all future reports.
At Rittman Mead, we’ve been using these tools and appreciate their substantial capabilities, but we really missed the RPD as a data source. We wanted to come up with a solution that would allow our clients to take advantage of the many hours they had likely already put into metadata modeling by creating a seamless transition from OBIEE’s metadata layer to Tableau.
This past week, I was asked to test our new product, called Unify. Wow. Once again, I am so proud of my fellow coworkers. Unify has a simple interface and uses a Tableau web connector to create a direct line to your OBIEE repository for use in Tableau reports, stories and dashboards.
In Unify, we select the subject areas from our RPD presentation layer and choose our tables and columns as needed. Below is a screenshot of Unify using the OBIEE 12c Sample App environment. If you are not familiar with OBIEE 12c, Oracle provides the Sample App - a standalone virtual image with everything that you need to test the product. You can download the SampleApp here: http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples-167534.html
We are immediately able to leverage all joins, calculated columns, hierarchies, RPD variables, session variables and that’s not all… our RPD security too! Yes, even row level security is respected when we press the “Unify” button and data is brought back into Tableau. So now, there is no reason to lose years of metadata work because one team prefers to visualize with Tableau instead of OBIEE.
Unify allows us to import only those data needed for the report, as we can utilize ‘in-tool’ filtering, keeping our query sets small, and our performance high.
In sum, Unify unites it all - have your cake and eat it too. No matter which tool you love the most, add them together and you will certainly love them both more.