Category Archives: Rittman Mead

Kafka and Football: KSQL, Google Natural Language APIs, BigQuery and DataStudio

If you missed it, yesterday I wrote a guest blog post for Confluent! The blog post mixes two of my favorite topics: Apache Kafka and Football! The post starts by defining the data ingestion from Twitter and sport news RSS feeds via Kafka Connect, continues with the definition of a KSQL UDF Function using Google Natural Language APIs for Sentiment Analysis. Then it's time to define the data sink to Google Big Query and the data visualization with Google Data Studio.

The last bit of the post is dedicated to data analysis with both KSQL and DataStudio on top of the quarterfinal match won by England against Norway. If you are interested in the full article, check it out here!

Oracle Analytics Summit 2019

Last week I attended the Oracle Analytics Summit at Skywalker Ranch. The event was live-streamed on Facebook and widely Tweeted, so a fair number of people may be up to speed with the announcements, I’ll summarise what I thought the key points were.

The purpose of the event, I think, was to relaunch/re-energise Oracle’s analytics products and services, make some specific announcements, and to try and better engage customers and analysts.

I think a lot of this stems from T. K. Anand taking over last summer and the changes in the team and direction he is looking to make. He was the first speaker at talked of openness and simplicity.

New Product Line

First, he announced a rationalisation of the product line with three offerings:

  • OAC (Oracle Analytics Cloud) - this is OAC on the second generation OCI (Oracle Cloud Infrastructure), with a simpler pricing model, details below.
  • OAS (Oracle Analytics Server) - this is a feature parity version of OAC available on premise.
  • OAA (Oracle Analytics Applications) - this is a new ‘BI Apps’ product for Oracle’s SaaS offerings such as ERP Cloud and HCM Cloud, more details below.

New Pricing

Second, he announced new pricing for OAC:

  • $20/month for Professional Edition
  • $2,000/month per OCPU for Enterprise Edition


Image from Doug Henschen

This also streamlined the different versions of OAC. Professional is essentially DV (Data Visualisation), and Enterprise is OBIEE and DV. I am not sure what this means for the existing Data Lake edition, however, I believe Essbase is being moved to a separate PaaS service.

I believe this also means that in the future if you are an on-premise OAS customer with the Enterprise Edition, then DV will be included in the price.

At the time of writing this, I’m not sure of the minimum number of users for the Professional Edition, however, given the price point, it looks like Oracle are targeting Power BI, Tableau etc., so I would expect a low number.

The Future of BI Apps

BI Apps has risen from the flames, and I believe there are now a couple of options:

  1. The existing ODI ‘version’ of BI Apps is being updated to support ODI 12c, OBIEE 12c/OAC and ADW. This means whether you have your application reporting on-premise or in the cloud, there is an upgrade path for the traditional BI Apps components, plus, due to the support of ADW, you will be able to migrate an existing on-premise system to Oracle’s cloud. Independent of anything that happens to DIPC, ODI will be supported on Oracle’s IaaS.
  2. There is a new cloud-only version of packaged applications, see OAA above. These provide a managed data pipeline between the SaaS application and an instance of ADW. Information is surfaced through what looks like DV, with some extended capabilities, allowing KPIs to be defined and these KPIs to arranged as a series of Tiles on a dashboard, not unlike the look of the Day by Day app.

Customer Commitment

One clear objective is more interaction and integration with customers, and to this end Oracle is increasing the number of people in support, there is more online content available, such as their Udemy channel and they are also creating something called the Idea Lab for the analytics community, where users can make suggestions and interact with the Product Managers.


I felt throughout the event that Oracle are aware of their shortcomings in the past and are committed to listening to customers, giving them a better experience and putting them more at the centre of their world.

The product line and pricing changes act as proof that they are committed to acting on this; it’s not just lip service. Through the pricing of the Professional Edition, I think you can also see Oracle going for the departmental or shadow IT world of data exploration and visualisation.

One customer commented at the end of the event that no one tried to sell him anything, that, if anything, is a good indicator for change.

Rittman Mead at Kscope 2019

Rittman Mead at Kscope 2019

June is time for one of my favourite conferences: Kscope! This year the location is Seattle and the Agenda is impressive. The event starts on Sunday with the Symposiums driven by Oracle Product Managers and divided by stream of interest.

The main conference is Monday-Wednesday with the Thursday dedicated to Deep Dive sessions. On Tuesday morning the Analytics track attention will be shifted to Skywalker Ranch for live-stream of the Oracle Analytics Summit!

I'll be representing Rittman Mead with two talks: One about "Become an Equilibrista: Find the Right Balance in the Analytics Tech Ecosystem" (Jun 24, 2019 03:45 PM - 04:45 PM, Room 3B, Level 3) discussing how to make self-service and centralized Analytics coexist successfully thus avoiding the "Excel Hell".

Rittman Mead at Kscope 2019

The second session is about two topics I love: Machine Learning and Wine! The "Is It Corked? Wine Machine Learning Predictions with OAC" (Jun 25, 2019 03:45 PM - 04:45 PM, Room 3B, Level 3) will show the details on how Oracle Analytics Cloud democratizes data science using a Wine Dataset as example. During the session we'll explore how OAC provides an easy and visual interface to Machine Learning and how a predictive model can be built, tested and evaluated within the same platform.

Rittman Mead at Kscope 2019

If you'll be at Kscope 2019 and you see me around, don't hesitate to stop me! I’d be pleased to speak with you about Wine, Food, Coffee, and Analytics of course!

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google’s Natural Language API and Python

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

Hi, Game of Thrones aficionados, welcome to GoT Series 8 and my tweet analysis! If you missed any of the prior season episodes, here are I, II and III. Finally, after almost two years, we have a new series and something interesting to write about! If you didn't watch Episode 1, do it before reading this post as it might contain spoilers!

Let's now start with a preview of the starting scene of Episode 2:

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

If you followed the previous season blog posts you may remember that I was using Kafka Connect to source data from Twitter, doing some transformations with KSQL and then landing the data in BigQuery using Connect again. On top of it, I was using Tableau to analyze the data.

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

The above infrastructure was working fine and I have been able to provide insights like the sentiment per character and the "game of couples" analysing how a second character mentioned in the same tweet could change the overall sentiment.

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

The sentiment scoring was however done at visualization time, with the data extracted from BigQuery into Tableau at tweet level, scored with an external call to R, then aggregated and finally rendered.

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

As you might understand the solution was far from optimal since:

  • The Sentiment scoring was executed for every query sent to the database, so possibly multiple times per dashboard
  • The data was extracted from the source at tweet level, rather than aggregated

The dashboard indeed was slow to render and the related memory consumption huge (think about data volumes being moved around). Furthermore, Sentiment Scores were living only inside Tableau: if any other people/application/visualization tool wanted to use them, they had to recalculate from scratch.

My question was then: where should I calculate Sentiment Scores in order to:

  • Do it only once per tweet, not for every visualization
  • Provide them to all the downstream applications

The answer is simple, I need to do it as close to the source as possible: in Apache Kafka!

Sentiment Scoring in Apache Kafka

There are a gazillion different ways to implement Sentiment Scoring in Kafka, so I chose a simple method based on Python and Google's Natural Language API.

Google Natural Language API

Google's NL APIs is a simple interface over a pre-trained Machine Learning model for language Analysis and as part of the service it provides sentiment scoring.

The Python implementation is pretty simple, you just need to import the correct packages

from import language_v1
from import enums

Instantiate the LanguageServiceClient

client = language_v1.LanguageServiceClient()

Package the tweet string you want to be evaluated in a Python dictionary

content = 'I'm Happy, #GoT is finally back!'
type_ = enums.Document.Type.PLAIN_TEXT
document = {'type': type_, 'content': content}

And parse the response

response = client.analyze_sentiment(document)
sentiment = response.document_sentiment
print('Score: {}'.format(sentiment.score))
print('Magnitude: {}'.format(sentiment.magnitude))

The result is composed by Sentiment Score and Magnitude:

  • Score indicated the emotion associated with the content as Positive (Value > 0) or Negative (Value < 0)
  • Magnitude indicates the power of such emotion, and is often proportional with the content length.

Please note that Google's Natural Language API is priced per document so the more content you send for scoring, the bigger your bill will be!

Creating a Kafka Consumer/Producer in Python

Once we fixed how to do Sentiment Scoring, it's time to analyze how we can extract a tweet from Kafka in Python. Unfortunately, there is no Kafka Streams implementation in Python at the moment, so I created an Avro Consumer/Producer based on Confluent Python Client for Apache Kafka. I used the jcustenborder/kafka-connect-twitter Connect, so it's always handy to have the Schema definition around when prototyping.

Avro Consumer

The implementation of an Avro Consumer is pretty simple: as always first importing the packages

from confluent_kafka import KafkaError
from confluent_kafka.avro import AvroConsumer
from confluent_kafka.avro.serializer import SerializerError

then instantiating the AvroConsumer passing the list of brokers, useful, as we'll see later, to add multiple consumers to the same topic, and the location of the schema registry service in schema.registry.url.

c = AvroConsumer({
    'bootstrap.servers': 'mybroker,mybroker2',
    '': 'groupid',
    'schema.registry.url': ''})

Next step is to subscribe to a topic, in my case got_avro


and start polling the messages in loop

while True:
        msg = c.poll(10)

    except SerializerError as e:
        print("Message deserialization failed for {}: {}".format(msg, e))



In my case, the message was returned as JSON and I could extract the tweet Text and Id using the json package


Avro Producer

The Avro Producer follows a similar set of steps, first including needed packages

from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer

Then we define the Avro Key and Value Schemas, in my case I used the tweet Id as key and included the text in the value together with the sentiment score and magnitude.

key_schema_str = """
   "namespace": "my.test",
   "name": "value",
   "type": "record",
   "fields" : [
       "name" : "id",
       "type" : "long"
value_schema_str = """
   "namespace": "my.test",
   "name": "key",
   "type": "record",
   "fields" : [
       "name" : "id",
       "type" : "long"
       "name" : "text",
       "type" : "string"
       "name" : "sentimentscore",
       "type" : "float"
       "name" : "sentimentmagnitude",
       "type" : "float"

Then it's time to load the Key and the Value

value_schema = avro.loads(value_schema_str)
key_schema = avro.loads(key_schema_str)
key = {"id": id}
value = {"id": id, "text": text,"sentimentscore": score ,"sentimentmagnitude": magnitude}

Creating the instance of the AvroProducer passing the broker(s), the schema registry URL and the Key and Value schemas as parameters

avroProducer = AvroProducer({
    'bootstrap.servers': 'mybroker,mybroker2',
    'schema.registry.url': 'http://schem_registry_host:port'
    }, default_key_schema=key_schema, default_value_schema=value_schema)

And finally produce the event defining as well the topic that will contain it, in my case got_avro_sentiment.

avroProducer.produce(topic='got_avro_sentiment', value=value, key=key)

The overall Producer/Consumer flow is needless to say, very easy

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

And it works!

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

Parallel Sentiment Scoring

One thing I started noticing immediately, however, is that especially on tweeting peaks, the scoring routine couldn't cope with the pace of the incoming tweets: a single python Consumer/Producer was not enough. No problem! With Kafka, you can add multiple consumers to the same topic, right?

Of course Yes! But you need to be careful.

Consumer Groups and Topic Partitions

You could create multiple consumers on different Consumer Groups (defined by the parameter mentioned above), but by doing this you're telling Kafka that those consumers are completely independent, thus Kafka will send each one a copy of every message. In our case, we'll simply end up scoring N times the same message, one for each consumer.

If, on the other hand, you create multiple consumers with the same consumer group, Kafka will treat them as unique consuming process and will try to share the load amongst them. However, it will do so only if the source topic is partitioned and will exclusively associate each consumer to one (or more) topic partitions! To read more about this check the Confluent documentation.

The second option is what we're looking for, having multiple threads reading from the same topic and splitting the tweet workload, but how do we split an existing topic into partitions? Here is where KSQL is handy! If you don't know about KSQL, read this post!

With KSQL we can define a new STREAM sourcing from an existing TOPIC or STREAM and the related number of partitions and partition key (the key's hash will be used to assign deterministically a message to a partition). The code is the following


Few things to keep in mind:

  • Choose the number of partitions carefully, the more partitions for the same topic, the more throughput but at the cost of extra complexity.
  • Choose the <PARTITION_KEY> carefully: if you have 10 partitions but only 3 distinct Keys, then 7 partitions will not be used. If you have 10 distinct keys but 99% of the messages have just 1 key, you'll end up using almost always the same partition.

Yeah! We can now create one consumer per partition within the same Consumer Group!

Joining the Streams

As the outcome of our process so far we have:

  • The native GOT_AVRO Stream coming from Kafka Connect, which we divided into 6 partitions using the tweet id as Key and named GOT_AVRO_PARTITIONED.
  • A GOT_AVRO_SENTIMENT Stream that we created using Python and Google's Natural Language API, with id as Key.

The next logical step would be to join them, which is possible with KSQL by including the WITHIN clause specifying the temporal validity of the join. The statement is, as expected, the following:

    ON A.ID=B.ID; 

Please note that I left a two minute window to take into account some delay in the scoring process. And as you would expect I get............ 0 results!

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

Reading the documentation better gave me the answer: Input data must be co-partitioned in order to ensure that records having the same key on both sides of the join are delivered to the same stream task.

Since the GOT_AVRO_PARTITIONED stream had 6 partitions and GOT_AVRO_SENTIMENT only one, the join wasn't working. So let's create a 6-partitioned version of GOT_AVRO_SENTIMENT.


Now the join actually works!

Game of Thrones Series 8: Real Time Sentiment Scoring with Apache Kafka, KSQL, Google's Natural Language API and Python

Next topics are: pushdown to Google's BigQuery and visualization using Google's Data Studio! But this, sadly, will be for another post! See you soon, and enjoy Game of Thrones!

Democratize Data Science with Oracle Analytics Cloud – Data Analysis and Machine Learning

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Welcome back! In my previous Post, I described how the democratization of Data Science is a hot topic in the analytical industry. We then explored how Oracle Analytics Cloud can act as an enabler for the transformation from Business Analyst to Data Scientist and covered the first steps in a Data Science project: problem definition, data connection & cleaning. In today's post, we'll cover the second part of the path: from the data transformation and enrichment, the analysis, the machine learning model training and evaluation. Let's Start!

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Step #3: Transform & Enrich

In the previous post, we understood how to clean data in order to handle wrong values, outliers, perform aggregation, feature scaling and divide our dataset between train and test. Cleaning the data, however, is only the first step in data processing, and should be followed by what in Data Science is called Feature Engineering.

Feature Engineering is a fancy name to call what in ETL terms we always called data transformation, we take a set of columns in input and we apply transformation rules to create new columns. The aim of Feature Engineering is to create good predictors for the following machine learning model. Feature Engineering is a bit of black art and to achieve excellent results requires a deep understanding of the ML Model we intend to use. However, most of the basic transformations are actually driven by domain knowledge: we should create new columns that we think will improve the problem explanation. Let's see some examples:

  • If we're planning to predict the Taxi Fare in New York between any two given points and we have source and destination, a good predictor for the fare probably would be the Euclidean distance between the two.
  • If we have Day/Month/Year on separate columns, we may want to condense the information in a unique column containing the Date
  • In case our dataset contains location names (Cities, Regions, Countries) we may want to geo-tag those properly with ZIP codes or ISO Codes.
  • If we have personal information like Credit Cards details or Person Name, we may want to decide to obfuscate or extract features like the person's sex from the name (on this topic please check the blog post about GDPR and ML from Brendan Tierney).
  • If we have continuous values like the person's age, do we think there is much difference between a 35, 36 or 37 year-old person? If not we should think about binning them in the same category.
  • Most Machine Learning Models can't cope with categorical data, thus we need to transform them to numbers (aka encoding). The standard process, when no ordering exists between the labels, is to create a new column for each value and mark the rows with 1/0 accordingly.
Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Oracle Analytics Cloud again covers all the above cases with two tools: Euclidean distance, generic data transformation like data condensation and binning are standard steps of the Dataflow component. We only need to set the correct parameters or write simple SQL-like statements. Moreover, for binning, there are options to do it manually as well as automatically providing equal-width and equal-height bins therefore taking out the manual labour and related BIAS.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

On the other side the geo-tagging, data obfuscation, automatic feature extraction (like person's sex based on name) is something that with most of the other tools needs to be resolved by hand, with complex SQL statements or dedicated Machine Learning efforts.

OAC again does a great job during the Data Preparation Recommendation step: after defining a data source, OAC will scan column names and values in order to find interesting features and propose some recommendations like geo-tagging, obfuscation, data splitting (e.g. Full Name split into First and Last Name) etc.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

The accepted recommendations will be added to a Data Preparation Script that can be automatically applied when updating our dataset.

Step #4: Data Analysis

Data Analysis is declared as Step #4 however since the Data Transformation and Enrichment phase we started a circular flow in order to optimize our predictive model output.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

The analysis is a crucial step for any Data Science project; in R or Python one of the first steps is to check dataset head() that will show a first overview of the data like the below

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

OAC does a similar job with the Metadata Overview where we can see for each column the name, type and sample values as well as the Attribute/Metric definition and associated aggregation than we can then change later on.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Analysing Data is always a complex task and is where the expert eye of a data scientist makes the difference. OAC, however, can help with the excellent Explain feature. As described in the previous post, by right clicking on any column in the dataset and selecting Explain, OAC will start calculating statistics and metrics related to the column and display the findings in graphs that we can incorporate in the Data Visualization project.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Even more, there are additional tabs in the Explain window that provide Key Drivers, Segments and Anomalies.

  • Key Drivers provides the statistically significant drivers for the column we are examining.
  • Segments shows hidden groups in the dataset that can predict outcomes in the column
  • Anomalies does an outlier detection, showing which are corner cases in our dataset
Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

Some Data Science projects could already end here. If the objective was to find insights, anomalies or particular segments in our dataset, Explain already provides that information in a clear and reusable format. We can add the necessary visualization to a Project and create a story with the Narrate option.

If on the other side, our scope is to build a predictive model, then it's time to tackle the next phase: Model Training & Evaluation.

Step #5: Train & Evaluate

Exciting: now it's time to tackle Machine Learning! The first thing to do is to understand what type of problem we are trying to solve. OAC allows us to solve problems in the following categories:

  • Supervised when we have a history of the problem's solution and we want to predict future outcomes, we can then identify two subcategories
    • Regression when we are trying to predict a continuous numerical value
    • Classification when we are trying to assign every sample to a category out of two or more
  • Unsupervised when we don't have a history of the solution, but we ask the ML tool to help us understanding the dataset.
    • Clustering when we try to label our dataset in categories based on similarity.
Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

OAC provides two different ways to apply Machine Learning on a dataset: On the Fly or via DataFlows. The On the Fly method is provided directly in the data visualization: when we create any chart, OAC provides the option to add Clusters, Outliers, Trend and Forecast Lines.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

When adding one of the Analytics, we have some control over the behaviour of the predictive model. For the clustering image above we can decide which algorithm to implement (between K-means and Hierarchical Clustering), the number of clusters and the trellis scope in case we visualize multiple scatterplots, one for each value of a dimension.

Applying Machine Learning models on the fly is very useful and could provide some great insights, however, it suffers from a limitation: the columns analysed by the model are only the ones included in the visualization, we have no control over other columns we may want to add to the model to increase predictions accuracy.

If we want to have granular control over columns, algorithm and parameters to use, OAC provides the Train Model step in the DataFlow component.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

As described above OAC provides the option to solve Regression problems via Numeric Prediction, apply Binary or Multi-Classifier for Classification, and Clustering. There is also an option to train Custom Models which can be scripted by a Data Scientist, wrapped in XML tags and included in OAC (more about this topic in a later post).

Once we've selected the class of problem we're aiming to solve, OAC lets us select which Model to train between various prebuilt ones. After selecting the model, we need to identify which is the target column (for Supervised ML classes) and fix the parameters. Note the Train Partition Percent providing an automated way to split the dataset in train/test and Categorical/Numerical Column Imputation to handle the missing values. As part of this process, the encoding for categorical data is executed.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

... But which Model should we use? What parameters should we pick? One lesson I got from my knowledge of Machine Learning is that there is no golden model and parameters set to solve all problems. Data Scientist will try to use different models, compare them and tune parameters based on experimentation (aka trial and error).

OAC allows us to create an initial Dataflow, select a model, set the parameters then save the Dataflow and model output.  Then restart by opening the Dataflow changing the model or the parameters and storing the artefacts with different names to compare them.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

After creating one or more Models, it's time to evaluate them, on OAC we can select a Model and Click on Inspect. In the Overview tab, Inspect shows the model description and properties. Far more interesting is the Quality tab which provides a set of Model scoring metrics based on the test dataset created following the Train Partition Percent parameter. In case of a Numeric Prediction problem, the Quality tab will show for each model quality metrics like the Root Mean Squared Error. OAC will provide similar metrics no matter which ML algorithm you're implementing, making the analysis and comparison easy.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

In the case of Classification, the Quality Tab will show the confusion matrix together with some pre-calculated metrics like Precision, Recall etc.

Democratize Data Science with Oracle Analytics Cloud - Data Analysis and Machine Learning

The model selection then becomes an optimization problem for the metric (or set of) we picked during the problem definition (see TEP in the previous post). After trying several models, parameters, features, we'll then choose the model that minimizes the error (or increase the accuracy) of our prediction.

Note: as part of the model training, it's very important to select which columns will be used for the prediction. A blind option is to use all columns but adding irrelevant columns isn't going to provide better results and, for big or wide (huge number of columns) datasets, it becomes computationally very expensive. As written before, the Explain function provides the list of columns that represent statistically significant predictors. The columns listed there should represent the basics of the model training.

Ok, part II done, we saw how to perform Feature Engineering and Model Training and Evaluation, check my next post for the final piece of the Data Science journey: Predictions and final considerations!