Category Archives: Rittman Mead

An Introduction to Oracle Stream Analytics

Oracle Stream Analytics (OSA) is a graphical tool that provides “Business Insight into Fast Data”. In layman terms, that translates into an intuitive web-based interface for exploring, analysing, and manipulating streaming data sources in realtime. These sources can include REST, JMS queues, as well as Kafka. The inclusion of Kafka opens OSA up to integration with many new-build data pipelines that use this as a backbone technology.

Previously known as Oracle Stream Explorer, it is part of the SOA component of Fusion Middleware (just as OBIEE and ODI are part of FMW too). In a recent blog it was positioned as “[…] part of Oracle Data Integration And Governance Platform.”. Its Big Data credentials include support for Kafka as source and target, as well as the option to execute across multiple nodes for scaling performance and capacity using Spark.

I’ve been exploring OSA from the comfort of my own Mac, courtesy of Docker and a Docker image for OSA created by Guido Schmutz. The benefits of Docker are many and covered elsewhere, but what I loved about it in this instance was that I didn’t have to download a VM that was 10s of GB. Nor did I have to spend time learning how to install OSA from scratch, which whilst interesting wasn’t a priority compared to just trying to tool out and seeing what it could do. [Update] it turns out that installation is a piece of cake, and the download is less than 1Gb … but in general the principle still stands – Docker is a great way to get up and running quickly with something

In this article we’ll take OSA for a spin, looking at some of the functionality and terminology, and then real examples of use with live Twitter data.

To start with, we sign in to Oracle Stream Analytics:

From here, click on the Catalog link, where a list of all the resources are listed. Some of these resource types include:

  • Streams – definitions of sources of data such as Kafka, JMS, and a dummy data generator (event generator)
  • Connections – Servers etc from which Streams are defined
  • Explorations – front-end for seeing contents of Streams in realtime, as well as applying light transformations
  • Targets – destination for transformed streams

Viewing Realtime Twitter Data with OSA

The first example I’ll show is the canonical big data/streaming example everywhere – Twitter. Twitter is even built into OSA as a Stream source. If you go to https://dev.twitter.com you can get yourself a set of credentials enabling you to query the live Twitter firehose for given hashtags or users.

With my twitter dev credentials, I create a new Connection in OSA:

Now we have an entry in the Catalog, for the Twitter connection:

from which we can create a Stream, using the connection and a set of hashtags or users for whom we want to stream tweets:

The Shape is basically the schema or data model that is applied for the stream. There is one built-in for Twitter, which we’ll use here:

When you click Save, if you get an error Unable to deploy OEP application then check the OSA log file for errors such as unable to reach Twitter, or invalid credentials.

Assuming the Stream is created successfully you are then prompted to create an Exploration from where you can see the Stream in realtime:

Explorations can have multiple stream sources, and be used to transform the contents, which we’ll see later. For now, after clicking Create, we get our Exploration window, which shows the contents of the stream in realtime:

At the bottom of the screen there’s the option to plot one or more charts showing the value of any numeric values in the stream, as can be seen in the animation above.

I’ll leave this example here for now, but finish by using the Publish option from the Actions menu, which makes it available as a source for subsequent analyses.

Adding Lookup Data to Streams

Let’s look now at some more of the options available for transforming and ‘wrangling’ streaming data with OSA. Here I’m going to show how two streams can be joined together (but not crossed) based on a common field, and the resulting stream used as the input for a subsequent process. The data is simulated, using a CSV file (read by OSA on a loop) and OSA’s Event Generator.

From the Catalog page I create a new Stream, using Event Generator as the Type:

On the second page of the setup I define how frequently I want the dummy events to be generated, and the specification for the dummy data:

The last bit of setup for the stream is to define the Shape, which is the schema of data that I’d like generated:

The Exploration for this stream shows the dummy data:

The second stream is going to be sourced from a very simple key/value CSV file:

attr_id,attr_value  
1,never  
2,gonna  
3,give  
4,you  
5,up

The stream type is CSV, and I can configure how often OSA reads from it, as well as telling OSA to loop back to the beginning when it’s read to the end, thus simulating a proper stream. The ‘shape’ is picked up automatically from the file, based on the first row (headers) and then inferred data types:

The Exploration for the stream shows the five values repeatedly streamed through (since I ticked the box to ‘loop’ the CSV file in the stream):

Back on the Catalog page I’m going to create a new Exploration, but this time based on a Pattern. Patterns are pre-built templates for stream manipulation and processing. Here we’ll use the pattern for a “left outer join” between streams.

The Pattern has a set of pre-defined fields that need to be supplied, including the stream names and the common field with which to join them. Note also that I’ve increased the Window Range. This is necessary so that a greater range of CSV stream events are used for the lookup. If the Range is left at the default of 1 second then only events from both streams occurring in the same second that match on attr_id would be matched. Unless both streams happen to be in sync on the same attr_id from the outset then this isn’t going to happen that often, and certainly wouldn’t in a real-life data stream.

So now we have the two joined streams:

Within an Exploration it is possible to do light transformation work. By right-clicking on a column you can rename or remove it, which I’ve done here for the duplicated attr_id (duplicated since it appears in both streams), as well as renamed the attr_value:

Daisy-Chaining, Targets, and Topology

Once an Exploration is Published it can be used as the Source for subsequent Explorations, enabling you to map out a pipeline based on multiple source streams and transformations. Here we’re taking the exploration created just above that joined the two streams together, and using the output as the source for a new Exploration:

Since the Exploration is based on a previous one, the same stream data is available, but with the joins and transformations already applied

From here another transformation could be applied, such as replacing the value of one column conditionally based on that of another

Whilst OSA enables rapid analysis and transformation of inbound streams, it also lets you stream the transformed results outside of OSA, to a Target as we saw in the Kafka example above. As well as Kafka other technologies are supported as targets, including a REST endpoint, or a simple CSV file.

With a target configured, as well as an Exploration based on the output of another, the Toplogy comes in handy for visualising the flow of data. You can access this from the Toplogy icon in an Exploration page, or from the dropdown menu on the Catalog page against a given object

Oracle_Stream_Analytics


In the next post I will look at how Oracle Stream Analytics can be used to analyse, enrich, and publish data to and from Kafka. Stay tuned!

The post An Introduction to Oracle Stream Analytics appeared first on Rittman Mead Consulting.

Announcing OBI Remote Training

Since the release of OBIEE 12c in 2015, we have received countless inquiries about how we would be offering our training. Our customers are familiar with our ability to provide on-site private training for a team and we are well known for hosting training classes in our offices in the UK and the US. But what most people aren’t aware of is that we now offer OBI remote training.

Our public training schedule offers a variety of courses monthly, some of which are offered exclusively as remote classes. And for any one of our public classes that is hosted in our U.S. offices, we also offer a limited number of seats to remote attendees. What does this mean for you? This means you have options!

One of our goals here at Rittman Mead is to provide unhindered access to the great wealth of information our team has accumulated through their extensive real-world experience. Now we’ve translated this goal into more accessible training. We understand budgets can be tight and travel may not always be an option for you or your team, but we don’t want that to be the reason you can’t attend our training.

In mid-2015 we started testing our ability to deliver remote training. Our main concern as we began testing was whether we’d be able to deliver the same value to our customers in a digital classroom that we’ve traditionally been able to deliver in a physical classroom. Our fear was that when you lost the face-to-face interaction between the instructor and students, you would also lose some of the rhythm and chemistry of the training, and, consequently, our students would feel less engaged. Other more technical concerns were on our minds, ranging from sound and video quality to connectivity. Much to our surprise and satisfaction, however, our concerns quickly dissolved as, time after time, we were able to deliver the training without issue.

So after plenty of testing, we are pleased to offer remote training as a regular option in our training schedule.

We are aware that remote training (or online training) has been around for some timewe are not claiming to be innovators in the ways of online learning—but we feel that the platform for online learning has finally reached a level that is in line with the quality we demand for our training.

In fact, we have consistently received high marks from customers who have attended our remote training, solidifying our confidence that it does in fact live up to our standards. We invite you to check out our training options. Whether it be on-site training (public or private) or remote training, rest assured that you will be receiving expert-level training from Rittman Mead’s best.

For a full list of our scheduled trainings, see our US or UK calendars.

The post Announcing OBI Remote Training appeared first on Rittman Mead Consulting.

System Metrics Collectors

The need to monitor and control the system performances is not new. What is new is the trend of clever, lightweight, easy to setup, open source metric collectors in the market, along with timeseries databases to store these metrics, and user friendly front ends through which to display and analyse the data.

In this post I will compare Telegraf, Collectl and Topbeat as lightweight metric collectors. All of them do a great job of collecting variety of useful system and application statistic data with minimal overhead to the servers.  Each has the strength of easy configuration and accessible documentation but still there are some differences around range of input and outputs; how they extract the data, what metrics they collect and where they store them.

  • Telegraf is part of the Influx TICK stack, and works with a vast variety of useful input plugins such as Elasticsearch, nginx, AWS and so on. It also supports a variety of outputs, obviously InfluxDB being the primary one. (Find out more…)
  • Topbeat is a new tool from Elastic, the company behind Elasticsearch, Logstash, and Kibana. The Beats platform is evolving rapidly, and includes topbeat, winlogbeat, and packetbeat. In terms of metric collection its support for detailed metrics such as disk IO is relatively limited currently. (Find out more…)
  • Collectl is a long-standing favourite of systems performance analysts, providing a rich source of data. This depth of complexity comes at a bit of a cost when it comes to the documentation’s accessibility, it being aimed firmly at a systems programmer! (Find out more…)

In this post I have used InfluxDB as the backend for storing the data, and Grafana as the front end visualisation tool. I will explain more about both tools later in this post.

In the screenshot below I have used Grafana dashboards to show  “Used CPU”, “Used Memory” and “Network Traffic” stats from the mentioned collectors. As you can see the output of all three is almost the same. What makes them different is:

    • What your infrastructure can support? For example, you cannot install Telegraf on old version of X Server.
    • What input plugins do you need? The current version of Topbeat doesn’t support more detailed metrics such as disk IO and network stats.
    • What storage do you want/need to use for the outputs? InfluxDB works as the best match for Telegraf data, whilst Beats pairs naturally with Elasticsearch
    • What is your visualisation tool and what does it work with best. In all cases the best front end should natively support time series visualisations.

Screen Shot 2016-04-13 at 13.30.27

Next I am going to provide more details on how to download/install each of the mentioned metrics collector services, example commands are written for a linux system.

Telegraf

“An open source agent written in Go for collecting metrics and data on the system it’s running on or from other services. Telegraf writes data it collects to InfluxDB in the correct format.”

  1. Download and install InfluxDB: sudo yum install -y https://s3.amazonaws.com/influxdb/influxdb-0.10.0-1.x86_64.rpm
  2. Start the InfluxDB service: sudo service influxdb start
  3. Download Telegraf: wget http://get.influxdb.org/telegraf/telegraf-0.12.0-1.x86_64.rpm
  4. Install Telegraf: sudo yum localinstall telegraf-0.12.0-1.x86_64.rpm
  5. Start the Telegraf service: sudo service telegraph start
  6. Done!

The default configuration file for Telegraf sits in /etc/telegraf/telegraf.conf or a new config file can be generated using the -sample-config flag on the location of your choice:  telegraf -sample-config > telegraf.conf .  Update the config file to enable/disable/setup different input or outputs plugins e.g. I enabled network inputs: [[inputs.net]]. Finally to test the config files and to verify the output metrics run: telegraf -config telegraf.conf -test

Once all ready and started, a new database called ‘telegraf’ will be added to the InfluxDB storage which you can connect and query. You will read more about InfluxDB in this post.

 

Collectl

Unlike most monitoring tools that either focus on a small set of statistics, format their output in only one way, run either interactively or as a daemon but not both, collectl tries to do it all. You can choose to monitor any of a broad set of subsystems which currently include buddyinfo, cpu, disk, inodes, infiniband, lustre, memory, network, nfs, processes, quadrics, slabs, sockets and tcp.

  • Install collectl: sudo yum install collectl
  • Update the Collectl config file at /etc/collectl.conf to turn on/off different switches and also to write the Collectl’s output logs to a database, i.e. InfluxDB
  • Restart Collectl service  sudo service collectl restart
  • Collectl will write its log in a new InfluxDB database called “graphite”.

 

Topbeat

Topbeat is a lightweight way to gather CPU, memory, and other per-process and system wide data, then ship it to (by default) Elasticsearch to analyze the results.

  • Download Topbeat: wget https://download.elastic.co/beats/topbeat/topbeat-1.2.1-x86_64.rpm
  • Install: sudo yum local install topbeat-1.2.1-x86_64.rpm
  • Edit the topbeat.yml configuration file at /etc/topbeat and set the output to elasticsearch or logstash.
  • If choosing elasticsearch as output, you need to load the index template, which lets Elasticsearch know which fields should be analyzed in which way. The recommended template file is installed by the Topbeat packages. You can either configure Topbeat to load the template automatically, Or you can run a shell script to load the template: curl -XPUT 'http://localhost:9200/_template/topbeat -d@/etc/topbeat/topbeat.template.json
  • Run topbeat: sudo /etc/init.d/topbeat start
  • To test your Topbeat Installation try: curl -XGET 'http://localhost:9200/topbeat-*/_search?pretty'
  • TopBeat logs are written at /var/log
  • Reference to output fields 

 

Why write another metrics collector?

From everything that I have covered above, it is obvious that there is no shortage of open source agents for collecting metrics. Still you may come across a situation that none of the options could be used e.g. specific operating system (in this case, MacOS on XServe) that can’t support any of the options above. The below code is my version of light metric collector, to keep track of Disk IO stats, network, CPU and memory of the host where the simple bash script will be run.

The code will run through an indefinite loop until it is forced quit. Within the loop, first I have used a CURL request (InfluxDB API Reference) to create a database called OSStat, if the database name exists nothing will happen. Then I have used a variety of built-in OS tools to extract the data I needed. In my example sar -u for cpu, sar -n for network, vm_stat for memory, iotop for diskio could return the values I needed. With a quick search you will find many more options. I also used a combinations of awk, sed and grep to transform the values from these tools to the structure that I was easier to use on the front end. Finally I pushed the results to InfluxDB using the curl requests.

#!/bin/bash
export INFLUX_SERVER=$1
while [ 1 -eq 1 ];
do
 
#######CREATE DATABASE ########
curl -G http://$INFLUX_SERVER:8086/query  -s --data-urlencode "q=CREATE DATABASE OSStat" > /dev/null
 
####### CPU  #########
sar 1 1 -u | tail -n 1 | awk -v MYHOST=$(hostname)   '{  print "cpu,host="MYHOST"  %usr="$2",%nice="$3",%sys="$4",%idle="$5}' | curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null
 
####### Memory ##########
FREE_BLOCKS=$(vm_stat | grep free | awk '{ print $3 }' | sed 's/.//')
INACTIVE_BLOCKS=$(vm_stat | grep inactive | awk '{ print $3 }' | sed 's/.//')
SPECULATIVE_BLOCKS=$(vm_stat | grep speculative | awk '{ print $3 }' | sed 's/.//')
WIRED_BLOCKS=$(vm_stat | grep wired | awk '{ print $4 }' | sed 's/.//')
 
FREE=$((($FREE_BLOCKS+SPECULATIVE_BLOCKS)*4096/1048576))
INACTIVE=$(($INACTIVE_BLOCKS*4096/1048576))
TOTALFREE=$((($FREE+$INACTIVE)))
WIRED=$(($WIRED_BLOCKS*4096/1048576))
ACTIVE=$(((4096-($TOTALFREE+$WIRED))))
TOTAL=$((($INACTIVE+$WIRED+$ACTIVE)))
 
curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary  "memory,host="$(hostname)" Free="$FREE",Inactive="$INACTIVE",Total-free="$TOTALFREE",Wired="$WIRED",Active="$ACTIVE",total-used="$TOTAL > /dev/null
 
####### Disk IO ##########
iotop -t 1 1 -P | head -n 2  | grep 201 | awk -v MYHOST=$(hostname)
  '{ print "diskio,host="MYHOST" io_time="$6"read_bytes="$8*1024",write_bytes="$11*1024}'  | curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null
 
###### NETWORK ##########
sar -n DEV 1  |grep -v IFACE|grep -v Average|grep -v -E ^$ | awk -v MYHOST="$(hostname)" '{print "net,host="MYHOST",iface="$2" pktin_s="$3",bytesin_s="$4",pktout_s="$4",bytesout_s="$5}'|curl -i -XPOST "http://${INFLUX_SERVER}:8086/write?db=OSStat"  -s --data-binary @- > /dev/null
 
sleep 10;
done

 

 

InfluxDB Storage

“InfluxDB is a time series database built from the ground up to handle high write and query loads. It is the second piece of the TICK stack. InfluxDB is meant to be used as a backing store for any use case involving large amounts of timestamped data, including DevOps monitoring, application metrics, IoT sensor data, and real-time analytics.”

InfluxDB’s SQL-like query language is called InfluxQL, You can connect/query InfluxDB via Curl requests (mentioned above), command line or browser. The following sample InfluxQLs cover useful basic command line statements to get you started:

influx -- Connect to the database

SHOW DATABASES  -- Show existing databases, _internal is the embedded databased used for internal metrics

USE telegraf -- Make 'telegraf' the current database

SHOW MEASUREMENTS -- show all tables within current database

SHOW FIELD KEYS -- show tables definition within current database

InfluxDB also have a browser admin console that is by default accessible on port 8086. (Official Reference(Read more on RittmanMead Blog)

influxdb_server_8086_-_Google_Search

 

Grafana Visualisation

“Grafana provides rich visualisation options best for working with time series data for Internet infrastructure and application performance analytics.”

Best to use InfluxDB as datasource for Grafana as Elasticsearch datasources doesn’t support all Grafana’s features e.g. functions behind the panels. Here is a good introduction video to visualisation with Grafana.

Screen Shot 2016-04-13 at 13.41.42

The post System Metrics Collectors appeared first on Rittman Mead Consulting.

Connecting Oracle Data Visualization Desktop to OBIEE

Recently at Rittman Mead we have been asked a lot of questions surrounding Oracle’s new Data Visualization Desktop tool and how it integrates with OBIEE. Rather than referring people to the Oracle docs on DVD, I decided to share with you my experience connecting to an OBIEE 12c instance and take you through some of the things I learned through the process.

In a previous blog, I went though database connections with Data Visualization Desktop and how to create reports using data pulled directly from the database. Connecting to DVD to OBIEE is largely the same process, but allows the user to pull in data at pre-existing report level. I decided to use our 12c ChitChat demo server as the OBIEE source and created some sample reports in answers to test out with DVD.

From the DVD Data Sources page, clicking “Create New Data Source” brings up a selection pane with the option to select “From Oracle Applications.”

Screen-Shot-2016-06-06-at-5-00-25-PM

Clicking this option brings up a connection screen with options to enter a connection name, URL (location of the reports you want to pull in as a source), username, and password respectively. This seems like a pretty straightforward process. Reading the Oracle docs on connectivity to OBIEE with DVD say to navigate to the web catalog, select the folder containing the analysis you want to use as a source, and then copy and paste the URL from your browser into the URL connection in DVD. However, using this method will cause the connection to fail.

Screen-Shot-2016-06-06-at-5-15-08-PM

To get Data Visualization Desktop to connect properly, you have to use the URL that you would normally use to log into OBIEE analytics with the proper username and password.

Screen-Shot-2016-06-06-at-5-03-46-PM

Once connected, the web catalog folders are displayed.

Screen-Shot-2016-06-06-at-5-29-11-PM

From here, you can navigate to the analyses you want to use for data sources.

Screen-Shot-2016-06-06-at-3-34-19-PM

Selecting the analysis you want to use as your data source is the same process as selecting schemas and tables from a database source. Once the selection is made, a new screen is displayed with all of the tables and columns that were used for the analysis within OBIEE.

Screen-Shot-2016-06-06-at-3-35-12-PM

From here you can specify each column as an attribute or measure column and change the aggregation for your measures to something other than what was imported with the analysis.

Clicking “Add to Project” loads all the data into DVD under Data Elements and is displayed on the right hand side just like subject area contents in OBIEE.

Screen-Shot-2016-06-06-at-3-35-28-PM

The objective of pulling data in from existing analyses is described by Oracle as revisualization. Keep in mind that Data Visualization Desktop is meant to be a discovery tool and not so much a day-to-day report generator.

The original report was a pivot table with Revenue and Order information for geographical, product and time series dimensions. Let’s say that I just wanted to look at the revenue for all cities located in the Americas by a specific brand for the year 2012.

Dragging in the appropriate columns and adding filters took seconds and the data loaded almost instantaneously. I changed the view to horizontal bar and added a desc sort to Revenue and this was my result:

Screen-Shot-2016-06-06-at-3-49-08-PM

Notice how the revenue for San Fransisco is much higher than any of the other states. Let’s say I want to get a closer look at all the other states without seeing the revenue data for San Fransisco. I could create a new filter for City and exclude San Fransisco from the list or I could just create a filter range for Revenue. Choosing the latter gave me the option of moving a slider to change my revenue value distribution and showed me the results in real time. Pretty cool, right?

Screen-Shot-2016-06-06-at-3-45-12-PM

Screen-Shot-2016-06-06-at-3-48-37-PM

Taking one report and loading it in can open up a wide range of data discovery opportunities but what if there are multiple reports I want to pull data from? You can do this and combine the data together in DVD as long as the two reports contain columns to join the two together.

Going back to my OBIEE connection, there are two reports I created on the demo server that both contain customer data.

Screen-Shot-2016-06-06-at-4-07-19-PM

By pulling in both the Customer Information and Number of Customer Orders Per Year report, Data Visualization Desktop creates two separate data sources which show up under Data Elements.

Screen-Shot-2016-06-06-at-4-09-12-PM

Inspecting one of the data sources shows the match between the two is made on both Customer Number and Customer Name columns.

Screen-Shot-2016-06-06-at-4-09-37-PM

Note: It is possible to make your own column matches manually using the Add Another Match feature.

By using two data sets from two different reports, you can blend the data together to discover trends, show outliers and view the data together without touching the database or having to create new reports within OBIEE.

Screen-Shot-2016-06-06-at-4-24-16-PM

The ability to connect directly to OBIEE with Data Visualization Desktop and pull in data from individual analyses is a very powerful feature that makes DVD’s that much greater. Combining data from multiple analyses blend them together internally creates some exciting data discovery possibilities for users with existing OBIEE implementations.

The post Connecting Oracle Data Visualization Desktop to OBIEE appeared first on Rittman Mead Consulting.

Using R with Jupyter Notebooks and Big Data Discovery

Oracle’s Big Data Discovery encompasses a good amount of exploration, transformation, and visualisation capabilities for datasets residing in your organisation’s data reservoir. Even with this though, there may come a time when your data scientists want to unleash their R magic on those same datasets. Perhaps the data domain expert has used BDD to enrich and cleanse the data, and now it’s ready for some statistical analysis? Maybe you’d like to use R’s excellent forecast package to predict the next six months of a KPI from the BDD dataset? And not only predict it, but write it back into the dataset for subsequent use in BDD? This is possible using BDD Shell and rpy2. It enables advanced analysis and manipulation of datasets already in BDD. These modified datasets can then be pushed back into Hive and then BDD.

BDD Shell provides a native Python environment, and you may opt to use the pandas library to work with BDD datasets as detailed here. In other cases you may simply prefer working with R, or have a particular library in mind that only R offers natively. In this article we’ll see how to do that. The “secret sauce” is rpy2 which enables the native use of R code within a python-kernel Jupyter Notebook.

As with previous articles I’m using a Jupyter Notebook as my environment. I’ll walk through the code here, and finish with a copy of the notebook so you can see the full process.

First we’ll see how you can use R in Jupyter Notebooks running a python kernel, and then expand out to integrate with BDD too. You can view and download the first notebook here.

Import the RPY2 environment so that we can call R from Jupyter

import readline is necessary to workaround the error: /u01/anaconda2/lib/libreadline.so.6: undefined symbol: PC

import readline

%load_ext rpy2.ipython

Example usage

Single inline command, prefixed with %R

%R X=c(1,4,5,7); sd(X); mean(X)

array([ 4.25])

R code block, marked by %%R

%%R
Y = c(2,4,3,9)
summary(lm(Y~X))

Call:  
lm(formula = Y ~ X)

Residuals:  
    1     2     3     4  
 0.88 -0.24 -2.28  1.64 

Coefficients:  
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   0.0800     2.3000   0.035    0.975  
X             1.0400     0.4822   2.157    0.164

Residual standard error: 2.088 on 2 degrees of freedom  
Multiple R-squared:  0.6993,    Adjusted R-squared:  0.549  
F-statistic: 4.651 on 1 and 2 DF,  p-value: 0.1638

Graphics plot, output to the notebook

%R plot(X, Y)

output_10_0

Pass Python variable to R using -i

import numpy as np
Z = np.array([1,4,5,10])

%R -i Z mean(Z)

array([ 5.])

For more information see the documentation

Working with BDD Datasets from R in Jupyter Notebooks

Now that we’ve seen calling R in Jupyter Notebooks, let’s see how to use it with BDD in order to access datasets. The first step is to instantiate the BDD Shell so that you can access the datasets in BDD, and then to set up the R environment using rpy2

execfile('ipython/00-bdd-shell-init.py')  
%load_ext rpy2.ipython

I also found that I had to make readline available otherwise I got an error (/u01/anaconda2/lib/libreadline.so.6: undefined symbol: PC)

import readline

After this, we can import a BDD dataset, convert it to a Spark dataframe and then a pandas dataframe, ready for passing to R

ds = dss.dataset('edp_cli_edp_8d6fd230-8e99-449c-9480-0c2bddc4f6dc')  
spark_df = ds.to_spark()  
import pandas as pd  
pandas_df = spark_df.toPandas()

Note that there is a lot of passing of the same dataframe into different memory structures here – from BDD dataset context to Spark to Pandas, and that’s before we’ve even hit R. It’s fine for ad-hoc wrangling but might start to be painful with very large datasets.

Now we use the rpy2 integration with Jupyter Notebooks and invoke R parsing of the cell’s contents, using the %%R syntax. Optionally, we can pass across variables with the -i parameter, which we’re doing here. Then we assign the dataframe to an R-notation variable (optional, but stylistically nice to do), and then use R’s summary function to show a summary of each attribute:

%%R -i pandas_df
R.df <- pandas_df
summary(R.df)

vendorid     tpep_pickup_datetime tpep_dropoff_datetime passenger_count
 Min.   :1.000   Min.   :1.420e+12    Min.   :1.420e+12     Min.   :0.000  
 1st Qu.:1.000   1st Qu.:1.427e+12    1st Qu.:1.427e+12     1st Qu.:1.000  
 Median :2.000   Median :1.435e+12    Median :1.435e+12     Median :1.000  
 Mean   :1.525   Mean   :1.435e+12    Mean   :1.435e+12     Mean   :1.679  
 3rd Qu.:2.000   3rd Qu.:1.443e+12    3rd Qu.:1.443e+12     3rd Qu.:2.000  
 Max.   :2.000   Max.   :1.452e+12    Max.   :1.452e+12     Max.   :9.000  
 NA's   :12      NA's   :12           NA's   :12            NA's   :12     
 trip_distance      pickup_longitude  pickup_latitude    ratecodeid    
 Min.   :    0.00   Min.   :-121.93   Min.   :-58.43   Min.   : 1.000  
 1st Qu.:    1.00   1st Qu.: -73.99   1st Qu.: 40.74   1st Qu.: 1.000  
 Median :    1.71   Median : -73.98   Median : 40.75   Median : 1.000  
 Mean   :    3.04   Mean   : -72.80   Mean   : 40.10   Mean   : 1.041  
 3rd Qu.:    3.20   3rd Qu.: -73.97   3rd Qu.: 40.77   3rd Qu.: 1.000  
 Max.   :67468.40   Max.   : 133.82   Max.   : 62.77   Max.   :99.000  
 NA's   :12         NA's   :12        NA's   :12       NA's   :12      
 store_and_fwd_flag dropoff_longitude dropoff_latitude  payment_type 
 N   :992336        Min.   :-121.93   Min.   : 0.00    Min.   :1.00  
 None:    12        1st Qu.: -73.99   1st Qu.:40.73    1st Qu.:1.00  
 Y   :  8218        Median : -73.98   Median :40.75    Median :1.00  
                    Mean   : -72.85   Mean   :40.13    Mean   :1.38  
                    3rd Qu.: -73.96   3rd Qu.:40.77    3rd Qu.:2.00  
                    Max.   :   0.00   Max.   :44.56    Max.   :5.00  
                    NA's   :12        NA's   :12       NA's   :12    
  fare_amount          extra            mta_tax          tip_amount     
 Min.   :-170.00   Min.   :-1.0000   Min.   :-1.7000   Min.   :  0.000  
 1st Qu.:   6.50   1st Qu.: 0.0000   1st Qu.: 0.5000   1st Qu.:  0.000  
 Median :   9.50   Median : 0.0000   Median : 0.5000   Median :  1.160  
 Mean   :  12.89   Mean   : 0.3141   Mean   : 0.4977   Mean   :  1.699  
 3rd Qu.:  14.50   3rd Qu.: 0.5000   3rd Qu.: 0.5000   3rd Qu.:  2.300  
 Max.   : 750.00   Max.   :49.6000   Max.   :52.7500   Max.   :360.000  
 NA's   :12        NA's   :12        NA's   :12        NA's   :12       
  tolls_amount      improvement_surcharge  total_amount       PRIMARY_KEY     
 Min.   : -5.5400   Min.   :-0.3000       Min.   :-170.80   0-0-0   :      1  
 1st Qu.:  0.0000   1st Qu.: 0.3000       1st Qu.:   8.75   0-0-1   :      1  
 Median :  0.0000   Median : 0.3000       Median :  11.80   0-0-10  :      1  
 Mean   :  0.3072   Mean   : 0.2983       Mean   :  16.01   0-0-100 :      1  
 3rd Qu.:  0.0000   3rd Qu.: 0.3000       3rd Qu.:  17.80   0-0-1000:      1  
 Max.   :503.0500   Max.   : 0.3000       Max.   : 760.05   0-0-1001:      1  
 NA's   :12         NA's   :12            NA's   :12        (Other) :1000560

We can use native R code and R libraries including the excellent dplyr to lightly wrangle and then chart the data:

%%R

library(dplyr)  
library(ggplot2)

R.df %>%  
    filter(fare_amount > 0) %>%  
    ggplot(aes(y=fare_amount, x=tip_amount,color=passenger_count)) +  
    geom_point(alpha=0.5 )


Finally, using the -o flag on the %%R invocation, we can pass back variables from the R context back to pandas :

%%R -o R_output  
R_output <-  
    R.df %>%  
    mutate(foo = 'bar')

and from there back to Spark and write the results to Hive:

spark_df2 = sqlContext.createDataFrame(R_output)  
spark_df2.write.mode('Overwrite').saveAsTable('default.updated_dataset')

and finally ingest the new Hive table to BDD:

from subprocess import call  
call(["/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI","--table default.updated_dataset"])

You can download the notebook here.

The post Using R with Jupyter Notebooks and Big Data Discovery appeared first on Rittman Mead Consulting.