Tag Archives: Big Data
Using R with Jupyter Notebooks and Oracle Big Data Discovery
Oracle's Big Data Discovery encompasses a good amount of exploration, transformation, and visualisation capabilities for datasets residing in your organisation’s data reservoir. Even with this though, there may come a time when your data scientists want to unleash their R magic on those same datasets. Perhaps the data domain expert has used BDD to enrich and cleanse the data, and now it's ready for some statistical analysis? Maybe you'd like to use R's excellent forecast package to predict the next six months of a KPI from the BDD dataset? And not only predict it, but write it back into the dataset for subsequent use in BDD? This is possible using BDD Shell and rpy2
. It enables advanced analysis and manipulation of datasets already in BDD. These modified datasets can then be pushed back into Hive and then BDD.
BDD Shell provides a native Python environment, and you may opt to use the pandas library to work with BDD datasets as detailed here. In other cases you may simply prefer working with R, or have a particular library in mind that only R offers natively. In this article we’ll see how to do that. The "secret sauce" is rpy2
which enables the native use of R code within a python-kernel Jupyter Notebook.
As with previous articles I’m using a Jupyter Notebook as my environment. I’ll walk through the code here, and finish with a copy of the notebook so you can see the full process.
First we'll see how you can use R in Jupyter Notebooks running a python kernel, and then expand out to integrate with BDD too. You can view and download the first notebook here.
Import the RPY2 environment so that we can call R from Jupyter
import readline
is necessary to workaround the error: /u01/anaconda2/lib/libreadline.so.6: undefined symbol: PC
import readline
%load_ext rpy2.ipython
Example usage
Single inline command, prefixed with%R
%R X=c(1,4,5,7); sd(X); mean(X)
array([ 4.25])
R code block, marked by %%R
%%R
Y = c(2,4,3,9)
summary(lm(Y~X))
Call:
lm(formula = Y ~ X)
Residuals:
1 2 3 4
0.88 -0.24 -2.28 1.64
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0800 2.3000 0.035 0.975
X 1.0400 0.4822 2.157 0.164
Residual standard error: 2.088 on 2 degrees of freedom
Multiple R-squared: 0.6993, Adjusted R-squared: 0.549
F-statistic: 4.651 on 1 and 2 DF, p-value: 0.1638
Graphics plot, output to the notebook
%R plot(X, Y)
Pass Python variable to R using -i
import numpy as np
Z = np.array([1,4,5,10])
%R -i Z mean(Z)
array([ 5.])
For more information see the documentation
Working with BDD Datasets from R in Jupyter Notebooks
Now that we've seen calling R in Jupyter Notebooks, let's see how to use it with BDD in order to access datasets. The first step is to instantiate the BDD Shell so that you can access the datasets in BDD, and then to set up the R environment usingrpy2
execfile('ipython/00-bdd-shell-init.py') %load_ext rpy2.ipython
I also found that I had to make readline
available otherwise I got an error (/u01/anaconda2/lib/libreadline.so.6: undefined symbol: PC
)
import readline
After this, we can import a BDD dataset, convert it to a Spark dataframe and then a pandas dataframe, ready for passing to R
ds = dss.dataset('edp_cli_edp_8d6fd230-8e99-449c-9480-0c2bddc4f6dc') spark_df = ds.to_spark() import pandas as pd pandas_df = spark_df.toPandas()
Note that there is a lot of passing of the same dataframe into different memory structures here - from BDD dataset context to Spark to Pandas, and that’s before we’ve even hit R. It’s fine for ad-hoc wrangling but might start to be painful with very large datasets.
Now we use the rpy2
integration with Jupyter Notebooks and invoke R parsing of the cell’s contents, using the %%R
syntax. Optionally, we can pass across variables with the -i
parameter, which we’re doing here. Then we assign the dataframe to an R-notation variable (optional, but stylistically nice to do), and then use R's summary
function to show a summary of each attribute:
%%R -i pandas_df R.df <- pandas_df summary(R.df)
vendorid tpep_pickup_datetime tpep_dropoff_datetime passenger_count Min. :1.000 Min. :1.420e+12 Min. :1.420e+12 Min. :0.000 1st Qu.:1.000 1st Qu.:1.427e+12 1st Qu.:1.427e+12 1st Qu.:1.000 Median :2.000 Median :1.435e+12 Median :1.435e+12 Median :1.000 Mean :1.525 Mean :1.435e+12 Mean :1.435e+12 Mean :1.679 3rd Qu.:2.000 3rd Qu.:1.443e+12 3rd Qu.:1.443e+12 3rd Qu.:2.000 Max. :2.000 Max. :1.452e+12 Max. :1.452e+12 Max. :9.000 NA's :12 NA's :12 NA's :12 NA's :12 trip_distance pickup_longitude pickup_latitude ratecodeid Min. : 0.00 Min. :-121.93 Min. :-58.43 Min. : 1.000 1st Qu.: 1.00 1st Qu.: -73.99 1st Qu.: 40.74 1st Qu.: 1.000 Median : 1.71 Median : -73.98 Median : 40.75 Median : 1.000 Mean : 3.04 Mean : -72.80 Mean : 40.10 Mean : 1.041 3rd Qu.: 3.20 3rd Qu.: -73.97 3rd Qu.: 40.77 3rd Qu.: 1.000 Max. :67468.40 Max. : 133.82 Max. : 62.77 Max. :99.000 NA's :12 NA's :12 NA's :12 NA's :12 store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type N :992336 Min. :-121.93 Min. : 0.00 Min. :1.00 None: 12 1st Qu.: -73.99 1st Qu.:40.73 1st Qu.:1.00 Y : 8218 Median : -73.98 Median :40.75 Median :1.00 Mean : -72.85 Mean :40.13 Mean :1.38 3rd Qu.: -73.96 3rd Qu.:40.77 3rd Qu.:2.00 Max. : 0.00 Max. :44.56 Max. :5.00 NA's :12 NA's :12 NA's :12 fare_amount extra mta_tax tip_amount Min. :-170.00 Min. :-1.0000 Min. :-1.7000 Min. : 0.000 1st Qu.: 6.50 1st Qu.: 0.0000 1st Qu.: 0.5000 1st Qu.: 0.000 Median : 9.50 Median : 0.0000 Median : 0.5000 Median : 1.160 Mean : 12.89 Mean : 0.3141 Mean : 0.4977 Mean : 1.699 3rd Qu.: 14.50 3rd Qu.: 0.5000 3rd Qu.: 0.5000 3rd Qu.: 2.300 Max. : 750.00 Max. :49.6000 Max. :52.7500 Max. :360.000 NA's :12 NA's :12 NA's :12 NA's :12 tolls_amount improvement_surcharge total_amount PRIMARY_KEY Min. : -5.5400 Min. :-0.3000 Min. :-170.80 0-0-0 : 1 1st Qu.: 0.0000 1st Qu.: 0.3000 1st Qu.: 8.75 0-0-1 : 1 Median : 0.0000 Median : 0.3000 Median : 11.80 0-0-10 : 1 Mean : 0.3072 Mean : 0.2983 Mean : 16.01 0-0-100 : 1 3rd Qu.: 0.0000 3rd Qu.: 0.3000 3rd Qu.: 17.80 0-0-1000: 1 Max. :503.0500 Max. : 0.3000 Max. : 760.05 0-0-1001: 1 NA's :12 NA's :12 NA's :12 (Other) :1000560
We can use native R code and R libraries including the excellent dplyr
to lightly wrangle and then chart the data:
%%R library(dplyr) library(ggplot2) R.df %>% filter(fare_amount > 0) %>% ggplot(aes(y=fare_amount, x=tip_amount,color=passenger_count)) + geom_point(alpha=0.5 )
Finally, using the -o
flag on the %%R
invocation, we can pass back variables from the R context back to pandas :
%%R -o R_output R_output <- R.df %>% mutate(foo = 'bar')
and from there back to Spark and write the results to Hive:
spark_df2 = sqlContext.createDataFrame(R_output) spark_df2.write.mode('Overwrite').saveAsTable('default.updated_dataset')
and finally ingest the new Hive table to BDD:
from subprocess import call call(["/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI","--table default.updated_dataset"])
You can download the notebook here.
Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery
Big Data Discovery (BDD) is a great tool for exploring, transforming, and visualising data stored in your organisation’s Data Reservoir. I presented a workshop on it at a recent conference, and got an interesting question from the audience that I thought I’d explore further here. Currently the primary route for getting data into BDD requires that it be (i) in HDFS and (ii) have a Hive table defined on top of it. From there, BDD automagically ingests the Hive table, or the data_processing_CLI
is manually called which prompts the BDD DGraph engine to go and sample (or read in full) the Hive dataset.
This is great, and works well where the dataset is vast (this is Big Data, after all) and needs the sampling that DGraph provides. It’s also simple enough for Hive tables that have already been defined, perhaps by another team. But – and this was the gist of the question that I got – what about where the Hive table doesn’t exist already? Because if it doesn’t, we now need to declare all the columns as well as choose the all-important SerDe in order to read the data.
SerDes are brilliant, in that they enable the application of a schema-on-read to data in many forms, but at the very early stages of a data project there are probably going to be lots of formats of data (such as TSV, CSV, JSON, as well as log files and so on) from varying sources. Choosing the relevant SerDe for each one, and making sure that BDD is also configured with the necessary jar
, as well as manually listing each column to be defined in the table, adds overhead to the project. Wouldn’t it be nice if we could side-step this step somehow? In this article we’ll see how!
Importing Datasets through BDD Studio
Before we get into more fancy options, don’t forget that BDD itself offers the facility to upload CSV, TSV, and XLSX files, as well as connect to JDBC datasources. Data imported this way will be stored by BDD in a Hive table and ingested to DGraph.
This is great for smaller files held locally. But what about files on your BDD cluster, that are too large to upload from local machine, or in other formats – such as JSON?
Loading a CSV file
As we’ve just seen, CSV files can be imported to Hive/BDD directly through the GUI. But perhaps you’ve got a large CSV file sat local to BDD that you want to import? Or a folder full of varying CSV files that would be too time-consuming to upload through the GUI one-by-one?
For this we can use BDD Shell with the Python Pandas library, and I’m going to do so here through the excellent Jupyter Notebooks interface. You can read more about these here and details of how to configure them on BigDataLite 4.5 here. The great thing about notebooks, whether Jupyter or Zeppelin, is that I don’t need to write any more blog text here – I can simply embed the notebook inline and it is self-documenting:
Note that at end of this we call data_processing_CLI
to automatically bring the new table into BDD’s DGraph engine for use in BDD Studio. If you’ve got BDD configured to automagically add new Hive tables, or you don’t want to run this step, you can just comment it out.
Loading simple JSON data
Whilst CSV files are tabular by definition, JSON records can contain nested objects (recursively), as well as arrays. Let’s look at an example of using SparkSQL to import a simple flat JSON file, before then considering how we handle nested and array formats. Note that SparkSQL can read datasets from both local (file://
) storage as well as HDFS (hdfs://
):
Once loaded into Hive, it can be viewed in Hue:
Loading nested JSON data
What’s been great so far, whether loading CSV, XLS, or simple JSON, is that we’ve not had to list out column names. All that needs modifying in the scripts above to import a different file with a different set of columns is to change the filename and the target tablename. Now we’re going to look at an example of a JSON file with nested objects – which is very common in JSON – and we’re going to have to roll our sleeves up a tad and start hardcoding some schema details.
First up, we import the JSON to a SparkSQL dataframe as before (although this time I’m loading it from HDFS, but local works too):
df = sqlContext.read.json('hdfs:///user/oracle/incoming/twitter/2016/07/12/')
Then I declare this as a temporary table, which enables me to subsequently run queries with SQL against it
df.registerTempTable("twitter")
A very simple example of a SQL query would be to look at the record count:
result_df = sqlContext.sql("select count(*) from twitter") result_df.show() +----+ | _c0| +----+ |3011| +----+
The result of a sqlContext.sql
invocation is a dataframe, which above I’m assigning to a new variable, but I could as easily run:
sqlContext.sql("select count(*) from twitter").show()
for the same result.
The sqlContext has inferred the JSON schema automagically, and we can inspect it using
df.printSchema()
The twitter schema is huge, so I’m just quoting a few choice sections of it here to illustrate subsequent points:
root |-- created_at: string (nullable = true) |-- entities: struct (nullable = true) | |-- hashtags: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- text: string (nullable = true) | |-- user_mentions: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- id: long (nullable = true) | | | |-- id_str: string (nullable = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- name: string (nullable = true) | | | |-- screen_name: string (nullable = true) |-- source: string (nullable = true) |-- text: string (nullable = true) |-- timestamp_ms: string (nullable = true) |-- truncated: boolean (nullable = true) |-- user: struct (nullable = true) | |-- followers_count: long (nullable = true) | |-- following: string (nullable = true) | |-- friends_count: long (nullable = true) | |-- name: string (nullable = true) | |-- screen_name: string (nullable = true)
Points to note about the schema:
- In the root of the schema we have attributes such as
text
andcreated_at
- There are nested elements (“struct”) such as
user
and within itscreen_name
,followers_count
etc - There’s also array objects, where an attribute can occur more than one, such as
hashtags
, anduser_mentions
.
Accessing root and nested attributes is easy – we just use dot notation:
sqlContext.sql("SELECT created_at, user.screen_name, text FROM twitter").show() +--------------------+--------------+--------------------+ | created_at| screen_name| text| +--------------------+--------------+--------------------+ |Tue Jul 12 16:13:...| Snehalstocks|"Students need to...| |Tue Jul 12 16:13:...| KingMarkT93|Ga caya :( https:...|
We can save this as a dataframe that’s then persisted to Hive, for ingest into BDD:
subset02 = sqlContext.sql("SELECT created_at, user.screen_name, text FROM twitter") tablename = 'twitter_user_text' qualified_tablename='default.' + tablename subset02.write.mode('Overwrite').saveAsTable(qualified_tablename)
Which in Hue looks like this:
Attributes in an array are a bit more tricky. Here’s an example tweet with multiple user_mentions
and a hashtag
too:
@johnnyq72 @orcldoug @rmoff @markrittman @mikedurran #ImALLin 😋
— Sabine ∞ Heimsath (@flederbine) July 12, 2016
Here we use the LATERAL VIEW
syntax, with the optional OUTER
operator since not all tweets have these additional entities, and we want to make sure we show all tweets including those that don’t have these entities. Here’s the SQL formatted for reading:
SELECT id, created_at, user.screen_name, text as tweet_text, hashtag.text as hashtag, user_mentions.screen_name as mentioned_user from twitter LATERAL VIEW OUTER explode(entities.user_mentions) user_mentionsTable as user_mentions LATERAL VIEW OUTER explode(entities.hashtags) hashtagsTable AS hashtag
Which when run as from sqlContext.sql()
gives us:
+------------------+--------------------+---------------+--------------------+-------+---------------+ | id| created_at| screen_name| tweet_text|hashtag| screen_name| +------------------+--------------------+---------------+--------------------+-------+---------------+ |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| johnnyq72| |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| orcldoug| |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| rmoff| |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| markrittman| |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| mikedurran| +------------------+--------------------+---------------+--------------------+-------+---------------+
and written back to Hive for ingest to BDD:
You can use these SQL queries both for simply flattening JSON, as above, or for building summary tables, such as this one showing the most common hashtags in the dataset:
sqlContext.sql("SELECT hashtag.text,count(*) as inst_count from twitter LATERAL VIEW OUTER explode(entities.hashtags) hashtagsTable AS hashtag GROUP BY hashtag.text order by inst_count desc").show(4) +-----------+----------+ | text|inst_count| +-----------+----------+ | Hadoop| 165| | Oracle| 151| | job| 128| | BigData| 112|
You can find the full Jupyter Notebook with all these nested/array JSON examples here:
You may decide after looking at this that you’d rather just go back to Hive and SerDes, and as is frequently the case in ‘data wrangling’ there’s multiple ways to achieve the same end. The route you take comes down to personal preference and familiarity with the toolsets. In this particular case I’d still go for SparkSQL for the initial exploration as it’s quicker to ‘poke around’ the dataset than with defining and re-defining Hive tables — YMMV. A final point to consider before we dig in is that SparkSQL importing JSON and saving back to HDFS/Hive is a static process, and if your underlying data is changing (e.g. streaming to HDFS from Flume) then you would probably want a Hive table over the HDFS file so that it is live when queried.
Loading an Excel workbook with many sheets
This was the use-case that led me to researching programmatic import of datasets in the first place. I was doing some work with a dataset of road traffic accident data, which included a single XLS file with over 30 sheets, each a lookup table for a separate set of dimension attributes. Importing each sheet one by one through the BDD GUI was tedious, and being a lazy geek, I looked to automate it.
Using Pandas read_excel
function and a smidge of Python to loop through each sheet it was easily done. You can see the full notebook here:
The post Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery appeared first on Rittman Mead Consulting.
Using SparkSQL and Pandas to Import Data into Hive and Big Data Discovery
Big Data Discovery (BDD) is a great tool for exploring, transforming, and visualising data stored in your organisation’s Data Reservoir. I presented a workshop on it at a recent conference, and got an interesting question from the audience that I thought I’d explore further here. Currently the primary route for getting data into BDD requires that it be (i) in HDFS and (ii) have a Hive table defined on top of it. From there, BDD automagically ingests the Hive table, or the dataprocessingCLI
is manually called which prompts the BDD DGraph engine to go and sample (or read in full) the Hive dataset.
This is great, and works well where the dataset is vast (this is Big Data, after all) and needs the sampling that DGraph provides. It’s also simple enough for Hive tables that have already been defined, perhaps by another team. But - and this was the gist of the question that I got - what about where the Hive table doesn’t exist already? Because if it doesn’t, we now need to declare all the columns as well as choose the all-important SerDe in order to read the data.
SerDes are brilliant, in that they enable the application of a schema-on-read to data in many forms, but at the very early stages of a data project there are probably going to be lots of formats of data (such as TSV, CSV, JSON, as well as log files and so on) from varying sources. Choosing the relevant SerDe for each one, and making sure that BDD is also configured with the necessary jar
, as well as manually listing each column to be defined in the table, adds overhead to the project. Wouldn’t it be nice if we could side-step this step somehow? In this article we’ll see how!
Importing Datasets through BDD Studio
Before we get into more fancy options, don’t forget that BDD itself offers the facility to upload CSV, TSV, and XLSX files, as well as connect to JDBC datasources. Data imported this way will be stored by BDD in a Hive table and ingested to DGraph.
This is great for smaller files held locally. But what about files on your BDD cluster, that are too large to upload from local machine, or in other formats - such as JSON?
Loading a CSV file
As we’ve just seen, CSV files can be imported to Hive/BDD directly through the GUI. But perhaps you’ve got a large CSV file sat local to BDD that you want to import? Or a folder full of varying CSV files that would be too time-consuming to upload through the GUI one-by-one?
For this we can use BDD Shell with the Python Pandas library, and I’m going to do so here through the excellent Jupyter Notebooks interface. You can read more about these here and details of how to configure them on BigDataLite 4.5 here. The great thing about notebooks, whether Jupyter or Zeppelin, is that I don’t need to write any more blog text here - I can simply embed the notebook inline and it is self-documenting:
https://gist.github.com/76b477f69303dd8a9d8ee460a341c445
Note that at end of this we call dataprocessingCLI
to automatically bring the new table into BDD’s DGraph engine for use in BDD Studio. If you’ve got BDD configured to automagically add new Hive tables, or you don’t want to run this step, you can just comment it out.
Loading simple JSON data
Whilst CSV files are tabular by definition, JSON records can contain nested objects (recursively), as well as arrays. Let’s look at an example of using SparkSQL to import a simple flat JSON file, before then considering how we handle nested and array formats. Note that SparkSQL can read datasets from both local (file://
) storage as well as HDFS (hdfs://
):
https://gist.github.com/8b7118c230f34f7d57bd9b0aa4e0c34c
Once loaded into Hive, it can be viewed in Hue:
Loading nested JSON data
What’s been great so far, whether loading CSV, XLS, or simple JSON, is that we’ve not had to list out column names. All that needs modifying in the scripts above to import a different file with a different set of columns is to change the filename and the target tablename. Now we’re going to look at an example of a JSON file with nested objects - which is very common in JSON - and we’re going to have to roll our sleeves up a tad and start hardcoding some schema details.
First up, we import the JSON to a SparkSQL dataframe as before (although this time I’m loading it from HDFS, but local works too):
df = sqlContext.read.json('hdfs:///user/oracle/incoming/twitter/2016/07/12/')
Then I declare this as a temporary table, which enables me to subsequently run queries with SQL against it
df.registerTempTable("twitter")
A very simple example of a SQL query would be to look at the record count:
result_df = sqlContext.sql("select count(*) from twitter") result_df.show() +----+ | _c0| +----+ |3011| +----+
The result of a sqlContext.sql
invocation is a dataframe, which above I’m assigning to a new variable, but I could as easily run:
sqlContext.sql("select count(*) from twitter").show()
for the same result.
The sqlContext has inferred the JSON schema automagically, and we can inspect it using
df.printSchema()
The twitter schema is huge, so I’m just quoting a few choice sections of it here to illustrate subsequent points:
root |-- created_at: string (nullable = true) |-- entities: struct (nullable = true) | |-- hashtags: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- text: string (nullable = true) | |-- user_mentions: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- id: long (nullable = true) | | | |-- id_str: string (nullable = true) | | | |-- indices: array (nullable = true) | | | | |-- element: long (containsNull = true) | | | |-- name: string (nullable = true) | | | |-- screen_name: string (nullable = true) |-- source: string (nullable = true) |-- text: string (nullable = true) |-- timestamp_ms: string (nullable = true) |-- truncated: boolean (nullable = true) |-- user: struct (nullable = true) | |-- followers_count: long (nullable = true) | |-- following: string (nullable = true) | |-- friends_count: long (nullable = true) | |-- name: string (nullable = true) | |-- screen_name: string (nullable = true)
Points to note about the schema:
- In the root of the schema we have attributes such as
text
andcreated_at
- There are nested elements (“struct”) such as
user
and within itscreen_name
,followers_count
etc - There’s also array objects, where an attribute can occur more than one, such as
hashtags
, anduser_mentions
.
Accessing root and nested attributes is easy - we just use dot notation:
sqlContext.sql("SELECT created_at, user.screen_name, text FROM twitter").show() +--------------------+--------------+--------------------+ | created_at| screen_name| text| +--------------------+--------------+--------------------+ |Tue Jul 12 16:13:...| Snehalstocks|"Students need to...| |Tue Jul 12 16:13:...| KingMarkT93|Ga caya :( https:...|
We can save this as a dataframe that’s then persisted to Hive, for ingest into BDD:
subset02 = sqlContext.sql("SELECT created_at, user.screen_name, text FROM twitter") tablename = 'twitter_user_text' qualified_tablename='default.' + tablename subset02.write.mode('Overwrite').saveAsTable(qualified_tablename)
Which in Hue looks like this:
Attributes in an array are a bit more tricky. Here’s an example tweet with multiple user_mentions
and a hashtag
too:
https://twitter.com/flederbine/status/752940179569115136
Here we use the LATERAL VIEW
syntax, with the optional OUTER
operator since not all tweets have these additional entities, and we want to make sure we show all tweets including those that don’t have these entities. Here’s the SQL formatted for reading:
SELECT id, created_at, user.screen_name, text as tweet_text, hashtag.text as hashtag, user_mentions.screen_name as mentioned_user from twitter LATERAL VIEW OUTER explode(entities.user_mentions) user_mentionsTable as user_mentions LATERAL VIEW OUTER explode(entities.hashtags) hashtagsTable AS hashtag
Which when run as from sqlContext.sql()
gives us:
+------------------+--------------------+---------------+--------------------+-------+---------------+ | id| created_at| screen_name| tweet_text|hashtag| screen_name| +------------------+--------------------+---------------+--------------------+-------+---------------+ |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| johnnyq72| |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| orcldoug| |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| rmoff| |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| markrittman| |752940179569115136|Tue Jul 12 18:58:...| flederbine|@johnnyq72 @orcld...|ImALLin| mikedurran| +------------------+--------------------+---------------+--------------------+-------+---------------+
and written back to Hive for ingest to BDD:
You can use these SQL queries both for simply flattening JSON, as above, or for building summary tables, such as this one showing the most common hashtags in the dataset:
sqlContext.sql("SELECT hashtag.text,count(*) as inst_count from twitter LATERAL VIEW OUTER explode(entities.hashtags) hashtagsTable AS hashtag GROUP BY hashtag.text order by inst_count desc").show(4) +-----------+----------+ | text|inst_count| +-----------+----------+ | Hadoop| 165| | Oracle| 151| | job| 128| | BigData| 112|
You can find the full Jupyter Notebook with all these nested/array JSON examples here:
https://gist.github.com/a38e853d3a7dcb48a9df99ce1e3505ff
You may decide after looking at this that you’d rather just go back to Hive and SerDes, and as is frequently the case in ‘data wrangling’ there’s multiple ways to achieve the same end. The route you take comes down to personal preference and familiarity with the toolsets. In this particular case I'd still go for SparkSQL for the initial exploration as it's quicker to 'poke around' the dataset than with defining and re-defining Hive tables -- YMMV. A final point to consider before we dig in is that SparkSQL importing JSON and saving back to HDFS/Hive is a static process, and if your underlying data is changing (e.g. streaming to HDFS from Flume) then you would probably want a Hive table over the HDFS file so that it is live when queried.
Loading an Excel workbook with many sheets
This was the use-case that led me to researching programmatic import of datasets in the first place. I was doing some work with a dataset of road traffic accident data, which included a single XLS file with over 30 sheets, each a lookup table for a separate set of dimension attributes. Importing each sheet one by one through the BDD GUI was tedious, and being a lazy geek, I looked to automate it.
Using Pandas read_excel
function and a smidge of Python to loop through each sheet it was easily done. You can see the full notebook here:
https://gist.github.com/rmoff/3fa5d857df8ca5895356c22e420f3b22
Using Jupyter Notebooks with Big Data Discovery 1.2
New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.
Whilst BDD Shell is command-line based, there’s also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive “Notebook”. This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a ‘Notebook’ enables you to modify and re-run commands, and then once correct retain them as a fully functioning script for future use.
The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you’d find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2.
For information how on to set up BDD Shell and Jupyter Notebooks, see this previous post. For the purpose of this article I’m running Jupyter on port 18888 so as not to clash with Hue:
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell /u01/anaconda2/bin/jupyter-notebook --port 18888
Important points to note:
- It’s important that you run this from the
bdd-shell
folder, otherwise the BDD shell won’t initialise properly - Jupyter by default only listens locally, so you need to use a web browser local to the server, or use port-forwarding if you want to access Jupyter from your local web browser.
Go to http://localhost:18888
in your web browser, and from the New menu select a Python 2 notebook:
You should then see an empty notebook, ready for use:
The ‘cell’ (grey box after the In [ ]:
) is where you enter code to run – type in execfile('ipython/00-bdd-shell-init.py')
and press shift-Enter. This will execute it – if you don’t press shift you just get a newline. Whilst it’s executing you’ll notice the line prefix changes from [ ]
to [*]
, and in the terminal window from which you launched Jupyter you’ll see some output related to the BDD Shell starting
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/). WARNING: Running spark-class from user-defined location. spark.driver.cores is set but does not apply in client mode.
Now back in the Notebook, enter the following – use Enter, not Shift-enter, between lines:
dss = bc.datasets() dss.count
Now press shift-enter to execute it. This uses the pre-defined bc
BDD context to get the datasets object, and return a count from it.
By clicking the + button on the toolbar, using the up and down arrows on the toolbar, and the Code/Markdown dropdown, it’s possible to insert “cells” which are not code but instead commentary on what the code is. This way you can produce fully documented, but executable, code objects.
From the File menu give the notebook a name, and then Close and Halt, which destroys the Jupyter process (‘kernel’) that was executing the BDD Shell session. Back at the Jupyter main page, you’ll note that a ipynb
file has been created, which holds the notebook definition and can be downloaded, sent to colleagues, uploaded to blogs to share, saved in source control, and so on. Here’s the file for the notebook above – note that it’s hosted on gist, which automagically previews it as a Notebook, so click on Raw to see the actual code behind it.
The fantastically powerful thing about the Notebooks is that you can modify and re-run steps as you go — but you never lose the history of how you got somewhere. Most people will be familar with learning or exploring a tool and its capabilities and eventually getting it to work – but no idea how they got there. Even for experienced users of a tool, being able to prove how to replicate a final result is important for (a) showing the evidence for how they got there and (b) enabling others to take that work and build on it.
With an existing notebook file, whether a saved one you created or one that someone sent you, you can reopen it in Jupyter and re-execute it, in order to replicate the results previously seen. This is an important tenet of [data] science in general – show your workings, and it’s great that Big Data Discovery supports this option. Obviously, showing the count of datasets is not so interesting or important to replicate. The real point here is being able to take datasets that you’ve got in BDD, done some joining and wrangling on already taking advantage of the GUI, and then dive deep into the data science and analytics world of things like Spark MLLib, Pandas, and so on. As a simple example, I can use a couple of python libraries (installed by default with Anaconda) to plot a correlation matrix for one of my BDD datasets:
As well as producing visualisations or calculations within BDD shell, the real power comes in being able to push the modified data back into Hive, and thus continue to work with it within BDD.
With Jupyter Notebooks not only can you share the raw notebooks for someone else to execute, you can export the results to HTML, PDF, and so on. Here’s the notebook I started above, developed out further and exported to HTML – note how you can see not only the results, but exactly the code that I ran in order to get them. In this I took the dataset from BDD, added a column into it using a pandas windowing function, and then saved it back to a new Hive table:
(you can view the page natively here, and the ipynb here)
Once the data’s been written back to Hive from the Python processing, I ran BDD’s data_processing_CLI
to add the new table back into BDD
/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table updated_accident_data
And once that’s run, I can then continue working with the data in BDD:
This workflow enables a continual loop of data wrangling, enrichment, advanced processing, and visualisation – all using the most appropriate tools for the job.
You can also use BDD Shell/Jupyter as another route for loading data into BDD. Whilst you can import CSV and XLS files into BDD directly through the web GUI, there are limitations – such as an XLS workbook with multiple sheets has to be imported one sheet at a time. I had a XLS file with over 40 sheets of reference data in it, which was not going to be time-efficient to load one at a time into BDD.
Pandas supports a lot of different input types – including Excel files. So by using Pandas to pull the data in, then convert it to a Spark dataframe I can write it to Hive, from where it can be imported to BDD. As before, the beauty of the Notebook approach is that I could develop and refine the code, and then simply share the Notebook here
The post Using Jupyter Notebooks with Big Data Discovery 1.2 appeared first on Rittman Mead Consulting.
Using Jupyter Notebooks with Big Data Discovery 1.2
New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.
Whilst BDD Shell is command-line based, there's also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive "Notebook". This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a 'Notebook' enables you to modify and re-run commands, and then once correct retain them as a fully functioning script for future use.
The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you'd find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2.
For information how on to set up BDD Shell and Jupyter Notebooks, see this previous post. For the purpose of this article I'm running Jupyter on port 18888 so as not to clash with Hue:
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell /u01/anaconda2/bin/jupyter-notebook --port 18888
Important points to note:
- It's important that you run this from the
bdd-shell
folder, otherwise the BDD shell won't initialise properly - Jupyter by default only listens locally, so you need to use a web browser local to the server, or use port-forwarding if you want to access Jupyter from your local web browser.
Go to http://localhost:18888
in your web browser, and from the New menu select a Python 2 notebook:
You should then see an empty notebook, ready for use:
The 'cell' (grey box after the In [ ]:
) is where you enter code to run - type in execfile('ipython/00-bdd-shell-init.py')
and press shift-Enter. This will execute it - if you don't press shift you just get a newline. Whilst it's executing you'll notice the line prefix changes from [ ]
to [*]
, and in the terminal window from which you launched Jupyter you'll see some output related to the BDD Shell starting
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/). WARNING: Running spark-class from user-defined location. spark.driver.cores is set but does not apply in client mode.
Now back in the Notebook, enter the following - use Enter, not Shift-enter, between lines:
dss = bc.datasets() dss.count
Now press shift-enter to execute it. This uses the pre-defined bc
BDD context to get the datasets object, and return a count from it.
By clicking the + button on the toolbar, using the up and down arrows on the toolbar, and the Code/Markdown dropdown, it's possible to insert "cells" which are not code but instead commentary on what the code is. This way you can produce fully documented, but executable, code objects.
From the File menu give the notebook a name, and then Close and Halt, which destroys the Jupyter process ('kernel') that was executing the BDD Shell session. Back at the Jupyter main page, you'll note that a ipynb
file has been created, which holds the notebook definition and can be downloaded, sent to colleagues, uploaded to blogs to share, saved in source control, and so on. Here's the file for the notebook above - note that it's hosted on gist, which automagically previews it as a Notebook, so click on Raw to see the actual code behind it.
The fantastically powerful thing about the Notebooks is that you can modify and re-run steps as you go -- but you never lose the history of how you got somewhere. Most people will be familar with learning or exploring a tool and its capabilities and eventually getting it to work - but no idea how they got there. Even for experienced users of a tool, being able to prove how to replicate a final result is important for (a) showing the evidence for how they got there and (b) enabling others to take that work and build on it.
With an existing notebook file, whether a saved one you created or one that someone sent you, you can reopen it in Jupyter and re-execute it, in order to replicate the results previously seen. This is an important tenet of [data] science in general - show your workings, and it's great that Big Data Discovery supports this option. Obviously, showing the count of datasets is not so interesting or important to replicate. The real point here is being able to take datasets that you've got in BDD, done some joining and wrangling on already taking advantage of the GUI, and then dive deep into the data science and analytics world of things like Spark MLLib, Pandas, and so on. As a simple example, I can use a couple of python libraries (installed by default with Anaconda) to plot a correlation matrix for one of my BDD datasets:
As well as producing visualisations or calculations within BDD shell, the real power comes in being able to push the modified data back into Hive, and thus continue to work with it within BDD.
With Jupyter Notebooks not only can you share the raw notebooks for someone else to execute, you can export the results to HTML, PDF, and so on. Here's the notebook I started above, developed out further and exported to HTML - note how you can see not only the results, but exactly the code that I ran in order to get them. In this I took the dataset from BDD, added a column into it using a pandas windowing function, and then saved it back to a new Hive table:
(you can view the page natively here, and the ipynb here)
https://gist.github.com/rmoff/f1024dd043565cb8a58a0c54e9a782f2
Once the data's been written back to Hive from the Python processing, I ran BDD's dataprocessingCLI
to add the new table back into BDD
/u01/bdd/v1.2.0/BDD-1.2.0.31.813/dataprocessing/edp_cli/data_processing_CLI --table updated_accident_data
And once that's run, I can then continue working with the data in BDD:
This workflow enables a continual loop of data wrangling, enrichment, advanced processing, and visualisation - all using the most appropriate tools for the job.
You can also use BDD Shell/Jupyter as another route for loading data into BDD. Whilst you can import CSV and XLS files into BDD directly through the web GUI, there are limitations - such as an XLS workbook with multiple sheets has to be imported one sheet at a time. I had a XLS file with over 40 sheets of reference data in it, which was not going to be time-efficient to load one at a time into BDD.
Pandas supports a lot of different input types - including Excel files. So by using Pandas to pull the data in, then convert it to a Spark dataframe I can write it to Hive, from where it can be imported to BDD. As before, the beauty of the Notebook approach is that I could develop and refine the code, and then simply share the Notebook here
https://gist.github.com/rmoff/3fa5d857df8ca5895356c22e420f3b22