Tag Archives: Oracle Big Data Appliance

News and Updates from Oracle Openworld 2014

It’s the Saturday after Oracle Openworld 2014, and I’m now home from San Francisco and back in the UK. It’s been a great week as usual, with lots of product announcements and updates to the BI, DW and Big Data products we use on current projects. Here’s my take on what was announced this last week.

New Products Announced

From a BI and DW perspective, the most significant product announcements were around Hadoop and Big Data. Up to this point most parts of an analytics-focused big data project required you to code the solution yourself, with the diagram below showing the typical three steps in a big data project – data ingestion, analysis and sharing the results.

NewImage

At the moment, all of these steps are typically performed from the command-line using languages such as Python, R, Pig, Hive and so on, with tools like Apache Flume and Apache Sqoop used to bring data into and out of the Hadoop cluster. Under the covers, these tools use technologies such as MapReduce or Spark to do their work, automatically running jobs in parallel across the cluster and making use of the easy scalability of Hadoop and NoSQL databases.

You can also neatly divide the work up on a big data project into two phases; the “discovery” phase typically performed by a data scientist where data is loaded, analysed, correlated and otherwise “understood” to provide the initial insights, and then an “exploitation” phase where we apply governance, provide the output data in a format usable by BI tools and otherwise share the results with the wider corporate audience. The updated Information Management Reference Architecture we collaborated on with Oracle and launched by in June this year had distinct discovery and exploitation phases, and the architecture itself made a clear distinction between the Innovation part that enabled the discovery phase of a project and the Execution part that delivered the insights and data in a more governed, production setting.

NewImage

This was the theme of the product announcements around analytics, BI, data warehousing and big data during Openworld 2014, with Oracle’s Omri Traub in the photo below taking us through Oracle’s big data product strategy. What Oracle are doing here is productising and “democratising” big data, putting it clearly in context of their existing database, engineered systems and BI products and linking them all together into an overall information management architecture and delivery process.

NewImage

So working through from ingestion through to data analysis, these steps have typically been performed by data scientists using scripting tools and rudimentary data visualisation engines, making them labour-intensive and reliant on a small set of people conversant with these tools and process. Oracle Big Data Discovery is aimed squarely at these steps, and combines Apache Spark-based data preparation and transformation capabilities with an analysis and visualisation engine based on Endeca Server.

NewImage

Key features of Big Data Discovery include:

  • Ability to analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based transformation engine
  • Create a catalog of the data on your Hadoop cluster, and then search that catalog using Endeca Server search technologies
  • Create recommendations of other datasets that might interest you, based on what you’re looking at now
  • Visualize your datasets to help understand what they contain, and discover new insights

Under the covers it comprises two parts; the data loading, transformation and profiling part that uses Apache Spark to do its work in parallel across all the nodes in the cluster, and the analysis part, which takes data prepared by Apache Spark and loads into the Endeca Server in-memory engine to perform the analysis, aggregation and data visualisation. Unlike the Spark part the Endeca server element runs just on one node and limits the size of the analysis dataset to what can run in-memory in the Endeca Server engine, but in practice you’re going to work with a sample of the data rather than the entire dataset at that stage (in time the assumption is that the Endeca Server engine will be unbundled and run natively on YARN, giving it the same scalability as the Spark-based data ingestion and transformation part). Initially Big Data Discovery will run on-premise with a cloud version later on, and it’s not dependent on Big Data Appliance – expect to see something later this year / early next year.

Another new product that addresses the discovery phase and discovery lab part of a big data project is Oracle Data Enrichment Cloud Service, from the Oracle Data Integration team and designed to complement ODI and Oracle EDQ. Whilst Oracle positioned ODECS as something you’d use as well as Big Data Discovery and typically upstream from BDD, to me there seemed to be a fair bit of overlap between the products, with both tools doing data profiling and transformation but BDD being more focused on the exploration and discovery part, and ODECS being more focused on early-stage data profiling and transformation.

NewImage

ODECS is clearly more of an ETL tool complement and runs natively in the cloud, right from the start. It’s most probably aimed at customers with their Hadoop dataset already in the cloud, maybe using Amazon Elastic MapReduce or Oracle’s new Hadoop-as-a-Service and has more in common with the old Data Quality Option for Oracle Warehouse Builder than Endeca’s search-first analytic interface. It’s got a very nice interface including a mobile-enabled website and the ability to include and merge in external datasets, including Oracle’s own Data as a Service platform offering. Along with the new Metadata Management tool Oracle also launched at Openworld it’s a great addition to the Oracle Data Integration product suite, but I can’t help thinking that its initial availability only on Oracle’s public cloud platform is going to limit its use with Oracle’s typical customers – we’ll have to just wait and see.

The other major product that addresses big data projects was Oracle Big Data SQL. Partly addressing the discovery phase of big data projects but mostly (to my mind) addressing the exploitation phase, and the execution part of the information management architecture, Big Data SQL gives Oracle Exadata the ability to return data from Hive and NoSQL on the Big Data Appliance as well as data from its normal relational store. I covered Big Data SQL on the blog a few weeks ago and I’ll be posting some more in-depth articles on it next week, but the other main technical innovation with the product is its bringing of Exadata’s SmartScan feature to Hadoop, projecting and filtering data at the Hadoop storage node level and also giving Hadoop the ability to understand regular Oracle SQL, rather than the cut-down version you get with HiveQL.

NewImage

Where this then leaves us is with the ability to do most of a big data project using (Oracle) tools, bringing big data analysis within reach of organisations with Oracle-style budgets but without access to rare data scientist-type resources. Going back to my diagram earlier, a post-OOW big data project using the new products launched in this last week could look something like this:

NewImage

Big Data SQL is out now and depends on BDA and Exadata for its use; Big Data Discovery should be out in a few months time, runs on-premise but doesn’t require BDA, whilst ODECS is cloud-only and runs on a BDA in the background. Expect more news and more integration/alignment from the products as 2014 ends and 2015 starts, and we’re looking forward to using them on Oracle-centric Hadoop projects in the near future. 

Product Updates for BI, Data Integration, Exalytics, BI Applications and OBIEE

Other news announced over the week for products we more commonly use on projects include:

Finally, something that we were particularly pleased to see was the updated Oracle Information Management Architecture I mentioned earlier referenced in most of the analytics sessions, with Oracle’s Balaji Yelamanchili for example introducing it in his big data and business analytics general session mid-way through the week. 

NewImage
 

We love the way this brings together the big data components and puts them in the context of the wider data warehouse and analytic processes, and compared to a few years ago when Hadoop and big data was considered completely separate to data warehousing and BI and done by staff completely different to the core business analytics team, this new reference architecture puts it squarely within the world of BI and analytics we work in. It also emphasises the new abilities Hadoop, NoSQL databases and big data can bring us – support for wider sets of data sources with dynamic schemas, the ability to economically work with and analyse much larger datasets, and support discovery-type upfront analysis work. Finally, it recognises that to get true value out of analysis you start on Hadoop, you eventually need to add proper data governance, make the results more widely available using full SQL tools, and use the right tools – relational databases, OLAP servers and the like – to analyse the data once its in a more structured form. 

If you missed our write-up on the updated Information Management Reference Architecture you can can read our two-part blog post here and here, read the Oracle white paper, or listen to the podcast with OTN Archbeat’s Bob Rhubart. For now though I’m looking forward to seeing the family after a week and a half away in San Francisco – thanks to OTN and the Oracle ACE Director Program for sponsoring my visit over to SF for Openworld, and we’ll post our conference presentation slides later next week when we’re back in the UK and US offices.

Using Oracle GoldenGate for Trickle-Feeding RDBMS Transactions into Hive and HDFS

A few months ago I wrote a post on the blog around using Apache Flume to trickle-feed log data into HDFS and Hive, using the Rittman Mead website as the source for the log entries. Flume is a good technology to use for this type of capture requirement as it captures log entries, HTTP calls, JMS queue entries and other “event” sources easily, has a resilient architecture and integrates well with HDFS and Hive. But what if the source you want to capture activity for is a relational database, for example Oracle Database 12c? With Flume you’d need to spool the database transactions to file, whereas what you really want is a way to directly connect to the database engine and capture the changes from source.

Which is exactly what Oracle GoldenGate does, and what most people don’t realise is that GoldenGate can also load data into HDFS and Hive, as well as the usual database targets. Hive and HDFS aren’t fully-supported targets yet, you can use the Oracle GoldenGate for Java adapter to act as the handler process and then land the data in HDFS files or Hive tables on your target Hadoop platform. My Oracle Support has two tech nodes, “Integrating OGG Adapter with Hive (Doc ID 1586188.1)” and “Integrating OGG Adapter with HDFS (Doc ID 1586210.1)” that give example implementations of the Java adapters you’d need for these two target types, with the overall end-to-end process for landing Hive data looking like the diagram below (and the HDFS one just swapping out HDFS for Hive at the handler adapter stage)

NewImage

This is also a good example of the sorts of technology we’d use to implement the “data factory” concept within the new Oracle Information Management Reference Architecture, the part of the architecture that moves data between the Hadoop and NoSQL-based Data Reservoir, and the relationally-stored enterprise information store; in this case, trickle-feeding transactional data from the Oracle database into Hadoop, perhaps to archive it at lower-cost than we could do in an Oracle database, or to add transaction activity data to a Hadoop-based application

NewImage

So I asked my colleague Nelio Guimaraes to set up a GoldenGate capture process on our Cloudera CDH5.1 Hadoop cluster, using GoldenGate 12.1.2.0.0 for our source Oracle 11gR2 database and Oracle GoldenGate for Java, downloadable separately on edelivery.oracle.com under Oracle Fusion Middleware > Oracle GoldenGate Application Adapters 11.2.1.0.0 for JMS and Flat File Media Pack. In our example, we’re going to capture activity on the SCOTT.EMP table in the Oracle database, and then perform the following step to set up replication from it into a replica Hive table:

  1. Create a table in Hive that corresponds to the table in Oracle database.
  2. Create a table in the Oracle database and prepare the table for replication.
  3. Configure the Oracle GoldenGate Capture to extract transactions from the Oracle database and create the trail file.
  4. Configure the Oracle GoldenGate Pump to read the trail and invoke the custom adapter
  5. Configure the property file for the Hive handler
  6. Code, Compile and package the custom Hive handler
  7. Execute a test. 

Setting up the Oracle Database Source Capture

Let’s go into the Oracle database first, check the table definition, and then connect to Hadoop to create a Hive table of the same column definition.

[oracle@centraldb11gr2 ~]$ sqlplus scott/tiger
SQL*Plus: Release 11.2.0.3.0 Production on Thu Sep 11 01:08:49 2014
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL> describe DEPT
 Name Null? Type
 ----------------------------------------- -------- ----------------------------
 DEPTNO NOT NULL NUMBER(2)
 DNAME VARCHAR2(14)
 LOC VARCHAR2(13)
SQL> exit
...
[oracle@centraldb11gr2 ~]$ ssh oracle@cdh51-node1
Last login: Sun Sep 7 16:11:36 2014 from officeimac.rittmandev.com
[oracle@cdh51-node1 ~]$ hive
...
create external table dept
(
 DEPTNO string, 
 DNAME string, 
 LOC string
) row format delimited fields terminated by '\;' stored as textfile
location '/user/hive/warehouse/department'; 
exit
...

Then I install Oracle Golden Gate 12.1.2 on the source Oracle database, just as you’d do for any Golden Gate install, and make sure supplemental logging is enabled for the table I’m looking to capture. Then I go into the ggsci Golden Gate command-line utility, to first register the user it’ll be connecting as, and what table it needs to capture activity for.

[oracle@centraldb11gr2 12.1.2]$ cd /u01/app/oracle/product/ggs/12.1.2/
[oracle@centraldb11gr2 12.1.2]$ ./ggsci
$ggsci> DBLOGIN USERID sys@ctrl11g, PASSWORD password sysdba
$ggsci> ADD TRANDATA SCOTT.DEPT COLS(DEPTNO), NOKEY

GoldenGate uses a number of components to replicate data from source to targets, as shown in the diagram below.

NewImageFor our purposes, though, there are just three that we need to configure; the Extract component, which captures table activity on the source; the Pump process that moves data (or the “trail”) from source database to the Hadoop cluster; and the Replicat component that takes that activity and applies it to the target tables. In our example, the extract and pump processes will be as normal, but we need to create a custom “handler” for the target Hive table that uses the Golden Gate Java API and the Hadoop FS Java API.

The tool we use to set up the extract and capture process is ggsci, the command-line Golden Gate Software Command Interface. I’ll use it first to set up the Manager process that runs on both source and target servers, giving it a port number and connection details into the source Oracle database.

$ggsci> edit params mgr
PORT 7809
USERID sys@ctrl11g, PASSWORD password sysdba
PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/12.1.2/dirdat/*, USECHECKPOINTS

Then I create two configuration files, one for the extract process and one for the pump process, and then use those to start those two processes.

$ggsci> edit params ehive
EXTRACT ehive
USERID sys@ctrl11g, PASSWORD password sysdba
EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, FORMAT RELEASE 11.2
TABLE SCOTT.DEPT;
$ggsci> edit params phive
EXTRACT phive
RMTHOST cdh51-node1.rittmandev.com, MGRPORT 7809
RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, FORMAT RELEASE 11.2
PASSTHRU
TABLE SCOTT.DEPT;
$ggsci> ADD EXTRACT ehive, TRANLOG, BEGIN NOW
$ggsci> ADD EXTTRAIL /u01/app/oracle/product/ggs/12.1.2/dirdat/et, EXTRACT ehive
$ggsci> ADD EXTRACT phive, EXTTRAILSOURCE /u01/app/oracle/product/ggs/12.1.2/dirdat/et
$ggsci> ADD RMTTRAIL /u01/app/oracle/product/ggs/11.2.1/dirdat/rt, EXTRACT phive

As the Java event handler on the target Hadoop platform won’t be able to ordinarily get table metadata for the source Oracle database, we’ll use the defgen utility on the source platform to create the parameter file that the replicat process will need.

$ggsci> edit params dept
defsfile ./dirsql/DEPT.sql
USERID ggsrc@ctrl11g, PASSWORD ggsrc
TABLE SCOTT.DEPT;
./defgen paramfile ./dirprm/dept.prm NOEXTATTR

Note that NOEXTATTR means no extra attributes; because the version on target is a generic and minimal version, the definition file with extra attributes won’t be interpreted. Then, this DEPT.sql file will need to be copied across to the target Hadoop platform where you’ve installed Oracle GoldenGate for Java, to the /dirsql folder within the GoldenGate install. 

[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1
oracle@cdh51-node1's password: 
Last login: Wed Sep 10 17:05:49 2014 from centraldb11gr2.rittmandev.com
[oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/
[oracle@cdh51-node1 11.2.1]
$ pwd/u01/app/oracle/product/ggs/11.2.1
[oracle@cdh51-node1 11.2.1]$ ls dirsql/
DEPT.sql

Then, going back to the source Oracle database platform, we’ll start the Golden Gate Monitor process, and then the extract and pump processes.

[oracle@cdh51-node1 11.2.1]$ ssh oracle@centraldb11gr2
oracle@centraldb11gr2's password: 
Last login: Thu Sep 11 01:08:18 2014 from bdanode1.rittmandev.com
GGSCI (centraldb11gr2.rittmandev.com) 7> start mgr
Manager started.
 
GGSCI (centraldb11gr2.rittmandev.com) 8> start ehive
 
Sending START request to MANAGER ...
EXTRACT EHIVE starting
 
GGSCI (centraldb11gr2.rittmandev.com) 9> start phive
 
Sending START request to MANAGER ...
EXTRACT PHIVE starting

Setting up the Hadoop / Hive Replicat Process

Setting up the Hadoop side involves a couple of similar steps to the source capture side; first we configure the parameters for the Manager process, then configure the extract process that will pull table activity off of the trail file, sent over by the pump process on the source Oracle database.

[oracle@centraldb11gr2 12.1.2]$ ssh oracle@cdh51-node1
oracle@cdh51-node1's password: 
Last login: Wed Sep 10 21:09:38 2014 from centraldb11gr2.rittmandev.com
[oracle@cdh51-node1 ~]$ cd /u01/app/oracle/product/ggs/11.2.1/
[oracle@cdh51-node1 11.2.1]$ ./ggsci
$ggsci> edit params mgr
PORT 7809
PURGEOLDEXTRACTS /u01/app/oracle/product/ggs/11.2.1/dirdat/*, usecheckpoints, minkeepdays 3
$ggsci> add extract tphive, exttrailsource /u01/app/oracle/product/ggs/11.2.1/dirdat/rt
$ggsci> edit params tphive
EXTRACT tphive
SOURCEDEFS ./dirsql/DEPT.sql
CUserExit ./libggjava_ue.so CUSEREXIT PassThru IncludeUpdateBefores
GETUPDATEBEFORES
TABLE SCOTT.DEPT;

Now it’s time to create the Java hander that will write the trail data to the HDFS files and Hive table. The My Oracle Support Doc.ID 1586188.1 I mentioned at the start of the article has a sample Java program called SampleHandlerHive.java that writes incoming transactions into an HDFS file within the Hive directory, and also writes it to a file on the local filesystem. To get this working on our Hadoop system, we created a new java source code file from the content in SampleHandlerHive.java, updated the path from hadoopConf.addResource to point the the correct location of core-site.xml, hdfs-site.xml and mapred-site.xml, and then compiled it as follows:

export CLASSPATH=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/*
javac -d . SampleHandlerHive.java

Successfully executing the above command created the SampleHiveHandler.class under /u01/app/oracle/product/ggs/11.2.1//dirprm/com/mycompany/bigdata. To create the JAR file that the GoldenGate for Java adapter will need, I then need to change directory to the /dirprm directory under the Golden Gate install, and then run the following commands:

jar cvf myhivehandler.jar com
chmod 755 myhivehandler.jar

I also need to create a properties file for this JAR to use, in the same /dirprm directory. This properties file amongst other things tells the Golden Gate for Java adapter where in HDFS to write the data to (the location where the Hive table keeps its data files), and also references any other JAR files from the Hadoop distribution that it’ll need to get access to.

[oracle@cdh51-node1 dirprm]$ cat tphive.properties 
#Adapter Logging parameters. 
gg.log=log4j
gg.log.level=info
 
#Adapter Check pointing  parameters
goldengate.userexit.chkptprefix=HIVECHKP_
goldengate.userexit.nochkpt=true
 
# Java User Exit Property
goldengate.userexit.writers=jvm
jvm.bootoptions=-Xms64m -Xmx512M -Djava.class.path=/u01/app/oracle/product/ggs/11.2.1/ggjava/ggjava.jar:/u01/app/oracle/product/ggs/11.2.1/dirprm:/u01/app/oracle/product/ggs/11.2.1/dirprm/myhivehandler.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-common-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-configuration-1.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-logging-1.1.3.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/commons-lang-2.6.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/etc/hadoop/conf.dist:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/lib/guava-11.0.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/hadoop-auth-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/hadoop-hdfs-2.3.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/commons-cli-1.2.jar:/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/hadoop/client/protobuf-java-2.5.0.jar
 
#Properties for reporting statistics
# Minimum number of {records, seconds} before generating a report
jvm.stats.time=3600
jvm.stats.numrecs=5000
jvm.stats.display=TRUE
jvm.stats.full=TRUE
 
#Hive Handler.  
gg.handlerlist=hivehandler
gg.handler.hivehandler.type=com.mycompany.bigdata.SampleHandlerHive
gg.handler.hivehandler.HDFSFileName=/user/hive/warehouse/department/dep_data
gg.handler.hivehandler.RegularFileName=cinfo_hive.txt
gg.handler.hivehandler.RecordDelimiter=;
gg.handler.hivehandler.mode=tx

Now, the final step on the Hadoop side is to start its Golden Gate Manager process, and then start the Replicat and apply process.

GGSCI (cdh51-node1.rittmandev.com) 5> start mgr
 
Manager started. 
 
GGSCI (cdh51-node1.rittmandev.com) 6> start tphive
 
Sending START request to MANAGER ...
EXTRACT TPHIVE starting

Testing it All Out

So now I’ve got the extract and pump processes running on the Oracle Database side, and the apply process running on the Hadoop side, let’s do a quick test and see if it’s working. I’ll start by looking at what data is in each table at the beginning.

SQL> select * from dept;     

    DEPTNO DNAME  LOC
 ---------- -------------- -------------

10 ACCOUNTING  NEW YORK
20 RESEARCH  DALLAS
30 SALES  CHICAGO
40 OPERATIONS  BOSTON
50 TESTE  PORTO
60 NELIO  STS
70 RAQUEL  AVES
 
7 rows selected.

Over on the Hadoop side, there’s just one row in the Hive table:

hive> select * from customer;

OK 80MARCIA   ST

Now I’ll go back to Oracle and insert a new row in the DEPT table:

SQL> insert into dept (deptno, dname, loc)
  2  values (75, 'EXEC','BRIGHTON'); 

1 row created. 
SQL> commit; 

Commit complete.

And, going back over to Hadoop, I can see Golden Gate has added that record to the Hive table, by the Golden Gate for Java adapter writing the transaction to the underlying HDFS file.

hive> select * from customer;

OK 80MARCIA   ST
75 EXEC       BRIGHTON

So there you have it; Golden Gate replicating Oracle RBDMS transactions into HDFS and Hive, to complement Apache Flume’s ability to replicate log and event data into Hadoop. Moreover, as Michael Rainey explained in this three part blog series, Golden Gate is closely integrated into the new 12c release of Oracle Data Integrator, making it even easier to manage Golden Gate replication processes into your overall data loading project, and giving Hadoop developers and Golden Gate users access to the full set of load orchestration and data quality features in that product rather than having to rely on home-grown scripting, or Oozie.

Analyzing Twitter Data using Datasift, MongoDB, Hive and ODI12c

Last week I posted an article on the blog around analysing Twitter data using Datasift, MongoDB and Pig, where I used the Datasift service to stream tweets about Rittman Mead into a MongoDB NoSQL database, and then queried the dataset using Pig. The context for this is the idea of a “data reservoir”, where we supplement the more traditional file and relational datasets we find in data warehouses with other data, typically machine generated, unstructured or very low-level, to add context to the numbers in our reporting system. In the example I quoted in the article, it’d be very interesting to take the activity we record against our blog and website and correlate that with the “conversation” that happens about it in the social media world; for example, were the hits for a particular article due to it been mentioned in a tweet, and did a spike in activity correspond to a particularly influential Twitter user retweeting something we’d tweeted?

NewImage

In that previous article I’d used Pig to access and analyse the data, in part because I saw a match between the nested datasets in a typical DataSift Twitter message and the relations, tuples and bags you get in a Pig schema. For example, if you look at the Tweet from Borkur in the screenshot below from RoboMongo, a Mac OS X client for MongoDB that I’ve found useful, you can see the author details nested inside the interaction details, and the Type attribute having many values under the Trends parent attribute – these map well onto Pig tuples and bags respectively.

NewImage

What I’d really like to do with this dataset, though, is to take certain elements of it and use that to supplement the data I’m loading using ODI12c. Whilst ODI can run arbitrary R, Pig and shell scripts using the ODI Procedure feature (as I did here to make use of Sqoop, before Oracle added Sqoop KMs to ODI12.1.3), it gets the best out of Hadoop when it can access data using Hive, the SQL layer over Hadoop that represents HDFS data as rows and columns, and allows us to SELECT and INSERT data using SQL commands – or to be precise, a dialect of SQL called HiveQL. But how will Hive cope with the nested and repeating data structures in a DataSift Twitter message, and allow us to get just the data out that we’re interested in?

In fact, the MongoDB connector for Hadoop that I used for Pig the other day also comes with Hive connectivity, in the form of a SerDe that lets Hive report against data in a MongoDB database (David Allen blogged about another MongoDB Hive storage handler a while ago, in an article about MongoDB and ODI). What’s more, this Hive connector for MongoDB is actually easier to work with that the Pig connector, as instead of worrying about Tuples and Bags you can just pick out the nested attributes that you’re interested in using a dot notation. For example, if I’m only interested in the InteractionID, username, tweet content and number of followers within a particular Twitter dataset, I can create a table that looks like this in Hive:

CREATE TABLE tweet_data(
  interactionId string,
  username string,
  content string,
  author_followers int)
ROW FORMAT SERDE 
  'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
  'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
  'mongo.columns.mapping'='{"interactionId":"interactionId",
  "username":"interaction.interaction.author.username",
  "content":\"interaction.interaction.content",
  "author_followers_count":"interaction.twitter.user.followers_count"}'
  )
TBLPROPERTIES (
  'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
  )

And at that point, it’s pretty easy to bring the dataset into ODI12c, through the IKM Hive to Hive Control Append knowledge module, and join up the Twitter dataset with the website log data that’s coming in via Flume. ODI can connect to Hive via JDBC drivers supplied with CDH4/5, and once you register the Hive connection and reverse-engineer the Hive metastore metadata into ODI’s repository, the complexity of the underlying Hive storage is hidden and you’re just presented with tables and columns, just like any other datastore type.

NewImage

Starting with the Twitter data first, I create a Hive table outside of ODI that returns the precise set of tweet attributes that I’m interested in, and then filter that dataset down to just the tweets that link to content on our website, by filtering on the tweet link’s URL matching the start of our website address.

NewImage

Then I load-up the hits from the Rittman Mead website, previously landed into Hadoop using Flume and exposed to ODI as another Hive table, filter out all the non-blog page accesses and keep just the URL part of the Apache Weblog request field, removing the transport mechanism and other bits around it.

NewImage

Then, I use a final ODI mapping to join the two datasets together, using ODI’s ability to apply HiveQL expressions to the incoming datasets so that’ve got the same format – trailing ‘/‘ at the end of the URL, no ampersand and query text at the end of the URL, and so on. Both this and the previous transformation are great examples of where ODI can help with this sort of work, making it pretty easy to munge and correct your data so that you’re then able to match-up the two different sources.

NewImage

Then it’s just a case of creating a package or load plan to sequence the mappings, and then run them using the local or standalone agent. You can see the individual KM steps running on the left-hand side, with ODI generating HiveQL queries which in turn are translated into MapReduce and run in parallel across the Hadoop cluster.

NewImage

And then, at the end of the process, I’ve got a Hive table of all of our blog articles that have been mentioned on Twitter (since we started consuming the tweet feed, a day or so ago), with the number of page requests and the number of times that page got mentioned in tweets.

NewImage

Obviously there’s a lot more we can do with this; we can access the number of followers each twitter user has, along with their location, gender and the sentiment (positive, negative, neutral) of the tweet. From that we can work out some impact from the twitter activity, and we can also add to it data from other sources such as Facebook, LinkedIn and so on to get a fuller picture of the activity around our site. Then, the data we’re gathering in can either be left in MongoDB, or I can use these ODI mappings to either archive it in Hive tables, or export the highlights out to Oracle Database using Sqoop or Oracle Loader for Hadoop.

Analyzing Twitter Data using Datasift, MongoDB and Pig

If you followed our recent postings on the updated Oracle Information Management Reference Architecture, one of the key concepts we talk about is the “data reservoir”. This is a pool of additional data that you can add to your data warehouse, typically stored on Hadoop or NoSQL databases, where you store unstructured, semi-structured or unprocessed structured data in large volume and at low cost. Adding a data reservoir gives you the ability to leverage the types of data sources that previously were thought of as too “messy”, too high-volume to store for any serious amount of time, or require processing or storing by tools that aren’t in the usual relational data warehouse toolset.

NewImage

By formally including them in your overall information management architecture though, with common tools, security and data governance over the entire dataset, you give your users the ability to consider the whole “360-degree view” of their customers and their interactions with the market.

To take an example, a few weeks ago I posted a series of articles on the blog where I captured user activity on our website, http://www.rittmanmead.com, transported it to one of our Hadoop clusters using Apache Flume, and then analysed it using Hive, Pig and finally Spark. In one of the articles I used Pig and a geocoding API to determine the country that each website visitor came from, and then in a final five-part series I automated the whole process using ODI12c and then copied the final output tables to Oracle using Oracle Loader for Hadoop. This is quite a nice example of ETL-offloading into Hadoop, with an element of Hadoop-native event capture using Flume, but once the processing has finished the data moves out of Hadoop and into the Oracle database.

NewImage

What would be interesting though would be to start adding data into Hadoop that’s permanent, not transitory as part of an ETL process, to start building out this concept of the “data reservoir”. Taking our website activity dataset, something that would really add context to the visits to our site would be corresponding activity on social networks, to see who’s linking to our posts, who’s discussing them, whether those discussions are positive or negative, and which wider networks those people belong to. Twitter is a good place to start with this as it’s the place we see our articles and activities most discussed, but it’s be good to build out this picture over time to add in activity on social networks such as Facebook, Youtube, LinkedIn and Google+; if we did this, we’d be able to consider a much broader and richer picture when looking at activity around Rittman Mead, potentially correlating activity and visits to our website with mentions of us in the press, comments made by our team and the wider picture of what’s going on in our world.

NewImage

There are a number of ways you can bring Twitter data into your Hadoop cluster or data warehouse, but the most convenient way we’ve found is to use DataSift, a social media aggregation site and service that license raw feeds from the likes of Twitter, Facebook, WordPress and other social media platforms, enhancing the data feeds with sentiment scores and other attributes, and then sell access to the feeds via a number of formats and APIs. Accessing Twitter data through DataSift costs money, particularly if you want to go back and look at historical activity vs. just filtering on a few keywords in new Twitter activity, but they’re very developer-friendly and able to provide greater volumes of firehose activity than the standard Twitter developer API allows.

So assuming you can get access to a stream of Twitter data on a particular topic – in our case, all mentions of our website, our team’s Twitter handles, retweets of our content etc – the question then becomes one of how to store the data. Looking at the Datasift Sample Output page, each of these streams delivers their payloads via JSON documents, XML-like structures that nest categories of tweet metadata within parent structures that make up the total tweet data and metadata dataset.

NewImage

And there’s a good reason for this; individual tweets might not use every bit of possible tweet metadata, for example not including entries under “mentions” or “retweets” if those aren’t used in a  particular message. Certain bits of metadata might be repeated X numbers of times – @ mentions, for example, and the JSON document might have a different structure altogether if a different JSON schema version is used for a particular tweet. Altogether not an easy type of data structure for a relational database to hold – though Oracle 12.1.0.2.0 has just introduced native JSON support to the core Oracle database – but NoSQL databases in contrast find these sorts of data structures easy, and one of the most popular for this type of work is MongoDB.

MongoDB is a open-source “document” database that’s probably best known to the Oracle world through this internet cartoon; what the video is getting at is NoSQL advocates recommending databases such as MongoDB for large-scale web work when something much more mainstream like mySQL would do the job better, but where NoSQL and document-style database come into their own is storing just this type of semi-structured, schema-on-read datasets. In fact, Datasift support MongoDB as an API end-point for their Twitter feed, so let’s go ahead and set up a MongoDB database, prepare it for the Twitter data, and then set-up a Datasift feed into it.

MongoDB installation on Linux, for example to run alongside a Hadoop installation, is pretty straightforward and involves adding a YUM repository and then running “sudo yum install mongodb-org” (there’s an OS X installation too, but I wanted to run this server-side on my Hadoop cluster). Once you’ve installed the MongoDB software, you start the mongod service to enable the server element, and then log into the mongo command-shell to create a new database.

MongoDB, being a schema-on-read database, doesn’t require you to set up a database schema up-front; instead, the schema comes from the data you load into it, with MongoDB’s equivalent of tables called “collections”, and with those collections made up of documents, analogous to rows in Oracle. Where it gets interesting though is that collections and databases only get created when you first start using them, and individual documents can have slightly, or even completely, different schema structures to each other – which makes them ideal for holding the sorts of datasets generated by Twitter, Facebook and DataSift.

[root@cdh51-node1 ~]# mongo
MongoDB shell version: 2.6.4
connecting to: test
> use datasift2
switched to db datasift2

Let’s create a couple of simple documents, and then add those to a collection. Note that the document becomes available just by declaring it, as does the collection when I add documents to it. Note also that the query language we’re using to work with MongoDB is Javascript, again making it particularly suited to JSON documents, and web-type environments.

> a = { name : "mark" }
{ "name" : "mark" }
> b = { product : "chair", size : "L" }
{ "product" : "chair", "size" : "L" }
> db
datasift2
> db.testData.insert(a)
WriteResult({ "nInserted" : 1 })
> db.testData.insert(b)
WriteResult({ "nInserted" : 1 })
> db.testData.find()
{ "_id" : ObjectId("54094081b5b6021fe9bc8b10"), "name" : "mark" }
{ "_id" : ObjectId("54094088b5b6021fe9bc8b11"), "product" : "chair", "size" : "L" }

And note also how the second entry (document) in the collection has a different schema to the entry above it – perfect for our semi-structured Twitter data, and something we could store as-is in MongoDB in this loose data format and then apply more formal structures and schemas to when we come to access the data – as we’ll do in a moment using Pig, and more formally using ODI and Hive in the next article in this series.

Setting up the Twitter feed from DataSift is a two-stage process, once you’ve got an account with them and an API key; first you define your search terms against a nested document model for the data source, then you activate the feed, in this case into my MongoDB database, and wait for the tweets to roll in. For my feed I selected tweets written by myself and some of the Rittman Mead team, tweets mentioning us, and tweets that included links to our blog in the main tweet contents (there’s also a graphical query designer, but I prefer to write them by hand using what DataSift call their “curated stream definition language” (CSDL).

NewImage

You can then preview the feed, live, or go back and sample historic data if you’re interested in loading old tweets, rather than incoming new ones. Once you’re ready you then need to activate the feed, in my case by calling a URL using CURL with a bunch of parameters (our API key and other sensitive data has been masked):

curl -X POST 'https://api.datasift.com/v1/push/create' \
-d 'name=connectormongodb' \
-d 'hash=65bd9dc4943ec426b04819exxxxxxxxx' \
-d 'output_type=mongodb' \
-d 'output_params.host=rittmandev.com' \
-d 'output_params.port=27017' \
-d 'output_params.use_ssl=no' \
-d 'output_params.verify_ssl=no' \
-d 'output_params.db_name=datasiftmongodb' \
-d 'output_params.collection_name=rm_tweets' \
-H 'Auth: rittmanmead:xxxxxxxxxxxxxxxxxxxxxxxxxxx'

The “hash” in the parameter list is the specific feed to activate, and the output type is MongoDB. The collection name is new, and will be created by MongoDB when the first tweet comes in; let’s run the curl command now and sit back for a while, and wait for some twitter activity to arrive in MongoDB …

… and a couple of hours later, eight tweets have been captured by the DataSift filter, with the last of them being one from Michael Rainey about his trip tonight to the Seahawks game:

> db.rm_tweets.count()
8
> db.rm_tweets.findOne()
{
    "_id" : ObjectId("54089a879ad4ec99158b4d78"),
    "interactionId" : "1e43454b1a16a880e074e49c51369eac",
    "subscriptionId" : "f6cf211e03dca5da384786676c31fd3e",
    "hash" : "65bd9dc4943ec426b04819e6291ef1ce",
    "hashType" : "stream",
    "interaction" : {
        "demographic" : {
            "gender" : "male"
        },
        "interaction" : {
            "author" : {
                "avatar" : "http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "id" : 14551637,
                "language" : "en",
                "link" : "https://twitter.com/mRainey",
                "name" : "Michael Rainey",
                "username" : "mRainey"
            },
            "content" : "Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD",
            "created_at" : "Thu, 04 Sep 2014 16:58:29 +0000",
            "hashtags" : [
                "GoHawks",
                "kickoff2014",
                "GBvsSEA"
            ],
            "id" : "1e43454b1a16a880e074e49c51369eac",
            "link" : "https://twitter.com/mRainey/status/507573423334100992",
            "mention_ids" : [
                23642374
            ],
            "mentions" : [
                "Seahawks"
            ],
            "received_at" : 1409849909.2967,
            "schema" : {
                "version" : 3
            },
            "source" : "Instagram",
            "type" : "twitter"
        },
        "language" : {
            "tag" : "en",
                "tag_extended" : "en",
            "confidence" : 98
        },
        "links" : {
            "code" : [
                200
            ],
            "created_at" : [
                "Thu, 04 Sep 2014 16:58:29 +0000"
            ],
            "meta" : {
                "charset" : [
                    "CP1252"
                ],
                "lang" : [
                    "en"
                ],
                "opengraph" : [
                    {
                        "description" : "mrainey's photo on Instagram",
                        "image" : "http://photos-d.ak.instagram.com/hphotos-ak-xfa1/10655141_1470641446544147_1761180844_n.jpg",
                        "site_name" : "Instagram",
                        "type" : "instapp:photo",
                        "url" : "http://instagram.com/p/sh_h6sQBYT/"
                    }
                ]
            },
            "normalized_url" : [
                "http://instagram.com/p/sh_h6sQBYT"
            ],
            "title" : [
                "Instagram"
            ],
            "url" : [
                "http://instagram.com/p/sh_h6sQBYT/"
            ]
        },
        "salience" : {
            "content" : {
                "sentiment" : 0,
                "topics" : [
                    {
                        "name" : "Video Games",
                        "hits" : 0,
                        "score" : 0.5354745388031,
                        "additional" : "Greyson and I will be ready for the @Seahawks game tonight!"
                    }
                ]
            }
        },
        "trends" : {
            "type" : [
                "San Jose",
                "United States"
            ],
            "content" : [
                "seahawks"
            ],
            "source" : [
                "twitter"
            ]
        },
        "twitter" : {
            "created_at" : "Thu, 04 Sep 2014 16:58:29 +0000",
            "display_urls" : [
                "instagram.com/p/sh_h6sQBYT/"
            ],
            "domains" : [
                "instagram.com"
            ],
            "filter_level" : "medium",
            "hashtags" : [
                "GoHawks",
                "kickoff2014",
                "GBvsSEA"
            ],
            "id" : "507573423334100992",
            "lang" : "en",
            "links" : [
                "http://instagram.com/p/sh_h6sQBYT/"
            ],
            "mention_ids" : [
                23642374
            ],
            "mentions" : [
                "Seahawks"
            ],
            "source" : "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>",
            "text" : "Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD",
            "user" : {
                "created_at" : "Sat, 26 Apr 2008 21:18:01 +0000",
                "description" : "Data Integration (#ODI #GoldenGate #OBIA) consultant / blogger / speaker @RittmanMead.\nOracle ACE.\n#cycling #Seahawks #travel w/ @XiomaraRainey\n#GoCougs!",
                "favourites_count" : 746,
                "followers_count" : 486,
                "friends_count" : 349,
                "geo_enabled" : true,
                "id" : 14551637,
                "id_str" : "14551637",
                "lang" : "en",
                "listed_count" : 28,
                "location" : "Pasco, WA",
                "name" : "Michael Rainey",
                "profile_image_url" : "http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "profile_image_url_https" : "https://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg",
                "screen_name" : "mRainey",
                "statuses_count" : 8549,
                "time_zone" : "Pacific Time (US & Canada)",
                "url" : "http://www.linkedin.com/in/rainey",
                "utc_offset" : -25200,
                "verified" : false
            }
        }
    }
}

If you’ve not looked at Twitter metadata before, it’s surprising how much metadata accompanies what’s ostensibly an 140-character tweet. As well as details on the author, where the tweet was sent from, what Twitter client sent the tweet and details of the tweet itself, there’s details and statistics on the sender, the number of followers they’ve got and where they’re located, a list of all other Twitter users mentioned in the tweet and any URLs and images referenced.

Not every tweet will use every element of metadata, and some tweets will repeat certain attributes – other Twitter users you’ve mentioned in the tweet, for example – as many times as there are mentions. Which makes Twitter data a prime candidate for analysis using Pig and Spark, which handle easily the concept of nested data structures, tuples (ordered lists of data, such as attribute sets for an entity such as “author”), and bags (sets of unordered attributes, such as the list of @ mentions in a tweet).

There’s a MongoDB connector for Hadoop on Github which allows MapReduce to connect to MongoDB databases, running MapReduce jobs on MongoDB storage rather than HDFS (or S3, or whatever). This gives us the ability to use languages such as Pig and Hive to filter and aggregate our MongoDB data rather than MongoDB’s Javascript API, which isn’t as fully-featured and scaleable as MapReduce and has limitations in terms of the number of documents you can include in aggregations; let’s start then by connecting Pig to our MongoDB database, and reading in the documents with no Pig schema applied:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' using com.mongodb.hadoop.pig.MongoLoader;                                                                                                                                                  2014-09-05 06:40:51,773 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.                                                                                                                                        
2014-09-05 06:40:51,838 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> tweets_count = FOREACH (GROUP tweets ALL) GENERATE COUNT (tweets);                                             
2014-09-05 06:41:07,772 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:41:07,817 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> dump tweets_count
...
(9)
grunt>

So there’s nine tweets in the MongoDB database now. Let’s take a look at one of the documents by creating a Pig alias containing just a single record.

grunt> tweets_limit_1 = LIMIT tweets 1;
2014-09-05 06:43:12,351 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:43:12,443 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> dump tweets_limit_1
...
([interaction#{trends={source=(twitter), content=(seahawks), type=(San Jose,United States)}, twitter={filter_level=medium, text=Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD, mention_ids=(23642374), domains=(instagram.com), links=(http://instagram.com/p/sh_h6sQBYT/), lang=en, id=507573423334100992, source=<a href="http://instagram.com" rel="nofollow">Instagram</a>, created_at=Thu, 04 Sep 2014 16:58:29 +0000, hashtags=(GoHawks,kickoff2014,GBvsSEA), mentions=(Seahawks), user={profile_image_url_https=https://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, location=Pasco, WA, geo_enabled=true, statuses_count=8549, lang=en, url=http://www.linkedin.com/in/rainey, utc_offset=-25200, id=14551637, time_zone=Pacific Time (US & Canada), favourites_count=746, verified=false, friends_count=349, description=Data Integration (#ODI #GoldenGate #OBIA) consultant / blogger / speaker @RittmanMead.
Oracle ACE.
#cycling #Seahawks #travel w/ @XiomaraRainey
#GoCougs!, name=Michael Rainey, created_at=Sat, 26 Apr 2008 21:18:01 +0000, screen_name=mRainey, id_str=14551637, profile_image_url=http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, followers_count=486, listed_count=28}, display_urls=(instagram.com/p/sh_h6sQBYT/)}, salience={content={topics=([score#0.5354745388031,additional#Greyson and I will be ready for the @Seahawks game tonight!,hits#0,name#Video Games]), sentiment=0}}, links={created_at=(Thu, 04 Sep 2014 16:58:29 +0000), title=(Instagram), code=(200), normalized_url=(http://instagram.com/p/sh_h6sQBYT), url=(http://instagram.com/p/sh_h6sQBYT/), meta={lang=(en), charset=(CP1252), opengraph=([image#http://photos-d.ak.instagram.com/hphotos-ak-xfa1/10655141_1470641446544147_1761180844_n.jpg,type#instapp:photo,site_name#Instagram,url#http://instagram.com/p/sh_h6sQBYT/,description#mrainey's photo on Instagram])}}, interaction={schema={version=3}, id=1e43454b1a16a880e074e49c51369eac, content=Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD, author={id=14551637, username=mRainey, language=en, avatar=http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, name=Michael Rainey, link=https://twitter.com/mRainey}, received_at=1.4098499092967E9, source=Instagram, mention_ids=(23642374), link=https://twitter.com/mRainey/status/507573423334100992, created_at=Thu, 04 Sep 2014 16:58:29 +0000, hashtags=(GoHawks,kickoff2014,GBvsSEA), type=twitter, mentions=(Seahawks)}, language={tag=en, confidence=98, tag_extended=en}, demographic={gender=male}},interactionId#1e43454b1a16a880e074e49c51369eac,_id#54089a879ad4ec99158b4d78,hash#65bd9dc4943ec426b04819e6291ef1ce,subscriptionId#f6cf211e03dca5da384786676c31fd3e,hashType#stream])

And there’s Michael’s tweet again, with all the attributes from the MongoDB JSON document appended together into a single record. But in this format the data isn’t all that useful as we can’t easily access individual elements in the Twitter record; what would be better would be to apply a Pig schema definition to the LOAD statement, using the MongoDB document field listing that we saw when we displayed a single record from the MongoDB collection earlier.

I can start by referencing the document fields that become simple Pig dataypes; ID and interactionId, for example:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.my_first_test' using com.mongodb.hadoop.pig.MongoLoader('id:chararray,interactionId:chararray','id');
2014-09-05 06:57:57,985 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
2014-09-05 06:57:58,022 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
grunt> describe tweets
2014-09-05 06:58:11,611 [main] INFO  com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode.
tweets: {id: chararray,interactionId: chararray}
grunt> tweets_limit_1 = LIMIT tweets 1;
...
(53fae22e9ad4ec93658b513e,1e42c2747542a100e074fff55100414a)
grunt>

Where the MongoDB document has fields nested within other fields, you can reference these as a tuple if they’re a set of attributes under a common header, or a bag if they’re just a list of values for a single attribute; for example, the “username” field is contained within the author tuple, which in-turn is contained within the interaction tuple, so to count tweets by author I’d need to first flatten the author tuple to turn its fields into scalar fields, then project out the username and other details; then I can group the relation in the normal way on those author details, and generate a count of tweets, like this:

grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' using com.mongodb.hadoop.pig.MongoLoader('id:chararray,interactionId:chararray,interaction:tuple(interaction:tuple(author:tuple(id:int,language:chararray,link:chararray,name:chararray,username:chararray)))','id');
grunt> tweets_author_tuple_flattened = FOREACH tweets GENERATE id, FLATTEN(interaction.$0);                                            
grunt> tweets_with_authors = FOREACH tweets_author_tuple_flattened GENERATE id, interaction::author.username, interaction::author.name;
grunt> tweets_author_group = GROUP tweets_with_authors by username; 
grunt> tweets_author_count = FOREACH tweets_author_group GENERATE group, COUNT(tweets_with_authors); 
...
(rmoff,1)
(dw_pete,1)
(mRainey,3)
(P_J_FLYNN,3)
(davidhuey,7)
(EdelweissK,1)
(JamesOickle,3)
(markrittman,3)
(rittmanmead,2)
(RedgraveChris,1)
grunt>

So there’s obviously a lot more we can do with the Twitter dataset as it stands, but where it’ll get really interesting is combining this with other social media interaction data – for example from Facebook, LinkedIn and so on – and then correlating that with out main site activity data. Check back in a few days when we’ll be covering this second stage in a further blog article, using ODI12c to orchestrate the process.

Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations

If you’ve been following our postings on the blog over the past year, you’ll probably have seen quite a lot of activity around big data and Hadoop and in particular, what these technologies bring to the world of Oracle Business Intelligence, Oracle Data Warehousing and Oracle Data Integration. For anyone who’s not had a chance to read the posts and articles, the three links below are a great introduction to what we’ve been up to:

In addition, we recently took part in an OTN ArchBeat podcast with Stewart Bryson and Andrew Bond on the updated Oracle Information Management Reference Architecture we co-developed with Oracle’s Enterprise Architecture team, where you can hear me talk with Stewart and Andrew about how the updated architecture came about, the thinking behind it, and how concepts like the data reservoir and data factory can be delivered in an agile way.

I’m also pleased to be delivering a number of presentations and seminars over the next few months, on Oracle and Cloudera’s Hadoop technology and how it applies to Oracle BI, DW and DI developers – if you’re part of a local Oracle user group and you’d like me to deliver one of them for your group, drop me an email at mark.rittman@rittmanmead.com.

Slovenian Oracle User Group / Croatian Oracle User Group Conferences, October 2014

These two events run over consecutive days in Slovenia and Croatia, and I’m delivering the keynote at each on Analytics and Big Data, and a one-day seminar running on the Tuesday in Slovenia, and over the Wednesday and Thursday in Croatia. The theme of the seminar is around applying Hadoop and big data technologies to Oracle BI, DW and data integration, and is made up of four sessions:

Part 1 : Introduction to Hadoop and Big Data Technologies for Oracle BI & DW Developers

“In this session we’ll introduce some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We’ll explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracle’s products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).”

Part 2 : Hadoop and NoSQL Data Ingestion using Oracle Data Integrator 12c and Hadoop Technologies

“There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.”

Part 3 : Big Data Analysis using Hive, Pig, Spark and Oracle R Enterprise / Oracle R Advanced Analytics for Hadoop

“Data within a Hadoop cluster is typically analysed and processed using technologies such as Pig, Hive and Spark before being made available for wider use using products like Oracle Big Data SQL and Oracle Business Intelligence. In this session, we’ll introduce Pig and Hive as key analysis tools for working with Hadoop data using MapReduce, and then move on to Spark as the next-generation analysis platform typically being used on Hadoop clusters today. We’ll also look at the role of Oracle’s R technologies in this scenario, using Oracle R Enterprise and Oracle R Advanced Analytics for Hadoop to analyse and understand larger datasets than we could normally accommodate with desktop analysis environments.”

Part 4 : Visualizing Hadoop Datasets using Oracle Business Intelligence, Oracle BI Publisher and Oracle Endeca Information Discovery

“Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, it’s usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle BI Publisher and Oracle Endeca Information Discovery can be used to visualise and publish Hadoop data. In this final session we’ll look at what’s involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets.”

Oracle Openworld 2014 (ODTUG Sunday Symposium), September 2014

Along with another session later in the week on the upcoming Oracle BI Cloud Services, I’m doing a session on the User Group Sunday for ODTUG on ODI12c and the Big Data Connectors for ETL on Hadoop:

Deep Dive into Big Data ETL with Oracle Data Integrator 12c and Oracle Big Data Connectors [UGF9481]

“Much of the time required to work with big data sources is spent in the data acquisition, preparation, and transformation stages of a project before your data reaches a state suitable for analysis by your users. Oracle Data Integrator, together with Oracle Big Data Connectors, provides a means to efficiently load and unload data to and from Oracle Database into a Hadoop cluster and perform transformations on the data, either in raw form or in technologies such as Apache Hive or R. This presentation looks at how Oracle Data Integrator can form the centerpiece of your big data ETL strategy, within either a custom-built big data environment or one based on Oracle Big Data Appliance.”

UK Oracle User Group Tech’14 Conference, December 2014

I’m delivering an extended version of my OOW presentation on the UKOUG Tech’14’s “Super Sunday” event, this time over 90 minutes rather than the 45 at OOW, giving me a bit more time for demos and discussion:

Deep-Dive into Big Data ETL using ODI12c and Oracle Big Data Connectors

“Much of the time required to work with Big Data sources is spent in the data aquisition, preparation and transformation stages of a project; before your data is in a state suitable for analysis by your users.Oracle Data Integrator, together with Oracle Big Data Connectors, provides a means to efficiently load and unload data from Oracle Database into a Hadoop cluster, and perform transformations on the data either in raw form or technologies such as Apache Hive or R. In this presentation, we will look at how ODI can form the centrepiece of your Big Data ETL strategy, either within a custom-built Big Data environment or one based on Oracle Big Data Appliance.”

Oracle DW Global Leaders’ Meeting, Dubai, December 2014

The Oracle DW Global Leaders forum is an invite-only group organised by Oracle and attended by select customers and associate partners, one of which is Rittman Mead. I’ll be delivering the technical seminar at the end of the second day, which will run over two sessions and will be based on the main points from the one-day seminars I’m running in Croatia and Slovenia.

From Hadoop to dashboards, via ODI and the BDA – the complete trail : Part 1 and Part 2

“Join Rittman Mead for this afternoon workshop, taking you through data acquisition and transformation in Hadoop using ODI, Cloudera CDH and Oracle Big Data Appliance, through to reporting on that data using OBIEE, Endeca and Oracle Big Data SQL. Hear our project experiences, and tips and techniques based on real-world implementations”

Keep an eye out for more Hadoop and big data content over the next few weeks, including a look at MongoDB and NoSQL-type databases, and how they can be used alongside Oracle BI, DW and data integration tools.