Tag Archives: Big Data
Analyzing Twitter Data using Datasift, MongoDB, Hive and ODI12c
Last week I posted an article on the blog around analysing Twitter data using Datasift, MongoDB and Pig, where I used the Datasift service to stream tweets about Rittman Mead into a MongoDB NoSQL database, and then queried the dataset using Pig. The context for this is the idea of a “data reservoir”, where we supplement the more traditional file and relational datasets we find in data warehouses with other data, typically machine generated, unstructured or very low-level, to add context to the numbers in our reporting system. In the example I quoted in the article, it’d be very interesting to take the activity we record against our blog and website and correlate that with the “conversation” that happens about it in the social media world; for example, were the hits for a particular article due to it been mentioned in a tweet, and did a spike in activity correspond to a particularly influential Twitter user retweeting something we’d tweeted?
In that previous article I’d used Pig to access and analyse the data, in part because I saw a match between the nested datasets in a typical DataSift Twitter message and the relations, tuples and bags you get in a Pig schema. For example, if you look at the Tweet from Borkur in the screenshot below from RoboMongo, a Mac OS X client for MongoDB that I’ve found useful, you can see the author details nested inside the interaction details, and the Type attribute having many values under the Trends parent attribute – these map well onto Pig tuples and bags respectively.
What I’d really like to do with this dataset, though, is to take certain elements of it and use that to supplement the data I’m loading using ODI12c. Whilst ODI can run arbitrary R, Pig and shell scripts using the ODI Procedure feature (as I did here to make use of Sqoop, before Oracle added Sqoop KMs to ODI12.1.3), it gets the best out of Hadoop when it can access data using Hive, the SQL layer over Hadoop that represents HDFS data as rows and columns, and allows us to SELECT and INSERT data using SQL commands – or to be precise, a dialect of SQL called HiveQL. But how will Hive cope with the nested and repeating data structures in a DataSift Twitter message, and allow us to get just the data out that we’re interested in?
In fact, the MongoDB connector for Hadoop that I used for Pig the other day also comes with Hive connectivity, in the form of a SerDe that lets Hive report against data in a MongoDB database (David Allen blogged about another MongoDB Hive storage handler a while ago, in an article about MongoDB and ODI). What’s more, this Hive connector for MongoDB is actually easier to work with that the Pig connector, as instead of worrying about Tuples and Bags you can just pick out the nested attributes that you’re interested in using a dot notation. For example, if I’m only interested in the InteractionID, username, tweet content and number of followers within a particular Twitter dataset, I can create a table that looks like this in Hive:
CREATE TABLE tweet_data( interactionId string, username string, content string, author_followers int) ROW FORMAT SERDE 'com.mongodb.hadoop.hive.BSONSerDe' STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES ( 'mongo.columns.mapping'='{"interactionId":"interactionId", "username":"interaction.interaction.author.username", "content":\"interaction.interaction.content", "author_followers_count":"interaction.twitter.user.followers_count"}' ) TBLPROPERTIES ( 'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' )
And at that point, it’s pretty easy to bring the dataset into ODI12c, through the IKM Hive to Hive Control Append knowledge module, and join up the Twitter dataset with the website log data that’s coming in via Flume. ODI can connect to Hive via JDBC drivers supplied with CDH4/5, and once you register the Hive connection and reverse-engineer the Hive metastore metadata into ODI’s repository, the complexity of the underlying Hive storage is hidden and you’re just presented with tables and columns, just like any other datastore type.
Starting with the Twitter data first, I create a Hive table outside of ODI that returns the precise set of tweet attributes that I’m interested in, and then filter that dataset down to just the tweets that link to content on our website, by filtering on the tweet link’s URL matching the start of our website address.
Then I load-up the hits from the Rittman Mead website, previously landed into Hadoop using Flume and exposed to ODI as another Hive table, filter out all the non-blog page accesses and keep just the URL part of the Apache Weblog request field, removing the transport mechanism and other bits around it.
Then, I use a final ODI mapping to join the two datasets together, using ODI’s ability to apply HiveQL expressions to the incoming datasets so that’ve got the same format – trailing ‘/‘ at the end of the URL, no ampersand and query text at the end of the URL, and so on. Both this and the previous transformation are great examples of where ODI can help with this sort of work, making it pretty easy to munge and correct your data so that you’re then able to match-up the two different sources.
Then it’s just a case of creating a package or load plan to sequence the mappings, and then run them using the local or standalone agent. You can see the individual KM steps running on the left-hand side, with ODI generating HiveQL queries which in turn are translated into MapReduce and run in parallel across the Hadoop cluster.
And then, at the end of the process, I’ve got a Hive table of all of our blog articles that have been mentioned on Twitter (since we started consuming the tweet feed, a day or so ago), with the number of page requests and the number of times that page got mentioned in tweets.
Obviously there’s a lot more we can do with this; we can access the number of followers each twitter user has, along with their location, gender and the sentiment (positive, negative, neutral) of the tweet. From that we can work out some impact from the twitter activity, and we can also add to it data from other sources such as Facebook, LinkedIn and so on to get a fuller picture of the activity around our site. Then, the data we’re gathering in can either be left in MongoDB, or I can use these ODI mappings to either archive it in Hive tables, or export the highlights out to Oracle Database using Sqoop or Oracle Loader for Hadoop.
Analyzing Twitter Data using Datasift, MongoDB and Pig
If you followed our recent postings on the updated Oracle Information Management Reference Architecture, one of the key concepts we talk about is the “data reservoir”. This is a pool of additional data that you can add to your data warehouse, typically stored on Hadoop or NoSQL databases, where you store unstructured, semi-structured or unprocessed structured data in large volume and at low cost. Adding a data reservoir gives you the ability to leverage the types of data sources that previously were thought of as too “messy”, too high-volume to store for any serious amount of time, or require processing or storing by tools that aren’t in the usual relational data warehouse toolset.
By formally including them in your overall information management architecture though, with common tools, security and data governance over the entire dataset, you give your users the ability to consider the whole “360-degree view” of their customers and their interactions with the market.
To take an example, a few weeks ago I posted a series of articles on the blog where I captured user activity on our website, http://www.rittmanmead.com, transported it to one of our Hadoop clusters using Apache Flume, and then analysed it using Hive, Pig and finally Spark. In one of the articles I used Pig and a geocoding API to determine the country that each website visitor came from, and then in a final five-part series I automated the whole process using ODI12c and then copied the final output tables to Oracle using Oracle Loader for Hadoop. This is quite a nice example of ETL-offloading into Hadoop, with an element of Hadoop-native event capture using Flume, but once the processing has finished the data moves out of Hadoop and into the Oracle database.
What would be interesting though would be to start adding data into Hadoop that’s permanent, not transitory as part of an ETL process, to start building out this concept of the “data reservoir”. Taking our website activity dataset, something that would really add context to the visits to our site would be corresponding activity on social networks, to see who’s linking to our posts, who’s discussing them, whether those discussions are positive or negative, and which wider networks those people belong to. Twitter is a good place to start with this as it’s the place we see our articles and activities most discussed, but it’s be good to build out this picture over time to add in activity on social networks such as Facebook, Youtube, LinkedIn and Google+; if we did this, we’d be able to consider a much broader and richer picture when looking at activity around Rittman Mead, potentially correlating activity and visits to our website with mentions of us in the press, comments made by our team and the wider picture of what’s going on in our world.
There are a number of ways you can bring Twitter data into your Hadoop cluster or data warehouse, but the most convenient way we’ve found is to use DataSift, a social media aggregation site and service that license raw feeds from the likes of Twitter, Facebook, WordPress and other social media platforms, enhancing the data feeds with sentiment scores and other attributes, and then sell access to the feeds via a number of formats and APIs. Accessing Twitter data through DataSift costs money, particularly if you want to go back and look at historical activity vs. just filtering on a few keywords in new Twitter activity, but they’re very developer-friendly and able to provide greater volumes of firehose activity than the standard Twitter developer API allows.
So assuming you can get access to a stream of Twitter data on a particular topic – in our case, all mentions of our website, our team’s Twitter handles, retweets of our content etc – the question then becomes one of how to store the data. Looking at the Datasift Sample Output page, each of these streams delivers their payloads via JSON documents, XML-like structures that nest categories of tweet metadata within parent structures that make up the total tweet data and metadata dataset.
And there’s a good reason for this; individual tweets might not use every bit of possible tweet metadata, for example not including entries under “mentions” or “retweets” if those aren’t used in a particular message. Certain bits of metadata might be repeated X numbers of times – @ mentions, for example, and the JSON document might have a different structure altogether if a different JSON schema version is used for a particular tweet. Altogether not an easy type of data structure for a relational database to hold – though Oracle 12.1.0.2.0 has just introduced native JSON support to the core Oracle database – but NoSQL databases in contrast find these sorts of data structures easy, and one of the most popular for this type of work is MongoDB.
MongoDB is a open-source “document” database that’s probably best known to the Oracle world through this internet cartoon; what the video is getting at is NoSQL advocates recommending databases such as MongoDB for large-scale web work when something much more mainstream like mySQL would do the job better, but where NoSQL and document-style database come into their own is storing just this type of semi-structured, schema-on-read datasets. In fact, Datasift support MongoDB as an API end-point for their Twitter feed, so let’s go ahead and set up a MongoDB database, prepare it for the Twitter data, and then set-up a Datasift feed into it.
MongoDB installation on Linux, for example to run alongside a Hadoop installation, is pretty straightforward and involves adding a YUM repository and then running “sudo yum install mongodb-org” (there’s an OS X installation too, but I wanted to run this server-side on my Hadoop cluster). Once you’ve installed the MongoDB software, you start the mongod service to enable the server element, and then log into the mongo command-shell to create a new database.
MongoDB, being a schema-on-read database, doesn’t require you to set up a database schema up-front; instead, the schema comes from the data you load into it, with MongoDB’s equivalent of tables called “collections”, and with those collections made up of documents, analogous to rows in Oracle. Where it gets interesting though is that collections and databases only get created when you first start using them, and individual documents can have slightly, or even completely, different schema structures to each other – which makes them ideal for holding the sorts of datasets generated by Twitter, Facebook and DataSift.
[root@cdh51-node1 ~]# mongo MongoDB shell version: 2.6.4 connecting to: test > use datasift2 switched to db datasift2
Let’s create a couple of simple documents, and then add those to a collection. Note that the document becomes available just by declaring it, as does the collection when I add documents to it. Note also that the query language we’re using to work with MongoDB is Javascript, again making it particularly suited to JSON documents, and web-type environments.
> a = { name : "mark" } { "name" : "mark" } > b = { product : "chair", size : "L" } { "product" : "chair", "size" : "L" } > db datasift2 > db.testData.insert(a) WriteResult({ "nInserted" : 1 }) > db.testData.insert(b) WriteResult({ "nInserted" : 1 }) > db.testData.find() { "_id" : ObjectId("54094081b5b6021fe9bc8b10"), "name" : "mark" } { "_id" : ObjectId("54094088b5b6021fe9bc8b11"), "product" : "chair", "size" : "L" }
And note also how the second entry (document) in the collection has a different schema to the entry above it – perfect for our semi-structured Twitter data, and something we could store as-is in MongoDB in this loose data format and then apply more formal structures and schemas to when we come to access the data – as we’ll do in a moment using Pig, and more formally using ODI and Hive in the next article in this series.
Setting up the Twitter feed from DataSift is a two-stage process, once you’ve got an account with them and an API key; first you define your search terms against a nested document model for the data source, then you activate the feed, in this case into my MongoDB database, and wait for the tweets to roll in. For my feed I selected tweets written by myself and some of the Rittman Mead team, tweets mentioning us, and tweets that included links to our blog in the main tweet contents (there’s also a graphical query designer, but I prefer to write them by hand using what DataSift call their “curated stream definition language” (CSDL).
You can then preview the feed, live, or go back and sample historic data if you’re interested in loading old tweets, rather than incoming new ones. Once you’re ready you then need to activate the feed, in my case by calling a URL using CURL with a bunch of parameters (our API key and other sensitive data has been masked):
curl -X POST 'https://api.datasift.com/v1/push/create' \ -d 'name=connectormongodb' \ -d 'hash=65bd9dc4943ec426b04819exxxxxxxxx' \ -d 'output_type=mongodb' \ -d 'output_params.host=rittmandev.com' \ -d 'output_params.port=27017' \ -d 'output_params.use_ssl=no' \ -d 'output_params.verify_ssl=no' \ -d 'output_params.db_name=datasiftmongodb' \ -d 'output_params.collection_name=rm_tweets' \ -H 'Auth: rittmanmead:xxxxxxxxxxxxxxxxxxxxxxxxxxx'
The “hash” in the parameter list is the specific feed to activate, and the output type is MongoDB. The collection name is new, and will be created by MongoDB when the first tweet comes in; let’s run the curl command now and sit back for a while, and wait for some twitter activity to arrive in MongoDB …
… and a couple of hours later, eight tweets have been captured by the DataSift filter, with the last of them being one from Michael Rainey about his trip tonight to the Seahawks game:
> db.rm_tweets.count() 8 > db.rm_tweets.findOne() { "_id" : ObjectId("54089a879ad4ec99158b4d78"), "interactionId" : "1e43454b1a16a880e074e49c51369eac", "subscriptionId" : "f6cf211e03dca5da384786676c31fd3e", "hash" : "65bd9dc4943ec426b04819e6291ef1ce", "hashType" : "stream", "interaction" : { "demographic" : { "gender" : "male" }, "interaction" : { "author" : { "avatar" : "http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg", "id" : 14551637, "language" : "en", "link" : "https://twitter.com/mRainey", "name" : "Michael Rainey", "username" : "mRainey" }, "content" : "Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD", "created_at" : "Thu, 04 Sep 2014 16:58:29 +0000", "hashtags" : [ "GoHawks", "kickoff2014", "GBvsSEA" ], "id" : "1e43454b1a16a880e074e49c51369eac", "link" : "https://twitter.com/mRainey/status/507573423334100992", "mention_ids" : [ 23642374 ], "mentions" : [ "Seahawks" ], "received_at" : 1409849909.2967, "schema" : { "version" : 3 }, "source" : "Instagram", "type" : "twitter" }, "language" : { "tag" : "en", "tag_extended" : "en", "confidence" : 98 }, "links" : { "code" : [ 200 ], "created_at" : [ "Thu, 04 Sep 2014 16:58:29 +0000" ], "meta" : { "charset" : [ "CP1252" ], "lang" : [ "en" ], "opengraph" : [ { "description" : "mrainey's photo on Instagram", "image" : "http://photos-d.ak.instagram.com/hphotos-ak-xfa1/10655141_1470641446544147_1761180844_n.jpg", "site_name" : "Instagram", "type" : "instapp:photo", "url" : "http://instagram.com/p/sh_h6sQBYT/" } ] }, "normalized_url" : [ "http://instagram.com/p/sh_h6sQBYT" ], "title" : [ "Instagram" ], "url" : [ "http://instagram.com/p/sh_h6sQBYT/" ] }, "salience" : { "content" : { "sentiment" : 0, "topics" : [ { "name" : "Video Games", "hits" : 0, "score" : 0.5354745388031, "additional" : "Greyson and I will be ready for the @Seahawks game tonight!" } ] } }, "trends" : { "type" : [ "San Jose", "United States" ], "content" : [ "seahawks" ], "source" : [ "twitter" ] }, "twitter" : { "created_at" : "Thu, 04 Sep 2014 16:58:29 +0000", "display_urls" : [ "instagram.com/p/sh_h6sQBYT/" ], "domains" : [ "instagram.com" ], "filter_level" : "medium", "hashtags" : [ "GoHawks", "kickoff2014", "GBvsSEA" ], "id" : "507573423334100992", "lang" : "en", "links" : [ "http://instagram.com/p/sh_h6sQBYT/" ], "mention_ids" : [ 23642374 ], "mentions" : [ "Seahawks" ], "source" : "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>", "text" : "Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD", "user" : { "created_at" : "Sat, 26 Apr 2008 21:18:01 +0000", "description" : "Data Integration (#ODI #GoldenGate #OBIA) consultant / blogger / speaker @RittmanMead.\nOracle ACE.\n#cycling #Seahawks #travel w/ @XiomaraRainey\n#GoCougs!", "favourites_count" : 746, "followers_count" : 486, "friends_count" : 349, "geo_enabled" : true, "id" : 14551637, "id_str" : "14551637", "lang" : "en", "listed_count" : 28, "location" : "Pasco, WA", "name" : "Michael Rainey", "profile_image_url" : "http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg", "profile_image_url_https" : "https://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg", "screen_name" : "mRainey", "statuses_count" : 8549, "time_zone" : "Pacific Time (US & Canada)", "url" : "http://www.linkedin.com/in/rainey", "utc_offset" : -25200, "verified" : false } } } }
If you’ve not looked at Twitter metadata before, it’s surprising how much metadata accompanies what’s ostensibly an 140-character tweet. As well as details on the author, where the tweet was sent from, what Twitter client sent the tweet and details of the tweet itself, there’s details and statistics on the sender, the number of followers they’ve got and where they’re located, a list of all other Twitter users mentioned in the tweet and any URLs and images referenced.
Not every tweet will use every element of metadata, and some tweets will repeat certain attributes – other Twitter users you’ve mentioned in the tweet, for example – as many times as there are mentions. Which makes Twitter data a prime candidate for analysis using Pig and Spark, which handle easily the concept of nested data structures, tuples (ordered lists of data, such as attribute sets for an entity such as “author”), and bags (sets of unordered attributes, such as the list of @ mentions in a tweet).
There’s a MongoDB connector for Hadoop on Github which allows MapReduce to connect to MongoDB databases, running MapReduce jobs on MongoDB storage rather than HDFS (or S3, or whatever). This gives us the ability to use languages such as Pig and Hive to filter and aggregate our MongoDB data rather than MongoDB’s Javascript API, which isn’t as fully-featured and scaleable as MapReduce and has limitations in terms of the number of documents you can include in aggregations; let’s start then by connecting Pig to our MongoDB database, and reading in the documents with no Pig schema applied:
grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' using com.mongodb.hadoop.pig.MongoLoader; 2014-09-05 06:40:51,773 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. 2014-09-05 06:40:51,838 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. grunt> tweets_count = FOREACH (GROUP tweets ALL) GENERATE COUNT (tweets); 2014-09-05 06:41:07,772 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. 2014-09-05 06:41:07,817 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. grunt> dump tweets_count ... (9) grunt>
So there’s nine tweets in the MongoDB database now. Let’s take a look at one of the documents by creating a Pig alias containing just a single record.
grunt> tweets_limit_1 = LIMIT tweets 1; 2014-09-05 06:43:12,351 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. 2014-09-05 06:43:12,443 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. grunt> dump tweets_limit_1 ... ([interaction#{trends={source=(twitter), content=(seahawks), type=(San Jose,United States)}, twitter={filter_level=medium, text=Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD, mention_ids=(23642374), domains=(instagram.com), links=(http://instagram.com/p/sh_h6sQBYT/), lang=en, id=507573423334100992, source=<a href="http://instagram.com" rel="nofollow">Instagram</a>, created_at=Thu, 04 Sep 2014 16:58:29 +0000, hashtags=(GoHawks,kickoff2014,GBvsSEA), mentions=(Seahawks), user={profile_image_url_https=https://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, location=Pasco, WA, geo_enabled=true, statuses_count=8549, lang=en, url=http://www.linkedin.com/in/rainey, utc_offset=-25200, id=14551637, time_zone=Pacific Time (US & Canada), favourites_count=746, verified=false, friends_count=349, description=Data Integration (#ODI #GoldenGate #OBIA) consultant / blogger / speaker @RittmanMead. Oracle ACE. #cycling #Seahawks #travel w/ @XiomaraRainey #GoCougs!, name=Michael Rainey, created_at=Sat, 26 Apr 2008 21:18:01 +0000, screen_name=mRainey, id_str=14551637, profile_image_url=http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, followers_count=486, listed_count=28}, display_urls=(instagram.com/p/sh_h6sQBYT/)}, salience={content={topics=([score#0.5354745388031,additional#Greyson and I will be ready for the @Seahawks game tonight!,hits#0,name#Video Games]), sentiment=0}}, links={created_at=(Thu, 04 Sep 2014 16:58:29 +0000), title=(Instagram), code=(200), normalized_url=(http://instagram.com/p/sh_h6sQBYT), url=(http://instagram.com/p/sh_h6sQBYT/), meta={lang=(en), charset=(CP1252), opengraph=([image#http://photos-d.ak.instagram.com/hphotos-ak-xfa1/10655141_1470641446544147_1761180844_n.jpg,type#instapp:photo,site_name#Instagram,url#http://instagram.com/p/sh_h6sQBYT/,description#mrainey's photo on Instagram])}}, interaction={schema={version=3}, id=1e43454b1a16a880e074e49c51369eac, content=Greyson and I will be ready for the @Seahawks game tonight! #GoHawks! #kickoff2014 #GBvsSEA http://t.co/4u16ziBhnD, author={id=14551637, username=mRainey, language=en, avatar=http://pbs.twimg.com/profile_images/476898781821018113/YRkKyGDl_normal.jpeg, name=Michael Rainey, link=https://twitter.com/mRainey}, received_at=1.4098499092967E9, source=Instagram, mention_ids=(23642374), link=https://twitter.com/mRainey/status/507573423334100992, created_at=Thu, 04 Sep 2014 16:58:29 +0000, hashtags=(GoHawks,kickoff2014,GBvsSEA), type=twitter, mentions=(Seahawks)}, language={tag=en, confidence=98, tag_extended=en}, demographic={gender=male}},interactionId#1e43454b1a16a880e074e49c51369eac,_id#54089a879ad4ec99158b4d78,hash#65bd9dc4943ec426b04819e6291ef1ce,subscriptionId#f6cf211e03dca5da384786676c31fd3e,hashType#stream])
And there’s Michael’s tweet again, with all the attributes from the MongoDB JSON document appended together into a single record. But in this format the data isn’t all that useful as we can’t easily access individual elements in the Twitter record; what would be better would be to apply a Pig schema definition to the LOAD statement, using the MongoDB document field listing that we saw when we displayed a single record from the MongoDB collection earlier.
I can start by referencing the document fields that become simple Pig dataypes; ID and interactionId, for example:
grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.my_first_test' using com.mongodb.hadoop.pig.MongoLoader('id:chararray,interactionId:chararray','id'); 2014-09-05 06:57:57,985 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. 2014-09-05 06:57:58,022 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. grunt> describe tweets 2014-09-05 06:58:11,611 [main] INFO com.mongodb.hadoop.pig.MongoStorage - Initializing MongoLoader in dynamic schema mode. tweets: {id: chararray,interactionId: chararray} grunt> tweets_limit_1 = LIMIT tweets 1; ... (53fae22e9ad4ec93658b513e,1e42c2747542a100e074fff55100414a) grunt>
Where the MongoDB document has fields nested within other fields, you can reference these as a tuple if they’re a set of attributes under a common header, or a bag if they’re just a list of values for a single attribute; for example, the “username” field is contained within the author tuple, which in-turn is contained within the interaction tuple, so to count tweets by author I’d need to first flatten the author tuple to turn its fields into scalar fields, then project out the username and other details; then I can group the relation in the normal way on those author details, and generate a count of tweets, like this:
grunt> tweets = LOAD 'mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets' using com.mongodb.hadoop.pig.MongoLoader('id:chararray,interactionId:chararray,interaction:tuple(interaction:tuple(author:tuple(id:int,language:chararray,link:chararray,name:chararray,username:chararray)))','id'); grunt> tweets_author_tuple_flattened = FOREACH tweets GENERATE id, FLATTEN(interaction.$0); grunt> tweets_with_authors = FOREACH tweets_author_tuple_flattened GENERATE id, interaction::author.username, interaction::author.name; grunt> tweets_author_group = GROUP tweets_with_authors by username; grunt> tweets_author_count = FOREACH tweets_author_group GENERATE group, COUNT(tweets_with_authors); ... (rmoff,1) (dw_pete,1) (mRainey,3) (P_J_FLYNN,3) (davidhuey,7) (EdelweissK,1) (JamesOickle,3) (markrittman,3) (rittmanmead,2) (RedgraveChris,1) grunt>
So there’s obviously a lot more we can do with the Twitter dataset as it stands, but where it’ll get really interesting is combining this with other social media interaction data – for example from Facebook, LinkedIn and so on – and then correlating that with out main site activity data. Check back in a few days when we’ll be covering this second stage in a further blog article, using ODI12c to orchestrate the process.
Upcoming Big Data and Hadoop for Oracle BI, DW and DI Developers Presentations
If you’ve been following our postings on the blog over the past year, you’ll probably have seen quite a lot of activity around big data and Hadoop and in particular, what these technologies bring to the world of Oracle Business Intelligence, Oracle Data Warehousing and Oracle Data Integration. For anyone who’s not had a chance to read the posts and articles, the three links below are a great introduction to what we’ve been up to:
- A list of some of our “getting started with Hadoop” articles
- Rittman Mead Announce New Partnerships with Cloudera and Amazon Web Services
- Rittman Mead and Oracle Big Data Appliance
- Presentation at the recent Oracle Virtual Technology Summit on ODI12c for Hadoop / Big Data Integration
In addition, we recently took part in an OTN ArchBeat podcast with Stewart Bryson and Andrew Bond on the updated Oracle Information Management Reference Architecture we co-developed with Oracle’s Enterprise Architecture team, where you can hear me talk with Stewart and Andrew about how the updated architecture came about, the thinking behind it, and how concepts like the data reservoir and data factory can be delivered in an agile way.
- Podcast Show Notes: Redefining Information Management Architecture (including links to the three parts in the podcast)
- Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture : Part 1: Information Architecture and the Data Factory
- Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture : Part 2: Delivering the Data Factory
I’m also pleased to be delivering a number of presentations and seminars over the next few months, on Oracle and Cloudera’s Hadoop technology and how it applies to Oracle BI, DW and DI developers – if you’re part of a local Oracle user group and you’d like me to deliver one of them for your group, drop me an email at mark.rittman@rittmanmead.com.
Slovenian Oracle User Group / Croatian Oracle User Group Conferences, October 2014
These two events run over consecutive days in Slovenia and Croatia, and I’m delivering the keynote at each on Analytics and Big Data, and a one-day seminar running on the Tuesday in Slovenia, and over the Wednesday and Thursday in Croatia. The theme of the seminar is around applying Hadoop and big data technologies to Oracle BI, DW and data integration, and is made up of four sessions:
Part 1 : Introduction to Hadoop and Big Data Technologies for Oracle BI & DW Developers
“In this session we’ll introduce some key Hadoop concepts including HDFS, MapReduce, Hive and NoSQL/HBase, with the focus on Oracle Big Data Appliance and Cloudera Distribution including Hadoop. We’ll explain how data is stored on a Hadoop system and the high-level ways it is accessed and analysed, and outline Oracle’s products in this area including the Big Data Connectors, Oracle Big Data SQL, and Oracle Business Intelligence (OBI) and Oracle Data Integrator (ODI).”
Part 2 : Hadoop and NoSQL Data Ingestion using Oracle Data Integrator 12c and Hadoop Technologies
“There are many ways to ingest (load) data into a Hadoop cluster, from file copying using the Hadoop Filesystem (FS) shell through to real-time streaming using technologies such as Flume and Hadoop streaming. In this session we’ll take a high-level look at the data ingestion options for Hadoop, and then show how Oracle Data Integrator and Oracle GoldenGate leverage these technologies to load and process data within your Hadoop cluster. We’ll also consider the updated Oracle Information Management Reference Architecture and look at the best places to land and process your enterprise data, using Hadoop’s schema-on-read approach to hold low-value, low-density raw data, and then use the concept of a “data factory” to load and process your data into more traditional Oracle relational storage, where we hold high-density, high-value data.”
Part 3 : Big Data Analysis using Hive, Pig, Spark and Oracle R Enterprise / Oracle R Advanced Analytics for Hadoop
“Data within a Hadoop cluster is typically analysed and processed using technologies such as Pig, Hive and Spark before being made available for wider use using products like Oracle Big Data SQL and Oracle Business Intelligence. In this session, we’ll introduce Pig and Hive as key analysis tools for working with Hadoop data using MapReduce, and then move on to Spark as the next-generation analysis platform typically being used on Hadoop clusters today. We’ll also look at the role of Oracle’s R technologies in this scenario, using Oracle R Enterprise and Oracle R Advanced Analytics for Hadoop to analyse and understand larger datasets than we could normally accommodate with desktop analysis environments.”
Part 4 : Visualizing Hadoop Datasets using Oracle Business Intelligence, Oracle BI Publisher and Oracle Endeca Information Discovery
“Once insights and analysis have been produced within your Hadoop cluster by analysts and technical staff, it’s usually the case that you want to share the output with a wider audience in the organisation. Oracle Business Intelligence has connectivity to Hadoop through Apache Hive compatibility, and other Oracle tools such as Oracle BI Publisher and Oracle Endeca Information Discovery can be used to visualise and publish Hadoop data. In this final session we’ll look at what’s involved in connecting these tools to your Hadoop environment, and also consider where data is optimally located when large amounts of Hadoop data need to be analysed alongside more traditional data warehouse datasets.”
Oracle Openworld 2014 (ODTUG Sunday Symposium), September 2014
Along with another session later in the week on the upcoming Oracle BI Cloud Services, I’m doing a session on the User Group Sunday for ODTUG on ODI12c and the Big Data Connectors for ETL on Hadoop:
Deep Dive into Big Data ETL with Oracle Data Integrator 12c and Oracle Big Data Connectors [UGF9481]
“Much of the time required to work with big data sources is spent in the data acquisition, preparation, and transformation stages of a project before your data reaches a state suitable for analysis by your users. Oracle Data Integrator, together with Oracle Big Data Connectors, provides a means to efficiently load and unload data to and from Oracle Database into a Hadoop cluster and perform transformations on the data, either in raw form or in technologies such as Apache Hive or R. This presentation looks at how Oracle Data Integrator can form the centerpiece of your big data ETL strategy, within either a custom-built big data environment or one based on Oracle Big Data Appliance.”
UK Oracle User Group Tech’14 Conference, December 2014
I’m delivering an extended version of my OOW presentation on the UKOUG Tech’14’s “Super Sunday” event, this time over 90 minutes rather than the 45 at OOW, giving me a bit more time for demos and discussion:
Deep-Dive into Big Data ETL using ODI12c and Oracle Big Data Connectors
“Much of the time required to work with Big Data sources is spent in the data aquisition, preparation and transformation stages of a project; before your data is in a state suitable for analysis by your users.Oracle Data Integrator, together with Oracle Big Data Connectors, provides a means to efficiently load and unload data from Oracle Database into a Hadoop cluster, and perform transformations on the data either in raw form or technologies such as Apache Hive or R. In this presentation, we will look at how ODI can form the centrepiece of your Big Data ETL strategy, either within a custom-built Big Data environment or one based on Oracle Big Data Appliance.”
Oracle DW Global Leaders’ Meeting, Dubai, December 2014
The Oracle DW Global Leaders forum is an invite-only group organised by Oracle and attended by select customers and associate partners, one of which is Rittman Mead. I’ll be delivering the technical seminar at the end of the second day, which will run over two sessions and will be based on the main points from the one-day seminars I’m running in Croatia and Slovenia.
From Hadoop to dashboards, via ODI and the BDA – the complete trail : Part 1 and Part 2
“Join Rittman Mead for this afternoon workshop, taking you through data acquisition and transformation in Hadoop using ODI, Cloudera CDH and Oracle Big Data Appliance, through to reporting on that data using OBIEE, Endeca and Oracle Big Data SQL. Hear our project experiences, and tips and techniques based on real-world implementations”
Keep an eye out for more Hadoop and big data content over the next few weeks, including a look at MongoDB and NoSQL-type databases, and how they can be used alongside Oracle BI, DW and data integration tools.
Rittman Mead and Oracle Big Data Appliance
Over the past couple of years Rittman Mead have been broadening our skills and competencies out from core OBIEE, ODI and Oracle data warehousing into the new “emerging” analytic platforms: R and database advanced analytics, Hadoop, cloud and clustered/distributed systems. As we talked about in the recent series of updated Oracle Information Management Reference Architecture blog posts and my initial look at the Oracle Big Data SQL product, our customers are increasingly looking to complement their core Oracle analytics platform with ones to handle unstructured and big data, and as technologists we’re always interesting in what else we can use to help our customers get more insight out of their (total) dataset.
An area we’ve particularly focused on over the past year has been Hadoop and R analysis, with the recent announcement of our partnering with Cloudera and the recruitment of a big data and advanced analytics team operating our of our Brighton, UK office. We’ve also started to work on a number of projects and proof of concepts with customers in the UK and Europe, working mainly with core Oracle BI, DW and ETL customers looking to make their first move into Hadoop and big data. The usual pattern of engagement is for us to engage with some business users looking to analyse a dataset hitherto too large or too unstructured to load into their Oracle data warehouse, or where they recognise the need for more advanced analytics tools such as R, MapReduce and Spark but need some help getting started. Most often we put together a PoC Hadoop cluster for them using virtualization technology on existing hardware they own, allowing them to get started quickly and with no initial licensing outlay, with our preferred Hadoop distribution being Cloudera CDH, the same Hadoop distribution that comes on the Oracle Big Data Appliance. Projects then typically move on to Hadoop running directly on physical hardware, in a couple of cases Oracle’s Big Data Appliance, usually in conjunction with Oracle Database, Oracle Exadata and Oracle Exalytics for reporting.
One such project started off by the customer wanting to analyse a dataset that was too large for the space available in their Oracle database and that they couldn’t easily process or analyse using the SQL-based tools they usually used; in addition, like most large organisations, database and hardware provisioning took a long time and they needed to get the project moving quickly. We came in and quickly put together a virtualised Hadoop cluster together for them, on re-purposed hardware and using the free (Standard) edition of Cloudera CDH4, and then used the trial version of Oracle Big Data Connectors along with SFTP transfers to get data into the cluster and then analysed.
The PoC itself then ran for just over a month with the bulk of the analysis being done using Oracle R Advanced Analytics for Hadoop, an extension to R that allows you to use Hive tables as a data source and create MapReduce jobs from within R itself; the output from the exercise was a series of specific-answer-to-specific-question R graphs that solved an immediate problem for the client, and showed the value of further investment in the technology and our services – the screenshot below shows a typical ORAAH session, in this case analyzing the flight delays dataset that you can also find on the Exalytics server and in smaller form in OBIEE 11g’s SampleApp dataset.
That project has now moved onto a larger phase of work with Oracle Big Data Appliance used as the Hadoop platform rather than VMs, and Cloudera Hadoop upgraded from the free, unsupported Standard version to Cloudera Enterprise. The VMs in fact worked pretty well and had the advantage that they could be quickly spun-up and housed temporarily on an existing server, but were restricted by the RAM that we could assign to each VM – 2GB initially, quickly upgraded to 8GB per VM, and the fact that they were sharing CPU and IO resources. Big Data Appliance, by contrast, has 64GB or RAM per node – something that’s increasingly important now in-memory tools like Impala are begin used – and has InfiniBand networking between the nodes as well as fast network connections out to the wider network, something thats often overlooked when speccing up a Hadoop system.
The support setup for the BDA is pretty good as well; from a sysadmin perspective there’s a lights-out ILOM console for low-level administration, as well as plugins for Oracle Enterprise Manager 12c (screenshot below), and Oracle support the whole package, typically handling the hardware support themselves and delegating to Cloudera for more Hadoop-specific queries. I’ve raised several SRs on client support contracts since starting work on BDAs, and I’ve not had any problem with questions not being answered or buck-passing between Oracle and Cloudera.
One thing that’s been interesting is the amount of actual work that you need to do with the Big Data Appliance beyond the actual installation and initial configuration by Oracle to “on-board” it into the typical enterprise environment. BDAs are left with customers in a fully-working state, but like Exalytics and Exadata though, initial install and configuration is just the start, and you’ve then got to integrate the platform in with your corporate systems and get developers on-boarded onto the platform. Tasks we’ve typically provided assistance with on projects like these include:
- Configuring Cloudera Manager and Hue to connect to the corporate LDAP directory, and working with their security team to create LDAP groups for developer and administrative access that we then used to restrict and control access to these tools
- Configuring other tools such as RStudio Server so that developers can be more productive on the platform
- Putting in place an HDFS directory structure to support incoming data loads and data archiving, as well as directories to hold the output datasets from the analysis work we’re doing – all within the POSIX security setup that HDFS currently uses which limits us to just granting owner, group and world permissions on directories
- Working with the client’s infrastructure team on things like alerting, troubleshooting and setting up backup and recovery – something that’s surprisingly tricky in the Hadoop world as Cloudera’s backup tools only backup from Hadoop-to-Hadoop, and by definition your Hadoop system is going to hold a lot of data, the volume of which your current backup tools aren’t going to easily handle
Once things are set up though you’ve got a pretty comprehensive platform that can be expanded up from the initial six nodes our customers’ systems typically start with to the full eighteen node cluster, and can use tools such as ODI to do data loading and movement, Spark and MapReduce to process and analyse data, and Hive, Impala and Pig to provide end-user access. The diagram below shows a typical future-state architecture we propose for clients on this initial BDA “starter config” where we’ve moved up to CDH5.x, with Spark and YARN generally used as the processing framework and with additional products such as MongoDB used for document-type storage and analysis:
Something that’s turned out to be more of an issue on projects than I’d originally anticipated is complying with corporate security policies. By definition, most customers who buy an Oracle Big Data Appliance and going to be large customers with an existing Oracle database estate, and if they deal with the public they’re going to have pretty strict security and privacy rules you’ll need to adhere to. Something that’s surprising therefore to most customers new to Hadoop is how insecure or at least easily compromised the average Hadoop cluster is, with Hadoop FS shell security relying on trusted networks and incoming user connections and interfaces such as ODBC not checking passwords at all.
Hadoop and the BDA only becomes what’s termed “secure” when you link it to a Kerebos server, but not every customer has Kerebos set up and unless you enable this feature right at the start when you set up the BDA, it’s a fairly involved task to add retrospectively. Moreover, customers are used to fine-grained access control to their data, a single security model over their data and a good understanding in their heads as to how security works on their database, whereas Hadoop is still a collection of fairly-loosely coupled components with pretty primitive access controls, and no easy way to delete or redact data, for example, when a particular country’s privacy laws in-theory mandate this.
Like everything there’s a solution if you’re creative enough, with tools such as Apache Sentry providing role-based access control over Hive and Impala tables, alternative storage tools like HBase that permit read, write, update and delete operations on data rather than just HDFS’s insert and (table or partition-level) delete, and tools like Cloudera Navigator and BDA features like Oracle Audit Vault that provide administrators with some sort of oversight as to who’s accessing what data and when. As I mentioned in my blog post a couple of weeks ago, Oracle’s Big Data SQL product addresses this requirement pretty well, potentially allowing us to apply Oracle security over both relational, and Hadoop, datasets, but for now we’re working within current CDH4 capabilities and planning on introducing Apache Sentry for role-based access control to Hive and Impala in the coming weeks. We’re also looking at implementing Cloudera’s “secure gateway” cluster topology with all access restricted to just a single gateway Hadoop node, and the cluster itself firewalled-off with external access to just that gateway node and HTTP / REST API access to the various cluster services, for example as shown in the diagram below:
My main focus on Hadoop projects has been on the overall Hadoop system architecture, and interacting with the client’s infrastructure and security teams to help them adopt the BDA and take over its maintenance. From the analysis side, it’s been equally as interesting, with a number of projects using tools such as R, Oracle R Advanced Analytics for Hadoop and core Hive/MapReduce for data analysis, Flume, Java and Python for data ingestion and processing, and most recently OBIEE11g for publishing the results out to a wider audience. Following the development model that we outlined in the second post in our updated Information Management Reference Architecture blog series, we typically split delivery of each project’s output into two distinct phases; a discovery phase, typically done using RStudio and Oracle R Advanced Analytics for Hadoop, where we explore and start understanding the dataset, presenting initial findings to the business and using their feedback and direction to inform the second phase; and a second, commercial exploitation phase where we use the discovery phases’ outputs and models to drive a more structured dimensional model with output begin in the form of OBIEE analyses and dashboards.
We looked at several options for providing the datasets for OBIEE to query, with our initial idea being to connect OBIEE directly to Hive and Impala and let the users query the data in-place, directly on the Hadoop cluster, with an architecture like the one in the diagram below:
In fact this turned out to not be possible, as whilst OBIEE 11.1.1.7 can access Apache Hive datasources, it currently only ships with HiveServer1 ODBC support, and no support for Cloudera Impala, which means we need to wait for a subsequent release of OBIEE11g to be able to report against the ODBC interfaces provided by CDH4 and CDH5 on the BDA (although ironically, you can get HiveServer2 and Impala working on OBIEE 11.1.1.7 on Windows, though this platform isn’t officially supported by Oracle for Hadoop access, only Linux). Whichever way though, it soon became apparent that even if we could get Hive and Impala access working, in reality it made more sense to use Hadoop as the data ingestion and processing platform – providing access to data analysts at this point if they wanted access to the raw datasets – but with the output of this then being loaded into an Oracle Exadata database, either via Sqoop or via Oracle Loader for Hadoop and ideally orchestrated by Oracle Data Integrator 12c, and users then querying these Oracle tables rather than the Hive and Impala ones on the BDA, as shown in the diagram below.
In-practice, Oracle SQL is far more complete and expressive than HiveQL and Impala SQL and it makes more sense to use Oracle as the query platform for the vast majority of users, with data analysts and data scientists still able to access the raw data on Hadoop using tools like Hive, R and (when we move to CDH5) Spark.
The final thing that’s been interesting about working on Hadoop and Big Data Appliance projects is that 80% of it, in my opinion, is just the same as working on large enterprise data warehouse projects, with 20% being “the magic”. A large portion of your time is spent on analysing and setting up feeds into the system, just in this case you use tools like Flume instead of GoldenGate (though GoldenGate can also load into HDFS and Hive, something that’s useful for transactional database data sources vs. Flume’s focus on file and server log data sources). Another big part of the work is data processing, ingestion, reformatting and combining, again skills an ETL developer would have (though there’s much more reliance, at this point, on command-line tools and Unix utilities, albeit with a place for tools like ODI once you get to the set-based filtering, joining and aggregating phase). In most cases, the output of your analysis and processing will be Hive and Impala tables so that results can be analysed using tools such as OBIEE, and you therefore need skills in areas such as dimensional modelling, business analysis and dashboard prototyping as well as tool-specific skills such as OBIEE RPD development.
Where the “magic” happens, of course, is the data preparation and analysis that you do once the data is loaded, quite intensively and interactively in the discovery phase and then in the form of MapReduce and Spark jobs, Sqoop loads and Oozie workflows once you know what you’re after and need to process the data into something more tabular for tools like OBIEE to access. We’re building up a team competent in techniques such as large-scale data analysis, data visualisation, statistical analysis, text classification and sentiment analysis, and use of NoSQL and JSON-type data sources, which combined with our core BI, DW and ETL teams allows us to cover the project from end-to-end. It’s still relatively early days but we’re encouraged by the response from our project customers so far, and – to be honest – the quality of the Oracle big data products and the Cloudera platform they’re based around – and we’re looking forward to helping other Oracle customers get the most out of their adoption of these new technologies.
If you’re an Oracle customer looking to make their first move into the worlds of Hadoop, big data and advanced analytics techniques, feel free to drop me an email at mark.rittman@rittmanmead.com for some initial advice and guidance – the fact we come from an Oracle-centric background as well typically makes it easier for us to relate these new concepts to the ones you’re typically more familiar with. Similarly, if you’re about to bring on-board an Oracle Big Data Appliance system and want to know how best to integrate it in with your existing Oracle BI, DW, data integration and systems management estate, get in contact and I’d be happy to share experiences and our delivery approach.
Why Oracle Big Data SQL Potentially Solves a Big Issue with Hadoop Security
Oracle announced their Big Data SQL product a couple of weeks ago, which effectively extends Exadata’s query-offloading to Hadoop data sources. I covered the launch a few days afterwards, focusing on how it implements Exadata’s SmartScan on Hive and NoSQL data sources and provides a single metadata catalog over both relational, and Hadoop, data sources. In a Twitter conversation later in the day though, I made the comment that in my opinion, the biggest benefit of Big Data SQL will be in its ability to extend Oracle’s security model to Hadoop data sources, because Hadoop security Hadoop security is still a bit of a mess:
lang=”en”>To me the greatest benefit of Big Data SQL is the single security model; even with Sentry, Hadoop security is fragmented and a mess (IMO)— Mark Rittman (@markrittman) July 17, 2014
I’ve been working on an Oracle Big Data Appliance project over the past few weeks, as the technical architect and initial sysadmin for the cluster, and it’s given me a first-hand experience of what security’s like on a Hadoop cluster. Over the past few weeks I’ve had to come up with a security policy covering HDFS, Hive and the Cloudera management tools (Cloudera Manager, Hue etc), and try and implement an access and authorisation approach that ensures only designated people can log in, and when they’re in, they can only see the data they’re supposed to see. Hadoop at this point, to my mind, suffers from a couple of major issues when it comes to security:
- It’s fragmented, in that each tool or Hadoop product tends to have its own security setup, and the documentation is all split up, rapidly goes out of date, and is more of a reference than a tutorial (Cloudera’s Security documentation is one of the better examples, but it still splits the key information you need over several sections and several other docs)
- It’s full of holes, such that the default security setup is considered insecure in terms of users being able to spoof their details, and making it more secure is again an exercise in hunting through docs, with some pretty complex configuration steps you need to perform (for example, configuring Kerebos authentication, a prerequisite for things like Apache Sentry)
If we take a typical security policy that a large enterprise customer’s going to want to put in place, it’ll look something like this:
- Users should only be able to log in via their corporate LDAP account, and we’ll want that login process to be secure so it can’t easily be bypassed
- We want to be able to secure our datasets, so that only authorised users can view particular datasets, and there’s likely to be some groups we grant read-only access to, and others we grant read-write
- The data loading processes for the Hadoop cluster need to be locked-down so they can’t overwrite the datasets of other applications
- Our security policy ideally needs to sync-up, or be an extension of, our existing enterprise security policy, not something we maintain separately
- We need to be able to audit and review who’s actually accessing what dataset, to ensure that these policies are being followed and enforced
- We also need the ability to obfuscate or depersonalise data before it gets into the cluster, and also have the option of encrypting the data at-rest as well as on-the-wire
Back in the early days of Hadoop these types of security policy weren’t often needed, as the users of the Hadoop cluster were typically a small set of data scientists or analysts who’d been cleared already to view and work with the data in the cluster (or more likely, they did it and just didn’t tell anyone). But as we move to enterprise information management architectures such as the one outlined in my two-part blog post series a few weeks ago (pt.1, pt.2), the users of Hadoop and other “data reservoir” data sources are likely to increase significantly in number as data from these systems becomes just another part of the general enterprise data set.
But in practice, this is hard to do. Let’s start with HDFS first, the Hadoop Distributed File System on which most Hadoop data is stored. HDFS aims to look as similar to a Linux or Unix-type filesystem as possible, with similar commands (mkdir, ls, chmod etc) and the same POSIX permissions model, where files and directories are associated with an owner and a group and where permissions are set for that owner, the group and all others. For example, in the HDFS file listing below, the “/user/cust_segment_analysis” directory is owned by the user “mrittman” and the group “marketing”, with the directory owner having full read, write and subdirectory traversal access to the directory, the group having read-only and subdirectory traversal access, and all others having no access at all.
[root@bdanode1 ~]# hadoop fs -ls /user Found 13 items drwxrwxrwx - admin admin 0 2014-06-02 16:06 /user/admin drwxr-x--- - mrittman marketing 0 2014-07-26 21:31 /user/cust_segment_analysis drwxr-xr-x - hdfs supergroup 0 2014-05-27 13:19 /user/hdfs drwxrwxrwx - mapred hadoop 0 2014-05-25 20:47 /user/history drwxrwxr-t - hive hive 0 2014-06-04 16:31 /user/hive drwxr-xr-x - hue hue 0 2014-05-31 18:51 /user/hue drwxrwxr-x - impala impala 0 2014-05-25 20:54 /user/impala drwxrwxr-x - oozie oozie 0 2014-05-25 20:52 /user/oozie drwxrwxrwx - oracle oracle 0 2014-06-09 21:38 /user/oracle drwxr-xr-x - root root 0 2014-06-06 16:25 /user/root drwxr-xr-x - sample sample 0 2014-05-31 18:51 /user/sample drwxr-x--x - spark spark 0 2014-05-25 20:45 /user/spark drwxrwxr-x - sqoop2 sqoop 0 2014-05-25 20:53 /user/sqoop2
Which all sounds great until you then have another group that needs read-write access to the directory, but you’re limited to just one group permissions setting for the directory which you’ve already used to set up read-only access for that particular group. If you therefore need to set up different sets of security access for different groups, you typically then end-up creating multiple HDFS directories and multiple copies of the dataset in question, assigning each copy to a different group, which isn’t all that convenient and gives you other problems in terms of maintenance and keeping it all in-sync.
What you of course need is something like the “access control lists” (ACLs) you get with operating systems like Windows NT and MacOS, where you can define an arbitrary number of user groups and then assign each of them their own permission set on the directory and the files it contains. The most recent versions of Hadoop actually implement a form of ACL for HDFS, with this feature making its way into the recently-released Cloudera CDH5.1, but these ACLs are an addition to the standard POSIX user, group, others model and aren’t recommended for all files in your HDFS filesystem as according to the Hadoop docs “Best practice is to rely on traditional permission bits to implement most permission requirements, and define a smaller number of ACLs to augment the permission bits with a few exceptional rules. A file with an ACL incurs an additional cost in memory in the NameNode compared to a file that has only permission bits.” Still, it’s better than not having them at all, and I’d imagine using this feature for particular directories and sets of files that need more than one set of group permissions configured for them.
In most cases though, the way you’ll present data out to non-technical end-users and applications is through Hive and Impala tables, or through tools like Pig and Spark. Under the covers, these tools still use HDFS permissions to control access to the data within Hive and Impala tables, but again by default you’re limited to granting access to whole HDFS directories, or the files contained within those directories. Something that addresses this issue is a product called Apache Sentry, an open-source project within the Hadoop family that enables role-based access control for Hive and Impala tables. Oracle are one of the co-founders of the Sentry project and include it in the base software on the Big Data Appliance, and using Sentry you can grant SELECT, INSERT or ALL privileges to a group on a particular Hive or Impala table, rather than on the underlying HDFS directories and files. A form of fine-grained access control can be set up using Sentry by creating views with particular row-level security settings, giving you the basics of a database-like security policy that you can apply over the main way that users access data in the cluster.
But Sentry itself has a few significant prerequisites – you have to enable Kerebos authentication on your cluster, which you should do anyway because of the risk of account spoofing, but is still a significant thing to set up – and of course you need to link Hive and Impala to your corporate LDAP server and configure them to work in the way that Sentry requires. Most importantly though, you’re still left with the situation where you’ve got two separate security setups – the one for your corporate data warehouse and relational data sources, and another for data accessed on Hadoop, and it’s still hard to be sure, what with all the disparate products and partially-complete open-source products, whether data in your Hadoop cluster is still really secure (though products like Cloudera Navigator aim to provide some form of data governance and auditing over these datasets); and, there’s still no straightforward way to remove individual customers’ data out of the Hadoop dataset (“data redaction”), no easy way to obfuscate or mask data, and no easy way (apart from the Hive views mentioned before) to restrict users to accessing only certain columns in a Hive or Impala table.
And so this is where Oracle’s Big Data SQL product could be very interesting. Big Data SQL takes the Exadata model of moving as much filtering and column-projection as it can to the storage server, adding Oracle SmartScan functionality to the Hadoop node and allowing it to understand the full Oracle SQL dialect (and PL/SQL security functions), rather than just the subset of SQL provided by HiveQL and Impala SQL.
More importantly, it’ll enable a single unified data dictionary over both Oracle and Hadoop data sources, presenting Hive tables and NoSQL data as regular Oracle tables and allowing the DBA to create data security, redaction and row-level filtering policies over both relational and Hadoop data – giving you potentially the ability to define a single security policy across all data in your overall information management architecture.
So I think this is actually a “big deal”, and potentially even more game-changing that the SmartScan functionality that got most of the attention with the Big Data SQL product launch. How well it’ll work in-practice, and how much will be enabled on day one it’s hard to say, but this feature meets a real need that our customers are finding now, so I’ll be very interested to try it out when the product becomes available (presumably) later in the year.