Tag Archives: Big Data
Announcing the Special Guest Speakers for Brighton & Atlanta BI Forum 2015
As well as a great line-up of speakers and sessions at each of the Brighton & Atlanta Rittman Mead BI Forum 2015 events in May, I’m very pleased to announce our two guest speakers who’ll give the second keynotes, on the Thursday evening of the two events just before we leave for the restaurant and the appreciation events. This year our special guest speaker in Atlanta is John Foreman, Chief Data Scientist at MailChimp and author of the book “Data Smart: Using Data Science to Transform Information into Insight”; and in Brighton we’re delighted to have Reiner Zimmerman, Senior Director of Product Management at Oracle US and the person behind the Oracle DW & Big Data Global Leaders program.
I first came across John Foreman when somebody recommended his book to me, “Data Smart”, a year or so ago. At that time Rittman Mead were getting more-and-more requests from our customers asking us to help with their advanced analytics and predictive modelings needs, and I was looking around for resources to help myself and the team get to grips with some of the more advanced modelings and statistical techniques Oracle’s tools now support – techniques such as clustering and pattern matching, linear regression and genetic algorithms.
One of the challenges when learning these sorts of techniques is not getting to caught up in the tools and technology – R was our favoured technology at the time, and there’s lots to it – so John’s book was particularly well-timed as it goes through these types of “data science” techniques but focuses on Microsoft Excel as the analysis tool, with simple examples and a very readable style.
Back in his day job, John is Chief Data Scientist at MailChimp and has become a particularly in-demand speaker following the success of his book, and I was very excited to hear from Charles Elliott, our Practice Manager for Rittman Mead America, that he lived near John in Atlanta and had arranged for him to keynote at our Atlanta BI Forum event. His Keynote will be entitled “How Mailchimp used qualitative and quantitative analysis to build their next product” and we’re very much looking forward to meeting him at our event in Atlanta on May 13th-15th 2015.
Our second keynote speaker at the Brighton Rittman Mead BI Forum 2015 event is non-other than Reiner Zimmerman, best known in EMEA for organising the Oracle DW Global Leaders Program. We’ve known Reiner for several years now as Rittman Mead are one of the associate sponsors for the program, which aims to bring together the leading organizations building data warehouse and big data systems on the Oracle Engineered Systems platform.
A bit like the BI Forum (but even more exclusive), the DW Global Leaders program holds meetings in the US, EMEA and AsiaPac over the year and is a fantastic networking and knowledge-sharing group for an exclusive set of customers putting together the most cutting-edge DW and big data systems on the latest Oracle technology. Reiner’s also an excellent speaker and a past visitor to the BI Forum, and his session entitled “Hadoop and Oracle BDA customer cases from around the world” will be a look at what customers are really doing, and the value they’re getting, from building big data systems on the Oracle platform.
Registration is now open for both the Brighton and Atlanta BI Forum 2015 events, with full details including the speaker line-up and how to register on the event website. Keep an eye on the blog for more details of both events later this week including more on the masterclass by myself and Jordan Meyer, and a data visualisation “bake-off” we’re going to run on the second day of each event. Watch this space…!
Rittman Mead BI Forum 2015 Now Open for Registration!
I’m very pleased to announce that the Rittman Mead BI Forum 2015, running in Brighton and Atlanta in May 2015, is now open for registration.
Back for its seventh successful year, the Rittman Mead BI Forum once again will be showcasing the best speakers and presentations on topics around Oracle Business Intelligence and data warehousing, with two events running in Brighton, UK and Atlanta, USA in May 2015. The Rittman Mead BI Forum is different to other Oracle tech events in that we keep the numbers attending limited, topics are all at the intermediate-to-expert level, and we concentrate on just one topic – Oracle Business Intelligence Enterprise Edition, and the technologies and products that support it.
As in previous years, the BI Forum will run on two consecutive weeks, starting in Brighton and then moving over to Atlanta for the following week. Here’s the dates and venue locations:
- Rittman Mead BI Forum 2015 – Hotel Seattle, Brighton, May 6th – 8th 2015
- Rittman Mead BI Forum 2015 – Renaissance Atlanta Midtown Hotel, May 13th – 15th 2015
This year our optional one-day masterclass will be delivered by Jordan Meyer, our Head of R&D, and myself and will be on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture” that we launched last year at our Brighton event. Details of the masterclass, and the speaker and session line up at the two events are on the Rittman Mead BI Forum 2015 homepage.
Each event has its own agenda, but both will focus on the technology and implementation aspects of Oracle BI, DW, Big Data and Analytics. Most of the sessions run for 45 minutes, but on the first day we’ll be holding a debate and on the second we’ll be running a data visualization “bake-off” – details on this, the masterclass and the keynotes and our special guest speakers will be revealed on this blog over the next few weeks – watch this space!
Creating Real-Time Search Dashboards using Apache Solr, Hue, Flume and Cloudera Morphlines
Late last week Cloudera published a blog post on their developer site on building a real-time log analytics dashboard using Apache Kafka, Cloudera Search and Hue. As I’d recently been playing around with Oracle Big Data Discovery with our website log data as the data source, and as we’ve also been doing the same exercise in our development labs using ElasticSearch and Kibana I thought it’d be interesting to give it a go; partly out of curiosity around how Solr, Kafka and Hue search works and compares to Elasticsearch, but also to try and work out what extra benefit Big Data Discovery gives you above and beyond free and open-source tools.
In the example, Apache web log data is read from the Linux server via a Flume syslog source, then fed into Apache Kafka as the transport mechanism before being loaded into Solr using a data transformation utility called “morphines”. I’ve been looking at Kafka as an alternative to Flume for ingesting data into a Hadoop system for a while mainly because of the tireless advocacy of Cloudera’s Gwen Shapira (Oracle ACE, ex-Pythian, now at Cloudera) who I respect immensely and has a great background in Oracle database administration as well as Hadoop, and because it potentially offers some useful benefits if used instead of, or more likely alongside, Flume – a publish-subscribe model vs. push, the ability to have multiple consumers as well as publishers, and a more robust transport mechanism that should avoid data loss when an agent node goes down. Kafka is now available as a parcel and service descriptor that you can download and then install within CDH5, and so I set up a separate VM in my Hadoop cluster as a Kafka broker and also installed Solr at the same time.
Working through the example, in the end I went with a slightly different and simplified approach that swapped the syslog Flume source for an Apache Server file tailing source, as our webserver was on a different host to the Flume agent and I’d previously set this up before for an earlier blog post. I also dropped the Kafka element as the Cloudera article wasn’t that clear to me whether it’d work in its published form or needed amending to use with Kafka (“To get data from Kafka, parse it with Morphlines, and index it into Solr, you can use an almost identical configuration”), and so I went with an architecture that looked like this:
Compared to Big Data Discovery, this approach has got some drawbacks, but some interesting benefits. From a drawback perspective, Apache Solr (or Cloudera Search as it’s called in CDH5, where Cloudera have integrated Solr with HDFS storage) needs some quite fiddly manual setup that’s definitely an IT task, rather than the point-and-click dataset setup that you get with Big Data Discovery. In terms of benefits though, apart from being free it’s potentially more scalable than Big Data Discovery as BDD has to sample the full Hadoop dataset and fit that sample (typically 1m rows, or 1-5% of the full dataset) into BDD’s Endeca Server-based DGraph engine; Solr, however, indexes the whole Hadoop dataset and can store its indexes and log files within HDFS across the cluster – potentially very interesting if it works.
Back to drawbacks though, the first complication is that Solr’s configuration settings in this Cloudera Search incarnation are stored in Apache Zookeeper, so you first have to download a template copy of the collection files (schema, index etc) from Zookeeper using solrctl, the command-line tool for SolrCloud (Solr running on a distributed cluster, as it is with Cloudera Search)
solrctl --zk bda5node2:2181/solr instancedir --generate $HOME/accessCollection
Then – and this again is a tricky part compared to Big Data Discovery – you have to edit the schema.xml file that Solr uses to determine which fields to index, what their datatypes are and so on. The Cloudera blog post points to a Github repo with the required schema.xml file for Apache Combined Log Format input files, I found I had to add an extra entry for the “text” field name before Solr would index properly, added at the end of the file except here:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="time" type="tdate" indexed="true" stored="true" /> <field name="record" type="text_general" indexed="true" stored="false" multiValued="true"/> <field name="client_ip" type="string" indexed="true" stored="true" /> <field name="code" type="string" indexed="true" stored="true" /> <field name="user_agent" type="string" indexed="true" stored="true" /> <field name="protocol" type="string" indexed="true" stored="true" /> <field name="url" type="string" indexed="true" stored="true" /> <field name="request" type="string" indexed="true" stored="true" /> <field name="referer" type="string" indexed="true" stored="true" /> <field name="bytes" type="string" indexed="true" stored="true" /> <field name="method" type="string" indexed="true" stored="true" /> <field name="extension" type="string" indexed="true" stored="true" /> <field name="app" type="string" indexed="true" stored="true" /> <field name="subapp" type="string" indexed="true" stored="true" /> <field name="device_family" type="string" indexed="true" stored="true" /> <field name="user_agent_major" type="string" indexed="true" stored="true" /> <field name="user_agent_family" type="string" indexed="true" stored="true" /> <field name="os_family" type="string" indexed="true" stored="true" /> <field name="os_major" type="string" indexed="true" stored="true" /> <field name="region_code" type="string" indexed="true" stored="true" /> <field name="country_code" type="string" indexed="true" stored="true" /> <field name="city" type="string" indexed="true" stored="true" /> <field name="latitude" type="float" indexed="true" stored="true" /> <field name="longitude" type="float" indexed="true" stored="true" /> <field name="country_name" type="string" indexed="true" stored="true" /> <field name="country_code3" type="string" indexed="true" stored="true" /> <field name="_version_" type="long" indexed="true" stored="true"/> <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/> <dynamicField name="ignored_*" type="ignored"/>
Then you have to upload the solr configuration settings to Zookeeper, and then configure Solr to use this particular set of Zookeeper Solr settings (note the “—create” before the accessCollection collection name in the second command, this was missing from the Cloudera steps but is needed to be a valid solrctl command)
solrctl --zk bda5node2:2181/solr instancedir --create accessCollection $HOME/accessCollection solrctl --zk bda5node2:2181/solr --create accessCollection -s 1
At this point you should be able to go to the Solr web admin page within the CDH cluster (http://bda5node5.rittmandev.com:8983/solr/#/, in my case), and see the collection (a distributed Solr index) listed with the updated index schema.
Next I configure the Flume source agent on the RM webserver, using this Flume conf file:
## SOURCE AGENT ## ## Local instalation: /etc/flume1.5.0 ## configuration file location: /etc/flume1.5.0/conf/conf ## bin file location: /etc/flume1.5.0/conf/bin ## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent # http://flume.apache.org/FlumeUserGuide.html#exec-source source_agent.sources = apache_server source_agent.sources.apache_server.type = exec source_agent.sources.apache_server.command = tail -f /etc/httpd/logs/access_log source_agent.sources.apache_server.batchSize = 1 source_agent.sources.apache_server.channels = memoryChannel source_agent.sources.apache_server.interceptors = itime ihost itype # http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor source_agent.sources.apache_server.interceptors.itime.type = timestamp # http://flume.apache.org/FlumeUserGuide.html#host-interceptor source_agent.sources.apache_server.interceptors.ihost.type = host source_agent.sources.apache_server.interceptors.ihost.useIP = false source_agent.sources.apache_server.interceptors.ihost.hostHeader = host # http://flume.apache.org/FlumeUserGuide.html#static-interceptor source_agent.sources.apache_server.interceptors.itype.type = static source_agent.sources.apache_server.interceptors.itype.key = log_type source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined # http://flume.apache.org/FlumeUserGuide.html#memory-channel source_agent.channels = memoryChannel source_agent.channels.memoryChannel.type = memory source_agent.channels.memoryChannel.capacity = 100 ## Send to Flume Collector on Hadoop Node # http://flume.apache.org/FlumeUserGuide.html#avro-sink source_agent.sinks = avro_sink source_agent.sinks.avro_sink.type = avro source_agent.sinks.avro_sink.channel = memoryChannel source_agent.sinks.avro_sink.hostname = rittmandev.com source_agent.sinks.avro_sink.port = 4545
and then I set up a Flume sink agent as part of the Flume service using Cloudera Manager, initially set as “stopped”.
The Flume configuration file for this sink agent is where the clever stuff happens.
collector.sources = AvroIn collector.sources.AvroIn.type = avro collector.sources.AvroIn.bind = bda5node5 collector.sources.AvroIn.port = 4545 collector.sources.AvroIn.channels = mc1 mc2 collector.channels = mc1 mc2 collector.channels.mc1.type = memory collector.channels.mc1.transactionCapacity = 1000 collector.channels.mc1.capacity = 100000 collector.channels.mc2.type = memory collector.channels.mc2.capacity = 100000 collector.channels.mc2.transactionCapacity = 1000 collector.sinks = LocalOut MorphlineSolrSink collector.sinks.LocalOut.type = file_roll collector.sinks.LocalOut.sink.directory = /tmp/flume/website_logs collector.sinks.LocalOut.sink.rollInterval = 0 collector.sinks.LocalOut.channel = mc1 collector.sinks.MorphlineSolrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink collector.sinks.MorphlineSolrSink.morphlineFile = /tmp/morphline.conf collector.sinks.MorphlineSolrSink.channel = mc2
The interesting bit here is the MorphlineSolrSink flume sink. This Flume sink type routes flume events to a morphline script that in turn copies the log data into the HDFS storage area used by Solr, and passes it to Solr for immediate indexing. Cloudera Morphlines is a command-based lightweight ETL framework designed to transform streaming data from Flume, Spark and other sources and load it into HDFS, HBase or in our case, Solr. Morphlines config files define ETL routines that then call extensible morphlines Kite SDK functions to perform transformations on incoming data streams such as
- Split webserver request fields into HTTP protocol, method and URL requested
- In conjunction with the Maxmind GeoIP database, generate the country, city and geocode for a given IP address
- Converting dates and times in string format to a Solr-format date and timestamp
with the output then being passed to Solr in this instance, along with the UUID and other metadata Solr needs, for loading to the Solr index, or “collection” as its termed when it’s running across the cluster (note the full log files aren’t stored by this process into HDFS, just the Solr indexes and transaction logs). The morphlines config file I used is below, based on the one provided in the Github repo accompanying the Cloudera blog post – note though that you need to download and setup the Maxmind GeoIP database file, and install the Python pip utility and a couple of pip packages before this will work:
# Specify server locations in a SOLR_LOCATOR variable; # used later in variable substitutions # Change the zkHost to point to your own Zookeeper quorum SOLR_LOCATOR : { # Name of solr collection collection : accessCollection # ZooKeeper ensemble zkHost : "bda5node2:2181/solr" } # Specify an array of one or more morphlines, each of which defines an ETL # transformation chain. A morphline consists of one or more (potentially # nested) commands. A morphline is a way to consume records (e.g. Flume events, # HDFS files or blocks), turn them into a stream of records, and pipe the stream # of records through a set of easily configurable transformations on it's way to # Solr (or a MapReduceIndexerTool RecordWriter that feeds via a Reducer into Solr). morphlines : [ { # Name used to identify a morphline. E.g. used if there are multiple morphlines in a # morphline config file id : morphline1 # Import all morphline commands in these java packages and their subpackages. # Other commands that may be present on the classpath are not visible to this morphline. importCommands : ["org.kitesdk.**", "org.apache.solr.**"] commands : [ { ## Read the email stream and break it up into individual messages. ## The beginning of a message is marked by regex clause below ## The reason we use this command is that one event can have multiple ## messages readCSV { separator: " " columns: [client_ip,C1,C2,time,dummy1,request,code,bytes,referer,user_agent,C3] ignoreFirstLine : false quoteChar : """ commentPrefix : "" trim : true charset : UTF-8 } } { split { inputField : request outputFields : [method, url, protocol] separator : " " isRegex : false #separator : """s*,s*""" # #isRegex : true addEmptyStrings : false trim : true } } { split { inputField : url outputFields : ["", app, subapp] separator : "/" isRegex : false #separator : """s*,s*""" # #isRegex : true addEmptyStrings : false trim : true } } { userAgent { inputField : user_agent outputFields : { user_agent_family : "@{ua_family}" user_agent_major : "@{ua_major}" device_family : "@{device_family}" os_family : "@{os_family}" os_major : "@{os_major}" } } } { #Extract GEO information geoIP { inputField : client_ip database : "/tmp/GeoLite2-City.mmdb" } } { # extract parts of the geolocation info from the Jackson JsonNode Java # # object contained in the _attachment_body field and store the parts in # # the given record output fields: extractJsonPaths { flatten : false paths : { country_code : /country/iso_code country_name : /country/names/en region_code : /continent/code #"/subdivisions[]/names/en" : "/subdivisions[]/names/en" #"/subdivisions[]/iso_code" : "/subdivisions[]/iso_code" city : /city/names/en #/postal/code : /postal/code latitude : /location/latitude longitude : /location/longitude #/location/latitude_longitude : /location/latitude_longitude #/location/longitude_latitude : /location/longitude_latitude } } } #{logInfo { format : "BODY : {}", args : ["@{}"] } } # add Unique ID, in case our message_id field from above is not present { generateUUID { field:id } } # convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ" format { # 21/Nov/2014:22:08:27 convertTimestamp { field : time inputFormats : ["[dd/MMM/yyyy:HH:mm:ss", "EEE, d MMM yyyy HH:mm:ss Z", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"] inputTimezone : America/Los_Angeles outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'" outputTimezone : UTC } } # Consume the output record of the previous command and pipe another # record downstream. # # This command sanitizes record fields that are unknown to Solr schema.xml # by deleting them. Recall that Solr throws an exception on any attempt to # load a document that contains a field that isn't specified in schema.xml { sanitizeUnknownSolrFields { # Location from which to fetch Solr schema solrLocator : ${SOLR_LOCATOR} } } # load the record into a SolrServer or MapReduce SolrOutputFormat. { loadSolr { solrLocator : ${SOLR_LOCATOR} } } ] } ]
Then it’s just a case of starting the target sink agent using Cloudera Manager, and the source agent on the RM webserver using the flume-ng command-line utility, and then (hopefully) watch the web activity log entries start to arrive as documents in the Solr index/collection – which, after a bit of fiddling around and correcting typos, it did:
What’s neat here is that instead of having to use either an ETL tool such as ODI to process and parse the log entries (as I did here, in an earlier blog post series on ODI on Hadoop), or use the Hive-to-DGraph data reload feature in BDD, I’ve instead just got a Flume sink running this morphlines process and my data is added in real-time to my Solr index, and as you’ll see in a moment, a Hue Search dashboard.
To get Hue to work with my Solr service and new index, you first have to add the Solr service URL details to the Hue configuration settings using Cloudera Manager, like this:
Then, you can select the index from the list presented by the Search application within Hue, and start creating your data discovery and faceted search dashboard.
with the end result, after a few minutes of setup, looking like this for me:
So how does Solr, Hue, Flume and Morphlines compare to Oracle Big Data Discovery as a potential search-and-discovery solution on Hadoop? What’s impressive is how little work, once I’d figured it out, it took to set this up including the real-time loading and indexing of data for the dashboard. Compared to a loading HDFS and Hive using ODI, and manually refreshing the BDD DGraph data store, it’s much more lightweight and pretty elegant. But, it’s clearly an IT / developer solution, and I spent a fair few late nights getting it all to work – getting the Solr schema.xml right was a tricky task, and the morphlines / Solr ingestion process was particularly hard to to debug and understand why it wasn’t working.
Oracle Big Data Discovery, by contrast, makes the data loading, transformation and enrichment process available to the business or data analyst, and provides much richer tools for cataloging and exploring the full universe of datasets on the Hadoop cluster. Morphlines compares well to the Groovy transformations provided by Big Data Discovery and Solr is extensible to add functionality such as sentiment analysis and text parsing, but again these are IT tasks and not something the average data analyst will want to do.
In summary then – Hue, Solr and the Morphlines transformation framework can be an excellent tool in the hands of IT professionals and can create surprisingly featureful and elegant solutions with just a bit of code and process configuration – but where Big Data Discovery comes into its own is putting significant parts of this capability in the hands of the business and the data analyst, and providing tools for data upload and wrangling, combining that data with other datasets, analyzing that whole dataset (or “data reservoir”) and then collaborating with others around the organization.
Introducing Oracle Big Data Discovery Part 3: Data Exploration and Visualization
In the first two posts in this series, we looked at what Oracle Big Data Discovery is and how you can use it to sample, cleanse and then catalog data in your Hadoop-based data reservoir. At the end of that second post we’d loaded some webserver log data into BDD, and then uploaded some additional reference data that we then joined to the log file dataset to provide descriptive attributes to add to the base log activity. Once you’ve loaded the datasets into BDD you can do some basic searching and graphing of your data directly from the “Explore” part o the interface, selecting and locating attribute values from the search bar and displaying individual attributes in the “Scratchpad” area.
With Big Data Discovery though you can go one step further and build complete applications to search and analyse your data, using the “Discover” part of the application. Using this feature you can add one or more charts to a dashboard page that go much further than the simple data visualisations you get on the Explore part of the application, based on the chart types and UI interactions that you first saw in Oracle Endeca Information Discovery Studio.
Components you can add include thematic maps, summary bars (like OBIEE’s performance tiles, but for multiple measures), various bar, line and bubble charts, all of which can then be faceted-searched using an OEID-like search component.
Each visualisation component is tied to a particular “view” that points to one or more underlying BDD datasets – samples of the full dataset held in the Hadoop cluster stored in the Endeca Server-based DGraph engine. For example, the thematic map above was created against the post comments dataset, with the theme colours defined using the number of comments metric and each country defined by a country name attribute derived from the calling host IP address.
Views are auto-generated by BDD when you import a dataset, or when you join two or more datasets together. You can also use the Endeca EQL language to define your own views using a SQL-type language, and then define which columns represent attributes, which ones are metrics (measures) and how those metrics are aggregated.
Like OEID before it, Big Data Discovery isn’t a substitute for a regular BI tool like OBIEE – beyond simple charts and visualizations its tricky to create more complex data selections, drill-paths in hierarchies, subtotals and so forth, and users will need to understand the concept of multiple views and datatypes, when to drop into EQL and so on – but for non-technical users working in an organization’s big data team it’s a great way to put a visual front-end onto the data in the data reservoir without having to understand tools like R Studio.
So that’s it for this three-part overview of Oracle Big Data Discovery and how it works with the Hadoop-based data reservoir. Keep an eye on the blog over the next few weeks as we get to grips with this new tool, and we’ll be covering it as part of the optional masterclass at the Brighton and Atlanta Rittman Mead BI Forum 2015 events this May.
Introducing Oracle Big Data Discovery Part 2: Data Transformation, Wrangling and Exploration
In yesterday’s post I looked at Oracle Big Data Discovery and how it brought the search and analytic capabilities of Endeca to Hadoop. We looked at how the Oracle Endeca Information Discovery Studio application works with a version of the Endeca Server engine to analyse and visualise sample sets of data from the Hadoop cluster, and how it uses Apache Spark to retrieve data from Hadoop and then transform that data to make it more suitable for data discovery and data analysis applications. Oracle Big Data Discovery is designed to work alongside ODI and GoldenGate for Big Data once you’ve decided on your main data flows, and Oracle Big Data SQL for BI tool and application access to the entire “data reservoir”. So how does Big Data Discovery work, and what role does it play in the overall big data project workflow?
The best way to think of Big Data Discovery, to my mind, is “Endeca on Hadoop”. Endeca Information Discovery had three main parts to it; the data loading part performed using Endeca Information Discovery Integrator and more recently, the personal data upload feature in Endeca Information Discovery Studio. Data was then ingested into the Endeca Server engine and stored in a key/value-store NoSQL database, indexed, parsed and enriched, and then analyzed using the graphical user interface provided by Studio. As I explained in more detail in my first post in the series yesterday, Big Data Discovery runs the Studio and DGraph (Endeca Server) elements on one or more dedicated nodes, and then reads data in from Hadoop and then writes it back in transformed states using Apache Spark, as shown in the diagram below:
As the data discovery and analysis features in Big Data Discovery rely on getting data into the DGraph (Endeca Server) engine first of all, this implies two things; first, we’ll need to take a subset or sample of the entire Hadoop dataset and load just that into the DGraph engine, and second we’ll need some means of transforming and “massaging” that data so it works well as a data discovery set, and then writing those changes back to the full Hadoop dataset if we want to use it with some other tool – OBIEE or Big Data SQL, for example. To see how this process works, let’s use the same Rittman Mead Apache webserver logs that I’ve used in my previous examples, and bring that data and some additional reference data into Big Data Discovery.
The log data from the RM webserver is in Apache Combined Log Format and a sample of the rows looks like this:
For data to be eligible to be ingested into Big Data Discovery, it has to be registered in the Hive Metastore and with the metadata available to use by external tools using the HCatalog service. This means that you already need to have created a Hive table over each datasource, either pointing this table to regular fixed-width or delimited files, or using a SerDe to translate another file format – say a compressed/column-store format like Parquet – into a format that Hive can understand. In our case I can use the RegEx SerDe that I first used in this blog post a while ago to create a Hive table over the log file and split out the various log file elements, with the resulting DDL looking like this:
CREATE EXTERNAL TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\" [^\"]*\") ([^ \"]*|\"[^\"]*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE LOCATION '/user/oracle/rm_logs';
If I then register the SerDe with Big Data Discovery I could ingest the table and file at this point, or I can use a Hive CTAS statement to remove the dependency on the SerDe and ingest into BDD without any further configuration.
create table access_logs as select * from apachelog;
At this point, if you’ve got the BDD Hive Table Detector running, it should pick up the presence of the new hive table and ingest it into BDD (you can whitelist table names, and restrict it to certain Hive databases if needed). Or, you can manually trigger the ingestion from the Data Processing CLI on the BDD node, like this:
[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_logs;
The data processing process then creates an Apache Oozie job to sample a statistically relevant sample set of data into Apache Spark – with a 1% sample providing 95% sample accuracy – that is the profiled, enriched and then loaded into the Big Data Discovery DGraph engine for further transformation, then exploration and analysis within Big Data Discovery Studio.
The profiling step in this process scans the incoming data and helps BDD determine the datatype of each Hive table column, the distribution of values within the column and so on, whilst the enrichment part identifies key words and phrases and other key lexical facts about the dataset. A key concept here also is that BDD typically works with a representative sample of your Hive table contents, not the whole contents, as all the data you analyse has to fit within the memory space with the DGraph engine, just like it used to with Endeca Server. At some point its likely that the functionality of the DGraph engine will be unbundled from the Endeca Server and run natively across the actual Hadoop cluster, but for now you have to separately ingest data into the DGraph engine (which can run clustered on BDD nodes) and analyse it there – however the rules of sampling are that if you’ve got a sufficiently big sample – say, 1m rows – regardless of the actual main dataset size this sample set is considered sufficiently representative – 95% in this case – as to make loading a bigger sample set not really worth the effort. But bear in mind when working with a BDD dataset that you’re working a sample, not the full set, so if a value you’re looking for is missing it might be because it’s not in this particular sample.
Once you’ve ingested the new dataset into BDD, you see it listed amongst the others that have previously been ingested, like this:
At this point you can explore the dataset, to take an initial look at the patterns and values in the dataset in its raw form.
Unfortunately, in this raw form the data in the access_logs table isn’t all that useful – details of the page request URL are mixed in with the HTTP protocol and method, for example; dates are in strings; details of the person accession the site are in IP address format rather than a geographical location, and so on. In previous examples on this blog I’ve looked at various methods to cleanse, transform and enhance the data in log file tables like this, using tools and techniques such as Hive table transformations, Pig and Apache Spark scripts, and ODI mappings but all of these typically require some IT invovement whereas one of the hallmarks of recent versions of Endeca Information Discovery Studio was giving power-users the ability to transform and enrich data themselves. Big Data Discovery provides tools to cleanse, transform and enrich data, with menu items for common transformations and a Groovy script editor for more complex ones, including deriving sentiment values from textual data and stripping out HTML and formatting characters from text.
Once you’ve finished transforming and enriching the dataset, you can either save (commit) the changes back to the sample dataset in the BDD DGraph engine, or you can use the transformation rules you’ve defined to apply those transformations to the entire Hive table contents back on Hadoop, with the transformation work being done using Apache Spark. Datasets are loaded into “projects” and each project can have its own transformed view of the raw data, with copies of the dataset being kept in the BDD DGraph engine to represent each team’s specific view onto the raw datasets.
In practice I found this didn’t, at the current product state, completely replace the need for a Hadoop developer or R data analyst – you need to get your data files into Hive and HCatalog at the start which involves parsing and interpreting semi-structured data files, and I often did some transformations in BDD, then applied the transformations to the whole Hive dataset and then re-imported the results back into BDD to start from a simple known state. But it certainly made tasks such as turning IP addresses into countries and cities, splitting our URLs and removing HTML tags much easier and I got the data cleansing process done in a matter of hours compared to the days with manual Hive, Pig and Spark scripting.
Now the data in my log file dataset is much more usable and easy to understand, with URLs split out, status codes grouped into high-level descriptors, and other descriptive and formatting changes made.
I can also at this point bring in additional datasets, either created manually outside of BDD and ingested into the DGraph from Hive, or manually uploaded using the Studio interface. These dataset uploads then live in the BDD DGraph engine, and are then written back to Hive for long-term persistence or for sharing with other tools and processes.
These datasets can then be joined to the main dataset on matching dataset columns, giving you a table-join interface not unlike OBIEE’s physical model editor.
So now we’re in a position where our datasets have been ingested into BDD, and we’ve cleansed, transformed and joined them into a combined web activity dataset. In tomorrow’s final post I’ll look at the data visualisation part of Big Data Discovery and see how it brings the capabilities of Endeca Information Discovery Studio to Hadoop.