Category Archives: Rittman Mead

More on the Rittman Mead BI Forum 2015 Masterclass : “Delivering the Oracle Big Data and Information Management Reference Architecture”

Each year at the Rittman Mead BI Forum we host an optional one-day masterclass before the event opens properly on Wednesday evening, with guest speakers over the year including Kurt Wolff, Kevin McGinley and last year, Cloudera’s Lars George. This year I’m particularly excited that together with Jordan Meyer, our Head of R&D, I’ll be presenting the masterclass on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture”.

NewImageLast year we launched at the Brighton BI Forum event a new reference architecture that Rittman Mead had collaborated with Oracle on, that incorporated big data and schema-on-read databases into the Oracle data warehouse and BI reference architecture. In two subsequent blog posts, and in a white paper published on the Oracle website a few weeks after, concepts such as the “Discovery Lab”, “Data Reservoirs” and the “Data Factory” were introduced as a way of incorporating the latest thinking, and product capabilities, into the reference architecture for Oracle-based BI, data warehousing and big data systems.

One of the problems I always feel with reference architectures though is that they tell you what you should create, but they don’t tell you how. Just how do you go from a set of example files and a vague requirement from the client to do something interesting with Hadoop and data science, and how do you turn the insights produced by that process into a production-ready, enterprise Big Data system? How do you implement the data factory, and how do you use new tools such as Oracle Big Data Discovery and Oracle Big Data SQL as part of this architecture? In this masterclass we’re looking to explain the “how” and “why” to go with this new reference architecture, based on experiences working with clients over the past couple of years.

The masterclass will be divided into two sections; the first, led by Jordan Meyer, will focus on the data discovery and “data science” parts of the Information Management architecture, going through initial analysis and discovery of datasets using R and Oracle R Enterprise. Jordan will share techniques he uses from both his work at Rittman Mead and his work with Slacker Radio, a Silicon Valley startup, and will introduce the R and Oracle R Enterprise toolset for uncovering insights, correlations and patterns in sample datasets and productionizing them as database routines. Over his three hours he’ll cover topics including: 

Session #1 – Data exploration and discovery with R (2 hours) 

1.1 Introduction to R 

1.2 Tidy Data  

1.3 Data transformations 

1.4 Data Visualization 

Session #2 – Predictive Modeling in the enterprise (1 hr)

2.1 Classification

2.2 Regression

2.3 Deploying models to the data warehouse with ORE

After lunch, I’ll take the insights and analysis patterns identified in the Discovery Lab and turn them into production big data pipelines and datasets using Oracle Data Integrator 12c, Oracle Big Data Discovery and Oracle Big Data SQL For a flavour of the topics I’ll be covering take a look at this Slideshare presentation from a recent Oracle event, and in the masterclass itself I’ll concentrate on techniques and approaches for ingesting and transforming streaming and semi-structured data, storing it in Hadoop-based data stores, and presenting it out to users using BI tools like OBIEE, and Oracle’s new Big Data Discovery.

Session # 3 – Building the Data Reservoir and Data Factory (2 hr)

3.1 Designing and Building the Data Reservoir using Cloudera CDH5 / Hortonworks HDP, Oracle BDA and Oracle Database 12c

3.2 Building the Data Factory using ODI12c & new component Hadoop KM modules, real-time loading using Apache Kafka, Spark and Spark Streaming

Session #4 – Accessing and visualising the data (1 hr)

4.1 Discovering and Analyzing the Data Reservoir using Oracle Big Data Discovery

4.2 Reporting and Dashboards across the Data Reservoir using Oracle Big Data SQL + OBIEE 11.1.1.9

You can register for a place at the two masterclasses when booking your BI Forum 2015 place, but you’ll need to hurry as we limit the number of attendees at each event in order to maximise interaction and networking within each group. Registration is open now and the two events take place in May – hopefully we’ll see you there!

An Introduction to Analysing ODI Runtime Data Through Elasticsearch and Kibana 4

An important part of working with ODI is analysing the performance when it runs, and identifying steps that might be inefficient as well as variations in runtime against a baseline trend. The Operator tool in ODI itself is great for digging down into individual sessions and load plan executions, but for broader analysis we need a different approach. We also need to make sure we keep the data available for trend analysis, as it’s often the case that tables behind Operator are frequently purged for performance reasons.

In this article I’m going to show how we can make use of a generic method of pulling information out of an RDBMS such as Oracle and storing it in Elasticsearch, from where it can be explored and analysed through Kibana. It’s standalone, it’s easy to do, it’s free open source – and it looks and works great! Here I’m going to use it for supporting the analysis of ODI runtime information, but it is equally applicable to any time-based data you’ve got in an RDBMS (e.g. OBIEE Usage Tracking data).

Kibana is an open-source data visualisation and analysis tool, working with data stored in Elasticsearch. These tools work really well for very rapid analysis of any kind of data that you want to chuck at them quickly and work with. By skipping the process of schema definition and data modelling the time taken to the first results is drastically reduced. It enables to you quickly start “chucking about” data and getting meaning out of it before you commit full-scale to how you want to analyse it, which is what the traditional modelling route can sometimes force you to do prematurely.

ODI writes runtime information to the database, about sessions run, steps executed, time taken and rows processed. This data is important for analysing things like performance issues, and batch run times. Whilst with the equivalent runtime data (Usage Tracking) from OBIEE there is the superb RPD/Dashboard content that Oracle ship in SampleApp v406, for ODI the options aren’t as vast, ultimately being based on home-brew SQL against the repository tables using the repository schema documentation from Oracle. Building an OBIEE metadata model against the ODI schema is one option, but then requires an OBIEE server on which to run it – or merging into an existing OBIEE deployment – which means that it can become more hassle than it’s worth. It also means a bunch of up-front modelling before you get any kind of visualisations and data out. By copying the data across into Elasticsearch it’s easy to quickly build analyses against it, and has the additional benefit of retaining the data as long as you’d like meaning that it’s still available for long-term trend analysis once the data’s been purged from the ODI repository itself.

Let’s take a bit of a walk through the ODI dashboard that I’ve put together. First up is a view on the number of sessions that have run over time, along with their duration. For duration I’ve shown 50th (median), 75th and 95th percentiles to get an idea of the spread of session runtimes. At the moment we’re looking at all sessions, so it’s not surprising that there is a wide range since there’ll always be small sessions and longer ones:

Next up on the dashboard comes a summary of top sessions by runtime, both cumulative and per-session. The longest running sessions are an obvious point of interest, but cumulative runtime is also important; something may only take a short while to run when compared to some singular long-running sessions, but if it runs hundreds of times then it all adds up and can give a big performance boost if time is shaved off it.

Plotting out session execution times is useful to be able to see both when the longest running sessions ran:

The final element on this first dashboard is one giving the detail for each of the top x long-running session executions, including the session number so that it can be examined in further detail through the Operator tool.

Kibana dashboards are interactive, so you can click on a point in a graph to zoom in on that time period, as well as click & drag to select an arbitrary range. The latter technique is sometimes known as “Brushing”, and if I’m not describing it very well have a look at this example here and you’ll see in an instant what I mean.

As you focus on a time period in one graph the whole dashboard’s time filter changes, so where you have a table of detail data it then just shows it for the range you’ve selected. Notice also that the granularity of the aggregation changes as well, from a summary of every three hours in the first of the screenshots through to 30 seconds in the last. This is a nice way of presenting a summary of data, but isn’t always desirable (it can mask extremes and abnormalities) so can be configured to be fixed as well.

Time isn’t the only interaction on the dashboard – anything that’s not a metric can be clicked on to apply a filter. So in the above example where the top session by cumulative time are listed out we might want to find out more about the one with several thousand executions

Simply clicking on it then filters the dashboard and now the session details table and graph show information just for that session, including duration, and rows processed:

Session performance analysis

As an example of the benefit of using a spread of percentiles we can see here is a particular session that had an erratic runtime with great variation, that then stabilised. The purple line is the 95th percentile response time; the green and blue are 50th and 75th respectively. It’s clear that whilst up to 75% of the sessions completed in about the same kind of time each time they ran, the remaining quarter took anything up to five times as long.

One of the most important things in performance is ensuring consistent performance, and that is what happens here from about half way along the horizontal axis at c.February:

But what was causing the variation? By digging a notch deeper and looking at the runtime of the individual steps within the given session it can be seen that the inconsistent runtime was caused by a single step (the green line in this graph) within the execution. When this step’s runtime stabilises, so does the overall performance of the session:

This is performing a port-mortem on a resolved performance problem to illustrate how useful the data is – obviously if there were still a performance problem we’d have a clear path of investigation to pursue thanks to this data.

If you’re interested in how to put together this kind of dashboard in Kibana and Elasticsearch, loading in data on a regular basis from an RDBMS, then stay tuned for part 2 when all will be revealed…

Announcing the Special Guest Speakers for Brighton & Atlanta BI Forum 2015

As well as a great line-up of speakers and sessions at each of the Brighton & Atlanta Rittman Mead BI Forum 2015 events in May, I’m very pleased to announce our two guest speakers who’ll give the second keynotes, on the Thursday evening of the two events just before we leave for the restaurant and the appreciation events. This year our special guest speaker in Atlanta is John Foreman, Chief Data Scientist at MailChimp and author of the book “Data Smart: Using Data Science to Transform Information into Insight”; and in Brighton we’re delighted to have Reiner Zimmerman, Senior Director of Product Management at Oracle US and the person behind the Oracle DW & Big Data Global Leaders program.

NewImage

I first came across John Foreman when somebody recommended his book to me, “Data Smart”, a year or so ago. At that time Rittman Mead were getting more-and-more requests from our customers asking us to help with their advanced analytics and predictive modelings needs, and I was looking around for resources to help myself and the team get to grips with some of the more advanced modelings and statistical techniques Oracle’s tools now support – techniques such as clustering and pattern matching, linear regression and genetic algorithms.

One of the challenges when learning these sorts of techniques is not getting to caught up in the tools and technology – R was our favoured technology at the time, and there’s lots to it – so John’s book was particularly well-timed as it goes through these types of “data science” techniques but focuses on Microsoft Excel as the analysis tool, with simple examples and a very readable style.

Back in his day job, John is Chief Data Scientist at MailChimp and has become a particularly in-demand speaker following the success of his book, and I was very excited to hear from Charles Elliott, our Practice Manager for Rittman Mead America, that he lived near John in Atlanta and had arranged for him to keynote at our Atlanta BI Forum event. His Keynote will be entitled “How Mailchimp used qualitative and quantitative analysis to build their next product” and we’re very much looking forward to meeting him at our event in Atlanta on May 13th-15th 2015.

NewImage

Our second keynote speaker at the Brighton Rittman Mead BI Forum 2015 event is non-other than Reiner Zimmerman, best known in EMEA for organising the Oracle DW Global Leaders Program. We’ve known Reiner for several years now as Rittman Mead are one of the associate sponsors for the program, which aims to bring together the leading organizations building data warehouse and big data systems on the Oracle Engineered Systems platform.

A bit like the BI Forum (but even more exclusive), the DW Global Leaders program holds meetings in the US, EMEA and AsiaPac over the year and is a fantastic networking and knowledge-sharing group for an exclusive set of customers putting together the most cutting-edge DW and big data systems on the latest Oracle technology. Reiner’s also an excellent speaker and a past visitor to the BI Forum, and his session entitled “Hadoop and Oracle BDA customer cases from around the world” will be a look at what customers are really doing, and the value they’re getting, from building big data systems on the Oracle platform.

Registration is now open for both the Brighton and Atlanta BI Forum 2015 events, with full details including the speaker line-up and how to register on the event website. Keep an eye on the blog for more details of both events later this week including more on the masterclass by myself and Jordan Meyer, and a data visualisation “bake-off” we’re going to run on the second day of each event. Watch this space…!

Rittman Mead BI Forum 2015 Now Open for Registration!

I’m very pleased to announce that the Rittman Mead BI Forum 2015, running in Brighton and Atlanta in May 2015, is now open for registration.

Back for its seventh successful year, the Rittman Mead BI Forum once again will be showcasing the best speakers and presentations on topics around Oracle Business Intelligence and data warehousing, with two events running in Brighton, UK and Atlanta, USA in May 2015. The Rittman Mead BI Forum is different to other Oracle tech events in that we keep the numbers attending limited, topics are all at the intermediate-to-expert level, and we concentrate on just one topic – Oracle Business Intelligence Enterprise Edition, and the technologies and products that support it.

NewImage

As in previous years, the BI Forum will run on two consecutive weeks, starting in Brighton and then moving over to Atlanta for the following week. Here’s the dates and venue locations:

This year our optional one-day masterclass will be delivered by Jordan Meyer, our Head of R&D, and myself and will be on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture” that we launched last year at our Brighton event. Details of the masterclass, and the speaker and session line up at the two events are on the Rittman Mead BI Forum 2015 homepage

Each event has its own agenda, but both will focus on the technology and implementation aspects of Oracle BI, DW, Big Data and Analytics. Most of the sessions run for 45 minutes, but on the first day we’ll be holding a debate and on the second we’ll be running a data visualization “bake-off” – details on this, the masterclass and the keynotes and our special guest speakers will be revealed on this blog over the next few weeks – watch this space!

Creating Real-Time Search Dashboards using Apache Solr, Hue, Flume and Cloudera Morphlines

Late last week Cloudera published a blog post on their developer site on building a real-time log analytics dashboard using Apache Kafka, Cloudera Search and Hue. As I’d recently been playing around with Oracle Big Data Discovery with our website log data as the data source, and as we’ve also been doing the same exercise in our development labs using ElasticSearch and Kibana I thought it’d be interesting to give it a go; partly out of curiosity around how Solr, Kafka and Hue search works and compares to Elasticsearch, but also to try and work out what extra benefit Big Data Discovery gives you above and beyond free and open-source tools.

NewImage

In the example, Apache web log data is read from the Linux server via a Flume syslog source, then fed into Apache Kafka as the transport mechanism before being loaded into Solr using a data transformation utility called “morphines”. I’ve been looking at Kafka as an alternative to Flume for ingesting data into a Hadoop system for a while mainly because of the tireless advocacy of Cloudera’s Gwen Shapira (Oracle ACE, ex-Pythian, now at Cloudera) who I respect immensely and has a great background in Oracle database administration as well as Hadoop, and because it potentially offers some useful benefits if used instead of, or more likely alongside, Flume – a publish-subscribe model vs. push, the ability to have multiple consumers as well as publishers, and a more robust transport mechanism that should avoid data loss when an agent node goes down. Kafka is now available as a parcel and service descriptor that you can download and then install within CDH5, and so I set up a separate VM in my Hadoop cluster as a Kafka broker and also installed Solr at the same time.

NewImage

Working through the example, in the end I went with a slightly different and simplified approach that swapped the syslog Flume source for an Apache Server file tailing source, as our webserver was on a different host to the Flume agent and I’d previously set this up before for an earlier blog post. I also dropped the Kafka element as the Cloudera article wasn’t that clear to me whether it’d work in its published form or needed amending to use with Kafka (“To get data from Kafka, parse it with Morphlines, and index it into Solr, you can use an almost identical configuration”), and so I went with an architecture that looked like this:

NewImage

Compared to Big Data Discovery, this approach has got some drawbacks, but some interesting benefits. From a drawback perspective, Apache Solr (or Cloudera Search as it’s called in CDH5, where Cloudera have integrated Solr with HDFS storage) needs some quite fiddly manual setup that’s definitely an IT task, rather than the point-and-click dataset setup that you get with Big Data Discovery. In terms of benefits though, apart from being free it’s potentially more scalable than Big Data Discovery as BDD has to sample the full Hadoop dataset and fit that sample (typically 1m rows, or 1-5% of the full dataset) into BDD’s Endeca Server-based DGraph engine; Solr, however, indexes the whole Hadoop dataset and can store its indexes and log files within HDFS across the cluster – potentially very interesting if it works.

Back to drawbacks though, the first complication is that Solr’s configuration settings in this Cloudera Search incarnation are stored in Apache Zookeeper, so you first have to download a template copy of the collection files (schema, index etc) from Zookeeper using solrctl, the command-line tool for SolrCloud (Solr running on a distributed cluster, as it is with Cloudera Search)

solrctl --zk bda5node2:2181/solr instancedir --generate $HOME/accessCollection

Then – and this again is a tricky part compared to Big Data Discovery – you have to edit the schema.xml file that Solr uses to determine which fields to index, what their datatypes are and so on. The Cloudera blog post points to a Github repo with the required schema.xml file for Apache Combined Log Format input files, I found I had to add an extra entry for the “text” field name before Solr would index properly, added at the end of the file except here:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="time" type="tdate" indexed="true" stored="true" />
   <field name="record" type="text_general" indexed="true" stored="false" multiValued="true"/>
   <field name="client_ip" type="string" indexed="true" stored="true" />
   <field name="code" type="string" indexed="true" stored="true" />
   <field name="user_agent" type="string" indexed="true" stored="true" />
   <field name="protocol" type="string" indexed="true" stored="true" />   
   <field name="url" type="string" indexed="true" stored="true" />   
   <field name="request" type="string" indexed="true" stored="true" />
   <field name="referer" type="string" indexed="true" stored="true" />
   <field name="bytes" type="string" indexed="true" stored="true" />
   <field name="method" type="string" indexed="true" stored="true" />
   
   <field name="extension" type="string" indexed="true" stored="true" />   
   <field name="app" type="string" indexed="true" stored="true" />      
   <field name="subapp" type="string" indexed="true" stored="true" />
      
   <field name="device_family" type="string" indexed="true" stored="true" />
   <field name="user_agent_major" type="string" indexed="true" stored="true" />   
   <field name="user_agent_family" type="string" indexed="true" stored="true" />
   <field name="os_family" type="string" indexed="true" stored="true" />   
   <field name="os_major" type="string" indexed="true" stored="true" />
   
   <field name="region_code" type="string" indexed="true" stored="true" />
   <field name="country_code" type="string" indexed="true" stored="true" />
   <field name="city" type="string" indexed="true" stored="true" />
   <field name="latitude" type="float" indexed="true" stored="true" />
   <field name="longitude" type="float" indexed="true" stored="true" />
   <field name="country_name" type="string" indexed="true" stored="true" />
   <field name="country_code3" type="string" indexed="true" stored="true" />
 
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
   <dynamicField name="ignored_*" type="ignored"/>

Then you have to upload the solr configuration settings to Zookeeper, and then configure Solr to use this particular set of Zookeeper Solr settings (note the “—create” before the accessCollection collection name in the second command, this was missing from the Cloudera steps but is needed to be a valid solrctl command)

solrctl --zk bda5node2:2181/solr instancedir --create accessCollection $HOME/accessCollection
solrctl --zk bda5node2:2181/solr --create accessCollection -s 1

At this point you should be able to go to the Solr web admin page within the CDH cluster (http://bda5node5.rittmandev.com:8983/solr/#/, in my case), and see the collection (a distributed Solr index) listed with the updated index schema.

NewImage

Next I configure the Flume source agent on the RM webserver, using this Flume conf file:

## SOURCE AGENT ##
## Local instalation: /etc/flume1.5.0
## configuration file location:  /etc/flume1.5.0/conf/conf
## bin file location: /etc/flume1.5.0/conf/bin
## START Agent: bin/flume-ng agent -c conf -f conf/flume-src-agent.conf -n source_agent
 
# http://flume.apache.org/FlumeUserGuide.html#exec-source
source_agent.sources = apache_server
source_agent.sources.apache_server.type = exec
source_agent.sources.apache_server.command = tail -f /etc/httpd/logs/access_log
source_agent.sources.apache_server.batchSize = 1
source_agent.sources.apache_server.channels = memoryChannel
source_agent.sources.apache_server.interceptors = itime ihost itype
 
# http://flume.apache.org/FlumeUserGuide.html#timestamp-interceptor
source_agent.sources.apache_server.interceptors.itime.type = timestamp
 
# http://flume.apache.org/FlumeUserGuide.html#host-interceptor
source_agent.sources.apache_server.interceptors.ihost.type = host
source_agent.sources.apache_server.interceptors.ihost.useIP = false
source_agent.sources.apache_server.interceptors.ihost.hostHeader = host
 
# http://flume.apache.org/FlumeUserGuide.html#static-interceptor
source_agent.sources.apache_server.interceptors.itype.type = static
source_agent.sources.apache_server.interceptors.itype.key = log_type
source_agent.sources.apache_server.interceptors.itype.value = apache_access_combined
 
# http://flume.apache.org/FlumeUserGuide.html#memory-channel
source_agent.channels = memoryChannel
source_agent.channels.memoryChannel.type = memory
source_agent.channels.memoryChannel.capacity = 100
 
## Send to Flume Collector on Hadoop Node
# http://flume.apache.org/FlumeUserGuide.html#avro-sink
source_agent.sinks = avro_sink
source_agent.sinks.avro_sink.type = avro
source_agent.sinks.avro_sink.channel = memoryChannel
source_agent.sinks.avro_sink.hostname = rittmandev.com
source_agent.sinks.avro_sink.port = 4545

and then I set up a Flume sink agent as part of the Flume service using Cloudera Manager, initially set as “stopped”.

NewImage

The Flume configuration file for this sink agent is where the clever stuff happens.

collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = bda5node5
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2

collector.channels = mc1 mc2
collector.channels.mc1.type = memory
collector.channels.mc1.transactionCapacity = 1000
collector.channels.mc1.capacity = 100000
collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100000
collector.channels.mc2.transactionCapacity = 1000

collector.sinks = LocalOut MorphlineSolrSink

collector.sinks.LocalOut.type = file_roll
collector.sinks.LocalOut.sink.directory = /tmp/flume/website_logs
collector.sinks.LocalOut.sink.rollInterval = 0
collector.sinks.LocalOut.channel = mc1

collector.sinks.MorphlineSolrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink
collector.sinks.MorphlineSolrSink.morphlineFile = /tmp/morphline.conf
collector.sinks.MorphlineSolrSink.channel = mc2

The interesting bit here is the MorphlineSolrSink flume sink. This Flume sink type routes flume events to a morphline script that in turn copies the log data into the HDFS storage area used by Solr, and passes it to Solr for immediate indexing. Cloudera Morphlines is a command-based lightweight ETL framework designed to transform streaming data from Flume, Spark and other sources and load it into HDFS, HBase or in our case, Solr. Morphlines config files define ETL routines that then call  extensible morphlines Kite SDK functions to perform transformations on incoming data streams such as

  • Split webserver request fields into HTTP protocol, method and URL requested
  • In conjunction with the Maxmind GeoIP database, generate the country, city and geocode for a given IP address
  • Converting dates and times in string format to a Solr-format date and timestamp

with the output then being passed to Solr in this instance, along with the UUID and other metadata Solr needs, for loading to the Solr index, or “collection” as its termed when it’s running across the cluster (note the full log files aren’t stored by this process into HDFS, just the Solr indexes and transaction logs). The morphlines config file I used is below, based on the one provided in the Github repo accompanying the Cloudera blog post – note though that you need to download and setup the Maxmind GeoIP database file, and install the Python pip utility and a couple of pip packages before this will work:

# Specify server locations in a SOLR_LOCATOR variable;
# used later in variable substitutions
# Change the zkHost to point to your own Zookeeper quorum
SOLR_LOCATOR : {
    # Name of solr collection
    collection : accessCollection
    # ZooKeeper ensemble
    zkHost : "bda5node2:2181/solr"
}
 
# Specify an array of one or more morphlines, each of which defines an ETL
# transformation chain. A morphline consists of one or more (potentially
# nested) commands. A morphline is a way to consume records (e.g. Flume events,
# HDFS files or blocks), turn them into a stream of records, and pipe the stream
# of records through a set of easily configurable transformations on it's way to
# Solr (or a MapReduceIndexerTool RecordWriter that feeds via a Reducer into Solr).
morphlines : [
{
    # Name used to identify a morphline. E.g. used if there are multiple morphlines in a
    # morphline config file
    id : morphline1
    # Import all morphline commands in these java packages and their subpackages.
    # Other commands that may be present on the classpath are not visible to this morphline.
    importCommands : ["org.kitesdk.**", "org.apache.solr.**"]
    commands : [
    {
        ## Read the email stream and break it up into individual messages.
        ## The beginning of a message is marked by regex clause below
        ## The reason we use this command is that one event can have multiple
        ## messages
        readCSV {
    separator:  " "
            columns:  [client_ip,C1,C2,time,dummy1,request,code,bytes,referer,user_agent,C3]
    ignoreFirstLine : false
            quoteChar : """
            commentPrefix : ""
            trim : true
            charset : UTF-8
        }
    }
    {
split { 
inputField : request
outputFields : [method, url, protocol]          
separator : " "        
isRegex : false      
#separator : """s*,s*"""        
#  #isRegex : true      
addEmptyStrings : false
trim : true          
          }
    }
     {
split { 
inputField : url 
outputFields : ["", app, subapp]          
separator : "/"        
isRegex : false      
#separator : """s*,s*"""        
#  #isRegex : true      
addEmptyStrings : false
trim : true          
          }
    }
    {
userAgent {
inputField : user_agent
outputFields : {
user_agent_family : "@{ua_family}"
user_agent_major  : "@{ua_major}"
device_family     : "@{device_family}"
os_family         : "@{os_family}"
os_major  : "@{os_major}"
}          
}
    }
    {
#Extract GEO information
geoIP {
            inputField : client_ip
            database : "/tmp/GeoLite2-City.mmdb"
}
     }
     {
# extract parts of the geolocation info from the Jackson JsonNode Java 
# # object contained in the _attachment_body field and store the parts in
# # the given record output fields:      
extractJsonPaths {
flatten : false
paths : { 
country_code : /country/iso_code
country_name : /country/names/en
                region_code  : /continent/code
#"/subdivisions[]/names/en" : "/subdivisions[]/names/en"     
#"/subdivisions[]/iso_code" : "/subdivisions[]/iso_code"     
city : /city/names/en
#/postal/code : /postal/code
latitude : /location/latitude
longitude : /location/longitude
#/location/latitude_longitude : /location/latitude_longitude
#/location/longitude_latitude : /location/longitude_latitude
} 
}
      }
      #{logInfo { format : "BODY : {}", args : ["@{}"] } }
    # add Unique ID, in case our message_id field from above is not present
    {
        generateUUID {
            field:id
        }
    }
    # convert the timestamp field to "yyyy-MM-dd'T'HH:mm:ss.SSSZ" format
    {
       #  21/Nov/2014:22:08:27
        convertTimestamp {
            field : time 
            inputFormats : ["[dd/MMM/yyyy:HH:mm:ss", "EEE, d MMM yyyy HH:mm:ss Z", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", "yyyy-MM-dd'T'HH:mm:ss", "yyyy-MM-dd"]
            inputTimezone : America/Los_Angeles
           outputFormat : "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"
            outputTimezone : UTC
        }
    }
    # Consume the output record of the previous command and pipe another
    # record downstream.
    #
    # This command sanitizes record fields that are unknown to Solr schema.xml
    # by deleting them. Recall that Solr throws an exception on any attempt to
    # load a document that contains a field that isn't specified in schema.xml
    {
        sanitizeUnknownSolrFields {
            # Location from which to fetch Solr schema
            solrLocator : ${SOLR_LOCATOR}
        }
    }
    # load the record into a SolrServer or MapReduce SolrOutputFormat.
    {
        loadSolr {
            solrLocator : ${SOLR_LOCATOR}
        }
    }
    ]
}
]

Then it’s just a case of starting the target sink agent using Cloudera Manager, and the source agent on the RM webserver using the flume-ng command-line utility, and then (hopefully) watch the web activity log entries start to arrive as documents in the Solr index/collection – which, after a bit of fiddling around and correcting typos, it did:

NewImage

What’s neat here is that instead of having to use either an ETL tool such as ODI to process and parse the log entries (as I did here, in an earlier blog post series on ODI on Hadoop), or use the Hive-to-DGraph data reload feature in BDD, I’ve instead just got a Flume sink running this morphlines process and my data is added in real-time to my Solr index, and as you’ll see in a moment, a Hue Search dashboard.

To get Hue to work with my Solr service and new index, you first have to add the Solr service URL details to the Hue configuration settings using Cloudera Manager, like this:

NewImage

Then, you can select the index from the list presented by the Search application within Hue, and start creating your data discovery and faceted search dashboard.

NewImage

with the end result, after a few minutes of setup, looking like this for me:

NewImage

So how does Solr, Hue, Flume and Morphlines compare to Oracle Big Data Discovery as a potential search-and-discovery solution on Hadoop? What’s impressive is how little work, once I’d figured it out, it took to set this up including the real-time loading and indexing of data for the dashboard. Compared to a loading HDFS and Hive using ODI, and manually refreshing the BDD DGraph data store, it’s much more lightweight and pretty elegant. But, it’s clearly an IT / developer solution, and I spent a fair few late nights getting it all to work – getting the Solr schema.xml right was a tricky task, and the morphlines / Solr ingestion process was particularly hard to to debug and understand why it wasn’t working.

Oracle Big Data Discovery, by contrast, makes the data loading, transformation and enrichment process available to the business or data analyst, and provides much richer tools for cataloging and exploring the full universe of datasets on the Hadoop cluster. Morphlines compares well to the Groovy transformations provided by Big Data Discovery and Solr is extensible to add functionality such as sentiment analysis and text parsing, but again these are IT tasks and not something the average data analyst will want to do.

In summary then – Hue, Solr and the Morphlines transformation framework can be an excellent tool in the hands of IT professionals and can create surprisingly featureful and elegant solutions with just a bit of code and process configuration – but where Big Data Discovery comes into its own is putting significant parts of this capability in the hands of the business and the data analyst, and providing tools for data upload and wrangling, combining that data with other datasets, analyzing that whole dataset (or “data reservoir”) and then collaborating with others around the organization.