Category Archives: Rittman Mead
Australian March Training Offer
Autumn is almost upon us here in Australia so why not hold off going into hibernation and head into the classroom instead.
For March and April only, Rittmanmead courses in Australia* are being offered at significantly discounted prices.
Heading up this promotion is the popular TRN202 OBIEE 11g Bootcamp course which will be held in Melbourne, Australia* on March 16th-20th 2015.
This is not a cut down version of the regular course but the entire 5 day content. Details
To enrol for this specially priced course, visit the Rittmanmead website training page. Registration is only open between March 1st – March 9th 2015 so register quickly to secure a spot.
Further specially priced courses will be advertised in the coming weeks.
*This offer is only available for courses run in Australia.
Registration Period: 01/03/2015 12:00am – 09/03/2015 11:59:59pm
Further Terms and Conditions can be found during registration
Introducing Oracle Big Data Discovery Part 3: Data Exploration and Visualization
In the first two posts in this series, we looked at what Oracle Big Data Discovery is and how you can use it to sample, cleanse and then catalog data in your Hadoop-based data reservoir. At the end of that second post we’d loaded some webserver log data into BDD, and then uploaded some additional reference data that we then joined to the log file dataset to provide descriptive attributes to add to the base log activity. Once you’ve loaded the datasets into BDD you can do some basic searching and graphing of your data directly from the “Explore” part o the interface, selecting and locating attribute values from the search bar and displaying individual attributes in the “Scratchpad” area.
With Big Data Discovery though you can go one step further and build complete applications to search and analyse your data, using the “Discover” part of the application. Using this feature you can add one or more charts to a dashboard page that go much further than the simple data visualisations you get on the Explore part of the application, based on the chart types and UI interactions that you first saw in Oracle Endeca Information Discovery Studio.
Components you can add include thematic maps, summary bars (like OBIEE’s performance tiles, but for multiple measures), various bar, line and bubble charts, all of which can then be faceted-searched using an OEID-like search component.
Each visualisation component is tied to a particular “view” that points to one or more underlying BDD datasets – samples of the full dataset held in the Hadoop cluster stored in the Endeca Server-based DGraph engine. For example, the thematic map above was created against the post comments dataset, with the theme colours defined using the number of comments metric and each country defined by a country name attribute derived from the calling host IP address.
Views are auto-generated by BDD when you import a dataset, or when you join two or more datasets together. You can also use the Endeca EQL language to define your own views using a SQL-type language, and then define which columns represent attributes, which ones are metrics (measures) and how those metrics are aggregated.
Like OEID before it, Big Data Discovery isn’t a substitute for a regular BI tool like OBIEE – beyond simple charts and visualizations its tricky to create more complex data selections, drill-paths in hierarchies, subtotals and so forth, and users will need to understand the concept of multiple views and datatypes, when to drop into EQL and so on – but for non-technical users working in an organization’s big data team it’s a great way to put a visual front-end onto the data in the data reservoir without having to understand tools like R Studio.
So that’s it for this three-part overview of Oracle Big Data Discovery and how it works with the Hadoop-based data reservoir. Keep an eye on the blog over the next few weeks as we get to grips with this new tool, and we’ll be covering it as part of the optional masterclass at the Brighton and Atlanta Rittman Mead BI Forum 2015 events this May.
Introducing Oracle Big Data Discovery Part 2: Data Transformation, Wrangling and Exploration
In yesterday’s post I looked at Oracle Big Data Discovery and how it brought the search and analytic capabilities of Endeca to Hadoop. We looked at how the Oracle Endeca Information Discovery Studio application works with a version of the Endeca Server engine to analyse and visualise sample sets of data from the Hadoop cluster, and how it uses Apache Spark to retrieve data from Hadoop and then transform that data to make it more suitable for data discovery and data analysis applications. Oracle Big Data Discovery is designed to work alongside ODI and GoldenGate for Big Data once you’ve decided on your main data flows, and Oracle Big Data SQL for BI tool and application access to the entire “data reservoir”. So how does Big Data Discovery work, and what role does it play in the overall big data project workflow?
The best way to think of Big Data Discovery, to my mind, is “Endeca on Hadoop”. Endeca Information Discovery had three main parts to it; the data loading part performed using Endeca Information Discovery Integrator and more recently, the personal data upload feature in Endeca Information Discovery Studio. Data was then ingested into the Endeca Server engine and stored in a key/value-store NoSQL database, indexed, parsed and enriched, and then analyzed using the graphical user interface provided by Studio. As I explained in more detail in my first post in the series yesterday, Big Data Discovery runs the Studio and DGraph (Endeca Server) elements on one or more dedicated nodes, and then reads data in from Hadoop and then writes it back in transformed states using Apache Spark, as shown in the diagram below:
As the data discovery and analysis features in Big Data Discovery rely on getting data into the DGraph (Endeca Server) engine first of all, this implies two things; first, we’ll need to take a subset or sample of the entire Hadoop dataset and load just that into the DGraph engine, and second we’ll need some means of transforming and “massaging” that data so it works well as a data discovery set, and then writing those changes back to the full Hadoop dataset if we want to use it with some other tool – OBIEE or Big Data SQL, for example. To see how this process works, let’s use the same Rittman Mead Apache webserver logs that I’ve used in my previous examples, and bring that data and some additional reference data into Big Data Discovery.
The log data from the RM webserver is in Apache Combined Log Format and a sample of the rows looks like this:
For data to be eligible to be ingested into Big Data Discovery, it has to be registered in the Hive Metastore and with the metadata available to use by external tools using the HCatalog service. This means that you already need to have created a Hive table over each datasource, either pointing this table to regular fixed-width or delimited files, or using a SerDe to translate another file format – say a compressed/column-store format like Parquet – into a format that Hive can understand. In our case I can use the RegEx SerDe that I first used in this blog post a while ago to create a Hive table over the log file and split out the various log file elements, with the resulting DDL looking like this:
CREATE EXTERNAL TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\" [^\"]*\") ([^ \"]*|\"[^\"]*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE LOCATION '/user/oracle/rm_logs';
If I then register the SerDe with Big Data Discovery I could ingest the table and file at this point, or I can use a Hive CTAS statement to remove the dependency on the SerDe and ingest into BDD without any further configuration.
create table access_logs as select * from apachelog;
At this point, if you’ve got the BDD Hive Table Detector running, it should pick up the presence of the new hive table and ingest it into BDD (you can whitelist table names, and restrict it to certain Hive databases if needed). Or, you can manually trigger the ingestion from the Data Processing CLI on the BDD node, like this:
[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_logs;
The data processing process then creates an Apache Oozie job to sample a statistically relevant sample set of data into Apache Spark – with a 1% sample providing 95% sample accuracy – that is the profiled, enriched and then loaded into the Big Data Discovery DGraph engine for further transformation, then exploration and analysis within Big Data Discovery Studio.
The profiling step in this process scans the incoming data and helps BDD determine the datatype of each Hive table column, the distribution of values within the column and so on, whilst the enrichment part identifies key words and phrases and other key lexical facts about the dataset. A key concept here also is that BDD typically works with a representative sample of your Hive table contents, not the whole contents, as all the data you analyse has to fit within the memory space with the DGraph engine, just like it used to with Endeca Server. At some point its likely that the functionality of the DGraph engine will be unbundled from the Endeca Server and run natively across the actual Hadoop cluster, but for now you have to separately ingest data into the DGraph engine (which can run clustered on BDD nodes) and analyse it there – however the rules of sampling are that if you’ve got a sufficiently big sample – say, 1m rows – regardless of the actual main dataset size this sample set is considered sufficiently representative – 95% in this case – as to make loading a bigger sample set not really worth the effort. But bear in mind when working with a BDD dataset that you’re working a sample, not the full set, so if a value you’re looking for is missing it might be because it’s not in this particular sample.
Once you’ve ingested the new dataset into BDD, you see it listed amongst the others that have previously been ingested, like this:
At this point you can explore the dataset, to take an initial look at the patterns and values in the dataset in its raw form.
Unfortunately, in this raw form the data in the access_logs table isn’t all that useful – details of the page request URL are mixed in with the HTTP protocol and method, for example; dates are in strings; details of the person accession the site are in IP address format rather than a geographical location, and so on. In previous examples on this blog I’ve looked at various methods to cleanse, transform and enhance the data in log file tables like this, using tools and techniques such as Hive table transformations, Pig and Apache Spark scripts, and ODI mappings but all of these typically require some IT invovement whereas one of the hallmarks of recent versions of Endeca Information Discovery Studio was giving power-users the ability to transform and enrich data themselves. Big Data Discovery provides tools to cleanse, transform and enrich data, with menu items for common transformations and a Groovy script editor for more complex ones, including deriving sentiment values from textual data and stripping out HTML and formatting characters from text.
Once you’ve finished transforming and enriching the dataset, you can either save (commit) the changes back to the sample dataset in the BDD DGraph engine, or you can use the transformation rules you’ve defined to apply those transformations to the entire Hive table contents back on Hadoop, with the transformation work being done using Apache Spark. Datasets are loaded into “projects” and each project can have its own transformed view of the raw data, with copies of the dataset being kept in the BDD DGraph engine to represent each team’s specific view onto the raw datasets.
In practice I found this didn’t, at the current product state, completely replace the need for a Hadoop developer or R data analyst – you need to get your data files into Hive and HCatalog at the start which involves parsing and interpreting semi-structured data files, and I often did some transformations in BDD, then applied the transformations to the whole Hive dataset and then re-imported the results back into BDD to start from a simple known state. But it certainly made tasks such as turning IP addresses into countries and cities, splitting our URLs and removing HTML tags much easier and I got the data cleansing process done in a matter of hours compared to the days with manual Hive, Pig and Spark scripting.
Now the data in my log file dataset is much more usable and easy to understand, with URLs split out, status codes grouped into high-level descriptors, and other descriptive and formatting changes made.
I can also at this point bring in additional datasets, either created manually outside of BDD and ingested into the DGraph from Hive, or manually uploaded using the Studio interface. These dataset uploads then live in the BDD DGraph engine, and are then written back to Hive for long-term persistence or for sharing with other tools and processes.
These datasets can then be joined to the main dataset on matching dataset columns, giving you a table-join interface not unlike OBIEE’s physical model editor.
So now we’re in a position where our datasets have been ingested into BDD, and we’ve cleansed, transformed and joined them into a combined web activity dataset. In tomorrow’s final post I’ll look at the data visualisation part of Big Data Discovery and see how it brings the capabilities of Endeca Information Discovery Studio to Hadoop.
Introducing Oracle Big Data Discovery Part 1: “The Visual Face of Hadoop”
Oracle Big Data Discovery was released last week, the latest addition to Oracle’s big data tools suite that includes Oracle Big Data SQL, ODI and it’s Hadoop capabilities and Oracle GoldenGate for Big Data 12c. Introduced by Oracle as “the visual face of Hadoop”, Big Data Discovery combines the data discovery and visualisation elements of Oracle Endeca Information Discovery with data loading and transformation features built on Apache Spark to deliver a tool aimed at the “Discovery Lab” part of the Oracle Big Data and Information Management Reference Architecture.
Most readers of this blog will probably be aware of Oracle Endeca Information Discovery, based on the Endeca Latitude product acquired as part of the Endeca aquisition. Oracle positioned Endeca Information Discovery (OEID) in two main ways; on the one hand as a data discovery tool for textual and unstructured data that complemented the more structured analysis capabilities of Oracle Business Intellligence, and on the other hand, as a fast click-and-refine data exploration tool similar to Qlikview and Tableau.
The problem for Oracle though was that data discovery against files and documents is a bit of a “solution looking for a problem” and doesn’t have a naturally huge market (especially considering the license cost of OEID Studio and the Endeca Server engine that stores and analyzes the data), whereas Qlikview and Tableau are significantly cheaper than OEID (at least at the start) and are more focused on BI-type tasks, making OEID a good too but not one with a mass market. To address this, whilst OEID will continue as a standalone tool the data discovery and unstructured data analysis parts of OEID are making their way into this new product called Oracle Big Data Discovery, whilst the fast click-and-refine features will surface as part of Visual Analyzer in OBIEE12c.
More importantly, Big Data Discovery will run on Hadoop making it a solution for a real problem – how to catalog, explore, refine and visualise the data in the data reservoir, where data has been landed that might be in schema-on-read databases, might need further analysis and understanding, and users need large-scale tooling to extract the nuggets of information that in time make their way into the “Execution” part of the Big Data and Information Management Reference Architecture. As some who’s admired the technology behind Endeca Information Discovery but sometimes struggled to find real-life use-cases or customers for it, I’m really pleased to see its core technology applied to a problem space that I’m encountering every day with Rittman Mead’s customers.
In this first post, I’ll look at how Big Data Discovery is architected and how it works with Cloudera CDH5, the Hadoop distribution we use with our customers (Hortonworks HDP support is coming soon). In the next post I’ll look at how data is loaded into Big Data Discovery and then cataloged and transformed using the BDD front-end; then finally, we’ll take a look at exploring and analysing data using the visual capabilities of BDD evolved from the Studio tool within OEID. Oracle Big Data Discovery 1.0 is now GA (Generally Available) but as you’ll see in a moment you do need a fairly powerful setup to run it, at least until such time as Oracle release a compact install version running on VM.
To run Big Data Discovery you’ll need access to a Hadoop install, which in most cases will consist of 6 (minumum 3 or 4, but 6 is the minimum we use) to 18 or so Hadoop nodes running Cloudera CDH5.3. BDD generally runs on its own server nodes and itself can be clustered, but for our setup we ran 1 BDD node alongside 6 CDH5.3 Hadoop nodes looking like this:
Oracle Big Data Discovery is made up of three component types highlighted in red in the above diagram, two of which typically run on their own dedicated BDD nodes and another which runs on each node in the Hadoop cluster (though there are various install types including all on one node, for demo purposes)
- The Studio web user interface, which combines the faceted search and data discovery parts of Endeca Information Discovery Studio with a lightweight data transformation capability
- The DGraph Gateway, which brings Endeca Server search/analytics capabilities to the world of Hadoop, and
- The Data Processing component that runs on each of the Hadoop nodes, and uses Hive’s HCatalog feature to read Hive table metadata and Apache Spark to load and transform data in the cluster
The Studio component can run across several nodes for high-availability and load-balancing, which the DGraph element can run on a single node as I’ve set it up, or in a cluster with a single “leader” node and multiple “follower” nodes again for enhanced availability and throughput. The DGraph part them works alongside Apache Spark to run intensive search and analytics on subsets of the whole Hadoop dataset, with sample sets of data being moved into the DGraph engine and any resulting transformations then being applied to the whole Hadoop dataset using Apache Spark. All of this then runs as part of the wider Oracle Big Data product architecture, which uses Big Data Discovery and Oracle R for the discovery lab and Oracle Exadata, Oracle Big Data Appliance and Oracle Big Data SQL to take discovery lab innovations to the wider enterprise audience.
So how does Oracle Big Data Discovery work in practice, and what’s a typical workflow? How does it give us the capability to make sense of structured, semi-structured and unstructured data in the Hadoop data reservoir, and how does it look from the perspective of an Oracle Endeca Information Discovery developer, or an OBIEE/ODI developer? Check back for the next parts in this three part series where I’ll first look at the data transformation and exploration capabilities of Big Data Discovery, and then look at how the Studio web interface brings data discovery and data visualisation to Hadoop.