Tag Archives: Oracle Big Data Appliance

Oracle Exalytics, Oracle R Enterprise and Endeca Part 3 : Flight Delays Analysis using OBIEE, Endeca and Oracle R Enterprise on Exalytics

So far in this series we’ve looked at Oracle’s BI, Big Data and Advanced Analytics strategy and seen how Exalytics can benefit both Endeca Information Discovery and Oracle R Enterprise through its 40 CPU cores and 1TB of RAM. For a recap on the two earlier postings in this series the links are provided below, and in this final posting we’ll look at an example where all three tools are brought together on the Exalytics platform to provide different types of “big data” analysis, with big data defined as data of high volume, high variety and high velocity:

In this example, the big data bit is slightly tenuous as the data we’re using is transactional data (numbers of flights, numbers of cancellations etc) accompanied by some unstructured/semi-structured data (reasons for flight delays). There’s no sensor data though, or data arriving at high velocity, but the dataset is large (around 123m rows), and diverse (sort of), and Oracle R Enterprise could easily connect to a Hadoop cluster via the Oracle R Connector for Hadoop if required. All of this also could run on a regular server but if it’s hosted on Exalytics then we additionally benefit from the 1TB RAM and 40 cores that this hardware can bring to bear on the analysis, with the R client running on Exalytics along with Endeca Information Discovery, and the Exalytics server then connecting via the two InfiniBand ports to an Exadata Server, an Oracle Big Data Appliance server, or other data sources via its 10GB and 1GB ethernet interfaces.

NewImage

The flight delays dataset is a fairly well-known set of public data made available by the Bureau of Transportation Statistics, Research and Innovative Technology Administration within the United States Department of Transportation, and in it’s full incarnation contains 123M rows of non-stop US domestic flight legs. For each flight leg it contains the source and destination airports, operator, and aircraft type, whilst for delays it holds the type and duration of delay, the delay reason and other supporting numeric and textual information. If you’ve seen the standard Exalytics Airlines demo, this is the dataset it uses, but it can also be used by Endeca and Oracle R Enterprise, as we’ll see in this post.

So given the three Exalytics tools we’ll be using for this example (OBIEE, Endeca Information Discovery and Oracle R Enterprise), at a high-level what is each tool good at? A good first start would be to say:

  • OBIEE is good for dashboard analysis of structured data(numeric + attribute with a clearly-defined data model), together with ad-hoc analysis, scorecards and other traditional BI-type analysis
  • Oracle Endeca Information Discovery is good for the initial exploration and discovery of the data set, allowing us to quickly bring in disparate structured and unstructured data and then aggregate and analyse it, usually as the pre-cursor to more structured dashboard analysis using OBIEE
  • R, and Oracle R Enterprise, is good at providing deep insight into specific questions, such as “are some airports more prone to delays than others”, and “for American Airlines, how has the distribution of delays for departures and arrivals evolved over time?” 

If we take this model of Endeca first to initially discover the full dataset, then OBIEE for answers to the questions we’ve now defined, and R/ORE to dig deeper into specific topics, our BI approach on Exalytics would look something like this:

NewImage

So let’s start now with the Endeca element. If you read my series of postings on Endeca just after the Oracle acquisition, you’ll have read how one of the main use-cases for Endeca Latitude and the MDEX engine (now known as Oracle Endeca Information Discovery, and the Endeca Server, respectively) was in situations where you had a whole range of potentially interesting data that you wanted to load up and quickly analyse, but you didn’t want to spend an inordinate amount of time creating a conformed dimensional data model; instead, the key-value pair database loads data up as records, each one of which contains a number of attributes that effectively contain their own schema. What you often end up with then is what Endeca termed a “jagged database”, where each record had at least one attribute in common with the others (typically, more than one attribute as shown in the diagram below), but records that originated from each different source system or database table might have different attribute sets to the other, or even different to other records in that dataset. The net effect of this is that upfront-data modelling is minimised and you don’t need to reject incoming data just because it doesn’t fit into your conformed data model. The diagram below shows a conceptual view of such an Endeca Server datastore, with the first incoming set of rows containing sales transaction data made up of dimension IDs and some attributes unique to sales data, with the next set of rows containing market research information that shares some key values with the previous dataset, but then contains its own unique attributes that may or may not be present in all of its records. 

NewImage

Endeca Server datastores (as their databases are called) are created and loaded via web service calls, typically constructed using Endeca Information Discovery Integrator, an ETL tool built-off of the Eclipse/CloverETL open-source platform and enhanced with specific components for Endeca Server administration. Once the datastore is loaded, the front-end dashboard application is created using Endeca Information Discovery Studio, with the two GUI tools looking as in the screenshots below. For more details of the Endeca Information Discovery development process, see this series of postings that I put together earlier in the year where I go through an end-to-end development process using the Quickstart/Bikestore Endeca demo dataset, and the set of videos on YouTube page that takes you through the process with narrative explaining what’s going on.

NewImage

Where the Endeca Server differentiates itself from OBIEE’s BI Server and its traditional RDBMS sources, and Essbase and other multi-dimensional OLAP servers, is that it’s got a bunch of features and capabilities for analysing textual data and extracting meaning, sentiment and other semantics from it. Using Integrator or direct web service calls to the Endeca Server, incoming unstructured data can be analysed using features such as:

  • Keyword search, boolean search, parametric search, wildcard search, dimension search and dimension filters
  • Dimension precedence rules
  • Numeric range, geospatial, date/time and security filters
  • Spell correction/suggestion, and “do you mean”-type alternative presentation
  • Find similar, and 1 and 2-way synonyms
  • Stemming and lemmatisation
  • Keyword-in-context snippeting
  • Results clustering, relevance ranking, sorting and paging
  • Support for Multiple languages

So what do Endeca Information Discovery dashboards look like once they’re created, and connected to a suitable Endeca Server datastore? In the example of the flight delays data we’re using across the various tools, there are a number of unique features that EID brings to the dataset, starting with the first dashboard we’ll look at, below.

NewImage

The flight delays dataset contains lots of free-form text, so that there are, for example, many different mis-spellings of McDonnell Douglas, an aircraft manufacturer. After being loaded into the Endeca Server datastore and then processed using the Endeca Server’s document analysis capabilities, cleaned-up and standardised versions of these mis-spellings are used to populate a manufacturer attribute that groups all of them together, for easy analysis.

I mentioned earlier that one of the main uses of Endeca Information Discovery is searching across the entire dataset to find attributes and records of interest, which will then form the focus of the more structured data model that we’ll then use as a data source for OBIEE. In the screenshot below, the Value Search feature is initially used to display all occurrences of the typed-in value in all attributes in the recordset, with the search highlighting attributes as the search term is typed in. In addition, what’s termed a record search can then be performed that takes an attribute value and uses it to filter the displayed set of records based on groups of attributes called “search interfaces”. As the set of records is narrowed down by the record search, graphs and other visuals on the dashboard page immediately show metric numbers aggregated for this record set, showing the dual search/analytic capabilities of the Endeca Server. When run on the Exalytics platform, all of this potentially takes place much quicker as the Endeca Server can parallelise search operations as well as any indexing that needs to take place in the datastore. The 1TB of RAM on the server can also be useful as the Endeca Server will try and keep as much of the analysis dataset in memory as is possible, with the disk-based column store database more there as a persistence store.

NewImage

Finally, the text search and analysis features in the Endeca Server are useful for pulling out themes and sentiments from the incoming data; in the screenshot below, we can see that MD-88 aircraft typically are involved in delays that are down to the age of the aircraft, whilst delays involving the newer Boing 777 are more often down to issues such as lights not working, crew areas now being serviceable and so on.

NewImage

Armed with all of this information and a subsequent better understanding of the data available to us, we can now start thinking about a more structured data model for use with OBIEE.

The flight delays dataset, once you look into it in more detail, really contains two main star schemas we’re interested in; one based around a flight leg fact dimensioned by carrier, flight month, origin and destination airport, route and so forth. The other fact would be around the actual flight delays, sharing some of these dimensions but also with its actual reason for the delay, like the diagram below:

NewImage

This dimension model would then map fairly easily into an Oracle BI Repository semantic model, with a single data source and business model and with two subject areas, one for each star schema. As we’re running on Exalytics though, we can then generate some aggregate recommendations, based either on database optimiser statistics (if the system is otherwise unused by end-users), or on actual query patterns taken from the usage tracking and summary statistics tables maintained by the BI Server. To generate these recommendations and then create a script for their implementation, you’d then use the Oracle BI Summary Advisor that’s only available on Exalytics systems.

NewImage

Full details on what happens when you use the Summary Advisor are in this previous blog post and my article on the topic for Oracle Magazine, but once you’ve generated your aggregates and created your dashboards and analyses, your dashboards would look something like the screenshots below. Note that whilst these examples are focusing on Exalytics, a cut-down version of the Flight Delays data along with dashboards and analyses are available as part of SampleApp v207, along with the R dashboards that we’ll see later on.

NewImage

What OBIEE does well here is display, in a very rich graphical form, lots of aggregated data with supporting attributes to enable slice-and-dice, analysis, KPIs, scorecards and maps. When run on Exalytics, all of the prompts have their “Apply” buttons removed so that changes in parameter values are reflected in the dashboard immediately, whilst the TimesTen in-memory database ensures that response-times are within the sub-second range, even when the underlying dataset has millions of detail-level rows within it.

So now on to R, and Oracle R Enterprise. R is typically used to answer more in-depth, focused questions using more advanced statistical functions than you’d get in regular SQL, such as:

  • Are some airports more prone to delays than others? Are some days of the week likely to see fewer delays than others? And are these differences significant? 
  • How do arrival delay distributions differ for the best and worst 3 airlines compared to the industry? Moreover, are there significant differences among airlines?
  • For American Airlines, how has the distribution of delays for departures and arrivals evolved over time? 
  • How do average annual arrival delays compare across select airlines, and what is the underlying trend for each airline?

To analyse the airlines dataset using R, luckily enough a cut-down version of the dataset ships also with ORE (ONTIME_S) and comes pre-installed with Oracle R Enterprise (ONTIME_S is also described in this Oracle R Enterprise Blog post, where you can see examples of R functions being used on the dataset). To work with the flight delays dataset then, you’d go through a process of creating “frames” within ORE using data from the Oracle Database, and then create R scripts to manipulate the dataset and provide answers to your questions. Again, teaching R is outside the scope of this posting, but the screenshots below show the ONTIME_S dataset being loaded up in the R client that’s included in SampleApp v207, along with an R script that provides one of the analyses used in the dashboard I’ll show in a moment.

NewImage

Scripts created using R can be utilised within Oracle BI in a couple of main ways; R scripts stored within the Oracle database using ORE can be referenced directly using BI Publisher, with R’s XML output then being used to create an image that can be displayed using RTF templates, or you can reference R scripts held within ORE directly within OBIEE’s BI Repository, as PL/SQL functions similar to regular ones such as AVG, LAG/LEAD and REGEXP (with details explained in a training PDF on Operationalizing R Scripts on the Oracle website). The OBIEE SampleApp v207 comes with a set of dashboards that show how both types of output might look, with the dashboard page on the left displaying a parameterised BI Publisher report embedded within, showing flight delays per airport calculate live by R engines on the Exalytics server. The dashboard page on the right, by contrast, shows a regression analysis calculated using functions referenced in the BI Repository RPD, displaying the output as both a table and an interactive map.

NewImage

So, it was a bit of a whistle-stop tour but hopefully it sets out the different types of analysis made available by Oracle Endeca Information Discovery, OBIEE and Oracle R Enterprise, and how you might use one, two or all of them on a typical BI project. I’ve left out Essbase of course which also has a role to play, and the “big data” element is a bit superficial as I’m not doing anything with Hadoop, MapReduce and so on. But hopefully it gives you a flavour of the different tools and how they might benefit from being run on the Exalytics platform. For more information on Rittman Mead and Endeca, check out the Rittman Mead Endeca homepage, whilst for more information on Exalytics, check out our Exalytics resource centre, where you can also read about our Exalytics Test Centre in London, UK, where we can prototype this sort of analysis using our own, dedicated Exalytics server, working in conjunction with our OBIEE, Endeca and R consulting and implementation team.

Oracle Exalytics, Oracle R Enterprise and Endeca Part 2 : Oracle Endeca, the Advanced Analytics Option and Oracle Exalytics

In this week of postings we’re going to look at Oracle Exalytics and how it enables “big data” and unstructured data analytics, using Oracle Endeca, Oracle Exadata, Oracle Big Data Appliance and the Oracle Database Advanced Analytics option. In case you’ve arrived via a Google search and you’re interested in the rest of the postings in this series, here’s the links to the articles (to be completed as postings are published).

So in the first post in this series we looked at Exalytics as part of the Oracle database tech stack, and how Oracle’s analytics strategy is to handle all types of data, using a number of optimised analytic tools and analysis engines, with packaged applications where appropriate and delivered via the web, via mobile devices, in the cloud and embedded in business application and processes. We closed the post with a mention of a new database option called the Advanced Analytics Option, and in this second posting we’ll look at just what this new option contains and how it related to Oracle’s engineered systems strategy.

The Advanced Analytics Option is an option to Oracle Database Enterprise Edition, available from version 11.2 of the database onwards. It includes two major components:

  • Oracle Data Mining, which prior to the Advanced Analytics Option was an option in itself (typically bought along with the OLAP Option, which is still a separate option)
  • Oracle R Enterprise, Oracle’s take on R, the statistical language used widely in academia and rapidly replacing Base SAS and SPSS within commercial organisations

For both data mining and R, the key premise with the Advanced Analytics Option is to bring the algorithms to the data; instead of having to extract data from a database, along with files and other sources, and then load this into a statistics engine such as SAS, you can instead embed R scripts and data mining algorithms directly within the database, making it easy to score and classify data in real-time, such as in a call-centre application or as part of an ETL routine.

Oracle Advanced Analytics Option

Oracle Data Mining has been around for a number of years now, but R is a new addition to Oracle’s analytic toolset, and is probably new to most Oracle BI & DW developers. So just what is R, and Oracle’s version of it, Oracle R Enterprise?

In fact, there are actually two R packages that Oracle have put together; one is free, the other is a database option. The free one is Oracle’s own distribution of open-source R, the same as you’d get from downloading it from the R Project’s website, but with additional libraries to make it run faster on x86 hardware. Open-source R can also be downloaded from the Oracle website along with licensable Oracle R Enterprise, installing it direct onto Oracle Linux or other Unix OS’s using Oracle’s Yum repository. Oracle R Enterprise, however, is basic R extended to work closer with the Oracle database by adding the following (licensed) elements:

  • R packages to add to and extend the standard packages provided with open-source R
  • A database library for connecting to Oracle and running R scripts within the database
  • SQL extensions to allow R functionality to be called from SQL and PL/SQL

These elements then provide four main Oracle Enterprise R features:

  • A “Transparency” layer that intercepts standard R functions and extends them to allow certain R functions and datatypes to reside in the Oracle database
  • A Statistics Engine providing a set of statistical functions and procedures for commonly-used statistical libraries, which then execute in the Oracle database
  • SQL extensions, which allow database server execution of R code, and support parallelism, SQL access to R and XML output
  • A Hadoop connector, for running R scripts and functions against an Oracle Hadoop cluster with its files held in either HDFS, an Oracle database, or local files.

When you work with R, you typically have the R client installed on your laptop or workstation which communicates with the R server, typically delivered as a single executable for Windows, Linux or Unix. Whilst this has the virtue of simplicity it also means that you are limited by the amount of RAM and CPU on your local machine, which can quickly become an issue when you try to spin up multiple R engines to process a model in parallel, as each engine loads up the full data set into memory before starting work. Even on a 2-4 core laptop with 16GB RAM you can quickly run out of memory, which is where Oracle R Enterprise comes in – the basic data structure that you work with in R called a “frame”, analogous to a relational table, can with Oracle R Enterprise be actually stored in a database giving you the ability to process much larger sets of data, with many more R engines running, than if you were running standalone. Typically this would be a large, multi-core Oracle database, though you can also connect R and ORE to the TimesTen in-memory database using the new ROracle R interface, detailed in this blog post by Jason Feldhaus on the Oracle R Enterprise blog.

Oracle R Enterprise also has the ability to spin-up (or “spawn”) it’s own R engines within the database server, providing a lights-out environment that allows R computations to be carried out even when you’re not at your workstation, and with these database-resident R engines having full access to the database, SQL and PL/SQL. Coupled with the Oracle R Connector for Hadoop, a typical ORE (as we’ll shorten Oracle R Enterprise to now) topology looks like the diagram below.

NewImage

So where does Exalytics come in to this? If you’ve followed-along so far you may well have spotted that, as ORE is in fact a database option and therefore runs as part of the Oracle Database, it shouldn’t really be installed (along with an Oracle database) on the Exalytics server – apart from having to license 20 processors of Oracle Database Enterprise Edition plus the Advanced Analytics Option, Exalytics is really meant for just OBIEE, WebLogic, TimesTen and Essbase, with ORE really supposed to reside on Exadata, or at least a separate database server. What Exalytics does do well though is play the role of a supercharged client for ORE, with Open-source R running on Exalytics then connecting to ORE on Exadata; The R client can then spin-up multiple R engines to process models in parallel making use of Exalytics 40 cores, whilst the 1TB of RAM allows multiple copies of the models’ data to be held in memory without the machine breaking a sweat. Coupled with ORE’s ability to spin-up it’s own R engines on the Exalytics server, and the InfiniBand connection between the two servers, and Oracle Big Data Appliance if you’ve also got this, and your R topology now looks like the diagram below.

NewImage

The question you’re probably asking at this point now, seeing as we’ve established where R fits into the Oracle BI and big data architecture, is just what is R? And what can it do for Oracle BI, if it’s just a statistical programming language? Well if you’ve got the latest OBIEE 11g SampleApp (v207) downloadable from OTN, it’s actually got R, and Oracle R Enterprise, already installed and set up, ready to go. So assuming you’ve got SampleApp v207 installed and all of the OBIEE and other servers running, you can start your first R session by selecting Applications > Accessories > Terminal from the Linux desktop menu bar, then type in “R” to start the R console, part of the standard R client, like this:

[oracle@obieesampleapp ~]$ R
 
R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: i686-redhat-linux-gnu (32-bit)
 
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
 
  Natural language support but running in an English locale
 
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
 
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
Loading Oracle R Enterprise Packages
Connecting to Oracle RDBMS
    User: rquser
    SID : orcl
    Host: localhost
    Port: 1521
Done.
> 

Note how starting the R console first displays the open-source license messages, then displays details about ORE, finishing up by displaying the connection details of the database account it’s now going to connect us to, which is the standard ORE account on a database that’s been configured to use the Advanced Analytics Option.

Obviously explaining the full syntax and capabilities of R is outside the scope of this blog post (try the online ORE docs, and the R Project’s online manuals), but ORE comes with a number of sample R scripts that are part of a base ORE package, that you can look at to get a flavour of the language and its syntax. Whilst connected to the R console you can list out the demo ORE scripts like this:

> demo (package = "ORE")
 
basic                   Basic connectivity to database
binning                 Binning logic
columnfns               Column functions
cor                     Correlation matrix
crosstab                Frequency cross tabulations
derived                 Handling of derived columns
distributions           Distribution, density, and quantile functions
do_eval                 Embedded R processing
freqanalysis            Frequency cross tabulations
graphics                Demonstrates visual analysis
group_apply             Embedded R processing by group
hypothesis              Hyphothesis testing functions
matrix                  Matrix related operations
nulls                   Handling of NULL in SQL vs. NA in R
push_pull               RDBMS <-> R data transfer
rank                    Attributed-based ranking of observations
reg                     Ordinary least squares linear regression
row_apply               Embedded R processing by row chunks
sql_like                Mapping of R to SQL commands
stepwise                Stepwise OLS linear regression
summary                 Summary functionality
table_apply             Embedded R processing of entire table

To run one of these, for example the correlation matrix one, type in the command:

> demo ("cor", package = "ORE")

R also ships with a number of graphics demos, that show off some of the graphs and other visualisations that R can produce. To run these, from the R console type in:

> demo (graphics)

The R console will then step you through a number of graph demos, displaying each graph when you press the enter key.

NewImage

Compared to the basic statistical functions provided by Oracle SQL, R provides a wider variety of statistical and graphical techniques including:

  • Linear and non-linear modelling
  • Classical statistical tests and time-series analysis
  • Classification, clustering and other capabilities
  • Matrix arithmetic, with scalar, vector, list and data frame (analogous to relational tables)

In addition, R is extensible through community-contributed packages at the Comprehensive R Archive Network (CRAN), which is probably the main attraction for users of R, and it can connect to “big data” sources such as Hadoop through Oracle’s R Connector for Hadoop. So now that we’ve seen the basics of R and how it might benefit from Exalytics and Oracle’s Enterprise R features, how might R, OBIEE and Endeca work together if used on Exalytics? In the final posting in this series we’ll look at a case study that takes the publically-available Flight Delays dataset and analyses it using OBIEE, Endeca and Oracle R Enterprise, to see what each tool can contribute and how they might look to a typical end-user.

Oracle Exalytics, Oracle R Enterprise and Endeca Part 1 : Oracle’s Analytics, Engineered Systems, and Big Data Strategy

One of the presentations Rittman Mead gave at last week’s Oracle Openworld was entitled “High Speed, Big Data Analysis using Oracle Exalytics” [PDF]. Although it was the last of my presentations it was probably the one that I most looked forward to delivering, as it talked about how Oracle’s in-memory analytics server could also be used for advanced analytics, and unstructured data analysis along with OBIEE’s traditional dashboards and ad-hoc reports. So how do Endeca Information Discovery and R, the most recent addition to Oracle’s analytics toolkit, relate to Exalytics and what benefits does this high-end server provide for these tools? Over this next week I’ll be looking at this topic in detail, with the following postings in the series (links will be added as each post is published).

For anyone new to the product, Oracle Exalytics In-Memory Machine is Oracle’s “engineered system” for business intelligence and analytics. Typically used alongside Oracle Exadata Database Machine, an analogy is that Exalytics is the “set top box” to Exadata’s 50″ flat screen TV, in that it provides query acceleration and highly-interactive visuals to accompany the terabytes of data typically managed by an Exadata Database Machine server. So far on the blog we’ve mostly talked about Exalytics in the context of OBIEE, but it also hosts Oracle Essbase (a multi-dimensional OLAP server) and is certified for use with Oracle Endeca Information Discovery, Oracle’s discovery/analytics tool for unstructured, semi-structured and structured data. Another way of thinking of Exalytics (idea courtesy of Oracle’s Jack Berkowitz, who looks after the BI Presentation Server part of OBIEE) is that’s it’s like the Akamai web caching service; for a single end-user it generally provides faster page delivery than Oracle could provide from it’s own web servers , but when it comes into its own is when there are 10,000 or 1m people trying to access Oracle’s website at a time – Akaimai’s cache, like Exalytics’ cache, guarantees fast service when user numbers scale beyond just a few test users, due in Exalytics case to the TimesTen in-memory database that provides a mid-tier cache between OBIEE’s BI Server and Presentation Server and the various data sources accessed in the dashboard.

Exalytics within the Oracle Tech Stack

As I mentioned before though, Exalytics also supports Essbase and the rest of the EPM product stack (where the product runs on Linux), with Essbase included in the Oracle BI Foundation Suite product bundle that comes with the base Exalytics server. Exalytics, from version 1.1, is also certified to run Endeca Information Discovery, details of which are in a series of blog posts that you can read on Rittman Mead’s Endeca homepage here. In fact, this wide range of query tools and analytic engines is one of the four pillar’s of Oracle’s current business analytics strategy, which as the diagram below shows covers data from any source, analytics using multiple query tools and engines, packaged applications and delivery via the web, mobile, desktop, or embedded in business processes and applications.

Oracle's Analytics Strategy

“Big Data” is as I’m sure most readers will be aware, along with Cloud the current buzz-word and hot topic within Oracle and the wider IT world, and refers to much larger data sets than we’re used to with relational databases holding much more granular data such as meter readings, bus movements, sensor data and the like. The interest in big data comes from its ability to provide us with much more context about people, activities and events of interest than we get with traditional data such as sales figures and product inventories, and is now made possible by server specs going up coupled with a bunch of new database and analysis techniques that eschew regular SQL and relational stores in favour of file-based databases, “NoSQL”-type languages and distributed processing tools that first crunch numbers and then extract useful information (Hadoop and MapReduce, for example). Oracle of course have put together a bunch of products to address Big Data requirements, including another engineered system called Oracle Big Data Appliance that couples a third-party distribution of Hadoop and MapReduce along with new Oracle products such as NoSQL and Oracle R Enterprise; Big Data Appliance therefore sits alongside Exadata as the second part of Oracle’s engineered systems data management hardware/software product set.

Big Data Appliance and Exadata

The idea here then is that Big Data Appliance acts as a data gatherer/processor/cruncher for a big-data enabled analytics environment, with Big Data Appliance linked to Exadata via InfiniBand and ODI, via Oracle’s Big Data Adaptors, taking nuggets of pre-processed data from Big Data Appliance and then loading them into Exadata for later analysis by Exalytics, Endeca or Oracle RTD.

Oracle's Big Data Topology

Big Data Appliance is mostly concerned with acquiring and organising data from unstructured sources, then processing it into a structured form (via Hadoop and MapReduce) for loading into Exadata, or querying via tools such as Endeca and Oracle Real-Time Decisions. But Big Data Appliance also comes with something called R, and Oracle have also recently released a new database option called the Advanced Analytics Option that comes with Oracle R Enterprise, Oracle’s added-value version of R that leverages the scale and capacities of the Oracle Database. So what is the Advanced Analytics Option and what is Oracle R Enterprise, and what can these new analytic capabilities provide for the BI Developer? We’ll look at this topic in more detail in the second posting in this series, tomorrow.