Tag Archives: Oracle Endeca

Introducing Oracle Big Data Discovery Part 2: Data Transformation, Wrangling and Exploration

In yesterday’s post I looked at Oracle Big Data Discovery and how it brought the search and analytic capabilities of Endeca to Hadoop. We looked at how the Oracle Endeca Information Discovery Studio application works with a version of the Endeca Server engine to analyse and visualise sample sets of data from the Hadoop cluster, and how it uses Apache Spark to retrieve data from Hadoop and then transform that data to make it more suitable for data discovery and data analysis applications. Oracle Big Data Discovery is designed to work alongside ODI and GoldenGate for Big Data once you’ve decided on your main data flows, and Oracle Big Data SQL for BI tool and application access to the entire “data reservoir”. So how does Big Data Discovery work, and what role does it play in the overall big data project workflow?

The best way to think of Big Data Discovery, to my mind, is “Endeca on Hadoop”. Endeca Information Discovery had three main parts to it; the data loading part performed using Endeca Information Discovery Integrator and more recently, the personal data upload feature in Endeca Information Discovery Studio. Data was then ingested into the Endeca Server engine and stored in a key/value-store NoSQL database, indexed, parsed and enriched, and then analyzed using the graphical user interface provided by Studio. As I explained in more detail in my first post in the series yesterday, Big Data Discovery runs the Studio and DGraph (Endeca Server) elements on one or more dedicated nodes, and then reads data in from Hadoop and then writes it back in transformed states using Apache Spark, as shown in the diagram below:

NewImage

As the data discovery and analysis features in Big Data Discovery rely on getting data into the DGraph (Endeca Server) engine first of all, this implies two things; first, we’ll need to take a subset or sample of the entire Hadoop dataset and load just that into the DGraph engine, and second we’ll need some means of transforming and “massaging” that data so it works well as a data discovery set, and then writing those changes back to the full Hadoop dataset if we want to use it with some other tool – OBIEE or Big Data SQL, for example. To see how this process works, let’s use the same Rittman Mead Apache webserver logs that I’ve used in my previous examples, and bring that data and some additional reference data into Big Data Discovery.

The log data from the RM webserver is in Apache Combined Log Format and a sample of the rows looks like this:

NewImage

For data to be eligible to be ingested into Big Data Discovery, it has to be registered in the Hive Metastore and with the metadata available to use by external tools using the HCatalog service. This means that you already need to have created a Hive table over each datasource, either pointing this table to regular fixed-width or delimited files, or using a SerDe to translate another file format – say a compressed/column-store format like Parquet – into a format that Hive can understand. In our case I can use the RegEx SerDe that I first used in this blog post a while ago to create a Hive table over the log file and split out the various log file elements, with the resulting DDL looking like this:

CREATE EXTERNAL TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) 
([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"
[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE
LOCATION '/user/oracle/rm_logs';

If I then register the SerDe with Big Data Discovery I could ingest the table and file at this point, or I can use a Hive CTAS statement to remove the dependency on the SerDe and ingest into BDD without any further configuration.

create table access_logs as
select * 
from apachelog;

At this point, if you’ve got the BDD Hive Table Detector running, it should pick up the presence of the new hive table and ingest it into BDD (you can whitelist table names, and restrict it to certain Hive databases if needed). Or, you can manually trigger the ingestion from the Data Processing CLI on the BDD node, like this:

[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli
[oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_logs;

The data processing process then creates an Apache Oozie job to sample a statistically relevant sample set of data into Apache Spark – with a 1% sample providing 95% sample accuracy – that is the profiled, enriched and then loaded into the Big Data Discovery DGraph engine for further transformation, then exploration and analysis within Big Data Discovery Studio.

NewImage

The profiling step in this process scans the incoming data and helps BDD determine the datatype of each Hive table column, the distribution of values within the column and so on, whilst the enrichment part identifies key words and phrases and other key lexical facts about the dataset. A key concept here also is that BDD typically works with a representative sample of your Hive table contents, not the whole contents, as all the data you analyse has to fit within the memory space with the DGraph engine, just like it used to with Endeca Server. At some point its likely that the functionality of the DGraph engine will be unbundled from the Endeca Server and run natively across the actual Hadoop cluster, but for now you have to separately ingest data into the DGraph engine (which can run clustered on BDD nodes) and analyse it there – however the rules of sampling are that if you’ve got a sufficiently big sample – say, 1m rows – regardless of the actual main dataset size this sample set is considered sufficiently representative – 95% in this case – as to make loading a bigger sample set not really worth the effort. But bear in mind when working with a BDD dataset that you’re working a sample, not the full set, so if a value you’re looking for is missing it might be because it’s not in this particular sample.

Once you’ve ingested the new dataset into BDD, you see it listed amongst the others that have previously been ingested, like this:

NewImage

At this point you can explore the dataset, to take an initial look at the patterns and values in the dataset in its raw form.

NewImage

Unfortunately, in this raw form the data in the access_logs table isn’t all that useful – details of the page request URL are mixed in with the HTTP protocol and method, for example; dates are in strings; details of the person accession the site are in IP address format rather than a geographical location, and so on. In previous examples on this blog I’ve looked at various methods to cleanse, transform and enhance the data in log file tables like this, using tools and techniques such as Hive table transformations, Pig and Apache Spark scripts, and ODI mappings but all of these typically require some IT invovement whereas one of the hallmarks of recent versions of Endeca Information Discovery Studio was giving power-users the ability to transform and enrich data themselves. Big Data Discovery provides tools to cleanse, transform and enrich data, with menu items for common transformations and a Groovy script editor for more complex ones, including deriving sentiment values from textual data and stripping out HTML and formatting characters from text.

NewImage

Once you’ve finished transforming and enriching the dataset, you can either save (commit) the changes back to the sample dataset in the BDD DGraph engine, or you can use the transformation rules you’ve defined to apply those transformations to the entire Hive table contents back on Hadoop, with the transformation work being done using Apache Spark. Datasets are loaded into “projects” and each project can have its own transformed view of the raw data, with copies of the dataset being kept in the BDD DGraph engine to represent each team’s specific view onto the raw datasets.

NewImage

In practice I found this didn’t, at the current product state, completely replace the need for a Hadoop developer or R data analyst – you need to get your data files into Hive and HCatalog at the start which involves parsing and interpreting semi-structured data files, and I often did some transformations in BDD, then applied the transformations to the whole Hive dataset and then re-imported the results back into BDD to start from a simple known state. But it certainly made tasks such as turning IP addresses into countries and cities, splitting our URLs and removing HTML tags much easier and I got the data cleansing process done in a matter of hours compared to the days with manual Hive, Pig and Spark scripting.

Now the data in my log file dataset is much more usable and easy to understand, with URLs split out, status codes grouped into high-level descriptors, and other descriptive and formatting changes made.

NewImage

I can also at this point bring in additional datasets, either created manually outside of BDD and ingested into the DGraph from Hive, or manually uploaded using the Studio interface. These dataset uploads then live in the BDD DGraph engine, and are then written back to Hive for long-term persistence or for sharing with other tools and processes.

NewImage

These datasets can then be joined to the main dataset on matching dataset columns, giving you a table-join interface not unlike OBIEE’s physical model editor.

NewImage

So now we’re in a position where our datasets have been ingested into BDD, and we’ve cleansed, transformed and joined them into a combined web activity dataset. In tomorrow’s final post I’ll look at the data visualisation part of Big Data Discovery and see how it brings the capabilities of Endeca Information Discovery Studio to Hadoop.

Introducing Oracle Big Data Discovery Part 1: “The Visual Face of Hadoop”

Oracle Big Data Discovery was released last week, the latest addition to Oracle’s big data tools suite that includes Oracle Big Data SQL, ODI and it’s Hadoop capabilities and Oracle GoldenGate for Big Data 12c. Introduced by Oracle as “the visual face of Hadoop”, Big Data Discovery combines the data discovery and visualisation elements of Oracle Endeca Information Discovery with data loading and transformation features built on Apache Spark to deliver a tool aimed at the “Discovery Lab” part of the Oracle Big Data and Information Management Reference Architecture.

NewImage

Most readers of this blog will probably be aware of Oracle Endeca Information Discovery, based on the Endeca Latitude product acquired as part of the Endeca aquisition. Oracle positioned Endeca Information Discovery (OEID) in two main ways; on the one hand as a data discovery tool for textual and unstructured data that complemented the more structured analysis capabilities of Oracle Business Intellligence, and on the other hand, as a fast click-and-refine data exploration tool similar to Qlikview and Tableau.

The problem for Oracle though was that data discovery against files and documents is a bit of a “solution looking for a problem” and doesn’t have a naturally huge market (especially considering the license cost of OEID Studio and the Endeca Server engine that stores and analyzes the data), whereas Qlikview and Tableau are significantly cheaper than OEID (at least at the start) and are more focused on BI-type tasks, making OEID a good too but not one with a mass market. To address this, whilst OEID will continue as a standalone tool the data discovery and unstructured data analysis parts of OEID are making their way into this new product called Oracle Big Data Discovery, whilst the fast click-and-refine features will surface as part of Visual Analyzer in OBIEE12c.

More importantly, Big Data Discovery will run on Hadoop making it a solution for a real problem – how to catalog, explore, refine and visualise the data in the data reservoir, where data has been landed that might be in schema-on-read databases, might need further analysis and understanding, and users need large-scale tooling to extract the nuggets of information that in time make their way into the “Execution” part of the Big Data and Information Management Reference Architecture. As some who’s admired the technology behind Endeca Information Discovery but sometimes struggled to find real-life use-cases or customers for it, I’m really pleased to see its core technology applied to a problem space that I’m encountering every day with Rittman Mead’s customers.

NewImage

In this first post, I’ll look at how Big Data Discovery is architected and how it works with Cloudera CDH5, the Hadoop distribution we use with our customers (Hortonworks HDP support is coming soon). In the next post I’ll look at how data is loaded into Big Data Discovery and then cataloged and transformed using the BDD front-end; then finally, we’ll take a look at exploring and analysing data using the visual capabilities of BDD evolved from the Studio tool within OEID. Oracle Big Data Discovery 1.0 is now GA (Generally Available) but as you’ll see in a moment you do need a fairly powerful setup to run it, at least until such time as Oracle release a compact install version running on VM.

To run Big Data Discovery you’ll need access to a Hadoop install, which in most cases will consist of 6 (minumum 3 or 4, but 6 is the minimum we use) to 18 or so Hadoop nodes running Cloudera CDH5.3. BDD generally runs on its own server nodes and itself can be clustered, but for our setup we ran 1 BDD node alongside 6 CDH5.3 Hadoop nodes looking like this:

NewImage

Oracle Big Data Discovery is made up of three component types highlighted in red in the above diagram, two of which typically run on their own dedicated BDD nodes and another which runs on each node in the Hadoop cluster (though there are various install types including all on one node, for demo purposes)

  • The Studio web user interface, which combines the faceted search and data discovery parts of Endeca Information Discovery Studio with a lightweight data transformation capability
  • The DGraph Gateway, which brings Endeca Server search/analytics capabilities to the world of Hadoop, and
  • The Data Processing component that runs on each of the Hadoop nodes, and uses Hive’s HCatalog feature to read Hive table metadata and Apache Spark to load and transform data in the cluster

The Studio component can run across several nodes for high-availability and load-balancing, which the DGraph element can run on a single node as I’ve set it up, or in a cluster with a single “leader” node and multiple “follower” nodes again for enhanced availability and throughput. The DGraph part them works alongside Apache Spark to run intensive search and analytics on subsets of the whole Hadoop dataset, with sample sets of data being moved into the DGraph engine and any resulting transformations then being applied to the whole Hadoop dataset using Apache Spark. All of this then runs as part of the wider Oracle Big Data product architecture, which uses Big Data Discovery and Oracle R for the discovery lab and Oracle Exadata, Oracle Big Data Appliance and Oracle Big Data SQL to take discovery lab innovations to the wider enterprise audience.

NewImage

So how does Oracle Big Data Discovery work in practice, and what’s a typical workflow? How does it give us the capability to make sense of structured, semi-structured and unstructured data in the Hadoop data reservoir, and how does it look from the perspective of an Oracle Endeca Information Discovery developer, or an OBIEE/ODI developer? Check back for the next parts in this three part series where I’ll first look at the data transformation and exploration capabilities of Big Data Discovery, and then look at how the Studio web interface brings data discovery and data visualisation to Hadoop.

Endeca Information Discovery Integration with Oracle BI Apps 11.1.1.8.1

One of the new features that made its way into Oracle BI Apps 11.1.1.8.1, and that completely passed me by at the time, was integration with Oracle Endeca Information Discovery 3.1 via a new ODI11g plug-in. As this new integration brings the Endeca “faceted search” and semi-structured data analysis capabilities to the BI Apps this is actually a pretty big deal, so let’s take a quick look now at what this integration brings and how it works under-the-covers.

Oracle BI Apps 11.1.1.8.1 integration with Oracle Endeca Information Discovery 3.1 requires OEID to be installed either on the same host as BI Apps or on its own server, and uses a custom ODI11g KM called “IKM SQL to Endeca Server 3.1.0” that needs to be separately downloaded from Oracle’s Edelivery site. The KM download also comes with an ODI Technology XML import file that adds Endeca Server as a technology type in the ODI Topology Navigator, and its here that the connection to the OEID 3.1 server is set up for subsequent data loads from BI Apps 11g.

NewImage

As the IKM has “SQL” as the loading technology specified this means any source that can be extracted from using standard JDBC drivers can be used to load the Endeca Server, and the KM itself has options for creating the schema to go with the data upload, creating the Endeca Server record spec, specifying the collection name and so on. BI Apps 11.1.1.8.1 ships with a number of interfaces (mappings) and ODI load plans to populate Endeca Server domains for specific BI Apps fact groups, with each interface taking all of the relevant repository fact measures and dimension attributes and loading them into one big Endeca Server denormalized schema.

NewImage

When you take a look at the physical (flow) tab for the interface, you can see that the data is actually extracted via the BI Apps RPD and the BI Server, then staged in the BI Apps DW database before being loaded into the Endeca Server domain. Column metadata in the BI Server repository is then used to define the datatypes, names, sizes and other details of the Endeca Server domain schema, an improvement over just pointing Endeca Information Discovery Integrator at the BI Apps database schema.

NewImage

The 11.1.1.8.1 ODI repository also provides a pre-built Endeca Server load plan that allows you to turn-on and turn-off loads for particular fact groups, and sets up the data domain name and other variables needed for the Endeca Server data load.

NewImage

Once you’ve loaded the BI Apps data into one or more Endeca Server data domains, there are a number of pre-build sample Endeca Studio applications you can use to get a feel for the capabilities of the Endeca Information Discovery Studio interface. For example, the Student Information Analytics Admissions and Recruiting sample application lets you filter on any attribute from the dataset on the left-hand side, and then returns instantaneously the filtered dataset displayed in the form of a number of graphs, word clouds and other visualizations.

NewImage

Another interesting sample application is around employee expense analysis. Using features such as word clouds, its relatively easy to see which expense categories are being used the most, and you can use Endeca Server’s search and text parsing capabilities to extract keywords and other useful bits of information out of free-text areas, for example supporting text for expense claims.

NewImage

Typically, you’d use Endeca Information Discovery as a “dig around the data, let’s look for something interesting” tool, with all of your data attributes listed in one place and automatic filtering and focusing on the filtered dataset using the Endeca Server in-memory search and aggregation engine. The sample applications that ship with 11.1.1.8.1 are just meant as a starting point though (and unlike the regular BI Apps dashboards and reports, aren’t themselves supported as an official part of the product), but as you can see from the Manufacturing Endeca Studio application below, there’s plenty of scope to create returns, production and warranty-type applications that let you get into the detail of the attribute and textual information held in the BI Apps data warehouse.

NewImage

More details on the setup process are in the online docs, and if you’re new to Endeca Information Discovery take a look at our past articles on the blog on this product.

Rittman Mead BI Forum 2015 Call for Papers Now Open!

I’m very pleased to announce that the Call for Papers for the Rittman Mead BI Forum 2015 is now open, with abstract submissions open to January 18th 2015. As in previous years the BI Forum will run over consecutive weeks in Brighton, UK and Atlanta, GA, with the provisional dates and venues as below:

  • Brighton, UK : Hotel Seattle, Brighton, UK : May 6th – 8th 2015
  • Atlanta, GA : Renaissance Atlanta Midtown Hotel, Atlanta, USA : May 13th-15th 2015

Now on it’s seventh year, the Rittman Mead BI Forum is the only conference dedicated entirely to Oracle Business Intelligence, Oracle Business Analytics and the technologies and processes that support it – data warehousing, data analysis, data visualisation, big data and OLAP analysis. We’re looking for session around tips & techniques, project case-studies and success stories, and sessions where you’ve taken Oracle’s BI products and used them in new and innovative ways. Each year we select around eight-to-ten speakers for each event along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Charles Elliott and Jordan Meyer.

NewImage

Last year we had a big focus on cloud, and a masterclass and several sessions on bringing Hadoop and big data to the world of OBIEE. This year we’re interested in project stories and experiences around cloud and Hadoop, and we’re keen to hear about any Oracle BI Apps 11g implementations or migrations from the earlier 7.9.x releases. Getting back to basics we’re always interested in sessions around OBIEE, Essbase and data warehouse data modelling, and we’d particularly like to encourage session abstracts on data visualization, BI project methodologies and the incorporation of unstructured, semi-structured and external (public) data sources into your BI dashboards. For an idea of the types of presentations that have been selected in the past, check out the BI Forum 2014, 2013 and 2012 homepages, or feel free to get in touch via email at mark.rittman@rittmanmead.com

The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues if you can speak at both. All session this year will be 45 minutes long, all we’ll be publishing submissions and inviting potential attendees to vote on their favourite sessions towards the end of January. Other than that – have a think about abstract ideas now, and make sure you get them in by January 18th 2015.

News and Updates from Oracle Openworld 2014

It’s the Saturday after Oracle Openworld 2014, and I’m now home from San Francisco and back in the UK. It’s been a great week as usual, with lots of product announcements and updates to the BI, DW and Big Data products we use on current projects. Here’s my take on what was announced this last week.

New Products Announced

From a BI and DW perspective, the most significant product announcements were around Hadoop and Big Data. Up to this point most parts of an analytics-focused big data project required you to code the solution yourself, with the diagram below showing the typical three steps in a big data project – data ingestion, analysis and sharing the results.

NewImage

At the moment, all of these steps are typically performed from the command-line using languages such as Python, R, Pig, Hive and so on, with tools like Apache Flume and Apache Sqoop used to bring data into and out of the Hadoop cluster. Under the covers, these tools use technologies such as MapReduce or Spark to do their work, automatically running jobs in parallel across the cluster and making use of the easy scalability of Hadoop and NoSQL databases.

You can also neatly divide the work up on a big data project into two phases; the “discovery” phase typically performed by a data scientist where data is loaded, analysed, correlated and otherwise “understood” to provide the initial insights, and then an “exploitation” phase where we apply governance, provide the output data in a format usable by BI tools and otherwise share the results with the wider corporate audience. The updated Information Management Reference Architecture we collaborated on with Oracle and launched by in June this year had distinct discovery and exploitation phases, and the architecture itself made a clear distinction between the Innovation part that enabled the discovery phase of a project and the Execution part that delivered the insights and data in a more governed, production setting.

NewImage

This was the theme of the product announcements around analytics, BI, data warehousing and big data during Openworld 2014, with Oracle’s Omri Traub in the photo below taking us through Oracle’s big data product strategy. What Oracle are doing here is productising and “democratising” big data, putting it clearly in context of their existing database, engineered systems and BI products and linking them all together into an overall information management architecture and delivery process.

NewImage

So working through from ingestion through to data analysis, these steps have typically been performed by data scientists using scripting tools and rudimentary data visualisation engines, making them labour-intensive and reliant on a small set of people conversant with these tools and process. Oracle Big Data Discovery is aimed squarely at these steps, and combines Apache Spark-based data preparation and transformation capabilities with an analysis and visualisation engine based on Endeca Server.

NewImage

Key features of Big Data Discovery include:

  • Ability to analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based transformation engine
  • Create a catalog of the data on your Hadoop cluster, and then search that catalog using Endeca Server search technologies
  • Create recommendations of other datasets that might interest you, based on what you’re looking at now
  • Visualize your datasets to help understand what they contain, and discover new insights

Under the covers it comprises two parts; the data loading, transformation and profiling part that uses Apache Spark to do its work in parallel across all the nodes in the cluster, and the analysis part, which takes data prepared by Apache Spark and loads into the Endeca Server in-memory engine to perform the analysis, aggregation and data visualisation. Unlike the Spark part the Endeca server element runs just on one node and limits the size of the analysis dataset to what can run in-memory in the Endeca Server engine, but in practice you’re going to work with a sample of the data rather than the entire dataset at that stage (in time the assumption is that the Endeca Server engine will be unbundled and run natively on YARN, giving it the same scalability as the Spark-based data ingestion and transformation part). Initially Big Data Discovery will run on-premise with a cloud version later on, and it’s not dependent on Big Data Appliance – expect to see something later this year / early next year.

Another new product that addresses the discovery phase and discovery lab part of a big data project is Oracle Data Enrichment Cloud Service, from the Oracle Data Integration team and designed to complement ODI and Oracle EDQ. Whilst Oracle positioned ODECS as something you’d use as well as Big Data Discovery and typically upstream from BDD, to me there seemed to be a fair bit of overlap between the products, with both tools doing data profiling and transformation but BDD being more focused on the exploration and discovery part, and ODECS being more focused on early-stage data profiling and transformation.

NewImage

ODECS is clearly more of an ETL tool complement and runs natively in the cloud, right from the start. It’s most probably aimed at customers with their Hadoop dataset already in the cloud, maybe using Amazon Elastic MapReduce or Oracle’s new Hadoop-as-a-Service and has more in common with the old Data Quality Option for Oracle Warehouse Builder than Endeca’s search-first analytic interface. It’s got a very nice interface including a mobile-enabled website and the ability to include and merge in external datasets, including Oracle’s own Data as a Service platform offering. Along with the new Metadata Management tool Oracle also launched at Openworld it’s a great addition to the Oracle Data Integration product suite, but I can’t help thinking that its initial availability only on Oracle’s public cloud platform is going to limit its use with Oracle’s typical customers – we’ll have to just wait and see.

The other major product that addresses big data projects was Oracle Big Data SQL. Partly addressing the discovery phase of big data projects but mostly (to my mind) addressing the exploitation phase, and the execution part of the information management architecture, Big Data SQL gives Oracle Exadata the ability to return data from Hive and NoSQL on the Big Data Appliance as well as data from its normal relational store. I covered Big Data SQL on the blog a few weeks ago and I’ll be posting some more in-depth articles on it next week, but the other main technical innovation with the product is its bringing of Exadata’s SmartScan feature to Hadoop, projecting and filtering data at the Hadoop storage node level and also giving Hadoop the ability to understand regular Oracle SQL, rather than the cut-down version you get with HiveQL.

NewImage

Where this then leaves us is with the ability to do most of a big data project using (Oracle) tools, bringing big data analysis within reach of organisations with Oracle-style budgets but without access to rare data scientist-type resources. Going back to my diagram earlier, a post-OOW big data project using the new products launched in this last week could look something like this:

NewImage

Big Data SQL is out now and depends on BDA and Exadata for its use; Big Data Discovery should be out in a few months time, runs on-premise but doesn’t require BDA, whilst ODECS is cloud-only and runs on a BDA in the background. Expect more news and more integration/alignment from the products as 2014 ends and 2015 starts, and we’re looking forward to using them on Oracle-centric Hadoop projects in the near future. 

Product Updates for BI, Data Integration, Exalytics, BI Applications and OBIEE

Other news announced over the week for products we more commonly use on projects include:

Finally, something that we were particularly pleased to see was the updated Oracle Information Management Architecture I mentioned earlier referenced in most of the analytics sessions, with Oracle’s Balaji Yelamanchili for example introducing it in his big data and business analytics general session mid-way through the week. 

NewImage
 

We love the way this brings together the big data components and puts them in the context of the wider data warehouse and analytic processes, and compared to a few years ago when Hadoop and big data was considered completely separate to data warehousing and BI and done by staff completely different to the core business analytics team, this new reference architecture puts it squarely within the world of BI and analytics we work in. It also emphasises the new abilities Hadoop, NoSQL databases and big data can bring us – support for wider sets of data sources with dynamic schemas, the ability to economically work with and analyse much larger datasets, and support discovery-type upfront analysis work. Finally, it recognises that to get true value out of analysis you start on Hadoop, you eventually need to add proper data governance, make the results more widely available using full SQL tools, and use the right tools – relational databases, OLAP servers and the like – to analyse the data once its in a more structured form. 

If you missed our write-up on the updated Information Management Reference Architecture you can can read our two-part blog post here and here, read the Oracle white paper, or listen to the podcast with OTN Archbeat’s Bob Rhubart. For now though I’m looking forward to seeing the family after a week and a half away in San Francisco – thanks to OTN and the Oracle ACE Director Program for sponsoring my visit over to SF for Openworld, and we’ll post our conference presentation slides later next week when we’re back in the UK and US offices.