Tag Archives: Oracle BI Suite EE

OBIEE, Cloudera Hadoop & Hive/Impala Part 1 : Install and Set-up an EC2 Hadoop Cluster

I’ve been over in San Francisco this last week for BIWA Summit 2014, and one of the things I demo’d during the week was OBIEE connecting to a Hadoop cluster running on Amazon EC2, and analysing the flight delays dataset that ships with recent SampleApps and Exalytics. There’s quite a few interesting steps and concepts in setting this up, so I thought it’d be interesting to go through them on the blog, so that others can have a try if they’re interested. Don’t take this as a definitive, 100%-complete set of steps you’ll need to work through to set up the example – I’m currently writing this in the BA lounge at SFO trying to get this written before my flight leaves, and I might have inadvertently missed a couple of steps – but this should give you the gist of what’s involved and show what’s possible.

What the example will do is create the following setup:


In this setup, we’ll initially create an Amazon EC2 instance that we’ll then install the free version of Cloudera Manager 4.5 onto; Cloudera are a company that have created a distribution of Hadoop which they then sell alongside their own management tools (similar to how Red Hat took Linux, made it “enterprise” and sold software and services around it), but who also provide a freely-downloadable version of their tools (“Cloudera Standard”) that have special setup routines when run on Amazon EC2.

We’ll then use this install of Cloudera Manager to automatically create and provision four Amazon EC2 instances which we’ll then install Hadoop onto, along with other tools like Impala (for in-memory SQL access over the cluster), Hive, HDFS and so on. Then, in the second part of this two-part series, we’ll then upload some data from the Flight Delays dataset into the cluster, connect OBIEE to it via the Cloudera Impala ODBC drivers, and analyse from Answers. I’m assuming with this that you’ve got some familiarity with Amazon AWS, EC2 and the rest of their cloud platform, and that you’ve got yourself set up with an account, your secret access keys and so on – if not, do that first before you try and of these steps.

Let’s start by setting up the initial EC2 virtual server instance onto which we’ll install Cloudera Manager.

Installing the EC2 Hadoop Cluster

1. What we’re going to before anything else is create what’s called a “security group”, a collection of firewall settings that we’ll apply to the Cloudera Manager virtual server so that it can then connect out to the nodes it’s going to set up to run Hadoop (and so that we can connect to it to run the web interface). To do this, log into the AWS Management Console, and from the Amazon Web Services menu navigate to EC2 > Network & Security > Security Groups.

Then when the Security Groups page is displayed, press the Create Security Group button, then enter the following details when prompted:

Name : CDH-Manager
Description : Security group for CDH4 Manager instance

Then, with this new security group selected, use the Add Rule button to add the following inbound rules:

SSH  :
7180 :
7182 :
7183 :
7432 :
Custom ICMP rule : Echo Reply

Once you’ve done this, the security group area should look like this:


Then, press the Apply Rule Changes button to register the security settings.

2. Next we’ll create an Amazon EC2 virtual server instance to run Cloudera Manager on, using this security group settings to ensure the right ports are open – then we’ll use that instance and install of Cloudera Manager to then set up the Hadoop cluster.

To do this with the EC2 Dashboard web page still open, click on the Instances menu item on the left-hand side of the page, then press Launch Instance, noting the EC2 region you’ll be working in at the same point (for me, it’s the EU Ireland region).

For this initial virtual server instance, use the Ubuntu Server 12.0.4 LTS 64-bit image – Cloudera Manager 4.5 Free can install onto either Ubuntu or Centos, and will adjust what it installs accordingly, so for now let’s select Ubuntu.


Then, when prompted, select the m1.medium image type, and on the Step 7: Review Instance Launch page, select the security group you created a moment ago for the instance’s security group settings. Once done, press the Launch button, create or select an SSH key pair and then download that key pair to your local laptop or PC so you can connect to the virtual server once it’s spun-up.

3. Now you need to SSH into this new EC2 virtual server and download the Cloudera Manager software to it, to then create the Hadoop cluster. To do this, first make a note of the instance name that the EC2 launch instance process gave you, like this:


Click on that link to then show the status of the virtual server, and more importantly, its public DNS address. Once the virtual server shows a status of “running”, you can then SSH into it and download and run the Cloudera Manager software; note that “EC2-cluster.pem” is the name of the keypair I created in the previous steps, and this file will need to be “chmod 400”-protected before EC2 and SSH will let you use it – see this blog article on setting up EC2 command-line access on the Mac for example details.

To SSH into the virtual server and install and run Cloudera Manager, type in the following (using your own SSH key file name and virtual server DNS address):

ssh -i EC2-cluster.pem ubuntu@ec2-54-216-126-144.eu-west-1.compute.amazonaws.com

Then, once you’re connected, download and install Cloudera Manager like this:

wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
chmod +x cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin 

You’ll then be walked through a wizard that will get you to agree to a couple of licenses, and then download and install the Cloudera Manager software for your instance type. Note that this is something CM does when it detects it’s running on Amazon EC2 – for other types of install it’s a slightly different process.

4. Once the Cloudera Manager software install has completed, give it a couple of minutes and then use your web browser to navigate to the Cloudera Manager website, at machine-name:7180, in my case:


Log in as “admin/admin” and when prompted, select the free Cloudera Standard option. Press Continue so that you’re then presented with the Provide instance specification page. Using this page, you can select the EC2 instance size and type, the number of nodes in your cluster, and a group name for your instances. In this example, we’ll create a four-node cluster using the Ubuntu 12.0.4. LTS 64-bit image. select m1.large as the image type, and call it “impala-demo-cdh”.


Then, on the Provide Credentials page, paste in your AWS access key ID and Secret Access Key, let Cloudera Manager generate a new key pair for use with the cluster (or upload your own one from before), and then press the Start Installation button on the next page to have Cloudera Manager start provisioning the cluster instances. Once the instances are created, download the additional key file and place it with the other one, “chmod 400”-ing it as before so it’ll work with SSH into EC2.

5. Once the instance provisioning completes, Cloudera Manager will then install the relevant software onto the different nodes. The Installation in Progress page will show you the progress of these installs, with the screenshot below showing it mid-way through the process.


Assuming all the cluster nodes install properly, walk through the rest of the steps to confirm what’s installed where, check all of the services are running OK and complete the process.


Configuring and Setting up Hadoop

So assuming all of the install and service startup steps went OK, what you have now is a four-node Hadoop cluster running on Amazon EC2, with additional management tools and services provided by Cloudera – think of it like a Linux distribution by Red Hat or Suse, where the core is standard open-source software and the vendor provides other complementary tools, and tools they write themselves, to enhance the product. The screenshot below is the overall summary page for your cluster, as provided by Cloudera Manager – don’t worry too much about the warnings, they’re down to log file disk space and can be ignored for this particular exercise.


If you select Services > All Services from the Cloudera Manager menu, you’ll see what’s been installed on your cluster:


Some of the key services are:

  • HDFS – the cluster filesystem that Hadoop processes data on, and we’ll use later on to upload text files containing the flight delays data we’re going to analyse. HDFS is unix-like in how you work with it, but it stores data redundantly across all nodes in the cluster, enabling parallel operations and providing fault-tolerance.
  • HBase – a NoSQL database that we won’t use here, but that stores data in key/value pairs using the HDFS filesystem
  • Hive – a SQL-like access layer over Hadoop, typically used for ETL access, and currently by OBIEE
  • Impala – an improved version of Hive that runs in-memory and bypasses MapReduce code creation, the thing that slows Hive down
  • Hue – a web UI that we’ll be using later on to run Hive and Impala queries, and create tables in Hive’s HCatalog
  • MapReduce – the framework and server within Hadoop that typically crunches, filters and transforms the data

Before we go into Hue to create some Hive tables, there’s one tasks we need to do if we’re to access this cluster via Hive – we need to install something called “Hiveserver2”, a server process that the Hive ODBC drivers OBIEE uses will need in order to connect to the cluster, but that isn’t installed by default. 

To install Hiveserver2, from the Cloudera Manager website select Services > hive1, and then click on the instances tab. Then, scroll-across so that you can see the HiveServer2 column, locate the cluster node with the majority of services and the Hive Metastore Server installed on it, and check the checkbox to select that service for install.


Press Continue, and then back on the Role Instances page, select the new hiveserver2 service, and select Actions for Selected > Start to start the service.

Now we’re at the point where we can use Hue to set up a Hive database, upload some files and create some tables for analysis. Check back tomorrow for the second-part in this series where we’ll do just that.

OBIEE Regression Testing – An Introduction

In this article I’m going to look at ways to test changes that you make to OBIEE to ensure that they don’t break existing functionality. In all but the simplest IT systems it’s common for one (planned) action to inadvertently cause another (unplanned).

What IS Regression Testing?

When we make a change to a system we use functional unit tests to ensure that it does do what it is supposed to do. We should also make sure that the same changes don’t do what they’re not supposed to, that is, cause functionality already existing in the system to change behaviour. If this does happen it is known as a regression and is something we want to ensure doesn’t happen without us knowing. Some examples of regressions seen in standard OBIEE development changes include:

  • Reports stop returning data, showing an error instead
  • Reports start to show the wrong data
  • Some combinations of dimensions and facts to no longer show data, or show an error
  • Dashboards that reference a particular analysis stop working

As well as these, less common system changes can also cause regressions, for example:

  • An OBIEE version upgrade causes certain types of graph to render in a different way from the previous version
  • An OBIEE patch introduces a bug in the front end user interface

What drives Regression Testing?

The requirement for regression testing OBIEE broadly comes from two different types of change:

  • New binaries – that is, an upgrade (or patch) of OBIEE
  • New “application code” – changes to the RPD, the underlying database schema, and so on.

These two requirements have the same aim – make sure nothing breaks when we make the change – but differ in ways that make how we address them important:

  1. Frequency : OBIEE may get patched once or twice a year, and upgraded every few years. Compare this to development changes made to the RPD et al, which users would often like to see happening on a frequent basis (sometimes daily at the beginning of an implementation). If these changes are happening with great regularity then (a) we don’t want to be the ones causing the bottleneck because we can’t regression test them and thus (b) we need to find a repeatable way to perform these tests accurately and quickly.
  2. Delta Visibility : When Oracle change the OBIEE code base, we are blind as to what has gone on under the covers. Sure, we know what’s changed in the documentation, but as a starting point for “what might have broken” we can only assume everything has and test accordingly. Conversely, in a planned development we know exactly what we changed and we can therefore work out the scope of the necessary testing.

The points in bold above are what I aim to address in this article. Regression testing OBIEE doesn’t have to mean one technique alone – it can be refined based on what we know has changed.

Why Regression Test?

If you don’t regression test then you place a wager that you’ll be able to fix any problems that arise. As soon as they arise. In Production. With angry users on the phone. And the project manager screaming blue murder because their change is getting blamed for breaking everything.

This is a recipe for compounding errors upon errors, not a stable system. Testing, in all flavours, is about gaining confidence about the impact of a proposed change to a system. Functional testing reassures us that the change will do what it was designed to do. Performance testing helps us understand how a system behaves from a response time and capacity perspective. Regression testing gives us the confidence that a change, whilst doing what it ought to, isn’t going to affect something else.

The confidence in what is (and isn’t) going to happen when we deploy a change enables us to make these changes more frequently as required by the users. Instead of a long development cycle with a huge number of changes bunched in together, and one big bang test and release, we can take a more rapid, flexible, and responsive approach to development and release because we have the confidence that an individual change is going to work.

In addition to confidence in additional releases to new deployments, a good regression testing framework enables us to have confidence in making changes to long-standing big ball of mud systems. So long as we understand the relevant interfaces points in OBIEE, we can build a pass/fail test framework on top of the most complex RPD/schema.

Targeting Regression Testing Effectively

Regression testing is easy. You pay a troop of monkeys to sit at a set of computers and run every single dashboard, build every permutation of adhoc report, and if you’ve just upgraded or patched OBIEE, go through the user interface with a fine toothed comb. After the appropriate period of several weeks, any differences they find from before your change was made is a regression. Congratulations. All you need to do now is fix the problem – and then of course, regression test your new change. So monkeys are one option, but they’re expensive (you should see the wholesale market peanut price these days), they’re not infallible (monkeys get distracted by YouTube too), and they are slow.

Better than monkeys is automated regression testing, targeted smartly at the area of OBIEE that has been changed. We will now take a look at which changes can cause regressions in which area, and from that derive a list of testing methods appropriate for each type of change made.

Regression testing points in the OBIEE stack

To understand how we can regression test OBIEE, let us look at where a regression can be detected. The following diagram illustrates the request/response flow through the components in the OBIEE stack. We can use it to see where regressions may expose themselves, and thus understand at what points we can consider testing for them.

Starting from the point of view of the actual end user:

  • The user interface may regress. They may be actual bugs that weren’t there before, or ‘regressions’ in the sense that functionality or icons/layout have changed. These changes would typically only come about through software changes (patching/upgrades).
    Regressions could also occur if you are manipulating the UI through the analysis itself (eg narrative view) and the behaviour changes, but this type of UI modification is less common.
  • Regressions caused by changes to the underlying data, RPD or analyses are going to manifest themselves through a dashboard. This could be in the data or the presentation of the data (tables, graphs, etc).
  • Considering a dashboard by its constituent parts, an individual analysis could exhibit differences in its data or the presentation of the data


Next to consider is that each analysis sends a “Logical” SQL request to the BI Server. It is not common, but it is possible that a change to the binaries (version upgrade/patch) could introduce a regression that caused the Logical SQL to be generated incorrectly. Specific changes to the RPD can also cause the Logical SQL generation to change, potentially erroneously.

The Logical SQL that is generated is executed by the BI Server which in turn returns the requested logical resultset data. This resultset may expose a regression in how the BI Server is handling the logical request.

A “Logical” SQL request on the BI Server is parsed through the metadata layer, the RPD, and one or more “Physical” SQL statements are sent to the underlying data source(s). An error in the RPD could result in the Physical SQL being generated incorrectly.

Finally, each “Physical” SQL request at the data source returns data back to the BI Server. Any errors in changes to the physical sources will show themselves through the physical query failing, or the results being incorrect.

Regression testing opportunities

To summarise the previous section, our testing points for regression are as follows.

  1. The logical query generated by Presentation Services for an analysis
  2. The physical query/queries generated by the BI Server to retrieve the data from the data source(s)
  3. The data supplied by the data source to the BI server
  4. The data supplied by the BI server for an analysis (logical resultset)
  5. User interface, including the dashboard/analysis, taking into account both rendered data and presentation/UI.

Regression testing is based around comparing one state (before a planned change) to another (after the planned change). In considering how we are going to perform our testing, let’s take a very simplistic view on what we need to test:

  1. Does it look the same
  2. Are the numbers the same

Of these two, one is very easy to get a computer to do (and conversely, very laborious to perform manually), and the other is very difficult to explain to a computer (and relatively easy to do manually): -

  • Telling a computer to fetch some data twice and compare the first result with the second is bread and butter automation.
  • Trying to explain to a computer what a page “looks” like, or what a user interface “does” is extremely time consuming, and inevitably specific to the single item in question. Of course, we can programmatically compare the underlying code for a dashboard before and after a change, but the question I pose is whether we should.

Computers are blind

The user interface for an OBIEE end user is a web browser, and OBIEE builds its web pages through a set of languages and protocols that used to be quaintly referred to as “Web 2.0”. It uses HTML, CSS, XML, and JavaScript, taking plentiful advantage of asynchronous page loading and in-flight modifications to the Document Object Model (DOM) too. AJAX is a term which certainly covers some of the magic that goes on. The resulting user interface is pretty slick with drop down menus, expanding hierarchy trees, and partial dashboard rendering as data is returned rather than waiting for all analyses to complete. All of this omits the knockout blow that is Flash, used for rendering all graph objects in OBIEE and the subject of at notable UI bug in OBIEE The “Developer Tools” option in modern web browsers gives us a glimpse into what is going on under the covers. We can see the number of resources that go into rendering a single page…

…and how many layers there are to the object model:

Getting a computer to interface with all of this, simulating a user interaction and parsing the response is possible with functional testing tools such as Selenium, Oracle Application Testing Suite, and HP’s QuickTest Professional. Each of these tools is capable of simulating a user (often by ‘recording’ a session as the starting point) and parsing the responses from OBIEE.

But, there is a  fundamental complication to using these tools. For all the AJAX/CSS/DOM magic to happen, the page that OBIEE generates is littered with element identifiers (so that the JavaScript code can identify the element to manipulate). For example, the following table cell has the ID in this particular execution of e_saw_14485_10_1_0_0:

Some of these IDs may change between report executions or sessions, but either way, cannot be relied on to be consistent. Therefore, getting our testing tool (such as Selenium) to compare the before/after results to detect a regression becomes a whole heap more tricky. It is possible to work out element paths based on their relative position within the page rather than an absolute ID, but that becomes even more page specific and complex to implement. Therefore to compare a before and after page programatically we have to either

  • define a particular part of the page alone to check remains the same (and risk chucking the baby out with the bath water, that is, missing other genuine regressions elsewhere on the page)

or we have to

  • compile a list of elements that we expect may change but that we don’t count as a regression (i.e. exceptions).

The latter is going to be prone to causing false positives (i.e. failing regression tests that aren’t genuine regressions) because it relies on reverse engineering the full Document Object Model of the OBIEE page. All of this is also without even taking into account software patching and upgrades – so far as Oracle are going to be concerned how a page is rendered is their own business and thus at full liberty to completely change the internal structure of a page as they desire. Given this above complication, it becomes clear that building a test against a single page is time consuming, and it will typically be specific to that page only. This becomes a problem the greater the scale of the deployment you are trying to test. Hardcoding the testing for one specific page might be fine, but given more than a handful of pages you risk ending up with a large inflexible regression test code base (that itself may become error prone and need regression testing when it’s changed…).


So, we come back to not how we test the front end but more should we, in every case? Given a finite amount of time, what are you going to get most benefit from in your regression tests? In the next post I will demonstrate one of the ways you can get the most “bang for your buck” when regression testing OBIEE, by concentrating your automation efforts on the query part of the OBIEE stack, and not the front end. Stay tuned!


Many thanks to Gianni Ceresa for his thoughts and assistance on this subject.

New Oracle Magazine Article on BI Mobile App Designer


My new article at Oracle Magazine is on Oracle BI Mobile App Designer, the new HTML5-based mobile BI tool for OBIEE built on Oracle BI Publisher technology. In the article, I walk the reader through creating a simple Mobile App Designer App, then publish it to the Apps Library for use with iOS, Android, Blackberry and other HTML5-compatible mobile devices.

You can also read my “first look” post on BI Mobile App Designer from our blog when the feature first came out, and we’re also running a promotion where we’ll implement your first Mobile App Designer app within a week, including patching up your OBIEE installation to the required version. More details on the offer, and on BI Mobile App Designer in-general, are on this QuickStart Mobile Analytic Apps for OBIEE 11g with Rittman Mead data sheet.

Rittman Mead BI Forum 2014 Call for Papers Now Open!

It’s that time of year again when we start planning out next year’s BI Forum, which like this year’s event will be running in May 2014 in Brighton and Atlanta. This will be our sixth annual event, and as with previous year’s the most important part is the content – and as such I’m pleased to announce that the Call for Papers for BI Forum 2014 is now open, running through to January 31st 2014.

If you’ve not been to one of our BI Forum events in past years, the Rittman Mead BI Forum is all about Oracle Business Intelligence, and the technologies and techniques that surround it – data warehousing, data analysis, big data, unstructured data analysis, OLAP analysis and this year – in-memory analytics. Each year we select around ten speakers for Brighton, and ten for Atlanta, along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Stewart Bryson.


Last year we had sessions on OBIEE internals and new features, OBIEE visualisations and data analysis, OBIEE and “big data”, along with sessions on Endeca, Exalytics, Exadata, Essbase and anything else that starts with an “E”. This year we’re continuing the theme, but are particularly looking for sessions on what’s hot this year and next – integration with unstructured and big data sources, use of engineered systems and in-memory analysis, advanced and innovative data visualisations, cloud deployment and analytics, and anything that “pushes the envelope” around Oracle BI, data warehousing and analytics.


The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues. We’re also looking for presenters for ten-minute “TED”-style sessions, and any ideas you might have for keynote speakers, send them directly to me at mark.rittman@rittmanmead.com. Other than that – have a think about abstract ideas now, and make sure you get them in by January 31st 2014.

Thoughts on Running OBIEE in the Cloud : Part 2 – Data Sources and ETL

In yesterday’s post on running OBIEE in the cloud, I looked at a number of options for hosting the actual OBIEE element; hosting it in a public cloud service such as Amazon EC2, using Oracle’s upcoming BI-as-a-Service offering, or partner offerings such as our upcoming ExtremeBI in the Cloud service. But the more you think about this sort of thing, the more you realise that the OBIEE element is actually the easy part – it’s what you do about data storage, security, LDAP directories and ETL that makes things more complicated.

Take the example I gave yesterday where OBIEE was run in the cloud, with the multi-tenancy option enabled, the main data warehouse in the cloud, and data sourced from cloud and on-premise sources.


In this type of setup, there’s a number of things you need to consider beyond how OBIEE is hosted. For example:

  • If your corporate LDAP directory is on-premise, how do we link OBIEE to it? Or does the LDAP server also need to be in the cloud?
  • What sort of database do we use if we’re hosting it in the cloud. Oracle? If so, self-hosted in a public cloud, or through one of the Oracle DB-in-the-cloud offerings?
  • If not Oracle database, what other options are available?
  • And how do we ETL data into the cloud-based data warehouse? Do we continue to use a tool like ODI, or use a cloud-based option – or even a service such as Amazon’s AWS Data Pipeline?

What complicates things at this stage in the development of “cloud”, is that most companies won’t move 100% to cloud in one go; more likely, individual application and systems might migrate to the cloud, but for a long time we’ll be left with a “hybrid” architecture where some infrastructure stays on premise, some might sit in a public cloud, others might be hosted on third-party private clouds.  So again, what are the options?

Well Oracle’s upcoming BI-as-a-service offering works at one extreme end-of-the-spectrum; the only data source it’ll initially work with is Oracle’s own database-as-a-service, which in its initial incarnation provides a single schema, with no SQL*Net access and with data instead uploaded via a web interface (this may well change when Oracle launch their database instance-as-a-service later in 2014). No SQL*Net access means no ETL tool access though, in practice, as they all use SQL*Net or ODBC to connect to the database, so this offer to my mind is aimed at either (a) small BI applications where it’s practical to upload the data via Excel files etc, or (b) wider Oracle Cloud-based systems that might use database-as-a-service to hold their data, Java-as-a-service for the application and so forth. What this service does promise though is new capabilities within OBIEE where users can upload their own data, again via spreadsheets, to the cloud OBIEE system, and have that “mashed-up” with the existing corporate data – the aim being to avoid data being downloaded into Excel to do this type of work, and with user metrics clearly marked in the catalog so they’re distinct from the corporate ones.

But assuming you’re not going for the Oracle cloud offer, what are the other options over data? Well hosting OBIEE in the cloud is conceptually no different from hosting anywhere else, in that it can connect to various data sources via the various connection methods, so in-principle you’ve got just the same options open to you if running on premise. But the driver for moving OBIEE into the cloud might be that your applications, data etc are already in the cloud, and you might also be looking to take advantage of cloud features in your database such as dynamic provisioning and scaling, or indeed use one of the new cloud-native databases such as Amazon Redshift.


I covered alternative databases for use with OBIEE a few months ago in a blog post, and Amazon Redshift at the time looked like an interesting option; based on ParAccel, a mature analytic database offering, column-store and tightly integrated in with Amazon’s other offerings, a few customers have asked us about this as an option. And they’re certainly interesting – in practice, not all that different in pricing to Oracle database as a source but with some interesting analytic features – but they all suffer from the common issue that they’re not officially supported as data sources. Amazon Redshift, for example, uses Postgres-derviced ODBC drivers to connect to it, but Postgres itself isn’t officially supported as a source, which means you could well get sub-optimal queries and you certainly won’t get specific support from Oracle for that source. But if it works for you – then this could be an option, along with more left-field data source such as Hadoop.

But to my mind, it’s the ETL element that’s the most interesting, and most challenging, part of the equation. Going back to Openworld, Oracle made a few mentions of ETL in their general Cloud Analytics talks, including talk about an upcoming data source adapter for the BI Apps that’ll enable loading from Fusion Apps in the cloud, like this:


There were also a number of other deployment options discussed, including hybrid architectures where some sources were in the cloud, some on-premise, but all involved running the ETL elements of the BI Apps – ODI or Informatica – on-premise, the same way as installs today. And to my mind, this is where the Oracle cloud offering is the weakest, around cloud-based and cloud-native ETL and data integration – the only real option at the moment is to run ODI agents in the cloud and connect them back to an on-premise ODI install, or move the whole thing into the cloud in what could be quiet a heavyweight data integration architecture.

Other vendors are, in my opinion, quite a way further forward with their cloud data integration tools strategy than Oracle, who instead seem to be still focused on on-premise (mainly), database-to-database (mostly) ETL. To take two examples; Informatica have an Informatica Cloud service which appears to be a platform-as-a-service play, with customers presumably signing-up for the service, designing their ETL flows and then paying for what they use, with a focus on cloud APIs as well as database connectivity, and full services around data quality, MDM and so forth.


Another vendor in this space is SnapLogic, a pure-play cloud ETL vendor selling a component-based product with a big focus on cloud, application and big data sources. What’s interesting about this and other similar vendor’s approaches though are they they appear to be “cloud-first” – written for the cloud, sold as a service, as much focused on APIs as database connectivity – a contrast to Oracle’s current data integration tools strategy which to my mind still assumes an on-premise architecture. What’s more concerning is the lack of any announcement around ETL-in-the-cloud at the last Openworld – if you look at the platform-as-a-service products announced at the event, whilst database, BI, documents, BPM and so forth-as-a-service were announced, there was no mention of data integration:


What I’d like to see added to this platform in terms of data integration would be something like:

  • On-demand data integration, sold as a service, available as a package along with database, Java and BI
  • Support for Oracle and non-Oracle application APIs – for example Salesforce.com, Workday and SAP – see for example what SnapLogic support in this area.
  • No need for an install – it’s already installed and it’s a shared platform, as they’re doing with OBIEE
  • Good support for big data, unstructured and social data sources

I think it’s pretty likely this will happen – whilst products such as the BI Apps can have their ETL in the cloud, via ODI in the BI Apps 11g version for example, these are inherently single-tenant, and I’d fully expect Oracle plan at some time to offer BI Apps-as-a-service, with a corresponding data integration element designed from the ground-up to work cloud-native and integrate with the rest of Oracle’s platform-as-a-service offering.

So there we have it – some thoughts on the database and ETL elements in an OBIEE-in-the-cloud offering. Keep an eye on the blog over the next few months as I built-out a few examples, and I’ll be presenting on the topic at the upcoming BIWA 2014 event in San Francisco in January – watch this space as they say.