Tag Archives: Oracle BI Suite EE

Automated RPD Builds with OBIEE

Introduction

Some time ago, I wrapped up a four-part series on using MDS XML as a basis for RPD development. Of course… you can (re-) read the blog posts to get up to speed on the concept… but I’ll offer a quick refresher here. Using an XML-based “data-store” for our RPD allows us the ability to treat our RPD the way Java developers treat their class files. At the end of the day… when it’s text, we can work with it. I paired this new(-ish) RPD storage format with the distributed version-control system (VCS) Git to highlight some truly radical development strategies. I speak on this approach quite often… and I always include a live demo of working on multiple branches of development at the same time, and using the power of Git at the end of it to merge the branches together to build a single, consistent RPD file, or an OBIEE RPD “release”. The question I always get asked: why Git? Can we do this with Subversion? My answer: sort of. The Git-difference should be appreciated by RPD developers: it’s all about metadata. Anyone who has set up a Subversion repository knows that branching was an afterthought. Our ability to “branch” relies on the manual creation of a directory called “branches” (right between “trunk” and “tags”) that plays no special part in the Subversion architecture. But Git has branching functionality built way down in the metadata. When it comes time to merge two branches together, Git knows ahead of time whether a conflict will arise, and has already devised a strategy for pulling the closest ancestor between the two branches to do a three-way merge.

What do we Get with Git?

Let’s run through the spoils. First… developers work on feature, or “topic” branches to complete their work. This is an isolated, “virtual” copy of the repository branched of from the master branch at a certain point in time.  Below, we can see an illustration of the “Git Flow” methodology, our preferred RPD multi-user development strategy at Rittman Mead:

Git Branch Strategy

Developers no longer have to concern themselves with conflict resolution the way they have always had to when using MUDE. The merging is done after the fact… at the time when releases are assembled. And this brings me to the second–and most impactful–spoil: we are able to easily “cherry-pick” (yes… that is a technical term) features to go into our releases. For Agile methodologies that focus on delivering value to their users often… cherry-picking is a requirement, and it simply is not possible with either serialized, online development… or with OBIEE’s built-in MUDE. As depicted in the Git Flow image above, we are able to cherry-pick the features we want to include in a particular release and merge them into our develop branch, and then prepare them for final delivery in the master branch, where we tag them as releases.

Is it Really So Radical?

Earlier, I described this methodology as “radical”, which certainly requires context. For BI developers using OBIEE, Cognos, MicroStrategy, etc., this is indeed radical. But this is all old hat to Java developers, or Groovy developers, or Ruby on Rails developers, etc. For those folks, SDLC and methodologies like Git Flow are just how development is done. What is it about BI developers that make these concepts so foreign to us? Why are proper software development lifecycle (SDLC) strategies always an after-thought in the BI tools? All of them–Red Stack or otherwise–seem to “bolt-on” custom revisioning features after the fact instead of just integrating with proven solutions (Git, SVN, etc.) used by the rest of the world. When I speak to customers who are “uneasy” about stepping away from the OBIEE-specific MUDE approach, I reassure them that actually, branch-based SDLC is really the proven standard, while bespoke, siloed alternatives are the exception . Which gets us thinking… what else is the rest of the world doing that we aren’t? The easy answer: automated builds.

Should We Automate Our Builds?

The answer is yes. The development industry is abuzz right now with discussions about build automation and continuous integration, with oft-mentioned open-source projects such as Apache Maven, Jenkins CI, Grunt and Gradle. Although these open source projects are certainly on my radar (I’m currently testing Gradle for possible fit), an automated build process doesn’t have to be near that fancy, and for most environments, could be produced with simple scripting using any interpretative language. To explain how an automated build would work for OBIEE metadata development, let’s first explore the steps required if we were building a simple Java application. The process goes something like this:

  1. Developers use Git (or other VCS) branches to track feature development, and those branches are used to cherry-pick combinations of features to package into a release.
  2. When the release has been assembled (or “merged” using Git terminology), then an automated process performs a build.
  3. The end result of our build process is typically an EAR file. As builds are configurable for different environments (QA, Production, etc.), our build process would be aware of this, and be capable of generating builds specific to multiple environments simultaneously.

Taking these steps and digesting them in an OBIEE world, using the built-in features we have for comparing and patching, it seems our build process would look something like the following:

deploy

We’ll explain this diagram in more detail as we go, because it’s truly pivotal to understanding how our build process works. But in general… we version control our MDS XML repository for development, and use different versions of that MDS XML repository to generate patch files each time we build a release. That patch file is then applied to one or more target binary RPD files downstream in an automated fashion. Make sense?

To prepare ourselves for using MDS XML for automated builds, we have to start by generating binary RPD files for the different target environments we want to build for, in our case, QA and Production versions. We’ll refer to these as our “baseline” RPD files: initially, they look identical to our development repository, but exist in binary format. These target binary RPD files will also be stored in Git to make it easy to pull back previous versions of our target RPDs, as well as to make it easy to deploy these files to any environment. We only have to generate our baselines once, so we could obviously do that manually using the Admin Tool, but we’ll demonstrate the command-line approach here:

Once we have our binary QA and Production RPDs ready, we’ll need to make any environment-specific configuration changes necessary for the different target environments. One obvious example of this kind of configuration is our connection pools. This is why we don’t simply generate an RPD file from our MDS XML repository and use it for all the downstream environments: specifically, we want to be able to manage connection pool differences, but generically, we are really describing any environment-specific configurations, which is possible with controlled, automated builds. This is another step that only has to be performed once (or whenever our connection pool information needs updating, such as regular password updates) so it too can be performed manually using the Admin Tool.

Once our baselines are set, we can enable our build processes. We’ll walk through the manual steps required to do a build, and after that, we’ll have a quick look at Rittman Mead’s automated build process.

Staging the Prior Version

After we’ve completed development one or more new features, those branches will be merged into our develop branch (or the master branch if we aren’t using Git Flow), giving us a consistent MDS XML repository that we are ready to push to our target environments. We’ll need to compare our current MDS XML repository with the last development version that we produced so we can generate a patch file to drive our build process. There’s a very important distinction to make here: we aren’t going to compare our development repository directly against our QA or Production repositories, which is the approach I often see people recommending in blogs and other media. Hopefully, the reason will be clear in a moment.

So where are we going to find the previous version of our MDS XML repository? Since we are using version control, we are able to checkout our previous build using either a branch or a tag. I’ll demonstrate this using a branch, but it makes little difference from a process perspective, and most VCS (including Git) use the same command for either approach. To make this clear, I’ve created two branches: current and previous. The current branch is the one we’ve been diligently working on with new content… the previous branch is the branch we used to drive the build process last time around. So we need to “stage” a copy of the prior MDS XML repository, which we do by checking out the previous branch and extracting that repository to a location outside of our Git repository. Notice that we’ll extract it as a binary RPD file; this is not a requirement, it’s just easier to deal with a single file for this process:

stage-rpd-mark

Generating the Patch File

Now that we have staged our previous RPD file and returned to the current development branch, we are ready to proceed with generating a patch file. The patch file is generated by simply doing a comparison between our development MDS XML repository and the staged binary RPD file. The important thing to note about this approach: this patch file will not have any environment-specific information in it, such as connection pool configurations, etc. It only contains new development that’s occurred in the development environment between the time of the last release and now:

create-patch-mark

Applying Patch File to our Target(s)

With the staged, prior version of the binary RPD, and our patch file, we have all we need to apply the latest release to our target binary RPD files. In our case, we’ll be applying the changes to both the QA and Production binary RPD files. Compared to everything we’ve done up until this point, this is really the easiest part. Below is a sample patchrpd process for our production RPD:

Patching Production RPD

There are a lot more options we could incorporate into this process from a patching perspective. We cold use the -D option to introduce a decision file, which gives us the ability to highly customize this process. What we’ve demonstrated is a fairly simple option, but to be honest, in a controlled, SDLC approach, it’s rare that we would ever need anything more than this simple process.

Automated Builds with the Rittman Mead Delivery Framework

As you can imagine, with all the brain-power floating around at Rittman Mead, we’ve packaged up a lot of automated processes over the years, and we have active development underway to refactor those processes into the Rittman Mead Delivery Framework. As these disparate pieces get incorporated into our Framework as a whole, we tend to rewrite them in either Groovy or Python… these languages have been adopted due to their strategic choice in Oracle products. I personally wrote our automated OBIEE build tool, and chose Groovy because of it’s existing hooks into Gradle, Maven, Ant and all the rest. Also, the domain-specific language (DSL) capability of Groovy (used in both Grails and Gradle) is perfect for writing task-specific processes such as builds. The build process in our framework is driven by the buildConf.groovy config file, which is demonstrated below:

buildConf-mark

With Groovy, most things Java are pretty easy to incorporate, so I added Apache log4j functionality. Running the automated build process first in a logging mode of INFO, we see the following standard output:

build-mark

And now, in DEBUG mode:

[oracle@oracle obiee]$ groovy BuildOBI.groovy 
[main] DEBUG common.GitRepo - currentBranch: current 
[main] DEBUG common.Util - command: git checkout previous 
[main] DEBUG common.GitRepo - currentBranch: previous 
[main] DEBUG common.Util - command: biserverxmlexec -D /home/oracle/source/gcbc/rpd/gcbc -O /stage/gcbc.rpd -P Admin123 
[main] DEBUG common.Util - command: git checkout current 
[main] DEBUG common.GitRepo - currentBranch: current 
[main] INFO obiee.Build - /stage/gcbc.rpd staged 
[main] DEBUG common.Util - command: comparerpd -P Admin123 -C /home/oracle/source/gcbc/rpd/gcbc -W Admin123 -G /stage/gcbc.rpd -D /home/oracle/source/gcbc/patch/patch.xml 
[main] INFO obiee.Build - /home/oracle/source/gcbc/patch/patch.xml created 
[main] DEBUG common.Util - command: patchrpd -C /home/oracle/source/gcbc/rpd/qa/gcbc.rpd -I /home/oracle/source/gcbc/patch/patch.xml -G /stage/gcbc.rpd -O /home/oracle/source/gcbc/rpd/qa/gcbc.rpd -P Admin123 -Q Admin123 
[main] INFO obiee.Build - /home/oracle/source/gcbc/rpd/qa/gcbc.rpd patched 
[main] DEBUG common.Util - command: patchrpd -C /home/oracle/source/gcbc/rpd/production/gcbc.rpd -I /home/oracle/source/gcbc/patch/patch.xml -G /stage/gcbc.rpd -O /home/oracle/source/gcbc/rpd/production/gcbc.rpd -P Admin123 -Q Admin123 
[main] INFO obiee.Build - /home/oracle/source/gcbc/rpd/production/gcbc.rpd patched 
[oracle@oracle obiee]$

The real power that Git brings (and frankly, other VCS’s such as Subversion do as well) is the ability to configure commit hooks, such that scripts are executed whenever revisions are committed. You’ll notice that the buildConf.groovy configuration file has a list option called validBranches. This allows us to configure the build to run only when the current branch is included in that list. We have that list unpopulated at the moment (meaning it’s valid for all branches), but if we populated that list, then the build process would only run for certain branches. This would allow us to execute the build process as a commit process (either pre or post commit), but only have the build process run when being committed to particular branches, such as the develop branch or the master branch, when the code is prepared to be released.

Conclusion

So what do we think? We’ve put a lot of thought into implementing enterprise SDLC for our customers, and this process works very well. However, I use our Delivery Framework in single-user development scenarios (when I’m creating metadata repositories for presentations, for instance) because the tools are repeatable and they just work. We have additional processes that allow us to complete the build process by uploading the completed binary RPD’s to the appropriate servers and restarting the services.

Rittman Mead BI Forum 2014 Call for Papers Closing Soon – And News on This Year’s Masterclass

Its a couple of days to go until the call for papers for the Rittman Mead BI Forum 2014 closes, with suggested topics this year including OBIEE (of course), Essbase, Endeca, Big Data, Visualizations, In-Memory analysis and data integration. So far we’ve had some excellent submissions but we’re still looking for more – so if you’re considering putting an abstract in, do it now before we close the process late this Friday night!

I’m also very excited to announce that this year’s optional one-day masterclass on the Wednesday before each event will be presented by Lars George from Cloudera, who be talking about Hadoop, Cloudera’s distribution of Hadoop and their management and real-time query tools, and how these relate to the world of Oracle BI&DW. Lars is a Cloudera Solutions Architect and Head of Services for them in EMEA, and is also an HBase committer and author of the book “HBase: The Definitive Guide”.

You’ll probably have seen a lot on big data, and Cloudera, on this blog over the past few months, and I’m particularly grateful to Justin Kestelyn who used to run OTN and the Oracle ACE Program, but now does a similar role over at Cloudera, for making it happen. Thanks Justin and Lars, and we’ll look forward to seeing Lars in Brighton and Atlanta in May this year.

Once the call for papers closes, we’ll do the usual vote to allow potential attendees to influence the paper selection, and then we’ll announce the agendas and open the events up for registration later in February. Until then though – get your abstracts in now before it’s too late… 

Exploring OBIEE Web Services through Python

My very first blog post for Rittman Mead was on the subject of Web Services in BI Publisher 11g, and now I want to return to the subject here, looking at the Web Services of OBIEE itself.

Web Services are basically a way of enabling an application or service to offer up an API to as-yet-undefined-clients. This may be an API for fetching data, sending data, or invoking a service or action. In this age of The Cloud, systems being able to interact in this abstracted manner (i.e. without requiring one to be aware of the other at design time) is a Good Thing and something that if an application does not support is rather frowned upon.

For a simple explanation of Web Services, and details of how to explore and test them using SoapUI then refer to my article Web Services in BI Publisher 11g.

OBIEE itself has supported Web Services for a long time now, but they are exposed through the SOAP, a HTTP-based protocol that relies on carefully formed XML messages. This is fine, but had meant that to actually utilise them beyond tinkering within SoapUI you had to create a heavy-weight JDeveloper project or similar. One of the many great things about working at Rittman Mead is that as a company we use Macs, meaning that a unix-like command line, and Python, is just a click away. Recently I discovered that Python has a library called SUDS. This makes calling a web service as simple as:

from suds.client import Client
client = Client('http://obiee-server:9704/analytics-ws/saw.dll/wsdl/v7')
sessionid = client.service['SAWSessionService'].logon('weblogic','Password01')

With this code I have just logged into OBIEE and returned a session ID token that I can now use in subsequent web service calls. In the background, SUDS sorts out the forming of the XML SOAP messages to send to the web service, and the parsing of the returned XML SOAP message into a Python object matching the object. So, now I can actually start exploring and programming using the Web Services straight from my command line…nice.

The Web Services discussed below are all session based, meaning that you have to first authenticate and retrieve a session ID, which is then used for subsequent Web Service calls.

Checking a user’s Subject Area grants

What prompted me to try again with exploring OBIEE’s Web Services was my discussions with my colleagues over ways to Regression Test OBIEE, above and beyond what I have already covered here and here, where I use nqcmd and Catalog Manager to work with the Logical SQL for an analysis. What about security – how can we test what a user can and cannot see from the point of view of an RPD’s Presentation Layer permissions?

Once authenticated, we can call a Web Service which shows the Subject Areas that the user may see. The MetadataService and getSubjectAreas are both straight from the documentation, transposed into SUDS/Python syntax.

sa_list=client.service['MetadataService'].getSubjectAreas(sessionid)
print sa_list

This returns:

[(SASubjectArea){
   name = ""Sales - Fact Sales""
   displayName = "Sales - Fact Sales"
   description = None
 }, (SASubjectArea){
   name = ""Sales - Store Quality""
   displayName = "Sales - Store Quality"
   description = None
 }]

And now for each subject area, examine the available tables and columns within:

for sa in sa_list:
    print '\t%s'%(sa.name)
    sa_contents=client.service['MetadataService'].describeSubjectArea(sa.name,'IncludeTablesAndColumns',sessionid)
  for table in sa_contents.tables:
        print '\t\t%s' % (table.name)
        for col in table.columns:
            print '\t\t\t%s' % (col.name)

Which returns:

"Sales - Fact Sales"
        "Fact Sales"
                "Cost"  
                "Revenue"
                "Sale Amount" 
                "Sale Amount Month Ago"
                "Sale Amount Target" 
        "Dim Times"
                "Month YYYYMM" 
                "Month" 
                "Quarter"
                "Quarter YYYYQ"
[...]

Nice huh? And with a bit of Python scripting magic we could easily dump that out to disk for comparison again once an RPD change has been made.

Listing users

Listing out the available RPD objects for a user is nice, but how do we know which users there are? Not everyone’s going to be logging in as weblogic, right? (right?!). There is a SecurityService Web Service that offers a getAccounts method, but how do we call it? Unlike the above MetadataService.getSubjectAreas call, where we just passed in the session ID token, here we need to form a proper object describing the account query that we want to run. SUDS offers a way to do this, using an object ‘factory’:

accountlist = client.factory.create('Account')

From the interactive Python shell we can see what an “Account” object looks like:

>>> print accountlist
(Account){
   name = None
   accountType = None
   guid = None
   displayName = None
 }

And this matches the AccountStructure in the documentation (displayName doesn’t do anything in this method). We can set the object’s attributes to define the accounts we want to see – any account name (* is wildcard), and show users (accountType=0) only, not application roles:

accountlist.name = '*'
accountlist.accountType = 0

And now call the getAccounts method, storing the results in an object:

accounts=client.service['SecurityService'].getAccounts(accountlist,sessionid)

Dumping the object’s contents to the screen in the interactive shell shows:

>>> print accounts
[(Account){
   name = "BISystemUser"
   accountType = 0
   guid = "208001E0834C11E3AF35BD098D6B48E3"
   displayName = "BISystemUser"
 }, (Account){
   name = "OracleSystemUser"
   accountType = 0
   guid = "E33C7570834111E3BF03EF3A9BA6EB6D"
   displayName = "OracleSystemUser"
 }, (Account){
   name = "author_user"
   accountType = 0
   guid = "8FB2B570873F11E3BF851386414CD489"
   displayName = "author_user"
 }, (Account){
   name = "consumer_user"
   accountType = 0
   guid = "A38EEFF0873F11E3BF851386414CD489"
   displayName = "consumer_user"
 }, (Account){
   name = "weblogic"
   accountType = 0
   guid = "E33CC390834111E3BF03EF3A9BA6EB6D"
   displayName = "weblogic"
 }]

Iterating through user accounts

Now we have an object that gives us each user defined on the system. We can iterate through this object, and use Impersonation (which I have written about in detail here) to login as each user. Once logged in as the user, we could use the above example to iterate through the subject areas available to the user and record their actual RPD security privileges, and so on.

for account in accounts:
        print '\n----\n\n'
        this_acc = account.name
        try:   
                print 'Logging in as %s' % (this_acc)
                imp_client = Client('http://obiee-server:9704/analytics-ws/saw.dll/wsdl/v7')
                new_session = imp_client.service['SAWSessionService'].impersonate('weblogic','Password01',this_acc)
                print 'New session ID %s' % (new_session)
                try:   
                        user=imp_client.service['SAWSessionService'].getCurUser(new_session)
                        print 'Logged in as %s'%(user)
                        #
                        # Do cool stuff as impersonated user here
                except:
                        print 'failed to get user session'
        except:
                print 'failed to login as %s' % (this_acc)

NB to use impersonation you need to have the oracle.bi.server.impersonateUser Application Policy granted, which in a default 11.1.1.7 environment is not granted even to the BIAdministrators Application Role

SUDS logging

Use this little snippet to dump out the raw request/response XML SOAP messages to help debug issues you may^H^H^H will encounter:

import logging
logging.basicConfig(level=logging.INFO)
logging.getLogger('suds.client').setLevel(logging.DEBUG)

See the SUDS documentation for more detail.

Toolkit

Web Services, even through SUDS, cannot be blagged nor approached through the standard code-by-Google method. You have to be extremely precise with your syntax and object formations, which means that you will need:

To install SUDS on a *nix platform with Python installed already (which it is by default on Macs, and most Linux distros), use Python’s easy_install program:

sudo easy_install suds

And off you go!

This is all very cool, but is it useful?

(Well, I think it’s cool, but then I laughed at this so go figure)

Being able to tap directly into core OBIEE functionality with a few lines of script, callable natively from any Linux OBIEE server (and any Windows service with Python installed) with minimal dependencies is very useful. It opens up much of OBIEE’s functionality as – in effect – a Python library. Whether it’s for building out some form of testing suite, automating web catalog management, auditing user privileges, or a tasks like a light-touch way to kick off an Agent (or create one from scratch), it’s another option handy to have up one’s sleeve.

Automated Regression Testing for OBIEE

In the first article of this series I explored what regression testing is, why it matters, and by breaking down the OBIEE stack into its constituent parts where it is possible to do it for OBIEE. In this posting, I explore some approaches that lend themselves well to automation for testing that existing analyses and dashboards are not affected by RPD changes.

Easy Automated RPD regression testing – it’s all about the numbers

“Bring me solutions not problems” goes the mantra, and in the first article all I did was rain on the parade of the de facto regression testing approach, looking at the front end using functional testing tools such as Selenium. So if not at the front end, then where should we focus our automated regression testing of OBIEE? Answer: the data.

The data should arguably be what is most important to our users. If it’s wrong, that’s bad, and if it’s right, hopefully they’re going to be happy. Obviously, there are other factors in making users happy not least performance and the visual appearance of the data. But a system that gives users wrong data, or no data, is fundamentally a failed one.

Looking at the following diagram of a request/response through the OBIEE stack we can see that so far as data is concerned, it is the BI Server doing all the work, handling both logical and physical SQL and data sets:

The data that it passes back up to Presentation Services for rendering in the user’s web browser is the raw data that feeds into what the user will see. Obviously the data gets processed further in graphs, pivot tables, narrative views, and so on – but the actual filtering, aggregation and calculation applied to data is all complete by the time that it leaves the BI Server.

How the BI Server responds to data requests (in the form of Logical SQL queries) is governed by the RPD, the metadata model that abstracts the physical source(s) into a logical Business model. Because of this abstraction it means that all Logical queries (i.e. analysis/dashboard data requests) are compiled into Physical SQL for sending to the data source(s) at runtime only. Whenever the RPD changes, the way in which any logical query is handled may change.

So if we focus on regression testing the impact that changes to the RPD have on the data alone then the available methods become clearer and easier. We can take the Logical SQL alone that Presentation Services generates for an analysis and sends to the BI Server, and we can run it directly against the BI Server ourselves to get the resulting [logical] dataset. This can be done using any ODBC or JDBC client, such as nqcmd (which is supplied with OBIEE at installation).

Faith in reason

If:

  • the Logical SQL remains the same (i.e. the analysis has not changed nor Presentation Services binaries changed – but see caveat below)
  • the data returned by the BI Server as a result of the Logical SQL before and after the RPD change is made is the same

Then we can reason (through the above illustration of the OBIEE stack) that

  • the resulting report shown to the user will remain the same.

Using this logic we can strip away the top layer of testing (trying to detect if a web page matches another) and test directly against the data without having to actually run the report itself.

In practice

To use this method in practice, the process is as follows:

  1. Obtain the Logical SQL for each analysis in your chosen dashboards
  2. Run the Logical SQL through BI Server, save the data
  3. Make the RPD changes
  4. Rerun the Logical SQL through BI Server, save the data
  5. Compare data before & after to detect if any changes occurred

The Logical SQL can be obtained from Usage Tracking, for example:

SELECT QUERY_TEXT   
FROM   S_NQ_ACCT   
WHERE  START_TS > SYSDATE - 7  

or you can take it directly from the nqquery.log. For more than a few analyses, Usage Tracking is definitely the more practical option.


You can also generate Logical SQL directly from an analysis in the catalog using runcat.sh – see later in this post for details.

If you don’t know which dashboards to take the Logical SQL from, ask yourself which are going to cause the most upset if they stop working, as well as making sure that you have a representative sample across all usage of your RPD’s Subject Areas.

Give the Logical SQL of each analysis an ID and have a log book of which dashboard it is associated with, when it was taken, etc. Then run it through nqcmd (or alternative) to return the first version of the data.

nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s analysis01.lsql -o analysis01.before.csv  

where:

  • -d is the BI Server DSN. For remote testing this is defined as part of the configuration/installation of the client. For testing local to the server it will probably be AnalyticsWeb on Linux and coreapplication_OHxxxxxx on Windows
  • -u and -p are the credentials of the user under whose ID the test should be executed
  • -s specifies the Logical SQL script to run
  • -o the output file to write the returned data to.

For more information about nqcmd see the manual.

Once you’ve run the initial data sample, make your RPD changes, and then rerun the data collection with the same command as before (except to a different output file). If the nqcmd command fails then it’s an indication that your RPD has failed regression testing already (because it means that the actual analysis will presumably also fail).

An important point here is that if your underlying source data changes or any time-based filter results change then the test results will be invalid. If you are running an analysis looking at “Sales for Yesterday”, and the regression test takes several days then “Yesterday” may change (depending on your init-block approach) and so will the results.

A second important point to note is that you must take into account the BI Server cache. If enabled, either disable it during your testing, or use the DISABLE_CACHE_HIT request variable with your Logical SQL statements.

Having taken the before and after data collections for each analysis, it’s a simple matter of comparing the before/after for each and reporting any differences. Any differences are typically going to mean a regression has occurred, and you can use the before/after data files to identify exactly where. On Linux the diff command works perfectly well for this

In this case we can see that the ‘after’ test failed with a missing table error. If both files are identical (meaning there is no difference in the data before and after the RPD change), there is no output from diff:

Tools like diff are not pretty but in practice you wouldn’t be running all this manually, it would be scripted, reporting on exceptions only.

So a typical regression test suite for an existing RPD would be a set of these nqcmd calls to an indexed list of Logical SQL statements, collecting the results for comparison with the same executions once the RPD changes have been made.

Tips

  • Instead of collecting actual data, you could run the results of nqcmd directly through md5 and store just the hash of each resultset, making for faster comparisons. The drawback of this approach would be that to examine any discrepancies you’d need to rerun both the before & after tests. There is also the theoretical risk of a hash collision (where the same hash is generated for two non-matching datasets) to be aware of.
  • diff sets a shell return code depending on whether there is a difference in the data (RC=1) or not (RC=0), which makes it handy for scripting into if/then/else shell script statements
  • nqcmd uses stdout and stderr, so instead of specifying -o for an output file, you can redirect the output of the Logical SQL for each analysis to a results file (file descriptor 1) and an error file (file descriptor 2), making spotting errors easier:
    nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s analysis01.lsql 1>analysis01.out 2>analysis01.err  

Taking it one step further

We can strip away the layers even further, in two additional stages:

  • Instead of examining the data returned by a generated physical SQL query, simply compare the generated SQL query itself, before and after a RPD change. If the query is the same, then therefore the data returned will be the same, and therefore the report will be the same.
    One thing to watch for is that the Physical query logged in Usage Tracking (although not in nqquery.log) has a Session ID embedded at the front of it which will make direct comparison more difficult.
  • The Physical SQL is dependant on the RPD; if the RPD changes then the Physical SQL may change. However, if neither the RPD nor inbound Logical SQL has changed and only the underlying data source has changed (for example, a schema modification or database migration) then we can ignore the OBIEE stack itself and simply test the results of the Physical SQL statement(s) associated with the analysis/dashboard in question and make sure that the same data is being returned before and after the change

The fly in the ointment – Logical SQL generation

This may all sound a bit too good to be true; and there is indeed a catch of which you should be aware.

Presentation Services does not save the Logical SQL of an analysis, but rather regenerates it at execution time. The implication of this is that the above nqcmd method could be invalid in certain circumstances where the generated Logical SQL changes even when the analysis and patch level remain unchanged. If an analysis’ Logical SQL changes then we cannot use the same before/after dataset comparison as described above — because the ‘after’ dataset would not actually match what would be returned. In reality, if the Logical SQL changes then the corresponding Logical resultset is also going to be different.

Two factors that will cause Presentation Services to generate different Logical SQL for an analysis without changing the analysis at all are modifications to an RPD Logical Column’s Sort order column or Descriptor ID column configuration.

As an example, consider a standard “Month” column. By default, the column will be sorted alphabetically, so starting with April (not January)

The Logical SQL for this can be seen in the Advanced tab

Now without modifying the analysis at all, we change the RPD to add in a Sort order column:

Reload the RPD in Presentation Services (Reload Files and Metadata) and reload the analysis and examine the Logical SQL. Even though we have not changed the analysis at all, the Logical SQL has changed:

When executed, the analysis results are now sorted according to the Month_YYYYMM column, i.e. chronologically:

The same happens with the Descriptor ID Column setting for a Logical Column – the generated Logical SQL will change if this is modified. Changes to Logical Dimensions can also affect the Logical SQL that is generated if an analysis is using hierarchical columns. For example, if the report has a hierarchical column expanded down one level, and that level is then deleted from the logical dimension in the RPD, the analysis will instead show the next level down when next run.

Regression Testing Logical SQL generation

It is important that Logical SQL is considered as part of regression testing if we are using this targeted approach – it is the price to pay for selectively testing elements of the stack using reasoning to exclude others. In this case, if the Logical SQL changes then we cannot compare datasets (because the source query will have changed when the actual analysis is run). In addition, if the Logical SQL changes then this is a regression in itself. Consider the above Sort order column example – if that were removed from an RPD where it had been present, users would see the effect and quite rightly raise it as a regression.

There are at least two ways to get the Logical SQL for an analysis programatically : the generateReportSQL web service, and the Presentation Services Catalog Manager tool. We will look at the latter option here. The Catalog Manager can be run interactively through a GUI, or from the command line. As a command line utility it offers a rich set of tools for working with objects in the Presentation Catalog, including generating the Logical SQL for a given analysis. The logical outline for using it would be as shown below. If the Logical SQL is not identical, or runcat.sh fails to generate Logical SQL for the analysis after the RPD is changed, then a regression has occurred. If the Logical SQL has remained the same then the testing can proceed to the nqcmd method to compare resulting datasets.

rt32

runcat.sh is a powerful utility but a bit of a sensitive soul for syntax. First off, it’s easiest to call it from its home folder:

cd $FMW_HOME/instance1/bifoundation/OracleBIPresentationServicesComponent/coreapplication_obips1/catalogmanager

To see all the things that it can do, run

 ./runcat.sh -help

Or further information for a particular command (in our case, we’re using the report command):

./runcat.sh -cmd report -help

So to generate the Logical SQL for a given analysis, call it as follows:

./runcat.sh -cmd report -online http://server:port/analytics/saw.dll -credentials creds.txt -forceOutputFile output.lsql -folder "/path/to/analysis" -type "Analysis" "SQL"

Where you need to replace:

  • server:port with your BI Server and Managed Server port number (for example, biserver:9704)
  • creds.txt is a file with your credentials in, see below for further details
  • output.lsql is the name of the file to which the Logical SQL will be written. Remove the ‘force’ prefix if you want runcat.sh to abort if the file exists already rather than overwrite it
  • /path/to/analysis is the full path to the analysis (!), which you can get from both OBIEE (Catalog -> Object -> Properties) and from the Catalog Manager in GUI mode (Object -> Properties). In the screenshots here the full path is /shared/Standard/Analyses/Sales Reportrt33 rt34

Having called runcat.sh once, you then make the RPD change, reload the RPD in Presentation Services, and then call runcat.sh again and compare the generated Logical SQL (e.g. using diff) - if it’s the same then you can be sure that when the analysis runs it is going to do so with the same Logical SQL and thus use the nqcmd method above for comparing before/after datasets.

To call runcat.sh you need the credentials in a flatfile that looks like this:

login=weblogic
pwd=Password01

If the plaintext password makes you uneasy then consider the partial workaround that is proposed in a blog post that I wrote last year : Make Use of OBIEE’s Command Line Tools with Reduced Exposure of Plain Text Passwords

Bringing it together

Combining both nqcmd and runcat.sh gives us a logic flow as follows.

  1. Run Logical SQL through nqcmd to generate initial dataset
  2. Run analysis to which the Logical SQL corresponds through runcat.sh to generate initial Logical SQL
  3. Make RPD changes
  4. Run analysis through runcat.sh again. If it fails, or the Logical SQL doesn’t match the previous, then regression occurred.
  5. Rerun nqcmd, and compare the before/after datasets. If they don’t match, or nqcmd, fails, then regression occurred.

You may wonder why we have two Logical SQL statements present – that for use with nqcmd, and that from runcat.sh. The Logical SQL for use with nqcmd will typically come from actual analysis execution (nqquery.log / Usage Tracking) with filter values present. To compare generated Logical SQL, the source analysis needs to be used.

Summary

So to summarise, automated regression testing of OBIEE can done using just the tools that are shipped with OBIEE (and a wee bit of scripting to automate them). In this article I’ve demonstrated how automated regression testing of OBIEE can be done, and suggested how it should be done if the changes are just in the RPD. Working directly with the BI Server, Logical SQL and resultset is much more practical and easier to automate at scale. Understanding the caveat to this approach – that it relies on the Logical SQL remaining the same – and understanding in what situations this may apply is important. I demonstrated another automated method that could be used to automatically flag any tests for which the dataset comparison results would be invalid.

Testing the data that analyses request and receive from the BI Server can be done using nqcmd by passing in the raw Logical SQL, and this Logical SQL we can also programatically validate using the Catalog Manager tool in command line mode, runcat.sh.

Looking back at the diagram from the first post in this series, you can see the opportunities for regression testing in the OBIEE stack, with the point being that a clear comprehension of this would allow one to accurately target testing, rather than assuming that it must take place at the front end:

If we now add to this diagram the tools that I have discussed in the article, it looks like this:

I know I’ve not covered Selenium, but I’ve included it in the diagram for completeness. The other tool that I plan to cover in a future posting is the OBIEE Web Services as they also could have a role to play in testing for specific regressions.

Conclusion

Take a step back from the detail, I have shown here a viable and pragmatic approach to regression testing OBIEE in a manner that can actually be implemented and automated at scale. It is important to be aware that this is not 100% test coverage. For example, it omits important considerations such as testing for security regressions (can people now see what they shouldn’t). However, I would argue that some regression testing is better than none, and regression testing “with one’s eyes open” to its limitations that can be addressed manually is a sensible approach.

Regression testing doesn’t have to be automated. A sensible mix of automation and manual checking is a good idea to try and maximise the test coverage, perhaps following an 80/20 rule. The challenges around regression testing the front end mean that it is sensible to explore more focussed testing further down the stack where possible. Just because the front end regression testing can’t be automated, it doesn’t mean that the front end shouldn’t be regression tested – but perhaps it is more viable to spend time visually checking and confirming something than investing orders of magnitude more hours in building an automated solution that will only ever be less accurate and less flexible.

Regression Testing OBIEE is not something that can be solved by throwing software at it. By definition, software must be told what to do and OBIEE is too flexible, too complex, to be able to be constrained in such a manner that a single software solution will be able to accurately detect all regressions.

Successfully planning and executing regression testing for OBIEE is something that needs to be not only part of a project from the outset but is something that the developers must take active responsibility for. They have the user interviews, the functional specs, they know what the system should be doing, and they know what changes they have made to it — and so they should know what needs testing and in what way. A siloed approach to development where regression testing is “someone else’s problem” is never going to be as effectively (in terms of accuracy and time/money) as one in which developers actively participant in the design of regression tests for the specific project and development tasks at hand.


Many thanks to Gianni Ceresa for his thoughts and assistance on this subject.

OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 2 : Load Data into Hive Tables, Analyze using Hive & Impala

In yesterday’s post on analyzing Hadoop data using Cloudera CDH4, Amazon EC2 and OBIEE 11.1.1.7, I went through the setup process for Cloudera Manager Standard and then used it to set up a four-node Hadoop cluster in Amazon AWS’s EC2 service. In today’s post, we’ll use a tool called “Hue” to upload some flight delays stats from OBIEE’s SampleApp / Exalytics demos, create Hive tables over those files and then analyse them first using Hive, and then using Cloudera Impala. If you’ve not seen yesterday’s post then take a look at that first for the setup instructions, and then we’ll move on to today’s topics.

Using Hue to Create a Database and Upload Data to It

1. Hue is a web-based application that ships with Cloudera’s distribution of Hadoop, and is used to run queries against Hadoop and perform general data activities – think of it as SQL*Developer or ApEx for Hadoop, as compared to Cloudera Manager which is more like Oracle Enterprise Manager. To first navigate to Hue, go back to Cloudera Manager, select Services > hue1, then select Hue Web UI from the tabs over the Hue service details. You’re then taken to a setup screen where you can create a new Hue user for admin purposes – I’ll use the username/password “admin/admin”, and then press the Sign-Up button – which then takes me to Hue’s Quick Start Wizard, like this:

NewImage

Press Next, and then click on the All button under the Install all the application examples label. This installs demo data for Hive, Impala, Pig (a procedural tool used for PL/SQL-type data transformation) and Oozie (a workflow tool). We won’t use this example data in this exercise, but it’s handy to have around for playing around with later.

NewImage

On the Step 3: Users page, click on the User Admin button and create a new user called “airlines”, password “airlines” – we’ll use this in a moment when uploading data to the cluster. Then return to the quick start wizard, press Next and then click on the Hue Home button to go into Hue proper.

2. You’re now at the Hue homepage where you can upload and work with the HDFS filesystem, create Hive tables, use Sqoop 2 to connect to and load data from a relational database, and perform other tasks.

NewImage

Select admin > Sign Out, and then log in again this time as the “airlines” user. We’re going to now use this user to create a new Hive database called “bi_airlines”, and then create tables out of four pipe-delimited files exported earlier from an Oracle database, and that I’ve uploaded to Dropbox in case you want to use them too.

3. To create the new Hive database, click on the Tables link, and then the Databases link at the top of the page, like this:

NewImage

Then, when the next screen is displayed, click on the Create a new database link and call it “bi_airlines”, accepting the default location (in HDFS) for the files it uses. Press the Create Database button that’s then displayed, and check the log and the output to make sure it’s created OK. At that point then, you should see two databases displayed – the “bi_airlines” one you just created, and the default one.

4. Click on the bi_airlines database to select it; another page will then be displayed that will list the tables within that database (which at this point is of course empty), and links to create a new table from a file, or to create one manually.

NewImage

Now in the real world, you’d create your Hive tables manually as you’re most likely going to map them onto a directory of files (or set of directories, if you want to use Hive table partitioning), and you’re also likely to have done some processing using MapReduce, Pig, R or another tool before having the data in something resembling table file extracts. In this example though, we’re just going to use four pipe-delimited files and use Hue’s ability to upload a file and create a table from it automatically.

The four files we’ll be using are:

  • flight_performance_2008-10.txt : three years of flight delay stats including origin and departure airport, number of flights, distance, arrival and departure delay in minutes (524MB)
  • geog_dest.txt : destination airports, their state and city, as referenced in the flight stats (150KB)
  • geog_origin.txt : the same set of data as the destinations file (150KB)
  • carriers.txt : carrier (airline) codes and descriptions

Given the size of the cluster and the potential data available, you could easily use a larger dataset if you’ve got access to an Exalytics demo environment. As such, there’s about 20m flight legs in the main file which is enough to give things a bit of a spin.

5. Still logged in as the “airlines” user, click on the Create a new table from a file link. Type in “flight_performance” as the table name, and then use the ellipses (“..”) button next to the file path area. When the Choose a File dialog is shown, press the Upload a File button and then navigate to the flight_performance_2008-10.txt file on your local machine. Double-click on the file to select it, and the Hue web UI will then upload the file to the Hadoop cluster, storing it on the HDFS distributed file-system (note that on a Mac I had to switch to Firefox to get this uploader to display).

NewImage

After a while the file will finish uploading; when it has, click on the link for it in the dialog to select it, and then move onto the next page in the wizard.

The Choose Delimiter page is then displayed. This file is pipe (“|”) delimited, so select Other as the Delimiter and key in “|” (no quotes); press Preview to then display the file data, which will look like this:

NewImage

Press Next, and then define the columns in the table like this:

  • Year (int)
  • uniquecarrier (string)
  • origin (string)
  • dest (string)
  • arrivaldelaymins (int)
  • depdelaymins (int)
  • flights (int)
  • distance (int)

Then, press the Create Table button, and the table will then be created and the file used to populate it. Once complete, review the Columns tab in the table display and then the Sample tab, which should output something like this:

NewImage

Then, repeat this process for the other three files, creating the following tables:

origin (based on geog_origin.txt)

  • origin (string)
  • origin_desc (string)
  • origin_city (string)
  • origin_state (string)
  • origin_airportid (string)

destination (based on geog_dest.txt)

  • dest (string)
  • dest_desc (string)
  • dest_city (string)
  • dest_state (string)
  • dest_airportid (string)

carriers (based on carriers.txt)

  • carrier (string)
  • carrier_desc (string)

We’ve now uploaded all the data to the Hadoop cluster; let’s take a look at it now before we move over to OBIEE.

6. When you normally upload files via Hue into Hadoop’s HDFS filesystem, it normally puts them into the home directory for that user in HDFS (for example /user/airlines/). If you choose to create a table from that file though, Hue and Hive move the file into Hive’s part of the HDFS filesystem, creating a sub-directory first for that new database. You can see where your files have gone by clicking on the File Browser button at the top of the Hue page, then navigating to /user/hive/warehouse/bi_airlines.db – you should see your files there (or more correctly, directories that contain your files). You can also map Hive tables to files outside of the /user/hive/warehouse directory (they’re called “external tables”), but this action is the default, and we’ll leave them there now.

NewImage

So where are these files kept, in this Hadoop cluster in the EC2 cloud. To find out, click on the flight_performance entry, and notice that the file (with the .txt extension) is actually contained within it – we’d actually clicked on a directory for that file. In fact, Hive tables can just as easily map onto a directory of files, so you could add in other years’ data here, or in fact thousands of files – this is usually how incoming data is received in big data-type applications. With the single flight_performance.txt file displayed, click on the file to view its contents, and then click on the View File link and then notice the First Block | Previous Block | Next Block | Last Block and bytes areas – HDFS in-fact breaks the file into blocks, and stores the file in several (redundant) places on nodes in the cluster, to give us fault-tolerance and make it easy for multiple nodes to process the dataset in block chunks.

7. So let’s start by running a couple of queries in Hive, using Hue again. Click on the Beeswax (Hive UI) icon at the top of the page, ensure bi_airlines is selected as the Database, and then type in the following HiveQL query:

select count(*) from flight_performance;

Execute the query and then watch the log output. You’ll see Hive creating and then submitting for execution the MapReduce jobs to select your columns (the “map” part) and then aggregate the results (the “reduce” bit).

NewImage

The count should return about 19m rows, and the query should take around a minute and a half to run. Now let’s try something more interesting:

select sum(f.flights)
from flight_performance f join origin o on (f.origin = o.origin)
where o.origin = 'SFO'

This time the query takes a bit longer, and when it completes you can see links for the two MapReduce jobs that it used to sum the flight data, as shown in the screenshot below:

NewImage

Click on one of the MR Jobs links and you can see a bit more detail about the MapReduce job that provided that part of the dataset – in the example below, there were three mappers that ran initially, then another two to setup and cleanup the job, and then one reducer to aggregate the data. Clicking on the other link is a similar story – a single mapper for the main data selection, then a reducer and control mappers to control and aggregate the dataset.

NewImage

When you think about it, it’s pretty amazing what Hive does, compared to writing the MapReduce code yourself and then running it. And it’s probably fine for ETL-type access where most probably there’s a lot more data to load than just this small fact table, but it wouldn’t really be good for BI-type queries as we’re talking 1,2 or 3 minutes to return data. And that’s what Impala is for – access to the same data, using the same Hive catalog, but much-faster queries that don’t use MapReduce to retrieve the data.

8. So let’s run the same query using Impala. Click on the Cloudera Impala (TM) Query UI icon at the top of the Hue page, and select from the drop-down menu under Database – note how the bi_airlines database isn’t showing there. To have it show, go into the query editor area and type in:

invalidate metadata;

This will have Impala re-load the Hive catalog metadata, and the bi_airlines database should then be listed. Select it, and then try the same query as before:

NewImage

select sum(f.flights)
from flight_performance f join origin o on (f.origin = o.origin)
where o.origin = 'SFO';

This time when you run it, it returns in a couple of seconds. So now we’ve got some data and some options for querying it, let’s move over to OBIEE and try and connect it to the cluster.

Connecting OBIEE 11.1.1.7 to Cloudera Hadoop on Amazon EC2 using Hive and Impala

For simplicity’s sake, we’ll use OBIEE 11.1.1.7 running on Windows (Windows Server 2008 R2 64-bit, in my case), and we’ll use Cloudera’s own ODBC drivers to make the connection. Oracle’s recommendation is that you use Linux for Apache Hadoop / Hive connectivity though, and they provide their own drivers as part of the 11.1.1.7 install and on OTN; however I don’t think these connect to the Hiveserver2 service that recent CDH4 installs use, and I know these work. So starting with a standard install of OBIEE 11.1.1.7 on Windows 64-bit, follow these steps to initially connect via Hive.

1. Start by downloading the Impala and Hive ODBC drivers from the Cloudera website, which at the time of writing can be found here:

Install the Hive one first, and we’ll try those before going over to the Impala ones. Run the MSI installer for Hive, and then open the 64-bit ODBC Data Source Administrator utility in Windows, so we can create the ODBC connection through to Hive.

2. Next we need to find the external EC2 DNS name for the virtual server we added the HiveServer2 service to in the previous article. Open up the Cloudera Manager website on the instance you created right back at the start of yesterday’s article to host Cloudera Manager, and navigate to the Hosts > All Hosts page, like this:

In my example, the virtual server that’s running HiveServer2 is displayed with its internal EC2 DNS name, like this:

NewImage

This name only works when you’re internal to the EC2 network though, so you’ll need to go over to the AWS Management Console and find the entry for that instance using the private DNS name, and then use that you retrieve the public one, like this:

NewImage

2. Now you’re good to go. Back in the Windows desktop, click on System DSN, and then Add. In the list of ODBC data source drivers, you should see Cloudera ODBC Driver for Apache Hive; select it and press Finish.

Then, when the Cloudera ODBC Driver for Apache Hive DSN Setup dialog is shown, enter the following details, substituting the host name that you just retrieved in the previous step, that’s running HiveServer2:

Data Source Name : hive_demo
Host : <your host name with HiveServer2 running>
Port : 10000
Database : bi_airlines
Hive Server Type : HIve Server 2
Authentication Mechanism : User Name
User Name : airlines

so that the dialog looks like this:

NewImage

Press Test, and check that the test results are successful (note I’m using a slightly older version of the drivers, so the dialog might look a bit different in the latest version). Then press OK, and OK again to close the dialog and save the system DSN.

Now, create a new RPD or log into an online one that’s also on this OBIEE host server, so that any online access can also use the ODBC drivers you just installed. When the RPD is open for editing, select File > Import Metadata .., and then when prompted, select the DSN you created a moment ago – in my case, “hive_demo”, enter the airlines/airlines username and password, and then press Next to proceed to the table import page.

Make sure “Tables” is still checked, press Next, and then select and bring across the bi_airlines database you created earlier, as shown in the Data source view:

NewImage

Press Finish to complete the metadata import.

3. Now click on the Hive physical database in the Physical panel in the BI Administration tool, to display the Database properties dialog. Change the Database type: from ODBC Basic to Apache Hadoop, and press No when asked if you’d like to edit the connection pool properties for this database, then press OK to close this dialog, then check in the changes to the RPD. To now check connectivity to the Hadoop cluster via Hive ODBC, right-click on one of the table and select View Data…

NewImage

Once all is working OK, create keys on the origin.origin, destination.dest and carriers.carrier tables and then connect the fact table them, so you’ve got a physical model that looks like this:

NewImage

Then, finally, pull the rest of the RPD together and create a simple report in answers; the key thing is that you get some data through, as I’ve managed to do in the screenshot below.

NewImage

But of course – it’s not very fast – queries typically take 2, 3 minutes to run, and these are just simple ones. As I said earlier – fine for ETL, particularly when the dataset it likely to be a lot bigger, but not great for ad-hoc BI queries. So let’s set up an Impala connection instead, and see how that goes.

4. Before this will work though, we’ll need to add the Impala port – 21050 – to the security group that Cloudera Manager created when it provisioned the Hadoop nodes yesterday. To do this, go back into the AWS Management Console, click on the Security Groups menu item and navigate to the security group set up by Cloudera Manager, in my case called “jclouds#impala-demo-cdh”. Click on it to select it, and use the Inbound tab to add an additional security rule like this:

Port Range : 21050
Source : 0.0.0.0/0

Then, press Add Rule and then Apply Rule Changes to add this additional port to the security group. Finally, check the list of ports now open for that security group to see that 21050 is now listed.

You can now over to the Windows environment, install the Impala ODBC drivers and use the ODBC Administrator utility in Windows and set up the Impala ODBC connection. In my case, I use the following values:

Data Source Name : impala_demo
Host : ec2-46-137-25-2.eu-west-1.compute.amazonaws.com
Port : 21050
Database : bi_airlines
Mechanism : No Authentication

NewImage

Press Test to check that it’s all working OK, and then import the bi_airlines tables into the RPD as you did with the Hive import.

NewImage

Double-click on the new physical database and set the database type to “Apache Hadoop” again. In addition though, click on the Features tab in the Database Properties dialog and uncheck the ORDERBY_SUPPORTED checkbox – Impala SQL requires a LIMIT clause after each ORDER BY but OBIEE doesn’t currently provide this (Impala’s an unsupported source at this point in time, so its not unexpected), so by unchecking this property we get the BI Server to do the results ordering, and queries will then run OK.

NewImage

Right-click and select View Data… on one of the imported Impala tables to check that it returns data OK, and then build-out the rest of the RPD as you did with the Hive data. Now when results come back, they come back in a matter of seconds (and the results look more correct, too).

NewImage

So there you have it – a more-or-less step-by-step to setting up a Hadoop cluster in Amazon EC2, then analysing data on it using OBIEE and Hive / Impala. Hopefully it was useful – more on this topic over the next few weeks.