Tag Archives: Oracle BI Suite EE

Can Endeca Provide an Alternative to OBIEE for Fast, Complex Ad-hoc Analysis?

Towards the end of last week I was pulling together an article for Oracle Magazine on Oracle Endeca Information Discovery, and how it can now use OBIEE’s BI Server and repository as an additional data source. I won’t go into the details of that connectivity now as it’ll be covered in the article itself, but one of the reasons I wanted to write this column was because of the dilemma I outlined in this blog post earlier in the month – is it possible to provide an ad-hoc query tool that gives users ultimate flexibility in how they create the report, but always guarantees sub-second response times. In that blog post, and in the Kevin McGinley blog post that inspired it, I speculated that Endeca Information Discovery 2.x might be able to provide that combination of features and I was keen to try this out in a “real-world” reporting scenario. If you’re new to Endeca and want to do a bit of background reading first, check out our Endeca focus area and these particular postings on the Endeca technology stack:

Update 25-Mar-2013: Endeca Information Discovery 3.0 is now out, and you can read about my first impressions of this updated release here.

So in the article, I used the OBIEE 11.1.1.6 v207 SampleApp as my data source, in particular the Airline Flight Delays dataset that exists in cut-down form in this VirtualBox VM, and in expanded form on our Exalytics demo server. The screenshot below shows this dataset in an OBIEE dashboard, that then allows the end-user to analyze flight statistics and associated delays over a number of years using graphs, pivot tables, maps and other interactive visualisations.

Sshot 8

All of this works well, and a set of pre-built summary tables that also come with the VM ensure that the reports in the demo run fairly fast – as you’d expect from a demo system provided by Oracle to showcase the tool. But if you go outside of the dashboard and use the analysis editor to create a new report, particularly one that involves lots of subtotals, time-series functions, outer joins across multiple fact tables and so forth, then it can sometimes be many seconds, maybe minutes before it returns data – basically, if you try hard enough then you’ll construct a query that returns results too slow for a demanding user. I had this issue a couple of weeks ago when I visited a client looking to upgrade from OBIEE 10g to 11g, and who were trying reconcile what appeared to be two contradictory aims – provide a reporting environment where the user could request just about anything, however complex or expensive the query, and the expectation was that the answer would come back straightaway.

Of course in reality it’s a tough call to do this consistently with OBIEE, but could Endeca Information Discovery meet this requirement, with its in-memory, column-store search analytic database back-end and it’s web-based dashboard interface? Looking at the screenshot below from the Quickstart demo application that comes as part of the Endeca Information Discovery 2.4 download, it’s not a million miles away from an OBIEE dashboard, and if it provides guaranteed split-second response times, could it be a viable replacement for Answers, or at least an option that we could turn to for a particular type of user?

Sshot 9

It’s worth saying at this point that Oracle, quite clearly, don’t position Endeca Information Discovery as a replacement or upgrade for OBIEE; instead, as I explain in this presentation from last year’s Oracle Openworld, Oracle position Endeca Information Discovery in the following way:

  • Primarily, as a BI tool that enables analysis of unstructured and semi-structured data, as well as more traditional structured (measures, dimensions etc) datasets
  • As a “data discovery” tool, through its ability to bring together loosely-related datasets and analyse them using search and lexical analysis tools
  • As an agile, rapid application development tool for BI, because the key/value-pair record-orientated database provided by the Endeca Server requires little upfront data modelling, using a “schema on read” approach rather than the traditional “schema on write” used by relational databases and OBIEE’s semantic layer.

But … Endeca has a number of properties that might be of interest to more traditional BI users looking for in-memory ad-hoc analysis:

  • The in-memory key-value store database it uses doesn’t have the same costs around data manipulation, table joins and disk access that traditional databases have, and the column-based storage it uses is particularly suited to selecting from sets of dimension members
  • The search-and-click-orientated interface might actually suit users better now, given that everyone uses Google and tools such as QlikView have taken off
  • It’s got a good range of data visualisation components, and a web-based dashboard that appears to do a similar job to OBIEE’s interactive dashboard.

So it seemed a good opportunity to take some data from an OBIEE 11g repository, load it up into an Endeca Server datastore, and see how well it actually worked as a replacement for Answers, even though this isn’t what the product was intended for.

The article goes through the steps loading up the Endeca Server datastore through a connection to OBIEE’s BI Server, and it works pretty well although I do have to create a special version of the SampleApp Flight Delays business model that removes the numbering from the business model tables, as Endeca doesn’t allow numbers as prefixes for its datastore attribute names. I took most of the tables from the Flight Delays subject area, and all of the rows, with the new BI Server integration feature creating a skeleton project for me in Endeca Information Discovery Integrator.

NewImage

The article then goes on to create a basic dashboard, using a few search and guided navigation components on the left-hand side of the dashboard page, and a selection of basic visualisations on the right, looking something like this:

Sshot 10

So far so good, and for an article whose focus is mainly around the Endeca-OBIEE integration piece, it illustrates the feature well. But how well does this work as an ad-hoc query tool, how fast are the queries, and how might a typical end-user get on trying to create their own reports, graphs and other visuals?

Initial results were encouraging; response times were consistently fast, albeit within the limited, controlled environment provided by the dashboard. All interactions were really in the form of filtering and searching of data, but as you can see from the performance stats below provided by the Endeca Information Discovery Studio control panel, all response times on all dashboards were under a second.

Sshot 11

Compare this with similar stats for the OBIEE dashboards and analyses, and whilst most queries also came in around a second or so, some of them, particularly the more complex ones (and, ironically, the ones using data from TimesTen) were ten, fifteen or more seconds on average, and this is for a system with predefined queries, aggregate tables and most probably, caching enabled as well. Loading the SampleApp flight delays data into Endeca took around 20 minutes or so and took up around 600MB of memory on the VM, so based on this initial test, Endeca looks an interesting alternative to Answers and a regular RBDMS back-end database.

But it’s when you come to use Endeca Information Discovery’s dashboard as an ad-hoc reporting tool that you begin to see the limitations, at least with Studio, the current web-based front-end for the Endeca Server. Going back to the Quickstart dashboard for a moment, the basic analysis process that an end-user would use with this tool would be as follows:

  1. Use the search and guided navigation tools to identify, and then focus in-on, the records that are of interest to you
  2. Then, using the graphing, tabular, tag cloud and other visualisations, aggregate and analyse the measures within that subset of records

To take an example, you might use the search and guided navigation tools in Quickstart to focus-in on mountain bike sales, in the UK, made in the second quarter of calendar year 2007, and once you’ve done this you would see a dashboard page looking like the one below.

Sshot 13

The Endeca Server engine is good at aggregating data on the fly, and as the whole dataset should be held in-memory, results should return fast. In fact, this dashboard doesn’t look a million-miles away from an OBIEE dashboard, with the Breadcrumbs component on the left looking just like a set of dashboard prompts on an OBIEE dashboard. And this works pretty well, with the main issue then coming from how suitable a tool Studio is for users creating their own reports.

For example, going back to the dashboard I created for the Oracle Magazine article, suppose I wanted to add a cross tab (pivot table) component to the dashboard. Selecting it from the list of new components is easy enough, but then the first dialog you’re presented with when you come to configure the rows and columns is this one:

Sshot 14

errrrr … what’s an EQL query? In fact, EQL is much like OBIEE’s logical SQL and therefore easy to write, but you still need to know the names of attributes, the format aggregation functions need to take, and of course the particular syntax EQL uses. In this example, to create a cross tab that breaks down flights by the percentage that cancel over region and quarter, the EQL would look something like this:

Sshot 15

Once you know EQL, and particularly if you’ve got a copy of Endeca Information Discovery Integrator open in front of you with the attribute names listed, it’s not exactly rocket-science, but its’ still a process more like creating ADF Data Visualization Components than working with OBIEE’s Analysis Editor. According to reports around the upcoming 3.0 release of Endeca Information Discovery, the tabular and crosstab components are being reworked to presumably make them easier for end-users to set up, but where we are now is a long way from the usability of tools like the OBIEE Analysis Editor or even Microsoft Excel.

Once the crosstab is configured and on the screen, options are limited for hierarchy drilling, pivoting, subtotalling and all the other pivot table-style operations that “power users” would expect to see. Attributes, equivalent to columns in OBIEE’s subject areas, can be nested within each other, but there’s no equivalent to OBIEE 11g’s row and column-swapping, selection steps, calculated items and groups, and all the other pivot table features that power users require, and that are typically the cause of the expensive and complex SQL you sometimes see associated with an OBIEE analysis. The screenshot on the left below shows a typical Endeca Information Studio crosstab whilst the one on the right shows a pivot table from OBIEE 11.1.1.6, featuring right-click interactions to further manipulate the report dataset after its returned from the underlying database.

NewImage

Its a similar story around charts in Endeca Studio. Adding a chart component to a dashboard page and selecting the Properties dialog this time requires that you select from a list of “views”, rather than type in an EQL query, with the average end-user probably not knowing what a view is or how to get hold of one.

Sshot 18

Views are in fact metadata objects within Studio that allow you to define the equivalent of a database view over records and attributes in the datastore, designating some of the attributes as “dimensions” and others as “measures”. Except the dimensions are more like member lists, with no obvious place to define hierarchies or any of the other structures associated with OLAP-style reporting. Views are actually quite cool and once you get the hang of them, a great additional to building Studio applications, but it’s another technique to learn compared to creating crosstab reports, and with the creation of the view we’re starting to get into RPD and catalog-style territory which isn’t probably where you want to be end-users spending their time.

Sshot 20

Update 24-March-2013: Oracle Endeca Information Discovery 3.0 has now been released, and the Crosstab component (now renamed Pivot Table component) now uses the same “views” datasource as Chart components, taking away the need to type in raw EQL when defining a crosstab component’s underlying query.

Another issue is a bit more subtle and concerns how the Endeca Server arranges data in the record-based datastore. When you load data into an Endeca Server datastore, you effectively denormalize all the input tables and files into a single large “table” of records, with each record being made up of a number of attributes (used for data selection) and measures (also attributes, but typically aggregated and displayed in numeric form on reports). This works reasonably well if you’re just loading the equivalent of a single fact table and its dimensions into a datastore, with the datastore records taking on the granularity of fact table rows or transactions in your source system. But typically, OBIEE users want to run queries that span multiple fact tables, and once you start trying to mix in fact and transaction sources of differing granularity, you start having to use workarounds and partner solutions such as this in order to have it make some sort of sense to the end user.

None of this is meant to imply any criticism of Endeca Information Discovery, or indeed to let OBIEE off-the-hook; Endeca was designed first and foremost as a data discovery tool, in a way the visualisations and analytics are a bonus, and Oracle have clearly stated that the graphing and analysis elements of the product are one of the major areas of the product they’re looking to improve, probably making more use of Oracle’s own data visualisation components and presumably moving away from LifeRay portal and Tomcat, on which Endeca Studio is built. But it’s no good comparing the relatively simplistic Endeca data analysis components with the more complex ones that OBIEE uses and saying they are faster, because it’s precisely those extra features that OBIEE’s components use that make their queries run slower – who knows how well Endeca queries would run given a query of a similar complexity?

I think what this exercise said to me is that, whilst the Endeca Server engine is most probably an excellent in-memory back-end for an ad-hoc query tool, the current generation of end-user reporting tools that come with the product just aren’t up to the standard that OBIEE power-users would require, and instead clearly they’re more aimed at developers who will build a dashboard and then either use it themselves, as part of a data discovery exercise, or give it to users to use in a “locked-down” form. And this is the key thing – if we’re considering Endeca as an alternative to OBIEE for these “power users”, they’re going to want to create their own reports, manipulate their own data and so forth, and clearly not having this ability will more or less make it unusable for them. Of course, the issue is compounded with Endeca in that the only interface into the Endeca Server datastore is via web service calls, rather than ODBC or JDBC, so Studio is more or less the only game in town until Oracle either make the interface more standard, or start to use OBIEE technologies as a way of analysing and visualising Endeca data.

So … close, but realistically a non-runner due to usability issues. If the requirement really is high-speed ad-hoc analysis of data, your best bet is still probably Essbase or OBIEE ideally in conjunction with Oracle Exalytics, but one wonders what Oracle have planned in terms of products to meet this need – some sort of combination of Endeca (for in-memory filtering and navigating across all types of data), Essbase (for aggregation and calculation) and OBIEE (for scalability, metadata management and data federation) combined with the best of each tools’ front-end capabilities would be very interesting – a kind of “Project Fusion” for BI, if you were.

Performance and OBIEE – Summary and FAQ

This article is the final one in a series about OBIEE and performance. You can find the previous posts here:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise

Summary

  • The key to long term, repeatable, successful performance optimisation of a system is KNOWING where problems lie and then resolving them accordingly. This means investing time up front to develop an understanding of the system and its instrumentations and metrics. Once you have this, you can apply the same method over and over to consistently and successfully resolve performance problems.
  • GUESSING what the problem is and throwing best practice checklists at it will only get you so far. If you hit a lucky streak it may get you far enough to convince yourself that it is enough alone. But sooner or later you will hit a dead-end with your guesswork and have to instead start truly diagnosing problems. When this happens, you are starting from scratch in learning and developing your method for doing this.
No! Dont Guess : Know; Dont Guess

Getting the best performance out of OBIEE is all about good design and empirical validation.

To be able to improve performance you must first identify the cause of problem. If you start ‘tuning’ without identifying the actual cause you risk making things much worse.

The slightly acerbic tone at times of these articles may betray my frustration with the incorrect approach that I see people take all too often. A methodical approach is the correct one, and I am somewhat passionate about this, because:

  1. It works! You gather the data to diagnose the issue, and then you fix the issue. There is no guessing and there is no luck
  2. By approaching it methodically, you learn so much more about how the OBIEE stack works, which aside from being useful in itself means that you will design better OBIEE systems and troubleshoot performance more easily. You actually progress in your understanding, rather than remaining in the dark about how OBIEE works and throwing fixes at it to hope one works.

FAQ

Q: How do I improve the performance of my OBIEE dashboards/reports?

A: Start here. Use the method described to help understand where your performance is slow, and why. Then you set to resolving it as described in this series of blog articles.

Q: No seriously, I don’t have time to read that stuff… how do I fix the performance? My boss is mad and I must fix it urgently!

A: You can either randomly try changing things, in which case Google will turn up several lists of settings to play with, or you can diagnose the real cause of the performance problem. If you’ve run a test then see here for how to analyse the performance and understand where the problem lies

Q: Why are you being such a bore? Can’t you just tell me the setting to change?

A: I’m the guy putting £0.50 bets on horses because I don’t want to really risk my money with big bets. In my view, changing settings to fix performance without knowing which setting actually needs changing is a gamble. Sometimes the gamble pays off, but in the end the house always wins.

Q: Performance is bad. When I run the report SQL against the database manually it is fast. Why is OBIEE slow?

A1: You can see from nqquery.log on the BI Server how long the SQL takes on the database, so you don’t necessarily need to run it manually. Bear in mind the network between your BI Server and the database, and also the user ID that the query is being executed as. Finally, the database may have cached the query so could be giving a better impression of the speed.

A2: If the query really does run faster manually against the database then look at nqquery.log to see where the rest of the query time is being spent. It could be the BI Server is having to do additional work on the data before it is returned up to the user. For more on this, see response time profiling.

Q: This all sounds like overkill to me. In my day we just created indexes and were done.
A: I’ve tried to make my method as comprehensive as possible, and usable in both large-scale performance tests but also isolated performance issues. If a particular report is slow for one use, then the test define, design and build is pretty much done already – you know the report, and to start with running it manually is probably fine. Then you analyse why the performance isn’t as good as you want and based on that, you optimise it.

Q: Should I disable query logging for performance? I have read it is a best practice.

A: Query logging is a good thing and shouldn’t be disabled, although shouldn’t be set too detailed either.

Reading & References

The bulk of my inspiration and passion for trying to understand more about how to ‘do’ performance properly has come from three key places:

plus the OBIEE gurus at rittmanmead including markrittman and krisvenkat, along with lots of random twitter conversations and stalking of neilkod, martinberx, alexgorbachev, kevinclosson, nialllitchfield, orcldoug, martindba, jamesmorle, and more.

Comments

I’d love your feedback on this. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

I’ve enabled comments on this article in the series only, to keep the discussion in one place. You can tell me what you think on twitter too, @rmoff

Performance and OBIEE – part VII – Optimising OBIEE performance (“Tuning”)

Potato potato?

When I first drafted this blog post, I gave it the title “Tuning”, rather than “Optimising”. Tuning is the word used so often in the context of performance, but it can actually be a bit misleading. A system that has fundamental design flaws doesn’t need “tuning” – it needs redesigning. Tuning can end up being a catch-all phrase meaning “fix performance ”, regardless of the scale of the task.

Why does this matter? It matters, because if you are in the performance test phase of a project with a deadline to keep to it’s useful to be honest with yourself and the project manager in describing the work that needs to be undertaken to improve performance. If your analysis has shown that the design is sound but you’re hitting bottlenecks in performance somewhere along the way, then it is “tuning” you’re doing. However, if your analysis is showing that the metadata model you’ve built in the RPD stinks, the data model on the database sucks, and the dashboard design is suspect – it’s not going to take a bit of “tuning” to fix, it’s going to take some serious redevelopment. From a point of view of time estimation and expectations management, this is important.

A rough analogy would be with piano tuning. A structurally sound piano needs tuning periodically to keep it from being out of tune. However, a piano with woodworm needs a lot more work doing to it than simple tuning – it needs taking to the workshop.

The piano tuning analogy serves one more purpose too: asking a piano tuner to give you a step by step guide to tuning a piano is almost as nonsensical as it is to expect a simple checklist to provide an way to comprehensively fix (“tune”) the performance of OBIEE.

<rant>

This section is deliberately at the end of a long and detailed series of articles describing how to test and analyse the performance of OBIEE. ‘Tuning’ can be a bit of a weasel word in IT, implying light-touch, silver bullet changes that magically improve a system’s performance, when actually it can mean randomly changing a bunch of things in the blind and desperate hope of making bad performance better but with no way of knowing if it worked or what ‘it’ was.

</rant>

Tuning should be undertaken only as part of a performance cycle. It should be approached from one of two ways:

  1. A performance problem has been identified, and you need to resolve it. You should have test data showing where the problem lies, and the analysis you’ve done should have pointed to the root cause. In this case, the tuning you do ought to be nothing more than fixing the problem. It might take lots of diagnosis and investigation to establish the resolution of a diagnosed cause, but crucially all the work you do will be focussed on the root cause that you’ve identified.
  2. You have a performance baseline, and you want to see if you can improve performance or capacity in general. There is no specific problem to resolve, but reducing the runtimes of queries would be nice. To be precise, we’re looking to optimise a system.

Anti-patterns in performance optimisation

There are anti-patterns (“Bad Practices”, if you will) to good performance troubleshooting; here are some of them. You should have your guard up if you see these, and challenge them!

  • Sacred Cows
  • Hero worship
  • Best practice!
  • Tweaking & Tinkering
  • Silver bullets
  • Golden rules
  • Cast iron guarantees
  • Never
  • Always
  • Take my word for it
  • Long bullet point lists

The only “best practice” you should be using all the time is “Use Your Brain”. – Tom Kyte

Generally good design principles to observe in OBIEE

These are not Best Practices! These are things that can work well, but have to be done with brain engaged!

They’re not bad practices either, they’re just good practices.

  • Ensure that the RPD has a correct dimensional Star model in the Logical (BMM) layer
  • Push the dimensional Star schema into the database; don’t build against a normalised data model if performance is key requirement
  • Create aggregate tables to support user queries
    • Use Usage Tracking to spot dimension levels that would benefit, or Summary Advisor on Exalytics
    • Aggregate Persistence Wizard can do the initial hard work in plumbing the metadata into the RPD for you
      • You still need to maintain the aggregate, unless you are happy to just drop/recreate each time the data changes
    • Don’t forget to create associated aggregated dimension tables. These are particularly useful for good performance of prompts where a distinct list of values at a level in a dimension are returned.
  • Make sure that aggregates get used when appropriate. Check the SQL that OBIEE is generating isn’t using a less-efficient source table.
  • OBIEE is not an extraction tool, especially not in the front-end.
    • If users really want a data dump, consider doing that for them outside of the tool, for example with sql*plus.
    • To still make use of the metadata model in the RPD, but without causing big problems in Presentation Services, use an ODBC or JDBC call into the BI Server directly to get the data dump out. Using this method, you could hook in Excel directly to the BI Server.
  • The fastest query is one that never runs – challenge dashboard & reports designs. Don’t just copy what an existing system does. Analyse the user’s workflow, to see if the reports you build can support and make more efficient what the user does.
  • Generally you should avoid building the RPD against database views, as they can hinder OBIEE’s SQL generation with the result of sub-optimal queries. Database views can also hide inefficient or unnecessary joins and logic. Air your dirty washing in public, and put the underlying tables into the Physical layer of the RPD instead and let OBIEE work with them.
    • This is not a hard and fast rule, and it is not a “Best Practice” (sigh). There will be some genuine cases where a database view is a pragmatic solution to a particular data model issue.
  • Minimise the work being done by the BI Server. When using federation to join data across databases it is unavoidable but generally it is to be frowned upon if within the same database. Wherever possible, all work should be seen to be pushed down to the database.
    • Check how much data the BI Server pulls back from the database as a percentage of rows returned to the user (low % is bad).
    • Monitor the BI Server’s temp directory – if this is filling up with large files it means that the BI Server is having to crunch lots of data
    • How many database queries does one report trigger? (higher is generally less efficient).
    • This SQL will help identify reports for investigation, using existing Usage Tracking data:
      SELECT SAW_SRC_PATH, 
             ROW_COUNT, 
             CUM_NUM_DB_ROW, 
            ( ROW_COUNT / CUM_NUM_DB_ROW ) * 100 AS ROWS_PCT, 
             TOTAL_TIME_SEC, 
             NUM_DB_QUERY 
      FROM   S_NQ_ACCT 
      WHERE  ROW_COUNT > 0 
             AND CUM_NUM_DB_ROW > 0 
          -- Update these predicates to tailor your results as required
          --   AND TOTAL_TIME_SEC > 10 
          --   AND CUM_NUM_DB_ROW > 10000 
      ORDER  BY 4 DESC 
      
  • Size hardware correctly to support the BI Server and Presentation Server (based on past experience and/or Oracle’s documentation)
  • Make sure that there a balanced hardware configuration throughout the stack (c.f. Greg Rahn and Oracle documentation)

Optimising OBIEE further

The above section outlines some of the principles you should always be aiming to follow, or have a clear reason why you’re not. There are some other techniques that can be worth investigating when you’re looking to optimise the performance of OBIEE further, discussed below.

BI Server Caching

How have I got this far and still not mentioned caching? To horribly mix two food metaphors, caching is the icing on the cake, it is not the bread and butter of good performance. If you are using caching as your sole means of ensuring good performance then you are skating on thin ice.

That is not to say caching is bad. Caching is good, when its use is thought through and evaluated carefully. Caching has an overhead in terms of management, so you cannot just chuck it in to the mix and forget about it. You need to manage the cache to make sure you’re not serving up stale data to your end users. It might be fast, but it’ll be wrong.

Caching can improve performance for several reasons:

  • It is typically faster to return the results of a query already run and stored in the cache than it is to re-execute the necessary Physical SQL against the source database(s)
  • By not executing the Physical SQL on the database(s), we reduce both the load on the database, and the amount of network traffic, freeing up these resources for non-cached queries
  • The cache stores the results of a Logical SQL query sent to the BI Server, not the data that the database returns. If the BI Server is doing lots of work, for example, federating and aggregating lots of data across sources, then by caching the result post-processing, all of this work can be avoided by subsequent cache hits.

The BI Server cache is more advanced than a ‘dumb’ cache where only a direct match on a previous request will result in a hit. The BI Server will parse a Logical SQL query and recognise if it is either a direct match, a subset , or an aggregate of an existing cache entry. So a cache entry for sales by day could well satisfy a subsequent Logical SQL query for sales by year.

Pre-seeding the cache is a good idea, so that users all benefit from the cache, not just those who come along after the first user has run a report that gets stored in the cache. There are a couple of ways to pre-seed the cache:

  1. A BI Delivers Agent with the Destination set to System Services : Oracle BI Server Cache. This agent could optionally be set to run straight after your ETL batch has loaded the data.
  2. Directly from an ETL tool via ODBC/JDBC using the SASeedQuery statement.

If you don’t pre-seed the cache then only users running queries based on queries already run by others users will benefit from the cache.

The flip-side of pre-seeding the cache is purging it, and there are two sensible ways to do this :

  1. Event Polling Table
  2. ODBC/JDBC command to the BI Server, triggered by the completion of a data load (ETL)

Watch out for the Cache Persistence time in the Physical Table – this defines how long an entry remains in the cache, rather than how frequently to purge it. If you have a daily data load, setting the cache persistence time to 24 hours will not do what you may think. If your data is loaded at 0300, the first user queries it and creates a cache entry at 0900, that cache entry will remain until 0900 the following day, even though the cached data would have been stale for six hours (since the subsequent data load at 0300).

Where Cache persistence time can be useful is in systems with frequent changes in the source data and you want to deliberately introduce a lag in the data the user sees for the benefit of generally faster response times for end-users. For example, you may have a trickle-fed ODS from which you are running OBIEE reports. If the data is being loaded in near-real-time, and the users want to see it 100% current, then evidently you cannot use caching. However, if the users would be happy with a lag in the data, for example ten minutes, then you could enable caching and set the cache persistence time for the relevant physical table to 10 minutes. For the ten minutes that the data is in the cache, the users get fast response times. This could be a pragmatic balance between freshness of data and response times to get the data. Bear in mind that a query taking 2 minutes to run is going to be reporting on data that is 2 minutes out of date already.

Use fast disk for cache and/or temporary files

OBIEE writes various temporary files, including cache data and work files, to disk. By default, these reside in $FMW_HOME/instances/instance1/tmp. You may find that using fast disk (e.g. SSD) or even RAM disk to hold some or all of these temporary files instead could improve performance. Be aware that some of the work files that OBIEE writes can be very big (as big as the data being pulled back from the database, so in the order of gigabytes if you’re unlucky).

Web Tier for static content caching

In a standard OBIEE installation, WebLogic acts as both the application server (running java code, communicating with Presentation Services etc) as well as the HTTP server, handling inbound connections from the web browser, serving up static files such as CSS stylesheets.

It can sometimes be beneficial to introduce a separate HTTP server such as Oracle HTTP Server (OHS), leaving WebLogic to just act as the application server. A dedicated HTTP server such as OHS can be configured to cache and compress static files which can improve the response time for users especially if the network is not a fast one.

For more information, see Venkat’s article Anatomy of BI EE Page Rendering – Use of HTTP/Web Server & Compression

Scale out / Scale up

Increasing the number of instances of the system components can help maximise the capacity of the overall BI Domain and enable it to use the full resource of the hardware on which it is running.

Scaling out is to add additional physical servers and extend the BI Domain onto them. Scaling up is to just increase the number of one or more of the components that are running on an existing server.

An example of where this can be useful is the Javahost component. It is configured with a default maximum number of jobs that it can handle. Certain workload types and volumes can hit this maximum with relative ease, so increasing the number of Javahosts can improve the performance by reducing contention for the process.

In general, if you don’t have any evidence of a capacity limit being reached or in danger of being reached, I would be reluctant to ‘inoculate’ a system by scaling (adding additional component instances) ‘just in case’. You will only add to the number of moving parts to keep track of (and increase complexity of configuration such as shared presentation catalog folder), and without really a reassurance that the change you’ve made will help. It gives a false sense of security, since you’re just presuming, guessing, that the first bottleneck your system will reach is one which is resolved by scaling out/up.

Mark Rittman wrote a post recently in which he discussed the idea of scaling the BI Server (nqsserver) component in order to take advantage of multiple CPU cores, and whether this was in fact necessary. You can read his post here : Does the BI Server System Component Make Full Use of My Multi-Socket, Multi-Core Server?

Optimising the database

This section is most definitely not a comprehensive study; it is a set of a few pointers that I would be looking for before speaking to my honourable DBA colleagues who have longer beards than I and know this stuff inside out.

  • DBAs :
    • you cannot tune the SQL that OBIEE sends to the database; it is generated by OBIEE. If there is a better way to write the SQL query then you need to get the BI Server to generate it in that way by amending the RPD. Missing joins etc indicate problem with the RPD
    • you can try optimising the physical implementation of the underlying database objects to make a query that OBIEE generates run faster

Techniques to evaluate in your physical implementation of the data model include:

  • Appropriate use of Parallel query, including Auto DOP (as written about by my colleague Pete Scott recently)
  • Partitioning
  • Appropriate – but not excessive – indexing
  • Correct data types, particularly on join columns
  • Materialized Views for holding pre-built aggregations of your fact data
    • (including PCT to incrementally refresh)
  • Statistics
    • Make sure that they’re representative/accurate
    • Have a proactive statistics management strategy. Don’t just rely on the auto stats jobs.
    • Incremental statistics can be useful
  • Resource Manager is useful for granular control of resources such as parallelism, optionally between different groups of users. For example, power users could be given a greater DOP than normal users.

Oracle’s tuning document

You can get the Oracle official OBIEE tuning document from here: Best Practices Guide for Tuning Oracle® Business Intelligence Enterprise Edition.

When Oracle released this document in December 2012 it caused a mini-twitter-storm amongst some OBIEE professionals. On the one hand, publishing a list of settings to evaluate is of great help. On the other, publishing a list of settings to evaluate with no context or method with which to validate them is no help whatsoever. In my [not very humble] opinion, a supplemental list of configuration parameters – especially from the software vendor themselves – should only go hand-in-hand with details of how to properly evaluate them. Here is why I think that:

  1. Turn the dial to 42. If Oracle says so, then that’s what we’ll do. Time is wasted changing configuration without understanding why.
  2. Each configuration change is a move away from default, and thus increases chances of being missed in environment migrations and product upgrades
  3. If there is a suggested default, with no context or caveat, then why isn’t it an instruction in the installation guide? (“For a stable / an even performance over time, you should at least set two Presentation Services per server and two Javahost per server.”)
  4. The document suggests enabling BI Server caching, with no observation that this could lead to users getting wrong (stale) results
  5. It propagates the idea that performance is only a case of setting the correct values in configuration files. In the tuning guide there is a generic mention of “Application Design”, but in a document that discusses parameter changes throughout the OBIEE stack where is the lengthy discussion of underlying data model, appropriate RPD model, query push-down? These fundamental design principles count a thousand times over against how many tcp_listeners an OS is configured for
  6. No explanation of (a) why a setting should be changed and (b) in what situation. For example, changing tcp settings will make no difference on a low-concurrency system horrendously bottlenecked on poor database queries resulting from a bad data model. If a tcp setting needs changing, which corresponding OS network stat will show a bottleneck? Or response time profile to show excessive wait times at the network layer?
  7. A shopping list of parameters to change should be the last step of performance optimisation, but by being the only tuning document available, gives the impression that this is all there is to performance – turn some knobs and you’re done.
  8. A little knowledge is a dangerous thing. By putting these low-level configuration options out as a generally available document it increases the chances of more harm than good being done by people. If someone doesn’t understand a system then it is best to leave it alone rather than tinker with some changes that have an Oracle stamp of authority on.
    If my car is running badly, I won’t start trying to remap the engine. I’ll take it to the garage. When they tell me what the problem was and what they did to fix it, I won’t take that recommendation and tell my friend whose car is running slowly to do the same.

This isn’t to say the document isn’t useful. It is very useful. It’s just a shame that it is the only OBIEE tuning document from Oracle. It’s really useful to have at the end of a performance test once test results are in and diagnosis and resolution of the problems start. But it’s necessary to have the diagnosis and means to validate a change first, before throwing changes at a system.

Summary

This is the penultimate article in this series about OBIEE and performance. For an overview and conclusion of my method for improving performance in OBIEE, see the final post in this series, here.

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to retain the thread of comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Performance and OBIEE – part VII – Optimising OBIEE performance (“Tuning”)

Potato potato?

When I first drafted this blog post, I gave it the title “Tuning”, rather than “Optimising”. Tuning is the word used so often in the context of performance, but it can actually be a bit misleading. A system that has fundamental design flaws doesn’t need “tuning” – it needs redesigning. Tuning can end up being a catch-all phrase meaning “fix performance ”, regardless of the scale of the task.

Why does this matter? It matters, because if you are in the performance test phase of a project with a deadline to keep to it’s useful to be honest with yourself and the project manager in describing the work that needs to be undertaken to improve performance. If your analysis has shown that the design is sound but you’re hitting bottlenecks in performance somewhere along the way, then it is “tuning” you’re doing. However, if your analysis is showing that the metadata model you’ve built in the RPD stinks, the data model on the database sucks, and the dashboard design is suspect – it’s not going to take a bit of “tuning” to fix, it’s going to take some serious redevelopment. From a point of view of time estimation and expectations management, this is important.

A rough analogy would be with piano tuning. A structurally sound piano needs tuning periodically to keep it from being out of tune. However, a piano with woodworm needs a lot more work doing to it than simple tuning – it needs taking to the workshop.

The piano tuning analogy serves one more purpose too: asking a piano tuner to give you a step by step guide to tuning a piano is almost as nonsensical as it is to expect a simple checklist to provide an way to comprehensively fix (“tune”) the performance of OBIEE.

<rant>

This section is deliberately at the end of a long and detailed series of articles describing how to test and analyse the performance of OBIEE. ‘Tuning’ can be a bit of a weasel word in IT, implying light-touch, silver bullet changes that magically improve a system’s performance, when actually it can mean randomly changing a bunch of things in the blind and desperate hope of making bad performance better but with no way of knowing if it worked or what ‘it’ was.

</rant>

Tuning should be undertaken only as part of a performance cycle. It should be approached from one of two ways:

  1. A performance problem has been identified, and you need to resolve it. You should have test data showing where the problem lies, and the analysis you’ve done should have pointed to the root cause. In this case, the tuning you do ought to be nothing more than fixing the problem. It might take lots of diagnosis and investigation to establish the resolution of a diagnosed cause, but crucially all the work you do will be focussed on the root cause that you’ve identified.
  2. You have a performance baseline, and you want to see if you can improve performance or capacity in general. There is no specific problem to resolve, but reducing the runtimes of queries would be nice. To be precise, we’re looking to optimise a system.

Anti-patterns in performance optimisation

There are anti-patterns (“Bad Practices”, if you will) to good performance troubleshooting; here are some of them. You should have your guard up if you see these, and challenge them!

  • Sacred Cows
  • Hero worship
  • Best practice!
  • Tweaking & Tinkering
  • Silver bullets
  • Golden rules
  • Cast iron guarantees
  • Never
  • Always
  • Take my word for it
  • Long bullet point lists

The only “best practice” you should be using all the time is “Use Your Brain”. – Tom Kyte

Generally good design principles to observe in OBIEE

These are not Best Practices! These are things that can work well, but have to be done with brain engaged!

They’re not bad practices either, they’re just good practices.

  • Ensure that the RPD has a correct dimensional Star model in the Logical (BMM) layer
  • Push the dimensional Star schema into the database; don’t build against a normalised data model if performance is key requirement
  • Create aggregate tables to support user queries
    • Use Usage Tracking to spot dimension levels that would benefit, or Summary Advisor on Exalytics
    • Aggregate Persistence Wizard can do the initial hard work in plumbing the metadata into the RPD for you
      • You still need to maintain the aggregate, unless you are happy to just drop/recreate each time the data changes
    • Don’t forget to create associated aggregated dimension tables. These are particularly useful for good performance of prompts where a distinct list of values at a level in a dimension are returned.
  • Make sure that aggregates get used when appropriate. Check the SQL that OBIEE is generating isn’t using a less-efficient source table.
  • OBIEE is not an extraction tool, especially not in the front-end.
    • If users really want a data dump, consider doing that for them outside of the tool, for example with sql*plus.
    • To still make use of the metadata model in the RPD, but without causing big problems in Presentation Services, use an ODBC or JDBC call into the BI Server directly to get the data dump out. Using this method, you could hook in Excel directly to the BI Server.
  • The fastest query is one that never runs – challenge dashboard & reports designs. Don’t just copy what an existing system does. Analyse the user’s workflow, to see if the reports you build can support and make more efficient what the user does.
  • Generally you should avoid building the RPD against database views, as they can hinder OBIEE’s SQL generation with the result of sub-optimal queries. Database views can also hide inefficient or unnecessary joins and logic. Air your dirty washing in public, and put the underlying tables into the Physical layer of the RPD instead and let OBIEE work with them.
    • This is not a hard and fast rule, and it is not a “Best Practice” (sigh). There will be some genuine cases where a database view is a pragmatic solution to a particular data model issue.
  • Minimise the work being done by the BI Server. When using federation to join data across databases it is unavoidable but generally it is to be frowned upon if within the same database. Wherever possible, all work should be seen to be pushed down to the database.
    • Check how much data the BI Server pulls back from the database as a percentage of rows returned to the user (low % is bad).
    • Monitor the BI Server’s temp directory – if this is filling up with large files it means that the BI Server is having to crunch lots of data
    • How many database queries does one report trigger? (higher is generally less efficient).
    • This SQL will help identify reports for investigation, using existing Usage Tracking data:
      SELECT SAW_SRC_PATH, 
             ROW_COUNT, 
             CUM_NUM_DB_ROW, 
            ( ROW_COUNT / CUM_NUM_DB_ROW ) * 100 AS ROWS_PCT, 
             TOTAL_TIME_SEC, 
             NUM_DB_QUERY 
      FROM   S_NQ_ACCT 
      WHERE  ROW_COUNT > 0 
             AND CUM_NUM_DB_ROW > 0 
          -- Update these predicates to tailor your results as required
          --   AND TOTAL_TIME_SEC > 10 
          --   AND CUM_NUM_DB_ROW > 10000 
      ORDER  BY 4 DESC 
      
  • Size hardware correctly to support the BI Server and Presentation Server (based on past experience and/or Oracle’s documentation)
  • Make sure that there a balanced hardware configuration throughout the stack (c.f. Greg Rahn and Oracle documentation)

Optimising OBIEE further

The above section outlines some of the principles you should always be aiming to follow, or have a clear reason why you’re not. There are some other techniques that can be worth investigating when you’re looking to optimise the performance of OBIEE further, discussed below.

BI Server Caching

How have I got this far and still not mentioned caching? To horribly mix two food metaphors, caching is the icing on the cake, it is not the bread and butter of good performance. If you are using caching as your sole means of ensuring good performance then you are skating on thin ice.

That is not to say caching is bad. Caching is good, when its use is thought through and evaluated carefully. Caching has an overhead in terms of management, so you cannot just chuck it in to the mix and forget about it. You need to manage the cache to make sure you’re not serving up stale data to your end users. It might be fast, but it’ll be wrong.

Caching can improve performance for several reasons:

  • It is typically faster to return the results of a query already run and stored in the cache than it is to re-execute the necessary Physical SQL against the source database(s)
  • By not executing the Physical SQL on the database(s), we reduce both the load on the database, and the amount of network traffic, freeing up these resources for non-cached queries
  • The cache stores the results of a Logical SQL query sent to the BI Server, not the data that the database returns. If the BI Server is doing lots of work, for example, federating and aggregating lots of data across sources, then by caching the result post-processing, all of this work can be avoided by subsequent cache hits.

The BI Server cache is more advanced than a ‘dumb’ cache where only a direct match on a previous request will result in a hit. The BI Server will parse a Logical SQL query and recognise if it is either a direct match, a subset , or an aggregate of an existing cache entry. So a cache entry for sales by day could well satisfy a subsequent Logical SQL query for sales by year.

Pre-seeding the cache is a good idea, so that users all benefit from the cache, not just those who come along after the first user has run a report that gets stored in the cache. There are a couple of ways to pre-seed the cache:

  1. A BI Delivers Agent with the Destination set to System Services : Oracle BI Server Cache. This agent could optionally be set to run straight after your ETL batch has loaded the data.
  2. Directly from an ETL tool via ODBC/JDBC using the SASeedQuery statement.

If you don’t pre-seed the cache then only users running queries based on queries already run by others users will benefit from the cache.

The flip-side of pre-seeding the cache is purging it, and there are two sensible ways to do this :

  1. Event Polling Table
  2. ODBC/JDBC command to the BI Server, triggered by the completion of a data load (ETL)

Watch out for the Cache Persistence time in the Physical Table – this defines how long an entry remains in the cache, rather than how frequently to purge it. If you have a daily data load, setting the cache persistence time to 24 hours will not do what you may think. If your data is loaded at 0300, the first user queries it and creates a cache entry at 0900, that cache entry will remain until 0900 the following day, even though the cached data would have been stale for six hours (since the subsequent data load at 0300).

Where Cache persistence time can be useful is in systems with frequent changes in the source data and you want to deliberately introduce a lag in the data the user sees for the benefit of generally faster response times for end-users. For example, you may have a trickle-fed ODS from which you are running OBIEE reports. If the data is being loaded in near-real-time, and the users want to see it 100% current, then evidently you cannot use caching. However, if the users would be happy with a lag in the data, for example ten minutes, then you could enable caching and set the cache persistence time for the relevant physical table to 10 minutes. For the ten minutes that the data is in the cache, the users get fast response times. This could be a pragmatic balance between freshness of data and response times to get the data. Bear in mind that a query taking 2 minutes to run is going to be reporting on data that is 2 minutes out of date already.

Use fast disk for cache and/or temporary files

OBIEE writes various temporary files, including cache data and work files, to disk. By default, these reside in $FMW_HOME/instances/instance1/tmp. You may find that using fast disk (e.g. SSD) or even RAM disk to hold some or all of these temporary files instead could improve performance. Be aware that some of the work files that OBIEE writes can be very big (as big as the data being pulled back from the database, so in the order of gigabytes if you’re unlucky).

Web Tier for static content caching

In a standard OBIEE installation, WebLogic acts as both the application server (running java code, communicating with Presentation Services etc) as well as the HTTP server, handling inbound connections from the web browser, serving up static files such as CSS stylesheets.

It can sometimes be beneficial to introduce a separate HTTP server such as Oracle HTTP Server (OHS), leaving WebLogic to just act as the application server. A dedicated HTTP server such as OHS can be configured to cache and compress static files which can improve the response time for users especially if the network is not a fast one.

For more information, see Venkat’s article Anatomy of BI EE Page Rendering – Use of HTTP/Web Server & Compression

Scale out / Scale up

Increasing the number of instances of the system components can help maximise the capacity of the overall BI Domain and enable it to use the full resource of the hardware on which it is running.

Scaling out is to add additional physical servers and extend the BI Domain onto them. Scaling up is to just increase the number of one or more of the components that are running on an existing server.

An example of where this can be useful is the Javahost component. It is configured with a default maximum number of jobs that it can handle. Certain workload types and volumes can hit this maximum with relative ease, so increasing the number of Javahosts can improve the performance by reducing contention for the process.

In general, if you don’t have any evidence of a capacity limit being reached or in danger of being reached, I would be reluctant to ‘inoculate’ a system by scaling (adding additional component instances) ‘just in case’. You will only add to the number of moving parts to keep track of (and increase complexity of configuration such as shared presentation catalog folder), and without really a reassurance that the change you’ve made will help. It gives a false sense of security, since you’re just presuming, guessing, that the first bottleneck your system will reach is one which is resolved by scaling out/up.

Mark Rittman wrote a post recently in which he discussed the idea of scaling the BI Server (nqsserver) component in order to take advantage of multiple CPU cores, and whether this was in fact necessary. You can read his post here : Does the BI Server System Component Make Full Use of My Multi-Socket, Multi-Core Server?

Optimising the database

This section is most definitely not a comprehensive study; it is a set of a few pointers that I would be looking for before speaking to my honourable DBA colleagues who have longer beards than I and know this stuff inside out.

  • DBAs :
    • you cannot tune the SQL that OBIEE sends to the database; it is generated by OBIEE. If there is a better way to write the SQL query then you need to get the BI Server to generate it in that way by amending the RPD. Missing joins etc indicate problem with the RPD
    • you can try optimising the physical implementation of the underlying database objects to make a query that OBIEE generates run faster

Techniques to evaluate in your physical implementation of the data model include:

  • Appropriate use of Parallel query, including Auto DOP (as written about by my colleague Pete Scott recently)
  • Partitioning
  • Appropriate – but not excessive – indexing
  • Correct data types, particularly on join columns
  • Materialized Views for holding pre-built aggregations of your fact data
    • (including PCT to incrementally refresh)
  • Statistics
    • Make sure that they’re representative/accurate
    • Have a proactive statistics management strategy. Don’t just rely on the auto stats jobs.
    • Incremental statistics can be useful
  • Resource Manager is useful for granular control of resources such as parallelism, optionally between different groups of users. For example, power users could be given a greater DOP than normal users.

Oracle’s tuning document

You can get the Oracle official OBIEE tuning document from here: Best Practices Guide for Tuning Oracle® Business Intelligence Enterprise Edition.

When Oracle released this document in December 2012 it caused a mini-twitter-storm amongst some OBIEE professionals. On the one hand, publishing a list of settings to evaluate is of great help. On the other, publishing a list of settings to evaluate with no context or method with which to validate them is no help whatsoever. In my [not very humble] opinion, a supplemental list of configuration parameters – especially from the software vendor themselves – should only go hand-in-hand with details of how to properly evaluate them. Here is why I think that:

  1. Turn the dial to 42. If Oracle says so, then that’s what we’ll do. Time is wasted changing configuration without understanding why.
  2. Each configuration change is a move away from default, and thus increases chances of being missed in environment migrations and product upgrades
  3. If there is a suggested default, with no context or caveat, then why isn’t it an instruction in the installation guide? (“For a stable / an even performance over time, you should at least set two Presentation Services per server and two Javahost per server.”)
  4. The document suggests enabling BI Server caching, with no observation that this could lead to users getting wrong (stale) results
  5. It propagates the idea that performance is only a case of setting the correct values in configuration files. In the tuning guide there is a generic mention of “Application Design”, but in a document that discusses parameter changes throughout the OBIEE stack where is the lengthy discussion of underlying data model, appropriate RPD model, query push-down? These fundamental design principles count a thousand times over against how many tcp_listeners an OS is configured for
  6. No explanation of (a) why a setting should be changed and (b) in what situation. For example, changing tcp settings will make no difference on a low-concurrency system horrendously bottlenecked on poor database queries resulting from a bad data model. If a tcp setting needs changing, which corresponding OS network stat will show a bottleneck? Or response time profile to show excessive wait times at the network layer?
  7. A shopping list of parameters to change should be the last step of performance optimisation, but by being the only tuning document available, gives the impression that this is all there is to performance – turn some knobs and you’re done.
  8. A little knowledge is a dangerous thing. By putting these low-level configuration options out as a generally available document it increases the chances of more harm than good being done by people. If someone doesn’t understand a system then it is best to leave it alone rather than tinker with some changes that have an Oracle stamp of authority on.
    If my car is running badly, I won’t start trying to remap the engine. I’ll take it to the garage. When they tell me what the problem was and what they did to fix it, I won’t take that recommendation and tell my friend whose car is running slowly to do the same.

This isn’t to say the document isn’t useful. It is very useful. It’s just a shame that it is the only OBIEE tuning document from Oracle. It’s really useful to have at the end of a performance test once test results are in and diagnosis and resolution of the problems start. But it’s necessary to have the diagnosis and means to validate a change first, before throwing changes at a system.

Summary

This is the penultimate article in this series about OBIEE and performance. For an overview and conclusion of my method for improving performance in OBIEE, see the final post in this series, here.

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to retain the thread of comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Performance and OBIEE – part VI – Analysing results

This part of the OBIEE performance cycle is the one which arguably matters most. Having defined what we’re going to test, built a means by which to test it, and executed that test, we now need to sift through the tealeaves and work out what the data we collected is telling us. Except we’re not going to use hocus-pocus like tea leaf reading, gut feeling or best practice checklists, we’re going to use cold hard data and analysis.

Analysing the data breaks down into several stages, and is often an iterative process:

  1. Analyse the net response time. Is it as fast as it needs to be, at the required level of user concurrency?
  2. If the response time is too slow (“too slow” being defined by you or your users, in advance of the test), then diagnose to determine why. This is another phase of analysis, breaking down the net response time into its constituent parts, analysing system and OBI metrics for signs of a bottleneck. The output of this phase will be an hypothesis as to the cause of the performance problem
  3. Based on the diagnosis of the issue, apply one change to the system to improve it, that is, resolve the performance issue. Having made one change (and one change only), the original test should be repeated and the analysis cycle repeated to determine the impact of the tuning.

Analysis

How you analyse your data determines whether you will be accurately and fairly representing the results of the test in your diagnoses and conclusions.

Avoid Averages

From your test execution you will have a series of response times. You need to summarise, that is, aggregate these into a single figure to give as a headline figure in your test report. If you take only one thing away from reading this, let it be the following point: don’t use average figures! I’ll say it again for emphasis : Averages are not a good way to represent your test data. What I am saying here is nothing that you won’t read in every other decent article written on performance testing and analysis.

When you average out a series of data, you mask and muddy your data by inadvertently hiding extreme values in the series. A much better summary to use is the percentile.

Consider a performance test of a single dashboard for a single user. It is run ten times, so as to get a representative set of data for the response time. Here are two series of data, both with an average response time of five seconds. If we look at the 90th Percentile for the same two series of data, we can see that series ‘A’ has a response time of ten seconds, whilst series ‘B’ has a response time of six seconds.

As a user, if you run this dashboard, which behaviour would you prefer? Series ‘A’, where it might take a second to run or it might take ten seconds, or Series ‘B’ where it is going to be five seconds, give or take one second either side? As human beings we like consistency and certainty. Sure, it’d be nice if the dashboard ran in a second, but most people would rather know that it’s definitely going to run within six seconds and not almost double that. That uncertainty can also be seen in the standard deviation figure in the two series. The lower the standard deviation, the more consistent the response times are.

For more detail and clear statistical explanations, read “Averages Only” in Zed Shaw’s Programmers Need To Learn Statistics and “Percentile Specifications” in Cary Millsap’s Thinking Clearly about Performance.

Throw away your test data

Well, not literally. But, if you are testing user concurrency, make sure that when you calculate your percentile (eg 90th percentile) response time, do it for a given number of users. Otherwise you are distorting the figure. Typically a test will have a ‘ramp up’ period where the concurrent users are gradually introduced onto the system, until the target number is active, at which point the system is in ‘steady state’. It is from this point, the steady state, that you should be deriving your response time calculation. It is useful to look at how the response time might vary as the workload is increased, but for an accurate figure of a response time at a given number of users, you should be ignoring the data except where the full number of users was running.

Analysis summary

The output of this phase of the analysis should be very simple:

The 90th Percentile response time for dashboard <xx> is <yy> at a user concurrency of <zz>

And this should then satisfy a pass/fail criterion that was specified when you defined the test.

  • If the test passes, great. Record all your test parameters and data, and move on to the next test.
  • If the test doesn’t pass, then you need to work out why, and for that, see below.

I’m oversimplifying, since there is other data (eg standard deviation) that you might want to include in your summary, along with some commentary around variances observed and so on.

Diagnosing poor OBIEE performance

Get to the root of the problem

So, the test results showed that the dashboard(s) run too slowly. Now what? Now, you need to work out why there is a performance problem. I am deliberately spelling this out, because too many people jump forward to attempting to fix a performance problem without actually understanding exactly what the problem is. They whip out their six-shooters loaded with silver bullets and start blasting, which is a bad idea for two reasons:

  1. You may never know what the problem was – so you won’t be able to avoid doing it again! Everyone makes mistakes; the mark of a good programmer is one who learns from them.
    If I run a dashboard on a 2 CPU 4GB server and find it’s slow, one option could be to run it on a 8 CPU 32GB server. Tada! It’s faster. But, does that mean that every report now needs to be run on the big server? Well, yes it’d be nice – but how do we know that the original performance problem wasn’t down to machine capacity but perhaps a missing filter in the report? Or a wrong join in the RPD? It could be an expensive assumption to make that the problem’s root cause was lack of hardware capacity.
  2. In determining the root cause, you will learn more about OBIEE. This better understanding of OBIEE will mean you are less likely to make performance errors in the future. You will also become better at performance diagnostics, making solving live problems in Production as well as future performance tests easier and faster to resolve.

“I broke things, so now I will jiggle things randomly until they unbreak” is not acceptable Linus Torvalds

There are always exceptions, but exceptions can be justified and supported with data. Just beware of the the silver bullet syndrome…The unfortunate part […] is that rarely anyone goes back and does the root cause analysis. It tends to fall into the bucket of “problem…solved”. Greg Rahn

Performance vs Capacity

I always try to split it into #performance tuning (response time) and capacity tuning (throughput/scalability) – Alex Gorbachev

Performance issues can be local to a report, or global to a system implementation and exacerbated by a particular report or set of reports – or both.

If an individual dashboard doesn’t perform with a single user running it, then it certainly isn’t going to with a 100, and there is clearly a performance problem in the design (of the dashboard, RPD, or physical data model design or implementation).

However, if an individual dashboard runs fine with a single user but performance gets worse and worse the more users that run it concurrently, this would indicate a capacity problem in the configuration or physical capacity of your system.

So which is which? An easy way to shortcut it is this: before you launch into your mega-multi-user-concurrency tests, test the dashboard with a single user. Is the response time acceptable? If not, then you have a performance problem. You’ve eliminated user concurrency from the equation entirely. If the response time is acceptable, then you can move onto your user concurrency tests.

If you have already run a big user concurrency test and are trying to identify whether the issue is performance or capacity, then look at what happens to your response time compared to the number of users running. If the response time is constant throughout then it indicates a performance problem; if it is increasing as more users are added it shows a capacity (which can include configuration) problem. Being able to identify this difference is why I’d never run a user concurrency test without a ramp-up period, since you don’t get to observe the behaviour of the system as users are added.

Response time vs active users

In the above graph there are two points evident:

  1. Up to ten users the response time is consistent, around 30 seconds. If the response time needs to be faster than this then there is a performance problem
  2. If 30 seconds is the upper limit of an acceptable response time then we can say that the system has a capacity of 10 concurrent users, and if the user concurrency needs to be greater than this then there is a capacity problem

Errors

Don’t overlook analysing the errors that may come out during your testing. Particularly as you start to hit limits within the stock OBIEE configuration, you might start to see things like:

  • Too many running queries. Server is too busy to process any more queries at this time.
  • com.siebel.analytics.javahost.standalone.SAJobManagerImpl$JobQueueFullException
  • Graph server does not appear to be responding in a timely fashion. It may be under heavy load or unavailable.
  • The queue for the thread pool ChartThreadPool is at it's maximum capacity of 512 jobs.

If you see errors such as these then they will often explain response time variances and problems that you observe in your test data, and should be top of your list for investigating further to resolve or explain.

Response time profiling

A good way to get started with identifying the root cause(s) of a problem is to build a time profile of the overall response time. This is something that I learnt from reading about Method R, and is just as applicable to OBIEE as it is to the Oracle RDBMS about which it was originally written. This link gives a good explanation of what Method R is.

You can improve a system without profiling, and maybe you can even optimize one without profiling. But you can’t know whether a system is optimal without knowing whether its tasks are efficient, and you can’t know whether a given task is efficient without profiling it. Cary Millsap

Given the number of moving parts in any complex software stack there’s often more than one imperfection. The trick is to find the most significant that will yield the best response time improvement when resolved. It also lets you identify which will give the “biggest bang for your buck” – maybe there are several problems, but the top one requires a complete redesign whilst the second one is an easy resolution and will improve response times sufficiently.

So in the context of OBIEE, what does a response time profile look like? If you hark back to the OBIEE stack that I described previously, a simple example profile could look something like this:

OBIEE response time profile

Here we can see that whatever we might do the speed up the chart rendering (5 seconds) the focus of our investigation should really be on the 20 second query run on the database, as well as the 10 seconds it takes BI Server to join the results together. Can we eliminate the need for two queries, and can we do something on the database to improve the query run time?

When building a time profile, start at the highest level, and break down the steps based on the data you have. For example, to determine the time it takes Presentation Services to send a query to BI Server is quite a complex process involving low-level log files. Yet, it probably isn’t a significant line entry on the profile, so by all means mark it down but spend the time on the bigger steps – which is usually the fetching and processing of the report data.

OBIEE response time profile

A more complicated profile might be something like this:

OBIEE response time profile

Graphing a response time profile can also help us comprehend at a glance what’s happening, and also gives a ‘template’ to hold up to profiles that are created. In general you would want to see the split of a time profile heavily weighted to the database:

OBIEE response time profileIf the response time profile shows that just as much of the total response time is happening on the BI Server then I would want to see what could be done to shift the weight of the work back to the database:

OBIEE response time profile
For more on this subject of where work should ideally be occurring, see the section below “Make sure that the database is sweating”.

Here are the sources you can look for response time profile data, starting at the user interface and going down to the database

  • Rendering time – Web browser profiler such as Chrome Developer Tools, YSlow, Google Page Speed
  • WebLogic – access.log will show the HTTP requests coming in
  • Presentation Services – sawlog0.log, but may require custom log levels to get at low-level information
  • BI Server
    • nqquery.log
      • Time to create physical SQL(s), i.e. compile time
      • Time to connect to DB
      • Time to execute on DB
      • Time to process on BI server and return to PS
    • Usage Tracking
      • S_NQ_ACCT
      • S_NQ_DB_ACCT
  • Database – whilst profiling can be extended down to the DB (for example, using an 10046 trace in Oracle), it makes more sense to do as a standalone analysis piece on an individual query where necessary. In extreme examples the profiling could actually go beyond the database down into the SAN, for example.

Diagnosing capacity problems

If a dashboard is performing acceptably under a single user load, but performance deteriorates unacceptably as the user currency increases, then you have a capacity issue. This capacity could be hardware, for example, you have exhausted your CPUs or saturated your I/O pipe. Capacity can also refer to the application and how it is configured. OBIEE is a powerful piece of software but to make it so flexible there are by definition a lot of ways in which is can be configured – including badly! Particularly as user concurrency (as in, concurrent executing reports) increases into three figures and above it may be the default configuration options are not sufficient. Note that this “three figures and above” should be taken with a large pinch of salt, since it could be lower for very ‘heavy’ dashboards, or much higher for ‘light’ dashboards. By ‘heavy’ and ‘light’ I am primarily referring to the amount of work they cause on the BI Server (e.g. federation of large datasets), Presentation Services (e.g. large pivot tables) and Javahost (e.g. lots of chart requests such as you’d see with trellis views).

To diagnose a capacity problem, you need data. You need to analyse the response time over time against the measures of how the system was performing over time, and then investigate any apparent correlations in detail to ascertain if there is causation.

ObXKCD

ObXKCD

This is where you may need to re-run your performance test if you didn’t collect this data the first time around. See the System Metrics section above for details on how and what. The easy stuff to collect is OS metrics, including CPU, Memory, Disk IO, and Network IO. You should include both the OBI and Database server(s) in this. Look at how this behaves over time compared to the performance test response times. Using a relatively gradual user ramp-up is a good idea to pinpoint where things might start to get unstuck, rather than just plain break.

Network bottleneck observed as load increases beyond c.9 active users

If the OS metrics are unremarkable – that is, there is plenty of capacity left in all of the areas but response times are still suffering as user concurrency increases – then you need to start digging deeper. This could include:

  • OBI Metrics
  • Analysis of the performance of the database against which queries are running
  • End-to-end stack capacity, eg Network, SAN, etc.

OBI Metrics can be particularly enlightening in diagnosing configuration issues. For example, an undersized connection pool or saturated javahost.

Don’t forget to also include the OBI logs in your analysis, since they may also point to any issues you’re seeing in the errors or warnings that they record.

Additional diagnosis tips

Having profiled the response time you should hopefully have pinpointed an area for investigation for coming up with your diagnosis. The additional analysis that you may need to do to determine root cause is very dependent upon the area you have identified. Below are some pointers to help you.

Make sure that the database is sweating

As mentioned above, a healthy OBI system will wherever possible generally push all of the ‘heavy lifting’ work such as filtering, calculations, and aggregations down to the database. You want to see as little difference between the data volume returned from the database to the BI Server, and that returned to the user.

Use nqquery.log to look at the bytes and rows that OBIEE is pulling back from the database. For example, you don’t want to see entries such as this:

Rows 13894550, bytes 3260497648 retrieved from database query id: xxxx

(13.8 million rows / 3GB of data!)

If you return lots of of data from the database to the BI server, performance suffers because:

  • You’re shifting lots of data across the network, each time the query runs
  • As well as the database processing lots of data, the BI Server now has to process the same volume of data to pare it down to the results that the user wants
  • If the data volumes are large the BI Server will start having to write .TMP files to disk, which can have its own overhead and implications in terms of available disk space

You can read more on this topic here.

N.B. If you’re using cross-database federation then this processing of data by the BI Server can be unavoidable, and is of course a powerful feature of OBIEE to take advantage of when needed.

A related point to this is the general principle of Filter Early. If dashboards are pulling back data for all months and all product areas, but the user is only looking at last month and their own product area then change the dashboard to filter it so. And if you use dashboard prompts but have left them unset by default then every time the user initially navigates to the dashboard they’ll be pulling back all data, so set defaults or a filter in the constituent reports.
As a last point on this particular subject – what if there are 13 million rows pulled back from the database because the user wants 13 million rows in their report? Well, other than this:
shudder

I would say: use the right tool for the right job. Would the user’s workflow be better served by an exception-based report rather than a vast grid of data just ‘because we’ve always done it that way’? If they really need all the data, then it’s clear that the user is not going to analyse 13 million rows of data in OBIEE, they’re probably going to dump it into Excel, or some other tool – and if so, then write a data extract to do it more efficiently and leave OBIEE out of the equation. If you want to make use of the metadata model you’ve built in the RPD, you could always use an ODBC or JDBC connection directly into the BI Server to get the data out. Just don’t try and do it through Answers/Dashboards.

Instrumenting connection pools

For a detailed understanding of how the database behaves under load as the result of BI queries, consider using instrumentation in your connection pools as way of correlating [problematic] workload on the database with originating OBIEE queries and users.

I have written previously about how to this, here

Why’s it doing what it’s doing

If a report ‘ought’ to be running well, but isn’t, there are two optimisers involved to investigate to see why it is running how it is. When the inbound Logical SQL is received by the BI Server from Presentation Services, it is parsed (‘compiled’) by the BI Server through the RPD to generate the Physical SQL statement(s) to run against the database.

To see how OBIEE analyses the Logical SQL and decides how to run it, use a LOGLEVEL setting of 4 or greater. This writes the execution plan to nqquery.log, but be aware, it’s low-level stuff and typically for Oracle support use only. To read more about log levels, see here. The execution plan is based entirely upon the contents of the RPD, so if you want different Physical SQL generated, you need to influence it through the RPD.

The second optimiser is the database optimiser, which will take the Physical SQL OBIEE is generating and decide how best to execute it on the database. On Oracle this is the Cost-Based Optimiser (CBO), about which there is plenty written already and your local friendly DBA will be able to help with.

Footnote: Hypotheses

Finally, in analysing your data to come up with a diagnosis or hypothesis as to the root cause of the problem, bear this quotation in mind:

If you take a skeptical attitude toward your analysis you’ll look just as hard for data that refutes your hypothesis as you will for data that confirms it. A skeptic attacks the same question from many different angles and dramatically increases their confidence in the results. John Rauser

What next?

If your testing has shown a performance problem then you should by now have a hypothesis or diagnosis of the root cause. Read all about optimisation here. If your testing has shown performance is just fine, you might want to read it anyway …

Other articles in this series

This article is part of a series on the subject of Performance and OBIEE:

  1. Introduction
  2. Test – Define
  3. Test – Design
  4. Test – Build
  5. Execute
  6. Analyse
  7. Optimise
  8. Summary and FAQ

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to retain the comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.