Oracle BI Suite EE | OBIEE.nl

March 18, 2013 Rittman Mead

Performance and OBIEE – part VI – Analysing results

This part of the OBIEE performance cycle is the one which arguably matters most. Having defined what we’re going to test, built a means by which to test it, and executed that test, we now need to sift through the tealeaves and work out what the data we collected is telling us. Except we’re not going to use hocus-pocus like tea leaf reading, gut feeling or best practice checklists, we’re going to use cold hard data and analysis.

Analysing the data breaks down into several stages, and is often an iterative process:

Analyse the net response time. Is it as fast as it needs to be, at the required level of user concurrency?
If the response time is too slow (“too slow” being defined by you or your users, in advance of the test), then diagnose to determine why. This is another phase of analysis, breaking down the net response time into its constituent parts, analysing system and OBI metrics for signs of a bottleneck. The output of this phase will be an hypothesis as to the cause of the performance problem
Based on the diagnosis of the issue, apply one change to the system to improve it, that is, resolve the performance issue. Having made one change (and one change only), the original test should be repeated and the analysis cycle repeated to determine the impact of the tuning.

Analysis

How you analyse your data determines whether you will be accurately and fairly representing the results of the test in your diagnoses and conclusions.

Avoid Averages

From your test execution you will have a series of response times. You need to summarise, that is, aggregate these into a single figure to give as a headline figure in your test report. If you take only one thing away from reading this, let it be the following point: don’t use average figures! I’ll say it again for emphasis : Averages are not a good way to represent your test data. What I am saying here is nothing that you won’t read in every other decent article written on performance testing and analysis.

When you average out a series of data, you mask and muddy your data by inadvertently hiding extreme values in the series. A much better summary to use is the percentile.

Consider a performance test of a single dashboard for a single user. It is run ten times, so as to get a representative set of data for the response time. Here are two series of data, both with an average response time of five seconds. If we look at the 90th Percentile for the same two series of data, we can see that series ‘A’ has a response time of ten seconds, whilst series ‘B’ has a response time of six seconds.

As a user, if you run this dashboard, which behaviour would you prefer? Series ‘A’, where it might take a second to run or it might take ten seconds, or Series ‘B’ where it is going to be five seconds, give or take one second either side? As human beings we like consistency and certainty. Sure, it’d be nice if the dashboard ran in a second, but most people would rather know that it’s definitely going to run within six seconds and not almost double that. That uncertainty can also be seen in the standard deviation figure in the two series. The lower the standard deviation, the more consistent the response times are.

For more detail and clear statistical explanations, read “Averages Only” in Zed Shaw’s Programmers Need To Learn Statistics and “Percentile Specifications” in Cary Millsap’s Thinking Clearly about Performance.

Throw away your test data

Well, not literally. But, if you are testing user concurrency, make sure that when you calculate your percentile (eg 90th percentile) response time, do it for a given number of users. Otherwise you are distorting the figure. Typically a test will have a ‘ramp up’ period where the concurrent users are gradually introduced onto the system, until the target number is active, at which point the system is in ‘steady state’. It is from this point, the steady state, that you should be deriving your response time calculation. It is useful to look at how the response time might vary as the workload is increased, but for an accurate figure of a response time at a given number of users, you should be ignoring the data except where the full number of users was running.

Analysis summary

The output of this phase of the analysis should be very simple:

The 90th Percentile response time for dashboard <xx> is <yy> at a user concurrency of <zz>

And this should then satisfy a pass/fail criterion that was specified when you defined the test.

If the test passes, great. Record all your test parameters and data, and move on to the next test.
If the test doesn’t pass, then you need to work out why, and for that, see below.

I’m oversimplifying, since there is other data (eg standard deviation) that you might want to include in your summary, along with some commentary around variances observed and so on.

Diagnosing poor OBIEE performance

Get to the root of the problem

So, the test results showed that the dashboard(s) run too slowly. Now what? Now, you need to work out why there is a performance problem. I am deliberately spelling this out, because too many people jump forward to attempting to fix a performance problem without actually understanding exactly what the problem is. They whip out their six-shooters loaded with silver bullets and start blasting, which is a bad idea for two reasons:

You may never know what the problem was – so you won’t be able to avoid doing it again! Everyone makes mistakes; the mark of a good programmer is one who learns from them.
If I run a dashboard on a 2 CPU 4GB server and find it’s slow, one option could be to run it on a 8 CPU 32GB server. Tada! It’s faster. But, does that mean that every report now needs to be run on the big server? Well, yes it’d be nice – but how do we know that the original performance problem wasn’t down to machine capacity but perhaps a missing filter in the report? Or a wrong join in the RPD? It could be an expensive assumption to make that the problem’s root cause was lack of hardware capacity.
In determining the root cause, you will learn more about OBIEE. This better understanding of OBIEE will mean you are less likely to make performance errors in the future. You will also become better at performance diagnostics, making solving live problems in Production as well as future performance tests easier and faster to resolve.

“I broke things, so now I will jiggle things randomly until they unbreak” is not acceptable – Linus Torvalds

There are always exceptions, but exceptions can be justified and supported with data. Just beware of the the silver bullet syndrome…The unfortunate part […] is that rarely anyone goes back and does the root cause analysis. It tends to fall into the bucket of “problem…solved”. – Greg Rahn

Performance vs Capacity

I always try to split it into #performance tuning (response time) and capacity tuning (throughput/scalability) – Alex Gorbachev

Performance issues can be local to a report, or global to a system implementation and exacerbated by a particular report or set of reports – or both.

If an individual dashboard doesn’t perform with a single user running it, then it certainly isn’t going to with a 100, and there is clearly a performance problem in the design (of the dashboard, RPD, or physical data model design or implementation).

However, if an individual dashboard runs fine with a single user but performance gets worse and worse the more users that run it concurrently, this would indicate a capacity problem in the configuration or physical capacity of your system.

So which is which? An easy way to shortcut it is this: before you launch into your mega-multi-user-concurrency tests, test the dashboard with a single user. Is the response time acceptable? If not, then you have a performance problem. You’ve eliminated user concurrency from the equation entirely. If the response time is acceptable, then you can move onto your user concurrency tests.

If you have already run a big user concurrency test and are trying to identify whether the issue is performance or capacity, then look at what happens to your response time compared to the number of users running. If the response time is constant throughout then it indicates a performance problem; if it is increasing as more users are added it shows a capacity (which can include configuration) problem. Being able to identify this difference is why I’d never run a user concurrency test without a ramp-up period, since you don’t get to observe the behaviour of the system as users are added.

In the above graph there are two points evident:

Up to ten users the response time is consistent, around 30 seconds. If the response time needs to be faster than this then there is a performance problem
If 30 seconds is the upper limit of an acceptable response time then we can say that the system has a capacity of 10 concurrent users, and if the user concurrency needs to be greater than this then there is a capacity problem

Errors

Don’t overlook analysing the errors that may come out during your testing. Particularly as you start to hit limits within the stock OBIEE configuration, you might start to see things like:

Too many running queries. Server is too busy to process any more queries at this time.
com.siebel.analytics.javahost.standalone.SAJobManagerImpl$JobQueueFullException
Graph server does not appear to be responding in a timely fashion. It may be under heavy load or unavailable.
The queue for the thread pool ChartThreadPool is at it's maximum capacity of 512 jobs.

If you see errors such as these then they will often explain response time variances and problems that you observe in your test data, and should be top of your list for investigating further to resolve or explain.

Response time profiling

A good way to get started with identifying the root cause(s) of a problem is to build a time profile of the overall response time. This is something that I learnt from reading about Method R, and is just as applicable to OBIEE as it is to the Oracle RDBMS about which it was originally written. This link gives a good explanation of what Method R is.

You can improve a system without profiling, and maybe you can even optimize one without profiling. But you can’t know whether a system is optimal without knowing whether its tasks are efficient, and you can’t know whether a given task is efficient without profiling it. –Cary Millsap

Given the number of moving parts in any complex software stack there’s often more than one imperfection. The trick is to find the most significant that will yield the best response time improvement when resolved. It also lets you identify which will give the “biggest bang for your buck” – maybe there are several problems, but the top one requires a complete redesign whilst the second one is an easy resolution and will improve response times sufficiently.

So in the context of OBIEE, what does a response time profile look like? If you hark back to the OBIEE stack that I described previously, a simple example profile could look something like this:

Here we can see that whatever we might do the speed up the chart rendering (5 seconds) the focus of our investigation should really be on the 20 second query run on the database, as well as the 10 seconds it takes BI Server to join the results together. Can we eliminate the need for two queries, and can we do something on the database to improve the query run time?

When building a time profile, start at the highest level, and break down the steps based on the data you have. For example, to determine the time it takes Presentation Services to send a query to BI Server is quite a complex process involving low-level log files. Yet, it probably isn’t a significant line entry on the profile, so by all means mark it down but spend the time on the bigger steps – which is usually the fetching and processing of the report data.

A more complicated profile might be something like this:

Graphing a response time profile can also help us comprehend at a glance what’s happening, and also gives a ‘template’ to hold up to profiles that are created. In general you would want to see the split of a time profile heavily weighted to the database:

If the response time profile shows that just as much of the total response time is happening on the BI Server then I would want to see what could be done to shift the weight of the work back to the database:

For more on this subject of where work should ideally be occurring, see the section below “Make sure that the database is sweating”.

Here are the sources you can look for response time profile data, starting at the user interface and going down to the database

Rendering time – Web browser profiler such as Chrome Developer Tools, YSlow, Google Page Speed
WebLogic – access.log will show the HTTP requests coming in
Presentation Services – sawlog0.log, but may require custom log levels to get at low-level information
BI Server
- nqquery.log
  - Time to create physical SQL(s), i.e. compile time
  - Time to connect to DB
  - Time to execute on DB
  - Time to process on BI server and return to PS
- Usage Tracking
  - S_NQ_ACCT
  - S_NQ_DB_ACCT
Database – whilst profiling can be extended down to the DB (for example, using an 10046 trace in Oracle), it makes more sense to do as a standalone analysis piece on an individual query where necessary. In extreme examples the profiling could actually go beyond the database down into the SAN, for example.

Diagnosing capacity problems

If a dashboard is performing acceptably under a single user load, but performance deteriorates unacceptably as the user currency increases, then you have a capacity issue. This capacity could be hardware, for example, you have exhausted your CPUs or saturated your I/O pipe. Capacity can also refer to the application and how it is configured. OBIEE is a powerful piece of software but to make it so flexible there are by definition a lot of ways in which is can be configured – including badly! Particularly as user concurrency (as in, concurrent executing reports) increases into three figures and above it may be the default configuration options are not sufficient. Note that this “three figures and above” should be taken with a large pinch of salt, since it could be lower for very ‘heavy’ dashboards, or much higher for ‘light’ dashboards. By ‘heavy’ and ‘light’ I am primarily referring to the amount of work they cause on the BI Server (e.g. federation of large datasets), Presentation Services (e.g. large pivot tables) and Javahost (e.g. lots of chart requests such as you’d see with trellis views).

To diagnose a capacity problem, you need data. You need to analyse the response time over time against the measures of how the system was performing over time, and then investigate any apparent correlations in detail to ascertain if there is causation.

This is where you may need to re-run your performance test if you didn’t collect this data the first time around. See the System Metrics section above for details on how and what. The easy stuff to collect is OS metrics, including CPU, Memory, Disk IO, and Network IO. You should include both the OBI and Database server(s) in this. Look at how this behaves over time compared to the performance test response times. Using a relatively gradual user ramp-up is a good idea to pinpoint where things might start to get unstuck, rather than just plain break.

Network bottleneck observed as load increases beyond c.9 active users

If the OS metrics are unremarkable – that is, there is plenty of capacity left in all of the areas but response times are still suffering as user concurrency increases – then you need to start digging deeper. This could include:

OBI Metrics
Analysis of the performance of the database against which queries are running
End-to-end stack capacity, eg Network, SAN, etc.

OBI Metrics can be particularly enlightening in diagnosing configuration issues. For example, an undersized connection pool or saturated javahost.

Don’t forget to also include the OBI logs in your analysis, since they may also point to any issues you’re seeing in the errors or warnings that they record.

Additional diagnosis tips

Having profiled the response time you should hopefully have pinpointed an area for investigation for coming up with your diagnosis. The additional analysis that you may need to do to determine root cause is very dependent upon the area you have identified. Below are some pointers to help you.

Make sure that the database is sweating

As mentioned above, a healthy OBI system will wherever possible generally push all of the ‘heavy lifting’ work such as filtering, calculations, and aggregations down to the database. You want to see as little difference between the data volume returned from the database to the BI Server, and that returned to the user.

Use nqquery.log to look at the bytes and rows that OBIEE is pulling back from the database. For example, you don’t want to see entries such as this:

Rows 13894550, bytes 3260497648 retrieved from database query id: xxxx

(13.8 million rows / 3GB of data!)

If you return lots of of data from the database to the BI server, performance suffers because:

You’re shifting lots of data across the network, each time the query runs
As well as the database processing lots of data, the BI Server now has to process the same volume of data to pare it down to the results that the user wants
If the data volumes are large the BI Server will start having to write .TMP files to disk, which can have its own overhead and implications in terms of available disk space

You can read more on this topic here.

N.B. If you’re using cross-database federation then this processing of data by the BI Server can be unavoidable, and is of course a powerful feature of OBIEE to take advantage of when needed.

A related point to this is the general principle of Filter Early. If dashboards are pulling back data for all months and all product areas, but the user is only looking at last month and their own product area then change the dashboard to filter it so. And if you use dashboard prompts but have left them unset by default then every time the user initially navigates to the dashboard they’ll be pulling back all data, so set defaults or a filter in the constituent reports.
As a last point on this particular subject – what if there are 13 million rows pulled back from the database because the user wants 13 million rows in their report? Well, other than this:
shudder

I would say: use the right tool for the right job. Would the user’s workflow be better served by an exception-based report rather than a vast grid of data just ‘because we’ve always done it that way’? If they really need all the data, then it’s clear that the user is not going to analyse 13 million rows of data in OBIEE, they’re probably going to dump it into Excel, or some other tool – and if so, then write a data extract to do it more efficiently and leave OBIEE out of the equation. If you want to make use of the metadata model you’ve built in the RPD, you could always use an ODBC or JDBC connection directly into the BI Server to get the data out. Just don’t try and do it through Answers/Dashboards.

Instrumenting connection pools

For a detailed understanding of how the database behaves under load as the result of BI queries, consider using instrumentation in your connection pools as way of correlating [problematic] workload on the database with originating OBIEE queries and users.

I have written previously about how to this, here

Why’s it doing what it’s doing

If a report ‘ought’ to be running well, but isn’t, there are two optimisers involved to investigate to see why it is running how it is. When the inbound Logical SQL is received by the BI Server from Presentation Services, it is parsed (‘compiled’) by the BI Server through the RPD to generate the Physical SQL statement(s) to run against the database.

To see how OBIEE analyses the Logical SQL and decides how to run it, use a LOGLEVEL setting of 4 or greater. This writes the execution plan to nqquery.log, but be aware, it’s low-level stuff and typically for Oracle support use only. To read more about log levels, see here. The execution plan is based entirely upon the contents of the RPD, so if you want different Physical SQL generated, you need to influence it through the RPD.

The second optimiser is the database optimiser, which will take the Physical SQL OBIEE is generating and decide how best to execute it on the database. On Oracle this is the Cost-Based Optimiser (CBO), about which there is plenty written already and your local friendly DBA will be able to help with.

Footnote: Hypotheses

Finally, in analysing your data to come up with a diagnosis or hypothesis as to the root cause of the problem, bear this quotation in mind:

If you take a skeptical attitude toward your analysis you’ll look just as hard for data that refutes your hypothesis as you will for data that confirms it. A skeptic attacks the same question from many different angles and dramatically increases their confidence in the results. John Rauser

What next?

If your testing has shown a performance problem then you should by now have a hypothesis or diagnosis of the root cause. Read all about optimisation here. If your testing has shown performance is just fine, you might want to read it anyway …

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to retain the comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

March 18, 2013 Rittman Mead

Performance and OBIEE – part V – Execute and Measure

Having designed and built our tests, we now move on to looking at the real nitty-gritty – how we run them and collect data. The data that we collect is absolutely crucial in getting comprehensible test results and as a consequence ensuring valid test conclusions.

There are several broad elements to the data collected for a test:

Response times
System behaviour
Test details

The last one is very important, because without it you just have some numbers. If someone wants to reproduce the test, or if you want to rerun it to check a result or test a change, you’ve got to be able to run it as it was done originally. This is the cardinal sin that too many performance tests I’ve seen commit. A set of response times in isolation is interesting, sure, but unless I can trace back exactly how they were obtained so that I can:

Ensure or challenge their validity
Rerun the test myself

then they’re just numbers on a piece of paper.
The other common mistake committed in performance test execution is measuring the wrong thing. It might be a wrong metric, or the right metric but in the wrong context. For example, if I build a test that runs through multiple dashboards, I could get a response time for “Go to Dashboard” transaction:

But what does this tell me? All it tells me really is that some of my dashboard transactions take longer than others to run. Sure, we can start aggregating and analysing the data, talking about percentile response times – but by only measuring the transaction generically, rather than per dashboard or dashboard type, I’m already clouding the data. Much better to accurately identify each transaction and easily see the clear difference in performance behaviour:

How about this enticing looking metric:

We could use that to record the report response times for our test, yes? Well, honestly, I have no idea. That’s because a “Request” in the context of the OBIEE stack could be any number of things. I’d need to check the documentation to find out what this number was actually showing and how it was summarised. Don’t just pick a metric because it’s the first one you find that looks about right. Make sure it actually represents what you think it does.

As Zed Shaw puts it:

It’s pretty simple: If you want to measure something, then don’t measure other shit.

Please consider the environment before running this test

Your test should be done on as ‘clean’ an environment as possible. The more contaminating factors there are, the less confidence you can have in your test results, to the point of them becoming worthless.

Work with a fixed code version. This can be difficult to do during a project particularly, but there is little point testing the performance of code release 1.00 if when you come to test your tuning changes 1.50 is in use. Who knows what the developers changed between 1.00 and 1.50? It whips the rug from out under your original test results. By fixed code, I mean:
- Database, including:
  - DDL
  - Object statistics
- BI Server Repository (RPD)
- Dashboard and report definitions (Webcat)
  If you can’t insist on this – and sometimes pragmatism dictates so – then at least have a very clear understanding of the OBIEE stack. This way you can understand the potential impact of an external code change and caveat your test results accordingly. For example, if a change was made to the RPD but in a different Business Model from the one you are testing then it may not matter. If, however, they have partitioned an underlying fact table, then this could drastically change your results to the extent you should be discarding your first results and retesting.
In an ideal world, all the above code artefacts will be under source control, and you can quote the revision/commit number in your test log.

Make sure the data in the tables from which you are reporting is both unchanging and representative of Production. Unchanging is hopefully obvious, but representative may benefit from elaboration. If you are going live with 10M rows of data then you’d be pretty wise to do your utmost to run your performance test against 10M rows of data. Different types of reports might behave differently, and this is where judgement comes in. For example, a weekly report that is based on a fact table partitioned by week might perform roughly the same whether all or just one partition is loaded. However, the same fact table as the source for a historical query going back months, or a query cutting across partitions, is going need more data in to be representative. Finally, don’t neglect future growth in your testing. If it’s a brand new system with brand new data, you’ll be starting with zero rows on day one, but if there’ll be millions of rows within a month or so you need to be testing against a million rows or so in your performance tests.
The configuration of the software should be constant. This means obvious configuration such as BI Server caching, but also things like version numbers and patch levels of the OBIEE stack. Consider taking a snapshot of all main configuration files (NQSConfig.INIinstanceconfig.xml, etc) to store alongside your test data.
- You should aim to turn off BI Server caching for your initial tests, and then re-enable it if required as a properly tested optimisation step. The appropriate use and implementation of BI Server caching is discussed in the optimisation article of this series.

Measure

Before you execute your performance test, work out what data you want to collect and how you will collect it. The reason that it is worth thinking about in advance is that once you’ve run your test you can’t usually retrospectively collect additional data.

The data you should collect for your test itself includes:

Response times at the lowest grain possible/practical – see Dashboard example above. Response times should be 2-dimensional; transaction name, plus time offset from beginning of test.
Number of test users running over time (i.e. offset from the start of your test)
Environment details – a diagram of the top-to-bottom stack, software versions, code levels, and data volumes. Anything that is going to be relevant in assessing the validity of the test results, or rerunning the test in the future.
System metrics – if response times and user numbers are the eye-catching numbers in a test, system metrics are the oft-missed but vital numbers that give real substance to a test and make it useful. If response times are bad, we need to know why. If they’re good, we need to know how good. Both these things come from the system metrics.
Query Metrics – depending on the level at which the testing is being done, collecting metrics for individual query executions can also be vital for aiding performance analysis. Consider this more of a second round, drill down, layer of metrics rather than one to always collect in a large test since it can be a large volume of data.

Response times

Depending on your testing method, how you capture response time will be different. Always capture the raw response time and test duration offset – don’t just record one aggregate figure for the whole test. Working in BI you hopefully are clear on the fact that you can always aggregate data up, but can’t break a figure down if you don’t have the base data.

JMeter has “Sample time” or “Response Time”. Use the setSampleLabel trick to make sure you get a response time per specific dashboard, not just per dashboard refresh call. Some useful listeners to try out include:

jp@gc - Response Times Over Time
jp@gc - Response Times vs Threads (although be aware that this shows an average response time, which is not the best summary of the metric to use)
View Results in Table

JMeter can also write data to csv file, which can be a very good starting point for your own analysis of the data,

If you are doing a test for specific tuning, you might well want to capture response times at other points in the stack too; for example from the database, BI Server, and Presentation Server.

Whichever time you capture, make sure you record what that time represents – is it response time of the query at the BI Server, response time back to the HTTP client, is it response time including rendering – and so on. Don’t forget, standard load test tools such as JMeter don’t include page render time.

System metrics

There are two key areas of system metrics:

Environment metrics – everything outside OBI – the host OS, the database, the database OS.
OBI metrics – internal metrics that let us understand how OBI is ticking along and where any problems may lie

One of the really important things about system metrics is that they are useful come rain or shine. If the performance is not as desired, we use them to analyse where the bottlenecks are. If performance is good, we use them to understand system behaviour at a “known-good” point in time, as reference for if things do go bad, and also as a long-term capacity planning device.

Some of the tools described below for capturing system metrics could be considered for putting in place as standard monitoring for OBIEE, whilst others are a bit more detailed than you’d want to be collecting all the time.

OS metrics

OS stats should be collected on every server involved in the end-to-end serving of OBIEE dashboards. If you have a separate web tier, a 2-node OBIEE cluster and a database, monitor them all. The workload can manifest itself in multiple places and the more “eyes-on” you have the easier it is to spot the anomalies in behaviour and not just be stuck with a “it was slow” response time summary.

As well as the whole stack, you should also be monitoring the server(s) generating your load test. If you’re testing large numbers of concurrent users the overhead on the generator can be very high, so you need to be monitoring to ensure you’re not hitting a ceiling there, rather than in what is being monitored.

On Windows, I would use the built-in Performance Monitor (perfmon) tool to collect and analyse data. You can capture CPU, IO, Memory, Network and process-specific data to file, and analyse it to your heart’s content afterwards within the tool or exported to CSV.

On Linux and other *nix systems there is a swathe of tools available, my tool of choice being collectl, optionally visualised through graphite or graphiti. There are plenty of alternatives, including sar, glance, and so on
collectl data rendered in graphite

collectl data rendered in graphiti

Finally, it’s worth noting that JMeter also offers collection of OS stats through the JMeter plugins project.

OBI metrics

The OBIEE performance counters are a goldmine of valuable information, and one it’s well worth the time mining for nuggets. The counters give you both a better picture of how different workloads are executed within OBIEE but also where any bottlenecks may arise.

The counters can be accessed in several ways:

Through Fusion Middleware Control, under Capacity Management -> Metrics -> View the full set of system metrics. This gives a point-in-time view of the data for the last 15 minutes, and does not store history.
Presentation Services includes its own Performance Monitor which can be accessed at http://<yourserver>:<analytics port>/saw.dll?perfmon. In OBIEE 10g it showed BI Server metrics too, but in 11g seems to only show Presentation Services (Oracle BI PS) metrics. It is point in time with no history.
Similar to Performance Monitor but with a lot more metrics available, DMS Spy is a java deployment hosted on the Admin Server by default, available at http://<yourserver>:<adminserver port>/dms/Spy

Through a manual call to opmn on the commandline. For example:


    [oracle@obieesampleapp bin]$ ./opmnctl metric op=query COMPONENT_TYPE=OracleBIServerComponent
    HTTP/1.1 200 OK
    Connection: close
    Content-Type: text/html

    <?xml version='1.0'?>
    <!DOCTYPE pdml>
    <pdml version='11.0' name='Oracle BI Server' host='obieesampleapp' id='4399' timestamp='1359703796346'>
    <statistics>
    <noun name="Oracle BI Server" type="Oracle_BI_Applications">
    <noun name="Usage_Tracking" type="Oracle_BI_Thread_Pool">
    <metric name="Peak_Thread_Count.value">
    <value type="integer"><![CDATA[5]]></value>
    </metric>
    <metric name="Current_Thread_Count.value">
    <value type="integer"><![CDATA[2]]></value>
    </metric>
    <metric name="Lowest_Queued_Time_(milliseconds).value">
    <value type="integer"><![CDATA[0]]></value>
    </metric>
    […]

See the documentation for details

DMS through WLST
BI Server diagnostics through ODBC/JDBC
It is worth noting that one place you cannot get the OBI metrics from any more is through JMX. In 10g this was an option and interfaced very well with industry-standard monitoring tools. In 11g, JMX is available for metrics outside core OBI, but not the core metrics themselves.

In addition to the out-of-the-box options above, here at RittmanMead we have developed our own OBIEE monitoring tool. DMS metrics are stored directly on disk or through a database, enabling both immediate and retrospective analysis. Custom dashboards enable the display of both OBIEE and OS data side-by-side for ease of analysis. Integration with third-party tools is also an option.

Query Metrics

If you are running a test for a specific performance issue then capturing query metrics is important as these will feed into the diagnosis that you do including building a time profile.

A Logical SQL query has two elements to it for which we capture information:

BI Server (Logical SQL)
- Query runtime
- Total Rows returned from the database
- Total Bytes returned from the database
- Number of database queries
Database (x n Physical SQL queries)
- Response time
- Rows returned
- Bytes returned

All of the above information can be obtained from the BI Server’s nqquery.log, or most of it from Usage Tracking tables S_NQ_ACCT and S_NQ_DB_ACCT.

For really detailed analysis you may want to capture additional information from the database about how a query ran such as its execution plan. On Oracle, Real Time SQL Monitoring is a very useful tool, along with several others. For further details, speak to a DBA … this is an OBIEE blog ;-)

Execute

And now the moment you’ve all been waiting for … it’s party time!

Here is a checklist to work through for executing your test:

Clear down the logs of any component you’re going to be analysing (eg nqquery.log, sawlog.log)
Record any configuration/environment changes from the original baseline
Record test start time
Restart the OBIEE stack (i.e. WLS, OPMN)
Start OS metric collection
Start OBIEE metric collection (if used)
Run the test!
Monitor for errors and excessively bad response times
Record test end time
Record any errors observed during test
Copy all logs, metric outputs, etc to a named/timestamped folder

Monitoring the test as it executes is important. If something goes wrong then time can be saved by abandoning the test rather than let the test script run to completion. There’s no point letting a big test run for hours if the results are going to be useless. Some of the things that could go wrong include:

Your test script is duff. For example, there’s a typo in the login script and no users are logging in let alone executing dashboards. All your test will tell you is how fast OBIEE can reject user logins.
Your test has hit a performance bottleneck in the stack. If you leave your test running beyond a certain point, all you’re doing is collecting data to show how bad things still are. If response times are flat on their back at 50 concurrent users, what’s the point leaving a test to run all the way up to 200? It’s best to curtail it and move swiftly on with the analysis and tuning stage
Your test framework has hit a bottleneck in itself. For example, the host machine cannot sustain the CPU or network traffic required. If this happens then your test data is worthless because all you’re now measuring is the capacity of the host machine, not the OBIEE stack.

Monitoring for errors is also vital for picking up messages that OBIEE might start to produce if it is hitting an internal threshold that could constrain performance.

Don’t fiddle the books!

If your test execution doesn’t work, or you spot an improvement or fix – resist the biggest temptation which is to ‘just fix it’. Hours become days with this approach and you lose complete track of what you changed.

Take a step back, make a note of what needs fixing or changing, and document it as part of the full cycle.

There is nothing wrong whatsoever with aborting a test for the reason that “I defined it incorrectly” or “I forgot to change a config setting”. Better to have a half dozen aborted tests lying around showing that you got your hands dirty than a suspiciously pristine set of perfectly executed tests.

Don’t forget that pesky documentation

Always document your testing, including method, definition, and results.

You will not remember precisely how you ran the test, even a few days later
How can you identify possible confounding of results, without recording a clear picture of the environment?
If you find something unexpected, you can quickly seek a second opinion
Without things written down, people will not be able to reproduce your testing
Test results on their own are worthless; they are just a set of numbers.
If it’s not written down, it didn’t happen

What next?

With a test run completed and a set of data collected, it’s time to make sense of the numbers and understand what they can tell us by analysing the results

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to keep comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

March 18, 2013 Rittman Mead

Performance and OBIEE – part V – Execute and Measure

Having designed and built our tests, we now move on to looking at the real nitty-gritty – how we run them and collect data. The data that we collect is absolutely crucial in getting comprehensible test results and as a consequence ensuring valid test conclusions.

There are several broad elements to the data collected for a test:

Response times
System behaviour
Test details

The last one is very important, because without it you just have some numbers. If someone wants to reproduce the test, or if you want to rerun it to check a result or test a change, you’ve got to be able to run it as it was done originally. This is the cardinal sin that too many performance tests I’ve seen commit. A set of response times in isolation is interesting, sure, but unless I can trace back exactly how they were obtained so that I can:

Ensure or challenge their validity
Rerun the test myself

then they’re just numbers on a piece of paper.
The other common mistake committed in performance test execution is measuring the wrong thing. It might be a wrong metric, or the right metric but in the wrong context. For example, if I build a test that runs through multiple dashboards, I could get a response time for “Go to Dashboard” transaction:

But what does this tell me? All it tells me really is that some of my dashboard transactions take longer than others to run. Sure, we can start aggregating and analysing the data, talking about percentile response times – but by only measuring the transaction generically, rather than per dashboard or dashboard type, I’m already clouding the data. Much better to accurately identify each transaction and easily see the clear difference in performance behaviour:

How about this enticing looking metric:

We could use that to record the report response times for our test, yes? Well, honestly, I have no idea. That’s because a “Request” in the context of the OBIEE stack could be any number of things. I’d need to check the documentation to find out what this number was actually showing and how it was summarised. Don’t just pick a metric because it’s the first one you find that looks about right. Make sure it actually represents what you think it does.

As Zed Shaw puts it:

It’s pretty simple: If you want to measure something, then don’t measure other shit.

Please consider the environment before running this test

Your test should be done on as ‘clean’ an environment as possible. The more contaminating factors there are, the less confidence you can have in your test results, to the point of them becoming worthless.

Work with a fixed code version. This can be difficult to do during a project particularly, but there is little point testing the performance of code release 1.00 if when you come to test your tuning changes 1.50 is in use. Who knows what the developers changed between 1.00 and 1.50? It whips the rug from out under your original test results. By fixed code, I mean:
- Database, including:
  - DDL
  - Object statistics
- BI Server Repository (RPD)
- Dashboard and report definitions (Webcat)
  If you can’t insist on this – and sometimes pragmatism dictates so – then at least have a very clear understanding of the OBIEE stack. This way you can understand the potential impact of an external code change and caveat your test results accordingly. For example, if a change was made to the RPD but in a different Business Model from the one you are testing then it may not matter. If, however, they have partitioned an underlying fact table, then this could drastically change your results to the extent you should be discarding your first results and retesting.
In an ideal world, all the above code artefacts will be under source control, and you can quote the revision/commit number in your test log.

Make sure the data in the tables from which you are reporting is both unchanging and representative of Production. Unchanging is hopefully obvious, but representative may benefit from elaboration. If you are going live with 10M rows of data then you’d be pretty wise to do your utmost to run your performance test against 10M rows of data. Different types of reports might behave differently, and this is where judgement comes in. For example, a weekly report that is based on a fact table partitioned by week might perform roughly the same whether all or just one partition is loaded. However, the same fact table as the source for a historical query going back months, or a query cutting across partitions, is going need more data in to be representative. Finally, don’t neglect future growth in your testing. If it’s a brand new system with brand new data, you’ll be starting with zero rows on day one, but if there’ll be millions of rows within a month or so you need to be testing against a million rows or so in your performance tests.
The configuration of the software should be constant. This means obvious configuration such as BI Server caching, but also things like version numbers and patch levels of the OBIEE stack. Consider taking a snapshot of all main configuration files (NQSConfig.INIinstanceconfig.xml, etc) to store alongside your test data.
- You should aim to turn off BI Server caching for your initial tests, and then re-enable it if required as a properly tested optimisation step. The appropriate use and implementation of BI Server caching is discussed in the optimisation article of this series.

Measure

Before you execute your performance test, work out what data you want to collect and how you will collect it. The reason that it is worth thinking about in advance is that once you’ve run your test you can’t usually retrospectively collect additional data.

The data you should collect for your test itself includes:

Response times at the lowest grain possible/practical – see Dashboard example above. Response times should be 2-dimensional; transaction name, plus time offset from beginning of test.
Number of test users running over time (i.e. offset from the start of your test)
Environment details – a diagram of the top-to-bottom stack, software versions, code levels, and data volumes. Anything that is going to be relevant in assessing the validity of the test results, or rerunning the test in the future.
System metrics – if response times and user numbers are the eye-catching numbers in a test, system metrics are the oft-missed but vital numbers that give real substance to a test and make it useful. If response times are bad, we need to know why. If they’re good, we need to know how good. Both these things come from the system metrics.
Query Metrics – depending on the level at which the testing is being done, collecting metrics for individual query executions can also be vital for aiding performance analysis. Consider this more of a second round, drill down, layer of metrics rather than one to always collect in a large test since it can be a large volume of data.

Response times

Depending on your testing method, how you capture response time will be different. Always capture the raw response time and test duration offset – don’t just record one aggregate figure for the whole test. Working in BI you hopefully are clear on the fact that you can always aggregate data up, but can’t break a figure down if you don’t have the base data.

JMeter has “Sample time” or “Response Time”. Use the setSampleLabel trick to make sure you get a response time per specific dashboard, not just per dashboard refresh call. Some useful listeners to try out include:

jp@gc - Response Times Over Time
jp@gc - Response Times vs Threads (although be aware that this shows an average response time, which is not the best summary of the metric to use)
View Results in Table

JMeter can also write data to csv file, which can be a very good starting point for your own analysis of the data,

If you are doing a test for specific tuning, you might well want to capture response times at other points in the stack too; for example from the database, BI Server, and Presentation Server.

Whichever time you capture, make sure you record what that time represents – is it response time of the query at the BI Server, response time back to the HTTP client, is it response time including rendering – and so on. Don’t forget, standard load test tools such as JMeter don’t include page render time.

System metrics

There are two key areas of system metrics:

Environment metrics – everything outside OBI – the host OS, the database, the database OS.
OBI metrics – internal metrics that let us understand how OBI is ticking along and where any problems may lie

One of the really important things about system metrics is that they are useful come rain or shine. If the performance is not as desired, we use them to analyse where the bottlenecks are. If performance is good, we use them to understand system behaviour at a “known-good” point in time, as reference for if things do go bad, and also as a long-term capacity planning device.

Some of the tools described below for capturing system metrics could be considered for putting in place as standard monitoring for OBIEE, whilst others are a bit more detailed than you’d want to be collecting all the time.

OS metrics

OS stats should be collected on every server involved in the end-to-end serving of OBIEE dashboards. If you have a separate web tier, a 2-node OBIEE cluster and a database, monitor them all. The workload can manifest itself in multiple places and the more “eyes-on” you have the easier it is to spot the anomalies in behaviour and not just be stuck with a “it was slow” response time summary.

As well as the whole stack, you should also be monitoring the server(s) generating your load test. If you’re testing large numbers of concurrent users the overhead on the generator can be very high, so you need to be monitoring to ensure you’re not hitting a ceiling there, rather than in what is being monitored.

On Windows, I would use the built-in Performance Monitor (perfmon) tool to collect and analyse data. You can capture CPU, IO, Memory, Network and process-specific data to file, and analyse it to your heart’s content afterwards within the tool or exported to CSV.

On Linux and other *nix systems there is a swathe of tools available, my tool of choice being collectl, optionally visualised through graphite or graphiti. There are plenty of alternatives, including sar, glance, and so on
collectl data rendered in graphite

collectl data rendered in graphiti

Finally, it’s worth noting that JMeter also offers collection of OS stats through the JMeter plugins project.

OBI metrics

The OBIEE performance counters are a goldmine of valuable information, and one it’s well worth the time mining for nuggets. The counters give you both a better picture of how different workloads are executed within OBIEE but also where any bottlenecks may arise.

The counters can be accessed in several ways:

Through Fusion Middleware Control, under Capacity Management -> Metrics -> View the full set of system metrics. This gives a point-in-time view of the data for the last 15 minutes, and does not store history.
Presentation Services includes its own Performance Monitor which can be accessed at http://<yourserver>:<analytics port>/saw.dll?perfmon. In OBIEE 10g it showed BI Server metrics too, but in 11g seems to only show Presentation Services (Oracle BI PS) metrics. It is point in time with no history.
Similar to Performance Monitor but with a lot more metrics available, DMS Spy is a java deployment hosted on the Admin Server by default, available at http://<yourserver>:<adminserver port>/dms/Spy

Through a manual call to opmn on the commandline. For example:


    [oracle@obieesampleapp bin]$ ./opmnctl metric op=query COMPONENT_TYPE=OracleBIServerComponent
    HTTP/1.1 200 OK
    Connection: close
    Content-Type: text/html

    <?xml version='1.0'?>
    <!DOCTYPE pdml>
    <pdml version='11.0' name='Oracle BI Server' host='obieesampleapp' id='4399' timestamp='1359703796346'>
    <statistics>
    <noun name="Oracle BI Server" type="Oracle_BI_Applications">
    <noun name="Usage_Tracking" type="Oracle_BI_Thread_Pool">
    <metric name="Peak_Thread_Count.value">
    <value type="integer"><![CDATA[5]]></value>
    </metric>
    <metric name="Current_Thread_Count.value">
    <value type="integer"><![CDATA[2]]></value>
    </metric>
    <metric name="Lowest_Queued_Time_(milliseconds).value">
    <value type="integer"><![CDATA[0]]></value>
    </metric>
    […]

See the documentation for details

DMS through WLST
BI Server diagnostics through ODBC/JDBC
It is worth noting that one place you cannot get the OBI metrics from any more is through JMX. In 10g this was an option and interfaced very well with industry-standard monitoring tools. In 11g, JMX is available for metrics outside core OBI, but not the core metrics themselves.

In addition to the out-of-the-box options above, here at RittmanMead we have developed our own OBIEE monitoring tool. DMS metrics are stored directly on disk or through a database, enabling both immediate and retrospective analysis. Custom dashboards enable the display of both OBIEE and OS data side-by-side for ease of analysis. Integration with third-party tools is also an option.

Query Metrics

If you are running a test for a specific performance issue then capturing query metrics is important as these will feed into the diagnosis that you do including building a time profile.

A Logical SQL query has two elements to it for which we capture information:

BI Server (Logical SQL)
- Query runtime
- Total Rows returned from the database
- Total Bytes returned from the database
- Number of database queries
Database (x n Physical SQL queries)
- Response time
- Rows returned
- Bytes returned

All of the above information can be obtained from the BI Server’s nqquery.log, or most of it from Usage Tracking tables S_NQ_ACCT and S_NQ_DB_ACCT.

For really detailed analysis you may want to capture additional information from the database about how a query ran such as its execution plan. On Oracle, Real Time SQL Monitoring is a very useful tool, along with several others. For further details, speak to a DBA … this is an OBIEE blog ;-)

Execute

And now the moment you’ve all been waiting for … it’s party time!

Here is a checklist to work through for executing your test:

Clear down the logs of any component you’re going to be analysing (eg nqquery.log, sawlog.log)
Record any configuration/environment changes from the original baseline
Record test start time
Restart the OBIEE stack (i.e. WLS, OPMN)
Start OS metric collection
Start OBIEE metric collection (if used)
Run the test!
Monitor for errors and excessively bad response times
Record test end time
Record any errors observed during test
Copy all logs, metric outputs, etc to a named/timestamped folder

Monitoring the test as it executes is important. If something goes wrong then time can be saved by abandoning the test rather than let the test script run to completion. There’s no point letting a big test run for hours if the results are going to be useless. Some of the things that could go wrong include:

Your test script is duff. For example, there’s a typo in the login script and no users are logging in let alone executing dashboards. All your test will tell you is how fast OBIEE can reject user logins.
Your test has hit a performance bottleneck in the stack. If you leave your test running beyond a certain point, all you’re doing is collecting data to show how bad things still are. If response times are flat on their back at 50 concurrent users, what’s the point leaving a test to run all the way up to 200? It’s best to curtail it and move swiftly on with the analysis and tuning stage
Your test framework has hit a bottleneck in itself. For example, the host machine cannot sustain the CPU or network traffic required. If this happens then your test data is worthless because all you’re now measuring is the capacity of the host machine, not the OBIEE stack.

Monitoring for errors is also vital for picking up messages that OBIEE might start to produce if it is hitting an internal threshold that could constrain performance.

Don’t fiddle the books!

If your test execution doesn’t work, or you spot an improvement or fix – resist the biggest temptation which is to ‘just fix it’. Hours become days with this approach and you lose complete track of what you changed.

Take a step back, make a note of what needs fixing or changing, and document it as part of the full cycle.

There is nothing wrong whatsoever with aborting a test for the reason that “I defined it incorrectly” or “I forgot to change a config setting”. Better to have a half dozen aborted tests lying around showing that you got your hands dirty than a suspiciously pristine set of perfectly executed tests.

Don’t forget that pesky documentation

Always document your testing, including method, definition, and results.

You will not remember precisely how you ran the test, even a few days later
How can you identify possible confounding of results, without recording a clear picture of the environment?
If you find something unexpected, you can quickly seek a second opinion
Without things written down, people will not be able to reproduce your testing
Test results on their own are worthless; they are just a set of numbers.
If it’s not written down, it didn’t happen

What next?

With a test run completed and a set of data collected, it’s time to make sense of the numbers and understand what they can tell us by analysing the results

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to keep comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

March 18, 2013 Rittman Mead

Performance and OBIEE – part I – Introduction

Performance matters. Performance really matters. And performance can actually be easy, but it takes some thinking about. It can’t be brute-forced, or learnt by rote, or solved in a list of Best Practices, Silver Bullets and fairy dust.

The problem with performance is that it is too easy to guess and sometimes strike lucky, to pick at a “Best Practice Tuning” setting that by chance matches an issue on your system. This leads people down the path of thinking that performance is just about tweaking parameters, tuning settings, and twiddling knobs. The trouble with trusting this magic beans approach is that down this path leads wasted time, system instability, uncertainty, and insanity. Your fix that worked on another system might work this time, or a setting you find in a “Best Practice” document might work. But would it not be better to know that it would?

I wanted to write this series of posts as a way of getting onto paper how I think analysing and improving performance in OBIEE should be done and why. It is intended to address the very basic question of how do we improve the performance of OBIEE. Lots of people work with OBIEE, and many of them will have lots of ideas about performance, but not all have a clear picture of how to empirically test and improve performance.

Why does performance matter?

Last and only Stopwatch graphic, promise! (image src)

Why does performance matter? Why are some people (me) so obsessed with testing and timing and tuning things? Can’t we just put the system live and see how it goes, since it seems fast enough in Dev?…

Why performance matters to a project’s success

Slow systems upset users. No-one likes to be kept waiting. If you’re withdrawing cash from an ATM, you’re going to be quite cross if it takes five minutes. In fact, a pause of five seconds will probably get you fidgeting.
Once users dislike a system, regaining their favour is an uphill battle. “Trust is hard to win and easily lost”. One of the things about performance is perception of speed, and if a user has decided a system is slow you will have to work twice as hard to get them to simply recognise a small improvement. You not only have to fix the performance problem, you also have to win round the user again and prove that it is genuinely faster.
From a cost point of view, poorly performing systems are inefficient:
- They waste hardware resource, increasing the machine capacity required, decreasing the time between hardware upgrades
- They cost more to support, particularly as performance bottlenecks can cause unpredictable stability issues
- They cost more to maintain, in two ways. Firstly, each quick-win used in an attempt to resolve the problem will probably add to the complexity or maintenance overhead of the system. Secondly, a proper resolution of the problem may involve a redesign on such a scale that it can become a rewrite of the entire system in all but name.
- They cost more to use. User waiting = user distracted = less efficient at his job. Eventually, User waiting = disgruntled user = poor system usage and support from the business.

Why performance matters to the techie

Performance is not a fire-and-forget task, and box on a checklist. It has many facets and places in a project’s life cycle.

Red Herring

Done properly, you will have confidence in the performance of your system, knowledge of the limits of its capacity, a better understanding of the workings of it, and a repeatable process for validating any issues that arise or prospective configuration changes.

Performance Goblin

Done badly, or not at all, you might hit lucky and not have any performance problems for a while. But when they do happen, you’ll be starting from a position of ignorance, trying to learn at speed and under pressure how to diagnose and resolve the problems. Silver bullets appear enticing and get fired at the problem in the vain hope that one will work. Time will be wasted chasing red herrings. You have no real handle on how much capacity your server has for an increasing user base. Version upgrades fill you with fear of the unknown. You don’t dare change your system for fear of upsetting the performance goblin lest he wreak havoc.

Building a good system is not just about one which cranks out the correct numbers. A good system is one which not only cranks out the good numbers, but performs well when it does so. Performance is a key component of any system design.

OBIEE and Performance

dotmatrix report

Gone are the days of paper reports, when a user couldn’t judge the performance of a computer system except by whether the paper reports were on their desk by 0800 on Monday morning. Now, users are more and more technologically aware. They are more aware of the concept and power of data. Most will have smartphones and be used to having their email, music and personal life at the finger-swipe of a screen. They know how fast computers can work.

One of the many strengths of OBIEE is that it enables “self-service” BI. The challenge that this gives us is that users will typically expect their dashboards and analyses to run as fast as all their other interactions with technology. A slow system risks being an unsuccessful system, as users will be impatient, frustrated, even angry with it.

Below I propose an approach, a method, which will support the testing and tuning of the performance of OBIEE during all phases of a project. Every method must have a silly TLA or catchy name, and this one is no different….

Fancy a brew? Introducing T.E.A., the OBIEE Performance Method

In working with performance one of the most important things is to retain a structured and logical approach to it. Here is mine:

Test creation
- A predefined, repeatable, workload
Execute and Measure
- Run the test and collect data
Analyse
- Analyse the test results, and if necessary apply a change to the system which is then validated through a repeat of the cycle

The emphasis is on this method being applicable at any time in a system’s lifecycle, not just the “Performance Test” phase. Here are a few examples to put it in context:

Formal performance test stage of a project
1. Test : define and build a set of tests simulating users, including at high concurrency
2. Execute and Measure: run test and collect detailed statistics about system profile
3. Analyse : check for bottlenecks, diagnose, redesign or reconfigure system and retest
Continual Monitoring of performance
1. Test could be a standard prebuilt report with known run time (i.e. a baseline)
2. Execute could be just when the report gets run on demand, or a scheduled version of the report for monitoring purposes. Measure just the response time, alongside standard OS metrics
3. Analyse – collect response times to track trends, identify problems before they escalate. Provides a baseline against which to test changes
Troubleshooting a performance problem
1. Test could be existing reports with known performance times taken from OBIEE’s Usage Tracking data
2. Execute Rerun reports and measure response times and detailed system metrics
3. Analyse Diagnose root cause, fix and retest

Re-inventing the wheel

T.E.A. is nothing new in the overall context of Performance. It is almost certainly in existence elsewhere under another name or guise. I have deliberately split it into three separate parts to make it easier to work with in the context of OBIEE. The OBIEE stack is relatively complex and teasing apart the various parts for consideration has to be done carefully. For example, designing how we generate the test against OBIEE should be done in isolation from how we are going to monitor it. Both have numerous ways of doing so, and in several places can interlink. The most important thing is that they’re initially considered separately.

The other reason for defining my own method is that I wanted to get something in writing on which I can then hang my various OBIEE-specific performance rants without being constrained by the terminology of another method.

obXKCD

What’s to come

This series of articles is split into the following :

I’m tempted to hyperlink these in the fashion of Choose Your Own Adventure and if you click straight from here onto the last section, Optimise, without having read the other parts first, it will redirect you back to them ;-)

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to keep the thread of comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

March 18, 2013 Rittman Mead

Performance and OBIEE – part I – Introduction

Performance matters. Performance really matters. And performance can actually be easy, but it takes some thinking about. It can’t be brute-forced, or learnt by rote, or solved in a list of Best Practices, Silver Bullets and fairy dust.

The problem with performance is that it is too easy to guess and sometimes strike lucky, to pick at a “Best Practice Tuning” setting that by chance matches an issue on your system. This leads people down the path of thinking that performance is just about tweaking parameters, tuning settings, and twiddling knobs. The trouble with trusting this magic beans approach is that down this path leads wasted time, system instability, uncertainty, and insanity. Your fix that worked on another system might work this time, or a setting you find in a “Best Practice” document might work. But would it not be better to know that it would?

I wanted to write this series of posts as a way of getting onto paper how I think analysing and improving performance in OBIEE should be done and why. It is intended to address the very basic question of how do we improve the performance of OBIEE. Lots of people work with OBIEE, and many of them will have lots of ideas about performance, but not all have a clear picture of how to empirically test and improve performance.

Why does performance matter?

Last and only Stopwatch graphic, promise! (image src)

Why does performance matter? Why are some people (me) so obsessed with testing and timing and tuning things? Can’t we just put the system live and see how it goes, since it seems fast enough in Dev?…

Why performance matters to a project’s success

Slow systems upset users. No-one likes to be kept waiting. If you’re withdrawing cash from an ATM, you’re going to be quite cross if it takes five minutes. In fact, a pause of five seconds will probably get you fidgeting.
Once users dislike a system, regaining their favour is an uphill battle. “Trust is hard to win and easily lost”. One of the things about performance is perception of speed, and if a user has decided a system is slow you will have to work twice as hard to get them to simply recognise a small improvement. You not only have to fix the performance problem, you also have to win round the user again and prove that it is genuinely faster.
From a cost point of view, poorly performing systems are inefficient:
- They waste hardware resource, increasing the machine capacity required, decreasing the time between hardware upgrades
- They cost more to support, particularly as performance bottlenecks can cause unpredictable stability issues
- They cost more to maintain, in two ways. Firstly, each quick-win used in an attempt to resolve the problem will probably add to the complexity or maintenance overhead of the system. Secondly, a proper resolution of the problem may involve a redesign on such a scale that it can become a rewrite of the entire system in all but name.
- They cost more to use. User waiting = user distracted = less efficient at his job. Eventually, User waiting = disgruntled user = poor system usage and support from the business.

Why performance matters to the techie

Performance is not a fire-and-forget task, and box on a checklist. It has many facets and places in a project’s life cycle.

Red Herring

Done properly, you will have confidence in the performance of your system, knowledge of the limits of its capacity, a better understanding of the workings of it, and a repeatable process for validating any issues that arise or prospective configuration changes.

Performance Goblin

Done badly, or not at all, you might hit lucky and not have any performance problems for a while. But when they do happen, you’ll be starting from a position of ignorance, trying to learn at speed and under pressure how to diagnose and resolve the problems. Silver bullets appear enticing and get fired at the problem in the vain hope that one will work. Time will be wasted chasing red herrings. You have no real handle on how much capacity your server has for an increasing user base. Version upgrades fill you with fear of the unknown. You don’t dare change your system for fear of upsetting the performance goblin lest he wreak havoc.

Building a good system is not just about one which cranks out the correct numbers. A good system is one which not only cranks out the good numbers, but performs well when it does so. Performance is a key component of any system design.

OBIEE and Performance

dotmatrix report

Gone are the days of paper reports, when a user couldn’t judge the performance of a computer system except by whether the paper reports were on their desk by 0800 on Monday morning. Now, users are more and more technologically aware. They are more aware of the concept and power of data. Most will have smartphones and be used to having their email, music and personal life at the finger-swipe of a screen. They know how fast computers can work.

One of the many strengths of OBIEE is that it enables “self-service” BI. The challenge that this gives us is that users will typically expect their dashboards and analyses to run as fast as all their other interactions with technology. A slow system risks being an unsuccessful system, as users will be impatient, frustrated, even angry with it.

Below I propose an approach, a method, which will support the testing and tuning of the performance of OBIEE during all phases of a project. Every method must have a silly TLA or catchy name, and this one is no different….

Fancy a brew? Introducing T.E.A., the OBIEE Performance Method

In working with performance one of the most important things is to retain a structured and logical approach to it. Here is mine:

Test creation
- A predefined, repeatable, workload
Execute and Measure
- Run the test and collect data
Analyse
- Analyse the test results, and if necessary apply a change to the system which is then validated through a repeat of the cycle

The emphasis is on this method being applicable at any time in a system’s lifecycle, not just the “Performance Test” phase. Here are a few examples to put it in context:

Formal performance test stage of a project
1. Test : define and build a set of tests simulating users, including at high concurrency
2. Execute and Measure: run test and collect detailed statistics about system profile
3. Analyse : check for bottlenecks, diagnose, redesign or reconfigure system and retest
Continual Monitoring of performance
1. Test could be a standard prebuilt report with known run time (i.e. a baseline)
2. Execute could be just when the report gets run on demand, or a scheduled version of the report for monitoring purposes. Measure just the response time, alongside standard OS metrics
3. Analyse – collect response times to track trends, identify problems before they escalate. Provides a baseline against which to test changes
Troubleshooting a performance problem
1. Test could be existing reports with known performance times taken from OBIEE’s Usage Tracking data
2. Execute Rerun reports and measure response times and detailed system metrics
3. Analyse Diagnose root cause, fix and retest

Re-inventing the wheel

T.E.A. is nothing new in the overall context of Performance. It is almost certainly in existence elsewhere under another name or guise. I have deliberately split it into three separate parts to make it easier to work with in the context of OBIEE. The OBIEE stack is relatively complex and teasing apart the various parts for consideration has to be done carefully. For example, designing how we generate the test against OBIEE should be done in isolation from how we are going to monitor it. Both have numerous ways of doing so, and in several places can interlink. The most important thing is that they’re initially considered separately.

The other reason for defining my own method is that I wanted to get something in writing on which I can then hang my various OBIEE-specific performance rants without being constrained by the terminology of another method.

obXKCD

What’s to come

This series of articles is split into the following :

I’m tempted to hyperlink these in the fashion of Choose Your Own Adventure and if you click straight from here onto the last section, Optimise, without having read the other parts first, it will redirect you back to them ;-)

Comments?

I’d love your feedback. Do you agree with this method, or is it a waste of time? What have I overlooked or overemphasised? Am I flogging a dead horse?

Because there are several articles in this series, and I’d like to keep the thread of comments in one place, I’ve enabled comments on the summary and FAQ post here, and disabled comments on the others.

Tag Archives: Oracle BI Suite EE

Analysis

Avoid Averages

Throw away your test data

Analysis summary

Diagnosing poor OBIEE performance

Get to the root of the problem

Performance vs Capacity

Errors

Response time profiling

Diagnosing capacity problems

Additional diagnosis tips

Make sure that the database is sweating

Instrumenting connection pools

Why’s it doing what it’s doing

Footnote: Hypotheses

What next?

Other articles in this series

Comments?

Please consider the environment before running this test

Measure

Response times

System metrics

OS metrics

OBI metrics

Query Metrics

Execute

Don’t fiddle the books!

Don’t forget that pesky documentation

What next?

Other articles in this series

Comments?

Please consider the environment before running this test

Measure

Response times

System metrics

OS metrics

OBI metrics

Query Metrics

Execute

Don’t fiddle the books!

Don’t forget that pesky documentation

What next?

Other articles in this series

Comments?

Why does performance matter?

Why performance matters to a project’s success

Why performance matters to the techie

OBIEE and Performance

Fancy a brew? Introducing T.E.A., the OBIEE Performance Method

Re-inventing the wheel

What’s to come

Comments?

Why does performance matter?

Why performance matters to a project’s success

Why performance matters to the techie

OBIEE and Performance

Fancy a brew? Introducing T.E.A., the OBIEE Performance Method

Re-inventing the wheel

What’s to come

Comments?

Social