Tag Archives: Testing

Using Linux Control Groups to Constrain Process Memory

Linux Control Groups (cgroups) are a nifty way to limit the amount of resource, such as CPU, memory, or IO throughput, that a process or group of processes may use. Frits Hoogland wrote a great blog demonstrating how to use it to constrain the I/O a particular process could use, and was the inspiration for this one. I have been doing some digging into the performance characteristics of OBIEE in certain conditions, including how it behaves under memory pressure. I’ll write more about that in a future blog, but wanted to write this short blog to demonstrate how cgroups can be used to constrain the memory that a given Linux process can be allocated.

This was done on Amazon EC2 running an image imported originally from Oracle’s OBIEE SampleApp, built on Oracle Linux 6.5.

$ uname -a  
Linux demo.us.oracle.com 2.6.32-431.5.1.el6.x86_64 #1 SMP Tue Feb 11 11:09:04 PST 2014 x86_64 x86_64 x86_64 GNU/Linux

First off, install the necessary package in order to use them, and start the service. Throughout this blog where I quote shell commands those prefixed with # are run as root and $ as non-root:

# yum install libcgroup  
# service cgconfig start

Create a cgroup (I’m shamelessly ripping off Frits’ code here, hence the same cgroup name ;-) ):

# cgcreate -g memory:/myGroup

You can use cgget to view the current limits, usage, & high watermarks of the cgroup:

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 0  
memory.memsw.usage_in_bytes: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 0  
memory.usage_in_bytes: 0

For more information about the field meaning see the doc here.

To test out the cgroup ability to limit memory used by a process we’re going to use the tool stress, which can be used to generate CPU, memory, or IO load on a server. It’s great for testing what happens to a server under resource pressure, and also for testing memory allocation capabilities of a process which is what we’re using it for here.

We’re going to configure cgroups to add stress to the myGroup group whenever it runs

$ cat /etc/cgrules.conf  
*:stress memory myGroup

[Re-]start the cg rules engine service:

# service cgred restart

Now we’ll use the watch command to re-issue the cgget command every second enabling us to watch cgroup’s metrics in realtime:

# watch --interval 1 cgget -g memory:/myGroup  
/myGroup:  
memory.memsw.failcnt: 0  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 0  
memory.memsw.usage_in_bytes: 0  
memory.oom_control: oom_kill_disable 0  
        under_oom 0  
memory.move_charge_at_immigrate: 0  
memory.swappiness: 60  
memory.use_hierarchy: 0  
memory.stat: cache 0  
        rss 0  
        mapped_file 0  
        pgpgin 0  
        pgpgout 0  
        swap 0  
        inactive_anon 0  
        active_anon 0  
        inactive_file 0  
        active_file 0  
        unevictable 0  
        hierarchical_memory_limit 9223372036854775807  
        hierarchical_memsw_limit 9223372036854775807  
        total_cache 0  
        total_rss 0  
        total_mapped_file 0  
        total_pgpgin 0  
        total_pgpgout 0  
        total_swap 0  
        total_inactive_anon 0  
        total_active_anon 0  
        total_inactive_file 0  
        total_active_file 0  
        total_unevictable 0  
memory.failcnt: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 0  
memory.usage_in_bytes: 0

In a separate terminal (or even better, use screen!) run stress, telling it to grab 150MB of memory:

$ stress --vm-bytes 150M --vm-keep -m 1

Review the cgroup, and note that the usage fields have increased:

/myGroup:  
memory.memsw.failcnt: 0  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.memsw.max_usage_in_bytes: 157548544  
memory.memsw.usage_in_bytes: 157548544  
memory.oom_control: oom_kill_disable 0  
        under_oom 0  
memory.move_charge_at_immigrate: 0  
memory.swappiness: 60  
memory.use_hierarchy: 0  
memory.stat: cache 0  
        rss 157343744  
        mapped_file 0  
        pgpgin 38414  
        pgpgout 0  
        swap 0  
        inactive_anon 0  
        active_anon 157343744  
        inactive_file 0  
        active_file 0  
        unevictable 0  
        hierarchical_memory_limit 9223372036854775807  
        hierarchical_memsw_limit 9223372036854775807  
        total_cache 0  
        total_rss 157343744  
        total_mapped_file 0  
        total_pgpgin 38414  
        total_pgpgout 0  
        total_swap 0  
        total_inactive_anon 0  
        total_active_anon 157343744  
        total_inactive_file 0  
        total_active_file 0  
        total_unevictable 0  
memory.failcnt: 0  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 9223372036854775807  
memory.max_usage_in_bytes: 157548544  
memory.usage_in_bytes: 157548544

Both memory.memsw.usage_in_bytes and memory.usage_in_bytes are 157548544 = 150.25MB

Having a look at the process stats for stress shows us:

$ ps -ef|grep stress  
oracle   15296  9023  0 11:57 pts/12   00:00:00 stress --vm-bytes 150M --vm-keep -m 1  
oracle   15297 15296 96 11:57 pts/12   00:06:23 stress --vm-bytes 150M --vm-keep -m 1  
oracle   20365 29403  0 12:04 pts/10   00:00:00 grep stress

$ cat /proc/15297/status

Name:   stress  
State:  R (running)  
[...]  
VmPeak:   160124 kB  
VmSize:   160124 kB  
VmLck:         0 kB  
VmHWM:    153860 kB  
VmRSS:    153860 kB  
VmData:   153652 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       328 kB  
VmSwap:        0 kB  
[...]

The man page for proc gives us more information about these fields, but of particular note are:

  • VmSize: Virtual memory size.
  • VmRSS: Resident set size.
  • VmSwap: Swapped-out virtual memory size by anonymous private pages

Our stress process has a VmSize of 156MB, VmRSS of 150MB, and zero swap.

Kill the stress process, and set a memory limit of 100MB for any process in this cgroup:

# cgset -r memory.limit_in_bytes=100m myGroup

Run cgset and you should see the see new limit. Note that at this stage we’re just setting memory.limit_in_bytes and leaving memory.memsw.limit_in_bytes unchanged.

# cgget -g memory:/myGroup|grep limit|grep bytes  
memory.memsw.limit_in_bytes: 9223372036854775807  
memory.soft_limit_in_bytes: 9223372036854775807  
memory.limit_in_bytes: 104857600

Let’s see what happens when we try to allocate the memory, observing the cgroup and process Virtual Memory process information at each point:

  • 15MB:

    $ stress --vm-bytes 15M --vm-keep -m 1  
    stress: info: [31942] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
    
    # cgget -g memory:/myGroup|grep usage|grep -v max  
    memory.memsw.usage_in_bytes: 15990784  
    memory.usage_in_bytes: 15990784
    
    $ cat /proc/$(pgrep stress|tail -n1)/status|grep VmVmPeak:    21884 kB  
    VmSize:    21884 kB  
    VmLck:         0 kB  
    VmHWM:     15616 kB  
    VmRSS:     15616 kB  
    VmData:    15412 kB  
    VmStk:        92 kB  
    VmExe:        20 kB  
    VmLib:      2232 kB  
    VmPTE:        60 kB  
    VmSwap:        0 kB

  • 50MB:

    $ stress --vm-bytes 50M --vm-keep -m 1  
    stress: info: [32419] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
    
    # cgget -g memory:/myGroup|grep usage|grep -v max  
    memory.memsw.usage_in_bytes: 52748288  
    memory.usage_in_bytes: 52748288     
    
    $ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
    VmPeak:    57724 kB  
    VmSize:    57724 kB  
    VmLck:         0 kB  
    VmHWM:     51456 kB  
    VmRSS:     51456 kB  
    VmData:    51252 kB  
    VmStk:        92 kB  
    VmExe:        20 kB  
    VmLib:      2232 kB  
    VmPTE:       128 kB  
    VmSwap:        0 kB

  • 100MB:

    $ stress --vm-bytes 100M --vm-keep -m 1  
    stress: info: [20379] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd        
    # cgget -g memory:/myGroup|grep usage|grep -v max  
    memory.memsw.usage_in_bytes: 105197568  
    memory.usage_in_bytes: 104738816
    
    $ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
    VmPeak:   108924 kB  
    VmSize:   108924 kB  
    VmLck:         0 kB  
    VmHWM:    102588 kB  
    VmRSS:    101448 kB  
    VmData:   102452 kB  
    VmStk:        92 kB  
    VmExe:        20 kB  
    VmLib:      2232 kB  
    VmPTE:       232 kB  
    VmSwap:     1212 kB

Note that VmSwap has now gone above zero, despite the machine having plenty of usable memory:

# vmstat -s  
     16330912  total memory  
     14849864  used memory  
     10583040  active memory  
      3410892  inactive memory  
      1481048  free memory  
       149416  buffer memory  
      8204108  swap cache  
      6143992  total swap  
      1212184  used swap  
      4931808  free swap

So it looks like the memory cap has kicked in and the stress process is being forced to get the additional memory that it needs from swap.

Let’s tighten the screw a bit further:

$ stress --vm-bytes 200M --vm-keep -m 1  
stress: info: [21945] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The process is now using 100MB of swap (since we’ve asked it to grab 200MB but cgroup is constraining it to 100MB real):

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
VmPeak:   211324 kB  
VmSize:   211324 kB  
VmLck:         0 kB  
VmHWM:    102616 kB  
VmRSS:    102600 kB  
VmData:   204852 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       432 kB  
VmSwap:   102460 kB

The cgget command confirms that we’re using swap, as the memsw value shows:

# cgget -g memory:/myGroup|grep usage|grep -v max  
memory.memsw.usage_in_bytes: 209788928  
memory.usage_in_bytes: 104759296

So now what happens if we curtail the use of all memory, including swap? To do this we’ll set the memory.memsw.limit_in_bytes parameter. Note that running cgset whilst a task under the cgroup is executing seems to get ignored if it is below that currently in use (per the usage_in_bytes field). If it is above this then the change is instantaneous:

  • Current state

    # cgget -g memory:/myGroup|grep bytes  
    memory.memsw.limit_in_bytes: 9223372036854775807  
    memory.memsw.max_usage_in_bytes: 209915904  
    memory.memsw.usage_in_bytes: 209784832  
    memory.soft_limit_in_bytes: 9223372036854775807  
    memory.limit_in_bytes: 104857600  
    memory.max_usage_in_bytes: 104857600  
    memory.usage_in_bytes: 104775680

  • Set the limit below what is currently in use (150m limit vs 200m in use)

    # cgset -r memory.memsw.limit_in_bytes=150m myGroup

  • Check the limit – it remains unchanged

    # cgget -g memory:/myGroup|grep bytes  
    memory.memsw.limit_in_bytes: 9223372036854775807  
    memory.memsw.max_usage_in_bytes: 209993728  
    memory.memsw.usage_in_bytes: 209784832  
    memory.soft_limit_in_bytes: 9223372036854775807  
    memory.limit_in_bytes: 104857600  
    memory.max_usage_in_bytes: 104857600  
    memory.usage_in_bytes: 104751104

  • Set the limit above what is currently in use (250m limit vs 200m in use)

    # cgset -r memory.memsw.limit_in_bytes=250m myGroup

  • Check the limit – it’s taken effect

    # cgget -g memory:/myGroup|grep bytes  
    memory.memsw.limit_in_bytes: 262144000  
    memory.memsw.max_usage_in_bytes: 210006016  
    memory.memsw.usage_in_bytes: 209846272  
    memory.soft_limit_in_bytes: 9223372036854775807  
    memory.limit_in_bytes: 104857600  
    memory.max_usage_in_bytes: 104857600  
    memory.usage_in_bytes: 104816640

So now we’ve got limits in place of 100MB real memory and 250MB total (real + swap). What happens when we test that out?

$ stress --vm-bytes 245M --vm-keep -m 1  
stress: info: [25927] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

The process is using 245MB total (VmData), of which 95MB is resident (VmRSS) and 150MB is swapped out (VmSwap)

$ cat /proc/$(pgrep stress|tail -n1)/status|grep Vm  
VmPeak:   257404 kB  
VmSize:   257404 kB  
VmLck:         0 kB  
VmHWM:    102548 kB  
VmRSS:     97280 kB  
VmData:   250932 kB  
VmStk:        92 kB  
VmExe:        20 kB  
VmLib:      2232 kB  
VmPTE:       520 kB  
VmSwap:   153860 kB

The cgroup stats reflect this:

# cgget -g memory:/myGroup|grep bytes  
memory.memsw.limit_in_bytes: 262144000  
memory.memsw.max_usage_in_bytes: 257159168  
memory.memsw.usage_in_bytes: 257007616  
[...]  
memory.limit_in_bytes: 104857600  
memory.max_usage_in_bytes: 104857600  
memory.usage_in_bytes: 104849408

If we try to go above this absolute limit (memory.memsw.max_usage_in_bytes) then the cgroup kicks in a stops the process getting the memory, which in turn causes stress to fail:

$ stress --vm-bytes 250M --vm-keep -m 1  
stress: info: [27356] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd  
stress: FAIL: [27356] (415) <-- worker 27357 got signal 9  
stress: WARN: [27356] (417) now reaping child worker processes  
stress: FAIL: [27356] (451) failed run completed in 3s

This gives you an indication of how careful you need to be using this type of low-level process control. Most tools will not be happy if they are starved of resource, including memory, and may well behave in unstable ways.


Thanks to Frits Hoogland for reading a draft of this post and providing valuable feedback.

The post Using Linux Control Groups to Constrain Process Memory appeared first on Rittman Mead Consulting.

BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).

OBIEE Regression Testing

In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.

In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.

If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.

Data Discovery with the ELK Stack

My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:

The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:

By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.

The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!


Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.

BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).

OBIEE Regression Testing

In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.

In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.

If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.

Data Discovery with the ELK Stack

My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:

The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:

By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.

The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!


Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.

Visual Regression Testing of OBIEE with PhantomCSS

Earlier this year I wrote a couple of blogs posts (here and here) discussing the topic of automated Regression Testing and OBIEE. One of the points that I was keen make was that OBIEE is a stack of elements and depending on the change being tested, it may be sensible to focus on certain elements in the stack instead of all of it. For example, if you are changing the RPD, there is little value in doing a web-based test when you can actually test for the vast majority of regressions using the nqcmd tool alone.

I also argued that testing the front end of OBIEE using tools such as Selenium is difficult to do comprehensively, it can be inflexible, time-consuming and in some cases just not a sensible use of effort. These tools work around the idea of parsing the web page that is served up and checking for presence (or absence) of a particular piece of text or an element on a web page. So for example, you could run a test and tell it to fail if it finds the text “Error” on the page, or you could say only pass the test if some known-content is present, such as a report title or data figure. This type of testing is prone to a great deal of false-negatives, because to efficiently build any kind of test case you must focus on something to check for in the page, but you cannot code for every possible error or failure. It is also usually based heavily on the internal IDs of elements on the page in locating the ‘something’ to check for. As the OBIEE Document Object Model (DOM) is undocumented code, Oracle are at presumably at liberty to change it whenever they feel like it, and thus any tests written based on it may fail. Finally, OBIEE 11g still defaults to serving up graphs as Flash objects, which Selenium et al just cannot handle, and so cannot be tested.

So, what do we do about regression testing the OBIEE front end?

What do we need to test in the front end?

There is still a strong case for regression testing the OBIEE front end. Analyses get changed, Dashboards break, permissions are updated – all these things can cause errors or problems for the end user, but which are something that testing further down the OBIEE stack (using something like nqcmd) will not cover.

Consider a simple dashboard:

If one of the dashboard pages that are linked to in the central section get moved in the Presentation Catalog, then this happens:

OK, so Invalid Link Path: is pretty easy to code in as an error check into Selenium. But, what about if the permissions on an analysis used in the dashboard get changed and the user can no longer access it when running the dashboard?

This is a different problem altogether. We need to check for the absence of something. There’s no error, there just isn’t the analysis that ought to be present. One way around this would be to code for the presence of the analysis title text or content – but that is not going to scale nor be maintainable to do for every dashboard being tested.

Another thing that is important to check in the front end is that authorisations are enforced as they should be. That is, a user can see the dashboards that they should be able to, and that they cannot see the ones they’re not. Changes made in the LDAP directory holding users and their groups, or a configuration change in the Application Roles, could easily mean that a user can no longer see the dashboards they should be able to. We could code for this specific issue using something like Web Services to programatically check each and every actual permission – but that could well be overkill.

What I would like to introduce here is the idea of testing OBIEE for regressions visually - but automated, of course.

Visual Regression Testing

Driven by the huge number of applications that are accessed solely on the web (sorry, “Cloud”), a new set of tools have been developed to support the idea of testing web pages for regressions visually. Instead of ‘explaining’ to the computer specifically what to look for in a page (no error text, etc), visual regression testing uses a process to compare images of a web page, comparing a baseline to a sample taken afterwards. This means that the number of false-negatives (missing genuine errors because the test didn’t detect them) drops drastically because instead of relying on coding a test program to parse the Document Object Model (DOM) of an OBIEE web page (which is extremely complex), instead it is simply considering if two snapshots of the resulting rendered page look the same.

The second real advantage of this method is that typically the tools (including the one I have been working with and will demonstrate below, PhantomCSS) are based on the actual engine that drives the web browsers in use by real end-users. So it’s not a case of parsing the HTML and CSS that the web server sends us and trying to determine if there’s a problem or not – it is actually rendering it the same as Chrome etc and taking a snapshot of it. PhantomCSS uses PhantomJS, which uses the engine that Safari is built on, WebKit.

Let’s Pretend…

So, we’ve got a tool – that I’ll demonstrate shortly – that can programatically fetch and snapshot OBIEE pages, and compare the snapshots to check for any changes. But what about graphs rendered in flash? These are a blindspot usually. Well here we can be a bit cheeky. If you pretend (in the User-Agent HTTP request header) to be an iPhone or iPad (devices that don’t support flash) then OBIEE obligingly serves up PNG graphs plus some javascript to do the hover tooltips. Because it’s a PNG image that means that it will be rendered correctly in our “browser”, and so included in the snapshot for comparison.

CasperJS

Let’s see this scripting in action. Some clarification of the programs we’re going to use first:

  • PhantomJS is the core functionality we’re using, a headless browser sporting Javascript (JS) APIs
  • CasperJS provides a set of APIs on top of PhantomJS that make working with web page forms, navigation etc much easier
  • PhantomCSS provides the regression testing bit, taking snapshots and running code to compare them and report differences.

We’ll consider a simple CasperJS example first, and come on to PhantomCSS after. Because PhantomCSS uses CasperJS for its core interactions, it makes sense to start with the basics.

Here is a bare-bones script. It loads the login page for OBIEE, echoes the page title to the console, takes a snapshot, and exits:

var casper = require('casper').create();

casper.start('http://rnm-ol6-2:9704/analytics', function() {
  this.echo(this.getTitle());
  this.capture('casper_screenshots/login.png');
});

casper.run();

I run it from the command line:

$ casperjs casper_example_01.js
Oracle Business Intelligence Sign In
$

As you can see, it outputs the title of the page, and then in the screenshots folder I have this:

I want to emphasise again to make clear why this is so useful: I ran this from the commandline only. I didn’t run a web browser, I didn’t take any snapshots by hand – it was all automatic.

Now, let’s build a bit of a bigger example, where we login to OBIEE and see what dashboards are available to us:

// Set the size of the browser window as part of the 
// Casper instantiation
var casper = require('casper').create({viewportSize: {
        width: 800,
        height: 600
    }});

// Load the login page
casper.start('http://rnm-ol6-2:9704/analytics', function() {
  this.echo(this.getTitle());
  this.capture('casper_screenshots/login.png');
});

// Do login
casper.then(function(){
  this.fill('form#logonForm', { NQUser: 'weblogic' ,
                                NQPassword: 'Password01'
                              }, true);
}).
waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?bieehome',function(){
  this.echo('Logged into OBIEE','INFO')
  this.capture('casper_screenshots/afterlogin.png');
  });

// Now "click" the Dashboards menu
casper.then(function() {
  this.echo('Clicking Dashboard menu','INFO')
  casper.click('#dashboard');
  this.waitUntilVisible('div.HeaderPopupWindow', function() {
    this.capture('casper_screenshots/dashboards.png');
  });
});

casper.run();

So I now get a screenshot of after logging in:

and after “clicking” the Dashboard menu:

The only bit of the script above that isn’t self-explanatory is where I am referencing elements. The references are as CSS3 selectors and are easily found using something like Chrome Developer Tools. Where the click on Dashboards is simulated, there is a waitUntilVisible function, which is crucial for making sure that the page has rendered fully. For a user clicking the menu, they’d obviously wait until it appears but computers work much faster so functions like this are important for reining them back.

To round off the CasperJS script, let’s add to the above navigating to a Dashboard, snapshotting it (with graphs!), and then logging out.

[...]
casper.then(function(){
  this.echo('Navigating to GCBC Dashboard','INFO')
  casper.clickLabel('GCBC Dashboard');
})

casper.waitForUrl('http://rnm-ol6-2:9704/analytics/saw.dll?dashboard', function() {
  casper.waitWhileVisible('div.AjaxLoadingOpacity', function() {
    casper.waitWhileVisible('div.ProgressIndicatorDiv', function() {
      this.capture('casper_screenshots/dashboard.png');
    })
  })
});

casper.then(function() {
  this.echo('Signing out','INFO')
  casper.clickLabel('Sign Out');
});

Again, there’s a couple of waitWhileVisible functions in there, necessary to get CasperJS to wait until the dashboard has rendered properly. The dashboard rendered is captured thus:

PhantomCSS

So now let’s see how we can use the above CasperJS code in conjunction with PhantomCSS to generate a viable regression test scenario for OBIEE.

The script remains pretty much the same, except CasperJS’s capture gets replaced with a phantomcss.screenshot based on an element (html for the whole page), and there’s some extra code “footer” to include that executes the actual test.

So let’s see how the proposed test method holds up to the examples above – broken links and disappearing reports.

First, we run the baseline capture, the “known good”. The console output shows that this is the first time it’s been run, because there are no existing images against which to compare:

In the screenshots folder is the ‘baseline’ image for each of the defined snapshots:

Now let’s break something! First off I’ll rename the target page for one of the links in the central pane of the dashboard, which will cause the ‘Invalid Link Path’ message to display.

Now I run the same PhantomCSS test again, and this time it tells me there’s a problem:

When an image is found to differ, a composite of the two highlighting the differences is created:

OK, so first test passed (or rather, failed), but arguably this could have been picked up simply by parsing the page returned from the OBIEE server for known error strings. But what about a disappearing analysis – that’s more difficult to ascertain from the page source alone.

Again, PhantomCSS picks up the difference, and highlights it nice and clearly in the generated image:

For the baseline image that you capture it would be against a “gold” version of a dashboard – no point including ad-hoc reports or dashboards under development. You’d also want to work with data that was unchanging, so where available a time filter fixed at a point in the past, rather than ‘current day’ which will be changing frequently.

Belts and Braces?

So visual regression testing is a great thing, but I think a hybrid approach, of parsing the page contents for text too, is worthwhile. CasperJS provides its own test APIs (which PhantomCSS uses), and we can write simple tests such as the following:

this.test.assertTextDoesntExist('Invalid Link Path', 'Check for error text on page');
this.test.assertTextDoesntExist('View Display Error', 'Check for error text on page');
phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

So check for a couple of well-known errors, and then snapshot the page too for subsequent automatic comparison. If an assertion is failed, it shows in the console:

This means that what is already be being done in Selenium (or for which Selenium is an assumed default tool) could even be brought into the same single test rig based around CasperJS/PhantomCSS.

Frame of Reference

The eagle-eyed of you will have noticed that the snapshots generated by PhantomCSS above are not the entire OBIEE webpage, whereas the ones from CasperJS earlier in this article are. That is because PhantomCSS deliberately wants to focus on an area of the page to test, identified using a CSS3 selector. So if you are testing a dashboard, then considering the toolbar is irrelevant and can only lead to false-positives.

phantomcss.screenshot('div.DashboardPageContentDiv','GCBC Dashboard page 1');

Similarly, considering the available dashboard list (to validate enforced authorisations) just needs to look at the list itself, not the rest of the page.  (and yes, that does say “Protals” – even developers have fat fingers sometimes ;-) )

phantomcss.screenshot('div.HeaderSharedProtals','Dashboard list');

Using this functionality means that the generated snapshots used for comparison can be done to exclude things like the alerts bar (which may appear or disappear between tests).

The Devil’s in the Detail

I am in no doubt that the method described above has definitely got its place in the regression testing arsenal for OBIEE. What I am yet to be fully convinced of is quite to what extent. My beef with Selenium et al is the level of detail one has to get in to when writing tests – identifying strings to test for, their location in the DOM, and so on. Yet above in my CasperJS/PhantomCSS examples, I have DOM selectors too, so is this just the same problem? At the moment, I don’t think so. For Selenium, to build a comprehensive test, you have to dissect the DOM for every single test you want to build. Whereas with CasperJS/PhantomCSS I think there is the need to write a basic framework for OBIEE (the basics of which are provided in this post; you’re welcome), which can then be parameterised based on dashboard name and page only. Sure, additional types of tests may need new code, but it would be more reusable.

Given that OBIEE doesn’t come with an out of the box test rig, whatever we build to test it is going to be bespoke, whether its nqcmd, Selenium, JMeter, LoadRunner, OATS, QTP, etc etc — the smart money is picking the option that will be the most flexible, more scalable, easiest to maintain, and take the least effort to develop. There is no one “program to rule them all” – an accurate, comprehensive, and flexible test suite is invariably going to utilise multiple components focussing on different areas.

In the case of regression testing – what is the aim of the testing? What are you looking to validate hasn’t broken after what kind of change?  If all that’s changed in the system is the DBAs adding some indexes or partitioning to the data, I really would not be going anywhere near the front end of OBIEE. However, more complex changes affecting the Presentation Catalog and the RPD can be well covered by this technique in conjunction with nqcmd. Visual regression testing will give you a pass/fail, but then it’s up to you to decipher the images, whereas nqcmd will give you a pass/fail but also an actual set of data to show what has changed.

Don’t forget that other great tool — you! Or rather, you and your minions, who can sit at OBIEE for 5 minutes and spot certain regressions that would take magnitudes of order greater in time to build a test to locate. Things like testing for UI/UX changes between OBIEE versions is something that is realistically handled manually. The testing of the dashboards can be automated, but faster than I can even type the requirement, let alone build a test to validate it – does clicking on the save icon bring up the save box? Well go click for yourself – done? Next test.

Summary

I have just scratched the surface of what is possible with headless browser scripting for testing OBIEE. Being able to automate and capture the results of browser interactions as we’ve seen above is hugely powerful. You can find the CasperJS API reference here if you want to find out more about how it is possible to interact with the web page as a “user”.

I’ve put the complete PhantomCSS script online here. Let me know in the comments section or via twitter if you do try it out!

Thanks to Christian Berg and Gianni Ceresa for reading drafts of this article and providing valuable feedback. 

Built-In OBIEE Load Testing with nqcmd

nqcmd ships with all installations of OBIEE and includes some very useful hidden functionality – the ability to generate load tests against OBIEE. There are lots of ways of generating load against OBIEE, but most require third party tools of varying degrees of complexity to work with.

It’s easy to try this out. First set the OBIEE environment:  [I'm using SampleApp v309R2 as an example; your FMW_HOME path will vary]

. ~/obiee/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

and then the “open sesame” setting which enables the hidden nqcmd functionality:

export SA_NQCMD_ADVANCED=Yes

On Windows, run set SA_NQCMD_ADVANCED=YES instead. If you don’t set this environment variable then nqcmd just throws an error if you try to use one of the hidden options.

Now if you list the available options for nqcmd you’ll see lots of new options in addition to the usual ones:

Command: nqcmd - a command line client which can issue SQL statements
                 against either Oracle BI server or a variety
                 of ODBC compliant backend databases.
SYNOPSIS
         nqcmd [OPTION]...
DESCRIPTION
         -d<data source name>
         -u<user name>
         -p<password>
         -s<sql input file name>
         -o<output result file name>
         -D<Delimiter>
         -b<super batch file name>
         -w<# wait seconds>
         -c<# cancel interval seconds>
         -C<# number of fetched rows by column-wise binding>
         -n<# number of loops>
         -r<# number of requests per shared session>
         -R<# number of fetched rows by row-wise binding>
         -t<# number of threads>
         -T (a flag to turn on time statistics)
         -a (a flag to enable async processing)
         -f (a flag to enable to flush output file for each write)
         -H (a flag to enable to open/close a request handle for each query)
         -z (a flag to enable UTF8 in the output result file
         -utf16 (a flag to enable UTF16 for communicating to Oracle BI ODBC driver)
         -q (a flag to turn off row output)
         -NoFetch (a flag to disable data fetch with query execution)
         -SmartDiff (a flag to enable SmartDiff tags in output)
         -NotForwardCursor (a flag to disable forwardonly cursor)
         -v (a flag to display the version)
         -P<the percent of statements to disable cache hit>
         -impersonate <the impersonate username>
         -runas <the runas username>
         -td <the time duration to run >
         -qsel <the query selection>
         -ds <the dump statistics duration in secs>
         -qstats <print Query statistics at end of run>
         -login <login scenario for PSR. login/execute sqls/logout for sql file>
         -ShowQueryLog <to display query log from server, -H is required for this setting>
         -i <ramup interval for each user for load testing, -i is required for this setting>
         -ONFormat<FormatString, i.e. TM9, 0D99>

You’re own your own figuring the new options out as they’re not documented (and therefore presumably not supported and liable to change or be dropped at any time). What I’ve done below is my best guess at how to use them – don’t take this as gospel. The one source that I did find is a post on Oracle’s CEAL blog: OBIEE 11.1.1 – Advanced Usage of nqcmd command, from which I’ve taken some of the detail below.

Let’s have a look at how we can generate a load test. First off, I’ll create a very simple query:

and from the Advanced tab extract the Logical SQL from it:

SELECT
   0 s_0,
   "A - Sample Sales"."Products"."P2  Product Type" s_1,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2
FROM "A - Sample Sales"
ORDER BY 1, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY

This Logical SQL I’ve saved to a file, report01.lsql.

To run this Logical SQL from nqcmd I use the standard (documented) syntax, passing the Logical SQL filename with the -s flag:

[oracle@obieesample loadtest]$ nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql

-------------------------------------------------------------------------------
          Oracle BI ODBC Client
          Copyright (c) 1997-2013 Oracle Corporation, All rights reserved
-------------------------------------------------------------------------------

Connection open with info:
[0][State: 01000] [DataDirect][ODBC lib] Application's WCHAR type must be UTF16, because odbc driver's unicode type is UTF16
SELECT
   0 s_0,
   "A - Sample Sales"."Products"."P2  Product Type" s_1,
   "A - Sample Sales"."Base Facts"."1- Revenue" s_2
FROM "A - Sample Sales"
ORDER BY 1, 2 ASC NULLS LAST
FETCH FIRST 5000001 ROWS ONLY
[...]

0            Smart Phones   6773120.36
--------------------
Row count: 11
--------------------

Processed: 1 queries

Adding the -q flag will do the same, but suppress the data output:

oracle@obieesample loadtest]$ nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q

[...]
----------------------------------------------------------------------
Row count: 11
-------------------------------------------------------------------------------------------------------------   
Processed: 1 queries

The basic parameters for load testing are

  • -t – how many threads [aka Virtual Users]
  • -td – test duration
  • -ds – how frequently to write out load test statistics
  • -T – enable time statistics [without this they will not be reported correctly]

You also need to supply -o with an output filename. Even if you’re not writing the data returned from the query to disk (which you shouldn’t, and -q disables), nqcmd needs this in order to be able to write its load test statistics properly (I got a lot of zeros and nan otherwise). In addition, the -T (Timer) flag should be enabled for accurate timings.

So to run a test for a minute with 5 threads, writing load test stats to disk every 5 seconds, you’d run:

nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q -T -td 60 -t 5 -ds 5 -o output

The load test stats are written to a file based on the name given in the -o parameter, with a _Counters.txt suffix:

$ cat output_Counters.txt
                        nQcmd Load Testing
TimeStamp       Sqls/Sec        Avg RT  CumulativePrepareTime   CumulativeExecuteTime   CumulativeFetchTime
00:00:05        56.200000       0.065925        2.536000                13.977000               2.012000
00:00:10        66.800000       0.065009        5.641000                33.479000               4.306000
00:00:15        69.066667       0.066055        8.833000                52.234000               7.366000
00:00:20        73.100000       0.063984        11.978000               71.944000               9.622000
[...]

Using obi-metrics-agent to pull out the OBIEE metrics and Graphite to render them we can easily visualise what happened when we ran the test. The Oracle_BI_General.Total_sessions metric shows:

nq07

Ramping Up the Load

nqcmd also has a -i parameter, to specify the ramp up per thread. Most load tests should incorporate a “ramp up”, whereby the load is introduced gradually. This is important so that you don’t overwhelm a server all at once. It might be the server will not support the total number of users planned, so by using a ramp up period you can examine the server’s behaviour as the load increases gradually, spotting the point at which the wheels begin to come off.

The -i parameter for nqcmd is the delay between each thread launching, and this has an interesting effect on the duration of the test. If you specify a test duration (-td) of 5 seconds, five threads (-t), and a rampup (-i) of 10 seconds the total elapsed will be c.55 seconds (5×10 + 5).

I’ve used the standard time command on Linux to validate this by specifying it before the nqcmd call.

$ time nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s report01.lsql -q -td 5 -t 5 -ds 1 -o $(date +%Y-%m-%d-%H%M%S) -T -i 10 

[...]

real    0m56.896s
user    0m2.350s
sys     0m1.434s

So basically the -td is the “Steady State” once all threads are ramped up, and the literal test duration is equal to (rampup * number of threads) + (desired steady state)

The above ramp-up can be clearly seen:

nq06

BTW a handy trick I’ve used here is to use a timestamp for the output name so that the Counter.txt from one test doesn’t overwrite another, by specifying date using an inline bash command :

nqcmd [...]   -o $(date +%Y-%m-%d-%H%M%S)   [...]

Whilst we’re at it for tips & tricks – if you want to stop nqcmd running but Ctrl-C isn’t instant enough for you, the following will stop it in its tracks:

pkill -9 nqcmd

Wait a Moment…

…or two. Wait time, or “think time”, is also important in producing a realistic load test. Unless you want to hammer your server just for the lulz to see how fast you can overload it, you’ll want to make sure the workload you’re simulating represents how it is actually used — and in reality users will be pausing (thinking) between report requests. The -w flag provides this option to nqcmd.

In this test below, whilst the Total Sessions is as before (no ramp up), the Connection Pool shows far fewer busy connections. On previous tests the busy connections were equal to the number of active threads, because the server was continuously running queries.

nq09

And the CPU, which in the previous test was exhausted at five users with no wait time, now is a bit more relaxed

nq10

for comparison, this was the CPU in the first test we ran (5 threads, no wait time, no ramp up). Note that ‘idle’ drops to zero, i.e. the CPU is flat-out.

nq11

Load Test in Action

Let’s combine ramp up and wait times to run a load test and see what we can see in the underlying OBIEE metrics. I’m specifying:

  • Write the output to a file with the current timestamp (date, in the format YYYY-MM-DD HH:MM:SS)
    -o $(date +%Y-%m-%d-%H%M%S)
  • 20 threads
    -t 20
  • 10 second gap between starting each new thread
    -i  10
  • 5 second wait between each thread submitting a new query
    -w 5
  • Run for a total of 230 seconds (20 thread x 10 second ramp up = 200 seconds, plus 30 second steady state)
    -td 230

$ date;time nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s queries.lsql -q -T -o $(date +%Y-%m-%d-%H%M%S) -t 20 -ds 5 -td 230 -w 5 -i 10;date

Here’s what happened.

  • At first, as the users ramp up the Connection Pool gets progressively busier
    2014-03-28_10-24-11
  • However, when we hit c.14 threads, things start to go awry. The busy count stays at 10, even though the user count is increasing: 2014-03-28_10-26-12
    (This was displayed in flot which you can get to on the /graphlot URL of your Graphite server)
  • So the user count is increasing, but we’re not seeing increasing activity on the Connection Pool… so what does that do for the response times? 2014-03-28_10-30-50
    OK, so the Average Query Elapsed Time is a metric I’d normally be wary of, but this is a dedicated server running just my load test workload (and a single query within it) so in this case it’s a valid indicator — and it’s showing that the response time it going up. Why’s it going up?
  • Looking more closely at the Connection Pool we can see a problem — we’re hitting the capacity of ten connections, and requests are starting to queue up: 2014-03-28_10-38-06
    Note how once the Current Busy Connection Count hits the Capacity of ten, the Current Queued Requests value starts to increase — because the number of users is increasing, trying to run more queries, but having to wait.

So this is a good example of where users would see slow performance, but some of the usual “Silver Bullets” around hardware and the database would completely miss the target, because the bottleneck here is actually in the configuration of the Connection Pool.


If you’re interested in hearing more about this subject, make sure you register for the BI Forum in Brighton, 7-9 May where I’m delighted to be speaking for the second time, presenting “No Silver Bullets : OBIEE Performance in the Real World“.