Tag Archives: Obiee

11.1.1.6.4 patch has been released for OBIEE

Patch 11.1.1.6.4 is here and can be downloaded from MOS.

Once more it's a 7-pack patch and here are the patch numbers and direct links:

Patch 14538078: PATCH 11.1.1.6.4 (1 OF 7) ORACLE BUSINESS INTELLIGENCE INSTALLER
Patch 14538128: PATCH 11.1.1.6.4 (2 OF 7) ORACLE REAL TIME DECISIONS
Patch 14285344: Patch 11.1.1.6.4 (3 of 7) Oracle Business Intelligence Publisher
Patch 14538164: PATCH 11.1.1.6.4 ( 4 OF 7) ORACLE BUSINESS INTELLIGENCE ADF COMPONENTS
Patch 14415773: Patch 11.1.1.6.4 (5 of 7) Enterprise Performance Management Components Installed from BI Installer 11.1.1.6.x
Patch 14405222: Patch 11.1.1.6.4 (6 of 7) Oracle Business Intelligence
Patch 14409674: Patch 11.1.1.6.4 (7 of 7) Oracle Business Intelligence Platform Client Installers and MapViewer

Documentation's here: https://updates.oracle.com/Orion/Services/download?type=readme&aru=15499675

Cheers!

Advanced monitoring of OBIEE with Nagios

Introduction

In the previous articles in this series, I described an overview of monitoring OBIEE, and then a hands-on tutorial for setting up Nagios to monitor OBIEE. Nagios is an Enterprise Systems Management tool that can monitor multiple systems and servers, send out alerts for pre-defined criteria, and so on.

In this article I’m going to demonstate creating custom plugins for Nagios to extend the capability of it to monitor additional elements of the OBIEE stack. The intention is not to document an exhaustive list of plugins and comprehensive configurations, but to show how the plugins can be created and get you started if you wanted to implement this yourself.

Most of these plugins will run local to the OBIEE server and the assumption is that you are using the NRPE mechanism for communication with the Nagios server, described in the previous article. For each plugin, I’ve included:

  • The plugin code, to be located in Nagios plugins folder (default is /usr/lib64/nagios/plugins)
  • If required, an entry for the NRPE configuration file on the BI Server
  • An entry for the service definition, on the Nagios server

Whenever you change the configuration of NRPE or Nagios, don’t forget to restart the appropriate service:

sudo service nrpe restart

or

sudo service nagios restart

A very brief introduction to writing Nagios plugins

There’s plenty on Google, but a Nagios plugin boils down to:

  • Something executable from the command line as the nagios or nrpe user
  • One or more lines of output to stdout. You can include performance data relevant to the check after a pipe | symbol too, but this is optional.
  • The exit code reflects the check state – 0,1,2 for OK, Warning or Critical respectively

Application Deployments

Here is a plugin for Nagios that will report on the state of a given Web Logic application deployment. Without several of the JEE applications that are hosted within Web Logic, OBIEE will not work properly, so it is important to monitor them.

Because of how WLST is invoked and Nagios’ use of a script’s exit code to determine the service status, there are two scripts required. One is the WLST python code, the other is a wrapper to parse the output and set the exit code accordingly.

Note that this plugin invokes WLST each time so running this for every Application Deployment at very regular intervals concurrently may not be a great idea since each invocation will spin up its own java instance on your BI Server. Using the Nagios service option parallelize_check=0 ought to prevent this I think, but didn’t seem to when I tested it. Another possibility would be to run wlst remotely from the Nagios server, but this is not a ‘light touch’ option.

N4 N5

check_wls_app_deployment.sh:   (put this in the Nagios plugins folder on the BI Server)

# check_wls_app_deployment.sh
# Put this in your Nagios plugins folder on the BI Server
#
# Check the status of an Application Deployment
# Takes five arguments - connection details, plus application name, and server
#
# This is a wrapper for check_wls_app_deployment necessary to make sure a proper exit code
# is passed back to Nagios. Because the py is called as a parameter to wlst, it cannot set the exit
# code itself (since it is the wlst.sh which exits).
#
# RNM 2012-09-03
#
# Set this to your FMW home path:
FMW_HOME=/home/oracle/obiee
#
# No user servicable parts below this line
# -----------------------------------------------------------------------------------------------
if [ $# -ne 5 ]then
        echo
        echo "ERROR : not enough parameters"
        echo "USAGE: check_wls_app_deployment.sh WLS_USER WLS_PASSWORD WLS_URL app_name target_server"
        exit 255
fi

output=$($FMW_HOME/oracle_common/common/bin/wlst.sh /usr/lib64/nagios/plugins/check_wls_app_deployment.py $1 $2 $3 $4 $5 | tail -n1)

echo $output

test=$(echo $output|awk '{print $1}'|grep OK)
ok=$?

if [ $ok -eq 0 ]
then
        exit 0
else
        exit 2
fi

check_wls_app_deployment.py:    (put this in the Nagios plugins folder on the BI Server)

# check_wls_app_deployment.py
# Put this in your Nagios plugins folder on the BI Server
#
# Check the status of an Application Deployment
# Takes five arguments - connection details, plus application name, and server
# RNM 2012-09-03
#
# You shouldn't need to change anything in this script
#
import sys
import os
# Check the arguments to this script are as expected.
# argv[0] is script name.
argLen = len(sys.argv)
if argLen -1 < 5:
        print "ERROR: got ", argLen -1, " args."
        print "USAGE: wlst.sh check_app_state.py WLS_USER WLS_PASSWORD WLS_URL app_name target_server"
        sys.exit(255)
WLS_USER = sys.argv[1]
WLS_PW = sys.argv[2]
WLS_URL = sys.argv[3]
appname = sys.argv[4]
appserver = sys.argv[5]

# Connect to WLS
connect(WLS_USER, WLS_PW, WLS_URL);

# Set Application run time object
nav=getMBean('domainRuntime:/AppRuntimeStateRuntime/AppRuntimeStateRuntime')
state=nav.getCurrentState(appname,appserver)
if state == 'STATE_ACTIVE':
        print 'OK : %s - %s on %s' % (state,appname,appserver)
else:
        print 'CRITICAL : State is "%s" for %s on %s' %  (state,appname,appserver)

NRPE configuration:

command[check_wls_analytics]=/usr/lib64/nagios/plugins/check_wls_app_deployment.sh weblogic welcome1 t3://localhost:7001 analytics#11.1.1 bi_server1

Service configuration:

define service{
        use                             obi-service
        host_name                       bi1
        service_description             WLS Application Deployment : analytics
        check_command                   check_nrpe_long!check_wls_analytics
        }

By default, NRPE waits 10 seconds for a command to execute before returning a timeout error to Nagios. WLST can sometimes take a while to crank up, so I created a new command, check_nrpe_long which increases the timeout:

define command{
        command_name    check_nrpe_long
        command_line    $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ -t 30
        }

nqcmd OBIEE plugin for Nagios

Using the OBIEE command line utility nqcmd it is simple to create a plugin for Nagios which will run a Logical SQL statement (as Presentation Services would pass to the BI Server). This plugin will validate that the Cluster Controller, BI Server and source database are all functioning. With a bit more coding, we can include a check for the response time on the query, raising an alert if it breaches defined thresholds.

This script can be used to just run a Logical SQL and return pass/fail, or if you include the additional command line parameters, check for the response time. To use the plugin, you need to create a file holding the logical SQL that you want to run. You can extract it from usage tracking, nqquery.log, or from the Advanced tab of a report in Answers. In the example given below, the logical SQL was copied into a file called q1.lsql located in /usr/lib64/nagios/plugins/.

check_obi_nqcmd.sh:    (put this in the Nagios plugins folder on the BI Server)

# check_obi_nqcmd.sh
# Put this in your Nagios plugins folder on the BI Server
# 
# Nagios plugin to check OBI status using nqcmd.
# Assumes that your DSN is AnalyticsWeb - modify the nqcmd call in this script if it is different
# 
# RNM September 2012
#
#
# Set this to your FMW home path:
FMW_HOME=/home/oracle/obiee
#
# No user servicable parts below this line
# -----------------------------------------------------------------------------------------------
case $# in
	3)
		lsql_file=$1
		username=$2
		password=$3
		checktime=0
	;;
	5) 
		lsql_file=$1
		username=$2
		password=$3
		checktime=1
		warn_msec=$4
		crit_msec=$5
	;;
	*)
		echo " "
		echo "Usage: check_obi_nqcmd.sh <lsql-filename> <username> <password> [warn msec] [crit msec]"
		echo " "
		echo "eg: check_obi_nqcmd.sh /home/oracle/nagios/q1.lsql weblogic welcome1"
		echo "eg: check_obi_nqcmd.sh /home/oracle/nagios/q1.lsql weblogic welcome1 1000 5000"
		echo " "
		echo " "
		exit 255
esac
# Initialise BI environment
. $FMW_HOME/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

outfile=$(mktemp)
errfile=$(mktemp)
grpfile=$(mktemp)

nqcmd -d AnalyticsWeb -u $username -p $password -s $lsql_file -q -T 1>$outfile 2>$errfile
grep Cumulative $outfile > /dev/null
nqrc=`echo $?`
if [ $nqrc -eq 0 ]
then
	responsetime=$(grep Cumulative $outfile |awk '{print $8 * 1000}')
	if [ $checktime -eq 1 ]
	then
		if [ $responsetime -lt $warn_msec ]
		then
			echo "OK - response time (msec) is  "$responsetime" |"$responsetime
			exitcode=0
		elif [ $responsetime -lt $crit_msec ]
		then
			echo "WARNING - response time is at or over warning threshold ("$warn_msec" msec). Response time is  "$responsetime" |"$responsetime 
			exitcode=1
		else
			echo "CRITICAL - response time is at or over critical threshold ("$crit_msec" msec). Response time is  "$responsetime" |"$responsetime 
			exitcode=2
		fi
	else
		echo "OK - response time (msec) is  "$responsetime" |" $responsetime
		exitcode=0
	fi
else
	grep -v "Connection open" $errfile > $grpfile
	grep failed $outfile >> $grpfile
	echo "CRITICAL - " $(tail -n1 $grpfile)
	exitcode=2
fi
rm $outfile $errfile $grpfile
exit $exitcode

NRPE configuration:

# Check nqcmd
command[check_nqcmd_q1]=/usr/lib64/nagios/plugins/check_obi_nqcmd.sh /usr/lib64/nagios/plugins/q1.lsql weblogic welcome1
command[check_nqcmd_q1_with_time_check]=/usr/lib64/nagios/plugins/check_obi_nqcmd.sh /usr/lib64/nagios/plugins/q1.lsql weblogic welcome1 500 1000

The first of the above commands runs the logical SQL file q1.lsql and will just do a pass/fail check. The second one checks how long it takes and raises a warning if it’s above half a second, or a critical alert if it’s over a second.

Nagios service configuration (use either or both, if you want the time checking):

define service{
        use                             obi-service
        host_name                       bi1
        service_description             NQCmd - Q1
        check_command                   check_nrpe!check_nqcmd_q1
        }

define service{
        use                             obi-service
        host_name                       bi1
        service_description             NQCmd - Q1 with time check
        check_command                   check_nrpe!check_nqcmd_q1_with_time_check
        }

N1

The plugin also supports the performance data output format, returning the time it took to run the logical SQL: N2

Test a real user with JMeter

Of all the checks and monitors described so far, they only consider a particular aspect of the stack. The above check with NQCmd is fairly comprehensive in that it tests both BI Server and the database. What it doesn’t test is the front-end into OBIEE – the web server and Presentation Services. For full confidence that OBIEE is working as it should be, we need a full end-to-end test, and to do that we simulate an actual user logging into the system and running a report.

N6

To do this, I am using JMeter plus some shell scripting. JMeter executes the actual web requests that a user would through their web browser in using OBIEE. The shell script looks at the result and sets the exit status, and how long it takes to perform the test is also recorded.

This check, like the NQCmd one above, could be set as a pass/fail, or also to consider how long it takes to run and raise a warning if it is above a threshold.N7

An important thing to note here is that this plugin is going to run local to the Nagios server, rather than on the BI Server like the two plugins above. This is deliberate, so that the network connectivity to the OBIEE server external to the host is also checked.

To set this up, you need:

  • JMeter (download the Binary from here). Unarchive it into a folder, for example /u01/app/apache-jmeter-2.7/. Set the files in the binary folder to executable
    chmod -R ugo+rx /u01/app/apache-jmeter-2.7/bin

    Make sure also that the nagios user (under which this check will run) has read/execute access to the folders above where jmeter is kept

  • A JMeter jmx script with the user actions that you want to test. The one I’m using does two simple things:
    • Login
    • Run dashboard

    I’m using assertions to check that each step runs correctly.

  • The actual plugin script which Nagios will use. Put this in the plugins folder (eg /usr/lib64/nagios/plugins)

    check_obi_user.sh:    (put this in the Nagios plugins folder on the Nagios server)

    # check_obi_user.sh
    # Put this in your Nagios plugins folder on the Nagios server
    #
    # RNM September 2012
    #
    # This script will invoke JMeter using the JMX script passed as an argument                            
    # It parses the output and sets the script exit code to 0 for a successful test                        
    # and to 2 for a failed test. 
    # 
    # Tested with jmeter 2.7 r1342410
    #
    # Set JMETER_PATH to the folder holding your jmeter files
    JMETER_PATH=/u01/app/apache-jmeter-2.7
    #
    # No user servicable parts below this line
    # -----------------------------------------------------------------------------------------------
    # You shouldnb't need to change anything below this line
    JMETER_SCRIPT=$1
    output_file=$(mktemp)
    
    /usr/bin/time -p $JMETER_PATH/bin/jmeter -n -t $JMETER_SCRIPT -l /dev/stdout 1>/$output_file 2>&1
    status_of_run=$?
    realtime=$(tail -n3 $output_file|grep real|awk '{print $2}')
    if [ $status_of_run -eq 0 ]
    then
            result=$(grep "<failure>true" $output_file)
            status=$?
            if [ $status -eq 1 ]
            then
                    echo "OK user test run successfully |"$realtime
                    rc=0
            else
                    failstep=$(grep --text "<httpSample" $output_file|tail -n1|awk -F="\"" '{print $6}'|awk -F="\"" '{sub(/\" rc/,"");print $1}')
                    echo "CRITICAL user test failed in step: "$failstep
                    rc=2
            fi
    else
            echo "CRITICAL user test failed"
            rc=2
    fi
    #echo "Temp file exists : "$output_file
    rm $output_file
    exit $rc
  • Because we want to test OBIEE as if a user were using it, we run this test from the Nagios server. If we used NRPE to run it locally on the OBIEE server we wouldn’t be checking any of the network gremlins that can cause problems. On the Nagios server define a command to call the plugin, as well as the service definition as usual:
    define command{
            command_name    check_obi_user
            command_line    $USER1$/check_obi_user.sh $ARG1$
            }
    
    define service{
            use                             obi-service
            host_name                       bi1
            service_description             OBI user : Sample Sales - Product Details
            check_command                   check_obi_user!/u01/app/jmeter_scripts/bi1.jmx
            }
    

The proof is in the pudding

After the configuration we’ve done, we now have the following checks in place for the OBIEE deployment:
N9

Now I’m going to test what happens when we start breaking things, and see if the monitoring does as it ought to. To test it, I’m using a script pull_the_trigger.sh which will randomly break things on an OBIEE system. It is useful for putting a support team through its paces, and for validating a monitoring setup.

Strike 1

First I run the script: N11 then I check Nagios: N10 Two critical errors are being reported; a network port error, and a process error — sounds like a BI process has been killed maybe. Drilling into the Service Group shows this: N12 and a manual check on the command line and in EM confirms it: N13

$ ./opmnctl status

Processes in Instance: instance1
---------------------------------+--------------------+---------+---------
ias-component                    | process-type       |     pid | status
---------------------------------+--------------------+---------+---------
coreapplication_obiccs1          | OracleBIClusterCo~ |    5549 | Alive
coreapplication_obisch1          | OracleBIScheduler~ |    5546 | Alive
coreapplication_obijh1           | OracleBIJavaHostC~ |     N/A | Down
coreapplication_obips1           | OracleBIPresentat~ |    5543 | Alive
coreapplication_obis1            | OracleBIServerCom~ |    5548 | Alive

So, 1/1, 100% so far …

Strike 2

After restarting Javahost, I run the test script again. This time Nagios shows an alert for the user simulation: N15 Drilling into it shows the step the failure occurs in: N17 And verifying this manually confirms there’s a problem: N16

The Nagios plugins poll at configurable intervals, so by now some of the other checks have also raised errors: N18 We can see that a process has clearly failed, and since user logon and the NQCmd tests are failing it is probably the BI Server process itself that is down: N19 I was almost right — it was the Cluster Controller which was down.

Strike 3

I’ve manufactured a high CPU load on the BI server, to see how/if it manifests itself in the alerts. Here’s how it starts to look: N20

The load average is causing a warning: N21

All the Application Deployment checks are failing with a timeout, presumably because WLST takes too long to start up because of the high CPU load: N22

And the NQCmd check is raising a warning because the BI Server is taking longer to return the test query result than it should do.

Strike 4

The last test I do is to make sure that my alerts for any problem with what the end user actually sees are picked up. Monitoring processes and ports is fine, but it’s the “unknown unknowns” that will get you eventually. In this example, I’ve locked the database account that the report data comes from. Obviously, we could write an alert which checks each database account status and raises an error if it’s locked, but the point here is that we don’t need to think of all these possible errors in advance.

When the user ends up seeing this:    (which is bad, m’kay?)

Our monitoring will pick up the problem:

Both the logical SQL check with nqcmd, and the end-user simulation with JMeter, are picking up the problem.

Summary

There are quite a few things that can go wrong with OBIEE, and the monitoring that we’ve built up in Nagios is doing a good job of picking up when things do go wrong.

An introduction to monitoring OBIEE with Nagios

Introduction

This is the second post in a mini-series on monitoring OBIEE. The previous post, Automated Monitoring of OBIEE in the Enterprise – an overview, looked at the overview and theory to why and what we should be monitoring. In this post I am going to walk through implementing a set of automated checks on OBIEE using the Systems Management tool Nagios

Nagios

There are at least three different flavours of Nagios, and only one of them is free (open source), called Nagios Core. The others listed here are Nagios XI and Nagios Fusion.

Brace yourself

One of the formal pre-requisites of open source software is either no documentation, or a vast swath of densely written documentation with no overview or map. OK, I’m kidding. But, be aware that with open source you have to be a bit more self-sufficient and prepared to roll up your sleeves than is normally the case with commercially produced software. I’m not trolling here, and there are exceptions on either side – but if you want to get Nagios working with OBIEE, be aware that it’s not simply click-click-done. :)

Nagios has a thriving community of plugins, addons, and companion applications such as alternative frontends. This is both a blessing and a curse. It’s great, because whatever you want to do with it, you probably can. It can be troublesome though because it means there’s no single point of reference to lookup how something is done — it could be done in many different ways. Some plugins will be excellent, others may be a bit ropey – you may find yourself navigating this with just your google-fu to guide you.

Right tool for the right job

As with any bit of software, make sure you’re not trying to hit the proverbial nail with a pick axe. Plugins and so on are great for extending a product, but always keep an eye on the product’s core purpose and whether you’re straying too far from it to be sensible. Something which works now might not in future product upgrades. Also sense-check whether two complementary tools might be better suited than trying to do everything within one.

Getting started

I’m working with two servers, both Oracle Linux 6.3.

  • The first server has OBIEE 11.1.1.6.2 BP1 installed in a standard single-node cluster with two WebLogic servers (AdminServer/Managed Server).
  • The second server is going to be my Nagios monitoring server

In theory you could install Nagios on the OBIEE server, but that’s not a great idea for Production usage as you’d be subject to all of the bad things which could happen to the OBIEE server and won’t be able to alert for them if the monitoring is from the same server.

Installing Nagios

There is documentation provided on how to install Nagios from source which looks comprehensive and easy to follow.

Alternatively, using the EPEL repository, install nagios and the default set of nagios plugins using the package manager yum:

 yum install nagios nagios-plugins-all 

If you use the yum method, you might want to follow this step from the above PDF which will set Nagios to startup automatically at boot:

 chkconfig --level 35 nagios on 

Testing the installation

If the installation has worked, you should be able to go to the address http://[server]/nagios and login using the credentials you created or the default nagiosadmin/nagiosadmin: Nagios01

If you don’t get this, check the following:

  • Is nagios running?
    $ ps -ef|grep [n]agios
    nagios 7959 1 0 14:16 ? 00:00:00 /usr/sbin/nagios -d /etc/nagios/nagios.cfg

    If it’s not, use

    service nagios start
  • Is Apache web server running?
    $ ps -ef|grep [h]ttpd 
    root 8016 1 0 14:19 ? 00:00:00 /usr/sbin/httpd apache 8018 8016 0 14:19 ? 00:00:00 /usr/sbin/httpd 
    […] 

    If it’s not, use

    service httpd start
  • If the firewall’s enabled, is port 80 open?

Nagios configuration

Nagios is configured, by default, through a series of files held on the server. There are GUI front ends for these files, but in order to properly understand what’s going on under the covers I am working with the files themselves here.

The documentation refers to Nagios config being in /usr/local/nagios, but on my install it put it in /etc/nagios/

Object types

To successfully work with Nagios it is necessary to understand some of the terminology and object types used. For a complete list with proper definitions, see the documentation.

  • A host is a physical server
  • A host has services defined against it
  • Each service defines a command to use
  • A command specifies a plugin to execute

For a detailed explanation of Nagios’ plugin architecture, see here

Examining the existing configuration

From your Nagios installation home page, click on Hosts and you should see localhost listed. Click on Services and you’ll see eight pre-configured checks (‘services’) for localhost. Nagios02 Let’s disect this existing configuration to start with. First off, the nagios.cfg file (probably in /etc/nagios or /usr/local/nagios) includes the line:

cfg_file=/etc/nagios/objects/localhost.cfg

The localhost.cfg file defines the host and services for localhost.

Open up localhost.cfg and you’ll see the line define host which is the definition for the machine, including an alias, its physical address, and the name by which it is referred to in later Nagios configuration.

Scrolling down, there is a set of define service statements. Taking the first one:

define service{
use local-service ; Name of service template to use 
host_name localhost 
service_description PING 
check_command check_ping!100.0,20%!500.0,60% 
}

We can see the following:

  1. It’s based on a local-service template
  2. The hostname to use in it is localhost, defined previously
  3. The (arbitrary) name of the service is PING
  4. The command to be run for this service (to determine the service’s state) is in the check_command. The syntax here is the command (check_ping) followed by arguments separated by the ! symbol (pling/bang/exclamation mark)

The command that a service runs (and the arguments that it accepts) is defined by default in the commands.cfg file. Open this up, and seach for ‘check_ping’ (the command we saw in the PING service definition above). We’re now getting closer to the actual execution, but not quite there yet. The define command gives us the command name (eg. check_ping), and then the command line that is executed for it. In this case, the command line is also called check_ping, and is an executable that is installed with nagios-plugins (nagios-plugins-all if you’re using a yum installation).

In folder /usr/lib64/nagios/plugins you will find all of the plugins that were installed by default, including check_ping. You can execute any of them from the command line, which is a good way to both test them and understand how they work with arguments passed to them. Many will support a -h help flag, including check_ping:

 $ cd /usr/lib64/nagios/plugins/ 
$ ./check_ping -h
check_ping v1.4.15 (nagios-plugins 1.4.15)
Copyright (c) 1999 Ethan Galstad <nagios@nagios.org>
Copyright (c) 2000-2007 Nagios Plugin Development Team
	<nagiosplug-devel@lists.sourceforge.net>

Use ping to check connection statistics for a remote host.

Usage:
check_ping -H <host_address> -w <wrta>,<wpl>% -c <crta>,<cpl>%
 [-p packets] [-t timeout] [-4|-6]
 […] 

Note the -w and -c parameters – this is where Warning and Critical thresholds are passed to the plugin, for it to then return the necessary status code back to Nagios.

Working back through the config, we can see the plugin is going to be executed with

command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5

(from the command definition) and the arguments passed to it are

 check_command check_ping!100.0,20%!500.0,60%

(from the service definition). Remember the arguments are separated by the ! symbol, so the first argument ($ARG1$) is 100.0,20% and the second argument ($ARG2$) is 500.00,60%. $HOSTADDRESS$ comes from the hostname entry in the service definition.

So, we can now execute the plugin ourselves to see how it works and to validate what we think Nagios should be picking up:

./check_ping -H localhost -w 100.0,20% -c 500,60% -p 5 
PING OK - Packet loss = 0%, RTA = 0.05 ms|rta=0.052000ms;100.000000;500.000000;0.000000 pl=0%;20;60;0

A picture may be worth a thousand words

To visualise how the configuration elements relate and in which files they are located by default, see the following diagram: Nagios03NB this is not a fully comprehensive illustration, but a simplified one of the default configuration.

tl;dr?

If you’re skimming through this looking for nuggets, you’d be well advised to try to digest the above section, or at least the diagram. It will save you time in the long run, as all of Nagios is based around the same design principle

Adding a new host

Let us start our OBIEE configuration of Nagios by adding in the OBIEE server. Currently Nagios has a single host defined, localhost, which is the Nagios server itself.

The first step is to specify where our new configuration will reside. We can either

  1. bolt it on to one of the existing default config files
  2. Create a new config file, and reference it in nagios.cfg with a new cfg_file entry
  3. Create a new config file directory, and add a line to nagios.cfg for cfg_dir

Option 1 is quick ‘n dirty. Option 2 is fine for small modifications. Option 3 makes the most sense, as any new configuration files we create after this one we just add to the directory and they will get picked up automagically. We’ll also see that keeping certain configuration elements in their own file makes it easier to deploy to additional machines later on.

First, create the configuration folder

mkdir -p /etc/nagios/config

Then add the following line to nagios.cfg

cfg_dir = /etc/nagios/config

Now, in the tradition of all good technology learning, we will copy the existing configuration and modify it for the new host.

Copy objects/localhost.cfg to config/bi1.cfg, and then modify it so it resembles this:

 define host{ 
use linux-server
host_name bi1 
alias DEV OBIEE server 1 
address 192.168.56.101 
}

define service{ 
use local-service 
host_name bi1 
service_description PING 
check_command check_ping!100.0,20%!500.0,60% 
}

Substitute your server’s IP address as required. host_name is just a label, it doesn’t have to match the server’s hostname (although it is sensible to do so).

So we have a very simple configuration – our host, and a single service, PING.

Before the configuration change is activated, we need to validate the configuration, by getting Nagios to parse it and check for errors

nagios -v /etc/nagios/nagios.cfg

(Remember, nagios.cfg is the main configuration file which points to all the others).

Once the configuration has been validated, we restart nagios to pick up the new configuration:

service nagios restart

Returning to the Nagios web front end (http:///nagios) you should now see the second host listed: Nagios04

Running Nagios checks on a remote machine

Nagios checks are all based on a command line executable run locally on the Nagios server. This works fine for things like ping, but when it comes to checking the CPU load or for a given process, we need a way of finding this information out from the remote machine. There are several ways of doing this, including check_by_ssh, NRPE and NSCA. We’re going to use NRPE here. There is a good diagram here of how it fits in the Nagios architecture, and documentation for NRPE here.

NRPE works as follows:

  1. Nagios server calls a check_nrpe plugin locally
  2. check_nrpe communicates with NRPE daemon on the remote server
  3. NRPE daemon on the remote server executes the required nagios plugin locally, and passes the results back to the Nagios server

You can see from points 2 and 3 that there is installation required on the remote server, of both the NRPE daemon and the Nagios plugins that you want to be available for the remote server.

Setting up NRPE

On the remote server, install the Nagios plugins and the NRPE daemon:

$ sudo yum install nagios-plugins-all nagios-plugins-nrpe nrpe

If you’re running a firewall, make sure you open the port for NRPE (by default, 5666).

Amend the NRPE configuration (/etc/nagios/nrpe.cfg) to add the IP of your Nagios server (in this example, 192.168.56.102) to the allowed_hosts line

allowed_hosts=127.0.0.1,192.168.56.102

(You might need to use sudo to edit the file)

Now set nrpe to start at boot, and restart the nrpe service to pick up the configuration changes made

$ sudo chkconfig --level 35 nrpe on
$ sudo service nrpe restart

Normally Nagios will be running check_nrpe from the Nagios server, but before we do that, we can use the plugin locally on the remote server to check that NRPE is functioning, before we get the network involved:

$ cd /usr/lib64/nagios/plugins 
$ ./check_nrpe -H localhost 
NRPE v2.12

If that works, then move on to testing the connection between the Nagios server and the remote server. On the Nagios server, install the check_nrpe plugin:

$ sudo yum install nagios-plugins-nrpe

And then run it manually:

$ cd /usr/lib64/nagios/plugins 
$ ./check_nrpe -H 192.168.56.101 
NRPE v2.12

(in this example, my remote server’s IP is 192.168.56.101)

NRPE, commands and plugins

In a local Nagios service check, the service specifies a command which in turn calls a plugin. When we do a remote service check using NRPE the same chain exists, except the service always calls the NRPE command and plugin. The difference is that it passes to the NRPE plugin the name of a command executed on the NRPE remote server.

So there are actually two commands to be aware of :

  • The command defined on the Nagios server, which is specified from the service
    These commands are defined as objects using the define command syntax
  • The command on the remote server in the NRPE configuration, which specifies the actual plugin executable that is executed
    The command is defined in the nrpe.cfg file, with the syntax
    command[<command name>]=<command line execution statement>

An example NRPE service configuration

One of the default service checks that comes with Nagios is Check Load. It uses the check_load plugin. We’ll see how the same plugin can be used on the remote server through NRPE.

  1. Determine the commandline call for the plugin on the remote server. In the plugins folder execute the plugin manually to determine its syntax
    $ cd /usr/lib64/nagios/plugins/
    $ ./check_load -h 
    […]
    Usage: check_load [-r] -w WLOAD1,WLOAD5,WLOAD15 -c CLOAD1,CLOAD5,CLOAD15
    

    So for example:

    ./check_load -w 15,10,5 -c 30,25,20 
    OK - load average: 0.02, 0.04, 0.05|load1=0.020;15.000;30.000;0; load5=0.040;10.000;25.000;0; load15=0.050;5.000;20.000;0;
    
  2. Specify the NRPE command in nrpe.cfg file with the command line determined in the previous step:
    command[check_load]=/usr/lib64/nagios/plugins/check_load -w 15,10,5 -c 30,25,20

    You’ll see this in the default nrpe.cfg file. Note that “check_load” is entirely arbitrary, and “command” is a literal.

  3. On the Nagios server, configure the generic check_nrpe command. This should be added to an existing .cfg file, or a new one in the cfg_dir folder that we configured earlier
    define command{
    command_name check_nrpe 
    command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$ 
    }

    Note here the -c argument, which passes $ARG1$ as the command to execute on the NRPE daemon.

  4. Define a service which will call the plugin on the NRPE server. I’ve added this into the configuration file for the new host created above (config/bi1.cfg)
    define service{ 
    use local-service
    host_name bi1 
    service_description Check Load 
    check_command check_nrpe!check_load 
    }

    Note that check_nrpe is the name of the command that we defined in step 3. check_load is the arbitrary command name that we’ve configured on the remote server in nrpe.cfg

As before, validate the configuration:

nagios -v /etc/nagios/nagios.cfg

and then restart the Nagios service:

sudo service nagios restart

Login to your Nagios console and you should see the NRPE-based service working:Nagios05

Nagios and OBIEE

Did someone say something about OBIEE? As I warned at the beginning of this article Nagios is fairly complex to configure and it is a steep learning curve. What I’ve written so far is hopefully sufficient to guide you through the essentials and give you a head-start in using it.

The rest of this article looks at the kinds of alerts we can build into Nagios for OBIEE

Process checks

To check for the processes in the OBIEE stack we can use the check_proc plugin. This is a flexible plugin with a variety of invocation approaches, but we are going to use it to raise a critical alert if there is not a process running which matches a argument or command that we specify.

As with all of these checks, it is best to develop it from the ground up, so start with the plugin on the command line and work out the correct syntax. Once the syntax is determined it is simple to incorporate it into the Nagios configuration.

The syntax for the plugin is obtained by running it with the -h flag:

 ./check_procs -h |more 
check_procs v1.4.15 (nagios-plugins 1.4.15)
Copyright (c) 1999 Ethan Galstad <nagios@nagios.org>
Copyright (c) 2000-2008 Nagios Plugin Development Team
	<nagiosplug-devel@lists.sourceforge.net>

Checks all processes and generates WARNING or CRITICAL states if the specified
metric is outside the required threshold ranges. The metric defaults to number
of processes.  Search filters can be applied to limit the processes to check.


Usage:
check_procs -w <range> -c <range> [-m metric] [-s state] [-p ppid]
 [-u user] [-r rss] [-z vsz] [-P %cpu] [-a argument-array]
 [-C command] [-t timeout] [-v][…]

So to check for Presentation Services, which runs as sawserver we would use the -C parameter to specify the process command to match. In addition, we need to specify the warning and critical thresholds. For the OBI processes these thresholds are pretty simple – if there are zero processes then sound the alarm, and if there’s one process then all is OK.

./check_procs -C sawserver -w 1: -c 1: 
 PROCS OK: 1 process with command name 'sawserver'

And if we bring down Presentation Services and run the same command:

./check_procs -C sawserver -w 1: -c 1: 
 PROCS CRITICAL: 0 processes with command name 'sawserver'

To add this into Nagios, do the following:

  1. On the remote server, add the command into NRPE.
    I’ve created a new file called custom.cfg in /etc/nrpe.d (the contents of which are read by NRPE for configuration as well as nrpe.cfg itself)
    The command I’ve defined is called check_obips:
    command[check_obips]=/usr/lib64/nagios/plugins/check_procs -w 1: -c 1: -C sawserver
  2. Because we’ve added a new command into NRPE, the NRPE service needs restarting:
    service nrpe restart
  3. On the Nagios server define a new service for the BI server which will use the check_obips command, via NRPE:
    define service{ 
    use local-service 
    host_name bi1 
    service_description Process: Presentation Services 
    check_command check_nrpe!check_obips }
  4. As before, validate the nagios configuration and if it passes, restart the service
    nagios -v /etc/nagios/nagios.cfg 
    service nagios restart

Looking in the Nagios frontend, the new Presentation Services alert should be present: Nagios06 In this screenshot the alert is status Critical because there are no Presentation Services (sawserver) processes running. If I restart it the alert will change: Nagios07

Network ports

To doublecheck that OBIEE is working, monitoring the state of the network ports is a good idea.

If you are using a firewall then you will need to run this check on the OBI server itself, through NRPE. If you’re not firewalled, then you could run it from the Nagios server. If you are firewalled but only want to check for the public-facing ports of OBIEE (for example, 9704) then you could run it locally on Nagios too.

Whichever way you run the alert, it is easily done using the check_tcp plugin

./check_tcp -p 9704 
TCP OK - 0.001 second response time on port 9704|time=0.001384s;;;0.000000;10.000000

The only parameter that we need to specify is the port, -p. As with the check_proc plugin, there are different ways to use it and check_tcp can raise warnings/alerts if there’s a specified delay connecting to the port, and it can also match a send/expect string. For our purpose, it will return OK if the port we specify is connected to, and fail if not.

The NRPE configuration:

command[check_obis_port]=/usr/lib64/nagios/plugins/check_tcp -H localhost -p 9703

The Nagios service configuration:

define service{
use local-service
host_name bi1
service_description Port: BI Server
check_command check_nrpe!check_obis_port
}

Log files

check_logwarn is not provided by the default set of Nagios plugins, and must be downloaded and installed separately. Once installed, it can be used thus:

NRPE command:

 command[check_log_nqserver]=/usr/lib64/nagios/plugins/check_logwarn -p -d /tmp /u01/app/oracle/product/fmw/instances/instance1/diagnostics/logs/OracleBIServerComponent/coreapplication_obis1/nqserver.log ERROR 

Service definition:

 define service{ 
use local-service 
host_name bi1 
service_description Logs: BI Server nqserver.log 
max_check_attempts 1 
check_command check_nrpe!check_log_nqserver 
} 

Nagios08Be aware that this method is only really useful for alerting you that there is something to look at in the logs — it doesn’t give you the log to browse through. For that you would need to go to the log file on disk, or the log viewer in EM. Tips:

  • Set max_check_attempts in the service defintion to 1, so that an alert is raised straight away.
    Unlike monitoring something like a network port where a glitch might mean a service should check it more than once before alerting, if an error is found in a log file it is still going to be there if you check again.
  • For this service, the action_url option for a service could be used to include a link through to the EM log viewer
  • Make sure that the NRPE user has permissions on the OBI log files.

Database

The check_oracle plugin can check that a database is running locally, or using a TNS entry remotely. Since the OBIEE server that I’m using here is a sandpit environment the database is also running on it, so the check can be run locally on it, via NRPE

NRPE configuration:

command[check_db]=/usr/lib64/nagios/plugins/check_oracle --db ORCL

Service definition:

define service{ 
use local-service 
host_name bi1 
service_description Database check_command 
check_nrpe!check_db 
}

Final Nagios configuration

Service Groups

Having covered the basic setup for monitoring an OBIEE server, we will now look at a couple of Nagios configuration options to improve the monitoring setup that’s been built. The first is Service Groups. These are a way of grouping services together (how did you guess). For example, all the checks for OBIEE network ports. In the Nagios frontend Service Groups can be examined individually and drilled into. Nagios09 The syntax is self-explanatory, except the members clause, which is a comma-separated list of host,service pairings:

 define servicegroup{ 
servicegroup_name obiports 
alias OBIEE network ports 
members bi1, Port: OPMN remote,bi1, Port: BI Server,bi1, Port: Javahost ,bi1, Port: OPMN local port,bi1, Port: BI Server - monitor,bi1, Port: Cluster Controller,bi1, Port: Cluster Controller - monitor,bi1, Port: BI Scheduler - monitor,bi1, Port: BI Scheduler - Script RPC,bi1, Port: Presentation Services,bi1, Port: BI Scheduler,bi1, Port: Weblogic Managed Server - bi_server1,bi1, Port: Weblogic Admin Server 
}

NBThe object definition for the servicegroups is best placed in its own configuration file, or at least, not in the same as the host/service configurations. If it’s in the same file as the host/service config then it’s less easy to duplicate that file for new hosts.

A note about templates

All of the objects that we have configured have included a use clause. This is a template object definition that specifies generic settings so that you don’t have to configure them each time you create an object of that type. It also means if you want to change that setting, you can do so in once place instead of dozens.

For example, services have a check_interval setting, which is how often Nagios will check the service. There’s also a retry_interval which is how many times Nagios will check the service again after the initial error, before raising an alert.

All the templates by default are defined in objects/templates.cfg, but note that templates in themselves are not an object type, they are just an object (eg service) which can be inherited. Templates can inherit other templates too. Examine the generic-service and local-service default templates to see more.

To see the final object definitions with all their inherited values, go to the Nagios web front end and choose the System > Configuration option from the left menu.

Email alerts

A silent alerting system is not much use if we want a hands-off approach to monitoring OBIEE. Getting Nagios to send out emails is pleasantly easy. In essence, you just need to configure a contact object. However I’m going to show how to set it up a bit neater, and illustrate the use of templates in the process.

  1. First step is to test that your Nagios server can send outbound email. In an enterprise this shouldn’t be too difficult, but if you’re trying this at home then some ISPs do block it.
    To test it, run:
    echo 'Email works from the Nagios server' | mailx -s 'Test message from Nagios' foo@bar.com

    Substitute your email address, and if you receive the email then you know the host can send emails. Note you’ve not testing the Nagios email functionality, just the functionality of the Nagios host server to send email.
    If the email doesn’t come through then check /var/log/maillog for errors

  2. In your Nagios configuration, create a contact and contactgroup object. For ease of manageability, I’ve created mine as config/contacts.cfg but anywhere that Nagios will pick up your object definition is fine.
    define contact { 
    use generic-contact 
    contact_name rnm 
    alias Robin Moffatt 
    email foo@bar.com 
    }
    
    define contactgroup { 
    contactgroup_name obiadmins 
    alias OBI Administrators 
    members rnm 
    }

    A contact group is pretty self-explanatory – it is made up of one or more contacts.

  3. To associate a contact group with a service, so that it receives notifications when the service goes into error, use the contact_groups clause in the service defintion.
    Instead of adding this into each service that we’ve defined (currently about 30), I am going to add it into the service template. At the moment the services use the local-service template, one of the defaults with Nagios. I’ve created a new template, called obi-service, which inherits the existing local-service definition but also includes the contact-groups clause:
    define service{ 
    name obi-service 
    use local-service 
    contact_groups obiadmins 
    }

    Now a simple search & replace in my configuration file for the OBIEE server (I called it config/bi1.cfg) to change all use local-service to use obi-service

    […]
    define service{ 
    use obi-service 
    host_name bi1 
    service_description Process: BI Server 
    check_command check_nrpe!check_obis 
    } 
    […]
  4. Validate the configuration and the restart Nagios

All going well, you should now receive alerts when services go into errorNagios10

You can see what alerts have been sent by looking in the Nagios web front end under Reports > Notifications on the left-hand menuNagios11

Deployment on other OBIEE servers

To deploy the same setup as above, for a new OBIEE server, do the following:

  1. Install nagios plugins and nrpe daemon on the new server
    sudo yum install nagios-plugins-all nagios-plugins-nrpe nrpe
  2. Add Nagios server IP to allowed_hosts in /etc/nagios/nrpe.cfg
  3. Start NRPE service
    service nrpe start
  4. Test nrpe locally on the new OBIEE server:
    $/usr/lib64/nagios/plugins/check_nrpe -H localhost 
    NRPE v2.12
  5. Test nrpe from Nagios server:
    $/usr/lib64/nagios/plugins/check_nrpe -H bi2 
    NRPE v2.12
  6. From the first OBIEE server, copy /etc/nrpe.d/custom.cfg to the same path on the new OBIEE server.
    Restart NRPE again
  7. On the Nagios server, define a new host and set of services associated with it. The quick way to do this is copy the existing bi1.cfg file (which has the host and service definitions for the original OBIEE server) to bi2.cfg and do a search and replace. Amend the host definition for the new server IP.
  8. Update the service group definition to include the list of bi2 services too.
  9. Validate the configuration and restart Nagios

The new host should now appear in the Nagios front end: Nagios12

Nagios13

Summary

Nagios is a powerful but complex beast to configure. Once you get into the swing of it, it does make sense though.

At a high-level, the way that you monitor OBIEE with Nagios is:

  • Define OBIEE server as a host on Nagios
  • Install and configure NRPE on the OBIEE server
  • Configure the checks (process, network port, etc) on NRPE on the OBIEE server
  • Create a corresponding set of service definitions on the Nagios server to call the NRPE commands

The final part of this series looks at how plugins can be created to do more advanced monitoring with Nagios, including simulating user requests and alerting if they fail : Advanced monitoring of OBIEE with Nagios

Documentation

Nagios Core documentation

Automated Monitoring of OBIEE in the Enterprise – an overview

A lot of time is given to the planning, development and testing of OBIEE solutions. Well, hopefully it is. Yet sometimes, the resulting deployment is marked Job Done and chucked over the wall to the Operations team, with little thought given to how it is looked after once it is running in Production.

Of course, at the launch and deployment into Production, everyone is paying very close attention to the system. The slightest cough or sneeze will make everyone jump and come running to check that the shiny new project hasn’t embarrassed itself. But what about weeks 2, 3, 4…six months later…performance has slowed, the users are unhappy, and once in a while someone thinks to load up a log file to check for errors.

This post is the first of a mini-series on monitoring, and will examine some of the areas to consider in deploying OBIEE as a Production service. Two further posts will look at some of this theory in practice.

Monitoring software

The key to happy users is to know there’s a problem before they do, and even better, fix it before they realise. How do you do this? You either sit and watch your system 24 hours a day, or you set up some automated monitoring. There’s lots of companies willing to take lots of money off your hands for very complex and fancy pieces of software that will do this, and there are lots of open-source solutions (some of them also very complex, and some of them very fancy) that will do the same. They all fall under the umbrella title of Systems Management.

Which you choose may be dictated to you by corporate policy or your own personal choice, but ultimately all the good ones will do pretty much the same and require configuring in roughly the same kind of way. Some examples of the software include:

  • HP OpenView
  • Nagios
  • Zabbix
  • Tivoli
  • Zenoss
  • Oracle Enterprise Manager
  • list on Wikipedia

Some of these tools can take the output of a bash script and use it as the basis for logging an alert or not. This means that pretty much anything you can think of to check, so long as you can script it, you can check it.

I’m not aware of any which come with out of the box templates for monitoring OBIEE 11g – if there are please let me know. Any company which says they have, make sure they’re not mistaking “is capable of” with “is actually implemented”.

What to monitor

It’s important to try and look at the entire map of your OBIEE deployment and understand where things could go wrong. Start thinking of OBIEE as the end-to-end stack, or service, and not simply the components that you installed. Once you’ve done that, you can start to plan how to monitor those things, or at least be aware of the fault potential. For example, it’s obvious to check that the OBIEE server is running, but what about the AD server which you use for authentication? Or the SAN your webcat resides on? Or the corporate load balanacer you’re using?

Here are some of the common elements to all OBIEE deployments that you should be considering:

OBIEE services

An easy one to start with, and almost so easy it could be overlooked. You need to have in place something which is going to check that the OBIEE processes are currently running. Don’t forget to include the Web Logic Server process(es) in this too.

The simplest way to build this into a monitoring tool is to have it check with the OS (be it Linux/Windows/whatever) that a process with the relevant name is running, and raise an alert if it’s not. For example, to check that the Presentation Services process is running, you could do this on Linux:

ps -ef|grep [s]awserver

You could use opmnctl to query the state of the processes, but be aware that OPMN is going to report how it sees the processes. If there’s something funny with opmn, then it may not pick up a service failure. Of course, if there’s something funny with opmn then you may have big trouble anyway.

A final point on process monitoring; note that OPMN manages the OBIEE system components and by default will restart them if they crash. This is different behvaiour from OBIEE 10g, where when a process died it stayed dead. In 11g, processes come back to life, and it can be most confusing if an alert fires saying that a process is down but when you check it appears to be running.

Network ports

This is a belts and braces counterpart to checking that processes are running. It makes sense to also check that the network ports that OBIEE uses to communicate both externally with users and internally with other OBIEE processes are listening for traffic. Why do this? Two reasons spring to mind. The first is that you misconfigure your process-check alert, or it fails, or it gets accidentally disabled. The second, less likely, is that an OBIEE process is running (so doesn’t trigger the process-not-running alert) but has hung in some way and isn’t accepting TCP traffic.

The ports that your particular OBIEE deployment uses will vary, particularly if you’ve got multiple deployments on one host. To see which ports are used by the BI System Components, look at the file $FMW_HOME/instances/instance1/config/OPMN/opmn/ports.prop. The ports used by Web Logic will be in $FMW_HOME/user_projects/domains/bifoundation_domain/config/config.xml

A simple check that Presentation Services was listening on its default port would be:

netstat -ln | grep tcp | grep 9710 | wc -l

If a zero is returned that means there are no ports listening, i.e. there’s a problem.

Application Deployments

Web Logic Server hosts various JEE Application Deployments, some of which are crucial to the well-being of OBIEE. An example of one of these is analytics (which handles the traffic between the web browser and the Presentation Services). Just because Web Logic is running, you cannot assume that the application deployment is. You can check automatically using WLST:

connect('weblogic','welcome1','t3://localhost:7001')
nav=getMBean('domainRuntime:/AppRuntimeStateRuntime/AppRuntimeStateRuntime')
state=nav.getCurrentState('analytics#11.1.1','bi_server1')
print &quot;\033[1;32m &quot; + state + &quot;\033[1;m&quot;

You would invoke the above script (assuming you'd saved it as /tmp/check_app.py) using:

$FMW_HOME/oracle_common/common/bin/wlst.sh tmp/check_app.py

Checking application deployment health using WLST
Because WLST is verbose when you invoke it, you might want to pipe the command through tail so that you just get the output

$FMW_HOME/oracle_common/common/bin/wlst.sh tmp/check_app.py | tail -n 1
 STATE_ACTIVE

If you want to explore more detail around this functionality a good starting point is the MBeans involved, which you can find in Enterprise Manager under Runtime MBeans > com.bea > Domain: bifoundation_domain

Log files

The log files from OBIEE are crucial for spotting problems which have happened, and indicators of problems which may be about to happen. You'll find the OBIEE logs in:

  • $FMW_HOME/instances/instance1/diagnostics

and the Web Logic Server related logs primarily in

  • $FMW_HOME/user_projects/domain/bifoundation_domain/servers/AdminServer/logs
  • $FMW_HOME/user_projects/domain/bifoundation_domain/servers/bi_server1/logs

There are others dotted around but these are the main places to start. For a more complete list, look in Enterprise Manager under coreapplication > Diagnostics > Log Viewer > Selected Targets.

Once you've located your logs, there's no prescribed list of what to monitor for - it's down to your deployment and the kind of errors you [expect to] see. Life is made easier because FMW already categorises log messages by severity, so you could start with simply watching WLS logs for <Error> and OBIEE logs for [ERROR (yes, no closing bracket).

If you find there are errors regularly causing alerts which you don't want then set up exceptions in your monitoring software to ignore them or downgrade their alert severity. Of course, if there are regular errors occurring then the correct long-term action is to resolve the root cause so that they don't happen in the first place!

I would also watch the server logs for an indication of the processes shutting down, and any database errors thrown. You can monitor the Presentation Services log (sawlog0.log) for errors which are being passed back to the user - always good to get a headstart on a user raising a support call if you're already investigating the error that they're about to phone up and report.

Monitoring log files should be the bread and butter of any decent systems management software, and so each will probably have its own way of doing so. You'll need to ensure that it copes with rotating logs - if you have configured them - otherwise it will read a log to position 94, the log rotates and the latest entry is now 42, but the monitoring tool will still be looking at 94.

Server OS stats

In an Enterprise environment you may find that your Ops team will monitor all server OS stats generically, since CPU is CPU, whether it's on an OBI server or SMTP server. If they don't, then you need to make sure that you do. You may find that whatever Systems Management tool you pick supports OS stats monitoring.

As well as CPU, make sure you're monitoring memory usage, disk IO, file system usage, and network IO.

Even if another team does this for your server already, it is a good idea to find out what alert thresholds have been set, and get access to the metrics themselves. Different teams have different aims in collecting metrics, and it may be the Ops team will only look at a server which hits 90% CPU. If you know your OBIEE server runs typically at 30% CPU then you should be getting involved and investigating as soon as CPU hits, say, 40%. Certainly, by the time it hits 90% then there may already be serious problems.

OBI Performance Metrics

Just as you should monitor the host OS for important metrics, you can monitor OBIEE too. Using the Dynamic Monitoring Service (DMS), you can examine metrics such as:

  • Logged in user count
  • Active users
  • Active connections to each database
  • Running queries
  • Cache hits

This is just a handful - there are literally hundreds of metrics available.

You can see the metrics in Enterprise Manager (Fusion Middleware Control), but there is no history retained and no integrated alerting, making it of little use as a hands-off monitoring tool.

At Rittman Mead we have developed a solution which records the OBIEE perfomance data and makes it available for realtime monitoring and alerting for OBIEE:
Realtime monitoring of OBIEE metrics

The kind of alerting you might want on these metrics could include:

  • High number of failed logins
  • High number of query errors
  • Excessive number of database connections
  • Low cache hit ratio

Usage Tracking

I take this as such a given that I almost forgot it from this list. If you haven't got Usage Tracking in place, then you really should. It's easy to configure, and once it's in place you can forget about it if you want to. The important thing is that you're building up an accurate picture of your system usage which is impossible to do easily any other way. Some good reasons for having Usage Tracking in place:

  • How many people logged into OBIEE this morning?
  • What was the busiest time period of the day?
  • Which reports are used the most?
  • Which users are running reports which take longer than x seconds to complete? (Can we help optimise the query?)
  • Which users are running reports which bring back more than x rows of data? (Can we help them get the data more efficiently?)

In addition to these analytical reasons, going back to the monitoring aspect of this post, Usage Tracking can be used as a data source to trigger alerts for long running reports, large row fetches, and so on. An example query which would list reports from the last day that took longer than five minutes to run, returned more than 60000 rows, or used more than four queries against the database, would be:

SELECT user_name, 
       TO_CHAR(start_ts, 'YYYY-MM-DD HH24:MI:SS'), 
       row_count, 
       total_time_sec, 
       num_db_query, 
       saw_dashboard, 
       saw_dashboard_pg, 
       saw_src_path 
FROM   dev_biplatform.s_nq_acct 
WHERE  start_ts &gt; SYSDATE - 1 
       AND ( row_count &gt; 60000 
              OR total_time_sec &gt; 300 
              OR num_db_query &gt; 4 ) 

This kind of monitoring would normally be used to trigger an informational alert, rather than sirens-blazing code red type alert. It's important to be aware of potentially bad queries on the system, but it can wait until after a cup of tea.

Some tools will support database queries natively, others you may have to fashion together a sql*plus call yourself.

Databases - both reporting sources and repository schemas (BIPLATFORM)

Without the database, OBIEE is not a great deal of use. It needs the database in place to provide the data for reports, and it also needs the repository schemas that are created by the RCU (MDS and BIPLATFORM).

As with the OS monitoring, it may be your databases are monitored by a DBA team. But as with OS monitoring, it is a good idea to get involved and understand exactly what is being monitored and what isn't. A DBA may have generic alerts in place, maybe for disk usage and deadlocks. It might be useful to monitor the DW also for long running queries or high session counts. Long running queries aren't going to necessarily bring the database down, but they might be a general indicator of some performance problems that you should be investigating sooner rather than later.

ETL

Getting further away from the core point of monitoring OBIEE, don't forget the ancillary components to your deployment. For your reports to have data the database needs to be functioning (see previous point) but there also needs to be data loaded into it.

OBIEE is the front-end of service you are providing to users, so even if a problem lies further down the line in a failed ETL batch, the users may perceive that as a fault in OBIEE.

So make sure that alerting is in place on your ETL batch too and there's a way that problems can be efficiently communicated to users of the system.

Active Monitoring

The above areas are crucial for "passive" monitoring of OBIEE. That is, when something happens which could be symptomatic of a problem, raise an alert. For real confidence in the OBIEE deployment, consider what I term active monitoring. Instead of looking for symptoms that everything is working (or not), actually run tests to confirm that it is. Otherwise you end up only putting in place alerts for things which have failed in the past and for which you have determined the failure symptom. Consider it the OBIEE equivalent of a doctor reading someone's vital signs chart versus interacting with the person and directly ascertaining their health.

OBIEE stack components involved in a successful report reqest

This diagram shows the key components involved in a successful report request in OBIEE, and illustrated on it are the three options for actively testing it described below. Use this as a guide to understand what you are and are not confirming by running one of these tests.

sawping

This is a command line utility provided by Oracle, and it runs a "ping" of the Presentation Services server. Not complicated, and not overly useful if you're already monitoring for the sawserver process and network port. But, easy to setup so maybe worth including anyway. Note that this doesn't check the BI server, database, or Web Logic.

[oracle@rnm ~]$ sawping -s myremoteobiserver.company.com -p 9710 -v
Server alive and well


[oracle@rnm ~]$ sawping -s myremoteobiserver.company.com -p 9710 -v
Unable to connect to server. The server may be down or may be too busy to accept additional connections.
An error occurred during connect to &quot;myremoteobiserver.company.com:9710&quot;. Connection refused [Socket:6]
Error Codes: YM52INCK

nqcmd

nqcmd is a command line utility provided with OBIEE which acts as an ODBC client to the BI Server. It can be used to run Logical SQL (the query that Presentation Services generates to fetch data for a report) against the BI Server. Using nqcmd you can validate that the BI Cluster Controller, BI Server and Database are functioning correctly.

You could use nqcmd in several ways here:

  • Simple yes/no test that this part of the stack is functioning
  • nqcmd returns the results of a query, so you could test that the data being returned by the database is correct (compare it to what you know it should be)
  • Measure how long it takes nqcmd to run the query, and trigger an alert if the query is slow-running

This example runs a query extracted from nqquery.log and saved as query01.lsql. It uses grep and awk to parse the output to show just the row count retrieved, and the total time it took to run nqcmd. It uses the / character to split lines for readability. If you want to understand more about how it works, run the nqcmd bit before the pipe | symbol and then add each of the pipe-separated statements back in one by one.

. $FMW_HOME/instances/instance1/bifoundation/OracleBIApplication/coreapplication/setup/bi-init.sh

time nqcmd -d AnalyticsWeb -u Prodney -p Admin123 -s ~/query01.lsql -q -T / 
2&gt;/dev/null | grep Row | awk '{print $13}'

NB don’t forget the bi-init step, which sets up the environment variables for OBIEE. On Linux it’s “dot-sourced” – with a dot space as the first two characters of the line.

Web user automation

Pretty much the equivelant of logging on to OBIEE in person and checking it is working, this method uses generic web application testing tools to simulate a user running a report and raise an alert if the report doesn’t work. As with the nqcmd option previously, you could stay simple with this option and just confirm that a report runs, or you could start analyzing run times for performance trending and alerting.

To implement this option you need a tool which lets you record a user’s OBIEE session and can replay it simulating the browser activity. Then script the tool to replay the session periodically and raise an alert if it fails. Two tools I’m aware of that could be used for this are JMeter and HP’s BAC/LoadRunner.

A final point on this method – if possible run it remotely from the OBIEE server. If there are network problems, you want to pick those up too rather than just hitting the local loopback interface.

If you think that this all sounds like overkill, then consider this real-life case here, where all the OBIEE processes were up, the database was up, the network ports were open, the OS stats were fine — but the users still couldn’t run their reports. Only by simulating the end-to-end user process can you get proper confidence that your monitoring will alert you when there is a problem

Enterprise Manager

This article is focussed on the automated monitoring of OBIEE. Enterprise Manager (Fusion Middleware Control) as it currently stands is very good for diagnosing and monitoring live systems, but doesn’t have the kind of automated monitoring seen in EM for Oracle DB.

There has always been the BI Management Pack available as an extra for EM Grid Control, but it’s not currently available for OBI 11g. Updated: It looks like there is now, or soon will be, an OBI 11g management pack for EM 12c, see here and here (h/t Srinivas Malyala)

Capacity Planning

Part of planning a monitoring strategy is building up a profile of your systems’ “normal” behaviour so that “abnormal” behaviour can be spotted and alerted. In building up this profile you should find you easily start to capture valuable information which feeds naturally into capacity planning.

Or put it another way, a pleasant side-effect of decent monitoring is a head-start on capacity planning and understanding your system’s usage versus the available resource.

Diagnostics

This post is focused on automated monitoring; in the middle of the night when all is quiet except the roar of your data centre’s AC, something is keeping an eye on OBIEE and will raise alarms if it’s going wrong. But what about if it is going wrong, or if it went wrong and you need to pick up the pieces?

This is where diagnostics and “interactive monitoring” come in to play. Enterprise Manager (Fusion Middleware Control) is the main tool here, along with Web Logic Console. You may also delve into the Web Logic Diagnostic Framework (WLDF) and the Web Logic Dashboard.

For further reading on this see Adam Bloom’s presentation from this year’s BI Forum: Oracle BI 11g Diagnostics

What next

Over the next few days I will be posting further articles in this series, looking at how we can put some of this monitoring theory into practice:

Using FMAP and AnalyticsRes in a Oracle BI High Availability Implementation

The fmap syntax has been used for a long time in Oracle BI / Siebel Analytics when referencing images inherent in the application as well as custom images. This syntax is used on Analysis requests an dashboards. The syntax is usually fmap:images/myimage.png or just images/myimage.png in some fields when developing. In legacy Oracle BI environments custom images were usually tossed in the \web\res\ folder and referencing custom images using fmap worked the sames as it did for those inherent images. If the legacy OBIEE environment was scaled-out in a cluster then most integrations just deployed the custom images or files to each server in the cluster and called it a day. There are better practices to this latter approach.