Monthly Archives: March 2012

Statistical Analysis in the Database

My first ever presentation at the UKOUG conference (long before I started working with Mark Rittman and Jon Mead) was an introduction to analytic functions. Since then analytics have remained a key part of my work as a DW developer; many of the processes that I have implemented would be so much less efficient without the use of  functions such as SUM, AVG, LAG, LEAD, FIRST_VALUE and the RANK functions. Analytics, however, are not just for reporting they are also a major feature in performant ETL; for example we have put lag, lead and last_value to good use in change data capture processes for one of our clients.

Recently, I have been writing a new presentation, “BI is More Than Slice and Dice”, for April’s Collaborate 12 conference in Las Vegas. In it I will be looking at some BI techniques beyond simple aggregation and dicing up the data based on the dimensions. As part of the talk I will be considering some of the things you can do inside the Oracle 11gR2 database – either for free or by paying out some extra license cash.

One of the “for free” things is the use of the statistical SQL database functions, which neatly ties in with my love of analytic functions. Although most of these statistical SQL functions can be used in the analytic function form (with an OVER (PARTITION BY…) clause), some are ”aggregate” only (that is, with an optional GROUP BY). In addition to the familiar AVG and the lesser used MEDIAN and STATS_MODE there are functions to calculate things such as: standard deviations, correlations, linear regressions and variance. Other functions (such as standard error) can be calculated by combining multiple statistical functions. We can carry out some very sophisticated analysis directly on our data warehouse data using SQL. Of course to make this analysis valuable we need to apply the correct analysis to our data – and that requires an understanding of the data and an appreciation of how to use statistical techniques; we should no more think of using a socket wrench to drive a nail than use the wrong statistical technique on our data. Recently, Kim Berg Hansen blogged a good worked example of the use of the analytic Linear Regression slope function to create a sales forecast based on previous sales. Of course we can get a lot more sophisticated in our analysis and take other factors into account but it is a good introductory example of what can be done using in-database functionality. More examples can be found in that must read Oracle manual, the Data Warehousing Guide.

As you may have seen elsewhere on the Rittman Mead Blog, it is a season of product launches, acquisitions and new versions  at Oracle – Mark has recently blogged mini series on Endeca, Exalytics and OBI 11.1.1.6, I have written a little about ODI 11.1.1.6, but we have not really mentioned Oracle’s involvement with R – the open source programming language and environment for statistical computing and graphics. If you have not come across R before it has it’s origins in Auckland University’s Statistics Department and has, with community contribution, evolved into a powerful environment that compares well with commercial packages such as SAS and SPSS. Oracle recently released its own open source R Distribution, contributed the ROracle package which provides robust OCI database connectivity to R distributions, and introduced two licensable products: Oracle R Enterprise (ORE) and the R connector for Hadoop (part of the Big Data Connector license) which allows the use of R functionality on data stored in a Hadoop file system.

ORE is installed directly into the database (currently 64-bit Linux is supported) and allows us to execute R code and use R packages where the the data is, that is in the Oracle database (whereas conventional R distributions extract the data to the R client’s memory space). This is a great advantage for data warehouse size tables which could be too big to process in client memory. ORE is licensed as part of the Advanced Analytics database option (along with Oracle Data Mining, ODM) – this is not a “use either A or B” bundling of products since ORE can give direct R-Language access to the advanced algorithms in ODM, on the other hand R Enterprise permits some forms of analysis that are not possible within ODM. With ORE we can develop very sophisticated statistical analyses within the database (such as forecasting, relationship discovery and what-if analysis) and visualize the results either through the R client application or OBIEE; something which my colleague, Jordan Meyer, has been doing recently. Hopefully Jordan will be able to share some of his experiences with both ORE and Hadoop soon.

See you in Brighton !

Want to know if  the White Stork really brings the babies? See me and many others speak at the Rittman Mead BI-Forum http://www.rittmanmead.com/biforum2012

See you in Brighton!

Oracle Data Integrator Designer for OWB Developers

Last time I described some of the aspects of ODI Topology. I only touched the surface of this subject and mainly kept to database sources; if you attend one of the Rittman Mead ODI bootcamps (we have public versions of the ODI course coming up in Australia, India, USA and the UK) you will learn a lot more about this topic over the four days of the course.

I surpose that the component of ODI that feels most like home to an OWB developer is the Designer Navigator. Functionally it is like an amalgam of the table editor, the mapping editor and the workflow editor with a load of other stuff thrown in. Again, just like the Topology Navigator, it is organised in accordion tabs. In this blog I am going to talk about just two of the accordions on the navigator: Models and Projects. These two are possibly all you need to design an ETL process; the model defines the structure of the data sources and targets (the tables in my example) and the project describes how the data is mapped from source model(s) to target model (usually in an interface) and how interfaces procedures and more are orchestrated in a PACKAGE (a package is similar to a process flow in OWB).

Models

In the same way that the Physical Architecture on the Topology Navigator contains data servers, the Designer Navigator’s Models accordion hold Datastores. Datastores describe the structure of the data sources (think tables and files) and targets we place on to the interfaces. Right clicking in the models section gives us the option to create new folders, sub-folders, models and the data stores themselves. Folders and sub folders are optional components that help you organize your work. In the picture below we are naming the folder based on business purpose but another good approach is name the folder by technologies – for example an Oracle folder, a XML folder and so on.

Models

The model is the linkage to the Technology Navigator items. We name the model, specify the technology and give the logical architecture, the actual connection to the data server comes through the context. From the model we can right click to define the datastores themselves or optionally generate them by reverse engineering the connection. Generally, I prefer to reverse engineer as this process brings in the table, the columns and the constraints but it is equally feasible to create your own datastores in the navigator. Just as in OWB, reverse engineering a second time does not remove any alterations you made to the the data store design – you can get caught out if columns change name between imports.

Datastores

As you can see we can also define filters and constraints – filters are effectively a “where clause” that gets appended to the select query used for data sources; for example we may only want to load rows where STATUS = ‘ACTIVE’. Constraints are where we define primary and alternate keys, unique constraints and so on. These constraints need not be physically implemented in the database but they are important to ODI. The primary or alternate keys are used in update operations to select the rows to update and unique constraints and not null constraints will be used if you elect to use the check modules as part of the interface implementation.

Projects

At last I get to the section where we actually implement the data flow from source to target. The first step on the projects accordion is to create a named project. This will generate sections for variables, sequences and knowledge modules (this may be somewhat different in the new 11.1.1.6 release as we can use versioned global Knowledge Modules). We can also create a folder to hold all of our interfaces, procedures and packages.

Projects

Interfaces are similar to OWB mappings in that they describe how data flows from table A, column B to table Z column X. Just as in OWB we can implement filters on the data sources, joiners between source tables (inner, left, right and full outer, and cartesian joins are supported) look-ups and set-based operators such UNION and MINUS. Where it differers is in that an ODI interface consists of a single target table and we do not have a palette of operators to place on the dataflow in the way in which we add expressions and constants to OWB mappings. Neither of these are a particular problem – constants and expressions can be typed directly in the mapping or quick edit tabs of the ODI interface editor (in fact ODI is more flexible as we can chose where to apply the expressions: on the source, the target or an intermediate stage table, this is a very useful feature) Not having multi-table inserts is again fairly simple to work around. Where the ODI interface is different from the OWB mapping is that we have a further two tabs of settings that we may need to complete – the flow tab is where we specify the knowledge modules our interface uses – there will always be one Integration Knowledge Module (IKM) which describes how data goes into the target datastore and depending on the technologies used in the interface possibly a Load Knowledge Module (LKM) this may not be the case if source and target datastores are on the same data server. The flow tab also has a selector to allow us to implement data validation on load through the use of a Check Knowledge Module (CKM).

Interfaces

The Knowledge Modules I referred to above are, under the covers, ODI Procedures, that is a program of ODI actions to be executed in sequence. An ODI procedure consists of one or more steps (units of code) written in one of the supported scripting technologies (SQL, Jython, Groovy); within a procedure we can use multiple technologies, for example the first step may execute an OS command from Jython and a subsequent step perform some SQL action on a database table. In addition to the in-built functionality of Jython we can add our own functionality by importing java routines from .JAR files included in the ODI code path. Procedures, like interfaces can access ODI variables and metadata functions; both of these features allow us to develop very flexible processes.

The final part of the ODI project I will consider in this blog is the Package. Packages are similar in concept to the process flows in OWB, however unlike OWB they do not require the use of Oracle Workflow. Like the other accordions in the Designer Navigator right clicking in the package area gives us the option to create a new package. The package designer allows us to drop on to the screen the various interfaces and procedures we need to implement our package. In addition there is a rich palette of ODI actions that can also be added to perform tasks such as file unzipping, decryption and FTP transfers. We can also include ODI variables in the package and thus introduce conditional logic to the package. We specify the first step of the package by right-clickign on it and selecting “first step” then link it to the subsequent tasks by using the OK arrow tool (for success) or the KO arrow (for failure) in the case of the variable comparison operator OK is TRUE and KO is FALSE.

Of course, there is a lot more to ODI than I have covered in just three blog posts (there is more to OWB than this too!) but hopefully it has given a flavour of how ODI looks and perhaps the encouragement to try it out (or book an ODI course)

 

 

 

 

Anatomy of BI EE Page Rendering – Use of HTTP/Web Server & Compression

There are many aspects/variables thatcan affect the performance of a BI Deployment. Generally whenever a BI Dashboard is slow the general tendency is to attribute the slowness in performance to eitherthe BI Reports or underlying databases. Of course, in 90% of the cases this will be truebut there are cases where some other factors affect the page load times. In thisblog post, we will try to investigate the various other factors like browser cache,HTTP Server, HTTP Server compression etc. Also, we will try to understand whathappens from a communication standpoint between the browser and the Server. How much of network traffic gets passed, what is the optimal bandwidth etc.

To basically find out what happens at a protocol/network level we will be using the Firebug & YSlowNetwork Monitoring extension. There are other network monitoring and sniffing tools available todo this, but i prefer the Firebug/YSlow (& LiveHTTPHeaders) extension as it gives a graphical representation of each GET/POST request madealong with the times and size. Lets load up the extension and point it to the SampleAppLite Overview dashboard (directly using a URL without the HTTP Server enabled) from a browser that has its browser cache cleared completely.

Instead of a single HTTP call we are having a lot of HTTP Calls. So, lets trace the output(rendered HTML) of the first HTTP call  (http://108.166.89.170:7001/analytics/saw.dll?Dashboard&PortalPath=%2Fshared%2FSample%20Lite%2F_portal%2FQuickStart&page=Overview)

So what we have is, the first HTTP call brings all the necessary data from all the reports (with table, Pivot view etc). So,the first HTTP call does not bring in all the images, css & js files. Those are all done in incremental GET calls in parallel.To confirm, lets look at the HTTP response(just the response from server – not the HTML) of the first GET call (the same above url)

If you notice, the HTTP response back from the server has server references (like relative urls to the location of js, image &stylesheet files etc). To retrieve them, the browser has to make further calls back to the server and display them. Lets look at the overall size breakup of the Overview dashboard. The idea is to find out the overall size of an average dashboard and the breakup of sizes by individual image files, js files etc

Interesting. Java Scripts alone take up bulk of the overhead. The actual data is pretty less. A total of 122 js files and a totalof 49 image files. Thats a lot for a single page. Ok, now lets try to understand where browser cache comes into picture. To understandthat lets take a look at the breakup shown below on the overall network data size with and without browser cache.

If you notice, with browser caching alone (on image files, js files etc), there is a significant reduction of packet size transferred from the server to the client i.e. from 4MB to 140KB. Thats a lot of caching. Also, if you notice, there is no cachingfor the main/first HTTP request as expected. So, as a standard we need to have browser caching enabled for production environments as that can significantly reduce the number of HTTP calls as well as the size of packets. So what happens is, after receiving the first HTTP request the client browser decides whether to send a HTTPrequest to the fetch the relative URL images, js or whether to get it from the local browser cache (if it is populated). The LiveHTTPHeaders outputshows the requests below (after the browser cache is warmed up)

Very interesting. As you see, as soon as the browser cache is warmed up there are only limited number of GET requests backto the server.

To summarize, there are 3 things that we can notice

  • BI EE just does not make a single HTTP call(obvious but a definite thing to keep in mind) to the Server from the browser
  • With browser cache significant amount of data can be reduced from being passed on over the network
  • BI EE – to render a single web page makes close to 300-400 HTTP requests to get almost 4MB of data

So the question is what role does a HTTP Server play in all of this? Can HTTP Compression reduce the amount of traffic?

At a high level this is what happens

  • Step 1: User enters a BI EE URL and hits on enter in the browser. A HTTP request is passed on to the server.
  • Step 2: The server on receiving the request, does some internal processing(basically fires SQL to get the dataand then creates the HTML) and then formulates the response.
  • Step 3: The browser will receive the response. The sample response is nothing but the source of the HTML output page.
  • Step 4: Based on the first HTTP response it decides whether to use browser cache(if populated) or to make further http requests to the server
  • Step 5: If the browser cache is enabled, only those images,css,js files that are not cached will generate the HTTP request. The HTTPrequests will be generated in parallel. The number of parallel requests depend on the browser.
  • Step 6: HTTP request is generated for all other remaining files required for browser rendering. Modern browsers can start renderingthe page as and when the first HTTP response is received. Then after that once the individual images are retrieved, browsers can render them incrementally (hence we get an impression as though the pages are being painted especially on slow networkinstalls)

Pretty straightforward. But this does offer insights into what can be done to reduce the traffic and thereby reducing contention.

  • Browser Cache once enabled can reduce the traffic significantly atleast for static images. This can reduce the numberof HTTP requests.
  • BI EE transfers around 4 to 10 MB(or more) of traffic per page (not a single transfer – this happens in parallel in chunks).There is scope for compression.
  • Browser Cache though very good still cannot cache every static file/resource. This is best done by a separate HTTP server. Also, the first time access of new users to a website will be slow if we depend on browser cache (until that warms up). So,HTTP Server ideally is the best way to cache static resources so that all end user static file response will

Remember HTTP Server caching does not reduce the number of GET requests. But it just reduces un-necessary GET requests to the serverapplication server. So, if you want to reduce the CPU/Memory usage on the main application server by HTTP Server cachingthen HTTP Server has to reside on a completely different box than the main Application Server.

Having the Web Server and App Serveron the same box might not give that much of performance gains (though it still makes sense to isolate the main app server processfrom serving static resources – http server has to do that as it can serve in-memory as well).

So, lets try to take a look at the BI EE SampleAppLite Overview dashboard page through the Web Server URL (rather than the appserver port).

As you see not much of actual gain. But the gains will become apparent as more users start accessing the URL – HTTP Server memory resources will get cached in-memory in its process space and hence can serve out requests faster than the app server.

Now lets try to understand how the page load times relate to the network bandwidth. As we had seen earlier almost 4MB of data gets movedover the network for loading one page. Majority of it can be cached either at the http server layer or at the browser level. But still 4MBis a lot of data.

So, lets try adding a compression at the HTTP Server layer. The idea behind this is HTTP Server before sending the response will compress using gzip(that will be specified in the HTTP Header so that the client browser can uncompress).  The client browser will then uncompress the packets using gzip again. To enable compression the following needs to be done on the  httpd.conf file of the http server

1. Ensure that the following lines are not commented out in the httpd.conf file.

LoadModule expires_module “${ORACLE_HOME}/ohs/modules/mod_expires.so
LoadModule deflate_module “${ORACLE_HOME}/ohs/modules/mod_deflate.so

2. Add the following after the LoadModule section

Now lets take a look at the size of the data that gets transferred over the network.

As you see there is a reduction of almost 3MB in the data size that gets transferred. Also, if you take a look at the headerof the response, there is a content-encoding of gzip that gets enabled.

The compression is recommended only if you are on a WAN network. Remember, compression will require CPU both on the HTTP Server side as well as the client side. Also LAN networkswhere there is no issue of bandwidth, compression ideally should not be enabled.

HTTP Server Support:

As a footnote, if you are in the process of enabling HTTP Servers for your existing BI EE 11g installs, ensure that you are on theright patch sets of Oracle Web Tier. For example, BI EE 11.1.1.5 supports only 11.1.1.4 version of Oracle Web Tier. And 11.1.1.4version of Web Tier is not a direct install – instead one has to install 11.1.1.2 version and then upgrade that version to 11.1.1.4 using the patchset. But for BI EE 11.1.1.6, Oracle Web Tier 11.1.1.6 is supported (no Patchset for this).

New Training Courses and Seminars from Rittman Mead

I’ve recently joined Rittman Mead to head-up our Training area, and for the past few weeks have been working on a number of new topics, modules and delivery channels for our courses. We’re now going to be offering public, focused half-day virtual training seminars on the internet, plus some new OBIEE 11g course modules on Exalytics, and some classroom-based training in our Brighton training centre for intermediate-to-advanced level OBIEE 11g developers. Here’s the details:

Webinar Series

We are excited to be launching a series of web-based seminars, presented by Mark Rittman and Robin Moffatt. These half day sessions will enable you to learn about specific areas of a product, from the comfort of your own desk. All you need is an internet connection, and headphones/speakers!

We are running them on a weekly basis starting at the end of March 2012, every Wednesday afternoon (UK/Europe time) so that listeners from North America can also join in. The cost is £250 inc VAT per session, payable via Paypal, or you can buy a “season pass” for all eight sessions for just £1250 inc VAT.

To see the details of the sessions, and to book your place, please visit our Webinars page.

Exalytics Training

We are very proud to be launching one of the first training courses for the new Exalytics machine from Oracle.

The one-day course covers the administration and configuration of an Exalytics machine, as well as how to best use the new front-end functionality to make the most of the power of in-memory analytics.

See our Exalytics training page for more information and to book a place on the course.

New public courses

In May, we will be running two new public courses at our new Brighton-based training centre

Less than an hour from London, Brighton is a great setting in which to learn about OBIEE in-depth from some of the industry’s most respected and experienced trainers.

OBIEE 11g Bootcamp

There are just a few places left on our TRN202 OBIEE 11g bootcamp running at our Brighton training centre the week of April 13th. Click here to book your place now!