Real Business Value with Predictive BI
By Antony Heljula (UKOUG Apps 2014 presentation)
Predictive BI is all about using your past to predict the future! It an effective and exciting way to extract more value out of the significant volumes of historical data that is gathered by most Organisations. Predictive BI is applicable to just about any Organisation or Department, especially those with Oracle BI and/or Oracle BI Applications.
This presentation will explain what Predictive BI can achieve for your business and discusses the Oracle products which can be used to deliver it (Oracle Advanced Analytics and Oracle Real-Time Decisions). A live demonstration will also show the impact of having Predictive capabilities on your Business Intelligence dashboards.
Predictive BI is all about using your past to predict the future! Predictive BI is an effective and exciting way to extract more value out of the significant volumes of historical data that is gathered by most Organisations. Predictive BI is applicable to just about any Organisation or Department, especially those with Oracle BI and/or Oracle BI Applications. This presentation will explain what Predictive BI can achieve for your business and discusses the Oracle products which can be used to deliver it (Oracle Advanced Analytics and Oracle Real-Time Decisions) |
To view the presentation please click the image below, and please feel free to leave any comments or questions using the form underneath.
Advanced Modelling with OBIEE: Data-mart Automation
By Antony Heljula (UKOUG Tech 2014 presentation) Based on real-life customer projects, this presentation outlines the advanced modelling features of the Oracle BI Server, which can significantly reduce your development timeframes and simplify the process of building and populating data-marts to optimise your analytical queries. Topics covered include Modelling Principles, Fragmentation, Cross Database Joins, Federated Queries, Aggregate Persistence and BI Server "Populate" commands. Examples show how Oracle BI was able to consolidate data from several operational systems into a central data-mart and provide near real-time reporting with fast response times - all data was loaded via the BI Server, no ETL tool required. |
To view the presentation please click the image below, and please feel free to leave any comments or questions using the form underneath.
Connecting OBIEE11g on Windows to a Kerberos-Secured CDH5 Hadoop Cluster using Cloudera HiveServer2 ODBC Drivers
In a few previous posts and magazine articles I’ve covered connecting OBIEE11g to a Hadoop cluster, using OBIEE 11.1.1.7 and Cloudera CDH4 and CDH5 as the examples. Things get a bit complicated in that the DataDirect Apache Hive ODBC drivers that Oracle ship are only for HiveServer1 and not the HiveServer2 version that CDH4 and CDH5 use, and the Linux version of OBIEE 11.1.1.7 won’t work with the Cloudera Hive ODBC drivers that you have to use to connect to Hive on CDH4/5. You can however connect OBIEE 11.1.1.7 on Windows to HiveServer2 on CDH4 and CDH5 if you use the Cloudera Hive ODBC drivers for Windows, and although this isn’t supported by Oracle in my experience it does work, albeit with the general OBIEE11g Hive restrictions and caveats detailed in the Metadata Repository Builder’s Guide, and the fact that in-practice Hive is too slow to use for ad-hoc reporting.
However … most enterprise-type customers who run Hadoop on their internal networks have their clusters configured as “secured”, rather than the unsecured cluster examples that you see in most OBIEE connection examples. By default, Hadoop clusters are very trusting of incoming network and client connections and assume that whoever’s connecting is who they say they are, and HDFS and the other cluster components don’t perform any authentication themselves of incoming client connections. In addition, by default all network connections between Hadoop cluster components run in clear text and without any mutual authentication, which is great for a research cluster or PoC but not really appropriate for enterprise customers looking to use Hadoop to store and analyse customer data.
Instead, these customers configure their clusters to run in secured mode, using Kerberos authentication to secure incoming connections, encrypt network traffic and secure connections between the various services in the cluster. How this affects OBIEE though is that your Hive connections through to the cluster also need to use Kerberos authentication, and you (and the OBIEE BI Server) need to have a valid Kerberos ticket when connecting through the Hive ODBC driver. So how do we set this up, and how do we get hold of a secure Hadoop cluster using Kerberos authentication to test against? A few of our customers have asked this question recently, so I thought it’d be worth jotting down a few notes on how to set this up.
At a high-level, if you want to connect OBIEE 11.1.1.7 to a secure, Kerberos-authenticated CDH cluster, there’s three main steps you need to carry out:
- Get hold of a Kerberos-secured CDH cluster, and establish the connection details you’ll need to use to connect to it
- Make sure the Kerberos server has the correct entries/principals/user details for the user you’re going to securely-connect as
- Configure the host environment for OBIEE to work with Kerberos authentication, and then create the connection from OBIEE to the CDH cluster using the correct Kerberos credentials for your user
In my case, I’ve got a Cloudera CDH5.3.0 cluster running in the office that’s been configured to use MIT Kerebos 5 for authentication, set up using an OEL6 VM as the KDC (Key Distribution Centre) and the cluster configured using the new Kerebos setup wizard that was introduced with CDH5.1. Using this wizard automates the creation of the various Kerberos service account and host principals in the Kerberos database, and configures each of the cluster components – YARN, Hive, HDFS and so on – to authenticate with each other using Kerberos authentication and use encrypted network connections for inter-service and inter-node communication.
Along with the secured Hadoop cluster, key bits of information and configuration data you’ll need for the OBIEE side are:
- The krb5.conf file from the Kerberos KDC, which contains details of the Kerberos realm, URL for the KDC server, and other key connection details
- The name of the Kerberos principal used for the Hive service name on the Hadoop cluster – typically this is “hive”; if you want to connect to Hive first using a JDBC tool such as beeline, you’ll also need the full principal name for this service, in my case “hive/bda3node2.rittmandev.com@RITTMANDEV.COM”
- The hostname (FQDN) of the node in the CDH cluster that contains the HiveServer2 RPC interface that OBIEE connects to, to run HiveQL queries
- The Port that HiveServer2 is running on – typically this is “10000”, and the Hive database name (for example, “default’)
- The name of the Kerebos Realm you’ll be connecting to – for example, MYCOMPANY.COM or in my case, RITTMANDEV.COM (usually in capitals)
In my case, the krb5.conf file that is used to configure Kerebos connections to my KDC looks like this – in your company it might be a bit more complex, but this example defines a simple MIT Kerebos 5 domain:
[logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log [libdefaults] default_realm = RITTMANDEV.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true [realms] RITTMANDEV.COM = { kdc = auth.rittmandev.com admin_server = auth.rittmandev.com } [domain_realm] .rittmandev.com = RITTMANDEV.COM rittmandev.com = RITTMANDEV.COM
In my setup, the CDH Hadoop cluster has been configured to use Kerberos authentication for all communications between cluster components and any connections from the outside that use those components; the cluster itself though can still be accessed via unsecured (non-Kerebos authenticated) SSH, though of course this aspect could be secured too. To test out the Hive connectivity before we get into the OBIEE details you can use the beeline CLI that ships with CDH5, and to do this you’ll need to be able to SSH into one of the cluster nodes (if you’ve not got beeline installed on your own workstation) and you’ll need an account (principal) created for you in the Kerebos database to correspond to the Linux user and HDFS/Hive user that has access to the Hive tables you’re interested in. To create such a Kerebos principal for my setup, I used the kadmin.local command on the KDC VM to create a user that matched my Linux/HDFS username and gave it a password:
kadmin.local: addprinc mrittman WARNING: no policy specified for mrittman@RITTMANDEV.COM; defaulting to no policy Enter password for principal "mrittman@RITTMANDEV.COM": Re-enter password for principal "mrittman@RITTMANDEV.COM": Principal "mrittman@RITTMANDEV.COM" created.
SSH’ing into one of the secure CDH cluster nodes, I first have to authenticate using the kinit command which when successful, creates a Kerebos ticket that gets cached for a set amount of time, and beeline can thereafter use as part of its own authentication process:
officeimac:.ssh markrittman$ ssh mrittman@bda3node4 mrittman@bda3node4's password: [mrittman@bda3node4 ~]$ kinit -p mrittman Password for mrittman@RITTMANDEV.COM: [mrittman@bda3node4 ~]$
Now I can use beeline, and pass the Hive service principal name in the connection details along with the usual host, port and database name. When beeline prompts for my username and password, I use the Kerberos principal name that matches the Linux/HDFS one, and enter that principal’s password:
[mrittman@bda3node4 ~]$ beeline Beeline version 0.13.1-cdh5.3.0 by Apache Hive beeline> !connect jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM scan complete in 2ms Connecting to jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM Enter username for jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM: mrittman Enter password for jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM: ******** Connected to: Apache Hive (version 0.13.1-cdh5.3.0) Driver: Hive JDBC (version 0.13.1-cdh5.3.0) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://bda3node2:10000/default> show tables; +------------------+--+ | tab_name | +------------------+--+ | posts | | things_mrittman | +------------------+--+ 2 rows selected (0.162 seconds) 0: jdbc:hive2://bda3node2:10000/default> select * from things_mrittman; +---------------------------+-----------------------------+--+ | things_mrittman.thing_id | things_mrittman.thing_name | +---------------------------+-----------------------------+--+ | 1 | Car | | 2 | Dog | | 3 | Hat | +---------------------------+-----------------------------+--+ 3 rows selected (0.251 seconds)
In this particular example we’re using Windows to host OBIEE 11.1.1.7, as this is the only platform that we can get the HiveServer2 ODBC drivers to work, in this case the Cloudera Hive ODBC drivers available on their website (free download but registration may be needed). Before we can get this ODBC driver to work though, we need to install the Kerberos client software on the Windows machine so that we can generate the Kerberos ticket that the ODBC driver will need to pass over as part of the authentication process.
To configure the Windows environment for Kerberos authentication, in my case I used the Kerberos for Windows 4.x client software downloadable for free from the MIT website and copied across the krb5.conf file from the KDC server, renaming it to krb5.ini and storing it the default location of c:\ProgramData\MIT\Kerberos5.
You also need to define a system environment variable, KRB5CCNAME, to point to a directory where the Kerebos tickets can be cached, in my case I used c:\temp\krb5cache. Once this is done, reboot the Windows environment and you should then be prompted after login to authenticate yourself to the Kerebos KDC.
The ticket then stays valid for a set number of days/hours, or you can configure OBIEE itself to authenticate and cache its own ticket – for now though, we’ll create the ticket manually and connect to the secured cluster using these cached ticket details.
After installing the Cloudera Hive ODBC drivers, I create the connection using Kerebos as the Authentication Mechanism, and enter the realm name, HiveServer2 host and the Hive Kerebos principal name, like this:
In my case both the BI Administration tool and the OBIEE BI Server were on the same Windows VM, and therefore shared the same ODBC driver install, so I then moved over to the BI Administration tool to import the Hive table metadata details into the RPD and create the physical, logical and presentation layer RPD elements. Depending on how your CDH cluster is set up you might be able to test the connection now by using the View Data… menu item in BI Administration, but in my case I had to do two more things on the CDH cluster itself before I could get Hive queries under this Kerberos principal to run properly.
First, as secured CDH Hadoop clusters usually configure HiveServer2 to use “user impersonation” (connecting to Hive as the user you authenticate as, not the user that HiveServer2 authenticates to the Hive service as), YARN and MapReduce jobs run under your account and not the usual “Hive” account that unsecured Hive connections use. Where this causes a problem on CDH installations on RHEL-derived platforms (RHEL, OEL, Centos etc) is that YARN normally blocks jobs running on behalf of users with a UID of <1000 (as this on other Linux distributions typically signifies a system account), RHEL starts user UIDs at 500 and YARN therefore blocks them from running jobs. To fix this, you need to go into Cloudera Manager and edit the YARN configuration settings to lower this UID threshold to something under 500, for example 250:
I also needed to alter the group ownership of the temporary directory each node used for the YARN NodeManager’s user files so that YARN could write its temporary files correctly; on each node in the cluster I ran the following Linux commands as root to clear down any files YARN had created before, and recreate the directories with the correct permissions (Hive jobs would fail until I did this, with OBIEE just reporting an ODBC error):
rm -rf /yarn mkdir -p /yarn/nm chown -R yarn /yarn chgrp -R yarn /yarn
Once this is done, queries from the BI Administration tool and from the OBIEE BI Server should connect to the Kerberos-secured CDH cluster successfully, using the Kerberos ticket you obtained using the MIT Kerberos Ticket Manager on login and then passing across the user details under which the YARN, and then Hive job should run.
If you’re interested, you can go back to the MIT Kerberos Ticket Manager and see the other Kerberos tickets that were requested and then cached by the Cloudera Hive ODBC driver when it mutually authenticated with the HiveServer2 RPC interface – Kerebos authenticates both ways to ensure that who you’re connecting to is actually who they say they are, in this case checking the HiveServer2 connection you’re connecting to isn’t being spoofed by someone else.
So that’s the process for connecting OBIEE to a Kerberos-secured CDH Hadoop cluster in a nutshell; in the New Year I’ll put something together on using Apache Sentry to provide role-based access control for Hive and Impala tables and as of CDH 5.3, HDFS directories, and I’ll also take a look at the new extended ACLs feature in CDH5.2 that goes beyond HDFS’s standard POSIX security model.
Happy Christmas from Rittman Mead, and Where To Find Us in 2015
It’s the day before Christmas Eve over in the UK, and we’re finishing up for the Christmas break and looking forward to seeing friends and family over the next few days. To all of our readers, customers, friends and partners, have a great Holiday season and New Year, and we’re looking forward to seeing you at various events and on customer projects in 2015.
If you’re thinking of submitting an abstract for the Rittman Mead BI Forum 2015 the call for papers is now open. with abstracts accepted through to January 18th, 2015. In addition, as well as the BI Forum in May you can also catch-up with us at these events in the first-half of the New Year:
- BIWA Summit 2015, San Francisco January 27th – 29th 2015
Myself, Francesco Tisiot, Jordan Meyer, Daniel Adams and Andy Rocha will be presenting on Big Data SQL, ODI & OBIEE; data science for Oracle professionals; Oracle BICS and data visualization amongst other topics
- Riga Dev Day, Riga, Latvia, 22nd January 2015
Robin Moffatt and I will be speaking on OBIEE, source control and release management, and Robin will deliver his award-winning session on OBIEE performance optimization
- Oracle Gebruikersclub Holland, Utrecht, 14th January 2015
This is an evening session for the Oracle Users’ Club Holland where I’ll be talking about our project experiences delivering Big Data and BI projects on Oracle Big Data Appliance, Exadata and Exalytics
- Middle East Oracle University Expert Summit, Dubai, Feb 15th-19th 2015 [PDF]
I’m one of nine speakers at this event, and I’ll be speaking about OBIEE development “best practices” on the first day, and then OBIEE futures on the second
- OUGN Conference, March 12th-14th 2015
Robin Moffatt, myself and others will be speaking at the OUGN Conference in March, on topics around OBIEE and Big Data
- Collaborate’15, Las Vegas, April 12th-16th 2015
- Rittman Mead BI Forum 2015, 6th-8th May 2015 (Brighton), 13th-15th May 2015 (Atlanta)
- ODTUG KScope’15, Hollywood Florida, June 21st-25th 2015
Look out for details of these events in early 2015 as agendas and sessions get finalised – and don’t forget to submit your BI Forum 2015 abstract by January 18th, 2015, with final agenda and speaker details to follow in February 2015.
That’s it for now though – have a great Christmas and New Year, and see you in 2015!