Tag Archives: Oracle BI Suite EE
Thoughts on Running OBIEE in the Cloud : Part 1 – The BI Platform
One of the major announcements at this year’s Oracle Openworld in San Francisco was around “OBIEE in the cloud”. Part of a wide-ranging set of announcements around services and infrastructure in the cloud from Oracle, the idea with this service is that you’ll be able to access an Oracle-hosted future version of OBIEE running in Oracle’s public cloud service, so that you can create dashboards and reports without having to perpetually-license lots of software, or stand-up lots of infrastructure in your data centre. But of course “cloud” is a hot topic at the moment, and lots of our customers are asking us about options to run OBIEE in the cloud, what’s involved in it, and how we deal with the stuff that surrounds an OBIEE installation – the data, the ETL, the security, and so forth. So I thought I’d put a couple of blog posts together to go through what options are available to OBIEE customers looking to deploy in the cloud, and see what the state of the industry is around this requirement.
I’ll start-off then by looking at the most obvious element – the BI platform itself. If you want to run OBIEE “in the cloud”, what are your basic options?
Probably conceptually the simplest, is just to run the OBIEE server applications in a cloud-based, hosted environment, for example Amazon EC2. In this instance, OBIEE runs as it would do normally, but instead of the host server being in your datacenter, it’s in a public cloud. This type of setup has been available for some time in the form of services such as Oracle On-Demand, but the difference here is that you’re using a public cloud service, there’s no service contracts to set up, usually no minimum contract size, and everything to do with security, failover, maintenance and so on is down to you. We’ve been doing this sort of thing for a long-time, typically for our cloud-based training environments, or for ad-hoc development work where it makes more sense for a server to be centrally available on the internet, rather than sitting on someone’s laptop or on an internal server. Once potential complication here is licensing – for products such as the Oracle Database, there are formal schemes such as Amazon RDS where either the license is included, or there are set metrics or arrangements to either bring your own license, or buy them based on virtual CPUs. If you’re prepared to take care of the licensing, and all of the security and maintenance aspects, this is an interesting approach.
In practice though, any setup like this is going to be what’s termed a “hybrid” cloud deployment; at least the administration side of OBIEE (the BI Administration Tool) is still going to be “on-premise”, as there’s no way you can deploy that “thin-client”, unless you create a Windows VM in the cloud too, and run the BI Administration tool from there. Moreover, your data sources are more than likely to still be on-premise, with just a few at this point hosted in the cloud, so most likely your OBIEE-in-the-cloud architecture will look like this:
There’s other things to think about too; how do you do your ETL when your target data warehouse might be in the cloud, or your data sources might be cloud based; how do you connect to your LDAP server, and so forth – I’ll cover these in the next postings in the series. But for now, this is conceptually simple, and its main benefit is avoiding the need to host your own servers, buy your own hardware, and so forth.
So what if you want to use OBIEE, but you don’t want the hassle even of setting up your own servers in a public cloud? This is where Oracle’s new “BI in the Cloud” service comes in – the idea here is that you swipe your credit card, fill in your details, self-provision your system, and then start loading data into it.
Whilst the underlying technology is core OBIEE, it’s designed for departmental, power user-type scenarios where the scope of data is limited, and the focus really is on ease-of-use, self-service and “software-as-a-service”. Looking at Oracle’s slides from Openworld, you can see the focus is quite different to on-premise OBIEE – the competition here for Oracle is the various internet startups, and products like Tableau, that make it easy to get started, provide consumer-class user interfaces, and focus on the single user rather than enterprise-type requirements.
But this focus on simplicity means a limitation in functionality as well. At the start, at least, Oracle’s BI in the Cloud will only offer Answers and Dashboards; no KPIs, BI Publisher, Maps or Delivers. Moreover, at least in its first iteration, it’ll only support Oracle’s own database schema-as-a-service as the single, sole datasource, so no Hybrid cloud/on-premise federated queries, no Essbase, and no ETL tools – hence the focus on single-user, departmental use-cases.
What you do get with this setup though is a cloud-native, fully-managed service where patching is taken care of for you, all the new feature appear first, and the administration element has been re-thought to be more appropriate for deployment to the cloud – in particular, a new version of the BI Administration tool that’s cloud-native, focused on simple use-cases, and doesn’t require desktop client installs or VPN connections through to the OBIEE server. The diagram bellow shows the architecture for this option, and you can see it’s all pretty self-contained; great for simplicity, but maybe a bit limiting at least in the initial iteration.
The third major variant that I can see around OBIEE in the cloud, is where partners (such as Rittman Mead) offer shared access to their cloud OBIEE installations, either as part of a development package or through some sort of multi-tenant reporting product. For example, we’re about to launch a service we’re calling “ExtremeBI in the Cloud”, where we combine our “ExtremeBI” agile development method with cloud-based development server hosting, with the cloud element there to make it easier for clients to start on a project *today*. Endeca Information Discovery is another product that could benefit from this approach, as the infrastructure behind an Endeca install can be fairly substantial, but users are typically more-focused on the data modelling and report-building element. In our case, the back-end cloud hosting will typically be done on a service like Amazon EC2 or VMWare’s Hybrid Cloud product, but the focus is more on the development piece – cloud is there to provide business agility.
The other third-party approach to this centres around multi-tenancy; typically, a partner or company will provide a reporting or analytics element to a wider product set, and use OBIEE as its embedded dashboarding tool as part of this. In the past, setting up multi-tenancy for OBIEE was fairly difficult, to the point where most companies set up separate installs for each end-customer, but 11.1.1.7 brought a number of multi-tenant features that presumably were put there to support Oracle’s own cloud product. The product docs describe the new multi-tenant features well, but the basic premise is that a single RPD and catalog are still set up, but OBIEE can then run in a “multi-tenant” mode where GUIDs demarcate each tenant, a new BIGlobalAdministrator role is created that assumes what were the old BIAdministrator role’s privileges, with other administrator roles then set up with limited privileges and no access to Fusion Middleware Control, for example.
What this does give you though is the framework to set up tenant groupings of users, separate areas of the catalog for each tenant, additional identity store attributes to hold tenant details, and a security framework that enables more limited forms of administrator account more suited to tenant-type situations. Again, not all OBIEE features are available when running in multi-tenant mode, and presumably we’ll see this feature evolve over time as more requirements come in from Oracle’s own cloud BI product, but it’s a feature you can use now if you’re looking to set up a similar type of environment.
So that’s the three basic options if you want to run OBIEE in the cloud; host it in a public cloud service like Amazon EC2 but then just run it like regular OBIEE; go for Oracle’s upcoming BI-in-the-cloud service, if the use-case suits you and you can live with the limitations, or consider one of the services from partners such as ourselves where we bundle cloud hosting of OBIEE up with a larger service offering like our “ExtremeBI in the Cloud” offer. But of course OBIEE is only one part of the overall platform – where do you store the data, and how do you get the data from wherever it normally is up into the cloud – in other words, how do we do ETL in the cloud? Check back tomorrow for the follow-up post to this one: Part 2 – Data Warehouse and ETL.
Make Use of OBIEE’s Command Line Tools with Reduced Exposure of Plain Text Passwords
Introduction
We all love a good commandline utility. It gives us that warm feeling of control and puts hairs on our chests. Either that, or it means we can script the heck out of a system, automate many processes, and concentrate on more important matters.
However, some of the OBIEE commandline utilities can’t be used in Production environments at many sites because they need the credentials for OBIEE stored in a plain-text file. Passwords in plain-text are bad, mmmm’kay?
Two of the utilities in particular that it is a shame that can’t be scripted up and deployed in Production because of this limitation are the Presentation Services Catalog Manager, and the Presentation Services Replication Agent. Both these perform very useful purposes, and what I want to share here is a way of invoking them more securely.
Caveat
IANAC : I Am Not A Cryptographer! Nor am I a trained security professional. Always consult a security expert for the final word on security matters.
The rationale behind developing the method described below is that some some sites will have a “No Plaintext Passwords” policy which flatout prevents the use of these OBIEE utilities. However, at the same sites the use of SSH keys to enable one server to connect to another automatically is permitted. On that basis, the key-based encryption for the OBIEE credentials may therefore be considered an acceptable risk. As per Culp’s 9th law of security administration, it’s all about striking the balance between enabling functionality and mitigating risk.
The method described below I believe is a bit more secure that plaintext credentials, but it is not totally secure. It uses key based encryption to secure the previously-plaintext credentials that the OBI utility requires. This is one step better than plaintext alone, but is still not perfect. If an attacker gained access to the machine they could still decrypt the file, because the key is held on the machine without a passphrase to protect it. The risk here is that we are using security by obscurity (because the OBIEE credentials are in an encrypted file it appears secure, even though the key is held locally), and like the emperor’s new clothes, if someone takes the time to look closely enough there is still a security vulnerability.
My final point on this caveat is that you should always bear in mind that if an attacker gains access to your OBIEE machine then they will almost certainly be able to do whatever they want regardless, including decrypting the weblogic superuser credentials or reseting it to a password of their own choosing.
Overview
Two new shiny tools I’ve acquired recently and am going to put to use here are GnuPG (gpg
) and mkfifo
. GPG provides key-based encryption and decryption and is available by default on common Linux distributions including Oracle Linux. mkfifo
is also commonly available and is a utility that creates named pipes, enabling two unreleated processes to communicate. For a detailed description and advanced usage of named pipes, see here
Setting up a secure credentials file
This is a one-time set up activity. We create a key in gpg
, and then encrypt the plain text credentials file using it.
The first step is to create a gpg key, using gpg --gen-key
. You need to specify a “Real name” to associate with the key, I just used “obiee”. Make sure you don’t specify a passphrase (otherwise you’ll be back in the position of passing plain text credentials around when you use this script).
$ gpg --gen-key
gpg (GnuPG) 1.4.5; Copyright (C) 2006 Free Software Foundation, Inc.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions. See the file COPYING for details.
Please select what kind of key you want:
(1) DSA and Elgamal (default)
(2) DSA (sign only)
(5) RSA (sign only)
Your selection?
DSA keypair will have 1024 bits.
ELG-E keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048) 2048
[...]
Real name: obiee
Email address:
Comment:
[...]
You don't want a passphrase - this is probably a *bad* idea!
I will do it anyway. You can change your passphrase at any time,
using this program with the option "--edit-key".
[...]
gpg: key 94DF4ABA marked as ultimately trusted
public and secret key created and signed.
Once this is done, you can encrypt the credentials file you need for the utility. For example, the Catalog Manager credentials file has the format:
login=weblogic
pwd=Password01
To encrypt it use gpg --encrypt
gpg --recipient obiee --output saw_creds.gpg --encrypt saw_creds.txt
Now remove the plaintext password file
rm saw_creds.txt
Using the secure credentials file
Once we have our encrypted credentials file we need a way of using it with the utility it is intended for. The main thing we’re doing is making sure we don’t expose the plaintext contents. We do this using the named pipes method:
In this example I am going to show how to use the secure credentials file with runcat.sh
, the Catalog Manager utility, to purge the Presentation Services cache. However it should work absolutely fine with any utility that expects credentials passed to it in a file (or stdin).
There is a three step process:
-
Create a named pipe with
mkfifo
. This appears on a disk listing with thep
bit to indicate that it is a pipe. Access to it can be controlled by the samechmod
process as a regular file. With a pipe, a process can request to consume from it, and anything that is passed to it by another process will go straight to the consuming process, in a FIFO fashion. What we’re doing through the use of a named pipe is ensuring that the plain text credentials are not visible in a plain text file on the disk. -
Invoke the OBIEE utility that we want to run. Where it expects the plaintext credentials file, we pass it the named pipe. The important bit here is that the utility will wait until it receives the input from the named pipe – so we call the utility with an ampersand so that it returns control whilst still running in the background
-
Use
gpg
to decrypt the credentials file, and pass the decrypted contents to the named pipe. The OBIEE utility is already running and listening on the named pipe, so will receive (and remove from the pipe) the credentials as soon as they are passed fromgpg
.
The script that will do this is as follows:
# Change folder to where we're invoking the utility from
cd $FMW_HOME/instances/instance1/bifoundation/OracleBIPresentationServicesComponent/coreapplication_obips1/catalogmanager
# Create a named pipe
mkfifo cred_pipe
# Let's make sure only we can access it
chmod 600 cred_pipe
# Invoke Catalog Manager. Because we're using a named pipe, it's actually going to sit and wait until it gets input on the pipe, so we need to put the ampersand in there so that it returns control to the script
./runcat.sh -cmd clearQueryCache -online http://localhost:9704/analytics/saw.dll -credentials cred_pipe &
# Decrypt the credentials and send them to the named pipe
gpg --quiet --recipient obiee --decrypt saw_creds.gpg > cred_pipe
# Remove the named pipe
rm cred_pipe
Depending on the utility that you are invoking, you may need to customise this script. For example, if the utility reads the credentials file multiple times then using the named pipes method it will fail after the first read. Your option would be to read the credentials into the pipe multiple times (possibly a bit hacky), or land the plaintext credentials to disk and delete them after the utility complete (could be less secure if the delete doesn’t get invoked)
Using a secure credentials file for command line arguments
Whilst the sticking point that triggered this article was around utilities requiring whole files with credentials in, it is also common to see command line utilities that want a password passed as an argument to them. For example, nqcmd :
nqcmd -d AnalyticsWeb -u weblogic -p Password01 -s myscript.lsql
Let’s assume we’ve created an encrypted file containing “Password01” (using the gpg --encrypt
method shown above) and saved it as password.gpg.
To invoke the utility and pass across the decrypted password, there’s no need for named pipes. Instead we can just use a normal (“unnamed”) pipe to send the output straight from gpg to the target utility (nqcmd in this example), via xargs:
gpg --batch --quiet --recipient obiee --decrypt ~/password.gpg | xargs -I GPGOUT nqcmd -d AnalyticsWeb -u weblogic -p GPGOUT -s input.lsql
Tip: xargs
has a --interactive
option that makes it a lot easier when developing piped commands such as the above
Limitations
Because there is no passphrase on the gpg key, a user who obtained access to the server would still be able to decrypt the credentials file. In many ways this is the same situation that would arise if a server was configured to use ssh-key authentication to carry out tasks or transfer files on another server.
Uses
Here are some of the utilities that the above now enables us to run more securely:
- nqcmd is a mainstay of my OBIEE toolkit, being useful for performance testing, regression testing, aggregate building, and more. Using the method above, it’s now easy to avoid storing a plaintext password in a script that calls it.
- Keeping the Presentation Catalog in sync on an OBIEE warm standby server, using Presentation Services Replication
- Purging the Presentation Services Cache from the command line (with Catalog Manager, per the above example)
- SampleApp comes with four excellent utilities that Oracle have provided, however all but one by default requires plaintext credentials. If you’ve not looked at the utilities closely yet, you should! You can see them in action in SampleApp itself, or get an idea of what they do looking at the SampleApp User Guide pages 14–17 or watching the YouTube video.
Why ODI, DW and OBIEE Developers Should Be Interested in Hadoop
Over the past few months I’ve been posting a number of articles about Hadoop, and how you can connect to it from ODI and OBIEE. From an ODI perspective, I covered Hadoop as one of a number of new data sources ODI11g could connect to, then looked at how it leveraged Hive to issue SQL-like data extraction commands to Hadoop, and how it used Oracle Hadoop connector tools to transfer Hadoop data into the Oracle Database, and directly work with data in HDFS files. For OBIEE, I went through the background to Hadoop, Hive and the other “big data” technologies, stepped through a typical Hive query session, then showed how OBIEE 11.1.1.7 could connect to Hadoop through its newly-added Hive adaptor, then finally built a proof-of-concept OBIEE connection through to Cloudera Impala, then extended that to a multi-node Hadoop cluster.
But why all this interest in Hadoop – what’s it really got to do with OBIEE and ODI, and why should you as developers be interested in what’s probably yet another niche BI/DW datasource? Well in my opinion, Hadoop is the classic disruptive technology – cheap, and starting-off with far less functionality than regular, relational databases – but it’s improving fast, and as BI&DW developers it offers the potential of both massive benefits – significantly lower TCO for basic DW work, and support for lots of modern, internet-scale use-cases – and threats – in that if we don’t understand it and see how it can benefit our customers and end-users, we risk being left-behind as technology moves on.
To my mind, there are two main ways in which Hadoop, Hive, HDFS and the other big-data ecosystem technologies are used, in the BI/DW context:
1. Standalone, with their own query tools, database tools, query languages and so forth – your typical “data scientist” use case, originating from customers such as Facebook, LinkedIn etc. In this context, there’s typically no Oracle footprint, users are pretty self-sufficient, any output we see is in the form of “insights”, marketing campaigns etc.
2. Alongside more mainstream, for example Oracle, technologies. In this instance, Hadoop, Hive, HDFS, NoSQL etc are used as complementary, and supporting, technologies to enhance existing Oracle-based data warehouses, capture processes, BI systems. In some cases, Hadoop-type technologies can replace more traditional relational ones, but mostly they’re used to make BI&DW systems more scaleable, cheaper to run, able to work with a wider range of data sources and so forth. This is the context in which Hadoop can be relevant to more traditional Oracle BI, ETL and DW developers.
To understand how this happened, let’s go through a bit of a history lesson. Five years ago or so, your typical DW+BI architecture looked like this:
The data warehouse was typically made-up of three layers – staging, foundation/ODS and performance/dimensional, with data stored in relational databases with some use made of OLAP servers, or some of the newer in-memory databases like Qlikview. But over the intervening years, the scale and types of data sources have increased, with customers now looking to store data from unstructured and semi-structured sources in their data warehouse, take in feeds from social media and other “streaming” sources, and access data in cloud systems typically via APIs, rather than traditional ETL loading. So now we end up with a data warehouse architecture that looks like this:
But this poses challenges for us. From an ETL perspective, how do we access these non-traditional sources, and once we’ve accessed them – how do we efficiently process them? The scale and “velocity” of some of these sources can be challenging for traditional ETL processes that expect to log every transformation in a database with transactional integrity and multi-version concurrency control, whilst in some cases it doesn’t make sense to try and impose a formal data structure on incoming data as you’re capturing it, instead giving it the structure when we finally need it, or when we choose to access it in a query.
And then came “Hadoop”, and its platform and tool ecosystem. At its core, Hadoop is a framework for processing, in a massively-parallel and low-cost fashion, large amounts of data using simple transformation building blocks – filtering (mapping) and aggregating (reducing). Hadoop and MapReduce came out of the US West Coast Internet scene as a way of processing web and behavioural data in the same massively-distributed way that companies provided web search and other web 2.0 activities, and a core part of it was that it was (a) open-source, like Linux and (b) cheap, both in being open-source but also because it was designed from the outset to run on low-cost, commodity hardware that’s expected to fail. Pretty much the opposite of Oracle’s business model, but also obviously very attractive to anyone looking to lower the TCO of their data warehouse system.
So as I said – the Hadoop pioneers went-out and built their systems without much reference to vendors such as Oracle, IBM, Microsoft and the like, and being blunt, they won’t have much time for traditional Oracle BI&DW developers like ourselves. But those customers who are largely invested in Oracle technology, but see advantages in deploying Hadoop and big data technologies to make their systems more flexible, scaleable and cheaper to run – that’s where ODI and OBIEE’s connectivity to these technologies becomes interesting.
To take the example of customers who are looking to deploy Hadoop technologies to enhance their Oracle data warehouse – a typical architecture going down this route would look like this:
In this example, we’re using HDFS – Hadoop Distributed File System – as a pre-staging area for the data warehouse, storing incoming files cheaply, and with build-in fault tolerance, to the point where storage is so cheap that you might as well keep stuff you’re not interested in now, but you think might be interesting in the future. Using Oracle Direct Connector for HDFS, you can set up Oracle Database external tables that map onto HDFS just like any other file system, so you can extract from and otherwise work with these files without worrying about writing MapReduce jobs; ODI, through Oracle Data Integration Adaptor for Hadoop, you can connect ODI to these table sources as well, and work with them just like any other topology source, as I show in the slide below from my upcoming UKOUG Tech’13 session on ODI, OBIEE and Hadoop that’s running in a couple of week’s time in Manchester:
As well as storing data, you can also do simple filtering and transformation on that data, using the Hadoop framework. Most upfront data processing you do as part of an ETL process involves filtering out data you’re not interested in, joining data sets, grouping and aggregating data, and other large-scale data transformation tasks, before you then load it into the foundation/ODS layer and do more complex work. And this simple filtering and transformation is what Hadoop does best, on cheap hardware or even in the cloud – and if your customer is already invested in ODI and runs the rest of their ETL process using it, its relatively simple to add Hadoop capabilities to it, using ODI to orchestrate the data processing steps but using Hadoop to do the heavy lifting, as my slide below shows:
Now some customers, and of course Hadoop vendors, say that in reality you don’t even need the Oracle database if you’re going to build a data warehouse, or more realistically a data mart. Now that’s a bigger question and probably one that depends on the particular customer and circumstances, but a typical architecture that takes this approach might look like this:
In this case, ODI again has capabilities to transform data entirely within Hadoop – with ODI acting as the ETL framework and co-ordinator, but Hadoop doing the heavy-lifting – and there’s always the ability to get the data of Hadoop and into a main Oracle data warehouse, if the Hadoop system is more of a data mart or deparment-specific analysis. But whichever way – in most cases the customer is going tho want to continue to use their existing BI tool, particularly if their BI strategy involves bringing together data from lots of different systems, as you can do with OBIEE’s federated query capability – giving you an overall architecture that looks like this:
So it’s this context that makes OBIEE’s connectivity to Hadoop so important; I’m not saying that someone creating a Hadoop system from scratch is going to go out and buy OBIEE as their query tool – more typically, they’ll use other open-source tools or create models in tools like R; or they might go out and buy a lightweight data visualisation tool like Tableau and use that to connect solely to their Hadoop source. But the customers we work with have typically got much wider requirements for BI, have a need for an enterprise metadata model, recognise the value of data and report governance, and (at least at present) access most of their data from traditional relational and OLAP sources. But they will still be interested in accessing data from Hadoop sources, and OBIEE’s new capability to connect to this type of data, together with closer integration with Endeca and its unstructured and semi-structured sources, addresses this need.
So there you have it – that’s why I think OBIEE and ODI’s ability to connect to Hadoop is a big deal, and it’s why I think developers using those tools should be interested in how it works, and should try and set up their own Hadoop systems and see how it all works. As I said, I’ll be covering this topic in some detail at the UKOUG Tech’13 Conference in Birmingham in a couple of weeks time, so if you’re there on the Sunday come along and I’ll try and explain how I think it all fits together.
Creating a Multi-Node Hadoop/Impala Cluster as a Datasource for OBIEE 11.1.1.7
The other day I posted an article on the blog about connecting OBIEE 11.1.1.7 to Cloudera Impala, a new “in-memory” SQL engine for Hadoop that’s much faster than Hive for interactive queries. In this example, I connected OBIEE 11.1.1.7 to the Cloudera Quickstart CDH4 VM, which comes with all the Hadoop and Cloudera tools pre-installed and configured, making it easy to get going with the Hadoop platform.
Whilst the example worked though, I couldn’t help thinking that using Impala against a single node Hadoop install isn’t really how it’d be used in real-life; in reality, if you used OBIEE in this way, you’re much more likely to be connecting to a full Hadoop cluster, with multiple server nodes handling the incoming queries and potentially gigabytes, terabytes or petabytes of data being processed. So it it possible to set up a Hadoop cluster that gets a bit nearer to this multi-node architecture, so we can practice connecting to a cluster and not a single server, and we can see Hadoop process our queries across all of the nodes – as we’d see in real life, given that this low-cost MPP processing is the key benefit of Hadoop as a whole?
Hadoop, as you’re probably aware, was designed from the ground-up to run across multiple nodes, with those nodes typically either being small, low-cost servers, or in many cases servers running in the “cloud”. As such, you’re as likely to see Hadoop running on a cluster of Amazon EC2 server as running on physical servers in a datacenter, and in most cases the underlying OS running on those servers is Linux – most usually, Ubuntu 64-bit. So if we want to set up our own Hadoop cluster, there’s a few options open to us:
1. Get hold of a bunch of physical servers (maybe, old PCs or blade servers), install Linux and Hadoop on them, and then do the configuration and setup manually.
2. Buy a preconfigured solution – Oracle’s Big Data Appliance, for example, which has all the software pre-installed along with connectivity to ODI, Oracle Database etc
3. Spin-up a bunch of servers in the Cloud, or
4. Spin-up a bunch of Linux VMs, for example using VirtualBox or VMWare Fusion/Workstation
In the past I’ve done the Hadoop setup myself, manually, using the Hadoop distribution files available on the Hadoop website, but more recently vendors such as Hortonworks, MapR and Cloudera have put together their own Hadoop + added value tools distributions, and it’s Cloudera I’ve been looking at in most detail recently (if only because there’s a lot of ex-Oracle people there who I know, and it’s the bundled Hadoop distribution that comes with Oracle Big Data Appliance). What’s particularly good about Cloudera’s Hadoop offering is their “Cloudera Manager” utility – of which there’s a free version – and which simplifies the whole process of setting up a cluster by automating most of the process.
What’s also particularly interesting about Cloudera and Cloudera Manager, is that there are a number of solution available out there that automate the process of spinning-up clusters. One of them, described in this blog post on Cloudera’s website, involves using a built-in feature in Cloudera Manager to automatically create, provision and configure X number of Amazon EC2 virtual servers, with this servers then able to handle your queries in parallel and you just paying by the hour for the compute resource you need. So let’s give it a try.
I won’t go into a tutorial or explanation into Amazon Web Services and their EC2 (“Elastic Compute Cloud”) service here, suffice to say that you can create on-demand cloud-based VMs, paying by the hour and with pricing based on the size of instance, amount of memory needed, OS needed and so forth. We use Amazon AWS and EC2 extensively within Rittman Mead for development work, training environments and so forth, and what’s particularly interesting about AWS is the fact it’s all scriptable, there’s a public API and so forth. It’s this public API that Cloudera Manager uses to provision and spin-up the cluster VMs, something Cloudera Manager will automatically offer to do if it detects it’s running on Amazon EC2. Once you’ve provisioned the basic VMs, Cloudera Manager will automatically install the required Hadoop software on each of the VMs, meaning you can just sit back and watch the install, and then at the end, log in and check it’s all working.
And it did work – compared with problems I’d had with earlier versions of Cloudera Manager I’d had, where I’d set up the servers myself, installed Linux and Cloudera Manager myself, it all went amazingly well – to the point where I could upload some data into Impala itself, run some queries, and make use of my cloud-based Hadoop cluster.
And for one-off tasks, particularly where you need a very large amount of compute resource for a relatively small amount of time, Amazon AWS is great, but the cost soon starts to mount-up if you leave it running for too long – reckon on a cost of around $200-$400 for a reasonably-high specced instance for a month, multiplied by the amount of servers in your cluster.
So what are the alternatives? Well as I said before, you could set up a cluster yourself, installing the various bits of software, and potentially creating lots of VMs on a test server to host the Hadoop nodes. But another route you could take is to use one of the various “devops” tools out there to automate the build of a number of Hadoop nodes, using virtualisation tools such as VMWare or VirtualBox and a reasonably well-specced desktop or laptop. Back on the Cloudera website again, I saw a blog post and how-to just along these lines – one that used a devops tool called Vagrant to script and automate the build of the cluster, including setting up Ubuntu Linux on the VMs, and downloading and installing Cloudera Manager, just leaving the job of configuring the Hadoop cluster to us afterwards. This is the approach I finally went with in the end, and to do the same you’d need the following bits of kit and software:
- A desktop or server with a lot of RAM – I’ve used my iMac which has 32GB RAM; you could use a 16GB server or laptop but you’ll need to alter the Vagrant scripts to reflect this
- Either VirtualBox (supported by default by Vagrant) or VMWare Fusion / Workstation (which require an extra-cost plugin for Vagrant)
- Vagrant itself – a free download with install instructions here
Vagrant itself is an easy install and is available for OS X, Windows, Linux etc, and you can download Vagrant configuration flle for setting everything up from the Cloudera blog post. Then, it was a case of running the Vagrant script, and watching it create my VMs.
In the terminal screenshot above, you can see the VMs begin created (I started off using VirtualBox, later on I switched to VMWare Fusion), and in the screenshot below, you can see the various Cloudera packages being downloaded and installed.
Once the Vagrant set-up of the VMs was complete, I then logged into the Cloudera Manager website, and similar to how I’d done it with the EC2-based install, I just selected the other VMs to configure, chose the software components, and let the configuration complete.
At the end of the install process, I had six VMs running to provide by Hadoop cluster, each one using about 4GB of memory, and playing the following roles:
To be honest – you still need to know a bit about Hadoop, what the various bits do and so on to get it working – but then you also do to get Oracle installed, SQL Server, and so on. I guess the real barrier is having a machine big enough to run multiple Hadoop server nodes – too few and you don’t really see how the query processing works – so I guess this is why the cloud / EC2 route is so popular. But for me, I’ve got the six nodes working now, along with an OBIEE Windows VM with 11.1.1.7 installed to test out the connectivity. The screenshot below shows Cloudera Manager listing out the nodes in the cluster:
whilst the screenshot below this shows the various Hadoop platform elements listed out alongside the cluster nodes (a.k.a. VirtualBox/VMWare VMs) they’re running on.
and with OS X’s Activity Monitor showing they’re (just about) comfortably running within the overall 32GB RAM in the iMac.
So – the moment of truth – let’s try out some queries. I’ll start with Hive first of all, as Hue (Hive’s web-based UI) has some nice tools for uploading files and creating Hive tables out of them – or of course you can use ODI and it’s Hadoop Adapter and upload some data to the cluster as part of an ETL process. To use a more meaningfully-large dataset, I unloaded some of the tables from the full Airline Delays dataset to CSV files (around 180m rows of flight leg data), and then created Hive tables out of those – the screenshot below shows data from the main flight leg fact table.
In the background, two things happen when you upload new data into Hive; first, the file containing the data is stored in Hadoop’s filesystem, called HDFS (Hadoop Distributed File System), a unix-like distributed filesystem that breaks data down into blocks, and stores the blocks redundantly across the nodes in the cluster. If we take a look at the file I uploaded with the flight delays fact table data in it, you can see that it’s been broken down into blocks as shown at the bottom of the page:
If you click on an individual block, you can also see that the block is stored primarily on one node, and then redundantly on three other nodes in the cluster.
HDFS does this for two reasons; first, by spreading the data file over multiple servers, it can take advantage of the parallel processing provided by the Hadoop framework. Second, though, this redundancy means that if any node goes down, there’s copies of the data blocks elsewhere in the cluster, giving you the ability to use low-cost, commodity hardware (or cloud-based servers) whilst still protecting uptime, and your data.
So let’s run a query via the Hue UI, using Hive first of all. I put together a simple query that sums up flights, and averages distances, for all flights with California as the destination. As you can see from the Hue screenshot below, the query triggered two MapReduce jobs, one to find all flights with Californian destinations (the “map”), and one to aggregate the results (the “reduce”).
Looking at the MapReduce jobs being spun-up, run and then results gathered in, you can see that the MapReduce element (i.e., query time) took just under three minutes.
Going over the Job Tracker / MapReduce admin pages in Cloudera Manager, you can see the MapReduce jobs that were triggered by the Hive query – see how it handles the join, and how the filtering (mapping) is handled separately to the aggregating (reducing).
You can also bring up the Hadoop task tracker page, to see how the task track gave out chunks of the work to the various nodes in the cluster, and then got the results back in the end.
So – what about the Impala equivalent of the same query, then? Let’s give it a go. Well the query is more or less the same, but this time the results come back in around ten seconds, as we’d expect with Impala.
Looking inside Cloudera Manager, you can see the various Impala server processes working in the background, bypassing the need to generate MapReduce code and instead, using their own in-memory MPP framework to query the HDFS files and return the results.
And finally – the $64,000 question – can we connect OBIEE to the cluster? To do this, you’ll need to download the Cloudera Impala ODBC drivers, as I outlined in my previous blog post on the subject, but once you do, it should work – see the screenshot below where I’m querying the flight delays data using OBIEE 11.1.1.7.
So there you have it – a couple of ways you can spin-up your own multi-node Hadoop cluster, and confirmation that it should all still work with OBIEE once it’s put together.
Connecting OBIEE 11.1.1.7 to Cloudera Impala
A few months ago I posted a series of articles about connecting OBIEE 11.1.1.7, Exalytics and ODI to Apache Hadoop through Hive, an SQL-interface layer for Hadoop. Hadoop/Hive connectivity is a cool new feature in OBIEE 11g but suffers from the problem common to Hive – it’s actually quite slow, particularly when you’re used to split-second response times from your BI tool. The reason it’s slow is because Hive, in the background, generates MapReduce jobs which in-turn query Hadoop, batch processing-style, for each of your queries; each one of these MapReduce jobs requires a Java program to be written, and then submitted to the Hadoop job control framework and run within a Java JVM, which adds latency to your query. It’s not uncommon for a Hive query to take a minute or so to retrieve even a small set of “rows” from the Hadoop cluster, which isn’t really what it’s for – Hive and Hadoop are typically used with very large datasets spread over multiple servers – but you can’t help thinking there must be a better way to query Hadoop data.
And there is a better way – several organisations and companies have come up with improvements to Hive, the idea being to take the central idea of SQL-access-over-Hadoop, but remove the dependency on writing MapReduce jobs on the fly, and instead create separate, typically in-memory, server processes that provide similar functionality but with much improved response times. Probably the most well-known product like this is Cloudera’s “Impala”, an open-source but commercially-sponsored Hive replacement that’s available as part of Cloudera’s “Quickstart” demo VM, downloadable for free from the Cloudera website. The architecture image below is from the Cloudera Impala datasheet, which nicely explains the differences between Hive and Impala, and where it fits in as part of the overall Hadoop framework.
Now whilst Impala isn’t officially supported by Oracle as a data source, Hive is, so I thought it’d be interesting to see if we could swap-out Hive for Impala and connect more efficiently to a Hadoop datasource. I managed to get it working, with a couple of workarounds, so I thought I’d share it here – note that in a real-world installation, where the server is on Linux/Unix, or where your server isn’t on the same machine as your BI Administration client, it’s a bit more involved as you’ll need both server-side and client-side ODBC driver install and configuration.
Keeping it simple for now though, to get this working you’ll need:
- OBIEE 11.1.1.7, for either Windows or Linux – in my case, I’ve used Windows. Oracle’s recommendation is you use Linux for Hadoop access but Windows seems to work OK.
- The Cloudera Quickstart CDH4 VM – you’ll need to add some data to Impala, I’ll leave this to you – if this article makes sense to you, I’m sure you can add some sample data ;-)
- The Cloudera ODBC Driver for Impala – the Windows 64-bit ones are here, and the rest of the drivers are on this page.
Once you’ve downloaded the Quickstart VM and got Impala up and running, and set up OBIEE 11.1.1.7 on a separate server, start by installing the ODBC drivers so you’re ready to configure them. In my case, my Impala tables were held in the standard “default” schema, and my Quickstart VM was running on the hostname cdh4.rittmandev.com, so my ODBC configuration settings looked like this:
“cloudera” is the default username on the Quickstart VM, with a password also of “cloudera”, so when I press the Test… button and put in the password, I see the success message:
So far so good. So now over to the BI Administration tool, where the process to import the Impala table metadata is the same as with Hive. First, select the new Impala DSN from the list of ODBC connections, then bring in the Impala tables that you want to include in the RPD – in this case, two tables called “product” and “product_sales” that I added myself to Impala.
Next, double-click on the new physical database entry that the import just created, and set the Database type from ODBC Basic to Apache Hadoop, like this:
When you’re prompted to change the connection pool settings as well – ignore this and press No, and leave them as they are.
Then, create your business model and presentation layer subject area as you would do normally – in my case, I add a primary key to the products table, join it in the physical layer to the product_sales fact table, and then create corresponding BMM and Presentation Layer models so that it’s then ready to report on.
Running a quick test on the datasource, displaying some sample rows from the Impala tables, indicates it’s working OK.
So over to the dashboard. I run a simple query that sums up sales by product, and … it doesn’t work.
If you take a look at the logical and physical SQL that the BI Server is generating for the query, it all looks OK …
]]
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-0] [] [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] ############################################## [[
-------------------- SQL Request, logical request hash:
2f2b87c8
SET VARIABLE QUERY_SRC_CD='Report';SELECT
0 s_0,
"Impala Sample"."products"."prod_desc" s_1,
"Impala Sample"."products"."prod_id" s_2,
"Impala Sample"."product_sales"."amt_sold" s_3
FROM "Impala Sample"
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 65001 ROWS ONLY
]]
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-23] [] [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- General Query Info: [[
Repository: Star, Subject Area: Impala Sample, Presentation: Impala Sample
]]
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-18] [] [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- Sending query to database named ClouderaCDH4 (id: <<10894>>), connection pool named Connection Pool, logical request hash 2f2b87c8, physical request hash ee7aff05: [[
select 0 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c1 as c4
from
(select sum(T43766.amt_sold) as c1,
T43769.prod_desc as c2,
T43769.prod_id as c3
from
products T43769 inner join
product_sales T43766 On (T43766.prod_id = T43769.prod_id)
group by T43769.prod_id, T43769.prod_desc
) D1
order by c3, c2
]]
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-34] [] [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- Query Status: [nQSError: 16015] SQL statement execution failed. [[
[nQSError: 16001] ODBC error state: S1000 code: 110 message: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : NotImplementedException: ORDER BY without LIMIT currently not supported.
[nQSError: 43119] Query Failed:
But the problem is that Impala doesn’t like ORDER BY clauses without a corresponding LIMIT clause, something the product insists on presumably because of the potential size of Impala/Hadoop datasets and the need to contain the rows returned in-memory. If you’re running Impala queries directly within the Impala shell, you can issue the command set DEFAULT_ORDER_BY_LIMIT = x; to provide a default LIMIT value when one isn’t specified, but I can’t as yet see how to provide that setting over an ODBC connection, so the workaround I used was to uncheck the ORDERBY_SUPPORTED database setting in the physical database properties dialog, so that the BI Server leaves-off the ORDER BY clause entirely, and does the result ordering itself after Impala returns the (now un-ordered) result set.
Saving the RPD again now, and refreshing the results, brings back the query as expected – and in under a second or so, rather than the 30, 60 seconds etc that Hive would have taken.
So – there you go. As I said, Impala’s not officially supported which means it may work, but Oracle haven’t tested it properly and you won’t be able to raise SRs etc – but it’s an interesting alternative to Hive if you’re serious about connecting OBIEE 11g to your Hadoop datasources.