Tag Archives: Big Data
Creating a Multi-Node Hadoop/Impala Cluster as a Datasource for OBIEE 11.1.1.7
The other day I posted an article on the blog about connecting OBIEE 11.1.1.7 to Cloudera Impala, a new “in-memory” SQL engine for Hadoop that’s much faster than Hive for interactive queries. In this example, I connected OBIEE 11.1.1.7 to the Cloudera Quickstart CDH4 VM, which comes with all the Hadoop and Cloudera tools pre-installed and configured, making it easy to get going with the Hadoop platform.
Whilst the example worked though, I couldn’t help thinking that using Impala against a single node Hadoop install isn’t really how it’d be used in real-life; in reality, if you used OBIEE in this way, you’re much more likely to be connecting to a full Hadoop cluster, with multiple server nodes handling the incoming queries and potentially gigabytes, terabytes or petabytes of data being processed. So it it possible to set up a Hadoop cluster that gets a bit nearer to this multi-node architecture, so we can practice connecting to a cluster and not a single server, and we can see Hadoop process our queries across all of the nodes – as we’d see in real life, given that this low-cost MPP processing is the key benefit of Hadoop as a whole?
Hadoop, as you’re probably aware, was designed from the ground-up to run across multiple nodes, with those nodes typically either being small, low-cost servers, or in many cases servers running in the “cloud”. As such, you’re as likely to see Hadoop running on a cluster of Amazon EC2 server as running on physical servers in a datacenter, and in most cases the underlying OS running on those servers is Linux – most usually, Ubuntu 64-bit. So if we want to set up our own Hadoop cluster, there’s a few options open to us:
1. Get hold of a bunch of physical servers (maybe, old PCs or blade servers), install Linux and Hadoop on them, and then do the configuration and setup manually.
2. Buy a preconfigured solution – Oracle’s Big Data Appliance, for example, which has all the software pre-installed along with connectivity to ODI, Oracle Database etc
3. Spin-up a bunch of servers in the Cloud, or
4. Spin-up a bunch of Linux VMs, for example using VirtualBox or VMWare Fusion/Workstation
In the past I’ve done the Hadoop setup myself, manually, using the Hadoop distribution files available on the Hadoop website, but more recently vendors such as Hortonworks, MapR and Cloudera have put together their own Hadoop + added value tools distributions, and it’s Cloudera I’ve been looking at in most detail recently (if only because there’s a lot of ex-Oracle people there who I know, and it’s the bundled Hadoop distribution that comes with Oracle Big Data Appliance). What’s particularly good about Cloudera’s Hadoop offering is their “Cloudera Manager” utility – of which there’s a free version – and which simplifies the whole process of setting up a cluster by automating most of the process.
What’s also particularly interesting about Cloudera and Cloudera Manager, is that there are a number of solution available out there that automate the process of spinning-up clusters. One of them, described in this blog post on Cloudera’s website, involves using a built-in feature in Cloudera Manager to automatically create, provision and configure X number of Amazon EC2 virtual servers, with this servers then able to handle your queries in parallel and you just paying by the hour for the compute resource you need. So let’s give it a try.
I won’t go into a tutorial or explanation into Amazon Web Services and their EC2 (“Elastic Compute Cloud”) service here, suffice to say that you can create on-demand cloud-based VMs, paying by the hour and with pricing based on the size of instance, amount of memory needed, OS needed and so forth. We use Amazon AWS and EC2 extensively within Rittman Mead for development work, training environments and so forth, and what’s particularly interesting about AWS is the fact it’s all scriptable, there’s a public API and so forth. It’s this public API that Cloudera Manager uses to provision and spin-up the cluster VMs, something Cloudera Manager will automatically offer to do if it detects it’s running on Amazon EC2. Once you’ve provisioned the basic VMs, Cloudera Manager will automatically install the required Hadoop software on each of the VMs, meaning you can just sit back and watch the install, and then at the end, log in and check it’s all working.
And it did work – compared with problems I’d had with earlier versions of Cloudera Manager I’d had, where I’d set up the servers myself, installed Linux and Cloudera Manager myself, it all went amazingly well – to the point where I could upload some data into Impala itself, run some queries, and make use of my cloud-based Hadoop cluster.
And for one-off tasks, particularly where you need a very large amount of compute resource for a relatively small amount of time, Amazon AWS is great, but the cost soon starts to mount-up if you leave it running for too long – reckon on a cost of around $200-$400 for a reasonably-high specced instance for a month, multiplied by the amount of servers in your cluster.
So what are the alternatives? Well as I said before, you could set up a cluster yourself, installing the various bits of software, and potentially creating lots of VMs on a test server to host the Hadoop nodes. But another route you could take is to use one of the various “devops” tools out there to automate the build of a number of Hadoop nodes, using virtualisation tools such as VMWare or VirtualBox and a reasonably well-specced desktop or laptop. Back on the Cloudera website again, I saw a blog post and how-to just along these lines – one that used a devops tool called Vagrant to script and automate the build of the cluster, including setting up Ubuntu Linux on the VMs, and downloading and installing Cloudera Manager, just leaving the job of configuring the Hadoop cluster to us afterwards. This is the approach I finally went with in the end, and to do the same you’d need the following bits of kit and software:
- A desktop or server with a lot of RAM – I’ve used my iMac which has 32GB RAM; you could use a 16GB server or laptop but you’ll need to alter the Vagrant scripts to reflect this
- Either VirtualBox (supported by default by Vagrant) or VMWare Fusion / Workstation (which require an extra-cost plugin for Vagrant)
- Vagrant itself – a free download with install instructions here
Vagrant itself is an easy install and is available for OS X, Windows, Linux etc, and you can download Vagrant configuration flle for setting everything up from the Cloudera blog post. Then, it was a case of running the Vagrant script, and watching it create my VMs.
In the terminal screenshot above, you can see the VMs begin created (I started off using VirtualBox, later on I switched to VMWare Fusion), and in the screenshot below, you can see the various Cloudera packages being downloaded and installed.
Once the Vagrant set-up of the VMs was complete, I then logged into the Cloudera Manager website, and similar to how I’d done it with the EC2-based install, I just selected the other VMs to configure, chose the software components, and let the configuration complete.
At the end of the install process, I had six VMs running to provide by Hadoop cluster, each one using about 4GB of memory, and playing the following roles:
To be honest – you still need to know a bit about Hadoop, what the various bits do and so on to get it working – but then you also do to get Oracle installed, SQL Server, and so on. I guess the real barrier is having a machine big enough to run multiple Hadoop server nodes – too few and you don’t really see how the query processing works – so I guess this is why the cloud / EC2 route is so popular. But for me, I’ve got the six nodes working now, along with an OBIEE Windows VM with 11.1.1.7 installed to test out the connectivity. The screenshot below shows Cloudera Manager listing out the nodes in the cluster:
whilst the screenshot below this shows the various Hadoop platform elements listed out alongside the cluster nodes (a.k.a. VirtualBox/VMWare VMs) they’re running on.
and with OS X’s Activity Monitor showing they’re (just about) comfortably running within the overall 32GB RAM in the iMac.
So – the moment of truth – let’s try out some queries. I’ll start with Hive first of all, as Hue (Hive’s web-based UI) has some nice tools for uploading files and creating Hive tables out of them – or of course you can use ODI and it’s Hadoop Adapter and upload some data to the cluster as part of an ETL process. To use a more meaningfully-large dataset, I unloaded some of the tables from the full Airline Delays dataset to CSV files (around 180m rows of flight leg data), and then created Hive tables out of those – the screenshot below shows data from the main flight leg fact table.
In the background, two things happen when you upload new data into Hive; first, the file containing the data is stored in Hadoop’s filesystem, called HDFS (Hadoop Distributed File System), a unix-like distributed filesystem that breaks data down into blocks, and stores the blocks redundantly across the nodes in the cluster. If we take a look at the file I uploaded with the flight delays fact table data in it, you can see that it’s been broken down into blocks as shown at the bottom of the page:
If you click on an individual block, you can also see that the block is stored primarily on one node, and then redundantly on three other nodes in the cluster.
HDFS does this for two reasons; first, by spreading the data file over multiple servers, it can take advantage of the parallel processing provided by the Hadoop framework. Second, though, this redundancy means that if any node goes down, there’s copies of the data blocks elsewhere in the cluster, giving you the ability to use low-cost, commodity hardware (or cloud-based servers) whilst still protecting uptime, and your data.
So let’s run a query via the Hue UI, using Hive first of all. I put together a simple query that sums up flights, and averages distances, for all flights with California as the destination. As you can see from the Hue screenshot below, the query triggered two MapReduce jobs, one to find all flights with Californian destinations (the “map”), and one to aggregate the results (the “reduce”).
Looking at the MapReduce jobs being spun-up, run and then results gathered in, you can see that the MapReduce element (i.e., query time) took just under three minutes.
Going over the Job Tracker / MapReduce admin pages in Cloudera Manager, you can see the MapReduce jobs that were triggered by the Hive query – see how it handles the join, and how the filtering (mapping) is handled separately to the aggregating (reducing).
You can also bring up the Hadoop task tracker page, to see how the task track gave out chunks of the work to the various nodes in the cluster, and then got the results back in the end.
So – what about the Impala equivalent of the same query, then? Let’s give it a go. Well the query is more or less the same, but this time the results come back in around ten seconds, as we’d expect with Impala.
Looking inside Cloudera Manager, you can see the various Impala server processes working in the background, bypassing the need to generate MapReduce code and instead, using their own in-memory MPP framework to query the HDFS files and return the results.
And finally – the $64,000 question – can we connect OBIEE to the cluster? To do this, you’ll need to download the Cloudera Impala ODBC drivers, as I outlined in my previous blog post on the subject, but once you do, it should work – see the screenshot below where I’m querying the flight delays data using OBIEE 11.1.1.7.
So there you have it – a couple of ways you can spin-up your own multi-node Hadoop cluster, and confirmation that it should all still work with OBIEE once it’s put together.
Connecting OBIEE 11.1.1.7 to Cloudera Impala
A few months ago I posted a series of articles about connecting OBIEE 11.1.1.7, Exalytics and ODI to Apache Hadoop through Hive, an SQL-interface layer for Hadoop. Hadoop/Hive connectivity is a cool new feature in OBIEE 11g but suffers from the problem common to Hive – it’s actually quite slow, particularly when you’re used to split-second response times from your BI tool. The reason it’s slow is because Hive, in the background, generates MapReduce jobs which in-turn query Hadoop, batch processing-style, for each of your queries; each one of these MapReduce jobs requires a Java program to be written, and then submitted to the Hadoop job control framework and run within a Java JVM, which adds latency to your query. It’s not uncommon for a Hive query to take a minute or so to retrieve even a small set of “rows” from the Hadoop cluster, which isn’t really what it’s for – Hive and Hadoop are typically used with very large datasets spread over multiple servers – but you can’t help thinking there must be a better way to query Hadoop data.
And there is a better way – several organisations and companies have come up with improvements to Hive, the idea being to take the central idea of SQL-access-over-Hadoop, but remove the dependency on writing MapReduce jobs on the fly, and instead create separate, typically in-memory, server processes that provide similar functionality but with much improved response times. Probably the most well-known product like this is Cloudera’s “Impala”, an open-source but commercially-sponsored Hive replacement that’s available as part of Cloudera’s “Quickstart” demo VM, downloadable for free from the Cloudera website. The architecture image below is from the Cloudera Impala datasheet, which nicely explains the differences between Hive and Impala, and where it fits in as part of the overall Hadoop framework.
Now whilst Impala isn’t officially supported by Oracle as a data source, Hive is, so I thought it’d be interesting to see if we could swap-out Hive for Impala and connect more efficiently to a Hadoop datasource. I managed to get it working, with a couple of workarounds, so I thought I’d share it here – note that in a real-world installation, where the server is on Linux/Unix, or where your server isn’t on the same machine as your BI Administration client, it’s a bit more involved as you’ll need both server-side and client-side ODBC driver install and configuration.
Keeping it simple for now though, to get this working you’ll need:
- OBIEE 11.1.1.7, for either Windows or Linux – in my case, I’ve used Windows. Oracle’s recommendation is you use Linux for Hadoop access but Windows seems to work OK.
- The Cloudera Quickstart CDH4 VM – you’ll need to add some data to Impala, I’ll leave this to you – if this article makes sense to you, I’m sure you can add some sample data ;-)
- The Cloudera ODBC Driver for Impala – the Windows 64-bit ones are here, and the rest of the drivers are on this page.
Once you’ve downloaded the Quickstart VM and got Impala up and running, and set up OBIEE 11.1.1.7 on a separate server, start by installing the ODBC drivers so you’re ready to configure them. In my case, my Impala tables were held in the standard “default” schema, and my Quickstart VM was running on the hostname cdh4.rittmandev.com, so my ODBC configuration settings looked like this:
“cloudera” is the default username on the Quickstart VM, with a password also of “cloudera”, so when I press the Test… button and put in the password, I see the success message:
So far so good. So now over to the BI Administration tool, where the process to import the Impala table metadata is the same as with Hive. First, select the new Impala DSN from the list of ODBC connections, then bring in the Impala tables that you want to include in the RPD – in this case, two tables called “product” and “product_sales” that I added myself to Impala.
Next, double-click on the new physical database entry that the import just created, and set the Database type from ODBC Basic to Apache Hadoop, like this:
When you’re prompted to change the connection pool settings as well – ignore this and press No, and leave them as they are.
Then, create your business model and presentation layer subject area as you would do normally – in my case, I add a primary key to the products table, join it in the physical layer to the product_sales fact table, and then create corresponding BMM and Presentation Layer models so that it’s then ready to report on.
Running a quick test on the datasource, displaying some sample rows from the Impala tables, indicates it’s working OK.
So over to the dashboard. I run a simple query that sums up sales by product, and … it doesn’t work.
If you take a look at the logical and physical SQL that the BI Server is generating for the query, it all looks OK …
]]
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-0] [] [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] ############################################## [[
-------------------- SQL Request, logical request hash:
2f2b87c8
SET VARIABLE QUERY_SRC_CD='Report';SELECT
0 s_0,
"Impala Sample"."products"."prod_desc" s_1,
"Impala Sample"."products"."prod_id" s_2,
"Impala Sample"."product_sales"."amt_sold" s_3
FROM "Impala Sample"
ORDER BY 1, 3 ASC NULLS LAST, 2 ASC NULLS LAST
FETCH FIRST 65001 ROWS ONLY
]]
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-23] [] [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- General Query Info: [[
Repository: Star, Subject Area: Impala Sample, Presentation: Impala Sample
]]
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-18] [] [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- Sending query to database named ClouderaCDH4 (id: <<10894>>), connection pool named Connection Pool, logical request hash 2f2b87c8, physical request hash ee7aff05: [[
select 0 as c1,
D1.c2 as c2,
D1.c3 as c3,
D1.c1 as c4
from
(select sum(T43766.amt_sold) as c1,
T43769.prod_desc as c2,
T43769.prod_id as c3
from
products T43769 inner join
product_sales T43766 On (T43766.prod_id = T43769.prod_id)
group by T43769.prod_id, T43769.prod_desc
) D1
order by c3, c2
]]
[2013-11-11T19:20:17.000+00:00] [OracleBIServerComponent] [TRACE:2] [USER-34] [] [ecid: ea4e409c7d956b38:-112518e2:14248823f42:-8000-00000000000001b4] [tid: c] [requestid: 60460014] [sessionid: 60460000] [username: weblogic] -------------------- Query Status: [nQSError: 16015] SQL statement execution failed. [[
[nQSError: 16001] ODBC error state: S1000 code: 110 message: [Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] : NotImplementedException: ORDER BY without LIMIT currently not supported.
[nQSError: 43119] Query Failed:
But the problem is that Impala doesn’t like ORDER BY clauses without a corresponding LIMIT clause, something the product insists on presumably because of the potential size of Impala/Hadoop datasets and the need to contain the rows returned in-memory. If you’re running Impala queries directly within the Impala shell, you can issue the command set DEFAULT_ORDER_BY_LIMIT = x; to provide a default LIMIT value when one isn’t specified, but I can’t as yet see how to provide that setting over an ODBC connection, so the workaround I used was to uncheck the ORDERBY_SUPPORTED database setting in the physical database properties dialog, so that the BI Server leaves-off the ORDER BY clause entirely, and does the result ordering itself after Impala returns the (now un-ordered) result set.
Saving the RPD again now, and refreshing the results, brings back the query as expected – and in under a second or so, rather than the 30, 60 seconds etc that Hive would have taken.
So – there you go. As I said, Impala’s not officially supported which means it may work, but Oracle haven’t tested it properly and you won’t be able to raise SRs etc – but it’s an interesting alternative to Hive if you’re serious about connecting OBIEE 11g to your Hadoop datasources.
Accelerating Hadoop/Hive OBIEE Queries Using Exalytics and the Summary Advisor
A few weeks ago I blogged about the new Hadoop/Hive support in OBIEE 11.1.1.7, and how it enables users to query “big data” sources with no need to know Hadoop or MapReduce. As this feature is currently based around Apache Hive, one of its drawbacks is that it can be a bit slow, at least in terms of latency and initial response time. This is because each call to Hive spins up a Java VM and creates MapReduce jobs on the fly, whereas more recent implementations of SQL-like access over Hadoop (Apache Drill, Cloudera Impala etc) bypass MapReduce and use their own in-memory engines instead.
One workaround thats worth considering though if you’ve got access to an Exalytics Server is to use TimesTen for Exalytics, and the Summary Advisor, to cache aggregations from these Hadoop data sources into memory, giving you the lightning-fast response times you’re used to with Exalytics against this new data source. I’m presenting on OBIEE, Exalytics and Hadoop and next week’s ilOUG Tech Days event in Jerusalem, so I thought I’d knock-up a quick demo as a “proof of concept”.
In this example then, I’ve got an OBIEE 11.1.1.7 repository that’s connecting to a Hadoop source via HiveODBC. OBIEE’s support for Hive access is fairly primitive at this point, so I’m sourcing all the data for my business model from a single Hive table, and modelling it into a simple two-dimensional star schema and subject area, like this:
I can then run a series of queries against this subject area, to create some entries in the S_NQ_ACCT, S_NQ_DB_ACCT and S_NQ_SUMMARY_ADVISOR tables.
Note also the logical dimension I created in the BMM layer, which has levels for all customers, salesperson and customer. The Summary Advisor will need these to be in place in order to make recommendations based on measure aggregation by level.
Response time on queries such as the one above is typically around 20-30 seconds, which generally wouldn’t be acceptable to most users running queries against a BI dashboard. Taking a look at the query in the BI Server query log, I can see the Hive query that the BI Server is sending across to the Hadoop cluster.
select avg(T44678.age) as c1,
T44678.sales_pers as c2,
sum(T44678.age) as c3,
count(T44678.age) as c4
from
dwh_customer T44678
group by T44678.sales_pers
So let’s move over to the Exalytics Summary Advisor, which lets me focus on the Hive-related logical fact table sources for the aggregation recommendations, just like any other data source.
Running through the steps, I eventually end-up with a recommendation for a single aggregate, aggregating the various measures by the salesperson level in the logical dimension.
Running this through nqcmd, the Aggregate Persistence feature within the BI Server creates the TimesTen aggregate tables as expected.
Then, moving back to the Repository, I can see the new in-memory aggregates plugged-into the Hive-related business model.
And, running the query again and checking the query log file shows TimesTen for Exalytics being queried instead, and with data now being returned instantaneously.
select distinct T44969.Total_Age0000AF82 as c1,
T44969.Cust_Count0000AF88 as c2,
T44969.Avg_Age0000AF2C as c3,
T44963.sales_pers0000AF56 as c4
from
EXALYTICS.SA_Salespe0000AF78 T44963,
EXALYTICS.ag_1567321688 T44969
where ( T44963.sales_pers0000AF56 = T44969.sales_pers0000AF56 )
The other thing to bear in mind is that, if I’d been running Hadoop on Oracle’s Big Data Appliance connected to Exalytics via Infiniband, I’d have also benefited from the extra bandwidth between the two boxes, something that might also help with the aggregation-into-TimesTen part too. Not bad and a viable solution to the latency issue with Hive and OBIEE.
Previewing the BI Forum 2013 Data Integration Masterclass
I guess it’s a British thing to not blow our own trumpet (does that translate the same over in the US?), but something I’m particularly proud about with the upcoming Rittman Mead BI Forum 2013 events is our Oracle Data Integration Masterclass, running on the Wednesday before each event properly starts, and put together by myself, Stewart Bryson and Michael Rainey. Although the main theme for the BI Forum is OBIEE, virtually every BI system that we all work with has a data warehouse of some sort underneath it, and most OBIEE professionals to one extent or another have to understand data warehousing principles, and how Oracle’s data integration tools work. So this year, we thought we’d take a deep-dive into Oracle Data Integrator and the wider Oracle Data Integration Suite, and in this preview positing I’ll be giving you a bit of a preview of what’s coming in the session – and places are still available for the US BI Forum event, and for the masterclass itself if you’ve only registered for just the main conference event.
The masterclass is made up of six sections, delivered by myself, Stewart and Michael, assumes a basic understanding of data warehousing and ETL tools but otherwise gets down into the detail of what we’ve found works well “in the field”. Stewart Bryson, Oracle ACE and Managing Director for Rittman Mead America, will open the session with an overview of ODI and the Oracle Data Integration Suite, taking a look at the product history and walking the audience through the major elements of the ODI product architecture. If you’ve ever wondered what agents do within ODI, why there are two repositories and where WebLogic comes into it, Stewart’s session will make everything clear before we get into the rest of the details.
Then, after coffee, Stewart will carry on and talk about what’s called the Oracle Information Management Reference Architecture, Oracle’s next-generation blueprint for data warehousing and information management that combines the best of Kimball and Inmon with new thinking around “big data” and “data discovery”. ODI and Oracle Data Integration Suite is the enabling technology for this new framework. At Rittman Mead, we use this framework for the majority of our DW customer engagements and we’ll be talking later on in the masterclass about how big data sources, for example, can be leveraged by ODI and brought into your BI environment in the same way as any other regular, relational datasource.
The third section of the masterclass sees Michael Rainey take over the stage and talk to us about ODI’s integration with Oracle GoldenGate, Oracle’s data integration product for real-time analysis and data loading. Michael has taken part in several ODI & GoldenGate customer engagements over in the States, and has worked with Stewart in producing a number of custom ODI knowledge modules to better make use of this powerful new data integration tool. If you’ve read through any of Michael’s blog posts on ODI and Golden Gate and are interested in hearing a bit more detail on how it all works, as well as some real-world practical tips and tricks, this will be an invaluable session for you.
So far I’ve got away with just making the tea, but straight-after Michael is my session, where I’ll be talking about ODI and its new integration with Hadoop, NoSQL and the wider “big data” technology area. I’ve been covering ODI and Hadoop in some blog posts over the past week, but there’s only so much that I can get into a blog post and this session will be the first airing of this new material, where I’ll be demoing all the main integration points and talking about what works well, and where the main value is, with this very interesting new feature.
Then it’s back to Stewart again, where he’ll be talking about creating highly-resilient ETL code that’s also resumable, using features such as ODI 11g’s load plans and the Oracle Database’s resumable space allocation feature. Stewart and I were particularly keen to put together this session as it brings together work Stewart did a few years ago on fault-tolerant ETL in the Oracle Database, with some blog posts I put together over the 2012 Christmas break around highly-resilient ETL with ODI11g. What this session does is explain the background to the ETL resilience features in the Oracle Database, and ODI’s use of WebLogic JEE agents, and demonstrates through some custom knowledge modules how they can be brought together for your project.
Finally, Michael concludes the masterclass with a look at a feature you’re probably vaguely aware of, intend to learn something about, but sounds a bit complex; Groovy scripting and the ODI SDK. In fact, like WLST scripting for OBIEE, learning Groovy and the SDK is the key to automating tedious tasks such as mass-importing and reverse-engineering tables and files, as well as making it possible to add functionality to ODI or integrate it with other standards-based products. In a session almost entirely made-up of live demos, Michael will take us through the basics of Groovy and the SDK, and show us a few examples of where this could add value to your data integration projects.
So there we have it – Brighton is now fully-booked up, but if you’ve already registered for the main event but want to come to the masterclass now too, you can log back into the registration site and update your booking to include the additional masterclass fee. Atlanta is running a week later and so still has a few main event passes left, and again if you’ve already registered for the main conference, use the link in your registration confirmation to go back in and add the masterclass to your booking. And – hopefully we’ll see you all in Brighton or Atlanta for the Rittman Mead BI Forum 2013 in the next two weeks!
OBIEE, ODI and Hadoop Part 4: Hive Data Transformation & Integration via ODI 11g
In the previous three articles in this series (and well done for everyone that’s still with us), we looked at how OBIEE 11.1.1.7 and ODI 11.1.1.6+ can access Hadoop data sources via a related technology called Hive, looking in the second article in more detail at the OBIEE aspect including how the OBIEE metadata import was set up, and how new HiveODBC drivers supplied by Oracle make this all possible. In the last posting in the series, we saw how ODI can be used to populate the Hive “warehouse” tables, stored on Hadoop’s HDFS (Hadoop Distributed File System) storage grid, that are then analysed via HiveQL and MapReduce, and in this final post, we’ll take a look at how ODI can go beyond simple loading from file into Hive tables and start to do some of the data transformation, and integration tasks that we regularly perform with more traditional data sources and targets.
In the previous ODI and Hive/Hadoop example, we used an ODI interface to load data from a local file into a Hive table, using the IKM File to Hive knowledge module. In this first example columns in the source file were mapped 1:1 into “columns” in the Hive table source, like this:
whilst in the second example, we used a Hadoop “SerDe” serializer-deserializer transformation to parse incoming weblog rows into the target Hive column format, again using this same IKM File to Hive knowledge module.
Which of course is great if you’re loading data directly from files into corresponding Hive warehouse tables, but what if you’ve already got data in Hive but you want to “reshape” or transform it, creating the equivalent of a star schema, say, of a set of “normalised” tables?
Hive, compared to Oracle at least, is pretty simplistic in terms of the data manipulation you can do with it, and is more akin to working with Oracle external tables than a full insert-update-delete-ACID-compliant database (but then again, it was never intended to be that). You can only insert new data into Hive tables, not (ordinarily) update rows or delete them, with the only way to delete data in a Hive table being to drop it, then re-create it new. HiveQL has syntax for joins, group by and some functions, and you can drop in your own custom MapReduce scripts for the more complex stuff – which is not bad going and probably more than enough for your average Hadoop/Big Data analyst.
But the scenarios we’re considering probably won’t have a Hadoop expert around, and are typically something like:
- We’ve got some marketing or analyst-type users who want to access some behavioural, weblog or activity data sitting in Hadoop/HDFS/Hive, they’re happy users of OBIEE, and we want to be able to connect OBIEE to this data so that they can report on it – which probably means “reshaping” it somewhat to fit OBIEE’s preference for star-schema (or at least simplified, denormalized) source data models, or
- We’ve got some data sitting in Hadoop/HDFS that we’d like to add in as a new source into our data warehouse, and ODI is the preferred tool for bringing in new sources
In neither case do we have much in the way of Hadoop or MapReduce skills, so we can either (a) use OBIEE’s BI Repository to do some source data reshaping, or even better (b) do the job properly, maybe as part of a much wider ETL process, using ODI. The key thing is fitting Hadoop into ODI’s way of doing things, and giving ODI the ability to do Hadoop and Hive-specific tasks through its extensible “knowledge module” framework.
In the example above that I used to illustrate ODI’s ability to work with Hive, I used the IKM File to Hadoop knowledge module that comes as part of the ODI Application Adaptor for Hadoop (ODIAAH); ODIAAH is one of a number of Fusion Middleware “application adapters”, and is licensed separately (but in conjunction with) ODI EE costing around the same as GoldenGate (around $17000/processor, according to the current tech price list). For some more background into ODIAAH this self-study training session goes into the basics, but at a high-level this adapter actually ships four mainly Hive-based ODI KMs that enables data loading, transformation and integration between Hadoop/Hive/HDFS and the Oracle database (amongst other sources/targets);
- IKM File to Hive (Load Data) : what we’ve been using so far, used to load Hive tables from local and HDFS-stored files
- IKM Hive Control Append : used for loading data into Hive tables, with data sourced from one or more other Hive tables – the equivalent of loading one relational table by joining, filtering and transforming other tables
- IKM Hive Transform : a variation on the previous KM that allows incoming data to be transformed via Python or Perl scripts
- IKM File-Hive to Oracle : load data from either a Hive table, or from files stored on HDFS, into an Oracle database using the separately-licensed Oracle Loader for Hadoop
- CKM Hive : a check knowledge module for Hive (which because of its “schema on read” rather than “schema on write” approach, doesn’t natively support keys or constraints)
- RKM Hive : a “reverse-engineering” KM that allows ODI to read table definitions from the Hive metastore, and use them to create ODI datastore definitions
A typical Oracle BI&DW project is going to use these KMs for a couple of main reasons; one, as stated before, is because their might be information sitting in Hadoop that the BI/DW system wants access to, and would otherwise have to start writing MapReduce, Pig, Sqoop etc code to get into their data warehouse. The other might be to leverage Hadoop’s ability to crunch and count large sets of data massively parallel, at relatively low cost, with ODI then initiating and monitoring the process, then loading the results into a more traditional Oracle data store. Other technologies, some implemented or licensed by Oracle, also surround this process – Oracle NoSQL database for example – but for now lets concentrate on Hadoop and Hive, and see what else these KMs can do for us.
Starting off with IKM Hive Control Append, you might have a situation where you’ve already got some data in Hive tables (perhaps through loading them up using ODI’s File to Hive (Load Data) KM, but the data needs joining, transforming, filtering or otherwise reshaping before you can connect a tool such as OBIEE to it. In this case it doesn’t make sense to use ODI File to Hive (Load Data) as this KM is for when the data sits outside Hive in source files, so instead we can use IKM Hive Control Append to truncate/insert append new rows into an existing target Hive table.
This process works very similar to regular table loading IKMs (except of course you can’t insert/update into the target, only insert append); for example, in the screenshot below, two Hive tables, one for customer information and one for salesperson information, are joined and the results transformed using HiveQL with the results loaded into another Hive table.
Looking at the loading code generated by ODI for this interface, you can see that the HiveQL used to extract and load the data looks very similar to Oracle SQL, and in fact this KM is about the closest one to “regular” relational database ones out of the set of ODIAAH knowledge modules.
Things get more interesting with the IKM Hive Transform, a knowledge module that takes data from any source and loads it, via custom shell scripts, into a target Hive table. These transformation shell scripts are typically written in Python or Perl, and give you the ability to write your own custom pre-processing or transformation code that (via ODI’s temporary interfaces feature) can then be used as “inline views” or multi-step processes when performing more complex data loading processes around Hive. In the example below, a three-step data loading process first creates and then loads an Apache Weblog file using IKM File to Hive (Load Data), then sessionizes (tags the log file with IDs to identify all activity within a particular browser session) using IKM Hive Transform:
Taking a look at the final “Sessionize Weblog” step in more detail, you can see in the interface Flow tab that this step uses the IKM Hive Transform module, and a script called “sessionize.pl” to do the transform work.
The perl script itself then parses through the log file information and works out the start, and stop points for each individual user session, outputting the results which are then transformed into the correct target columns by the PRE_TRANSFORM_DISTRIBUTION settings in the KM options.
Finally, the IKM File-Hive to Oracle knowledge module takes things in the other direction, extracting from Hive tables or HDFS files into an Oracle database, via the Oracle Loaded for Hadoop big data connector, but that’s really a topic in itself and for another day, when I’d like to look in more detail at the big data connectors in general, and how you can leverage Hadoop and HFDS from within SQL and PL/SQL commands. For now though, this concludes my look at Hadoop connectivity from within OBIEE 11.1.1.7 and ODI.11.1.1.6, but if you’re like me this brings up as many questions as it answers; for example:
- How would I go about setting up my own Hadoop/Hive/HDFS development environment, and can I use Windows or does only Linux make sense?
- Just what could I do with Hadoop as a data source that I can’t do with regular Oracle and file sources; in terms of scale, and also complexity/speed?
- Can ODI also make use of Hadoop data loading/transformation tools like Pig, or Sqoop?
- Do any of the related/commercialized Hive/Hadoop technologies add anything to these scenarios – for example, Cloudera’s Impala (for BI metadata/reporting) or Cloudera Manager (for Hadoop administration)?
- And – is there a time/volume threshold where Hadoop makes more sense as a data processing platform than an Oracle database?
If these sound interesting to you, they’re exactly what I’ll be covering during my part of the Data Integration Masterclass at the Rittman Mead BI Forum 2013 events in Brighton and Atlanta, running in just a few weeks time. We’ve still got a few places left, so if you’re interested and want to see all this technology in action, sign-up now and I’ll hopefully see you soon.