Tag Archives: Oracle BI Suite EE
Using Oracle Big Data SQL to Add Dimensions and Attributes to Hadoop Reporting
In a previous post I looked at using Oracle’s new Big Data SQL product with ODI12c, where I used Big Data SQL to expose two Hive tables as Oracle external tables, and then join them using the BETWEEN operator, something that’s not possible with regular HiveQL. In this post I’m going to look at using Oracle Big Data SQL with OBIEE11g, to enable reporting against Hive tables without the need to use Hive ODBC drivers and to bring in reference data without having to stage it in Hive tables in the Hadoop cluster.
In this example I’ve got some webserver log activity from the Rittman Mead Blog stored as a Hive table in Hadoop, which in its raw form only has a limited amount of descriptive data and wouldn’t be all that useful to users reporting against it using OBIEE. Here’s the contents of the Hive table as displayed via SQL*Developer:
When I bring this table into OBIEE, I really want to add details of the country that each user is visiting from, and also details of the category that each post referenced in the webserver logs belongs to. Tables for these reference data items can be found in an accompanying Oracle database, like this:
The idea then is to create an ORACLE_HIVE external table over the Hive table containing the log activity, and then import all of these tables into the OBIEE RPD as regular Oracle tables. Back in SQL*Developer, connected to the database that has the link setup to the Hadoop cluster via Big Data SQL, I create the external table using the new ORACLE_HIVE external table access driver:
And now with the Hive table exposed as the Oracle external table BDA_OUTPUT.ACCESS_PER_POST_EXTTAB, I can import all four tables into the OBIEE repository.
I can now create joins across the two Oracle schemas and four tables:
and then create a business model and presentation model to define a simple star schema against the combined dataset:
Once the RPD is saved and made available to the Presentation layer, I can now go and create some simple reports against the Hive and Oracle tables, with the Big Data SQL feature retrieving the Hive data using SmartScan technology running directly on the Hadoop cluster – bypassing MapReduce and filtering, projecting and just returning the results dataset back to the Exadata server running the Oracle SQL query.
In the previous ODI12c and Big Data SQL posting, I used the Big Data SQL feature to enable a join between the Hive table and a table containing IP address range lookups using the BETWEEN operator, so that I could return the country name for each visitor to the website. I can do a similar thing with OBIEE, by first recreating the main incoming fact table source as a view over the ORACLE_HIVE external table and adding an IP integer calculation that I can then use for the join to the IP range lookup table (and also take the opportunity to convert the log-format date string into a proper Oracle DATE datatype):
and then using that to join to a new table I’ve imported from the BLOG_REFDATA Oracle schema that contains the IP range lookups:
Now I can add country as a dimension, and create reports that break down site visits by country of access.
Similarly, I can break the date column in the view over the Hive external table out into its own logical dimension table, and then create some reports to show site access over time.
and with the final RPD looking like this:
If you’re interested in reading more about Oracle Big Data SQL I also covered it earlier on the blog around the launch date, with this post introducing the feature and another looking at how it extends Oracle security over your Hadoop cluster.
Top 10 Rittman Mead Blog Posts from 2014
It’s the afternoon of New Year’s Eve over in the UK, so to round the year off here’s the top 10 blog posts from 2014 from the Rittman Mead blog, based on Google Analytics stats (page views for 2014 in brackets, only includes articles posted in 2014)
- Using Sqoop for Loading Oracle Data into Hadoop on the BigDataLite VM – Mark Rittman, March 22, 2014 (8466)
- OBIEE Dashboard prompt: at least one mandatory – Gianni Ceresa, March 17th 2014 (7683)
- Thoughts on Using Amazon Redshift as a Replacement for an Oracle Data Warehouse – Peter Scott, February 20th 2014 (6993)
- The Secret Life of Conditional Formatting in OBIEE – Gianni Ceresa, March 26th 2014 (5606)
- Trickle-Feeding Log Files to HDFS using Apache Flume – Mark Rittman, May 18th 2014 (5494)
- The State of the OBIEE11g World as of May 2014 – Mark Rittman, May 12th 2014 (4932)
- Date formatting in OBIEE 11g – setting the default Locale for users – Robin Moffatt, February 12th 2014 (4840)
- Automated Regression Testing for OBIEE – Robin Moffatt, Jan 23rd 2014 (4040)
- OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 2 : Load Data into Hive Tables, Analyze using Hive & Impala – Mark Rittman, Jan 18th 2014 (3439)
- Introduction to Oracle BI Cloud Service : Product Overview – Mark Rittman, Sep 22nd 2014 (3190)
In all, the blog in one form or another has been going for 10 years now, and our most popular post of all time over the same period is Robin Moffatt’s “Upgrading OBIEE to 11.1.1.7” – well done Robin. To everyone else, have a Happy New Year and a prosperous 2015, and see you next year when it all starts again!
Connecting OBIEE11g on Windows to a Kerberos-Secured CDH5 Hadoop Cluster using Cloudera HiveServer2 ODBC Drivers
In a few previous posts and magazine articles I’ve covered connecting OBIEE11g to a Hadoop cluster, using OBIEE 11.1.1.7 and Cloudera CDH4 and CDH5 as the examples. Things get a bit complicated in that the DataDirect Apache Hive ODBC drivers that Oracle ship are only for HiveServer1 and not the HiveServer2 version that CDH4 and CDH5 use, and the Linux version of OBIEE 11.1.1.7 won’t work with the Cloudera Hive ODBC drivers that you have to use to connect to Hive on CDH4/5. You can however connect OBIEE 11.1.1.7 on Windows to HiveServer2 on CDH4 and CDH5 if you use the Cloudera Hive ODBC drivers for Windows, and although this isn’t supported by Oracle in my experience it does work, albeit with the general OBIEE11g Hive restrictions and caveats detailed in the Metadata Repository Builder’s Guide, and the fact that in-practice Hive is too slow to use for ad-hoc reporting.
However … most enterprise-type customers who run Hadoop on their internal networks have their clusters configured as “secured”, rather than the unsecured cluster examples that you see in most OBIEE connection examples. By default, Hadoop clusters are very trusting of incoming network and client connections and assume that whoever’s connecting is who they say they are, and HDFS and the other cluster components don’t perform any authentication themselves of incoming client connections. In addition, by default all network connections between Hadoop cluster components run in clear text and without any mutual authentication, which is great for a research cluster or PoC but not really appropriate for enterprise customers looking to use Hadoop to store and analyse customer data.
Instead, these customers configure their clusters to run in secured mode, using Kerberos authentication to secure incoming connections, encrypt network traffic and secure connections between the various services in the cluster. How this affects OBIEE though is that your Hive connections through to the cluster also need to use Kerberos authentication, and you (and the OBIEE BI Server) need to have a valid Kerberos ticket when connecting through the Hive ODBC driver. So how do we set this up, and how do we get hold of a secure Hadoop cluster using Kerberos authentication to test against? A few of our customers have asked this question recently, so I thought it’d be worth jotting down a few notes on how to set this up.
At a high-level, if you want to connect OBIEE 11.1.1.7 to a secure, Kerberos-authenticated CDH cluster, there’s three main steps you need to carry out:
- Get hold of a Kerberos-secured CDH cluster, and establish the connection details you’ll need to use to connect to it
- Make sure the Kerberos server has the correct entries/principals/user details for the user you’re going to securely-connect as
- Configure the host environment for OBIEE to work with Kerberos authentication, and then create the connection from OBIEE to the CDH cluster using the correct Kerberos credentials for your user
In my case, I’ve got a Cloudera CDH5.3.0 cluster running in the office that’s been configured to use MIT Kerebos 5 for authentication, set up using an OEL6 VM as the KDC (Key Distribution Centre) and the cluster configured using the new Kerebos setup wizard that was introduced with CDH5.1. Using this wizard automates the creation of the various Kerberos service account and host principals in the Kerberos database, and configures each of the cluster components – YARN, Hive, HDFS and so on – to authenticate with each other using Kerberos authentication and use encrypted network connections for inter-service and inter-node communication.
Along with the secured Hadoop cluster, key bits of information and configuration data you’ll need for the OBIEE side are:
- The krb5.conf file from the Kerberos KDC, which contains details of the Kerberos realm, URL for the KDC server, and other key connection details
- The name of the Kerberos principal used for the Hive service name on the Hadoop cluster – typically this is “hive”; if you want to connect to Hive first using a JDBC tool such as beeline, you’ll also need the full principal name for this service, in my case “hive/bda3node2.rittmandev.com@RITTMANDEV.COM”
- The hostname (FQDN) of the node in the CDH cluster that contains the HiveServer2 RPC interface that OBIEE connects to, to run HiveQL queries
- The Port that HiveServer2 is running on – typically this is “10000”, and the Hive database name (for example, “default’)
- The name of the Kerebos Realm you’ll be connecting to – for example, MYCOMPANY.COM or in my case, RITTMANDEV.COM (usually in capitals)
In my case, the krb5.conf file that is used to configure Kerebos connections to my KDC looks like this – in your company it might be a bit more complex, but this example defines a simple MIT Kerebos 5 domain:
[logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log [libdefaults] default_realm = RITTMANDEV.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true [realms] RITTMANDEV.COM = { kdc = auth.rittmandev.com admin_server = auth.rittmandev.com } [domain_realm] .rittmandev.com = RITTMANDEV.COM rittmandev.com = RITTMANDEV.COM
In my setup, the CDH Hadoop cluster has been configured to use Kerberos authentication for all communications between cluster components and any connections from the outside that use those components; the cluster itself though can still be accessed via unsecured (non-Kerebos authenticated) SSH, though of course this aspect could be secured too. To test out the Hive connectivity before we get into the OBIEE details you can use the beeline CLI that ships with CDH5, and to do this you’ll need to be able to SSH into one of the cluster nodes (if you’ve not got beeline installed on your own workstation) and you’ll need an account (principal) created for you in the Kerebos database to correspond to the Linux user and HDFS/Hive user that has access to the Hive tables you’re interested in. To create such a Kerebos principal for my setup, I used the kadmin.local command on the KDC VM to create a user that matched my Linux/HDFS username and gave it a password:
kadmin.local: addprinc mrittman WARNING: no policy specified for mrittman@RITTMANDEV.COM; defaulting to no policy Enter password for principal "mrittman@RITTMANDEV.COM": Re-enter password for principal "mrittman@RITTMANDEV.COM": Principal "mrittman@RITTMANDEV.COM" created.
SSH’ing into one of the secure CDH cluster nodes, I first have to authenticate using the kinit command which when successful, creates a Kerebos ticket that gets cached for a set amount of time, and beeline can thereafter use as part of its own authentication process:
officeimac:.ssh markrittman$ ssh mrittman@bda3node4 mrittman@bda3node4's password: [mrittman@bda3node4 ~]$ kinit -p mrittman Password for mrittman@RITTMANDEV.COM: [mrittman@bda3node4 ~]$
Now I can use beeline, and pass the Hive service principal name in the connection details along with the usual host, port and database name. When beeline prompts for my username and password, I use the Kerberos principal name that matches the Linux/HDFS one, and enter that principal’s password:
[mrittman@bda3node4 ~]$ beeline Beeline version 0.13.1-cdh5.3.0 by Apache Hive beeline> !connect jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM scan complete in 2ms Connecting to jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM Enter username for jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM: mrittman Enter password for jdbc:hive2://bda3node2:10000/default;principal=hive/bda3node2.rittmandev.com@RITTMANDEV.COM: ******** Connected to: Apache Hive (version 0.13.1-cdh5.3.0) Driver: Hive JDBC (version 0.13.1-cdh5.3.0) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://bda3node2:10000/default> show tables; +------------------+--+ | tab_name | +------------------+--+ | posts | | things_mrittman | +------------------+--+ 2 rows selected (0.162 seconds) 0: jdbc:hive2://bda3node2:10000/default> select * from things_mrittman; +---------------------------+-----------------------------+--+ | things_mrittman.thing_id | things_mrittman.thing_name | +---------------------------+-----------------------------+--+ | 1 | Car | | 2 | Dog | | 3 | Hat | +---------------------------+-----------------------------+--+ 3 rows selected (0.251 seconds)
In this particular example we’re using Windows to host OBIEE 11.1.1.7, as this is the only platform that we can get the HiveServer2 ODBC drivers to work, in this case the Cloudera Hive ODBC drivers available on their website (free download but registration may be needed). Before we can get this ODBC driver to work though, we need to install the Kerberos client software on the Windows machine so that we can generate the Kerberos ticket that the ODBC driver will need to pass over as part of the authentication process.
To configure the Windows environment for Kerberos authentication, in my case I used the Kerberos for Windows 4.x client software downloadable for free from the MIT website and copied across the krb5.conf file from the KDC server, renaming it to krb5.ini and storing it the default location of c:\ProgramData\MIT\Kerberos5.
You also need to define a system environment variable, KRB5CCNAME, to point to a directory where the Kerebos tickets can be cached, in my case I used c:\temp\krb5cache. Once this is done, reboot the Windows environment and you should then be prompted after login to authenticate yourself to the Kerebos KDC.
The ticket then stays valid for a set number of days/hours, or you can configure OBIEE itself to authenticate and cache its own ticket – for now though, we’ll create the ticket manually and connect to the secured cluster using these cached ticket details.
After installing the Cloudera Hive ODBC drivers, I create the connection using Kerebos as the Authentication Mechanism, and enter the realm name, HiveServer2 host and the Hive Kerebos principal name, like this:
In my case both the BI Administration tool and the OBIEE BI Server were on the same Windows VM, and therefore shared the same ODBC driver install, so I then moved over to the BI Administration tool to import the Hive table metadata details into the RPD and create the physical, logical and presentation layer RPD elements. Depending on how your CDH cluster is set up you might be able to test the connection now by using the View Data… menu item in BI Administration, but in my case I had to do two more things on the CDH cluster itself before I could get Hive queries under this Kerberos principal to run properly.
First, as secured CDH Hadoop clusters usually configure HiveServer2 to use “user impersonation” (connecting to Hive as the user you authenticate as, not the user that HiveServer2 authenticates to the Hive service as), YARN and MapReduce jobs run under your account and not the usual “Hive” account that unsecured Hive connections use. Where this causes a problem on CDH installations on RHEL-derived platforms (RHEL, OEL, Centos etc) is that YARN normally blocks jobs running on behalf of users with a UID of <1000 (as this on other Linux distributions typically signifies a system account), RHEL starts user UIDs at 500 and YARN therefore blocks them from running jobs. To fix this, you need to go into Cloudera Manager and edit the YARN configuration settings to lower this UID threshold to something under 500, for example 250:
I also needed to alter the group ownership of the temporary directory each node used for the YARN NodeManager’s user files so that YARN could write its temporary files correctly; on each node in the cluster I ran the following Linux commands as root to clear down any files YARN had created before, and recreate the directories with the correct permissions (Hive jobs would fail until I did this, with OBIEE just reporting an ODBC error):
rm -rf /yarn mkdir -p /yarn/nm chown -R yarn /yarn chgrp -R yarn /yarn
Once this is done, queries from the BI Administration tool and from the OBIEE BI Server should connect to the Kerberos-secured CDH cluster successfully, using the Kerberos ticket you obtained using the MIT Kerberos Ticket Manager on login and then passing across the user details under which the YARN, and then Hive job should run.
If you’re interested, you can go back to the MIT Kerberos Ticket Manager and see the other Kerberos tickets that were requested and then cached by the Cloudera Hive ODBC driver when it mutually authenticated with the HiveServer2 RPC interface – Kerebos authenticates both ways to ensure that who you’re connecting to is actually who they say they are, in this case checking the HiveServer2 connection you’re connecting to isn’t being spoofed by someone else.
So that’s the process for connecting OBIEE to a Kerberos-secured CDH Hadoop cluster in a nutshell; in the New Year I’ll put something together on using Apache Sentry to provide role-based access control for Hive and Impala tables and as of CDH 5.3, HDFS directories, and I’ll also take a look at the new extended ACLs feature in CDH5.2 that goes beyond HDFS’s standard POSIX security model.
OBIEE and ODI on Hadoop : Next-Generation Initiatives To Improve Hive Performance
The other week I posted a three-part series (part 1, part 2 and part 3) on going beyond MapReduce for Hadoop-based ETL, where I looked at a typical Apache Pig dataflow-style ETL process and showed how Apache Tez and Apache Spark can potentially make these processes run faster and make better use of in-memory processing. I picked Pig as a data processing environment as the multi-step data transformations creates translate into lots of separate MapReduce jobs in traditional Hadoop ETL environments, but run as a single DAG (directed acyclic graph) under Tez and Spark and can potentially use memory to pass intermediate results between steps, rather than writing all those intermediate datasets to disk.
But tools such as OBIEE and ODI use Apache Hive to interface with the Hadoop world, not Pig, so its improvements to Hive that will have the biggest immediate impact on the tools we use today. And what’s interesting is the developments and work thats going on around Hive in this area, with four different “next-generation Hive” initiatives going on that could end-up making OBIEE and ODI on Hadoop run faster:
- Hive-on-Tez (or “Stinger”), principally championed by Hortonworks, along with Stinger.next which will enable ACID transactions in HiveQL
- Hive-on-Spark, a more limited port of Hive to run on Spark and backed by Cloudera amongst others
- Spark SQL within Apache Spark, which enables SQL queries against Spark RDDs (and Hive tables), and exposes a HiveServer2-compatible Thrift Server for JDBC access
- Vendor initiatives that build on Hive but are mainly around integration with their RDBMS engines, for example Oracle Big Data SQL
Vendor initiatives like Oracle’s Big Data SQL and Cloudera Impala have the benefit of working now (and are supported), but usually come with some sort of penalty for not working directly within the Hive framework. Oracle’s Big Data SQL, for example, can read data from Hive (very efficiently, using Exadata SmartScan-type technology) but then can’t write-back to Hive, and currently pulls all the Hive data into Oracle if you try and join Oracle and Hive data together. Cloudera’s Impala, on the other hand, is lightening-fast and works directly on the Hadoop platform, but doesn’t support the same ecosystem of SerDes and storage handlers that Hive supports, taking away one of the key flexibility benefits of working with Hive.
So what about the attempts to extend and improve Hive, or include Hive interfaces and compatibility in Spark? In most cases an ETL routine written as a series of Hive statements isn’t going to be as fast or resource-efficient as a custom Spark program, but if we can make Hive run faster or have a Spark application masquerade as a Hive database, we could effectively give OBIEE and ODI on Hadoop a “free” platform performance upgrade without having to change the way they access Hadoop data. So what are these initiatives about, and how usable are they now with OBIEE and ODI?
Probably the most ambitious next-generation Hive project is the Stinger initiative. Backed by Hortonworks and based on the Apache Tez framework that runs on Hadoop 2.0 and YARN. Stinger aimed first to port Hive to run on Tez (which runs MapReduce jobs but enables them to potentially run as a single DAG), and then add ACID transaction capabilities so that you can UPDATE and DELETE from a Hive table as well as INSERT and SELECT, using a transaction model that allows you to roll-back uncommitted changes (diagram from the Hortonworks Stinger.next page)
Tez is more of a set of developer APIs rather than the full data discovery / data analysis platform that Spark aims to provide, but it’s a technology that’s available now as part of Hortonworks HDP2.2 platform and as I showed in the blog post a few days ago, an existing Pig script that you run as-is on a Tez environment typically runs twice as fast as when its using MapReduce to move data around (with usual testing caveats applying, YMMV etc). Hive should be the same as well, giving us the ability to take Hive transformation scripts and run them unchanged except for specifying Tez at the start as the execution engine.
Hive on Tez is probably the first of these initiatives we’ll see working with ODI and OBIEE, as ODI has just been certified for Hortonworks HDP2.1, and the new HDP2.2 release is the one that comes with Tez as an option for Pig and Hive query execution. I’m guessing ODI will need to have its Hive KMs updated to add a new option to select Tez or MapReduce as the underlying Hive execution engine, but otherwise I can see this working “out of the box” once ODI support for HDP2.2 is announced.
Going back to the last of the three blog posts I wrote on going beyond MapReduce, many in the Hadoop industry back Spark as the successor to MapReduce rather than Tez as its a more mature implementation that goes beyond the developer-level APIs that Tez aims to provide to make Pig and Hive scripts run faster. As we’ll see in a moment Spark comes with its own SQL capabilities and a Hive-compatible JDBC interface, but the other “swap-out-the-execution-engine” initiative to improve Hive is Hive on Spark, a port of Hive that allows Spark to be used as Hive’s execution engine instead of just MapReduce.
Hive on Spark is at an earlier stage in development than Hive on Tez with the first demo being given at the recent Strata + Hadoop World New York, and specific builds of Spark and versions of Hive needed to get it running. Interestingly though, a post went on the Cloudera Blog a couple of days ago announcing an Amazon AWS AMI machine image that you could use to test Hive on Spark, which though it doesn’t come with a full CDH or HDP installation or features such as a HiveServer JDBC interface, comes with a small TPC-DS dataset and some sample queries that we can use to get a feeling for how it works. I used the AMI image to create an Amazon AWS m3.large instance and gave it a go.
By default, Hive in this demo environment is configured to use Spark as the underlying execution engine. Running a couple of the TPC-DS queries first using this Spark engine, and then switching back to MapReduce by running the command “set hive.execution.engine=mr” within the Hive CLI, I generally found queries using Spark as the execution engine ran 2-3x faster than the MapReduce ones.
You can’t read too much into this timing as the demo AMI is really only to show off the functional features (Hive using Spark as the execution engine) and no work on performance optimisation has been done, but it’s encouraging even at this point that it’s significantly faster than the MapReduce version.
Long-term the objective is to have both Tez and Spark available as options as execution engines under Hive, along with MapReduce, as the diagram below from a presentation by Cloudera’s Szenon Ho shows; the advantage of building on Hive like this rather than creating your own new SQL-on-Hadoop engine is that you can make use of the library of SerDes, storage handlers and so on that you’d otherwise need to recreate for any new tool.
The third major SQL-on-Hadoop initiative I’ve been looking at is Spark SQL within Apache Spark. Unlike Hive on Spark which aims to swap-out the compiler and execution engine parts of Hive but otherwise leave the rest of the product unchanged, Apache Spark as a whole is a much more freeform, flexible data query and analysis environment that’s aimed more at analysts that business users looking to query their dataset using SQL. That said, Spark has some cool SQL and Hive integration features that make it an interesting platform for doing data analysis and ETL.
In my Spark ETL example the other day, I loaded log data and some reference data into RDDs and then filtered and transformed them using a mix of Scala functions and Spark SQL queries. Running on top of the set of core Spark APIs, Spark SQL allows you to register temporary tables within Spark that map onto RDDs, and give you the option of querying your data using either familiar SQL relational operators, or the more functional programming-style Scala language
You can also create connections to the Hive metastore though, and create Hive tables within your Spark application for when you want to persist results to a table rather than work with the temporary tables that Spark SQL usually creates against RDDs. In the code example below, I create a HiveContext as opposed to the sqlContext that I used in the example on the previous blog, and then use that to create a table in my Hive database, running on a Hortonworks HDP2.1 VM with Spark 1.0.0 pre-built for Hadoop 2.4.0:
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> hiveContext.hql("CREATE TABLE posts_hive (post_id int, title string, postdate string, post_type string, author string, post_name string, generated_url string) row format delimited fields terminated by '|' stored as textfile") scala> hiveContext.hql("LOAD DATA INPATH '/user/root/posts.psv' INTO TABLE posts_hive")
If I then go into the Hive CLI, I can see this new table listed there alongside the other ones:
hive> show tables; OK dummy posts posts2 posts_hive sample_07 sample_08 src testtable2 Time taken: 0.536 seconds, Fetched: 8 row(s)
What’s even more interesting is that Spark also comes with a HiveServer2-compatible Thrift Server, making it possible for tools such as ODI that connect to Hive via JDBC to run Hive queries through Spark, with the Hive metastore providing the metadata but Spark running as the execution engine.
This is subtly different to Hive-on-Spark as Hive’s metastore, support for SerDes and storage handlers runs under the covers but Spark provides you with a full programmatic environment, making it possible to just expose Hive tables through the Spark layer, or mix and match data from RDDs, Hive tables and other sources before storing and then exposing the results through the Hive SQL interface. For example then, you could use Oracle SQL*Developer 4.1 with the Cloudera Hive JDBC drivers to connect to this Spark SQL Thrift Server and query the tables just like any other Hive source, but crucially the Hive execution is being done by Spark, rather than MapReduce as would normally happen.
Like Hive-on-Spark, Spark SQL and Hive support within Spark SQL are at early stages, with Spark SQL not yet being supported by Cloudera whereas the core Spark API is. From the work I’ve done with it, it’s not yet possible to expose Spark SQL temporary tables through the HiveServer2 Thrift Server interface, and I can’t see a way of creating Hive tables out of RDDs unless you stage the RDD data to a file in-between. But it’s clearly a promising technology and if it becomes possible to seamlessly combine RDD data and Hive data, and expose Spark RDDs registered as tables through the HiveServer2 JDBC interface it could make Spark a very compelling platform for BI and data analyst-type applications. Oracle’s David Allen, for example, blogged about using Spark and the Spark SQL Thrift Server interface to connect ODI to Hive through Spark, and I’d imagine it’d be possible to use the Cloudera HiveServer2 ODBC drivers along with the Windows version of OBIEE 11.1.1.7 to connect to Spark in this way too – if I get some spare time over the Christmas break I’ll try and get an example working.
Rittman Mead BI Forum 2015 Call for Papers Now Open!
I’m very pleased to announce that the Call for Papers for the Rittman Mead BI Forum 2015 is now open, with abstract submissions open to January 18th 2015. As in previous years the BI Forum will run over consecutive weeks in Brighton, UK and Atlanta, GA, with the provisional dates and venues as below:
- Brighton, UK : Hotel Seattle, Brighton, UK : May 6th – 8th 2015
- Atlanta, GA : Renaissance Atlanta Midtown Hotel, Atlanta, USA : May 13th-15th 2015
Now on it’s seventh year, the Rittman Mead BI Forum is the only conference dedicated entirely to Oracle Business Intelligence, Oracle Business Analytics and the technologies and processes that support it – data warehousing, data analysis, data visualisation, big data and OLAP analysis. We’re looking for session around tips & techniques, project case-studies and success stories, and sessions where you’ve taken Oracle’s BI products and used them in new and innovative ways. Each year we select around eight-to-ten speakers for each event along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Charles Elliott and Jordan Meyer.
Last year we had a big focus on cloud, and a masterclass and several sessions on bringing Hadoop and big data to the world of OBIEE. This year we’re interested in project stories and experiences around cloud and Hadoop, and we’re keen to hear about any Oracle BI Apps 11g implementations or migrations from the earlier 7.9.x releases. Getting back to basics we’re always interested in sessions around OBIEE, Essbase and data warehouse data modelling, and we’d particularly like to encourage session abstracts on data visualization, BI project methodologies and the incorporation of unstructured, semi-structured and external (public) data sources into your BI dashboards. For an idea of the types of presentations that have been selected in the past, check out the BI Forum 2014, 2013 and 2012 homepages, or feel free to get in touch via email at mark.rittman@rittmanmead.com.
The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues if you can speak at both. All session this year will be 45 minutes long, all we’ll be publishing submissions and inviting potential attendees to vote on their favourite sessions towards the end of January. Other than that – have a think about abstract ideas now, and make sure you get them in by January 18th 2015.