Tag Archives: Oracle Data Integrator

Rittman Mead BI Forum 2015 Now Open for Registration!

I’m very pleased to announce that the Rittman Mead BI Forum 2015, running in Brighton and Atlanta in May 2015, is now open for registration.

Back for its seventh successful year, the Rittman Mead BI Forum once again will be showcasing the best speakers and presentations on topics around Oracle Business Intelligence and data warehousing, with two events running in Brighton, UK and Atlanta, USA in May 2015. The Rittman Mead BI Forum is different to other Oracle tech events in that we keep the numbers attending limited, topics are all at the intermediate-to-expert level, and we concentrate on just one topic – Oracle Business Intelligence Enterprise Edition, and the technologies and products that support it.

NewImage

As in previous years, the BI Forum will run on two consecutive weeks, starting in Brighton and then moving over to Atlanta for the following week. Here’s the dates and venue locations:

This year our optional one-day masterclass will be delivered by Jordan Meyer, our Head of R&D, and myself and will be on the topic of “Delivering the Oracle Big Data and Information Management Reference Architecture” that we launched last year at our Brighton event. Details of the masterclass, and the speaker and session line up at the two events are on the Rittman Mead BI Forum 2015 homepage

Each event has its own agenda, but both will focus on the technology and implementation aspects of Oracle BI, DW, Big Data and Analytics. Most of the sessions run for 45 minutes, but on the first day we’ll be holding a debate and on the second we’ll be running a data visualization “bake-off” – details on this, the masterclass and the keynotes and our special guest speakers will be revealed on this blog over the next few weeks – watch this space!

Data Integration Tips: ODI – Use Scenarios in packages

 

Here is another question I often hear during ODI Bootcamp or I often read in the ODI Space on OTN :
In a package, should I directly use mappings (or interfaces prior to 12c) or is it better to use the scenarios generated from these mappings?

An ODI package is a workflow used to sequence the different steps of execution of a data integration process. Some of these steps might change the value of a variable, evaluate a variable, perform an administrative task like sending an email or retrieving a file from an FTP. But the core thing is the execution of a mapping and this can be done in two different ways.

 

Direct Use

It is possible to directly drag and drop a mapping in the package.

Mapping in a Package

In this example the mapping M_FACT_SALES_10 is directly executed within the package, in the same session. This means that the value of the variables will be the same as the ones in the package.

Execution of a package holding a mapping

 

If there are several mappings, they will be executed sequentially, there is no parallelism here. (Damn, I will never get the double L at the right place in that word… Thanks autocorrect!)

The code for the mapping is generated using the current state of the mapping in the work repository, including any changes done since the package was created. But if a scenario is created for the package PCK_FACT_LOAD_MAP, it will need to be regenerated – or a new version should be generated – to take any change of M_FACT_SALES_10 into account.

One good point for this approach is that we can build a data integration process that will be entirely rolled back in case of failure. If one of the step fails, the data will be exactly as it was before the execution started. This can be done by using the same transaction for all the mappings and disable the commit option for all of them but the last one. So it’s only when everything succeeds that the changes will be committed and available for everyone. Actually it’s not even required to commit on the last step as ODI will anyway issue a commit at the end of a successful session. This technique works of course only if you don’t have any DDL statement on the target tables. You will find more information about it on this nice blog post from Bhabani Rajan : Transaction Control in ODI. Thanks to David Allan for reminding me this!

 

Using Scenarios

But you can also generate a scenario and then drag and drop it in a packageScenario in a package

Sessions and variables

In this case, the execution of the mapping will not be done within the package session. Instead, on step of this session will use the OdiStartScen command to trigger a new session to execute the scenario.

Execution of a package holding a scenario

We can see here that the second step of the package session (401) got only one task which will run the command to start a new session. The only step in the new session (402) is the mapping and it has the three same tasks as the previous example. Thanks to the fact this is a different session, you could choose to execute it using another context or agent. My colleague Michael brought an interesting use case to me. When one step of your package must extract data from a file on a file server that has its own standalone agent installed to avoid permission issues, you could specify that agent in the Scenario step of your package. So the session for that scenario will be started on that standalone agent while all the other sessions will use the execution agent of the parent session.

 

But what about variables configured to keep no history? As this is a different session, the values are lost outside of the scope of one session. We therefore need to pass it as startup parameters to the scenario of the mapping.

Passing variables as startup parameters To do this, I go to the Command tab on the scenario step and I add my variables there with the syntax

"-<PROJECT_CODE>.<VARIABLE_NAME>=<VALUE_TO_ASSIGN>"

where <VALUE_TO_ASSIGN> can be the value of variable in the current session itself (401 in our case).

 Code executed

The scenario step name was originally “Execution of the Scenario M_FACT_SALES_10 version 001″ and I slightly modified it to remove the mention of a version number. As I want to always execute the last version of the scenario, I also changed the version in the parameters and set it to -1 instead of 001. There is a small bug in the user interface so if you click somewhere else, it will change back to 001. The workaround is to press the Return key (or Enter) after typing your value.

So by using a scenario, you can choose which version of a scenario you want to execute. Either a specific version (e.g. 003) or the last one, using the value -1. It also means that if you won’t break anything if the package is executed while you are working on the mapping. The package will still use the frozen code of the scenario even if you changed the mapping.

If you create a scenario for the package PCK_FACT_LOAD_SCEN, there is no need to regenerate it when a new scenario is (re-)generated for the mapping. By using -1, it will reference the newly generated scenario.

 Asynchronous Execution

Another advantage of using scenarios is that it supports parallelism (Yes, I got it right!!).

Asynchronous scenario executions

Here I set the “Synchronous / Asynchronous” option to “Asynchronous Mode” for my two scenarios so the two sessions will start at the same time. By adding an OdiWaitForChildSessions step, I can wait for the end of all the sessions before doing anything else. This step will also define in which case you want to report an error. By default, it’s when all the child sessions are in error. But I usually change to parameter to zero, so any failure will cause my package execution to fail as well.

getPrevStepLog()

Just a short note, be careful when using the method getPrevStepLog() from the substitution API in a step after executing a scenario. That method will retrieve information about the previous step in the current Session. So about the ODI command execution OdiStartScen, and not about the execution of the scenario itself.

 

Summary

Here is a small recap table to compare both approach :

comparison table

In conclusion, development is generally more robust when using scenarios in packages. There is more control on what is executed and on the parallelism. The only good reason to use mappings directly is to keep everything within the same transaction.

Keep also in mind that Load Plans might be a better alternative than packages with better parallelism, error handling and restartability. Unless you need loops or a persistent session… A comparison between Load Plans and Packages can be found in the Oracle blog post introducing Load Plans.

 

More tips and core principles demystification coming soon so stay tuned. And follow @mRainey, @markrittman and myself – @JeromeFr – on twitter to be informed of anything happening in the ODI world.

Endeca Information Discovery Integration with Oracle BI Apps 11.1.1.8.1

One of the new features that made its way into Oracle BI Apps 11.1.1.8.1, and that completely passed me by at the time, was integration with Oracle Endeca Information Discovery 3.1 via a new ODI11g plug-in. As this new integration brings the Endeca “faceted search” and semi-structured data analysis capabilities to the BI Apps this is actually a pretty big deal, so let’s take a quick look now at what this integration brings and how it works under-the-covers.

Oracle BI Apps 11.1.1.8.1 integration with Oracle Endeca Information Discovery 3.1 requires OEID to be installed either on the same host as BI Apps or on its own server, and uses a custom ODI11g KM called “IKM SQL to Endeca Server 3.1.0” that needs to be separately downloaded from Oracle’s Edelivery site. The KM download also comes with an ODI Technology XML import file that adds Endeca Server as a technology type in the ODI Topology Navigator, and its here that the connection to the OEID 3.1 server is set up for subsequent data loads from BI Apps 11g.

NewImage

As the IKM has “SQL” as the loading technology specified this means any source that can be extracted from using standard JDBC drivers can be used to load the Endeca Server, and the KM itself has options for creating the schema to go with the data upload, creating the Endeca Server record spec, specifying the collection name and so on. BI Apps 11.1.1.8.1 ships with a number of interfaces (mappings) and ODI load plans to populate Endeca Server domains for specific BI Apps fact groups, with each interface taking all of the relevant repository fact measures and dimension attributes and loading them into one big Endeca Server denormalized schema.

NewImage

When you take a look at the physical (flow) tab for the interface, you can see that the data is actually extracted via the BI Apps RPD and the BI Server, then staged in the BI Apps DW database before being loaded into the Endeca Server domain. Column metadata in the BI Server repository is then used to define the datatypes, names, sizes and other details of the Endeca Server domain schema, an improvement over just pointing Endeca Information Discovery Integrator at the BI Apps database schema.

NewImage

The 11.1.1.8.1 ODI repository also provides a pre-built Endeca Server load plan that allows you to turn-on and turn-off loads for particular fact groups, and sets up the data domain name and other variables needed for the Endeca Server data load.

NewImage

Once you’ve loaded the BI Apps data into one or more Endeca Server data domains, there are a number of pre-build sample Endeca Studio applications you can use to get a feel for the capabilities of the Endeca Information Discovery Studio interface. For example, the Student Information Analytics Admissions and Recruiting sample application lets you filter on any attribute from the dataset on the left-hand side, and then returns instantaneously the filtered dataset displayed in the form of a number of graphs, word clouds and other visualizations.

NewImage

Another interesting sample application is around employee expense analysis. Using features such as word clouds, its relatively easy to see which expense categories are being used the most, and you can use Endeca Server’s search and text parsing capabilities to extract keywords and other useful bits of information out of free-text areas, for example supporting text for expense claims.

NewImage

Typically, you’d use Endeca Information Discovery as a “dig around the data, let’s look for something interesting” tool, with all of your data attributes listed in one place and automatic filtering and focusing on the filtered dataset using the Endeca Server in-memory search and aggregation engine. The sample applications that ship with 11.1.1.8.1 are just meant as a starting point though (and unlike the regular BI Apps dashboards and reports, aren’t themselves supported as an official part of the product), but as you can see from the Manufacturing Endeca Studio application below, there’s plenty of scope to create returns, production and warranty-type applications that let you get into the detail of the attribute and textual information held in the BI Apps data warehouse.

NewImage

More details on the setup process are in the online docs, and if you’re new to Endeca Information Discovery take a look at our past articles on the blog on this product.

Data Integration Tips: ODI – One Data Server with several Physical Schemas

Yes, I’m hijacking the “Data Integration Tips” series of my colleague Michael Rainey (@mRainey) and I have no shame!

DISCLAIMER
This tip is intended for newcomers in the ODI world and is valid with all the versions of ODI. It’s nothing new, it has been posted by other authors on different blogs. But I see so much people struggling with that on the ODI Space on OTN that I wanted to explain it in full details, with all the context and with my own words. So next time I can just post a link to this instead of explaining it from scratch.

The Problem

I’m loading data from a schema to another schema on the same Oracle database but it’s slower than when I write a SQL insert statement manually. The bottle neck of the execution is in the steps from the LKM SQL to SQL. What should I do?

Why does it happen?

Loading Knowledge Modules (LKMs) are used to load the data from one Data Server to another. It usually connects to the source and the target Data Server to execute some steps on each of them. This is required when working with different technologies or different database instances for instance. So if we define two Data Servers to connect to our two database schemas, we will need a LKM.

In this example, I will load a star schema model in HR_DW schema, using the HR schema from the same database as a source. Let’s start with the approach using two Data Servers. Note that here we use directly the database schema to connect to our Data Servers.

Two Data Servers connecting to the same database instance, using directly the database schema to connect.

And here are the definitions of the physical schemas :

Physical Schemas

Let’s build a simple mapping using LOCATIONS, COUNTRIES and REGIONS as source to denormalize it and load it into a single flattened DIM_LOCATIONS table. We will use Left Outer joins to be sure we don’t miss any location even if there is no country or region associated. We will populate LOCATION_SK with a sequence and use an SCD2 IKM.

Mapping - Logical tab

If we check the Physical tab, we can see two different Execution Groups. This mean the Datastores are in two different Data Servers and therefore a LKM is required. Here I used LKM SQL to SQL (Built-In) which is a quite generic one, not particularly designed for Oracle databases. Performances might be better with a technology-specific KM, like LKM Oracle to Oracle Pull (DB Link). By choosing the right KM we can leverage the technology-specific concepts – here the Oracle database links – which often improve performance. But still, we shouldn’t need any database link as everything lies in the same database instance.

Mapping - Physical tab

 

Another issue is that temporary objects needed by the LKM and the IKM are created in the HR_DW schema. These objects are the C$_DIM_LOCATIONS table created by the LKM to bring the data in the Target Data Servers and the I$_DIM_LOCATIONS table created by the IKM to detect when a new row is needed or when a row needs to be updated according to the SCD2 rules. Even though these objects are deleted in the clean-up steps at the end of the mapping execution, it would be better to use another schema for these temporary objects instead of target schema that we want to keep clean.

The Solution

If the source and target Physical Schemas are located on the same Data Server – and the technology can execute code – there is no need for a LKM. So it’s a good idea to try to reuse as much as possible the same Data Server for data coming from the same place. Actually, the Oracle documentation about setting up the topology recommends to create an ODI_TEMP user/schema on any RDBMS and use it to connect.

This time, let’s create only one Data Server with two Physical schemas under it and let’s map it to the existing Logical schemas. Here I will use ODI_STAGING name instead of ODI_TEMP because I’m using the excellent ODI Getting Started virtual machine and it’s already in there.

One Data Server with two Physical Schemas under it

As you can see in the Physical Schema definitions, there is no other password provided to connect with HR or HR_DW directly. At run-time, our agent will only use one connection to ODI_STAGING and execute code through it, even if it needs to populate HR_DW tables. It means that we need to be sure that ODI_STAGING has all the required privileges to do so.

Physical schemas

Here are the privileges I had to grant to ODI_STAGING :

GRANT SELECT on HR.LOCATIONS TO ODI_STAGING;
GRANT SELECT on HR.COUNTRIES TO ODI_STAGING;
GRANT SELECT on HR.REGIONS TO ODI_STAGING;

GRANT SELECT, INSERT, UPDATE, DELETE on HR_DW.DIM_LOCATIONS to ODI_STAGING;
GRANT SELECT on HR_DW.DIM_LOCATIONS_SEQ to ODI_STAGING;

Let’s now open our mapping again and go on the physical tab. We now have only one Execution Group and there is no LKM involved. The code generated is a simple INSERT AS SELECT (IAS) statement, selecting directly from the HR schema and loading into the HR_DW schema without any database link. Data is loaded faster and our first problem is addressed.

Mapping - Physical tab without LKM

Now let’s tackle the second issue we had with temporary objects being created in HR_DW schema. If you scroll upwards to the Physical Schema definitions (or click this link, if you are lazy…) you can see that I used ODI_STAGING as Work Schema in all my Physical Schemas for that Data Server. This way, all the temporary objects are created in ODI_STAGING instead of the source or target schema. Also we are sure that we won’t have any issue with missing privileges, because our agent uses directly ODI_STAGING to connect.

So you can see it has a lot of advantages using a single Data Server when sources come from the same place. We get rid of the LKM and the schema used to connect can also be used as Work Schema so we keep the other schemas clean without any temporary objects.

The only thing you need to remember is to give the right privileges to ODI_STAGING (or ODI_TEMP) on all the objects it needs to handle. If your IKM has a step to gather statistics, you might also want to grant ANALYZE ANY. If you need to truncate a table before loading it, you have two approaches. You can grant DROP ANY table to ODI_STAGING, but this might be a dangerous privilege to give in production. A safer way is to create a stored procedure ODI_TRUNCATE in all the target database schema. This procedure takes a table name as a parameter and truncates that table using the Execute Immediate statement. Then you can grant execute on that procedure to ODI_STAGING and edit your IKM step to execute that procedure instead of using the truncate syntax.

 

That’s it for today, I hope this article can help some people to understand the reason of that Oracle recommendation and how to implement it. Stay tuned on this blog and on Twitter (@rittmanmead, @mRainey, @markrittman, @JeromeFr, …) for more tips about Data Integration!

OBIEE and ODI on Hadoop : Next-Generation Initiatives To Improve Hive Performance

The other week I posted a three-part series (part 1, part 2 and part 3) on going beyond MapReduce for Hadoop-based ETL, where I looked at a typical Apache Pig dataflow-style ETL process and showed how Apache Tez and Apache Spark can potentially make these processes run faster and make better use of in-memory processing. I picked Pig as a data processing environment as the multi-step data transformations creates translate into lots of separate MapReduce jobs in traditional Hadoop ETL environments, but run as a single DAG (directed acyclic graph) under Tez and Spark and can potentially use memory to pass intermediate results between steps, rather than writing all those intermediate datasets to disk.

But tools such as OBIEE and ODI use Apache Hive to interface with the Hadoop world, not Pig, so its improvements to Hive that will have the biggest immediate impact on the tools we use today. And what’s interesting is the developments and work thats going on around Hive in this area, with four different “next-generation Hive” initiatives going on that could end-up making OBIEE and ODI on Hadoop run faster:

  • Hive-on-Tez (or “Stinger”), principally championed by Hortonworks, along with Stinger.next which will enable ACID transactions in HiveQL
  • Hive-on-Spark, a more limited port of Hive to run on Spark and backed by Cloudera amongst others
  • Spark SQL within Apache Spark, which enables SQL queries against Spark RDDs (and Hive tables), and exposes a HiveServer2-compatible Thrift Server for JDBC access
  • Vendor initiatives that build on Hive but are mainly around integration with their RDBMS engines, for example Oracle Big Data SQL

Vendor initiatives like Oracle’s Big Data SQL and Cloudera Impala have the benefit of working now (and are supported), but usually come with some sort of penalty for not working directly within the Hive framework. Oracle’s Big Data SQL, for example, can read data from Hive (very efficiently, using Exadata SmartScan-type technology) but then can’t write-back to Hive, and currently pulls all the Hive data into Oracle if you try and join Oracle and Hive data together. Cloudera’s Impala, on the other hand, is lightening-fast and works directly on the Hadoop platform, but doesn’t support the same ecosystem of SerDes and storage handlers that Hive supports, taking away one of the key flexibility benefits of working with Hive. 

So what about the attempts to extend and improve Hive, or include Hive interfaces and compatibility in Spark? In most cases an ETL routine written as a series of Hive statements isn’t going to be as fast or resource-efficient as a custom Spark program, but if we can make Hive run faster or have a Spark application masquerade as a Hive database, we could effectively give OBIEE and ODI on Hadoop a “free” platform performance upgrade without having to change the way they access Hadoop data. So what are these initiatives about, and how usable are they now with OBIEE and ODI?

Probably the most ambitious next-generation Hive project is the Stinger initiative. Backed by Hortonworks and based on the Apache Tez framework that runs on Hadoop 2.0 and YARN. Stinger aimed first to port Hive to run on Tez (which runs MapReduce jobs but enables them to potentially run as a single DAG), and then add ACID transaction capabilities so that you can UPDATE and DELETE from a Hive table as well as INSERT and SELECT, using a transaction model that allows you to roll-back uncommitted changes (diagram from the Hortonworks Stinger.next page)

NewImage

Tez is more of a set of developer APIs rather than the full data discovery / data analysis platform that Spark aims to provide, but it’s a technology that’s available now as part of Hortonworks HDP2.2 platform and as I showed in the blog post a few days ago, an existing Pig script that you run as-is on a Tez environment typically runs twice as fast as when its using MapReduce to move data around (with usual testing caveats applying, YMMV etc). Hive should be the same as well, giving us the ability to take Hive transformation scripts and run them unchanged except for specifying Tez at the start as the execution engine.

NewImage

Hive on Tez is probably the first of these initiatives we’ll see working with ODI and OBIEE, as ODI has just been certified for Hortonworks HDP2.1, and the new HDP2.2 release is the one that comes with Tez as an option for Pig and Hive query execution. I’m guessing ODI will need to have its Hive KMs updated to add a new option to select Tez or MapReduce as the underlying Hive execution engine, but otherwise I can see this working “out of the box” once ODI support for HDP2.2 is announced.

Going back to the last of the three blog posts I wrote on going beyond MapReduce, many in the Hadoop industry back Spark as the successor to MapReduce rather than Tez as its a more mature implementation that goes beyond the developer-level APIs that Tez aims to provide to make Pig and Hive scripts run faster. As we’ll see in a moment Spark comes with its own SQL capabilities and a Hive-compatible JDBC interface, but the other “swap-out-the-execution-engine” initiative to improve Hive is Hive on Spark, a port of Hive that allows Spark to be used as Hive’s execution engine instead of just MapReduce.

Hive on Spark is at an earlier stage in development than Hive on Tez with the first demo being given at the recent Strata + Hadoop World New York, and specific builds of Spark and versions of Hive needed to get it running. Interestingly though, a post went on the Cloudera Blog a couple of days ago announcing an Amazon AWS AMI machine image that you could use to test Hive on Spark, which though it doesn’t come with a full CDH or HDP installation or features such as a HiveServer JDBC interface, comes with a small TPC-DS dataset and some sample queries that we can use to get a feeling for how it works. I used the AMI image to create an Amazon AWS m3.large instance and gave it a go.

By default, Hive in this demo environment is configured to use Spark as the underlying execution engine. Running a couple of the TPC-DS queries first using this Spark engine, and then switching back to MapReduce by running the command “set hive.execution.engine=mr” within the Hive CLI, I generally found queries using Spark as the execution engine ran 2-3x faster than the MapReduce ones.

NewImage

You can’t read too much into this timing as the demo AMI is really only to show off the functional features (Hive using Spark as the execution engine) and no work on performance optimisation has been done, but it’s encouraging even at this point that it’s significantly faster than the MapReduce version.

Long-term the objective is to have both Tez and Spark available as options as execution engines under Hive, along with MapReduce, as the diagram below from a presentation by Cloudera’s Szenon Ho shows; the advantage of building on Hive like this rather than creating your own new SQL-on-Hadoop engine is that you can make use of the library of SerDes, storage handlers and so on that you’d otherwise need to recreate for any new tool.

NewImage

The third major SQL-on-Hadoop initiative I’ve been looking at is Spark SQL within Apache Spark. Unlike Hive on Spark which aims to swap-out the compiler and execution engine parts of Hive but otherwise leave the rest of the product unchanged, Apache Spark as a whole is a much more freeform, flexible data query and analysis environment that’s aimed more at analysts that business users looking to query their dataset using SQL. That said, Spark has some cool SQL and Hive integration features that make it an interesting platform for doing data analysis and ETL.

In my Spark ETL example the other day, I loaded log data and some reference data into RDDs and then filtered and transformed them using a mix of Scala functions and Spark SQL queries. Running on top of the set of core Spark APIs, Spark SQL allows you to register temporary tables within Spark that map onto RDDs, and give you the option of querying your data using either familiar SQL relational operators, or the more functional programming-style Scala language

NewImage

You can also create connections to the Hive metastore though, and create Hive tables within your Spark application for when you want to persist results to a table rather than work with the temporary tables that Spark SQL usually creates against RDDs. In the code example below, I create a HiveContext as opposed to the sqlContext that I used in the example on the previous blog, and then use that to create a table in my Hive database, running on a Hortonworks HDP2.1 VM with Spark 1.0.0 pre-built for Hadoop 2.4.0:

scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hiveContext.hql("CREATE TABLE posts_hive (post_id int, title string, postdate string, post_type string, author string, post_name string, generated_url string) row format delimited fields terminated by '|' stored as textfile")
scala> hiveContext.hql("LOAD DATA INPATH '/user/root/posts.psv' INTO TABLE posts_hive")

If I then go into the Hive CLI, I can see this new table listed there alongside the other ones:

hive> show tables;
OK
dummy
posts
posts2
posts_hive
sample_07
sample_08
src
testtable2
Time taken: 0.536 seconds, Fetched: 8 row(s)

What’s even more interesting is that Spark also comes with a HiveServer2-compatible Thrift Server, making it possible for tools such as ODI that connect to Hive via JDBC to run Hive queries through Spark, with the Hive metastore providing the metadata but Spark running as the execution engine.

NewImage

This is subtly different to Hive-on-Spark as Hive’s metastore, support for SerDes and storage handlers runs under the covers but Spark provides you with a full programmatic environment, making it possible to just expose Hive tables through the Spark layer, or mix and match data from RDDs, Hive tables and other sources before storing and then exposing the results through the Hive SQL interface. For example then, you could use Oracle SQL*Developer 4.1 with the Cloudera Hive JDBC drivers to connect to this Spark SQL Thrift Server and query the tables just like any other Hive source, but crucially the Hive execution is being done by Spark, rather than MapReduce as would normally happen.

NewImage

Like Hive-on-Spark, Spark SQL and Hive support within Spark SQL are at early stages, with Spark SQL not yet being supported by Cloudera whereas the core Spark API is. From the work I’ve done with it, it’s not yet possible to expose Spark SQL temporary tables through the HiveServer2 Thrift Server interface, and I can’t see a way of creating Hive tables out of RDDs unless you stage the RDD data to a file in-between. But it’s clearly a promising technology and if it becomes possible to seamlessly combine RDD data and Hive data, and expose Spark RDDs registered as tables through the HiveServer2 JDBC interface it could make Spark a very compelling platform for BI and data analyst-type applications. Oracle’s David Allen, for example, blogged about using Spark and the Spark SQL Thrift Server interface to connect ODI to Hive through Spark, and I’d imagine it’d be possible to use the Cloudera HiveServer2 ODBC drivers along with the Windows version of OBIEE 11.1.1.7 to connect to Spark in this way too – if I get some spare time over the Christmas break I’ll try and get an example working.