Tag Archives: Big Data
Taking a Look at the New Oracle Big Data SQL
Oracle launched their Oracle Big Data SQL product earlier this week, and it’ll be of interest to anyone who saw our series of posts a few weeks ago about the updated Oracle Information Management Reference Architecture, where Hadoop now sits alongside traditional Oracle data warehouses to provide what’s termed a “data reservoir”. In this type of architecture, Hadoop and its underlying technologies HDFS, Hive and schema-on-read databases provide an extension to the more structured relational Oracle data warehouses, making it possible to store and analyse much larger sets of data with much more diverse data types and structures; the issue that customers face when trying to implement this architecture is that Hadoop is a bit of a “wild west” in terms of data access methods, security and metadata, making it difficult for enterprises to come up with a consistent, over-arching data strategy that works for both types of data store.
Oracle Big Data SQL attempts to address this issue by providing a SQL access layer over Hadoop, managed by the Oracle database and integrated in with the regular SQL engine within the database. Where it differs from SQL on Hadoop technologies such as Apache Hive and Cloudera Impala is that there’s a single unified data dictionary, single Oracle SQL dialect and the full management capabilities of the Oracle database over both sources, giving you the ability to define access controls over both sources, use full Oracle SQL (including analytic functions, complex joins and the like) without having to drop down into HiveQL or other Hadoop SQL dialects. Those of you who follow the blog or work with Oracle’s big data connector products probably know of a couple of current technologies that sound like this; Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that copies Hive or HDFS data into an Oracle database typically faster than a tool like Sqoop, whilst Oracle Direct Connector for HDFS (ODCH) gives the database the ability to define external tables over Hive or HDFS data, and then query that data using regular Oracle SQL.
Where ODCH falls short is that it treats the HDFS and Hive data as a single stream, making it easy to read once but, like regular external tables, slow to access frequently as there’s no ability to define indexes over the Hadoop data; OLH is also good but you can only use it to bulk-load data into Oracle, you can’t use it to query data in-place. Oracle Big Data SQL uses an approach similar to ODCH but crucially, it uses some Exadata concepts to move processing down to the Hadoop cluster, just as Exadata moves processing down to the Exadata storage cells (so much so that the project was called “Project Exadoop” internally within Oracle up to the launch) – but also meaning that it’s Exadata only, and not available for Oracle Databases running on non-Exadata hardware.
As explained by the launch blog post by Oracle’s Dan McClary, Oracle Big Data SQL includes components that install on the Hadoop cluster nodes that provide the same “SmartScan” functionality that Exadata uses to reduce network traffic between storage servers and compute servers. In the case of Big Data SQL, this SmartScan functionality retrieves just the columns of data requested in the query (a process referred to as “column projection”), and also only sends back those rows that are requested by the query predicate.
Combined with Hive’s ability to map unstructured data sources into regular columns and tables, and Big Data SQL’s support for Oracle NoSQL database, the promise of this new technology is the ability to run queries against both relational, Hadoop and NoSQL data sources using a common data dictionary and common set of identity and data access controls.
There’s a couple of potential downsides, though. First-off, Big Data SQL will only be available as part of Oracle Big Data Appliance, which though an impressive bit of hardware and software is a much smaller market than the total set of Oracle customers looking to combine relational and Hadoop-based data; it’s also restricted to Oracle 12c on Exadata meaning you’ll most probably need to do a database upgrade even if you’ve already got the required Exadata servers in-place. Finally, it’s also restricted to the Oracle-specific distribution of Cloudera Hadoop, though if you’re using the BDA you’ll be using this anyway.
My other concern though is that Oracle now focus on SQL as their only access mechanism into Hadoop and big data, in a similar way to how they focused on SQL as their access route into OLAP when they incorporated Oracle Express into the Oracle Database, back in the mid-2000’s. Focusing on SQL over multidimensional languages such as Express 4GL and MDX meant you missed the real point of using a multidimensional, OLAP database – which of course was being able to use a multidimensional query language, and my concern with Big Data SQL is that we’ll end up focusing on that rather than languages such as Spark, Pig and NoSQL query languages which, combined with schema-on-read, is the real differentiator for Hadoop-based systems. As long as Big Data SQL is positioned as a “bonus” – a convenient way of getting data out of Hadoop once it’s been processed and analysed using more Hadoop-native technologies – then Big Data SQL will be a great enabling and acceptance technology for enterprises, rather than one that ends up restricting them.
We’re not aware of any beta program and I don’t think the launch webcast mentioned a specific date or BDA version when Big Data SQL will be out, but with Openworld coming up soon I’d expect to hear more about this over the next few months. We’re involved in a couple of significant Oracle Big Data Appliance implementations at the moment and this product would address a real, pressing need at the moment with our customers, so I’m looking forward to getting more involved in it over the next few months.
This article was updated on 18th July to add the fact that Big Data SQL is only available on Exadata, and is not a generic Oracle Database 12c technology.
Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture Pt2. – Delivering the Data Factory
In my previous post on our updated Oracle Information Management Reference Architecture, jointly-developed with Oracle’s Enterprise Architecture team, we went through a conceptual and logical view of the information architecture, introducing new concepts like the Raw Data Reservoir, the Data Factory and the Discovery Lab. I said that the Data Factory, Data Reservoir, Enterprise Information Store and Reporting, together with the Discovery Lab, comprised the Information Platform, which could be used to create data applications (raw data plus the data factory), or more wide-ranging Information Solutions, with the Discovery Lab often acting as the place where new ideas and models were prototyped, and algorithms developed that could be productionized in the Execution area of the overall architecture.
And this is typically where most information architectures leave you; with a good starting point and an overall schematic, but without any real guidance on how to deliver it. For example, how do you develop small pieces of functionality that may often be throwaway, and how do you take on larger projects? Is it possible to develop this type of architecture in an agile way, and if so, can the data reservoir part be delivered agile too? In the collaboration with Oracle’s Enterprise Architecture team this is where we made our major contribution, taking the ideas behind our ExtremeBI agile development approach and using them to deliver this updated Information Management architecture. Let’s go back a few steps though and think about why an agile, or at least a different, way of developing BI content is needed, and what part of the overall architecture are we developing for?
Within this overall architecture, the bit that developers such as ourselves typically add value for is the “data factory”; the part of the architecture that acts as the conduit, or interface, between the schema-on-read raw data reservoir and the schema-on-write enterprise information store.
There’s other code to write too; interface and streaming code to get data into the event engine, for example, and more traditional ETL code to load data in from database and application stores. But what we’re most concerned about is the data models and reports that users request in their user stories and epics. If you take a look back at the logical view of our new information architecture, you can see that information can come in at any level in the architecture – raw data reservoir, foundation or access and performance, with different pros and cons for landing and accessing data at each layer, and the ability to take data from one layer and use it to load layers higher up in the layer hierarchy.
And this is where our agile development methodology, “ExtremeBI”, comes in. ExtremeBI, for structured relational sources, typically uses Oracle GoldenGate to trickle-feed data of interest directly into the foundation layer of this information management data architecture, with developers then using that layer plus the access and performance layer, if needed, to close-off user stories as quickly as possible, as shown in the diagram below.
In some cases it’s possible to model against just the foundation layer in OBIEE, where that foundation layer contains all the metadata and history columns that you need. One of the main parts of our ExtremeBI methodology is a technical process to make this modelling possible, and more importantly allow the underlying physical model to evolve in response to user stories about performance and analysis needs, without breaking existing reports or creating an unmanageable system.
We’ll be documenting this process as part of the series of white papers we’ll be co-authoring with Oracle, and we also bring utilities and accelerators to ExtremeBI customer projects, along with continuous integration and regression testing tools, to help get usable software in the hands of end-users as soon as possible.
But what about the parts of the data factory that source data from the raw data store? Can we use similar techniques, and tools like ODI and OBIEE, to deliver working software to users within a short time span? Well in practice it’s not quite so easy; typically, getting information out of schema-on-read database involves writing code, and its easy to fall back into bad habits where everything is scripted, code is brittle and fragile, and no-one feels confident in evolving the model over time.
We think though it’s possible to use ExtremeBI techniques with schema-on-read sources thought, when you combine them with tools like Flume or GoldenGate for trickle-feed extraction and loading, and new capabilities in ODI and OBIEE where Hadoop data can be accessed via Hive. Flume is the industry-standard mechanism for transporting log entries from source to the Hadoop cluster, and GoldenGate has the ability through its Java API to land data in HDFS or Hive, when the source is a relational database (see MOS Doc.IDs 1586188.1 (Hive) and 1586210.1 (HDFS) for example code).
Hive, though slow for ad-hoc query access against Hadoop data, has a range of Serializer/Deserializer utilities that can translate semi-structured and NoSQL sources into more regular columns and tables, with Cloudera Impala removing the response-time issue and rapidly matching Hive in terms of plugins and SerDes. ODI, as I demonstrated in a five-part series on the blog last week, has a number of Hadoop-native knowledge modules and also leverages Oracle Loader for Hadoop to move data out of the raw data reservoir and into the foundation and access + performance layers.
The important thing to understand though when working with data in the updated Information Management architecture, and reporting against schema-on-read and schema-on-write sources, is that the most effective way to deliver real value for end-users is to have two phases to your development work;
- A discovery phase, typically done in the Discovery Lab sandbox or Rapid Development Sandboxes, that focuses on understanding the data and helping the users see where the value is, with less emphasis on governance and corporate standards, and
- A commercial exploitation phase, when our understanding of the data has significantly increased and we start to add structure to the data, and where it’s an appropriate time to add governance and address data quality issues. Typically, this would be done in the Execution part of our architecture, taking our prototype design and adding it to the data factory as one of its interface routines.
We’ll be developing these ideas out further over the summer, with some white papers and a joint presentation at Openworld 2014 to formally launch it. In the meantime, keep an eye on the blog as we work through some of the key concepts.
Trying to understand the Oracle Reference Architecture for Information Management
Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture Pt1. – Information Architecture and the “Data Factory”
One of the things at Rittman Mead that we’re really interested in, is the architecture of “information management” systems and how these change over time as thinking, and product capabilities, evolve. In fact we often collaborate with the Enterprise Architecture team within Oracle, giving input into the architecture designs they come up with, and more recently working on a full-blown collaboration with them to come up with a next-generation Information Management architecture. I these two posts I wanted to share some of our recent thinking in this area, looking first at our new proposed architecture, and then in the second post talking about how we’d use agile development methods, in-particular our “ExtremeBI” development approach, to deliver it.
But first, some history. Back in 2009 I blogged about a first-generation DW reference architecture which introduced a couple of new concepts, based on new capabilities from tools such as OBIEE plus some thinking we, and the Enterprise Architecture team at Oracle, had been doing over the years. This reference architecture introduced the concept of “Foundation” and “Access and Performance” layers, and recognised the reality that Kimball-type star schemas were great for querying but not so good for long-term, query-neutral storage of data, whilst Inmon-style EDW models were great as a long-term, process-neutral system of record, but not so good for running user queries on. This new architecture included both of these design approaches, with the foundation layer forming the “information management” layer and the access and performance layer being the “information access” layer. Most importantly, tools like OBIEE made it possible for enterprises to create metadata layers that potentially accessed all layers in this model, so users could query the foundation layer if needed as well as the access and performance layer, if the foundation layer was a better source of data for a particular reports.
A second revision to this model, a couple of years later, expanded on the original one and introduced another two interesting concepts, brought upon by the introduction of tools like Endeca Information Discovery, and the rise of unstructured and semi-structured data sources. This new architecture added unstructured and semi-structured sources into the model, and also introduced the concept of “sandboxes”, areas of the information management model that allowed more free-form, exploratory BI applications to be built.
But in-practice, this idea of “unstructured” and “semi-structured” sources wasn’t all that helpful. What really started to make an impact in the past couple of years is the increasing use of “schema-on-read” databases, where we trade-off the performance and defined structure of traditional relational 3NF and star schemas for the flexibility and “time-to-value” provided by key-value store databases. The Endeca Server is a good example of these types of database, where the Endeca Server allows rapid storage of loosely-associated datasets and tools like Endeca Studio then apply a structure to the data, at the time of analysis. Schema-on-read databases are great for fast, flexible access to datasets, but the cost of ETL is then borne by each system that accesses the data.
Probably the most well-known examples of schema-on-read sources though are Hadoop, and NoSQL databases. Coupled with their ability to store lots of detail-level data at relatively low cost, Hadoop and NoSQL databases have significantly affected the overall landscape for BI, data warehousing and business analytics, and we thought it was about time for a new reference architecture that fully-incorporated the capabilities and latest thinking around this area. Back at the start of 2014 myself, Jon Mead and Stewart Bryson met up with Oracle’s Andrew Bond in his team for a series of workshops, and what came out of it was an updated Information Management Architecture *and* a development methodology for delivering it. Let’s start off then by looking at this updated architecture from a conceptual view.
At a conceptual level, we build on this idea of sandbox environment and formally separate things out into the Execution area – business-as-usual, production and development areas – and an Innovation area, where we build on the idea of a sandbox and rename it the “Discovery lab”. The Discovery lab is where, for want of a better word, the “data scientists” work, with fewer constraints on development and whose inputs are events and data, and outputs are the discovery output that can be the prototype and inspiration for designs going into the execution area.
The main “engine” of the Execution area is our enterprise store of data, this time broken down into four areas:
- A “data reservoir” where we store all incoming events and data at detail-level, typically on HDFS. This blog article by Oracle’s Jean-Pierre Dijcks sets out the concept of a data reservoir well, and I like this blog by Scaleabilities’ Jeff Needham where he makes the case for calling it a “data reservoir” that can ingest, process and analyse data rather than a “data lake”, which implies a passive store.
- An Enterprise Data Store, analogous to the enterprise data warehouses we use today, and a reporting component, typically in our case OBIEE
- Most importantly, the new concept of a “data factory”, a conduit between the data reservoir and the enterprise information store
Together, the execution and innovation layers form our “information platform”, with the event engine feeding real-time events into the platform and outputting them into the data reservoir, and traditional ETL routines loading structured data from the enterprise into the enterprise information store.
This conceptual architecture then permits several types of information application. For example, the data reservoir and the data factory together could support what we call “data applications”, applications working on semi-structured, large and low-granularity data sets such as those used for genomic analysis.
Other applications might be more traditional BI and data warehousing applications, but with the addition of data from the data reservoir and the analysis capabilities of Hadoop.
The discovery lab can be a standalone area, or the insights and discovery it outputs can be used as inputs into the main information platform. More event-based data will typically come in via the event engine, with its output going into the data reservoir and supporting “next-best-decision” applications like Oracle Real-Time Decisions.
Another way of looking at this architecture is from a logical perspective, in particular focusing on the data layers and access/loading processes to load them. The diagram below is our latest version of the two diagrams at the start of this article, and as you can see we’ve kept the data sources and BI element much the same, and kept the concept of the sandbox, in this case refined as the “discovery lab sandbox”.
What is different this time though is the middle bit; we’ve lost the staging area and replaced it with the raw data reservoir, added a “Rapid Development Sandbox”, and drawn the main layers as a slanted set of stacked areas. So why?
What we’re trying to show with the slanted data layers is the relative cost of data ingestion (loading), and the relative cost of accessing it (information interpretation). For the raw data reservoir, for example, there’s little cost in ingesting the data – maybe copy some files to HDFS, or use Flume or GoldenGate to capture log or transaction data to HDFS or Hive, but the cost is then borne in accession this typically “schema-on-read” data source. As you go up the stack, there’s a bit more work in landing data into the Foundation layer – ETL routines, GoldenGate routines, some data cleaning and constraint checking, for example – but it’s correspondingly easier to get data out. For the Access and Performance Layer there’s the most cost in getting data in, but then users have very little work to do when getting data out.
Data can move up the stack from Raw Data Reservoir to Foundation, or directly into Access and Performance, or it could be landed at levels above Raw Data Reservoir, for example in our ExtremeBI approach where we use GoldenGate to replicate source system tables directly into Foundation without going through a staging layer. The Rapid Development Sandboxes are there to support agile, iterative development, with the output from them either being the result in itself, or their designs and insights being used to create more formal projects and data structures.
From a more product-centric perspective, you can overlay these types of diagrams with specific schematics for example enterprises. For example, in the diagram below you can see Oracle NoSQL database being see with HDFS and the Oracle Big Data Connectors to capture and store events from Complex Event Processing, and then outputs from CEP being also fed into a more traditional, “high density data” store as well as directly into a decision engine.
So this all sounds great, but how do you build it? Do we have to use the (discredited) step-by-step, waterfall method to build this type of architecture, and in particular the key “data factory” element that provides connectivity between the Raw Data Reservoir and the Enterprise Information Store? And can we apply agile methods to big data sources, as well as regular databases and applications? Check back on Monday for our thoughts on how this should be done.
End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.5 : Bulk Unload to Oracle
All week I’ve been looking at what’s involved in moving data around Hadoop on the Oracle Big Data Appliance, using ODI12c to orchestrate the end-to-end process. Using web log data from the Rittman Mead website, I first landed the log data on HDFS using Apache Flume, then transformed the data in several stages using Hive transformations built using ODI mappings and knowledge modules. The links below give the previous articles in this series, where you can read how I transformed data using Hive, Sqoop and Python streaming:
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.1 : Flume to Initial Hive Table
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.2 : Hive Table Joins, Aggregation and Loading
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.3 : Enhance with Oracle Reference Data via Sqoop, and CKMs
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.4 : Transforming Data using Python & Hive Streaming
- End-to-End ODI12c ETL on Oracle Big Data Appliance Pt.5 : Bulk Unload to Oracle
At the end of all these ODI12c transformations, I’m left with a Hive table containing summarised web page accesses, enhanced with reference data on the page title and author, and with the entries geocoded with the name of the country associated with the page access. In many cases, this is where I’d leave the data, but quite often it’d be useful to then export the data out of Hadoop and into an Oracle database, so that I could load it into a data warehouse, or analyse it using a richer SQL dialect than I can get with HiveQL. I could use Sqoop to get the data out of Hive and into Oracle, but for larger exports in particular I’d get better unload performance by using Oracle Loader for Hadoop, one of the Oracle Big Data Connectors that typically come with Oracle Big Data Appliance.
There’s actually two Oracle Big Data Connectors that deal with getting data out of Hadoop and into Oracle; Oracle Direct Connector for HDFS (ODCH) gives you the ability to define an Oracle Database External Table over a HDFS file or Hive table, and is useful if you just want to access the data in Hadoop without actually loading it into an Oracle Database.
The way ODCH works is that it adds a special type of adapter to the external table feature in Oracle Database, that gives the database the ability to stream HDFS files in a similar way to how external tables stream regular file data into the database. This makes ODCH very convenient if you just want access to HDFS or Hive data from an Oracle database, but it suffers from the same performance penalty as regular external flat file access, in that you can’t index the table and so forth; so if you want high-performance access to data held in a Hadoop system, and you need this access more than just occasionally, you’re better off copying the data right into the Oracle database, and this is where Oracle Loader for Hadoop comes in.
Oracle Loader for Hadoop (OLH) is a bulk-unloader for Hadoop that uses MapReduce technology to sort, partition and organise the data extraction at the Hadoop end, and then fast-unload the data into the Oracle Database typically much faster than a Sqoop data transfer.
OLH is, however, a bit fiddly to setup, so what’s useful is that ODI has a knowledge module, IKM File/Hive to Oracle, that can set up both OLH and ODCH jobs. This knowledge module does have its own setup process that itself is a bit involved – mostly around environment variables on your Linux system – but once its done, each OLH data transfer is then very easy to setup, and as long as you don’t then go on to do any more transformation on the Oracle data post-unload, it’s covered by the ODI limited-use license you get when you buy it in conjunction with the BDA.
So the key things to get OLH working with ODI are firstly, to install OLH on the BDA node that’s running the ODI12c agent (or in my case, full ODI Studio), and then set up the Oracle user’s environment variables so OLH works, and ODI knows where to find any additional Hadoop / Hive libraries and configuration files to work with OLH. In my case, this meant adding the following entries to my .bashrc file:
export HIVE_HOME=/usr/lib/hive export HADOOP_CLASSPATH=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/*:/etc/hive/conf:$HIVE_HOME/lib/hive-metastore-0.12.0-cdh5.0.1.jar:$HIVE_HOME/lib/libthrift.jar:$HIVE_HOME/lib/libfb303-0.9.0.jar:$HIVE_HOME/lib/hive-common-0.12.0-cdh5.0.1.jar:$HIVE_HOME/lib/hive-exec-0.12.0-cdh5.0.1.jar export OLH_HOME=/home/oracle/oracle/product/oraloader-3.0.0-h2 export HADOOP_HOME=/usr/lib/hadoop export JAVA_HOME=/usr/java/jdk1.7.0_60 export ODI_HIVE_SESSION_JARS=/usr/lib/hive/lib/hive-contrib.jar export ODI_OLH_JARS=/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/ojdbc6.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-utility.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-mapping.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/orai18n-collation.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/oraclepki.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/osdt_cert.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/osdt_core.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/commons-math-2.2.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/jackson-core-asl-1.8.8.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/jackson-mapper-asl-1.8.8.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/avro-1.7.3.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/avro-mapred-1.7.3-hadoop2.jar,/home/oracle/oracle/product/oraloader-3.0.0-h2/jlib/oraloader.jar,/usr/lib/hive/lib/hive-metastore.jar,/usr/lib/hive/lib/libthrift-0.9.0.cloudera.2.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/hive/lib/hive-common-0.12.0-cdh5.0.1.jar,/usr/lib/hive/lib/hive-exec.jar
where OLH was installed to /home/oracle/oracle/product/oraloader-3.0.0-h2, Hadoop was installed to /usr/lib/hadoop and Hive to /usr/lib/hive. ODI_HIVE_SESSION_JARS is optional, but if you do anything fancy with SerDes in your mappings you’re best adding this entry in, to the hive-contrib.jar file (or technically the symbolic link to it). Getting this right was the hardest part about setting this KM up, but once it’s done, it’s done.
Then, to set the ODI parts up, I first create a table in my Oracle 11.2.0.3 database with matching datatypes for the Hive table i’m going to unload; varchar2 for string, number for int, etc, and then create entries in the ODI topology and Designer navigator. Then, I create a new mapping in ODI12c, where I copy the columns from my Hive table into the Oracle table, like this (note that I could map from an HDFS file instead, if I wanted to):
Now it’s a case of configuring the Physical part of the mapping. There’s two steps to this; first, you have to change the default LKM setting for the access table in the physical mapping, the ACCESS_F object in the TARGET_GROUP below:
By default this sets itself to LKM SQL to Oracle (as you’ve got an Oracle target), but we need to change this to a special one, LKM SQL Multi-Connect (below), that delegates the multi-connect capabilities to the downstream node, so you can use a multi-connect IKM such as IKM File/Hive to Oracle.
Then, and only once you’ve selected this LKM, you can select IKM File/Hive to Oracle for the target table KM.
There’s not much you need to enable with this KM to at least get started. Set OLH_OUTPUT_MODE to JDBC (you can enable OCI later on instead, for potentially faster unloading) and set MAPRED_OUTPUT_BASE_DIR to an HDFS directory that the agent’s OS user (typically “oracle”) can read and write to – in my case, I created one specifically for this KM, using Hue’s File System browser tool. Then, if all’s set up correctly you should be able to execute the mapping and view the results in the Operator navigator – note that in my case, it took several attempts before I finally got all the environment variable settings right. Once done, you should then be able to view the Hive table output in your Oracle table, using SQL Developer.
So now, we’ve got each step for our end-to-end process. The only thing left to do is to pull all of these steps, including the ODI Procedure to run the sqoop transfer, into an ODI package or load plan, so that we can run the steps in sequence:
Then, finally, I can run the whole process as a single execution, moving data from Flume capture through Hadoop ETL steps, to the unload into an Oracle table, all from within ODI12c.
So there you have it – Hadoop-native ETL processing, with the metadata-driven design, error handling and heterogenous connectivity provided by ODI12c.