Tag Archives: Big Data
Oracle “BigDataLite” VM Now Available for Download on OTN
Oracle released a new developer VM for download on OTN yesterday called “bigdatalite” – if you’re interested in big data, Hadoop and some of the SQL-on-Hadoop technologies I’ve been looking at recently on the blog, this is something you’ll want to download as soon as possible and play around with. I’ve had access to an earlier version of this VM back from 2012 because of some development work I did with ODI these technologies, but up until now there’s not been a publicly downloadable version I could point people to. Now there is, so I just wanted to walk through what in it, and how you can start to play around with some of the features.
Once you’ve downloaded the various archive files and imported the VM into Virtualbox, log in as oracle/welcome1 and you’ll see a (strangely militaristic-looking) desktop and some links to start an Oracle database, open a browser and so on:
Give the various services a few seconds to start up, and then click on the “Start Here” link on the desktop to open your browser.
The getting started page lists out the various products that are installed on the VM, which you can group as:
- Hadoop and big data products from Cloudera – Cloudera Manager, their equivalent to Enterprise Manager; Cloudera’s distribution of Hadoop (similar to how Red Hat and SuSE distribute their own versions of Linux); and Cloudera Impala and Search, their add-ons to Hadoop that make querying and searching faster
- Oracle’s Big Data Connectors, a set of technologies that link the Oracle database to Hadoop, allowing you to query Hadoop from Oracle, and load and unload data between the two platforms
- Oracle Data Integrator 12c, with a couple of Hadoop integration examples pre-created
- Oracle Database 12c, to use with the Big Data Connectors and ODI
- Oracle NoSQL database, a key/value database similar to Apache HBase
- A bunch of other related Oracle tools such as Jdeveloper, SQL Developer, and Oracle’s R Distribution – with R Studio and additional R packages separately installable
So a great place to start playing around with Hadoop in-general, a way to get some experience with Impala and Hive if you’re an OBIEE developer, and also a great way to try out the integration pieces between the Oracle Database and Hadoop including ODI’s capabilities in this area.
If you click on the Cloudera Manager link (http://localhost:7180/cmf/login) you’ll be taken to Cloudera Manager. This web UI allows you to see the state of the various services managed by Cloudera Manager, including
- HDFS (the distributed filesystem that holds the datafiles then typically analysed using Hive and Impala);
- Hive and Impala (two technologies for issuing SQL-type queries over HDFS files);
- MapReduce (the core data-processing technology within Hadoop that splits operations into mapping, shuffling and reducing (aggregating) data and automatically parallelises it over nodes in the Hadoop cluster)
- Sqoop (for loading data into and out of Hadoop from relational databases)
- Hue (a web UI for all of the above, that we’ll look at in a moment)
Hue is the other main web interface you’ll want to look at, and this is more of a developer-focused web app that allows you to create and view HDFS files, create Hive tables and then query them using Hive and Impala.
I covered Hue and the process of uploading files to create Hive tables in the two blog posts below the other week, and once you’ve done that you can query them from tools such as OBIEE using the 11.1.1.7 release’s Hive connectivity:
- OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 1 : Install and Set-up an EC2 Hadoop Cluster
- OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 2 : Load Data into Hive Tables, Analyze using Hive & Impala
If you’re more from the database side, there’s some tutorials available on the big data connectors and so forth – there doesn’t appear to be any separate tutorials for ODI though so you’ll need to “reverse-engineer” the two examples in ODI Studio to work through how they’ve been created. I’ll try and do this soon and post it on the blog, if anyone’s interested.
Anyway, the VM is downloadable now with supporting materials available on OTN here. I’ve added some links below to earlier posts on our blog that might be of interest to you if you’re looking to try OBIEE and ODI with this platform:
- Why ODI, DW and OBIEE Developers Should Be Interested in Hadoop
- Creating a Multi-Node Hadoop/Impala Cluster as a Datasource for OBIEE 11.1.1.7
- Connecting OBIEE 11.1.1.7 to Cloudera Impala
- Accelerating Hadoop/Hive OBIEE Queries Using Exalytics and the Summary Advisor
- OBIEE, ODI and Hadoop Part 4: Hive Data Transformation & Integration via ODI 11g
- OBIEE, ODI and Hadoop Part 3: A Closer Look at Hive, HFDS and Cloudera CDH3
- OBIEE, ODI and Hadoop Part 2: Connecting OBIEE 11.1.1.7 to Hadoop Data Sources
- OBIEE, ODI and Hadoop Part 1: So What Is Hadoop, MapReduce and Hive?
OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 2 : Load Data into Hive Tables, Analyze using Hive & Impala
In yesterday’s post on analyzing Hadoop data using Cloudera CDH4, Amazon EC2 and OBIEE 11.1.1.7, I went through the setup process for Cloudera Manager Standard and then used it to set up a four-node Hadoop cluster in Amazon AWS’s EC2 service. In today’s post, we’ll use a tool called “Hue” to upload some flight delays stats from OBIEE’s SampleApp / Exalytics demos, create Hive tables over those files and then analyse them first using Hive, and then using Cloudera Impala. If you’ve not seen yesterday’s post then take a look at that first for the setup instructions, and then we’ll move on to today’s topics.
Using Hue to Create a Database and Upload Data to It
1. Hue is a web-based application that ships with Cloudera’s distribution of Hadoop, and is used to run queries against Hadoop and perform general data activities – think of it as SQL*Developer or ApEx for Hadoop, as compared to Cloudera Manager which is more like Oracle Enterprise Manager. To first navigate to Hue, go back to Cloudera Manager, select Services > hue1, then select Hue Web UI from the tabs over the Hue service details. You’re then taken to a setup screen where you can create a new Hue user for admin purposes – I’ll use the username/password “admin/admin”, and then press the Sign-Up button – which then takes me to Hue’s Quick Start Wizard, like this:
Press Next, and then click on the All button under the Install all the application examples label. This installs demo data for Hive, Impala, Pig (a procedural tool used for PL/SQL-type data transformation) and Oozie (a workflow tool). We won’t use this example data in this exercise, but it’s handy to have around for playing around with later.
On the Step 3: Users page, click on the User Admin button and create a new user called “airlines”, password “airlines” – we’ll use this in a moment when uploading data to the cluster. Then return to the quick start wizard, press Next and then click on the Hue Home button to go into Hue proper.
2. You’re now at the Hue homepage where you can upload and work with the HDFS filesystem, create Hive tables, use Sqoop 2 to connect to and load data from a relational database, and perform other tasks.
Select admin > Sign Out, and then log in again this time as the “airlines” user. We’re going to now use this user to create a new Hive database called “bi_airlines”, and then create tables out of four pipe-delimited files exported earlier from an Oracle database, and that I’ve uploaded to Dropbox in case you want to use them too.
3. To create the new Hive database, click on the Tables link, and then the Databases link at the top of the page, like this:
Then, when the next screen is displayed, click on the Create a new database link and call it “bi_airlines”, accepting the default location (in HDFS) for the files it uses. Press the Create Database button that’s then displayed, and check the log and the output to make sure it’s created OK. At that point then, you should see two databases displayed – the “bi_airlines” one you just created, and the default one.
4. Click on the bi_airlines database to select it; another page will then be displayed that will list the tables within that database (which at this point is of course empty), and links to create a new table from a file, or to create one manually.
Now in the real world, you’d create your Hive tables manually as you’re most likely going to map them onto a directory of files (or set of directories, if you want to use Hive table partitioning), and you’re also likely to have done some processing using MapReduce, Pig, R or another tool before having the data in something resembling table file extracts. In this example though, we’re just going to use four pipe-delimited files and use Hue’s ability to upload a file and create a table from it automatically.
The four files we’ll be using are:
- flight_performance_2008-10.txt : three years of flight delay stats including origin and departure airport, number of flights, distance, arrival and departure delay in minutes (524MB)
- geog_dest.txt : destination airports, their state and city, as referenced in the flight stats (150KB)
- geog_origin.txt : the same set of data as the destinations file (150KB)
- carriers.txt : carrier (airline) codes and descriptions
Given the size of the cluster and the potential data available, you could easily use a larger dataset if you’ve got access to an Exalytics demo environment. As such, there’s about 20m flight legs in the main file which is enough to give things a bit of a spin.
5. Still logged in as the “airlines” user, click on the Create a new table from a file link. Type in “flight_performance” as the table name, and then use the ellipses (“..”) button next to the file path area. When the Choose a File dialog is shown, press the Upload a File button and then navigate to the flight_performance_2008-10.txt file on your local machine. Double-click on the file to select it, and the Hue web UI will then upload the file to the Hadoop cluster, storing it on the HDFS distributed file-system (note that on a Mac I had to switch to Firefox to get this uploader to display).
After a while the file will finish uploading; when it has, click on the link for it in the dialog to select it, and then move onto the next page in the wizard.
The Choose Delimiter page is then displayed. This file is pipe (“|”) delimited, so select Other as the Delimiter and key in “|” (no quotes); press Preview to then display the file data, which will look like this:
Press Next, and then define the columns in the table like this:
- Year (int)
- uniquecarrier (string)
- origin (string)
- dest (string)
- arrivaldelaymins (int)
- depdelaymins (int)
- flights (int)
- distance (int)
Then, press the Create Table button, and the table will then be created and the file used to populate it. Once complete, review the Columns tab in the table display and then the Sample tab, which should output something like this:
Then, repeat this process for the other three files, creating the following tables:
origin (based on geog_origin.txt)
- origin (string)
- origin_desc (string)
- origin_city (string)
- origin_state (string)
- origin_airportid (string)
destination (based on geog_dest.txt)
- dest (string)
- dest_desc (string)
- dest_city (string)
- dest_state (string)
- dest_airportid (string)
carriers (based on carriers.txt)
- carrier (string)
- carrier_desc (string)
We’ve now uploaded all the data to the Hadoop cluster; let’s take a look at it now before we move over to OBIEE.
6. When you normally upload files via Hue into Hadoop’s HDFS filesystem, it normally puts them into the home directory for that user in HDFS (for example /user/airlines/). If you choose to create a table from that file though, Hue and Hive move the file into Hive’s part of the HDFS filesystem, creating a sub-directory first for that new database. You can see where your files have gone by clicking on the File Browser button at the top of the Hue page, then navigating to /user/hive/warehouse/bi_airlines.db – you should see your files there (or more correctly, directories that contain your files). You can also map Hive tables to files outside of the /user/hive/warehouse directory (they’re called “external tables”), but this action is the default, and we’ll leave them there now.
So where are these files kept, in this Hadoop cluster in the EC2 cloud. To find out, click on the flight_performance entry, and notice that the file (with the .txt extension) is actually contained within it – we’d actually clicked on a directory for that file. In fact, Hive tables can just as easily map onto a directory of files, so you could add in other years’ data here, or in fact thousands of files – this is usually how incoming data is received in big data-type applications. With the single flight_performance.txt file displayed, click on the file to view its contents, and then click on the View File link and then notice the First Block | Previous Block | Next Block | Last Block and bytes areas – HDFS in-fact breaks the file into blocks, and stores the file in several (redundant) places on nodes in the cluster, to give us fault-tolerance and make it easy for multiple nodes to process the dataset in block chunks.
7. So let’s start by running a couple of queries in Hive, using Hue again. Click on the Beeswax (Hive UI) icon at the top of the page, ensure bi_airlines is selected as the Database, and then type in the following HiveQL query:
select count(*) from flight_performance;
Execute the query and then watch the log output. You’ll see Hive creating and then submitting for execution the MapReduce jobs to select your columns (the “map” part) and then aggregate the results (the “reduce” bit).
The count should return about 19m rows, and the query should take around a minute and a half to run. Now let’s try something more interesting:
select sum(f.flights)
from flight_performance f join origin o on (f.origin = o.origin)
where o.origin = 'SFO'
This time the query takes a bit longer, and when it completes you can see links for the two MapReduce jobs that it used to sum the flight data, as shown in the screenshot below:
Click on one of the MR Jobs links and you can see a bit more detail about the MapReduce job that provided that part of the dataset – in the example below, there were three mappers that ran initially, then another two to setup and cleanup the job, and then one reducer to aggregate the data. Clicking on the other link is a similar story – a single mapper for the main data selection, then a reducer and control mappers to control and aggregate the dataset.
When you think about it, it’s pretty amazing what Hive does, compared to writing the MapReduce code yourself and then running it. And it’s probably fine for ETL-type access where most probably there’s a lot more data to load than just this small fact table, but it wouldn’t really be good for BI-type queries as we’re talking 1,2 or 3 minutes to return data. And that’s what Impala is for – access to the same data, using the same Hive catalog, but much-faster queries that don’t use MapReduce to retrieve the data.
8. So let’s run the same query using Impala. Click on the Cloudera Impala (TM) Query UI icon at the top of the Hue page, and select from the drop-down menu under Database – note how the bi_airlines database isn’t showing there. To have it show, go into the query editor area and type in:
invalidate metadata;
This will have Impala re-load the Hive catalog metadata, and the bi_airlines database should then be listed. Select it, and then try the same query as before:
select sum(f.flights)
from flight_performance f join origin o on (f.origin = o.origin)
where o.origin = 'SFO';
This time when you run it, it returns in a couple of seconds. So now we’ve got some data and some options for querying it, let’s move over to OBIEE and try and connect it to the cluster.
Connecting OBIEE 11.1.1.7 to Cloudera Hadoop on Amazon EC2 using Hive and Impala
For simplicity’s sake, we’ll use OBIEE 11.1.1.7 running on Windows (Windows Server 2008 R2 64-bit, in my case), and we’ll use Cloudera’s own ODBC drivers to make the connection. Oracle’s recommendation is that you use Linux for Apache Hadoop / Hive connectivity though, and they provide their own drivers as part of the 11.1.1.7 install and on OTN; however I don’t think these connect to the Hiveserver2 service that recent CDH4 installs use, and I know these work. So starting with a standard install of OBIEE 11.1.1.7 on Windows 64-bit, follow these steps to initially connect via Hive.
1. Start by downloading the Impala and Hive ODBC drivers from the Cloudera website, which at the time of writing can be found here:
Install the Hive one first, and we’ll try those before going over to the Impala ones. Run the MSI installer for Hive, and then open the 64-bit ODBC Data Source Administrator utility in Windows, so we can create the ODBC connection through to Hive.
2. Next we need to find the external EC2 DNS name for the virtual server we added the HiveServer2 service to in the previous article. Open up the Cloudera Manager website on the instance you created right back at the start of yesterday’s article to host Cloudera Manager, and navigate to the Hosts > All Hosts page, like this:
In my example, the virtual server that’s running HiveServer2 is displayed with its internal EC2 DNS name, like this:
This name only works when you’re internal to the EC2 network though, so you’ll need to go over to the AWS Management Console and find the entry for that instance using the private DNS name, and then use that you retrieve the public one, like this:
2. Now you’re good to go. Back in the Windows desktop, click on System DSN, and then Add. In the list of ODBC data source drivers, you should see Cloudera ODBC Driver for Apache Hive; select it and press Finish.
Then, when the Cloudera ODBC Driver for Apache Hive DSN Setup dialog is shown, enter the following details, substituting the host name that you just retrieved in the previous step, that’s running HiveServer2:
Data Source Name : hive_demo
Host : <your host name with HiveServer2 running>
Port : 10000
Database : bi_airlines
Hive Server Type : HIve Server 2
Authentication Mechanism : User Name
User Name : airlines
so that the dialog looks like this:
Press Test, and check that the test results are successful (note I’m using a slightly older version of the drivers, so the dialog might look a bit different in the latest version). Then press OK, and OK again to close the dialog and save the system DSN.
Now, create a new RPD or log into an online one that’s also on this OBIEE host server, so that any online access can also use the ODBC drivers you just installed. When the RPD is open for editing, select File > Import Metadata .., and then when prompted, select the DSN you created a moment ago – in my case, “hive_demo”, enter the airlines/airlines username and password, and then press Next to proceed to the table import page.
Make sure “Tables” is still checked, press Next, and then select and bring across the bi_airlines database you created earlier, as shown in the Data source view:
Press Finish to complete the metadata import.
3. Now click on the Hive physical database in the Physical panel in the BI Administration tool, to display the Database properties dialog. Change the Database type: from ODBC Basic to Apache Hadoop, and press No when asked if you’d like to edit the connection pool properties for this database, then press OK to close this dialog, then check in the changes to the RPD. To now check connectivity to the Hadoop cluster via Hive ODBC, right-click on one of the table and select View Data…
Once all is working OK, create keys on the origin.origin, destination.dest and carriers.carrier tables and then connect the fact table them, so you’ve got a physical model that looks like this:
Then, finally, pull the rest of the RPD together and create a simple report in answers; the key thing is that you get some data through, as I’ve managed to do in the screenshot below.
But of course – it’s not very fast – queries typically take 2, 3 minutes to run, and these are just simple ones. As I said earlier – fine for ETL, particularly when the dataset it likely to be a lot bigger, but not great for ad-hoc BI queries. So let’s set up an Impala connection instead, and see how that goes.
4. Before this will work though, we’ll need to add the Impala port – 21050 – to the security group that Cloudera Manager created when it provisioned the Hadoop nodes yesterday. To do this, go back into the AWS Management Console, click on the Security Groups menu item and navigate to the security group set up by Cloudera Manager, in my case called “jclouds#impala-demo-cdh”. Click on it to select it, and use the Inbound tab to add an additional security rule like this:
Port Range : 21050
Source : 0.0.0.0/0
Then, press Add Rule and then Apply Rule Changes to add this additional port to the security group. Finally, check the list of ports now open for that security group to see that 21050 is now listed.
You can now over to the Windows environment, install the Impala ODBC drivers and use the ODBC Administrator utility in Windows and set up the Impala ODBC connection. In my case, I use the following values:
Data Source Name : impala_demo
Host : ec2-46-137-25-2.eu-west-1.compute.amazonaws.com
Port : 21050
Database : bi_airlines
Mechanism : No Authentication
Press Test to check that it’s all working OK, and then import the bi_airlines tables into the RPD as you did with the Hive import.
Double-click on the new physical database and set the database type to “Apache Hadoop” again. In addition though, click on the Features tab in the Database Properties dialog and uncheck the ORDERBY_SUPPORTED checkbox – Impala SQL requires a LIMIT clause after each ORDER BY but OBIEE doesn’t currently provide this (Impala’s an unsupported source at this point in time, so its not unexpected), so by unchecking this property we get the BI Server to do the results ordering, and queries will then run OK.
Right-click and select View Data… on one of the imported Impala tables to check that it returns data OK, and then build-out the rest of the RPD as you did with the Hive data. Now when results come back, they come back in a matter of seconds (and the results look more correct, too).
So there you have it – a more-or-less step-by-step to setting up a Hadoop cluster in Amazon EC2, then analysing data on it using OBIEE and Hive / Impala. Hopefully it was useful – more on this topic over the next few weeks.
OBIEE 11.1.1.7, Cloudera Hadoop & Hive/Impala Part 1 : Install and Set-up an EC2 Hadoop Cluster
I’ve been over in San Francisco this last week for BIWA Summit 2014, and one of the things I demo’d during the week was OBIEE 11.1.1.7 connecting to a Hadoop cluster running on Amazon EC2, and analysing the flight delays dataset that ships with recent SampleApps and Exalytics. There’s quite a few interesting steps and concepts in setting this up, so I thought it’d be interesting to go through them on the blog, so that others can have a try if they’re interested. Don’t take this as a definitive, 100%-complete set of steps you’ll need to work through to set up the example – I’m currently writing this in the BA lounge at SFO trying to get this written before my flight leaves, and I might have inadvertently missed a couple of steps – but this should give you the gist of what’s involved and show what’s possible.
What the example will do is create the following setup:
In this setup, we’ll initially create an Amazon EC2 instance that we’ll then install the free version of Cloudera Manager 4.5 onto; Cloudera are a company that have created a distribution of Hadoop which they then sell alongside their own management tools (similar to how Red Hat took Linux, made it “enterprise” and sold software and services around it), but who also provide a freely-downloadable version of their tools (“Cloudera Standard”) that have special setup routines when run on Amazon EC2.
We’ll then use this install of Cloudera Manager to automatically create and provision four Amazon EC2 instances which we’ll then install Hadoop onto, along with other tools like Impala (for in-memory SQL access over the cluster), Hive, HDFS and so on. Then, in the second part of this two-part series, we’ll then upload some data from the Flight Delays dataset into the cluster, connect OBIEE to it via the Cloudera Impala ODBC drivers, and analyse from Answers. I’m assuming with this that you’ve got some familiarity with Amazon AWS, EC2 and the rest of their cloud platform, and that you’ve got yourself set up with an account, your secret access keys and so on – if not, do that first before you try and of these steps.
Let’s start by setting up the initial EC2 virtual server instance onto which we’ll install Cloudera Manager.
Installing the EC2 Hadoop Cluster
1. What we’re going to before anything else is create what’s called a “security group”, a collection of firewall settings that we’ll apply to the Cloudera Manager virtual server so that it can then connect out to the nodes it’s going to set up to run Hadoop (and so that we can connect to it to run the web interface). To do this, log into the AWS Management Console, and from the Amazon Web Services menu navigate to EC2 > Network & Security > Security Groups.
Then when the Security Groups page is displayed, press the Create Security Group button, then enter the following details when prompted:
Name : CDH-Manager
Description : Security group for CDH4 Manager instance
VPC : No VPC
Then, with this new security group selected, use the Add Rule button to add the following inbound rules:
SSH : 0.0.0.0/0
7180 : 0.0.0.0/0
7182 : 0.0.0.0/0
7183 : 0.0.0.0/0
7432 : 0.0.0.0/0
Custom ICMP rule : Echo Reply 0.0.0.0/0
Once you’ve done this, the security group area should look like this:
Then, press the Apply Rule Changes button to register the security settings.
2. Next we’ll create an Amazon EC2 virtual server instance to run Cloudera Manager on, using this security group settings to ensure the right ports are open – then we’ll use that instance and install of Cloudera Manager to then set up the Hadoop cluster.
To do this with the EC2 Dashboard web page still open, click on the Instances menu item on the left-hand side of the page, then press Launch Instance, noting the EC2 region you’ll be working in at the same point (for me, it’s the EU Ireland region).
For this initial virtual server instance, use the Ubuntu Server 12.0.4 LTS 64-bit image – Cloudera Manager 4.5 Free can install onto either Ubuntu or Centos, and will adjust what it installs accordingly, so for now let’s select Ubuntu.
Then, when prompted, select the m1.medium image type, and on the Step 7: Review Instance Launch page, select the security group you created a moment ago for the instance’s security group settings. Once done, press the Launch button, create or select an SSH key pair and then download that key pair to your local laptop or PC so you can connect to the virtual server once it’s spun-up.
3. Now you need to SSH into this new EC2 virtual server and download the Cloudera Manager software to it, to then create the Hadoop cluster. To do this, first make a note of the instance name that the EC2 launch instance process gave you, like this:
Click on that link to then show the status of the virtual server, and more importantly, its public DNS address. Once the virtual server shows a status of “running”, you can then SSH into it and download and run the Cloudera Manager software; note that “EC2-cluster.pem” is the name of the keypair I created in the previous steps, and this file will need to be “chmod 400”-protected before EC2 and SSH will let you use it – see this blog article on setting up EC2 command-line access on the Mac for example details.
To SSH into the virtual server and install and run Cloudera Manager, type in the following (using your own SSH key file name and virtual server DNS address):
ssh -i EC2-cluster.pem ubuntu@ec2-54-216-126-144.eu-west-1.compute.amazonaws.com
Then, once you’re connected, download and install Cloudera Manager like this:
wget http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin
chmod +x cloudera-manager-installer.bin
sudo ./cloudera-manager-installer.bin
You’ll then be walked through a wizard that will get you to agree to a couple of licenses, and then download and install the Cloudera Manager software for your instance type. Note that this is something CM does when it detects it’s running on Amazon EC2 – for other types of install it’s a slightly different process.
4. Once the Cloudera Manager software install has completed, give it a couple of minutes and then use your web browser to navigate to the Cloudera Manager website, at machine-name:7180, in my case:
http://ec2-54-216-126-144.eu-west-1.compute.amazonaws.com:7180
Log in as “admin/admin” and when prompted, select the free Cloudera Standard option. Press Continue so that you’re then presented with the Provide instance specification page. Using this page, you can select the EC2 instance size and type, the number of nodes in your cluster, and a group name for your instances. In this example, we’ll create a four-node cluster using the Ubuntu 12.0.4. LTS 64-bit image. select m1.large as the image type, and call it “impala-demo-cdh”.
Then, on the Provide Credentials page, paste in your AWS access key ID and Secret Access Key, let Cloudera Manager generate a new key pair for use with the cluster (or upload your own one from before), and then press the Start Installation button on the next page to have Cloudera Manager start provisioning the cluster instances. Once the instances are created, download the additional key file and place it with the other one, “chmod 400”-ing it as before so it’ll work with SSH into EC2.
5. Once the instance provisioning completes, Cloudera Manager will then install the relevant software onto the different nodes. The Installation in Progress page will show you the progress of these installs, with the screenshot below showing it mid-way through the process.
Assuming all the cluster nodes install properly, walk through the rest of the steps to confirm what’s installed where, check all of the services are running OK and complete the process.
Configuring and Setting up Hadoop
So assuming all of the install and service startup steps went OK, what you have now is a four-node Hadoop cluster running on Amazon EC2, with additional management tools and services provided by Cloudera – think of it like a Linux distribution by Red Hat or Suse, where the core is standard open-source software and the vendor provides other complementary tools, and tools they write themselves, to enhance the product. The screenshot below is the overall summary page for your cluster, as provided by Cloudera Manager – don’t worry too much about the warnings, they’re down to log file disk space and can be ignored for this particular exercise.
If you select Services > All Services from the Cloudera Manager menu, you’ll see what’s been installed on your cluster:
Some of the key services are:
- HDFS – the cluster filesystem that Hadoop processes data on, and we’ll use later on to upload text files containing the flight delays data we’re going to analyse. HDFS is unix-like in how you work with it, but it stores data redundantly across all nodes in the cluster, enabling parallel operations and providing fault-tolerance.
- HBase – a NoSQL database that we won’t use here, but that stores data in key/value pairs using the HDFS filesystem
- Hive – a SQL-like access layer over Hadoop, typically used for ETL access, and currently by OBIEE
- Impala – an improved version of Hive that runs in-memory and bypasses MapReduce code creation, the thing that slows Hive down
- Hue – a web UI that we’ll be using later on to run Hive and Impala queries, and create tables in Hive’s HCatalog
- MapReduce – the framework and server within Hadoop that typically crunches, filters and transforms the data
Before we go into Hue to create some Hive tables, there’s one tasks we need to do if we’re to access this cluster via Hive – we need to install something called “Hiveserver2”, a server process that the Hive ODBC drivers OBIEE uses will need in order to connect to the cluster, but that isn’t installed by default.
To install Hiveserver2, from the Cloudera Manager website select Services > hive1, and then click on the instances tab. Then, scroll-across so that you can see the HiveServer2 column, locate the cluster node with the majority of services and the Hive Metastore Server installed on it, and check the checkbox to select that service for install.
Press Continue, and then back on the Role Instances page, select the new hiveserver2 service, and select Actions for Selected > Start to start the service.
Now we’re at the point where we can use Hue to set up a Hive database, upload some files and create some tables for analysis. Check back tomorrow for the second-part in this series where we’ll do just that.
Rittman Mead BI Forum 2014 Call for Papers Now Open!
It’s that time of year again when we start planning out next year’s BI Forum, which like this year’s event will be running in May 2014 in Brighton and Atlanta. This will be our sixth annual event, and as with previous year’s the most important part is the content – and as such I’m pleased to announce that the Call for Papers for BI Forum 2014 is now open, running through to January 31st 2014.
If you’ve not been to one of our BI Forum events in past years, the Rittman Mead BI Forum is all about Oracle Business Intelligence, and the technologies and techniques that surround it – data warehousing, data analysis, big data, unstructured data analysis, OLAP analysis and this year – in-memory analytics. Each year we select around ten speakers for Brighton, and ten for Atlanta, along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Stewart Bryson.
Last year we had sessions on OBIEE internals and new features, OBIEE visualisations and data analysis, OBIEE and “big data”, along with sessions on Endeca, Exalytics, Exadata, Essbase and anything else that starts with an “E”. This year we’re continuing the theme, but are particularly looking for sessions on what’s hot this year and next – integration with unstructured and big data sources, use of engineered systems and in-memory analysis, advanced and innovative data visualisations, cloud deployment and analytics, and anything that “pushes the envelope” around Oracle BI, data warehousing and analytics.
The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues. We’re also looking for presenters for ten-minute “TED”-style sessions, and any ideas you might have for keynote speakers, send them directly to me at mark.rittman@rittmanmead.com. Other than that – have a think about abstract ideas now, and make sure you get them in by January 31st 2014.
Why ODI, DW and OBIEE Developers Should Be Interested in Hadoop
Over the past few months I’ve been posting a number of articles about Hadoop, and how you can connect to it from ODI and OBIEE. From an ODI perspective, I covered Hadoop as one of a number of new data sources ODI11g could connect to, then looked at how it leveraged Hive to issue SQL-like data extraction commands to Hadoop, and how it used Oracle Hadoop connector tools to transfer Hadoop data into the Oracle Database, and directly work with data in HDFS files. For OBIEE, I went through the background to Hadoop, Hive and the other “big data” technologies, stepped through a typical Hive query session, then showed how OBIEE 11.1.1.7 could connect to Hadoop through its newly-added Hive adaptor, then finally built a proof-of-concept OBIEE connection through to Cloudera Impala, then extended that to a multi-node Hadoop cluster.
But why all this interest in Hadoop – what’s it really got to do with OBIEE and ODI, and why should you as developers be interested in what’s probably yet another niche BI/DW datasource? Well in my opinion, Hadoop is the classic disruptive technology – cheap, and starting-off with far less functionality than regular, relational databases – but it’s improving fast, and as BI&DW developers it offers the potential of both massive benefits – significantly lower TCO for basic DW work, and support for lots of modern, internet-scale use-cases – and threats – in that if we don’t understand it and see how it can benefit our customers and end-users, we risk being left-behind as technology moves on.
To my mind, there are two main ways in which Hadoop, Hive, HDFS and the other big-data ecosystem technologies are used, in the BI/DW context:
1. Standalone, with their own query tools, database tools, query languages and so forth – your typical “data scientist” use case, originating from customers such as Facebook, LinkedIn etc. In this context, there’s typically no Oracle footprint, users are pretty self-sufficient, any output we see is in the form of “insights”, marketing campaigns etc.
2. Alongside more mainstream, for example Oracle, technologies. In this instance, Hadoop, Hive, HDFS, NoSQL etc are used as complementary, and supporting, technologies to enhance existing Oracle-based data warehouses, capture processes, BI systems. In some cases, Hadoop-type technologies can replace more traditional relational ones, but mostly they’re used to make BI&DW systems more scaleable, cheaper to run, able to work with a wider range of data sources and so forth. This is the context in which Hadoop can be relevant to more traditional Oracle BI, ETL and DW developers.
To understand how this happened, let’s go through a bit of a history lesson. Five years ago or so, your typical DW+BI architecture looked like this:
The data warehouse was typically made-up of three layers – staging, foundation/ODS and performance/dimensional, with data stored in relational databases with some use made of OLAP servers, or some of the newer in-memory databases like Qlikview. But over the intervening years, the scale and types of data sources have increased, with customers now looking to store data from unstructured and semi-structured sources in their data warehouse, take in feeds from social media and other “streaming” sources, and access data in cloud systems typically via APIs, rather than traditional ETL loading. So now we end up with a data warehouse architecture that looks like this:
But this poses challenges for us. From an ETL perspective, how do we access these non-traditional sources, and once we’ve accessed them – how do we efficiently process them? The scale and “velocity” of some of these sources can be challenging for traditional ETL processes that expect to log every transformation in a database with transactional integrity and multi-version concurrency control, whilst in some cases it doesn’t make sense to try and impose a formal data structure on incoming data as you’re capturing it, instead giving it the structure when we finally need it, or when we choose to access it in a query.
And then came “Hadoop”, and its platform and tool ecosystem. At its core, Hadoop is a framework for processing, in a massively-parallel and low-cost fashion, large amounts of data using simple transformation building blocks – filtering (mapping) and aggregating (reducing). Hadoop and MapReduce came out of the US West Coast Internet scene as a way of processing web and behavioural data in the same massively-distributed way that companies provided web search and other web 2.0 activities, and a core part of it was that it was (a) open-source, like Linux and (b) cheap, both in being open-source but also because it was designed from the outset to run on low-cost, commodity hardware that’s expected to fail. Pretty much the opposite of Oracle’s business model, but also obviously very attractive to anyone looking to lower the TCO of their data warehouse system.
So as I said – the Hadoop pioneers went-out and built their systems without much reference to vendors such as Oracle, IBM, Microsoft and the like, and being blunt, they won’t have much time for traditional Oracle BI&DW developers like ourselves. But those customers who are largely invested in Oracle technology, but see advantages in deploying Hadoop and big data technologies to make their systems more flexible, scaleable and cheaper to run – that’s where ODI and OBIEE’s connectivity to these technologies becomes interesting.
To take the example of customers who are looking to deploy Hadoop technologies to enhance their Oracle data warehouse – a typical architecture going down this route would look like this:
In this example, we’re using HDFS – Hadoop Distributed File System – as a pre-staging area for the data warehouse, storing incoming files cheaply, and with build-in fault tolerance, to the point where storage is so cheap that you might as well keep stuff you’re not interested in now, but you think might be interesting in the future. Using Oracle Direct Connector for HDFS, you can set up Oracle Database external tables that map onto HDFS just like any other file system, so you can extract from and otherwise work with these files without worrying about writing MapReduce jobs; ODI, through Oracle Data Integration Adaptor for Hadoop, you can connect ODI to these table sources as well, and work with them just like any other topology source, as I show in the slide below from my upcoming UKOUG Tech’13 session on ODI, OBIEE and Hadoop that’s running in a couple of week’s time in Manchester:
As well as storing data, you can also do simple filtering and transformation on that data, using the Hadoop framework. Most upfront data processing you do as part of an ETL process involves filtering out data you’re not interested in, joining data sets, grouping and aggregating data, and other large-scale data transformation tasks, before you then load it into the foundation/ODS layer and do more complex work. And this simple filtering and transformation is what Hadoop does best, on cheap hardware or even in the cloud – and if your customer is already invested in ODI and runs the rest of their ETL process using it, its relatively simple to add Hadoop capabilities to it, using ODI to orchestrate the data processing steps but using Hadoop to do the heavy lifting, as my slide below shows:
Now some customers, and of course Hadoop vendors, say that in reality you don’t even need the Oracle database if you’re going to build a data warehouse, or more realistically a data mart. Now that’s a bigger question and probably one that depends on the particular customer and circumstances, but a typical architecture that takes this approach might look like this:
In this case, ODI again has capabilities to transform data entirely within Hadoop – with ODI acting as the ETL framework and co-ordinator, but Hadoop doing the heavy-lifting – and there’s always the ability to get the data of Hadoop and into a main Oracle data warehouse, if the Hadoop system is more of a data mart or deparment-specific analysis. But whichever way – in most cases the customer is going tho want to continue to use their existing BI tool, particularly if their BI strategy involves bringing together data from lots of different systems, as you can do with OBIEE’s federated query capability – giving you an overall architecture that looks like this:
So it’s this context that makes OBIEE’s connectivity to Hadoop so important; I’m not saying that someone creating a Hadoop system from scratch is going to go out and buy OBIEE as their query tool – more typically, they’ll use other open-source tools or create models in tools like R; or they might go out and buy a lightweight data visualisation tool like Tableau and use that to connect solely to their Hadoop source. But the customers we work with have typically got much wider requirements for BI, have a need for an enterprise metadata model, recognise the value of data and report governance, and (at least at present) access most of their data from traditional relational and OLAP sources. But they will still be interested in accessing data from Hadoop sources, and OBIEE’s new capability to connect to this type of data, together with closer integration with Endeca and its unstructured and semi-structured sources, addresses this need.
So there you have it – that’s why I think OBIEE and ODI’s ability to connect to Hadoop is a big deal, and it’s why I think developers using those tools should be interested in how it works, and should try and set up their own Hadoop systems and see how it all works. As I said, I’ll be covering this topic in some detail at the UKOUG Tech’13 Conference in Birmingham in a couple of weeks time, so if you’re there on the Sunday come along and I’ll try and explain how I think it all fits together.