Tag Archives: Cloud
OBIEE SampleApp in The Cloud: Importing VirtualBox Machines to AWS EC2
Virtualisation has revolutionised how we work as developers. A decade ago, using new software would mean trying to find room on a real tin server to install it, hoping it worked, and if it didn’t, picking apart the pieces probably leaving the server in a worse state than it was to begin with. Nowadays, we can just launch a virtual machine to give a clean environment and if it doesn’t work – trash it and start again.
The sting in the tail of virtualisation is that full-blown VMs are heavy – for disk we need several GB just for a blank OS, and dozens of GB if you’re talking about a software stack such as Fusion MiddleWare (FMW), and the host machine needs to have the RAM and CPU to support it all too. Technologies such as Linux Containers go some way to making things lighter by abstracting out a chunk of the OS, but this isn’t something that’s reached the common desktop yet.
So whilst VMs are awesome, it’s not always practical to maintain a library of all of them on your local laptop (even 1TB drives fill up pretty quickly), nor will your laptop have the grunt to run more than one or two VMs at most. VMs like this are also local to your laptop or server – but wouldn’t it be neat if you could duplicate that VM and make a server based on it instantly available to anyone in the world with an internet connection? And that’s where The Cloud comes in, because it enables us to store as much data as we can eat (and pay for), and provision “hardware” at the click of a button for just as long as we need it, accessible from anywhere.
Here at Rittman Mead we make extensive use of Amazon Web Services (AWS) and their Elastic Computing Cloud (EC2) offering. Our website runs on it, our training servers run on it, and it scales just as we need it to. A class of 3 students is as easy to provision for as a class of 24 – no hunting around for spare servers or laptops, no hardware sat idle in a cupboard as spare capacity “just in case”.
One of the challenges that we’ve faced up until now is that all servers have had to be built from scratch in the cloud. Obviously we work with development VMs on local machines too, so wouldn’t it be nice if we could build VMs locally and then push them to the cloud? Well, now we can. Amazon offer a route to import virtual machines, and in this article I’m going to show how that works. I’ll use the superb SampleApp v406 VM that Oracle provide, because this is a great real-life example of a VM that is so useful, but many developers can find too memory-intensive to be able to run on their local machines all the time.
This tutorial is based on exporting a Linux guest VM from a Linux host server. A Windows guest probably behaves differently, but a Mac or Windows host should work fine since VirtualBox is supported on both. The specifics are based on SampleApp, but the process should be broadly the same for all VMs.
Obtain the VM
We’re going to use SampleApp, which can be downloaded from Oracle.
- Download the six-part archive from http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples–167534.html
- Verify the md5 checksums against those published on the download page:
[oracle@asgard sampleapp406]$ ll total 30490752 -rw-r--r-- 1 oracle oinstall 5242880000 Sep 9 01:33 SampleAppv406.zip.001 -rw-r--r-- 1 oracle oinstall 5242880000 Sep 9 01:30 SampleAppv406.zip.002 -rw-r--r-- 1 oracle oinstall 5242880000 Sep 9 02:03 SampleAppv406.zip.003 -rw-r--r-- 1 oracle oinstall 5242880000 Sep 9 02:34 SampleAppv406.zip.004 -rw-r--r-- 1 oracle oinstall 5242880000 Sep 9 02:19 SampleAppv406.zip.005 -rw-r--r-- 1 oracle oinstall 4977591522 Sep 9 02:53 SampleAppv406.zip.006 [oracle@asgard sampleapp406]$ md5sum * 2b9e11f69ada5f889088dd74b5229322 SampleAppv406.zip.001 f8a1a5ae6162b20b3e9c6c888698c071 SampleAppv406.zip.002 68438cfea87e8d3a2e2f15ff00dadf12 SampleAppv406.zip.003 b71d9ace4f75951198fc8197da1cfe62 SampleAppv406.zip.004 4f1a5389c9e0addc19dce6bbc759ec20 SampleAppv406.zip.005 2c430f87e22ff9718d5528247eff2da4 SampleAppv406.zip.006
- Unpack the archive using 7zip — the instructions for SampleApp are very clear that you must use 7zip, and not another archive tool such as winzip.
[oracle@asgard sampleapp406]$ time 7za x SampleAppv406.zip.001</code>7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18 p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,80 CPUs) Processing archive: SampleAppv406.zip.001 Extracting SampleAppv406Appliance Extracting SampleAppv406Appliance/SampleAppv406ga-disk1.vmdk Extracting SampleAppv406Appliance/SampleAppv406ga.ovf Everything is Ok Folders: 1 Files: 2 Size: 31191990916 Compressed: 5242880000 real 1m53.685s user 0m16.562s sys 1m15.578s
- Because we need to change a couple of things on the VM first (see below), we’ll have to import the VM to VirtualBox so that we can boot it up and make these changes.You can import using the VirtualBox GUI, or as I prefer, the
VBoxManage
command line interface. I like to time all these things (just because, numbers), so stick atime
command on the front:
time VBoxManage import --vsys 0 --eula accept SampleAppv406Appliance/SampleAppv406ga.ovf
This took 12 minutes or so, but that was on a high-spec system, so YMMV.
[...] 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% Successfully imported the appliance. real 12m15.434s user 0m1.674s sys 0m2.807s
Preparing the VM
Importing Linux VMs to Amazon EC2 will only work if the kernel is supported, which according to an AWS blog post includes Red Hat Enterprise Linux 5.1 – 6.5. Whilst SampleApp v406 is built on Oracle Linux 6.5 (which isn’t listed by AWS as supported), we have the option of telling the VM to use a kernel that is Red Hat Enterprise Linux compatible (instead of the default Unbreakable Enterprise Kernel – UEK). There are some other pre-requisites that you need to check if you’re trying this with your own VM, including a network adaptor configured to use DHCP. The aforementioned blog post has details.
- Boot the VirtualBox VM, which should land you straight in the desktop environment, logged in as the oracle user.
- We need to modify a file as root (superuser). Here’s how to do it graphically, or use
vi
if you’re a real programmer:- Open a Terminal window from the toolbar at the top of the screen
- Enter
sudo gedit /etc/grub.conf
The sudo bit is important, because it tells Linux to run the command as root. (I’m on an xkcd-roll here: 1, 2) - In the text editor that opens, you will see a header to the file and then a set of repeating sections beginning with title. These are the available kernels that the machine can run under. The default is 3, which is zero-based, so it’s the fourth
title
section. Note that the kernel version details include uek which stands for Unbreakable Enterprise Kernel – and is not going to work on EC2.
- Change the default to 0, so that we’ll instead boot to a Red Hat Compatible Kernel, which will work on EC2
- Save the file
- Optional steps:
- Whilst you’ve got the server running, add your SSH key to the image so that you can connect to it easily once it is up on EC2. For more information about SSH keys, see my previous blog post here, and a step-by-step for doing it on SampleApp here.
- Disable non-SSH key logins (in
/etc/ssh/sshd_config
, setPasswordAuthentication no
andPubkeyAuthentication yes
), so that your server once on EC2 is less vulnerable to attack. Particularly important if you’re using the stock image with Admin123 as the root password. - Set up screen, and OBIEE and the database as a Linux service, both covered in my article here.
- Shutdown the instance by entering this at a Terminal window:
sudo shutdown -h now
Export the VirtualBox VM to Amazon EC2
Now we’re ready to really get going. The first step is to export the VirtualBox VM to a format that Amazon EC2 can work with. Whilst they don’t explicitly support VMs from VirtualBox, they do support the VMDK format – which VirtualBox can create. You can do the export from the graphical interface, or as before, from the command line:
time VBoxManage export "OBIEE SampleApp v406" --output OBIEE-SampleApp-v406.ovf 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% Successfully exported 1 machine(s). real 56m51.426s user 0m6.971s sys 0m12.162s
If you compare the result of this to what we downloaded from Oracle it looks pretty similar – an OVF file and a VMDK file. The only difference is that the VMDK file is updated with the changes we made above, including the modified kernel settings which are crucial for the success of the next step.
[oracle@asgard sampleapp406]$ ls -lh total 59G -rw------- 1 oracle oinstall 30G Sep 9 10:55 OBIEE-SampleApp-v406-disk1.vmdk -rw------- 1 oracle oinstall 15K Sep 9 09:58 OBIEE-SampleApp-v406.ovf
We’re ready now to get all cloudy. For this, you’ll need:
- An AWS account
- You’ll also need your AWS account’s Access Key and Secret Key
- AWS EC2 commandline tools installed, along with a Java Runtime Environment (JRE) 1.7 or greater:
wget http://s3.amazonaws.com/ec2-downloads/ec2-api-tools.zip sudo mkdir /usr/local/ec2 sudo unzip ec2-api-tools.zip -d /usr/local/ec2 # You might need to fiddle with the following paths and version numbers: sudo yum install -y java-1.7.0-openjdk.x86_64 cat >> ~/.bash_profile <<EOF export JAVA_HOME="/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.65.x86_64/jre" export EC2_HOME=/usr/local/ec2/ec2-api-tools-1.7.1.1/ export PATH=$PATH:$EC2_HOME/bin EOF<
- Set your credentials as environment variables:
export AWS_ACCESS_KEY=xxxxxxxxxxxxxx export AWS_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxx
- Ideally a nice fat pipe to upload the VM file over, because at 30GB it is not trivial (not in 2014, anyway)
What’s going to happen now is we use an EC2 command line tool to upload our VMDK (virtual disk) file to Amazon S3 (a storage platform), from where it gets converted into an EBS volume (Elastic Block Store, i.e. a EC2 virtual disk), and from there attached to a new EC2 instance (a “server”/”VM”).
Before we can do the upload we need an S3 “bucket” to put the disk image in that we’re uploading. You can create one from https://console.aws.amazon.com/s3/. In this example, I’ve got one called rmc-vms – but you’ll need your own.
Once the bucket has been created, we build the command line upload statement using ec2-import-instance:
time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY
Points to note:
- m3.large is the spec for the VM. You can see the available list here. In the AWS blog post it suggests only a subset will work with the import method, but I’ve not hit this limitation yet.
- region is the AWS Region in which the EBS volume and EC2 instance will be built. I’m using ew-west-1 (Ireland), and it makes sense to use the one geographically closest to where you or your users are located. Still waiting for uk-yorks-1…
- architecture and platform relate to the type of VM you’re importing.
The upload process took just over 45 minutes for me, and that’s from a data centre with a decent upload:
[oracle@asgard sampleapp406]$ time ec2-import-instance OBIEE-SampleApp-v406-disk1.vmdk --instance-type m3.large --format VMDK --architecture x86_64 --platform Linux --bucket rmc-vms --region eu-west-1 --owner-akid $AWS_ACCESS_KEY --owner-sak $AWS_SECRET_KEY Requesting volume size: 200 GB TaskType IMPORTINSTANCE TaskId import-i-fh08xcya ExpirationTime 2014-09-16T10:07:44Z Status active StatusMessage Pending InstanceID i-b07d3bf0 DISKIMAGE DiskImageFormat VMDK DiskImageSize 31191914496 VolumeSize 200 AvailabilityZone eu-west-1a ApproximateBytesConverted 0 Status active StatusMessage Pending : Downloaded 0 Creating new manifest at rmc-vms/d77672aa-0e0b-4555-b368-79d386842112/OBIEE-SampleApp-v406-disk1.vmdkmanifest.xml Uploading the manifest file Uploading 31191914496 bytes across 2975 parts 0% |--------------------------------------------------| 100% |==================================================| Done Average speed was 11.088 MBps The disk image for import-i-fh08xcya has been uploaded to Amazon S3 where it is being converted into an EC2 instance. You may monitor the progress of this task by running ec2-describe-conversion-tasks. When the task is completed, you may use ec2-delete-disk-image to remove the image from S3. real 46m59.871s user 10m31.996s sys 3m2.560s
Once the upload has finished Amazon automatically converts the VMDK (now residing on S3) into a EBS volume, and then attaches it to a new EC2 instance (i.e. a VM). You can monitor the status of this task using ec2-describe-conversion-tasks
, optionally filtered on the TaskId
returned by the import command above:
ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya TaskType IMPORTINSTANCE TaskId import-i-fh08xcya ExpirationTime 2014-09-16T10:07:44Z Status active StatusMessage Pending InstanceID i-b07d3bf0 DISKIMAGE DiskImageFormat VMDK DiskImageSize 31191914496 VolumeSize 200 AvailabilityZone eu-west-1a ApproximateBytesConverted 3898992128 Status active StatusMessage Pending : Downloaded 31149971456
This is now an ideal time to mention as a side note the Linux utility watch, which simply re-issues a command for you every x seconds (2 by default). This way you can leave a window open and keep an eye on the progress of what is going to be a long-running job
watch ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya Every 2.0s: ec2-describe-conversion-tasks --region eu-west-1 import-i-fh08xcya Tue Sep 9 12:03:24 2014 TaskType IMPORTINSTANCE TaskId import-i-fh08xcya ExpirationTime 2014-09-16T10:07:44Z Status active StatusMessage Pending InstanceID i-b07d3bf0 DISKIMAGE DiskImageFormat VMDK DiskImageSize 31191914496 VolumeSize 200 AvailabilityZone eu-west-1a ApproximateBytesConverted 5848511808 Status active StatusMessage Pending : Downloaded 31149971456
And whilst we’re at it, if you’re using a remote server to do this (as I am, to take advantage of the large bandwidth), you will find screen invaluable for keeping tasks running and being able to reconnect at will. You can read more about screen
and watch
here.
So back to our EC2 import job. To start with, the task will be Pending: (NB unlike lots of CLI tools, you read the output of this one left-to-right, rather than as columns with headings)
$ ec2-describe-conversion-tasks --region eu-west-1 TaskType IMPORTINSTANCE TaskId import-i-ffvx6z86 ExpirationTime 2014-09-12T15:32:01Z Status active StatusMessage Pending InstanceID i-b2245ef2 DISKIMAGE DiskImageFormat VMDK DiskImageSize 5021144064 VolumeSize 60 AvailabilityZone eu-west-1a ApproximateBytesConverted 4707330352 Status active StatusMessage Pending : Downloaded 5010658304
After a few moments it gets underway, and you can see a Progress percentage indicator: (scroll right in the code snippet below to see)
TaskType IMPORTINSTANCE TaskId import-i-fgr0djcc ExpirationTime 2014-09-15T15:39:28Z Status active StatusMessage Progress: 53% InstanceID i-c7692e87 DISKIMAGE DiskImageFormat VMDK DiskImageSize 5582545920 VolumeId vol-f71368f0 VolumeSize 20 AvailabilityZone eu-west-1a ApproximateBytesConverted 5582536640 Status completed
Note that at this point you’ll see also see an Instance in the EC2 list, but it won’t launch (no attached disk – because it’s still being imported!)
If something goes wrong you’ll see the Status as cancelled, such as in this example here where the kernel in the VM was not a supported one (observe it is the UEK kernel, which isn’t supported by Amazon):
TaskType IMPORTINSTANCE TaskId import-i-ffvx6z86 ExpirationTime 2014-09-12T15:32:01Z Status cancelled StatusMessage ClientError: Unsupported kernel version 2.6.32-300.32.1.el5uek InstanceID i-b2245ef2 DISKIMAGE DiskImageFormat VMDK DiskImageSize 5021144064 VolumeId vol-91b1c896 VolumeSize 60 AvailabilityZone eu-west-1a ApproximateBytesConverted 5021128688 Status completed
After an hour or so, the task should complete:
TaskType IMPORTINSTANCE TaskId import-i-fh08xcya ExpirationTime 2014-09-16T10:07:44Z Status completed InstanceID i-b07d3bf0 DISKIMAGE DiskImageFormat VMDK DiskImageSize 31191914496 VolumeId vol-a383f8a4 VolumeSize 200 AvailabilityZone eu-west-1a ApproximateBy tesConverted 31191855472 Status completed
At this point you can remove the VMDK from S3 (and should do, else you’ll continue to be charged for it), following the instructions for ec2-delete-disk-image
Booting the new server on EC2
Go to your EC2 control panel, where you should see an instance (EC2 term for “server”) in Stopped state and with no name.
Select the instance, and click Start on the Actions menu. After a few moments a Public IP will be shown in the details pane. But, we’re not home free quite yet…read on.
Firewalls
So this is where it gets a bit tricky. By default, the instance will have launched with Amazon’s Firewall (known as a Security Group) in place which – unless you have an existing AWS account and have modified the default security group’s configuration – is only open on port 22, which is for ssh traffic.
You need to head over to the Security Group configuration page, accessed in several ways but easiest is clicking on the security group name from the instance details pane:
Click on the Inbound tab and then Edit, and add “Custom TCP Rule” for the following ports:
- 7780 (OBIEE front end)
- 7001 (WLS Console / EM)
- 5902 (oracle VNC)
You can make things more secure by allowing access to the WLS admin (7001) and VNC port (5902) to a specific IP address or range only.
Whilst we’re talking about security, your server is now open to the internet and all the nefarious persons out there, so you’ll be wanting to harden your server not least by resetting all the passwords to ones which aren’t publicly documented in the SampleApp user documentation!
Once you’ve updated your Security Group, you can connect to your server! If you installed the OBIEE and database auto start scripts (and if not, why not??) you should find OBIEE running just nicely on http://[your ip]:7780/analytics – note that the port is 7780, not 9704.
If you didn’t install the script, you will need to start the services manually per the SampleApp documentation. To connect to the server you can ssh (using Terminal, PuTTY, etc) to the server or connect on VNC (Admin123 is the password). For VNC clients try Screen Share on Macs (installed by default), or RealVNC on Windows.
Caveats & Disclaimers
- Running a server on AWS EC2 costs real money, so watch out. Once you’ve put your credit card details in, Amazon will continue to charge your card whilst there are chargeable items on your account (EBS volumes, instances – running or not- , and so on). You can get an idea of the scale of charges here.
- As mentioned above, a server on the open internet is a lot more vulnerable than one virtualised on your local machine. You will get poked and probed, usually by automated scripts looking for open ports, weak passwords, and so on. SampleApp is designed to open the toybox of a pimped-out OBIEE deployment to you, it is not “hardened”, and you risk learning the tough way about the need for it if you’re not careful.
Cloning
Amazon EC2 supports taking a snapshot of a server, either for backup/rollback purposes or spinning up as a clone, using an Amazon Machine Image (AMI). From the Instances page, simply select “Create an Image” to build your AMI. You can then build another instance (or ten) from this AMI as needed, exact replicas of the server as it was at the point that you created the image.
Lather, Rinse, and Repeat
There’s a whole host of VirtualBox “appliances” out there, and some of them such as the developer-tools-focused ones only really make sense as local VMs. But there are plenty that would benefit from a bit of “Cloud-isation”, where they’re too big or heavy to keep on your laptop all the time, but are handy to be able to spin up at will. A prime example of this for me is the EBS Vision demo database that we use for our BI Apps training. Oracle used to provide an pre-built Amazon image (know as an AMI) of this, but since withdrew it. However, Oracle do publish Oracle VM VirtualBox templates for EBS 12.1.3 and 12.2.3 (related blog), so from this with a bit of leg-work and a big upload pipe, it’s a simple matter to brew your own AWS version of it — ready to run whenever you need it.
Rittman Mead Announce New Partnerships with Cloudera and Amazon Web Services
To complement our strategic partnership with Oracle Corporation, Rittman Mead are very pleased to announce two new technology partnerships to complement our Oracle-based services; with Cloudera, the industry-leading Hadoop software vendor, and with Amazon Web Services, the infrastructure-as-a-service provider that’s become the standard for deploying highly-available, secure cloud applications. As business intelligence, analytics and data warehousing systems extend onto the cloud and to encompass big data, these two new partnerships will allow us to offer best-of-breed services in these area whilst complementing our main technology partnership with Oracle. So why Cloudera, why Amazon Web Services, and why now?
Extending BI from the EDW to the “Enterprise Data Hub”
If you’re a regular reader of this blog you’ll probably have read several articles by myself and other authors about cloud, Hadoop and big data. Whilst the vast majority of BI & DW systems we put together for clients today are based around regular database and application data sources, increasingly customers are asking us to help them bring non-traditional, NoSQL and Hadoop-based data sources into their BI platform, adding what’s becoming termed “enterprise data hubs” alongside the more traditional enterprise data warehouses. We’re also seeing more interest in deploying these systems into the cloud – either driven by the economics of the cloud vs. on-premise systems, or because other key line-of-business systems are moving into the cloud and it makes sense to deploy your BI there too. And Oracle, of course, have noticed this trend, with support being added to OBIEE and ODI for reporting against, and loading data out of/into Hadoop data sources, and Oracle themselves due to launch Oracle Business Intelligence Cloud Services in the next couple of months.
Oracle of course already have some Hadoop-based products out in the market, including the very-impressive Oracle Big Data Appliance and their Oracle Big Data Connectors, and we think these products have the potential to make a massive impact on the types of project and solutions we can offer customers today. But at Rittman Mead we like to get ahead of the technology curve, and so we decided to in addition partner directly with Cloudera, makers of the Hadoop software within Big Data Appliance but a significant software vendor in themselves, to ensure our team is at the forefront of delivering cutting-edge big data solutions based around Cloudera CDH, Apache Spark and Cloudera Impala.
Over the past few months we’ve started to deliver projects and PoCs for customers around Cloudera Hadoop technology and Oracle Engineered Systems, Database and BI tools, and what’s emerged as our core competence is the ability to bring together Oracle’s chosen Hadoop distribution and tools with the core Oracle technologies our customers use. We’ve developed systems using Oracle’s Big Data Connectors, to for example analyse huge datasets across multiple Hadoop nodes using Oracle R Advanced Analytics for Hadoop, and we’re currently working with a couple of customers’ Oracle DBA teams who are in the process of adopting Oracle Big Data Appliance. Most importantly, as official Cloudera Partners we’ve got access to their technical and solutions architecture resources, giving us a similar same level of technical backup as we have access to on our core Oracle projects.
BI Moves into the Cloud, for Improved Business Agility and TCO
The other big trend that’s driving a lot of innovation in the BI industry is “cloud”. If you were at one of our BI Forum events in Brighton and Atlanta this year, you’ll know that cloud is front-and-centre in Oracle’s product strategy at the moment, both in terms of making BI available as part of the wider Oracle Public Cloud, but also as a way of accelerating innovation and making more of the product self-service. What’s now been officially named “Oracle BI Cloud Service” (BICS) was officially launched at last year’s Openworld and we’re about to help beta-test the product for Oracle prior to its launch later in the year, and we’re expecting BICS to be particularly attractive to existing OBIEE customers looking to quickly spin-up departmental BI environments without the need for IT to get involved.
But as I covered in a couple of blog posts earlier in the year, BICS functionality is likely to be limited in its initial incarnation and many customers are going to be looking to run “full” OBIEE in the cloud, along with a full Oracle database and an ETL infrastructure using tools such as ODI or Informatica, and for these types of customer a more complete cloud solution will be needed – which is why we’ve also partnered with Amazon Web Services, in our view by far the best cloud service provider on the market and the platform behind companies such as Netflix and Dropbox.
We’ve been long-term users of Amazon AWS since around three or four years ago, initially running our training servers on the platform but more recently, undertaking internal and customer development work on their “virtual private cloud” platform as part of our new “Extreme BI in the Cloud” initiative. As techies, we appreciate the power and flexibility of the Amazon AWS API which we use to script and automate much of our “DevOps” work, and we’re finding more and more of our customers are moving to AWS independently, due to their excellent uptime and levels of customer service, and the wide ecosystem of AWS-native and partner products such as Amazon Redshift, Amazon Elastic MapReduce, Amazon EC2 and S3, and Attunity Cloudbeam. Again, we’ve partnered officially with Amazon AWS so that we can train our team-up and access technical and solutions architecture resources, and as with the Cloudera partnership, our particular specialisation is in deploying Oracle-based technologies onto the Amazon AWS platform.
So does this mean that Rittman Mead are de-emphasising our Oracle Partnership or otherwise taking our attention away from solutions built-on OBIEE, ODI, Essbase and the Oracle Database? Far from it; the core of the company will always be around Oracle technology, but by partnering with two best-of-breed complementary organisations – Cloudera, for Oracle-centric Hadoop and big data solutions, and Amazon AWS, for cloud and hybrid deployments – we can continue to offer customers the most innovative and industry-leading solutions as BI evolves from its database and on-premise roots to cover big data and cloud deployments.
Look out for our our blog over the coming months as we cover integrating Oracle technology with Cloudera’s Hadoop platform and Oracle Big Data Appliance, and look out for products and services where we combine Oracle and Cloudera technology in innovative ways, creating cutting edge solutions for our customers both on-premise, in the Amazon and Oracle clouds, or as hybrid on-premise/cloud deployments.
Thoughts on Using Amazon Redshift as a Replacement for an Oracle Data Warehouse
Recently, my colleague, Pete Carpenter, described a proof of concept we carried out using Amazon Redshift as the data warehouse storage layer in a system capturing data from Oracle E-Business Suite (EBS) using Attunity CloudBeam in conjunction with Oracle Data Integrator (ODI) for specialised ETL processing and Oracle Business Intelligence (OBI) as the reporting tool.
In this blog I will look at Amazon Redshift and how it compares with a more traditional DW approach using, as my example, Oracle. I am not going to talk performance in absolute terms as your mileage is going to vary.
What is Redshift?
Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3 components but is managed separately using the Redshift tab of the AWS console. As a cloud based system it is rented by the hour from Amazon, and broadly the more storage you hire the more you pay. Currently, there are 2 families of Redshift servers, the traditional hard-disk based, and the recently introduced SSD family, which has less storage but far more processing power and faster CPUs. For our trials we looked at the traditional disk based storage on a 2 node cluster to give us 4TB of disk spread across 4 CPU cores. Apart from single node configurations, Redshift systems consist of a leader node and two or more database nodes; the leader node is supplied free of charge (you only pay for the storage nodes) and is responsible for acting as the query parser, coordinating the results from the database nodes, and being a central network address for user access.
The Redshift product has its origins in ParAccel and that in turn Postgres and thus supports ANSI SQL and the ODBC and JDBC Postgres drivers. In basic terms it is a share-nothing parallel processing columnar store database that supports columnar compression.
At the cluster level all sorts of robustness features come in to play to handle routine hardware failures such as a node or disk; regular automatic backups occur and on-demand backups can be made to S3 storage for DR or replication to other AWS networks. It is possible to dynamically change the number and or type of Redshift nodes in use, in effect a new cluster is spun up and the data copied from the existing system to the new before dropping the old system. The original database remains open for query (but not update) during the scale-out (or scale-down) process. As Pete Carpenter described, creating a new Redshift instance is a simple matter of completing a few web forms and waiting for the cluster to come up. Once up you can connect to the database using the master credentials you specified at cluster creation and then create databases, users, and schemas as required.
Databases, users, schemas and security
Although it is possible to run a Redshift database using the master user and the default database, good practice suggests that we do a bit more than this. In some ways Redshift is a little like the Oracle 12c database in that we can create additional databases within the master database, much in the style of plugable databases; a major difference comes with the concept of a USER. In Oracle 12c a user belongs to a plugable database, in Redshift all users belong to the master (container) database and can see any of the contained databases (subject to grants.) Schemas are logical groupings for objects and need not be aligned to database user names. Standard object and role grants allow users to access specific databases, schemas, and tables or to have role-rights such as administrator. The final aspect of security is outside the database and is in effect a firewall rule to permit any nominated AWS user or specified IP addresses to speak to the database listener; by default the rule is no inbound access. The diagram below is a block representation of how databases, users, schemas and firewall interrelate. Note user names are descriptive and not valid names!
Database Design
A key point of difference between Amazon Redshift and Oracle is in how the data is stored or structured in the database. An understanding of this is vital in how to design a performant data warehouse. With Oracle we have shared storage (SAN or local disk) attached to a pool of processors (single machine or a cluster); however, Redshift uses a share-nothing architecture, that is the storage is tied to the individual processor cores of the nodes. As with Oracle, data is stored in blocks, however the Redshift block size is much larger (1MB) than the usual Oracle block sizes; the real difference is how tables are stored in the database, Redshift stores each column separately and optionally allows one of many forms of data compression. Tables are also distributed across the node slices so that each CPU core has its own section of the table to process. In addition, data in the table can be sorted on a sort column which can lead to further performance benefits; I will discuss this in the section on tables.
Not all of the database features we come to expect in an Oracle data warehouse are available to us in Redshift. The Redshift Developer Guide has the full rundown on what is available, but for now here is a short list of common DW features that are not going to be available to us.
- Tablespaces
- Indexes
- Partitions
- Constraints
- Check
- Primary, Unique, Foreign Key (all usable by optimizer but not enforced)
- Spatial (Locator) functionality
- Sequences (although there is an AUTO NUMBER column type)
- MERGE – we have to code as UPDATE and INSERT in two steps
- In-database PL/SQL-like language
- Triggers
- User defined functions
- Procedures
- Timestamps (with timezone)
- XML types
- Pseudo columns
- Various SQL functions (not a full list, but functions I often use in ETL processes)
- Regular expressions
- Regression functions
- SUBSTR
- TRANSLATE
- ROW_NUMBER
- TO_TIMESTAMP
In addition data types may not be exactly the same as those used in Oracle; for example DATE in Oracle has a resolution of 1 SECOND, DATE in Redshift has a resolution of 1 DAY.
Tables
The basic Oracle syntax to create a table works (as does CTAS, Create Table As Select), however there are additional items we can, and should, specify at table creation.
By default the data distribution style is EVEN, that is data is distributed between node-slices in a round-robin fashion, for performance we may wish to specify a distribution key column to allow a particular column to control how data is distributed; a similar concept to Oracle hash partitioning, and with the same sort of performance characteristics. We aim to create an even distribution of rows per slice (else one slice will take longer than the others to process its data) and by applying the same distribution to other tables that are commonly joined we can benefit from improved table joining performance as all of the rows are stored in the same node-slice. Sometimes it is more appropriate to replicate the whole table to each slice so that the data is always available to join without the need to move data to the same slice before joining; In such cases we set the distribution style to be ALL.
The second thing we can set on a table is the SORTKEY this specifies one or more columns on the table by which the data is ordered on data load (it can be the same column as the distribution key). Redshift maintains information on the minimum and maximum values of the sort key in each database block and at query time uses this information to skip blocks that do not contain data of interest.
Finally, we can elect to compress columns in the database. If we do not specify compression, the default is RAW (i.e. uncompressed) is used. For compressed data we can specify the compression algorithm used, different algorithms are better for certain data types and values. Compression may be data block based (DELTA, BYTE-DICTIONARY, RUN LENGTH, TEXT255 and TEXT32K) or value base (LZO and the MOSTLY compressions). This sounds daunting but there are two ways we can get compression suggestions from the database: using the ANALYZE COMPRESSION command on a loaded table and the AUTO COMPRESS feature of the COPY command, this however requires an empty non-compressed target table; copy is the Redshift equivalent of SQL/Loader and takes a flat file and inserts it into the database.
Let’s consider a simple table T1 with three columns, C1, C2 and C3. We can create this using a simple piece of DDL:
CREATE TABLE T1 ( C1 INTEGER NOT NULL, C2 VARCHAR(20) NOT NULL, C3 DATE );
I have not used any of the Redshift nice-to-have features for sorting, distribution, and compression of data. Note too, that I am using NOT NULL constraints, this is the only constraint type enforced in the database. This simple create statement creates database objects on each slice of the cluster, with one block per column per slice (1 slice = 1 CPU core) see the following diagram, note there is no table object stored in the database, it is a collection of columns.
Without specifying a distribution key data is evenly spread across all slices. When a 1MB block for a column is full a new block is created for subsequent inserts on the slice. An empty table will occupy block size * number of columns * number of cores and our block size is 1MB this would be columns * cores megabytes
Using a distribution key effectively hashes the data on the key column by the number of cores. Adding a sort key declares that the rows in the table are ordered and potentially allows block elimination to kick in. If our sort key is, say, transaction date, it is likely that our data loads occur in transaction date order, however if we sorted on product code we might find each data load has data that needs to be inserted between existing rows. This does not happen, the data is still appended to the table and the table now needs to be reorganised to put the rows in order. There are two ways to achieve this, the VACUUM command that does an on-line reorg of the table and the potentially faster route of creating a copy table, populating it and then dropping the original and renaming the copy, of course this gives a little downtime when the original table is not available for access.
Applying compression, sort and distribution we get a DDL statement like:
CREATE TABLE T2 ( C1 INTEGER NOT NULL, C2 VARCHAR(20) NOT NULL SORTKEY DISTKEY, C3 DATE ENCODE DELTA );
This table uses column C2 as both the sort key and the distribution key; column c3 is compressed using delta compression – this is an efficient compression algorithm where most dates are ±127 days of the date of the previous row. If we wanted to use a multi-column sort key the DDL syntax would be like:
CREATE TABLE T1 ( C1 INTEGER NOT NULL, C2 VARCHAR(20) NOT NULL DISTKEY, C3 DATE ) SORTKEY (C3,C2);
Multi-column distribution keys are not supported.
Designing for Performance
Redshift is designed for query and bulk insert operations; we can optimise query performance by structuring data so that less data is transferred between nodes in a join operations or less data is read from disk in a table scan. Choosing the right data sortkeys and distkeys is vital in this process. Ideally these key columns should not be compressed. Adding primary and foreign keys to the tables tells the optimizer about the data relationships and thus improves the quality of query plan being generated. Of course up to date table stats are a given too; tables must be ANALYZEd when ever the contents changes significantly and certainly after initial load. I feel that we should collect stats after each data load.
For a FACT + DIMENSIONS data model (such as in the performance layer of Oracle’s Reference Data Warehouse Architecture) it would be appropriate to distribute data on the dimension key of the largest dimension on both the dimension and the fact tables, this will reduce the amount of data being moved between slices to facilitate joins.
For optimal performance we should always ensure we include both the distribution keys and the sort keys in any query, even if they appear to be redundant. The presence of these keys forces the optimizer to access the tables in an efficient way.
For best data load performance we insert rows in bulk and in sortkey order. Redshift claim best performance comes from using the COPY command to load from flat files and as second best the bulk insert SQL commands such as CTAS and INSERT INTO T1 (select * from T2);. Where Redshift performs less well is when we use certain kinds of ETL steps in our process, particularly those that involve updating rows or single row activities. In addition loading data without respecting the sort key leads to performance problems on data query. If data update is essential we have two real options: we move our ETL processes to a conventional database hub server (perhaps using ODI) and just use Redshift to store pre-transformed data; or we revise our ETL processes to mimimize update activity on the Redshift platform. There is some scope to optimize updates by distributing data on the update key but another approach is to use temporary tables to build the results of the update and to replace the table with the results of the merge. This requires a bit of inventiveness with the ETL design but fortunately many of our required SQL constructs including analytic functions are there to help us.
Building an Amazon Redshift Data Warehouse using ODI and Attunity Part 2 : Using ODI and OBIEE with Redshift as a Target/Source
In my post yesterday we stepped through the initial set up and staging load for a data warehouse using Amazon Redshift and Attunity, for eventual use with OBIEE and ODI. Now that we have our source data in Redshift, let’s look at transforming it into a star schema using ODI, by initially looking how we set up the connection to Redshift in ODI’s Topology Navigator.
As I mentioned in yesterday’s post and on a blog article we wrote on non-Oracle databases a few months ago, Redshift is based on ParAccel technology, but uses PostGreSQL ODBC and JDBC drivers to connect to it. Therefore. we can create a Physical Architecture connection in ODI through to it using the PostgresSQL technology type, like this:
For the Redshift JDBC URL you’ll need, refer to the AWS Cluster summary page back on the Amazon AWS Console.
We can then add our schemas:
Then, if your connection to the Redshift database is working correctly, you should see a list of schemas in the drop down menus:
Once all the schemas have been added, when then go through the familiar ODI process of logical architecture and context mapping:
Next, we can create a model and reverse engineer in our data stores -
It’s a good idea at this point to review each datastore and ensure that all attributes have been correctly assigned a datatype. To save a little reading time, a set of dimension and fact tables were created in the performance schema, and these were also reverse engineered in:
So from here on out, the process of creating and using mappings, packages, variables etc to populate your target tables is fairly straightforward, one which ODI Developers will be familiar with. You add your source and target data stores and implement the transformation logic.
For the initial load, the SQL Control Append IKM was used and for the most part, this KM worked well without customisation. However, one of the early issues encountered during execution was with unordered outer joins – this appears to be a an issue with the PostGresQL technology. This was resolved by amending the mapping and setting an order for the joins:
Merging / Incremental loading
One of the things to be aware with Redshift at the time of writing is the absence of Merge functionality (we’re so spoilt in Oracle DBMS land ;) ) You can of course achieve this by issuing separate insert and update statements. There is also the SQL Incremental Update IKM in ODI, but you may want to review the KM Task steps against your requirements before proceeding, as invariably some customisation will be required to improve performance. The same applies to SCD functionality – you should consider utilising analytic window functions (Redshift supports several aggregate and ranking window functions) to achieve your desired format.
Additionally, as mentioned previously Attunity supports a variety of replication tasks – you can do one off initial loading, an initial load + apply changes, and/or collect deltas into a change tracking table in the Redshift database. This in itself means you have the basic building blocks to create and maintain a Foundation layer as part of your Data Warehouse Design.
So, once we have data in our target fact and dimension tables, we can model these in OBIEE and create some analyses.
OBIEE Configuration
As per the AWS documentation for Redshift, the recommended ODBC driver was downloaded from the Amazon AWS website, and an ODBC data source was created on the OBIEE server:
For the server entry, you can use either the public or private leader node IP, depending on your network/security configuration. Just backtracking a little here – you may recall that the usr_dw account was created after the initial Redshift cluster build. In order for it to be able to access the tables in the performance schema, we need to grant the required select privileges.
Once the ODBC connection is defined, we can create an RPD, define our database connection properties, import our tables and start modelling.
When it came to setting the features I used the query DBMS function and then went with what it set. Bear in mind that because PostgreSQL and Redshift aren’t supported source databases for OBIEE, depending on the analyses you are creating you may or may not hit functionality issues, so you may find yourself tweaking these settings.
So if your connection is defined correctly and there are no other issues, you should now be able to import the tables:
Once you’ve made your selection, you can then define the relationships between the fact and dimension tables (if you have primary and foreign keys defined these should propagate through), create any aliases etc. and then you’re about ready to start building up the the Business Model and Presentation areas.
Once the RPD modelling is completed, we can upload it and create some content.
Admittedly the analyses created for this dashboard were fairly basic, so no functionality issues were encountered during creation and execution – you might hit some with more “edge-cases” such as analytic functions or nested subtotals, or you might end-up hitting performance issues when you’ve got lots of concurrent users, as the SQL issued by OBIEE might not be as efficient as it would be for a supported data source.
Performance Considerations for Redshift
Pete Scott will be covering the ins and outs of designing for performance in a blog post tomorrow, for which I’ll add a link to this post once it’s up. However, one thing that should be mentioned is the importance of setting sort and distribution keys appropriately on the fact and dimension tables, as well as Primary and Foreign keys (whilst these constraints are not enforced, they are used by the optimiser when determining execution plans). The sort and distribution keys determine how the data is stored on disk, and how it is distributed across the compute nodes, and can make a big impact on query response times, as well as any issues around lack of official support in OBIEE for Redshift as a data source.
So to conclude - whilst not officially supported, it is possible to create and maintain a Data Warehouse in Redshift and use the Oracle Product set to drive ETL and reporting. Some of the features that make Redshift an attractive DB platform is it’s ease of management, it’s scaleability and sizing options, and the ability to get a cluster up and running in hours rather than days. Combined with ODI and OBIEE and therefore a transferable skills base, it makes for an intriguing DW solution.
Building an Amazon Redshift Data Warehouse using ODI and Attunity Part 1 : Configuring Redshift, and CDC Using Attunity
One of our longstanding Oracle customers recently asked us to put together a proof-of-concept DW system using Amazon Redshift as the data warehouse database, rather than Oracle Database. The main driver for this was the economics of running Redshift in the cloud vs. Oracle on-premise, or using Amazon RDS, and they were also interested in the potential performance benefits of running their data warehouse on a column-store database. They were also interested in trying out Attunity Replicate with Cloudbeam as the CDC solution, a product conceptually similar to Oracle GoldenGate but hosted on the Amazon AWS platform, and licensed by data volume and month rather than per CPU, perpetual, as GoldenGate is.
Because the client is a big user of Oracle ETL technology though – currently OWB, moving to ODI in the near future – they still wanted to use ODI to do the main data integration work, so an interesting question for us was whether ODI was a suitable tool for use with Redshift, which isn’t an officially supported source or target platform for ODI. It’s important to understand what “not supported” or “not certified” means in this context – it doesn’t mean it won’t work, it just means you’re on your own if it doesn’t. What gave us hope though was that Redshift uses Postgres-compatible ODBC and JDBC drivers, and we got it working with OBIEE a few months ago, so the premise for the PoC seemed reasonable to us. The diagram below shows the architecture we went with at the start of the project.
A little about the non-Oracle components:
- Amazon Redshift – Amazon’s Fully Managed Data Warehouse in the cloud offering. Based on ParAccel technology (which itself has roots in PostGresQL), the service utilises MPP, columnar storage and compression principles to enhance query performance.
- Attunity Replicate & Cloudbeam – Attunity provide log based Change Data Capture and replication functionality, which includes Redshift as a target database. Attunity supports on going source to target replication as well as storing change data in audit / change tracking tables. In order to replicate data to Redshift, you need to set up an account with Attunity and then sign up for a CloudBeam subscription.
- Amazon S3 – Amazon’s cloud storage service, which is used by Attunity to stage replication/change data prior to loading into the Redshift database.
Initial Configuration and Load
Source Database
The source RDBMS used for this proof of concept was an Oracle EBS database running on 11.1.0.7. In particular, we used the GL schema as this housed some sizeable data. The database requirements for Attunity Replicate are:
- Add supplemental logging (for this experiment full logging was added on the GL tables)
- Switch the database into archive log mode
- Optionally create a dedicated database account for Attunity – requires select privs on various V$ tables, the GL schema tables and execute on the LogMiner packages. It’s at this point you may want to touch base with your resident DBA :)
S3 Storage
The Attunity CloudBeam service initially stages files in an S3 bucket before loading into the target database schema. In order to create this, browse to the S3 section of the AWS Console, and click Create Bucket. Once the bucket has been created, you then need to create a folder within this, in this instance the folder was named ORA_STG.
Redshift Cluster
You deploy a Redshift database from the AWS management console. From the Redshift page, select a Region, and then Launch Cluster. From here, fill out the cluster details, for example:
Leaving the database name blank will create a database named ‘dev’.
After clicking Continue, the next page allows you to size your cluster. For our test case, we opted for a two node cluster (which in fact has 3 nodes – 1 Leader node and 2 Compute nodes):
The next configuration screen allows you to set various security and network options:
Finally you are presented with a summary screen confirming your configuration options. The important thing to note here is that as soon as the cluster is launched and running, you will be start being charged.
After you click Launch, the cluster build will begin, and after some time the Status will switch from creating to Available. Clicking the Cluster Name hyperlink will show you a summary screen, detailing security config, cluster status, information about backups and URLs for connecting to your cluster.
At the bottom of the screen area list of public and private IPs for the Leader and Compute Nodes of the cluster:
We used the public IP of the Leader node for our inbound client connections.
The final step in terms of cluster configuration was to configure a list of IPs that were authorised to access the cluster. This is handled from within the security screen on the Redshift Management Console.
Once you’ve authorised your machine, you can connect to the Redshift database via the JDBC/ODBC URLs shown on the cluster summary screen previously. Amazon recommend using SQL Workbench, and also link to JDBC and ODBC drivers which you should install. So, the next step was to enter the connection details, connect to the database and create some target users and schemas.
The connection details were defined using the downloaded jdbc driver, and the jdbc url from the AWS summary screen:
Once connected, users and schemas were created with the following statements:
create schema performance; create schema staging; create schema foundation; create user usr_stg with password '<password>'; create user usr_fnd with password '<password>'; create user usr_dw with password '<password>'; create user odi_work with password '<password>';
NB: in Redshift, passwords must contain one uppercase character,a number, and be at least 8 chars long.
Once the users and schemas are created, we can then grant privileges to the odi_work account to be able to create objects across the schemas:
grant usage,create on schema staging to odi_work; grant usage,create on schema performance to odi_work; grant usage,create on schema foundation to odi_work;
Of course, how you would manage security and database privileges in your own environment is entirely up to you. You can configure object level privileges as your database security guidelines dictate.
Once your users are set up, you can create objects and grant privileges as required.
Attunity
Once the Attunity Replicate software is installed, it’s a case of adding the source and target databases.
Source: Oracle EBS database
Clicking Manage Databases and then Add Database brings up the following dialogue, in which you fill in the details of the database. The Advanced tab contains details around how Attunity implements and executes CDC/log Mining.
Target: Redshift Cluster
You follow the same process as above, but select Amazon Redshift as the Type and specify it as a target. Within this screen, you enter details of the cluster endpoint address, the port (refer back to the AWS cluster summary screen), and a user name and password to access the database. If you have entered the details correctly and you can reach the cluster, you should be able to drop down the Database Name list and see the dev database. The next section on this dialog is to enter details about your Attunity CloudBeam subscription, and then your S3 bucket and folder. It should be noted here that in your environment you may not want to use your master user db account, but instead opt for a slightly less privileged user.
Once you have filled out each section, it is worth clicking the Test button on the bottom left of the dialog to ensure all details are correct.
So, now that we have our source and target databases configured, the next step is to create a Replication Task. The task can be a combination of the following options:
- Full Load – replicate a table from source to target
- Apply Changes – combined with the above, maintains the target copy with changes that occur on source
- Store Changes – CDC deltas stored in a separate table
Once you’ve decided on the type of replication task, you drag and drop the source and targets onto the palette, configure which tables you want to replicate from source, and specify any transformation rules e.g. to move to a different target schema. In our test case, we created an initial load task and a rule was created to map the GL schema in the source to the staging schema in the target.
Within the task settings, you can choose to either create new target tables, or use pre-created ones. This is useful if you want to create the tables with certain options, e.g. specifying the sort and distribution keys.
Once you finished reviewing and tweaking settings, you can run the task. Once kicked off, you can review progress from the Monitor screen:
Any errors will be flagged up and can be reviewed in the Messages section. When the load completes, you can verify the existence of the tables in the Redshift database:
We can now need to grant select privileges on the above objects to odi_work (unless you configured Attunity to use the odi_work account) before it can access them, as despite residing in a schema odi_work can create objects in, they are owned by a different user and therefore permissions still need to be granted.
grant select on <schema>.<object> to odi_work;
So the first stage of getting our data from source to Redshift is complete. Tomorrow, we will look at transforming our initial load data into a target star schema using ODI.