Category Archives: Rittman Mead

ETL Offload with Spark and Amazon EMR – Part 3 – Running pySpark on EMR

In the previous articles (here, and here) I gave the background to a project we did for a client, exploring the benefits of Spark-based ETL processing running on Amazon's Elastic Map Reduce (EMR) Hadoop platform. The proof of concept we ran was on a very simple requirement, taking inbound files from a third party, joining to them to some reference data, and then making the result available for analysis.

I showed here how I built up the prototype PySpark code on my local machine, using Docker to quickly and easily make available the full development environment needed.

Now it's time to get it running on a proper Hadoop platform. Since the client we were working with already have a big presence on Amazon Web Services (AWS), using Amazon's Hadoop platform made sense. Amazon's Elastic Map Reduce, commonly known as EMR, is a fully configured Hadoop cluster. You can specify the size of the cluster and vary it as you want (hence, "Elastic"). One of the very powerful features of it is that being a cloud service, you can provision it on demand, run your workload, and then shut it down. Instead of having a rack of physical servers running your Hadoop platform, you can instead spin up EMR whenever you want to do some processing - to a size appropriate to the processing required - and only pay for the processing time that you need.

Moving my locally-developed PySpark code to run on EMR should be easy, since they're both running Spark. Should be easy, right? Well, this is where it gets - as we say in the trade - "interesting". Part of my challenges were down to the learning curve in being new to this set of technology. However, others I would point to more as being examples of where the brave new world of Big Data tooling becomes less an exercise in exciting endless possibilities and more stubbornly Googling errors due to JAR clashes and software version mismatches...

Provisioning EMR

Whilst it's possible to make the entire execution of the PySpark job automated (including the provisioning of the EMR cluster itself), to start with I wanted to run it manually to check each step along the way.

To create an EMR cluster simply login to the EMR console and click Create

I used Amazon's EMR distribution, configured for Spark. You can also deploy a MapR-based hadoop platform, and use the Advanced tab to pick and mix the applications to deploy (such as Spark, Presto, etc).

The number and size of the nodes is configured here (I used the default, 3 machines of m3.xlarge spec), as is the SSH key. The latter is very important to get right, otherwise you won't be able to connect to your cluster over SSH.

Once you click Create cluster Amazon automagically provisions the underlying EC2 servers, and deploys and configures the software and Hadoop clustering across them. Anyone who's set up a Hadoop cluster will know that literally a one-click deploy of a cluster is a big deal!

If you're going to be connecting to the EMR cluster from your local machine you'll want to modify the security group assigned to it once provisioned and enable access to the necessary ports (e.g. for SSH) from your local IP.

Deploying the code

I developed the ETL code in Jupyter Notebooks, from where it's possible to export it to a variety of formats - including .py Python script. All the comment blocks from the Notebook are carried across as inline code comments.

To transfer the Python code to the EMR cluster master node I initially used scp, simply out of habit. But, a much more appropriate solution soon presented itself - S3! Not only is this a handy way of moving data around, but it comes into its own when we look at automating the EMR execution later on.

To upload a file to S3 you can use the S3 web interface, or a tool such as Cyberduck. Better, if you like the command line as I do, is the AWS CLI tools. Once installed, you can run this from your local machine:

aws s3 cp Acme.py s3://foobar-bucket/code/Acme.py

You'll see that the syntax is pretty much the same as the Linux cp comand, specifying source and then destination. You can do a vast amount of AWS work from this command line tool - including provisioning EMR clusters, as we'll see shortly.

So with the code up on S3, I then SSH'd to the EMR master node (as the hadoop user, not ec2-user), and transfered it locally. One of the nice things about EMR is that it comes with your AWS security automagically configred. Whereas on my local machine I need to configure my AWS credentials in order to use any of the aws commands, on EMR the credentials are there already.

aws s3 cp s3://foobar-bucket/code/Acme.py ~

This copied the Python code down into the home folder of the hadoop user.

Running the code - manually

To invoke the code, simply run:

spark-submit Acme.py

A very useful thing to use, if you aren't already, is GNU screen (or tmux, if that's your thing). GNU screen is installed by default on EMR (as it is on many modern Linux distros nowadays). Screen does lots of cool things, but of particular relevance here is it lets you close your SSH connection whilst keeping your session on the server open and running. You can then reconnect at a later time back to it, and pick up where you left off. Whilst you're disconnected, your session is still running and the work still being processed.

From the Spark console you can monitor the execution of the job running, as well as digging into the details of how it undertakes the work. See the EMR cluster home page on AWS for the Spark console URL

Problems encountered

I've worked in IT for 15 years now (gasp). Never has the phrase "The devil's in the detail" been more applicable than in the fast-moving world of big data tools. It's not suprising really given the staggering rate at which code is released that sometimes it's a bit quirky, or lacking what may be thought of as basic functionality (often in areas such as security). Each of these individual points could, I suppose, be explained away with a bit of RTFM - but the nett effect is that what on paper sounds simple took the best part of half a day and a LOT of Googling to resolve.

Bear in mind, this is code that ran just fine previously on my local development environment.

When using SigV4, you must specify a 'host' parameter

boto.s3.connection.HostRequiredError: BotoClientError: When using SigV4, you must specify a 'host' parameter.

To fix, switch

conn_s3 = boto.connect_s3()  

for

conn_s3 = boto.connect_s3(host='s3.amazonaws.com')  

You can see a list of endpoints here.

boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request

Make sure you're specifying the correct hostname (see above) for the bucket's region. Determine the bucket's region from the S3 control panel, and then use the endpoint listed here.

Error: Partition column not found in schema

Strike this one off as bad programming on my part; in the step to write the processed file back to S3, I had partitionBy='', in the save function

duplicates_df.coalesce(1).write.save(full_uri,
                                     format='com.databricks.spark.csv',
                                     header='false',
                                     partitionBy='',
                                     mode='overwrite')

This, along with the coalesce (which combined all the partitions down to a single one) were wrong, and fixed by changing to:

duplicates_df.write.save(full_uri,
                         format='com.databricks.spark.csv',
                         header='false',
                         mode='overwrite')

Exception: Python in worker has different version 2.6 than that in driver 2.7, PySpark cannot run with different minor versions

To get the code to work on my local Docker/Jupyter development environment, I set an environment variable as part of the Python code to specify the Python executable:

os.environ['PYSPARK_PYTHON'] = '/usr/bin/python2'

I removed this (along with all the PYSPARK_SUBMIT_ARGS) and the code then ran fine.

Timestamp woes

In my original pySpark code I was letting it infer the schema from the source, which included it determining (correctly) that one of the columns was a timestamp. When it wrote the resulting processed file, it wrote the timestamp in a standard format (YYYY-MM-DD HH24:MI:SS). Redshift (of which more in the next article) was quite happy to process this as a timestamp, because it was one.
Once I moved the pySpark code to EMR, the Spark engine moved from my local 1.6 version to 2.0.0 - and the behaviour of the CSV writer changed. Instead of the format before, it switched to writing the timestamp in epoch form, and not just that but microseconds since epoch. Whilst Redshift could cope with epoch seconds, or milliseconds, it doesn't support microseconds, and the load job failed

Invalid timestamp format or value [YYYY-MM-DD HH24:MI:SS]

and then

Fails: Epoch time copy out of acceptable range of [-62167219200000, 253402300799999]

Whilst I did RTFM, it turns out that I read the wrong FM, taking the latest (2.0.1) instead of the version that EMR was running (2.0.0). And whilst 2.0.1 includes support for specifying the output timestampFormat, 2.0.0 doesn't.

In the end I changed the Spark job to not infer the schema, and so treat the timestamp as a string, thus writing it out in the same format. This was a successful workaround here, but if I'd needed to do some timestamp-based processing in the Spark job I'd have had to find another option.

Success!

I now had the ETL job running on Spark on EMR, processing multiple files in turn. Timings were approximately five minutes to process five files, half a million rows in total.

One important point to bear in mind through all of this is that I've gone with default settings throughout, and not made any effort to optimise the PySpark code. At this stage, it's simply proving the end-to-end process.

Automating the ETL

Having seen that the Spark job would run successfully manually, I now went to automate it. It's actually very simple to do. When you launch an EMR cluster, or indeed even if it's running, you can add a Step, such as a Spark job. You can also configure EMR to terminate itself once the step is complete.

From the EMR cluster create screen, switch to Advanced. Here you can specify exactly which applications you want deployed - and what steps to run. Remember how we copied the Acme.py code to S3 earlier? Now's when it comes in handy! We simply point EMR at the S3 path and it will run that code for us - no need to do anything else. Once the code's finished executing, the EMR cluster will terminate itself.

After testing out this approach successfully, I took it one step further - command line invocation. AWS make this ridiculously easier, because from the home page of any EMR cluster (running or not) there is a button to click which gives you the full command to run to spin up another cluster with the exact same configuration

This gives us a command like this:

    aws emr create-cluster \
    --termination-protected \
    --applications Name=Hadoop Name=Spark Name=ZooKeeper \
    --tags 'owner=Robin Moffatt' \
    --ec2-attributes '{"KeyName":"Test-Environment","InstanceProfile":"EMR_EC2_DefaultRole","AvailabilityZone":"us-east-1b","EmrManagedSlaveSecurityGroup":"sg-1eccd074","EmrManagedMasterSecurityGroup":"sg-d7cdd1bd"}' \
    --service-role EMR_DefaultRole \
    --enable-debugging \
    --release-label emr-5.0.0 \
    --log-uri 's3n://aws-logs-xxxxxxxxxx-us-east-1/elasticmapreduce/' \
    --steps '[{"Args":["spark-submit","--deploy-mode","cluster","s3://foobar-bucket/code/Acme.py"],"Type":"CUSTOM_JAR","ActionOnFailure":"TERMINATE_CLUSTER","Jar":"command-runner.jar","Properties":"","Name":"Acme"}]' \
    --name 'Rittman Mead Acme PoC' \
    --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"m3.xlarge","Name":"Master instance group - 1"},{"InstanceCount":2,"InstanceGroupType":"CORE","InstanceType":"m3.xlarge","Name":"Core instance group - 2"}]' \
    --region us-east-1 \
    --auto-terminate

This spins up an EMR cluster, runs the Spark job and waits for it to complete, and then terminates the cluster. Logs written by the Spark job get copied to S3, so that even once the cluster has been shutdown, the logs can still be accessed. Seperation of compute from storage - it makes a lot of sense. What's the point having a bunch of idle CPUs sat around just so that I can view the logs at some point if I want to?

The next logical step for this automation would be the automatic invocation of above process based on the presence of a defined number of files in the S3 bucket. Tools such as Lambda, Data Pipeline, and Simple Workflow Service are all ones that can help with this, and the broader management of ETL and data processing on AWS.

Spot Pricing

You can save money further with AWS by using Spot Pricing for EMR requests. Spot Pricing is used on Amazon's EC2 platform (on which EMR runs) as a way of utilising spare capacity. Instead of paying a fixed (higher) rate for some server time, you instead 'bid' at a (lower) rate and when the demand for capacity drops such that the spot price does too and your bid price is met, you get your turn on the hardware. If the spot price goes up again - your server gets killed.

Why spot pricing makes sense on EMR particularly is that Hadoop is designed to be fault-tolerant across distributed nodes. Whilst pulling the plug on an old-school database may end in tears, dropping a node from a Hadoop cluster may simply mean a delay in the processing whilst the particular piece of (distributed) work is restarted on another node.

Summary

We've developed out simple ETL application, and got it running on Amazon's EMR platform. Whilst we used AWS because it's the client's platform of choice, in general there's no reason we couldn't take it and run it on another Hadoop platform. This could be a Hadoop platform such as Oracle's Big Data Cloud Service, Cloudera's CDH running on Oracle's Big Data Appliance, or simply a self-managed Hadoop cluster on commodity hardware.

Processing time was in the region of 30 minutes to process 2M rows across 30 files, and in a separate batch run 3.8 hours to process 283 files of around 25M rows in total.

So far, the data that we've processed is only sat in a S3 bucket up in the cloud.

In the next article we'll look at what the options are for actually analysing the data and running reports against it.

ETL Offload with Spark and Amazon EMR – Part 2 – Code development with Notebooks and Docker

In the previous article I gave the background to a project we did for a client, exploring the benefits of Spark-based ETL processing running on Amazon's Elastic Map Reduce (EMR) Hadoop platform. The proof of concept we ran was on a very simple requirement, taking inbound files from a third party, joining to them to some reference data, and then making the result available for analysis. The primary focus was proving the end-to-end concept, with future iterations focussing on performance and design optimisations.

Here we'll see how I went about building up the ETL process.

Processing Overview

The processing needed to iterate over a set of files in S3, and for each one:

  • Loads the file from S3
  • Determines region from filename, and adds as column to data
  • Deduplicates it
  • Writes duplicates to separate file
  • Loads sites reference data
  • Extracts domain from URL string
  • Joins facts with sites on domain
  • Writes resulting file to S3

Once the data is processed and landed back to S3, we can run analytics on it. See subsequent articles for discussion of Redshift vs in-place querying with tools such as Presto.

Ticking All The Cool-Kid Boxes - Spark AND Notebooks AND Docker!

Whilst others in Rittman Mead have done lots of work with Spark, I myself was new to it, and needed a sandpit in which I could flail around without causing any real trouble. Thus I set up a nice self-contained development environment on my local machine, using Docker to provision and host it, and Jupyter Notebooks as the interface.

Notebooks

In a world in which it seems that there are a dozen cool new tools released every day, Interactive Notebooks are for me one of the most significant of recent times for any developer. They originate in the world of data science, where taking the 'science' bit at its word, data processing and exploration is written in a self-documenting manner. It makes it possible to follow how and why code was written, what the output at each stage was -- and to run it yourself too. By breaking code into chunks it makes it much easier to develop as well, since you can rerun and perfect each piece before moving on.

Notebooks are portable, meaning that you can work with them in one system, and then upload them for others to learn from and even import to run on their own systems. I've shared a simplified version of the notebook that I developed for this project on gist here, and you can see an embedded version of it at the end of this article.

The two most common are Apache Zeppelin, and Jupyter Notebooks (previously known as iPython Notebooks). Jupyter is the one I've used previously, and stuck with again here. To read more about notebooks and see them in action, see my previous blog posts here and here.

Docker

Plenty's been written about Docker. In a nutshell, it is a way to provision and host a set of self-contained software. It takes the idea of a virtual machine (such as VMWare, or VirtualBox), but without having to install an OS, and then the software, and then configure it all yourself. You simply take a "Dockerfile" that someone has prepared, and run it. You can create copies, or throwaway and start again, from a single command. I ran Docker on my Mac through Kitematic, and natively on my home server.

There are prebuilt Docker configuration files for lots of software (including Oracle and OBIEE!), and I found one that includes Spark, PySpark, and Jupyter - perfect!

To launch it, you simply enter:

docker run -d -p 18888:8888 jupyter/all-spark-notebook

This downloads all the necessary Docker files etc - you don't need anything local first, except Docker.

I ran it with an additional flag, -v, configuring it to use a folder on my local machine to store the work that I created. By default all files reside within the Docker image itself - and get deleted when you delete the Docker instance.

docker run -d -p 18888:8888 -v /Users/rmoff/all-spark-notebook:/home/jovyan/work jupyter/all-spark-notebook

You can also run the container with an additional flag, GRANT_SUDO, so that the guest user can run sudo commands within it. To do this include -e GRANT_SUDO=yes --user root:

docker run -d -p 18888:8888 -e GRANT_SUDO=yes --user root -v /Users/rmoff/all-spark-notebook:/home/jovyan/work jupyter/all-spark-notebook

With the docker container running, you can access Jupyter notebooks on the port exposed in the command used to launch it (18888)

Getting Started with Jupyter

From Jupyter's main page you can see the files within the main folder (see above for how you can map this to a local folder on the machine hosting Docker). Using the New menu in the top-right you can create:

  • Folders and edit Text files
  • A terminal
  • A notebook, running under one of several different 'Kernels' (host interpreters and environments)

The ability to run a terminal from Jupyter is very handy - particularly on Docker. Docker by its nature isn't really designed for interaction within the container itself. It's the point of Docker in a way, that it provisions and configures all the software for you. You can use Docker to run a bash shell directly, but it's more fiddly than just using the Jupyer Terminal.

I used a Python 2 notebook for my work; with this Docker image you also have the option of Python 3, Scala, and R.

Developing the Spark job

With my development environment up and running, I set to writing the Spark job. Because I'm already familiar with Python I took advantage of PySpark. Below I describe the steps in the processing and how I achieved them.

Environment Preparation

Define AWS parameters:

access_key='XXXXXXXXXXXXXXXXXX  
secret='YYYYYYYYYYYYYYYYY'  
bucket_name='foobar-bucket'  

Set up the PySpark environment, including necessary JAR files for accessing S3 from Spark:

import os  
os.environ['AWS_ACCESS_KEY_ID'] = access_key  
os.environ['AWS_SECRET_ACCESS_KEY'] = secret  
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python2'  
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.hadoop:hadoop-aws:2.7.1,com.amazonaws:aws-java-sdk-pom:1.10.34,com.databricks:spark-csv_2.11:1.3.0 pyspark-shell'  

Create a spark context:

import pyspark  
sc = pyspark.SparkContext('local[*]')  
sqlContext = pyspark.SQLContext(sc)  

Import Python libraries

from pyspark.sql.functions import udf  
from pyspark.sql.functions import lit  
import boto  
from urlparse import urlsplit  

Note that to install python libraries not present on the Docker image (such as boto, which is used for accessing AWS functionality from within Python) you can run from a Jupyter Terminal:

/opt/conda/envs/python2/bin/pip install boto

On other platforms the path to pip will vary, but the install command is the same

Loading Data from S3

The source data comes from an S3 "bucket", on which there can be multiple "keys". Buckets and keys roughly translate to "disk drive" and "file".

We use the boto library to interact with S3 to find a list of files ('keys') matching the pattern of source files that we want to process.

Connect to the bucket

conn_s3 = boto.connect_s3()  
bucket = conn_s3.get_bucket(bucket_name)  

Iterate over the bucket contents

This bit would drive iterative processing over multiple input files; for now it just picks the last file on the list (acme_file getting set on each iteration and so remaining set after the loop)

contents=bucket.list(prefix='source_files/')  
for f in contents:  
    print f.name
    print f.size
    acme_file = f.name
print "\n\n--\nFile to process: %s" % acme_file  

Read the CSV from S3 into Spark dataframe

The Docker image I was using was running Spark 1.6, so I was using the Databricks CSV reader; in Spark 2 this is now available natively. The CSV file is loaded into a Spark data frame. Note that Spark is reading the CSV file directly from a S3 path.

full_uri = "s3n://{}/{}".format(bucket_name, acme_file)  
print full_uri  
s3n://foobar-bucket/source_files/acme_GB_20160803_100000.csv
acme_df = sqlContext.read.load(full_uri,  
                                  format='com.databricks.spark.csv',
                                  header='true',
                                  inferSchema='true')

acme_df.printSchema()  
root
 |-- product: string (nullable = true)
 |-- product_desc: string (nullable = true)
 |-- product_type: string (nullable = true)
 |-- supplier: string (nullable = true)
 |-- date_launched: timestamp (nullable = true)
 |-- position: string (nullable = true)
 |-- url: string (nullable = true)
 |-- status: string (nullable = true)
 |-- reject_reason: string (nullable = true)

The above shows the schema of the dataframe; Spark has infered this automagically from the column headers (for the column names), and then the data types within (note that it has correctly detected a timestamp in the date_launched column)

Add country column to data frame

The filename of the source data includes a country field as part of it. Here we use this regular expression to extract it:

filename=os.path.split(acme_file)[1]  
import re  
m=re.search('acme_([^_]+)_.+$', filename)  
if m is None:  
    country='NA'
else:  
    country=m.group(1)

print "Country determined from filename '%s' as : %s" % (filename,country)  
Country determined from filename 'acme_GB_20160803_100000.csv' as : GB

With the country stored in a variable, we add it as a column to the data frame:

Note that the withColumn function requires a Column value, which we create here using the PySpark lit function that was imported earlier on.

acme_df=acme_df.withColumn('country',lit(country))

acme_df.printSchema()  
root
 |-- product: string (nullable = true)
 |-- product_desc: string (nullable = true)
 |-- product_type: string (nullable = true)
 |-- supplier: string (nullable = true)
 |-- date_launched: timestamp (nullable = true)
 |-- position: string (nullable = true)
 |-- url: string (nullable = true)
 |-- status: string (nullable = true)
 |-- reject_reason: string (nullable = true)
 |-- country: string (nullable = false)

Note the new column added to the end of the schema.

Deduplication

Now that we've imported the file, we need to deduplicate it to remove entries with the same value for the url field. Here I'm created a second dataframe based on a deduplication of the first, using the PySpark native function dropDuplicates:

acme_deduped_df = acme_df.dropDuplicates(['url'])  

For informational purposes we can see how many records are in the two dataframes, and determine how many duplicates there were:

orig_count = acme_df.count()  
deduped_count = acme_deduped_df.count()  
print "Original count: %d\nDeduplicated count: %d\n\n" % (orig_count,deduped_count)  
print "Number of removed duplicate records: %d" % (orig_count - deduped_count)  
Original count: 97974
Deduplicated count: 96706


Number of removed duplicate records: 1268

Deriving Domain from URL

One of the sets of reference data is information about the site on which the product was viewed. To bring these sets of attributes into the main dataset we join on the domain itself. To perform this join we need to derive the domain from the URL. We can do this using the python urlsplit library, as seen in this example:

sample_url = 'https://www.rittmanmead.com/blog/2016/08/using-apache-drill-with-obiee-12c/'

print sample_url  
print urlsplit(sample_url).netloc  
https://www.rittmanmead.com/blog/2016/08/using-apache-drill-with-obiee-12c/
www.rittmanmead.com

We saw above that to add a column to the dataframe the withColumn function can be used. However, to add a column that's based on another (rather than a literal, which is what the country column added above was) we need to use the udf function. This generates the necessary Column field based on the urlsplit output for the associated url value.

First we define our own function which simply applies urlsplit to the value passed to it

def getDomain(value):  
    return urlsplit(value).netloc

and then a UDF based on it:

udfgetDomain = udf(getDomain)  

Finally, apply this to a third version of the dataframe:

acme_deduped_df_with_netloc = acme_deduped_df.withColumn("netloc", udfgetDomain(acme_deduped_df.url))  

Joining to Second Dataset

Having preparing the primary dataset, we'll now join it to the reference data. The source of this is currently an Oracle database. For simplicity we're working with a CSV dump of the data, but PySpark supports the option to connect to sources with JDBC so we could query it directly if required.

First we import the sites reference data CSV:

sites_file = "s3n://{}/{}".format('foobar-bucket', 'sites.csv')  
sites_df = sqlContext.read.load(sites_file,  
                                  format='com.databricks.spark.csv',
                                  header='true',
                                  inferSchema='true')

Then some light data cleansing with the filter function to remove blank SITE entries, and blank SITE_RETAIL_TYPE entries:

sites_pruned_df = sites_df.filter("NOT (SITE ='' OR SITE_RETAIL_TYPE = '')")  

Now we can do the join itself. This joins the original dataset (acme_deduped_df_with_netloc) with the sites reference data (sites_pruned_df), using a left outer join.

merged_df = acme_deduped_df_with_netloc.join(sites_pruned_df,acme_deduped_df_with_netloc.netloc == sites_pruned_df.SITE, 'left_outer')  

Using the filter function and show we can inspect the dataset for matches, and misses:

First 10 matched:

merged_df.filter(merged_df.ID.isNotNull()).select('date_launched','url','netloc','ID','SITE','SITE_RETAIL_TYPE').show(10)  

First 10 unmatched:

merged_df.filter(merged_df.ID.isNull()).select('date_launched','url','netloc','ID','SITE','SITE_RETAIL_TYPE').show(10)  

Write Back to S3

The finished dataset is written back to S3. As before, we're using the databricks CSV writer here but in Spark 2 would be doing it natively:

acme_enriched_filename='acme_enriched/%s' % filename.replace('.csv','')  
full_uri = "s3n://{}/{}".format(bucket_name, acme_enriched_filename)  
print 'Writing enriched acme data to %s' % full_uri  
merged_df.write.save(path=full_uri,  
                     format='com.databricks.spark.csv',
                     header='false',
                     nullValue='null',
                     mode='overwrite')

Summary

With the above code written, I could process input files in a couple of minutes per 30MB file. Bear in mind two important constraints to this performance:

  1. I was working with data residing up in the Amazon Cloud, with the associated network delay in transferring to and from it

  2. The processing was taking place on a single node Spark deployment (on my laptop, under virtualisation), rather than the multiple-node configuration typically seen.

The next steps, as we'll see in the next article, were to port this code up to Amazon Elastic Map Reduce (EMR). Stay tuned!

Footnote: Notebooks FTW!

(FTW)

Whilst I've taken the code and written it out above more in the form of a blog post, I could have actually just posted the Notebook itself, and it wouldn't have needed much more explanation. Here it is, along with some bonus bits on using S3 from python:

ETL Offload with Spark and Amazon EMR – Part 1

We recently undertook a two-week Proof of Concept exercise for a client, evaluating whether their existing ETL processing could be done faster and more cheaply using Spark. They were also interested in whether something like Redshift would provide a suitable data warehouse platform for them. In this series of blog articles I will look at how we did this, and what we found.

Background

The client has an existing analytics architecture based primarily around Oracle database, Oracle Data Integrator (ODI), Oracle GoldenGate, and Oracle Business Intelligence Enterprise Edition (OBIEE), all running on Amazon EC2. The larger architecture in the organisation is all AWS based too.

Existing ETL processing for the system in question is done using ODI, loading data daily into a partitioned Oracle table, with OBIEE providing the reporting interface.

There were two aspects to the investigation that we did:

  • Primarily, what would an alternative platform for the ETL look like? With lots of coverage recently of the concept of "ETL offloading" and "Apache-based ETL", the client was keen to understand how they might take advantage of this

    Within this, key considerations were:

    • Cost
    • Scalability
    • Maintenance
    • Fit with existing and future architectures
  • The second aspect was to investigate whether the performance of the existing reporting could be improved. Despite having data for multiple years in Oracle, queries were too slow to provide information other than within a period of a few days.

Oracle licenses were a sensitive point for the client, who were keen to reduce - or at least, avoid increased - costs. ODI for Big Data requires additional licence, and so was not in scope for the initial investigation.

Data and Processing

The client uses their data to report on the level of requests for different products, including questions such as:

  • How many requests were there per day?
  • How many requests per product type in a given period?
  • For a given product, how many requests were there, from which country?

Data volumes were approximately 50MB, arriving in batch files every hour. Reporting requirements were previous day and before only. Being able to see data intra-day would be a bonus but was not a requirement.

High Level Approach

Since the client already uses Amazon Web Services (AWS) for all its infrastructure, it made sense to remain in the AWS realm for the first round of investigation. We broke the overall requirement down into pieces, so as to understand (a) the most appropriate tool at each point and (b) the toolset with best overall fit. A very useful reference for an Amazon-based big data design is the presentation Big Data Architectural Patterns and Best Practices on AWS. Even if you're not running on AWS, the presentation has some useful pointers for things like where to be storing your data based on volumes, frequency of access, etc.

Data Ingest

The starting point for the data was Amazon's storage service - S3, in which the data files in CSV format are landed by an external process every hour.

Processing (Compute)

Currently the processing is done by loading the external data into a partitioned Oracle table, and resolving dimension joins and de-duplication at query time.

Taking away any assumptions, other than a focus on 'new' technologies (and a bias towards AWS where appropriate), we considered:

  • Switch out Oracle for Redshift, and resolve the joins and de-duplication there
    • Loading the data to Redshift would be easy, but would be switching one RDBMS-based solution for another. Part of the aim of the exercise was to review a broader solution landscape than this.
  • Use Hadoop-based processing, running on Elastic Map Reduce (EMR):

    • Hive QL to process the data on S3 (or HDFS on EMR)
      • Not investigated, because provides none of the error handling etc that Spark would, and Spark has SparkSQL for any work that needs doing in SQL.
    • Pig
      • Still used, but 'old' technology, somewhat esoteric language, and superseded by Spark
    • Spark
      • Support for several languages including commonly-used ones such as Python
      • Gaining increasing levels of adoption in the industry
      • Opens up rich eco-system of processing possibilities with related projects such as Machine Learning, and Graph.

We opted to use Spark to process the files, joining them to the reference data, and carrying out de-duplication. For a great background and discussion on Spark and its current place in data architectures, have a listen to this podcast.

Storage

The output from Spark was written back to S3.

Analytics

With the processed data in S3, we evaluated two options here:

  • Load it to Redshift for query
  • Query in-place with a SQL-on-Hadoop engine such as Presto or Impala
    • With the data at rest on S3, Amazon's Athena is also of interest here, but was released after we carried out this particular investigation.

The presumption was that OBIEE would continue to provide the front-end to the analytics. Oracle's Data Visualization Desktop tool was also of interest.

In the next post we'll see the development environment that we used for prototyping. Stay tuned!

Source Control and Automated Code Deployment Options for OBIEE

It's Monday morning. I've arrived at a customer site to help them - ironically enough - with automating their OBIEE code management. But, on arrival, I'm told that the OBIEE team can't meet with me because someone did a release on the previous Friday, had now gone on holiday - and the wrong code was released but they didn't know which version. All hands-on-deck, panic-stations!

This actually happened to me, and in recent months too. In this kind of situation hindsight gives us 20:20 vision, and of course there shouldn't be a single point of failure, of course code should be under version control, of course it should be automated to reduce the risk of problems during deployments. But in practice, these things often don't get done - and it's understandable why. In the very early days of a project, it will be a manual process because that's what is necessary as people get used to the tools and technology. As time goes by, project deadlines come up, and tasks like this are seen as "zero sum" - sure we can automate it, but we can also continue doing it manually and things will still get done, code will still get released. After a while, it's just accepted as how things are done. In effect, it is technical debt - and this is your reminder that debt has to be paid, sooner or later :)

I'll not pretend that managing OBIEE code in source control, and automating code deployments, is straightforward. But, it is necessary, so in this post I'll walk through why you should be doing it, and then importantly how.

Why Source Control?

Do we really need source control for OBIEE? After all, what's wrong with the tried-and-tested method of sticking it all in a folder like this?

sdlc01.png

What's wrong with this? What's right with this? Oh lack of source control, let me count the number of ways that I doth hate thee:

  1. No audit trail of who changed something
  2. No audit of what was changed, and when
  3. No enforceable naming standards for versions
  4. No secure way of identifying deployment candidates
  5. No distributed method for sharing code (don't tell me that a network share counts!)
  6. No way of reliably identifying the latest version of code

These range from the immediately practical through to the slightly more abstract but necessary in a mature deployment.

Of immediate impact is the simply ability to identify the latest version of code on which to make new changes. Download the copy from the live server? Really? No. If you're tracking your versions accurately and reliably then you simply pull the latest version of code from there, in the knowledge that it is the version that is live. No monkeying around trying to figure out if it really is (just because it's called "PROD-091216.rpd" how do you know that's actually what got released to Production? And was that on 12th December or 9th September? Who knows!).

Longer term, having a secure and auditable code line simply makes it easier and less risky to manage. It also gives you the ability to work with it in a much more flexible manner, such as genuine concurrent development by multiple developers against the RPD. You can read more about this in my presentation here.

Which Source Control?

I don't care. Not really. So long as you are using source control, I am happy.

For preference, I always advocate using git. It is a modern platform, with strong support from many desktop clients (SourceTree is my favourite, along with the commandline too, natch). Git is decentralised, meaning that you can commit and branch code locally on your own machine without having to be connected to a server. It supports a powerful fork and pull process too, which is part of the reason it has almost universal usage within the open source world. The most well known of git platforms is github, which in effect provides git as a Platform-as-a-service (PaaS), in a similar fashion to Bitbucket too. You can also run git on its own locally, or more pragmatically, with gitlab.

But if you're using Subversion (SVN), Perforce, or whatever - that's fine. The key thing is that you understand how to use it, and that it is supported within your organisation. For simple source control, pretty much all the common platforms work just fine. If you get onto more advanced use, such as feature-branches and concurrent development, you may find it worth ensuring that your chosen platform supports the workflow that you adopt. Even then, whilst I'd chose git for preference, at Rittman Mead we've helped clients develop very powerful concurrent development processes with Subversion providing the underlying source control.

What Goes into Source Control? Part 1

So you've drunk the Source Control koolaid, and accepted that really there is no excuse not to use it. So what do you put into it? The RPD? The OBIEE 12c BAR file? What if you're still on OBIEE 11g? The answer here depends partially on how you are planning to manage code deployment in your environment. For a fully automated solution, you may opt to store code in a more granular fashion than if you are simply migrating full BAR files each time. So, read on to understand about code deployment, and then we'll revisit this question again after that.

How Do You Deploy Code Changes in OBIEE?

The core code artefacts are the same between OBIEE 11g and OBIEE 12c, so I'll cover both in this article, pointing out as we go any differences.

The biggest difference with OBIEE 12c is the concept of the "Service Instance", in which the pieces for the "analytical application" are clearly defined and made portable. These components are:

  • Metadata model (RPD)
  • Presentation Catalog ("WebCat"), holding all analysis and dashboard definitions
  • Security - Application Roles and Policy grants, as well as OBIEE front-end privilege grants

Part of this is laying the foundations for what has been termed "Pluggable BI", in which 'applications' can be deployed with customisations layered on top of them. In the current (December 2016) version of OBIEE 12c we have just the Single Service Instance (ssi). Service Instances can be exported and imported to BI Archive files, known as BAR files.

The documentation for OBIEE environment migrations (known as "T2P" - Test to Production) in 12c is here. Hopefully I won't be thought too rude for saying that there is scope for expanding on it, clarifying a few points - and perhaps making more of the somewhat innocuous remark partway down the page:

PROD Service Instance metadata will be replaced with TEST metadata.

Hands up who reads the manual fully before using a product? Hands up who is going to get a shock when they destroy their Production presentation catalog after importing a service instance?...

Let's take walk through the three main code artefacts, and how to manage each one, starting with the RPD.

The RPD

The complication of deployments of the RPD is that the RPD differs between environments because of different connection pool details, and occassionally repository variable values too.

If you are not changing connection pool passwords between environments, or if you are changing anything else in your RPD (e.g. making actual model changes) between environments, then you probably shouldn't be. It's a security risk to not have different passwords, and it's bad software development practice to make code changes other than in your development environment. Perhaps you've valid reasons for doing it... perhaps not. But bear in mind that many test processes and validations are based on the premise that code will not change after being moved out of dev.

With OBIEE 12c, there are two options for managing deployment of the RPD:

  1. BAR file deploy and then connection pool update
  2. Offline RPD patch with connection pool updates, and then deploy
    • This approach is valid for OBIEE 11g too

RPD Deployment in OBIEE 12c - Option 1

This is based on the service instance / BAR concept. It is therefore only valid for OBIEE 12c.

  1. One-off setup : Using listconnectionpool to create a JSON connection pool configuration file per target environment. Store each of these files in source control.
  2. Once code is ready for promotion from Development, run exportServiceInstance to create a BAR file. Commit this BAR file to source control

    /app/oracle/biee/oracle_common/common/bin/wlst.sh <<EOF
    exportServiceInstance('/app/oracle/biee/user_projects/domains/bi/','ssi','/home/oracle','/home/oracle')
    EOF
    

  3. To deploy the updated code to the target environment:

    1. Checkout the BAR from source control
    2. Deploy it with importServiceInstance, ensuring that the importRPD flag is set.

      /app/oracle/biee/oracle_common/common/bin/wlst.sh <<EOF
      importServiceInstance('/app/oracle/biee/user_projects/domains/bi','ssi','/home/oracle/ssi.bar',true,false,false)
      EOF
      
    3. Run updateConnectionPool using the configuration file from source control for the target environment to set the connection pool credentials

      /app/oracle/biee/user_projects/domains/bi/bitools/bin/datamodel.sh updateconnectionpool -C ~/prod_cp.json -U weblogic -P Admin123 -SI ssi
      

      Note that your OBIEE system will not be able to connect to source databases to retrieve data until you update the connection pools.

    4. The BI Server should pick up the new RPD after a few minutes. You can force this by restarting the BI Server, or using "Reload Metadata" from OBIEE front end.

Whilst you can also create the BAR file with includeCredentials, you wouldn't use this for migration of code between environments - because you don't have the same connection pool database passwords in each environment. If you do have the same passwords then change it now - this is a big security risk.

The above BAR approach works fine, but be aware that if the deployed RPD is activated on the BI Server before you have updated the connection pools (step 3 above) then the BI Server will not be able to connect to the data sources and your end users will see an error. This approach is also based on storing the BAR file as whole in source control, when for preference we'd store the RPD as a standalone binary if we want to be able to do concurrent development with it.

RPD Deployment in OBIEE 12c - Option 2 (also valid for OBIEE 11g)

This approach takes the RPD on its own, and takes advantage of OBIEE's patching capabilities to prepare RPDs for the target environment prior to deployment.

  1. One-off setup: create a XUDML patch file for each target environment.

    Do this by:

    1. Take your development RPD (e.g. "DEV.rpd"), and clone it (e.g. "PROD.rpd")
    2. Open the cloned RPD (e.g. "PROD.rpd") offline in the Administration Tool. Update it only for the target environment - nothing else. This should be all connection pool passwords, and could also include connection pool DSNs and/or users, depending on how your data sources are configured. Save the RPD.
    3. Using comparerpd, create a XUDML patch file for your target environment:

      /app/oracle/biee/user_projects/domains/bi/bitools/bin/comparerpd.sh \
      -P Admin123 \
      -W Admin123 \
      -G ~/DEV.rpd \
      -C ~/PROD.rpd \
      -D ~/prod_cp.xudml
      
    4. Repeat the above process for each target environment

  2. Once code is ready for promotion from Development:

    1. Extract the RPD

      • In OBIEE 12c use downloadrpd to obtain the RPD file

        /app/oracle/biee/user_projects/domains/bi/bitools/bin/datamodel.sh \
        downloadrpd \
        -O /home/oracle/obiee.rpd \
        -W Admin123 \
        -U weblogic \
        -P Admin123 \
        -SI ssi
        
      • In OBIEE 11g copy the file from the server filesystem

    2. Commit the RPD to source control

  3. To deploy the updated code to the target environment:

    1. Checkout the RPD from source control
    2. Prepare it for the target environment by applying the patch created above

      1. Check out the XUDML patch file for the appropriate environment from source control
      2. Apply the patch file using biserverxmlexec:

        /app/oracle/biee/user_projects/domains/bi/bitools/bin/biserverxmlexec.sh \
        -P Admin123 \
        -S Admin123 \
        -I prod_cp.xudml \
        -B obiee.rpd \
        -O /tmp/prod.rpd
        
    3. Deploy the patched RPD file

      • In OBIEE 12c use uploadrpd

        /app/oracle/biee/user_projects/domains/bi/bitools/bin/datamodel.sh \
        uploadrpd \
        -I /tmp/prod.rpd \
        -W Admin123 \
        -U weblogic \
        -P Admin123 \
        -SI ssi \
        -D
        

        The RPD is available straightaway. No BI Server restart is needed.

      • In OBIEE 11g use WLST's uploadRepository to programatically do this, or manually from EM.

        After deploying the RPD in OBIEE 11g, you need to restart the BI Server.

This approach is the best (only) option for OBIEE 11g. For OBIEE 12c I also prefer it as it is 'lighter' than a full BAR, more solid in terms of connection pools (since they're set prior to deployment, not after), and it enables greater flexibility in terms of RPD changes during migration since any RPD change can be encompassed in the patch file.

Note that the OBIEE 12c product manual states that uploadrpd/downloadrpd are for:

"...repository diagnostic and development purposes such as testing, only ... all other repository development and maintenance situations, you should use BAR to utilize BAR's repository upgrade and patching capabilities and benefits.".

Maybe in the future the BAR capabilites will extend beyond what they currently do - but as of now, I've yet to see a definitive reason to use them and not uploadrpd/downloadrpd.

The Presentation Catalog ("WebCat")

The Presentation Catalog stores the definition of all analyses and dashboards in OBIEE, along with supporting objects including Filters, Conditions, and Agents. It differs significantly from the RPD when it comes to environment migrations. The RPD can be seen in more traditional software development lifecycle terms, sine it is built and developed in Development, and when deployed in subsequent environment overwrites in entirety what is currently there. However, the Presentation Catalog is not so simple.

Commonly, content in the Presentation Catalog is created by developers as part of 'pre-canned' reporting and dashboard packs, to be released along with the RPD to end-users. Where things get difficult is that the Presentation Catalog is also written to in Production. This can include:

  • User-developed content saved in one (or both) of:
    • My Folders
    • Shared, e.g. special subfolders per department for sharing common reports outside of "gold standard" ones
  • User's profile data, including timezone and language settings, saved dashboard customisations, preferred delivery devices, and more
  • System configuration data, such as default formatting for specific columns, bookmarks, etc

In your environment you maybe don't permit some of these (for example, disabling access to My Folders is not uncommon). But almost certainly, you'll want your users to be able to persist their environment settings between sessions.

The impact of this is that the Presentation Catalog becomes complex to manage. We can't just overwrite the whole catalog when we come to deployment in Production, because if we do so all of the above listed content will get deleted. And that won't make us popular with users, at all.

So how do we bring any kind of mature software development practice to the Presentation Catalog, assuming that we have report development being done in non-Production environments?

We have two possible approaches:

  1. Deploy the full catalog into Production each time, but backup first existing content that we don't want to lose, and restore it after the deploy
    • Fiddly, but means that we don't have to worry about which bits of the catalog go in source control - all of it does. This has consequences for if we want to do branch-based development with source control, in that we can't. This is because the catalog will exist as a single binary (whether BAR or 7ZIP), so there'll be no merging with the source control tool possible.
    • Risky, if we forget to backup the user content first, or something goes wrong in the process
    • A 'heavier' operation involving the whole catalog and therefore almost certainly requiring the catalog to be in maintenance-mode (read only).
  2. Deploy the whole catalog once, and then do all subsequent deploys as deltas (i.e. only what has changed in the source environment)
    • Less risky, since not overwriting whole target environment catalog
    • More flexible, and more granular so easier to track in source control (and optionally do branch-based development).
    • Requires more complex automated deployment process.

Both methods can be used with OBIEE 11g and 12c.

Presentation Catalog Migration in OBIEE - Option 1

In this option, the entire Catalog is deployed, but content that we want to retain backed up first, and then re-instated after the full catalog deploy.

First we take the entire catalog from the source environment and store it in source control. With OBIEE 12c this is done using the exportServiceInstance WLST command (see the example with the RPD above) to create a BAR file. With OBIEE 11g, you would create an archive of the catalog at its root using 7-zip/tar/gzip (but not winzip).

When ready to deploy to the target environment, we first backup the folders that we want to preserve. Which folders might we want to preserve?

  1. /users - this holds both objects that users have created and saved in My Folders, as well as user profile information (including timezone preferences, delivery profiles, dashboard customisations, and more)
  2. /system - this hold system internal settings, which include things such as authorisations for the OBIEE front end (/system/privs), as well as column formatting defaults (/system/metadata), global variables (/system/globalvariables), and bookmarks (/system/bookmarks).
    • See note below regarding the /system/privs folder
  3. /shared/<…>/<…> - if users are permitted to create content directly in the Shared area of the catalog you will want to preserve this. A valid use of this is for teams to share content developed internally, instead of (or prior to) it being released to the wider user community through a more formal process (the latter being often called 'gold standard' reports).

Regardless of whether we are using OBIEE 11g or 12c we create a backup of the folders identified by using the Archive functionality of OBIEE. This is NOT just creating a .zip file of the file system folders - which is completely unsupported and a bad idea for catalog management, except at the very root level. Instead, the Archive functionality creates a .catalog file which can be stored in source control, and unarchived back into OBIEE to restore content.

You can create OBIEE catalog archives in one of four ways, which are also valid for importing the content back into OBIEE too:

  1. Manually, through OBIEE front-end
  2. Manually, through Catalog Manager GUI
  3. Automatically, through Catalog Manager CLI (runcat.sh)

    • Archive:

      runcat.sh \
      -cmd archive  \
      -online http://demo.us.oracle.com:7780/analytics/saw.dll \
      -credentials /tmp/creds.txt \
      -folder "/shared/HR" \
      -outputFile /home/oracle/hr.catalog
      
    • Unarchive:

      runcat.sh \
      -cmd unarchive \
      -inputFile hr.catalog \
      -folder /shared \
      -online http://demo.us.oracle.com:7780/analytics/saw.dll  \
      -credentials /tmp/creds.txt \
      -overwrite all
      
  4. Automatically, using the WebCatalogService API (copyItem2 / pasteItem2).

Having taken a copy of the necessary folders, we then deploy the entire catalog (with the changes from the development in) taken from source control. Deployment is done in OBIEE 12c using importServiceInstance. In OBIEE 11g it's done by taking the server offline, and replacing the catalog with the filesystem archive to 7zip of the entire catalog.

Finally, we then restore the folders previously saved, using the Unarchive function to import the .catalog files:

Presentation Catalog Migration in OBIEE - Option 2

In this option we take a more granular approach to catalog migration. The entire catalog from development is only deployed once, and after that only .catalog files from development are put into source control and then deployed to the target environment.

As before, the entire catalog is initially taken from the development environment, and stored in source control. With OBIEE 12c this is done using the exportServiceInstance WLST command (see the example with the RPD above) to create a BAR file. With OBIEE 11g, you would create an archive of the catalog at its root using 7zip.

Note that this is only done once, as the initial 'baseline'.

The first time an environment is commissioned, the baseline is used to populate the catalog, using the same process as in option 1 above (in 12c, importServiceInstance/ in 11g unzip of full catalog filesystem copy).

After this, any work that is done in the catalog in the development environment is migrated through by using OBIEE's archive function against just the necessary /shared subfolder to a .catalog file, storing this in source control

This is then imported to target environment with unarchive capability. See above in option 1 for details of using archive/unarchive - just remember that this is archiving with OBIEE, not using 7zip!

You will need to determine at what level you take this folder: -

  • If you archive the whole of /shared each time you'll never be able to do branch-based development with the catalog in which you want to merge branches (because the .catalog file is binary).
  • If you instead work at, say, department level (/shared/HR, /shared/sales, etc) then the highest grain for concurrent catalog development would be the department. The lower down the tree you go the greater the scope for independent concurrent development, but the greater the complexity to manage. This is because you want to be automating the unarchival of these .catalog files to the target environment, so having to deal with multiple levels of folder hierarchy gets hard work.

It's a trade off between the number of developers, breadth of development scope, and how simple you want to make the release process.

The benefit of this approach is that content created in Production remains completely untouched. Users can continue to create their content, save their profile settings, and so on.

Presentation Catalog Migration - OBIEE Privilege Grants

Permissions set in the OBIEE front end are stored in the Presentation Catalog's /system/privs folder.

Therefore, how this folder is treated during migration dictates where you must apply your security grants (or conversely, where you set your security grants dictates how you should treat the folder in migrations). For me the "correct" approach would be to define the full set of privileges in the development environment and the migrate these through along with pre-built objects in /shared through to Production. If you have a less formal approach to environments, or for whatever reason permissions are granted directly in Production, you will need to ensure that the /system/privs folder isn't overwritten during catalog deployments.

When you create a BAR file in OBIEE 12c, it does include /system/privs (and /system/metadata). Therefore, if you are happy for these to be overwritten from the source environment, you would not need to backup/restore these folders. If you set includeCatalogRuntimeInfo in the OBIEE 12c export to BAR, it will also include the complete /system folder as well as /users.

Agents

Regardless of how you move Catalog content between environments, if you have Agents you need to look after them too. When you move Agents between environment, they are not automatically registered with the BI Scheduler in the target environment. You either have to do this manually, or with the web service API : WebCatalogService.readObjects to get the XML for the agent, and then submit it to iBotService.writeIBot which will register it with the BI Scheduler.

Security

  • In terms of the Policy store (Application Roles and Policy grants), these are managed by the Security element of the BAR and migration through the environments is simple. You can deploy the policy store alone in OBIEE 12c using the importJazn flag of importServiceInstance. In OBIEE 11g it's not so simple - you have to use the migrateSecurityStore WLST command.
  • Data/Object security defined in the RPD gets migrated automatically through the RPD, by definition
  • See above for a discussion of OBIEE front-end privilege grants.

What Goes into Source Control? Part 2

So, suddenly this question looks a bit less simple than when orginally posed at the beginning of this article. In essence, you need to store:

  1. RPD
    1. BAR + JSON configuration for each environment's connection pools -- 12c only, simpler, but less flexible and won't support concurrent development easily
    2. RPD (.rpd) + XUDML patch file for each environment's connection pools -- works in 11g too, supports concurrent development
  2. Presentation Catalog
    1. Entire catalog (BAR in 12c / 7zip in 11g) -- simpler, but impossible to manage branch-based concurrent development
    2. Catalog baseline (BAR in 12c / 7zip in 11g) plus delta .catalog files -- More complex, but more flexible, and support concurrent development
  3. Security
    1. BAR file (OBIEE 12c)
    2. system-jazn-data.xml (OBIEE 11g)
  4. Any other files that are changed for your deployment.

    It's important that when you provision a new environment you can set it up the same as the others. It is also invaluable to have previous versions of these files so as to be able to rollback changes if needed, and to track what settings have changed over time.

    This could include:

    • Configuration files (nqsconfig.ini, instanceconfig.xml, tnsnames.ora, etc)
    • Custom skins & styles
    • writeback templates
    • etc

Summary

I never said it was simple ;-)

OBIEE is an extremely powerful product, and just as you have to take care to build your data models correctly, you also need to take care to understand why and how to manage your code correctly. What I've tried to do here is pull together the different options available, and lay them out with their respectively pros and cons. Let me know in the comments below what you think and how you manage OBIEE code at your site.

One of the key messages that it's important to get across is this: there are varying degrees of complexity with which you can embrace source control. All are valid, and in fact an incremental adoption of them rather than big-bang can sometimes be a better idea:

  • At one end of the scale, you simply use source control to hold copies of all your code, and continue to deploy manually
  • Getting a bit smarter, automating code deployments from source control. Code development is still done serially though.
  • At the other end of the scale, you use source control with branch-based feature-driven concurrent development. Completed features are merged automatically with RPD conflicts managed by the OBIEE tooling from the command line. Testing and deployment are both automated.

If you'd like assistance with your OBIEE development and deployment practices, including fully automated source-control driven concurrent development management, please get in touch with us here at Rittman Mead. We would be delighted to use our extensive experience in this field to produce a flexible and customised process for your particular environment and requirements.


You can find the companion slide deck to this article, with further discussion on concurrent development, here.

The Visual Plugin Pack for OBIEE

Last week we announced the Rittman Mead Open Source project, and released into open source:

  • the excellent Insights project, a javascript API/framework for building a new frontend for OBIEE, written by Minesh Patel
  • Enhanced usage tracking for OBIEE, to track click-by-click how your users interact with the application

Today it is the turn of the Visual Plugin Pack.....

What is the Visual Plugin Pack for OBIEE ?

Visual Plugin Pack (VPP) is a means by which users of OBIEE Answers can use custom JavaScript visualisations without having to write any javascript!

It is a framework that enables developers to build Javascript visualisation plugins, that report builders can then utilise and configure through native OBIEE user interface.

I want to point this out from the very start, that despite its name, the Visual Plugin Pack is not a pack of all-singing, all-dancing, super-duper visualisations for OBIEE.

Instead, VPP should be thought of as a framework that allows you to quickly develop and integrate all-singing, all-dancing, super-duper visualisations that will work natively within OBIEE.

Subtle difference, but an important one.

So what does it do ?

Essentially, VPP allows you to accelerate the development and deployment of custom, configurable and reusable OBIEE JavaScript visualisations.

Back in 2013 I wrote this post describing how to embed a D3 Visualisation within OBIEE. The same method will still work today, but it's a cumbersome process and requires heavy use of the narrative form, which let's be honest, is a painful experience when it comes to JavaScript development.

Some drawbacks with this method:

  • Code editing in the Narrative view is not productive.
  • Reusing visualisations in other analyses requires the copying and pasting of code.
  • Basic Visualisation configuration changes, for example, width, height, colours, font etc requires code changes.
  • Remapping Column bindings requires code changes.
  • JavaScript library dependencies and load order can be tricky to manage.

The Visual Plugin Pack attempts to address these issues by abstracting away the complexities of the Narrative form and allowing developers to focus on visualisation code, not OBIEE integration code.
If you choose to use VPP for your visualisations then you will never have to touch the narrative form, all visualisation development can take place outside of OBIEE in your favourite code editor and deployed to Weblogic when you are done.

VPP also allows you to define design-time controls that affect column bindings and visualisation behaviour. The example visualisation below has been written to accept 5 column bindings and 1 configuration component, which controls the visualisation size. You can create as many column bindings and configuration components as you need

scatterplot matrix

How do I get started ?

You can download or fork the repository from here.

Installation and developer guides can be found on the wiki:-

There are several visualisations that come bundled with VPP, some more polished than others, but they should serve as good examples that can be enhanced further.

Summary

If you've got some in-house JavaScript skills and are looking to develop and integrate custom visualisations into OBIEE, then VPP can help alleviate a lot of the frustrations associated with the traditional method. Once you're up and running you'll be able to develop faster, integrate quickly and share your visualisations with all OBIEE report writers.

If you'd like to discuss how Rittman Mead can help with deployment or assist with custom visualisation development feel free to contact us.