Category Archives: Rittman Mead

Using rlwrap with Apache Hive beeline for improved readline functionality

rlwrap is a nice little wrapper in which you can invoke commandline utilities and get them to behave with full readline functionality just like you’d get at the bash prompt. For example, up/down arrow keys to move between commands, but also home/end to go to the start/finish of a line, and even ctrl-R to search through command history to rapidly find a command. It’s one of the standard config changes I’ll make to any system with Oracle’s sqlplus on, and it works just as nicely with Apache Hive’s commandline interface, beeline.

beeline comes with some of this functionality (up/down arrow) but not all (for me, it was ‘home’ and ‘end’ not working and printing 1~ and 5~ respectively instead that prompted me to setup rlwrap with it).

Installing rlwrap

To install rlwrap simply add the EPEL yum packages to your repository configuration:

sudo rpm -Uvh http://dl.fedoraproject.org/pub/epel/6/`uname -p`/epel-release-6-8.noarch.rpm

and then install rlwrap from yum:

sudo yum install -y rlwrap

Use

Once rlwrap is installed you can invoke beeline through it manually, specifying all the standard beeline options as you would normally: (I’ve used the \ line continuation character here just to make the example nice and clear)

rlwrap -a beeline \
-u jdbc:hive2://bdanode1:10000 \
-n rmoffatt -p password \
-d org.apache.hive.jdbc.HiveDriver

Now I can connect to beeline, and as before I press up arrow to access commands from when I previously used the tool, but I can also hit ctrl-R to start typing part of a command to recall it, just like I would in bash. Some other useful shortcuts:

  • Ctrl-lclears the screen but with the current line still shown
  • Ctrl-kdeletes to the end of the line from the current cursor position
  • Ctrl-udeletes to the beginning of the line from the current cursor position
  • Esc-fmove forward one word
  • Esc-bmove backward one word
    (more here)

And most importantly, Home and End work just fine! (or, ctrl-a/ctrl-e if you prefer).

NB the -a argument for rlwrap is necessary because beeline already does some readline-esque functions, and we want rlwrap to forceable override them (otherwise neither work very well). Or more formally (from man rlwrap):

Always remain in “readline mode”, regardless of command’s terminal settings. Use this option if you want to use rlwrap with commands that already use readline.

Alias

A useful thing to do is to add an alias directly in your profile so that it is always available to launch beeline under rlwrap, in this case as the rlbeeline command:

# this sets up "rlbeeline" as the command to run beeline
# under rlwrap, you can call it what you want though. 
cat >> ~/.bashrc<<EOF
alias rlbeeline='rlwrap -a beeline'
EOF
# example usage:
# rlbeeline /
# -u jdbc:hive2://bdanode1:10000 /
# -n rmoffatt -p password /
# -d org.apache.hive.jdbc.HiveDriver

If you want this alias available for all users on a machine create the above as a standalone .sh file in /etc/profile.d/.

Autocomplete

One possible downside of using rlwrap with beeline is that you lose the native auto-complete option within beeline for the HiveQL statements. But never fear – we can have the best of both worlds, with the -f argument for rlwrap, specifying a list of custom auto-completes. So this is even a level-up for beeline, because we could populate it with our own schema objects and so on that we want auto-completed.

As a quick-start, run beeline without rlwrap, hit tab-twice and then ‘y’ to show all options and paste the resulting list into a text file (eg beeline_autocomplete.txt). Now call beeline, via rlwrap, passing that file as an argument to rlwrap:

rlwrap -a -f beeline_autocomplete.txt beeline

Once connected, use auto-complete just as you would normally (hit tab after typing a character or two of the word you’re going to match):

Connecting to jdbc:hive2://bdanode1:10000
Connected to: Apache Hive (version 0.12.0-cdh5.0.1)
[...]
Beeline version 0.12.0-cdh5.0.1 by Apache Hive
0: jdbc:hive2://bdanode1:10000> SE
SECOND        SECTION       SELECT        SERIALIZABLE  SERVER_NAME   SESSION       SESSION_USER  SET
0: jdbc:hive2://bdanode1:10000> SELECT

Conclusion

rlwrap is the tool that keeps on giving; just as I was writing this article, I noticed that it also auto-highlights opening parentheses when typing the closing one. Nice!

First-timer tips for Oracle Open World

Last week I had the great pleasure to attend Oracle Open World (OOW) for the first time, presenting No Silver Bullets – OBIEE Performance in the Real World at one of the ODTUG user group sessions on the Sunday. It was a blast, as the saying goes, but the week before OOW I was more nervous about the event itself than my presentation. Despite having been to smaller conferences before, OOW is vast in its scale and I felt like the week before going to university for the first time, full of uncertainty about what lay ahead and worrying that everyone would know everyone else except you! So during the week I jotted down a few things that I’d have found useful to know ahead of going and hopefully will help others going to OOW take it all in their stride from the very beginning.

Coming and going

I arrived on the Friday at midday SF time, and it worked perfectly for me. I was jetlagged so walked around like a zombie for the remainder of the day. Saturday I had chance to walk around SF and get my bearings both geographically, culturally and climate. Sunday is “day zero” when all the user group sessions are held, along with the opening OOW keynote in the evening. I think if I’d arrived Saturday afternoon instead I’d have felt a bit thrust into it all straight away on the Sunday.

In terms of leaving, the last formal day is Thursday and it’s full day of sessions too. I left straight after breakfast on Thursday and I felt I was leaving too early. But, OOW is a long few days & nights so chances are by Thursday you’ll be beat anyway, so check the schedule and plan your escape around it.

Accomodation

Book in advance! Like, at least two months in advance. There are 60,000 people descending on San Francisco, all wanting some place to stay.

Get airbnb, a lot more for your money than a hotel. Wifi is generally going to be a lot better, and having a living space in which to exist is nicer than just a hotel room. Don’t fret about the “perfect” location – anywhere walkable to Moscone (where OOW is held) is good because it means you can drop your rucksack off at the end of the day etc, but other than that the events are spread around so you’ll end up walking further to at least some of them. Or, get an Uber like the locals do!

Sessions

Go to Oak Table World (OTW), it’s great, and free. Non-marketing presentations from some of the most respected speakers in the industry. Cuts through the BS. It’s also basically on the same site as the rest of OOW, so easy to switch back and forth between OOW/OTW sessions.

Go and say hi to the speakers. In general they’re going to want to know that you liked it. Ask questions — hopefully they like what they talk about so they’ll love to speak some more about it. You’ll get more out of a five minute chat than two hours of keynote. And on that subject, don’t fret about dropping sessions — people tweet them, the slides are usually available, and in fact you could be sat at your desk instead of OOW and have missed the whole lot so just be grateful for what you do see. Chance encounters and chats aren’t available for download afterwards; most presentations are. Be strict in your selection of “must see” sessions, lest you drop one you really really did want to see.

Use the schedule builder in advance, but download it to your calendar (watch out for line-breaks in the exported file that will break the import) and sync it to your mobile phone so you can see rapidly where you need to head next. Conference mobile apps are rarely that useful and frequently bloated and/or unstable.

Don’t feel you need to book every waking moment of every day to sessions. It’s not slacking if you go to half as many but are twice as effective from not being worn out!

Dress

Dress wise, jeans and polo is fine, company polo or a shirt for delivering presentations. Day wear is fine for evenings too, no need to dress up. Some people do wear shorts but they’re in the great minority. There are lots of suits around, given it is a customer/sales conference too.

Socialising

The sessions and random conversations with people during the day are only part of OOW — the geek chat over a beer (or soda) is a big part too. Look out for the Pythian blogger meetup, meetups from your country’s user groups, companies you work with, and so on.

Register for the evening events that you get invited to (ODTUG, Pythian, etc) because often if you haven’t pre-registered you can’t get in if you change your mind, whereas if you do register but then don’t go that’s fine as they’ll bank on no-shows. The evening events are great for getting to chat to people (dare I say, networking), as are the other events that are organised like the swim in the bay, run across the bridge, etc.

Sign up for stuff like swim in the bay,  it’s good fun – and I can’t even swim really. Run/Bike across the bridge are two other events also organised. Hang around on twitter for details, people like Yury Velikanov and Jeff Smith are usually in the know if not doing the actual organising.

General

When the busy days and long evenings start to take their toll don’t be afraid to duck out and go and decompress. Grab a shower, get a coffee, do some sight seeing. Don’t forget to drink water as well as the copious quantities of coffee and soda.

Get a data package for your mobile phone in advance of going eg £5 per day unlimited data. Conference wifi is just about OK at best, often flaky. Trying to organise short-notice meetups with other people by IM/twitter/email gets frustrating if you only get online half an hour after the time they suggested to meet!

Don’t pack extra clothes ‘just in case’. Pack minimally because (1) you are just around the corner from Market Street with Gap, Old Navy etc so can pick up more clothes cheaply if you need to and (2) you’ll get t-shirts from exhibitors, events (eg swim in the bay) and you’ll need the suitcase space to bring them all home. Bring a suitcase with space in or that expands, don’t arrive with a suitcase that’s already at capacity.

Food

So much good food and beer. Watch out for some of the American beers; they seem to start at about 5% ABV and go upwards, compared to around 3.6% ABV here in the UK. Knocking back this at the same rate as this will get messy.

In terms of food you really are spoilt, some of my favourites were:

  • Lori’s diner (map) : As a brit, I loved this American Diner, and great food - yum yum. 5-10 minutes walk from Moscone.
  • Mel’s drive-in (map) : Just round the corner from Moscone, very busy but lots of seats. Great american breakfast experience! yum
  • Grove (map) : Good place for breakfast if you want somewhere a bit less greasy than a diner (WAT!)

 

Adding Oracle Big Data SQL to ODI12c to Enhance Hive Data Transformations

An updated version of the Oracle BigDataLite VM came out a couple of weeks ago, and as well as updating the core Cloudera CDH software to the latest release it also included Oracle Big Data SQL, the SQL access layer over Hadoop that I covered on the blog a few months ago (here and here). Big Data SQL takes the SmartScan technology from Exadata and extends it to Hadoop, presenting Hive tables and HDFS files as Oracle external tables and pushing down the filtering and column-selection of data to individual Hadoop nodes. Any table registered in the Hive metastore can be exposed as an external table in Oracle, and a BigDataSQL agent installed on each Hadoop node gives them the ability to understand full Oracle SQL syntax rather than the cut-down SQL dialect that you get with Hive.

NewImage

There’s two immediate use-cases that come to mind when you think about Big Data SQL in the context of BI and data warehousing; you can use Big Data SQL to include Hive tables in regular Oracle set-based ETL transformations, giving you the ability to reference Hive data during part of your data load; and you can also use Big Data SQL as a way to access Hive tables from OBIEE, rather than having to go through Hive or Impala ODBC drivers. Let’s start off in this post by looking at the ETL scenario using ODI12c as the data integration environment, and I’ll come back to the BI example later in the week.

You may recall in a couple of earlier posts earlier in the year on ETL and data integration on Hadoop, I looked at a scenario where I wanted to geo-code web server log transactions using an IP address range lookup file from a company called MaxMind. To determine the country for a given IP address you need to locate the IP address of interest within ranges listed in the lookup file, something that’s easy to do with a full SQL dialect such as that provided by Oracle:

NewImage

In my case, I’d want to join my Hive table of server log entries with a Hive table containing the IP address ranges, using the BETWEEN operator – except that Hive doesn’t support any type of join other than an equi-join. You can use Impala and a BETWEEN clause there, but in my testing anything other than a relatively small log file Hive table took massive amounts of memory to do the join as Impala works in-memory which effectively ruled-out doing the geo-lookup set-based. I then went on to do the lookup using Pig and a Python API into the geocoding database but then you’ve got to learn Pig, and I finally came up with my best solution using Hive streaming and a Python script that called that same API, but each of these are fairly involved and require a bit of skill and experience from the developer.

But this of course is where Big Data SQL could be useful. If I could expose the Hive table containing my log file entries as an Oracle external table and then join that within Oracle to an Oracle-native lookup table, I could do my join using the BETWEEN operator and then output the join results to a temporary Oracle table; once that’s done I could then use ODI12c’s sqoop functionality to copy the results back down to Hive for the rest of the ETL process. Looking at my Hive database using SQL*Developer 4.0.3’s new ability to work with Hive tables I can see the table I’m interested in listed there:

NewImage

and I can also see it listed in the DBA_HIVE_TABLES static view that comes with Big Data SQL on Oracle Database 12c:

SQL> select database_name, table_name, location
  2  from dba_hive_tables
  3  where table_name like 'access_per_post%';

DATABASE_N TABLE_NAME             LOCATION
---------- ------------------------------ --------------------------------------------------
default    access_per_post        hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post

default    access_per_post_categories     hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post_categories

default    access_per_post_full       hdfs://bigdatalite.localdomain:8020/user/hive/ware
                      house/access_per_post_full

There are various ways to create the Oracle external tables over Hive tables in the linked Hadoop cluster, including using the new DBMS_HADOOP package to create the Oracle DDL from the Hive metastore table definitions or using SQL*Developer Data Modeler to generate the DDL from modelled Hive tables, but if you know the Hive table definition and its not too complicated, you might as well just write the DDL statement yourself using the new ORACLE_HIVE external table access driver. In my case, to create the corresponding external table for the Hive table I want to geo-code, it looks like this:

CREATE TABLE access_per_post_categories(
  hostname varchar2(100), 
  request_date varchar2(100), 
  post_id varchar2(10), 
  title varchar2(200), 
  author varchar2(100), 
  category varchar2(100),
  ip_integer number)
organization external
(type oracle_hive
 default directory default_dir
 access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));

Then it’s just a case of importing the metadata for the external table over Hive, and the tables I’m going to join to and then load the results into, into ODI’s repository and then create a mapping to bring them all together.

NewImage

Importantly, I can create the join between the tables using the BETWEEN clause, something I just couldn’t do when working with Hive tables on their own.

NewImage

Running the mapping then joins the webserver log lookup table to the geocoding IP address range lookup table through the Oracle SQL engine, removing all the complexity of using Hive streaming, Pig or the other workaround solutions I used before. What I can then do is add a further step to the mapping to take the output of my join and use that to load the results back into Hive, like this:

NewImage

I’ll then use IKM SQL to to Hive-HBase-File (SQOOP) knowledge module to set up the export from Oracle into Hive.

NewImage

Now, when I run the mapping I can see the initial table join taking place between the Oracle native table and the Hive-sourced external table, and the results then being exported back into Hadoop at the end using the Sqoop KM.

NewImage

Finally, I can view the contents of the downstream Hive table loaded via Sqoop, and see that it does in-fact contain the country name for each of the page accesses.

NewImage

Oracle Big Data SQL isn’t a solution suitable for everyone; it only runs on the BDA and requires Exadata for the database access, and it’s an additional license cost on top of the base BDA software bundle. But if you’ve got it available it’s an excellent way to blend Hive and Oracle data, and a great way around some of the restrictions around HiveQL and the Hive JDBC/ODBC drivers. More on this topic later next week, when I’ll look at using Big Data SQL in conjunction with OBIEE 11g.

News and Updates from Oracle Openworld 2014

It’s the Saturday after Oracle Openworld 2014, and I’m now home from San Francisco and back in the UK. It’s been a great week as usual, with lots of product announcements and updates to the BI, DW and Big Data products we use on current projects. Here’s my take on what was announced this last week.

New Products Announced

From a BI and DW perspective, the most significant product announcements were around Hadoop and Big Data. Up to this point most parts of an analytics-focused big data project required you to code the solution yourself, with the diagram below showing the typical three steps in a big data project – data ingestion, analysis and sharing the results.

NewImage

At the moment, all of these steps are typically performed from the command-line using languages such as Python, R, Pig, Hive and so on, with tools like Apache Flume and Apache Sqoop used to bring data into and out of the Hadoop cluster. Under the covers, these tools use technologies such as MapReduce or Spark to do their work, automatically running jobs in parallel across the cluster and making use of the easy scalability of Hadoop and NoSQL databases.

You can also neatly divide the work up on a big data project into two phases; the “discovery” phase typically performed by a data scientist where data is loaded, analysed, correlated and otherwise “understood” to provide the initial insights, and then an “exploitation” phase where we apply governance, provide the output data in a format usable by BI tools and otherwise share the results with the wider corporate audience. The updated Information Management Reference Architecture we collaborated on with Oracle and launched by in June this year had distinct discovery and exploitation phases, and the architecture itself made a clear distinction between the Innovation part that enabled the discovery phase of a project and the Execution part that delivered the insights and data in a more governed, production setting.

NewImage

This was the theme of the product announcements around analytics, BI, data warehousing and big data during Openworld 2014, with Oracle’s Omri Traub in the photo below taking us through Oracle’s big data product strategy. What Oracle are doing here is productising and “democratising” big data, putting it clearly in context of their existing database, engineered systems and BI products and linking them all together into an overall information management architecture and delivery process.

NewImage

So working through from ingestion through to data analysis, these steps have typically been performed by data scientists using scripting tools and rudimentary data visualisation engines, making them labour-intensive and reliant on a small set of people conversant with these tools and process. Oracle Big Data Discovery is aimed squarely at these steps, and combines Apache Spark-based data preparation and transformation capabilities with an analysis and visualisation engine based on Endeca Server.

NewImage

Key features of Big Data Discovery include:

  • Ability to analyse, parse, explore and “wrangle” data using graphical tools and a Spark-based transformation engine
  • Create a catalog of the data on your Hadoop cluster, and then search that catalog using Endeca Server search technologies
  • Create recommendations of other datasets that might interest you, based on what you’re looking at now
  • Visualize your datasets to help understand what they contain, and discover new insights

Under the covers it comprises two parts; the data loading, transformation and profiling part that uses Apache Spark to do its work in parallel across all the nodes in the cluster, and the analysis part, which takes data prepared by Apache Spark and loads into the Endeca Server in-memory engine to perform the analysis, aggregation and data visualisation. Unlike the Spark part the Endeca server element runs just on one node and limits the size of the analysis dataset to what can run in-memory in the Endeca Server engine, but in practice you’re going to work with a sample of the data rather than the entire dataset at that stage (in time the assumption is that the Endeca Server engine will be unbundled and run natively on YARN, giving it the same scalability as the Spark-based data ingestion and transformation part). Initially Big Data Discovery will run on-premise with a cloud version later on, and it’s not dependent on Big Data Appliance – expect to see something later this year / early next year.

Another new product that addresses the discovery phase and discovery lab part of a big data project is Oracle Data Enrichment Cloud Service, from the Oracle Data Integration team and designed to complement ODI and Oracle EDQ. Whilst Oracle positioned ODECS as something you’d use as well as Big Data Discovery and typically upstream from BDD, to me there seemed to be a fair bit of overlap between the products, with both tools doing data profiling and transformation but BDD being more focused on the exploration and discovery part, and ODECS being more focused on early-stage data profiling and transformation.

NewImage

ODECS is clearly more of an ETL tool complement and runs natively in the cloud, right from the start. It’s most probably aimed at customers with their Hadoop dataset already in the cloud, maybe using Amazon Elastic MapReduce or Oracle’s new Hadoop-as-a-Service and has more in common with the old Data Quality Option for Oracle Warehouse Builder than Endeca’s search-first analytic interface. It’s got a very nice interface including a mobile-enabled website and the ability to include and merge in external datasets, including Oracle’s own Data as a Service platform offering. Along with the new Metadata Management tool Oracle also launched at Openworld it’s a great addition to the Oracle Data Integration product suite, but I can’t help thinking that its initial availability only on Oracle’s public cloud platform is going to limit its use with Oracle’s typical customers – we’ll have to just wait and see.

The other major product that addresses big data projects was Oracle Big Data SQL. Partly addressing the discovery phase of big data projects but mostly (to my mind) addressing the exploitation phase, and the execution part of the information management architecture, Big Data SQL gives Oracle Exadata the ability to return data from Hive and NoSQL on the Big Data Appliance as well as data from its normal relational store. I covered Big Data SQL on the blog a few weeks ago and I’ll be posting some more in-depth articles on it next week, but the other main technical innovation with the product is its bringing of Exadata’s SmartScan feature to Hadoop, projecting and filtering data at the Hadoop storage node level and also giving Hadoop the ability to understand regular Oracle SQL, rather than the cut-down version you get with HiveQL.

NewImage

Where this then leaves us is with the ability to do most of a big data project using (Oracle) tools, bringing big data analysis within reach of organisations with Oracle-style budgets but without access to rare data scientist-type resources. Going back to my diagram earlier, a post-OOW big data project using the new products launched in this last week could look something like this:

NewImage

Big Data SQL is out now and depends on BDA and Exadata for its use; Big Data Discovery should be out in a few months time, runs on-premise but doesn’t require BDA, whilst ODECS is cloud-only and runs on a BDA in the background. Expect more news and more integration/alignment from the products as 2014 ends and 2015 starts, and we’re looking forward to using them on Oracle-centric Hadoop projects in the near future. 

Product Updates for BI, Data Integration, Exalytics, BI Applications and OBIEE

Other news announced over the week for products we more commonly use on projects include:

Finally, something that we were particularly pleased to see was the updated Oracle Information Management Architecture I mentioned earlier referenced in most of the analytics sessions, with Oracle’s Balaji Yelamanchili for example introducing it in his big data and business analytics general session mid-way through the week. 

NewImage
 

We love the way this brings together the big data components and puts them in the context of the wider data warehouse and analytic processes, and compared to a few years ago when Hadoop and big data was considered completely separate to data warehousing and BI and done by staff completely different to the core business analytics team, this new reference architecture puts it squarely within the world of BI and analytics we work in. It also emphasises the new abilities Hadoop, NoSQL databases and big data can bring us – support for wider sets of data sources with dynamic schemas, the ability to economically work with and analyse much larger datasets, and support discovery-type upfront analysis work. Finally, it recognises that to get true value out of analysis you start on Hadoop, you eventually need to add proper data governance, make the results more widely available using full SQL tools, and use the right tools – relational databases, OLAP servers and the like – to analyse the data once its in a more structured form. 

If you missed our write-up on the updated Information Management Reference Architecture you can can read our two-part blog post here and here, read the Oracle white paper, or listen to the podcast with OTN Archbeat’s Bob Rhubart. For now though I’m looking forward to seeing the family after a week and a half away in San Francisco – thanks to OTN and the Oracle ACE Director Program for sponsoring my visit over to SF for Openworld, and we’ll post our conference presentation slides later next week when we’re back in the UK and US offices.

EPM and BI Meetup at Next Week’s Openworld (and details of our Oracle DI Speakeasy)

Just a short note to help publicise the Oracle Openworld 2014 EPM and BI Meetup that’s running next week, organised by Cameron Lackpour and Tim Tow from the ODTUG board.

This is an excellent opportunity for EPM and BI developers and customers to get together and network over drinks and food, and chat with members of the ODTUG board and maybe some of the EPM and BI product management team. It’s running at Piattini, located at 2331 Mission St. (between 19th St & 20th St), San Francisco, CA 94110 from 7pm to late and there’s more details at this blog post by Cameron. The turnout should be pretty good, and if you’re an EPM or BI developer looking to meet up with others in your area this is a great opportunity to do so. Attendance is free and you just need to register using this form.

Similarly, if you’re into data warehousing and data integration you might be interested in our Rittman Mead / Oracle Data Integration’s Speakeasy event, running on the same evening (Tuesday September 30th 2014) from 7pm – 9pm at Local Edition, 691 Market St, San Francisco, CA. Aimed at ODI, OWB and data integration developers and customers and featuring members of the Rittman Mead team and Oracle’s Data Integration product team, again this is a great opportunity to meet with your peers and share stories and experiences. Registration is free and done through this registration form, with spaces still open at the time of posting.