Category Archives: Rittman Mead
Managing Impala and Other Mixed Workloads on the Oracle Big Data Appliance
One of our current client projects uses Cloudera Impala to provide fast ad-hoc querying to the data we’re loading into their Oracle Big Data Appliance Hadoop environment. Impala bypasses MapReduce to provide faster queries than Hive, but to do so it does a lot of processing in-memory and runs server processes on each node in the cluster, leading in some cases to runaway queries blocking other workloads in the same way that OBIEE queries on an Oracle Database can sometimes block ETL and application workloads. Several projects share this same Big Data Appliance, so to try and limit the impact Impala could have on other cluster workloads the client had disabled the Impala Daemons on nine of the twelve nodes in their Big Data Appliance; our concern with this approach was that an Impala query could access data from any datanode in the Big Data Appliance cluster, so whilst HDFS data is typically stored and replicated to three nodes in the cluster running the Impala daemons on just a quarter of the available nodes was likely to lead to data locality issues for Impala and blocks getting shipped across the network unnecessarily.
Going back to OBIEE and the Oracle Database, Oracle have a resource management feature for the Oracle database that allows you to put users and queries into separate resource pools and manage the share of overall resources that each pool gets. I covered this concept on the blog a few years ago, and the version of Cloudera Hadoop (CDH5.3) as used on the client’s Big Data Appliance has a feature called “YARN”, or Yet Another Resource Negotiator, that splits out the resource management and scheduling parts that were bound into MapReduce in Hadoop 1.0 so that MapReduce then just runs as a workload type on Hadoop, and with it then possible to run other workload types, for example Apache Spark, on that same cluster management framework.
Impala isn’t however configured to use YARN by default and uses an internal scheduler to govern how concurrent queries run and use cluster resources, but it can be configured to use YARN in what Cloudera term “Integrated Resource Management” and our initial response was to recommend this approach; however YARN is really optimised for longer-running batch jobs and not the shorter jobs that Impala generates (such that Cloudera recommends you don’t actually use YARN, and control Impala resource usage via service-level process constraints or through a new Impala feature called Admission Control instead). Taking a step back though, how do we actually see what resources Impala is using across the cluster when a query runs, and is there a feature similar to the Oracle Database’s SQL Explain Plan to help us understand how an Impala SQL query is executed? Then, using this and the various resource management options to us, can we use them to understand how YARN and other options will affect the Impala users on the client’s cluster if we enable them? And, given that we were going to test this all out on one of our development Hadoop clusters running back at the office on VMWare, how well could we simulate the multiple concurrent queries and mixed workload we’d then encounter on the real customer Big Data Appliance?
When trying to understand what goes on when a Cloudera Impala SQL query runs, the two main tools in your toolbox are EXPLAIN plans and query profiles. The concept of EXPLAIN plans will be familiar to Oracle developers, and putting “explain” before your Impala SQL query when you’re using the Impala Shell (or pressing the “Explain” button when you’re using the Impala Editor in Hue) will display an output like the one below, showing the steps the optimiser plans to take to return the query results:
[bda7node1.rittmandev.com:21000] > explain select sum(f.flights) as total_flights, d.dest_city from flight_delays f join geog_dest d on f.dest = d.dest join geog_origin o on f.orig = o.orig where d.dest_state = 'California' and o.orig_state in ('Florida','New York','Alaska') group by d.dest_city having total_flights > 3000; Query: explain select sum(f.flights) as total_flights, d.dest_city from flight_delays f join geog_dest d on f.dest = d.dest join geog_origin o on f.orig = o.orig where d.dest_state = 'California' and o.orig_state in ('Florida','New York','Alaska') group by d.dest_city having total_flights > 3000 +---------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=154.01MB VCores=2 | | | | 10:EXCHANGE [UNPARTITIONED] | | | | | 09:AGGREGATE [FINALIZE] | | | output: sum:merge(f.flights) | | | group by: d.dest_city | | | having: sum(f.flights) > 3000 | | | | | 08:EXCHANGE [HASH(d.dest_city)] | | | | | 05:AGGREGATE | | | output: sum(f.flights) | | | group by: d.dest_city | | | | | 04:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: f.orig = o.orig | | | | | |--07:EXCHANGE [BROADCAST] | | | | | | | 02:SCAN HDFS [airlines.geog_origin o] | | | partitions=1/1 files=1 size=147.08KB | | | predicates: o.orig_state IN ('Florida', 'New York', 'Alaska') | | | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: f.dest = d.dest | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [airlines.geog_dest d] | | | partitions=1/1 files=1 size=147.08KB | | | predicates: d.dest_state = 'California' | | | | | 00:SCAN HDFS [airlines.flight_delays f] | | partitions=1/1 files=1 size=64.00MB | +---------------------------------------------------------------------+ Fetched 35 row(s) in 0.21s
Like an Oracle SQL explain plan, Impala’s cost-based optimiser uses table and partition stats that you should have gathered previously using Impala’s “compute stats” command to determine what it thinks is the optimal execution plan for your query. To see the actual cost and timings for the various plan steps that are run for a query, you can then use the “summary” statement after your query has run (or for more detail, the “profile” statement”) to see the actual timings and stats for each step in the query execution.
[bda7node1.rittmandev.com:21000] > summary; > ; +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | 10:EXCHANGE | 1 | 20.35us | 20.35us | 7 | 193 | 0 B | -1 B | UNPARTITIONED | | 09:AGGREGATE | 6 | 142.18ms | 180.81ms | 7 | 193 | 6.28 MB | 10.00 MB | FINALIZE | | 08:EXCHANGE | 6 | 59.86us | 123.39us | 60 | 1.93K | 0 B | 0 B | HASH(d.dest_city) | | 05:AGGREGATE | 6 | 171.72ms | 208.36ms | 60 | 1.93K | 22.73 MB | 10.00 MB | | | 04:HASH JOIN | 6 | 89.42ms | 101.82ms | 540.04K | 131.88M | 12.79 MB | 5.41 KB | INNER JOIN, BROADCAST | | |--07:EXCHANGE | 6 | 16.32us | 19.63us | 2.81K | 117 | 0 B | 0 B | BROADCAST | | | 02:SCAN HDFS | 1 | 302.83ms | 302.83ms | 469 | 117 | 309.00 KB | 32.00 MB | airlines.geog_origin o | | 03:HASH JOIN | 6 | 936.71ms | 1.10s | 15.68M | 131.88M | 12.14 MB | 3.02 KB | INNER JOIN, BROADCAST | | |--06:EXCHANGE | 6 | 19.02us | 46.49us | 1.04K | 39 | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 1 | 266.99ms | 266.99ms | 173 | 39 | 325.00 KB | 32.00 MB | airlines.geog_dest d | | 00:SCAN HDFS | 6 | 1.07s | 1.90s | 131.88M | 131.88M | 74.03 MB | 480.00 MB | airlines.flight_delays_full f | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
Output from the Summary statement gives us some useful information in working out the impact of the various resource management options for the Oracle Big Data Appliance, at least in terms of its impact on individual Impala queries – we’ll look at the impact on the overall Hadoop cluster and individual nodes later on. From the output of the above Summary report I can see that my query ran on all six nodes in the cluster (queries I ran earlier on a smaller version of the fact table ran on just a single node), and I can see how long each step in the query actually took to run. So what happens if I run the same query again on the cluster but disable the Impala daemon service role on three of the nodes, using Cloudera Manager?
Here’s the Summary output after running the query again:
[bda7node1.rittmandev.com:21000] > summary; +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | 10:EXCHANGE | 1 | 22.01us | 22.01us | 7 | 193 | 0 B | -1 B | UNPARTITIONED | | 09:AGGREGATE | 3 | 111.12ms | 117.24ms | 7 | 193 | 6.27 MB | 10.00 MB | FINALIZE | | 08:EXCHANGE | 3 | 30.09us | 39.02us | 30 | 1.93K | 0 B | 0 B | HASH(d.dest_city) | | 05:AGGREGATE | 3 | 161.26ms | 173.57ms | 30 | 1.93K | 22.84 MB | 10.00 MB | | | 04:HASH JOIN | 3 | 156.50ms | 238.90ms | 540.04K | 131.88M | 12.81 MB | 5.41 KB | INNER JOIN, BROADCAST | | |--07:EXCHANGE | 3 | 20.19us | 28.93us | 1.41K | 117 | 0 B | 0 B | BROADCAST | | | 02:SCAN HDFS | 1 | 477.38ms | 477.38ms | 469 | 117 | 309.00 KB | 32.00 MB | airlines.geog_origin o | | 03:HASH JOIN | 3 | 1.48s | 1.66s | 15.68M | 131.88M | 12.14 MB | 3.02 KB | INNER JOIN, BROADCAST | | |--06:EXCHANGE | 3 | 12.07us | 14.89us | 519 | 39 | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 1 | 308.83ms | 308.83ms | 173 | 39 | 325.00 KB | 32.00 MB | airlines.geog_dest d | | 00:SCAN HDFS | 3 | 3.39s | 6.85s | 131.88M | 131.88M | 74.11 MB | 480.00 MB | airlines.flight_delays_full f | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
What the Summary statement doesn’t show you is the overall time the query took to run, and the query ran against three nodes took 9.48s to run compared to 3.59s for the one before where I had all six nodes’ Impala daemon enabled. In-fact I’d expect a query running on the client’s BDA with just three out of twelve nodes enabled to run even slower because of the block locality issue – Impala has a feature called block locality tracking which keeps track of where HDFS data blocks are actually located on the cluster and tries to run impalad tasks on the right nodes, but three out of twelve nodes running makes that job really hard – but the other factor that we need to consider is how running multiple queries concurrently affects things when only a few nodes are handling all the Impala user queries.
To try and simulate concurrent queries running I opened six terminal session against nodes actually running Impala Daemon service roles and submitted the same query from each session, with a second or two gap between each query; with all six nodes enabled the average response time rose to about 6s, but with just three enabled the response rose fairly consistently to around 27s.
This is of course what you’d expect when everything was trying to run on the same three (now resource-starved) server nodes, and again I’d expect this to be even more pronounced on the client’s twelve-node BDA. What this test of course didn’t cover was running workloads other than Impala on the same cluster, or running queries against different datasets, but it did at least show us how response-time increases fairly dramatically (albeit consistently) as more Impala users come onto the system.
So now we have some baseline benchmarking figures, let’s configure Impala to use YARN, using Cloudera Manager on the CDH5.3 setup used on the client’s BDA and our development cluster back in the office. There’s actually two parts to Impala running on YARN in CDH5.x; YARN itself as the overall cluster resource management layer, and another component called Llama (Low-Latency, or “Long-Lived”, Application Master) that sits between YARN and Impala and reduces the time that each Impala query takes to obtain YARN resource allocations.
Enabling YARN and Llama (and if you want to, configuring Llama and thereby Impala for high-availability) is done through a wizard in CDH5.3 that also offers to set up an Linux feature called Cgroups that YARN can use to limit the “containers” it uses for resource management at the OS-level.
Once you’ve run through the wizard and restarted the cluster, Impala should be configured to use YARN instead of its own scheduler to request resources, which in-theory will allow Hadoop and the Big Data Appliance to consider Impala workloads alongside MapReduce, Spark and HBase when scheduling jobs across the cluster. Before we get into the options YARN gives us for managing these workloads I ran the same Impala queries again, first as a single query and then with six running concurrently, to see what impact YARN on its own had on query response times.
The single query on its own took around the same time as without YARN to run (3-4s), but when I ran six concurrent queries together the response time went up from the 3-4s that I saw without YARN enabled to between 5s and 18s depending on the session, with quite a bit of variation between response times compared to the consistent times I saw when YARN wasn’t being used – which surprised me as one of the stated benefits of YARN is making job execution times more predictable and smooth, though this cloud be more of an overall-cluster thing and there are also recommendations around configuring YARN and Llama’s resource estimation more efficient for Impala in the Cloudera docs.
[bda7node1.rittmandev.com:21000] > summary; +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | 10:EXCHANGE | 1 | 41.38us | 41.38us | 7 | 193 | 0 B | -1 B | UNPARTITIONED | | 09:AGGREGATE | 6 | 115.28ms | 123.04ms | 7 | 193 | 6.28 MB | 10.00 MB | FINALIZE | | 08:EXCHANGE | 6 | 44.44us | 67.62us | 60 | 1.93K | 0 B | 0 B | HASH(d.dest_city) | | 05:AGGREGATE | 6 | 170.91ms | 201.47ms | 60 | 1.93K | 22.82 MB | 10.00 MB | | | 04:HASH JOIN | 6 | 82.25ms | 98.34ms | 540.04K | 131.88M | 12.81 MB | 5.41 KB | INNER JOIN, BROADCAST | | |--07:EXCHANGE | 6 | 15.39us | 18.99us | 2.81K | 117 | 0 B | 0 B | BROADCAST | | | 02:SCAN HDFS | 1 | 244.40ms | 244.40ms | 469 | 117 | 309.00 KB | 32.00 MB | airlines.geog_origin o | | 03:HASH JOIN | 6 | 850.55ms | 942.47ms | 15.68M | 131.88M | 12.14 MB | 3.02 KB | INNER JOIN, BROADCAST | | |--06:EXCHANGE | 6 | 13.99us | 19.05us | 1.04K | 39 | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 1 | 222.03ms | 222.03ms | 173 | 39 | 325.00 KB | 32.00 MB | airlines.geog_dest d | | 00:SCAN HDFS | 6 | 1.54s | 2.88s | 131.88M | 131.88M | 74.03 MB | 480.00 MB | airlines.flight_delays_full f | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
But it seems clear that users of Impala on the client cluster should expect some sort of overhead from using YARN to manage Impala’s resources, with the payoff being better balance between Impala workloads and the other uses they’re putting the BDA cluster too – however I think there’s more we can do to fine-tune how Llama and YARN allocate memory to Impala queries up-front (allocating a set amount of memory for all queries, rather than making an estimate and then adding more memory mid-query if it’s needed) and of course we’ve not really tested it on a cluster with a full, mixed workload running. But what about our original scenario, where only a certain percentage of the overall cluster resources or nodes are allocated to Impala query processing? To set up that sort of division resources we can use another feature of YARN called dynamic allocation, and dynamic resource pools that we can set up through Cloudera Manager again.
Dynamic allocation is one of the ways that YARN can be configured to manage multiple workloads on a Hadoop cluster (the other way is through static service pools, and I’ll come to those in a moment). Using dynamic allocation I can set up a resource pool for the airline flight delays application that my Impala SQL queries are associated with and allocate it 25% of overall cluster resources, with the remainder of cluster resources allocated to other applications. I can keep that weighting simple as I have done in the screenshot below, or I can allocate resources based on virtual cores and memory, but I found it simpler to just set these overall weightings and let YARN worry about cores and RAM.
Depending on the scheduling policy you select, YARN will prioritise Impala and other jobs in different ways, but the recommended scheduling policy for mixed workloads is dominent resource fairness which balances RAM and CPU depending on which resource pool needs them most at a particular time. Note also that Impala can either be managed as part of the overall YARN workload or separately, a choice you can make in the Impala service configuration settings in Cloudera Manager (the “Enable Dynamic Resource Pools” setting that’s checked below, but was unchecked for the screenshot above)
There’s also a separate control you can place on Impala queries called Admission Control, that limits the number of queries that can run or be queued for a resource pool at any particular time. The docs are a bit vague on when to use admission control, when to use YARN or not and so on, but my take on this is that if it’s just Impala queries you’re worried about and throttling their use solves the problem then use this feature and leave Impala outside of YARN, but if you need to manage overall mixed workloads then do it all through YARN. For my testing example though I just went with simple resource pool weighting, and you can see from the screenshot below where multiple queries are running at once for my pool, CPU and RAM resources are constrained as expected.
To make a particular Impala query run within a specific resource pool you can either allocate that user to a named resource pool, or you can specific the resource pool in your Impala shell session like this:
[bda7node1.rittmandev.com:21000] > set request_pool = airlines; REQUEST_POOL set to airlines [bda7node1.rittmandev.com:21000] > select sum(f.flights) as total_flights, d.dest_city from airlines.flight_delays_full f join airlines.geog_dest d on f.dest = d.dest join airlines.geog_origin o on f.orig = o.orig where d.dest_state = 'California' and o.orig_state in ('Florida','New York','Alaska') group by d.dest_city having total_flights > 3000;
Looking then at a typical summary output for a query running with these restrictions (25% of resources overall) and other queries running concurrently, the numbers don’t look all that different to before and results took between 8s and 30s to return – again I was surprised on the variance but I think YARN is more about overall cluster performance rather than individual queries, and you shouldn’t read too much into specific times on a dev server with an unrepresentative overall workload.
[bda7node1.rittmandev.com:21000] > summary; +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | 10:EXCHANGE | 1 | 26.78us | 26.78us | 7 | 193 | 0 B | -1 B | UNPARTITIONED | | 09:AGGREGATE | 6 | 209.10ms | 262.02ms | 7 | 193 | 6.28 MB | 10.00 MB | FINALIZE | | 08:EXCHANGE | 6 | 63.20us | 118.89us | 60 | 1.93K | 0 B | 0 B | HASH(d.dest_city) | | 05:AGGREGATE | 6 | 282.56ms | 401.37ms | 60 | 1.93K | 22.76 MB | 10.00 MB | | | 04:HASH JOIN | 6 | 99.56ms | 114.14ms | 540.04K | 131.88M | 12.85 MB | 5.41 KB | INNER JOIN, BROADCAST | | |--07:EXCHANGE | 6 | 15.49us | 17.94us | 2.81K | 117 | 0 B | 0 B | BROADCAST | | | 02:SCAN HDFS | 1 | 531.08ms | 531.08ms | 469 | 117 | 309.00 KB | 32.00 MB | airlines.geog_origin o | | 03:HASH JOIN | 6 | 1.20s | 1.54s | 15.68M | 131.88M | 12.14 MB | 3.02 KB | INNER JOIN, BROADCAST | | |--06:EXCHANGE | 6 | 24.29us | 68.23us | 1.04K | 39 | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 1 | 287.31ms | 287.31ms | 173 | 39 | 325.00 KB | 32.00 MB | airlines.geog_dest d | | 00:SCAN HDFS | 6 | 2.34s | 3.13s | 131.88M | 131.88M | 74.03 MB | 480.00 MB | airlines.flight_delays_full f | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
A point to note is that I found it very hard to get Impala queries to run when I got down to specifying virtual core and memory limits rather than just overall weightings, so I’d go with these high-level resource pool prioritisations which seemed to work and didn’t unduly affect query response times. For example the setting below looked clever, but queries always seemed to time out and I never really got a satisfactory setup working properly.
Note that for YARN dynamic resource pools to be used, all Linux/CDH users will need to be assigned to resource pools so they don’t run as “unconstrained”; this can also be done from the Dynamic Resource Pools configuration page.
Finally though, all of this management through resource pools might not be the best way to control resource usage by YARN. The Cloudera docs say quite clearly on the Integrated Resource Management page that:
“When using YARN with Impala, Cloudera recommends using the static partitioning technique (through a static service pool) rather than the combination of YARN and Llama. YARN is a central, synchronous scheduler and thus introduces higher latency and variance which is better suited for batch processing than for interactive workloads like Impala (especially with higher concurrency). Currently, YARN allocates memory throughout the query, making it hard to reason about out-of-memory and timeout conditions.
What this means in-practice is that, if you’ve got a single project using the Big Data Appliance and you just want to specify at a high-level what proportion of resources Impala, HBase, MapReduce and the other services under YARN management use, you can define this as static service pool settings in Cloudera Manager and have these restrictions enforced by Linux Cgroups. In the screenshot below I unwound all of the dynamic resource pool settings I created a moment ago and allocated 25% of overall cluster resources to Impala, with the wizard then using those top-level values to set limits for services across all nodes in the cluster based on their actual RAM and CPU, the services running on them and so on.
Then, going back to Cloudera Manager and running some queries, you can see these static service pool limits being applied in real-time and their effect in the form of graphs for particular cluster resources.
So given all of this, what was our recommendation to the client about how best to set up resource management for Impala and other workloads on their Big Data Appliance? Not too much should be read into individual numbers – it’s hard to simulate a proper mixed workload on a development server, and of course their BDA has 12 nodes, more memory, faster CPUs. However it’s probably fair to say these are the obvious conclusions:
- Running Impala daemons on just a subset of nodes isn’t actually a bad way to constrain resources used by Impala, but it’ll only work on clusters with a small amount of nodes so that there’s a good chance one node will have one of the three copies of a data block. On a system of the scale of our customer’s, we’ll probably hit unacceptable overheads in terms of block locality. I would not carry on with this approach because of that.
- If the customer BDA will be running a mixed workload, i.e. data loading, long-running Hive/Pig/Spark jobs as well as short-running Impala jobs, enabling Impala for YARN and setting overall resource pools for applications would be the best approach, but individual Impala queries will probably run slower than now (even given the restriction in resources), due to the overhead YARN imposes when scheduling and running jobs. But this will be the best way to allocate resource between applications and provide a generally “smoother” experience for users
- If the BDA needs to be optimised mostly for Impala queries, then don’t manage Impala under YARN, leave it outside of this and just use Static service pools to allocate Impala roughly 25% of resources across all nodes. In both this and the previous instance (Impala on YARN) then all nodes should be re-enabled for Impala so as to minimize issues over block locality
- If the only real issue is Impala queries for a particular application taking all resources/becoming runaway, Impala could be left outside of YARN but enabled for admission control so as to limit the total number of running/queued queries for a particular application.
Oracle Big Data Discovery 1.1 now GA, and Available as part of BigDataLite 4.2.1
The new Oracle Big Data Discovery 1.1 release went GA a couple of weeks ago, and came with a bunch of features that addressed show-stoppers in the original 1.0 release; the ability to refresh and reload datasets from Hive, compatibility with Cloudera CDH and Hortonworks HDP Hadoop platforms, Kerberos integration, and the ability to bring in datasets from remote JDBC datasources. If you’re new to Big Data Discovery I covered the initial release in a number of blog posts over the past year or so:
- Introducing Oracle Big Data Discovery Part 1: “The Visual Face of Hadoop”
- Introducing Oracle Big Data Discovery Part 2: Data Transformation, Wrangling and Exploration
- Introducing Oracle Big Data Discovery Part 3: Data Exploration and Visualization
- Combining Oracle Big Data Discovery and Oracle Visual Analyzer on BICS
- Some Oracle Big Data Discovery Tips and Techniques
So let’s start by loading some data into Big Data Discovery so that we can explore what’s in it, see the range of attributes and their values and do some basic data clean-up and enrichment. As with Big Data Discovery 1.0 you import (or “sample”) data into Big Data Discovery’s DGraph engine either via file upload, or by using a command-line utility. Data in Hadoop has to be registered in the Hive HCatalog metadata layer, and I’ll start by importing a Hive table mapped to some webserver log files via a Hive SERDE:
To import or sample this table into BDD’s DGraph engine I use the following command to invoke the Big Data Discovery Data Processing engine, which reads the Hive table metadata, loads the Hive table data into the DGraph engine (either all rows, or a representative sample) and process/enriches the data to add geocoding, for example:
[oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t apachelog_parsed
This then runs as an Apache Spark job under YARN, progress of which you can track either from the console or through Cloudera Manager / Hue.
[2015-09-04T14:20:43.792-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:oracle] client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.oracle start time: 1441390841404 final status: UNDEFINED tracking URL: http://bigdatalite.localdomain:8088/proxy/application_1441385366282_0001/ user: oracle [2015-09-04T14:20:45.794-04:00] [DataProcessing] [INFO] [] [org.apache.spark.Logging$class] [tid:main] [userID:oracle] Application report for application_1441385366282_0001 (state: ACCEPTED)
Going over to Big Data Discovery Studio I can then see the dataset within the catalog, and then load it into a project and start exploring and transforming the dataset. In the screenshots below I’m cleaning up the date and time field to turn it into a timestamp, and arranging the automatically-derived country, city, region and state attributes into a geography hierarchy. BDD1.1 comes with a bunch of other transformation enhancements including new options for enrichment, the ability to tag values via a whitelist and so on – a full list of new features for BDD1.1 can be found in MOS Doc.ID 2044712.1
Crucially now in BDD1.1 you can either refresh a data set with new data from Hive (re-load), or do an incremental update after you’ve selected an attribute as the update (identity) column – in the screenshot below I’m doing this for a dataset uploaded from a file, but you can reload and update dataset from the command-line too which then opens-up the possibility of scripting, scheduling etc.
You can also define JDBC data connections in the administration part of BDD Studio, and then type in SQL queries to define data sources that can then be added into your project as a dataset – loading their data directly into the DGraph engine rather than having to stage it in Hadoop beforehand.
Then, as with the initial release of Big Data Discovery, you can define joins between the data sets in a project based on common attributes – in this case I’m joining the page URLs in the webserver logs with the page and article data extracted from our WordPress install, sourced from both Hive and Oracle (via JDBC)
Other new features in BDD1.1 include the ability to define “applications”, projects and datasets that are considered “production quality” and include details on how to refresh and incrementally load their datasets (designed presumably to facilitate migrations from Endeca Information Discovery), and a number of other new features around usability, data exploration and security. You can download BDD1.1 from Oracle’s Edelivery site, or download it pre-installed and configured as part of the new Big Data Lite 4.2.1 Virtualbox virtual machine.
Primeros Cursos de Rittman Mead en Español en América Latina
La semana pasada y en el contexto del OTN Tour LA 2015 dimos nuestros primeros cursos en Español en Quito, Ecuador y Cali, Colombia.
Se trató de un workshop en este caso de un día (8 horas) sobre Desarrollo Front-End de OBIEE 11g.
En esta oportunidad los cursos fueron presenciales, dónde todo el material teórico-práctico era en español y cada alumno tenía a su disposición durante el entrenamiento una máquina virtual exclusiva donde realizar la parte práctica contando con el apoyo y la asistencia del instructor en todo momento.
Hace muchos años que vengo dando cursos. El primer curso que di, fue allá por 1999 (ejem, sí, en el siglo pasado). Desde ahí he dado muchísimos más sobre distintos productos Oracle inclusive como instructora de Oracle University lo cuál he sido por más de 10 años. En total debo haber dado cursos a más de 500 personas en mi vida.
Hace un año y medio que estoy trabajando felizmente en Rittman Mead desarrollando y dictando cursos en Inglés, lo que ha sido un reto maravilloso. Pero volver a dar cursos en español fue como quien dice “cómo volver a casa”. Realmente una experiencia super linda.
El feedback que tuvimos del curso fue realmente muy bueno, y nos llena de alegría haber logrado una vez más exceder las expectativas de los participantes. Esto son algunos ejemplos del feedback que hemos recibido.
Queremos agradecer a todos los asistentes del curso por su participación y también a nuestros partners en Ecuador Bluefrog y Refundation, y a los Grupos de Usuario de Oracle de Ecuador y Colombia por la organización.
Si quieres recibir nuestros cursos en tu propio idioma, ya sea en Español o Portugués o quieres convertirte en nuestro partner de entrenamientos, mándanos un correo a training@rittmanmead.com y nos pondremos inmediatamente en contacto contigo.
UKOUG Partner of the Year Awards 2015
It’s that time of year again for the UKOUG Partner of the Year Awards. This year we have been nominated for 4 awards:
- Engineered Systems Partner of the Year Award
- Business Analytics Partner of the Year Award
- Training Partner of the Year Award
- Emerging Partner of the Year Award
The awards are decided by “end users of Oracle-related products or services” i.e. you, so we would like to ask you to vote for us by going to this link.
I would like to propose four reasons why I think we deserve these awards.
Research, development and sharing
The culture at Rittman Mead has always been to delve into the tools we use, push their boundaries and share what we learn with the community. Internally, we have formalised this by having our own in house R&D department. People like Robin Moffatt, Jordan Meyer and Mark Rittman spend a lot of time and effort looking at the core Oracle data and analytics toolset to determine the optimal way to use it and see which other leading edge tools can be integrated into it.
This has given rise to a huge amount of freely available information ranging from a whole series on OBIEE performance tuning to drinks cabinet optimisation.
We have also worked with Oracle to produce a new version of their reference architecture that was the first one to incorporate the new wave of big data technologies and approaches such as Hadoop and a data reservoir.
Delivery
One of the main drivers for our R&D department is to make us more effective at delivering data and analytics projects.
We are continually investigating common and new approaches and design patterns found in the world of ETL, data warehousing, business intelligence, analytics, data science, big data and agile project delivery, and combining them with our experience to define optimal ways deliver projects.
Again, we share a lot of these approaches through talks at Oracle and community conferences, blog posts and routines shared on our GitHub repository.
Learning and education
Learning is key to everything we do in life, and as such, we see the provision of independent courses for Oracle business intelligence and data integration tools as key for the industry. We have developed our own training materials based on the different roles people play on projects, for example we have a Business Enablement Bootcamp aimed at end users and OBIEE Bootcamp aimed at developers. We know from our feedback forms how effective this training is.
To supplement the training materials we also wrote the official OBIEE Oracle Press book based around the same examples and data sets.
Optimisation
Our key role as an Oracle partner and member of the Oracle community is to optimise the value any organisation gets from investing in Oracle data and analytics related software and hardware.
This is something that requires a long term commitment, a high level of investment and a deep level of knowledge and experience, which is hopefully demonstrated above. To this end, we are prepared to often go beyond the level of information that Oracle can offer and in certain cases challenge their own understanding of the tools.
We were the first UK partner to buy an Exalytics server, for example, and have written a whole host of articles around the subject. Similarly we are the proud owner of a BICS pod and we are now evaluating how organisations can effectively use cloud in their strategic business intelligence architectures and then, if they do, the best approach to integrating it.
Finally, we are also investing heavily in user engagement, providing the capability to measure then optimise an organisation’s data and analytics systems. We believe user engagement is directly and measurably linked to the return organisations get from their investment in Oracle data and analytics software and hardware.
In Summary
So, in summary, I hope that the reasons that I outline above explain why we deserve some or all of the above awards, as they act as a great way to recognise the effort put in by all our staff over the course of the year. The voting link is here.