Tag Archives: Uncategorized

Rittman Mead and Oracle Big Data Webcast Series – November 2015

We’re running a set of three webcasts together with Oracle on three popular use-cases for big data within an Oracle context – with the first one running tomorrow, November 3rd 2015 15:00 – 16:00 GMT / 16:00 – 17:00 CET on extending the data warehouse using Hadoop and NoSQL technologies.

The sessions are running over three weeks this month and look at ways we’re seeing Rittman Mead use big data technologies to extend the and capabilities of their data warehouse, create analysis sandpits for analysing customer behaviour, and taking data discovery into the Hadoop era using Oracle Big Data Discovery. All events are free to attend, we’re timing them to suit the UK,Europe and the US, with details of each webcast are as follows:

NewImage

Extending and Enhancing Your Data Warehouse to Address Big Data

Organizations with data warehouses are increasingly looking at big data technologies to extend the capacity of their platform, offload simple ETL and data processing tasks and add new capabilities to store and process unstructured data along with their existing relational datasets. In this presentation we’ll look at what’s involved in adding Hadoop and other big data technologies to your data warehouse platform, see how tools such as Oracle Data Integrator and Oracle Business Intelligence can be used to process and analyze new “big data” data sources, and look at what’s involved in creating a single query and metadata layer over both sources of data.

Audience: DBAs, DW managers, architects Tuesday 3rd November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Audience : DBAs, DW managers, architects

What is Big Data Discovery and how does it complement traditional Business Analytics?

Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization. At the same time Oracle Big Data Discovery reduces the dependency on expensive and often difficult to find Data Scientists, opening up many Big Data tasks to “Citizen” Data Scientists. In this session we’ll look at Oracle Big Data Discovery and how it provides a “visual face” to your big data initiatives, and how it complements and extends the work that you currently do using business analytics tools.

Audience : Data analysts, market analysts, & Big Data project team members Tuesday 10th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Adding Big Data to your Organization to create true 360-Degree Customer Insight

Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately. In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.

Audience : MI Managers, CX Managers, CIOs, BI / Analytics Managers Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

News on Three Big Data Webcasts with Oracle, and a Customer Case-Study at Cloudera Sessions

NewImage

Next week I’m presenting along with Liberty Global at the Cloudera Sessions event in Amsterdam on October 15th 2015, on their implementation of Cloudera Enterprise on Oracle Big Data Appliance for a number of big data and advanced analytics initiatives around their cable TV, mobile and internet business.

We’ve been working with Liberty Global for a number of years and helped them get started with their move into big data a year or so ago, and it’s great to see them speaking at this Cloudera event and the success they’ve had with this joint Oracle+Cloudera platform. Andre Lopes and Roberto Manfredini from Liberty Global will talk about the business drivers and initial PoC scenario that then paid for the first main stage of the project, and I’ll talk about how we worked with their implementation team and senior managers to implement Cloudera’s enterprise Hadoop platform on Oracle engineered systems. 

Rittman Mead and Oracle Big Data Webcast Series – November 2015

We’re also running a set of three webcasts together with Oracle on three use-cases for big data in an Oracle context. The sessions will run over three weeks  in November 2015 and will look at three ways we’re seeing Rittman Mead big data customers use the platform to extend the storage and capabilities of their data warehouse, creating repositories and analysis sandpits for customer behaviour analysis, and taking data discovery into the Hadoop era using Big Data Discovery.

All events are free to attend, we’re timing them to suit the UK,Europe and the US, and details of each webcast are as follows:

Extending and enhancing your Data Warehouse to address Big Data

Organizations with data warehouses are increasingly looking at big data technologies to extend the capacity of their platform, offload simple ETL and data processing tasks and add new capabilities to store and process unstructured data along with their existing relational datasets. In this presentation we’ll look at what’s involved in adding Hadoop and other big data technologies to your data warehouse platform, see how tools such as Oracle Data Integrator and Oracle Business Intelligence can be used to process and analyze new “big data” data sources, and look at what’s involved in creating a single query and metadata layer over both sources of data.

Audience: DBAs, DW managers, architects 
Tuesday 3rd November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Audience : DBAs, DW managers, architects 

What is Big Data Discovery and how does it complement traditional Business Analytics?

Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization. At the same time Oracle Big Data Discovery reduces the dependency on expensive and often difficult to find Data Scientists, opening up many Big Data tasks to “Citizen” Data Scientists.

In this session we’ll look at Oracle Big Data Discovery and how it provides a “visual face” to your big data initiatives, and how it complements and extends the work that you currently do using business analytics tools.

Audience : Data analysts, market analysts, & Big Data project team members 
Tuesday 10th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Adding Big Data to your Organization to create true 360-Degree Customer Insight

Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately.

In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.

Audience : MI Managers, CX Managers, CIOs, BI / Analytics Managers
Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

 


Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately.
In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.
Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Register for Customer Insight

Managing Impala and Other Mixed Workloads on the Oracle Big Data Appliance

One of our current client projects uses Cloudera Impala to provide fast ad-hoc querying to the data we’re loading into their Oracle Big Data Appliance Hadoop environment. Impala bypasses MapReduce to provide faster queries than Hive, but to do so it does a lot of processing in-memory and runs server processes on each node in the cluster, leading in some cases to runaway queries blocking other workloads in the same way that OBIEE queries on an Oracle Database can sometimes block ETL and application workloads. Several projects share this same Big Data Appliance, so to try and limit the impact Impala could have on other cluster workloads the client had disabled the Impala Daemons on nine of the twelve nodes in their Big Data Appliance; our concern with this approach was that an Impala query could access data from any datanode in the Big Data Appliance cluster, so whilst HDFS data is typically stored and replicated to three nodes in the cluster running the Impala daemons on just a quarter of the available nodes was likely to lead to data locality issues for Impala and blocks getting shipped across the network unnecessarily.

Going back to OBIEE and the Oracle Database, Oracle have a resource management feature for the Oracle database that allows you to put users and queries into separate resource pools and manage the share of overall resources that each pool gets. I covered this concept on the blog a few years ago, and the version of Cloudera Hadoop (CDH5.3) as used on the client’s Big Data Appliance has a feature called “YARN”, or Yet Another Resource Negotiator, that splits out the resource management and scheduling parts that were bound into MapReduce in Hadoop 1.0 so that MapReduce then just runs as a workload type on Hadoop, and with it then possible to run other workload types, for example Apache Spark, on that same cluster management framework.

NewImage

 

Impala isn’t however configured to use YARN by default and uses an internal scheduler to govern how concurrent queries run and use cluster resources, but it can be configured to use YARN in what Cloudera term “Integrated Resource Management” and our initial response was to recommend this approach; however YARN is really optimised for longer-running batch jobs and not the shorter jobs that Impala generates (such that Cloudera recommends you don’t actually use YARN, and control Impala resource usage via service-level process constraints or through a new Impala feature called Admission Control instead). Taking a step back though, how do we actually see what resources Impala is using across the cluster when a query runs, and is there a feature similar to the Oracle Database’s SQL Explain Plan to help us understand how an Impala SQL query is executed? Then, using this and the various resource management options to us, can we use them to understand how YARN and other options will affect the Impala users on the client’s cluster if we enable them? And, given that we were going to test this all out on one of our development Hadoop clusters running back at the office on VMWare, how well could we simulate the multiple concurrent queries and mixed workload we’d then encounter on the real customer Big Data Appliance?

When trying to understand what goes on when a Cloudera Impala SQL query runs, the two main tools in your toolbox are EXPLAIN plans and query profiles. The concept of EXPLAIN plans will be familiar to Oracle developers, and putting “explain” before your Impala SQL query when you’re using the Impala Shell (or pressing the “Explain” button when you’re using the Impala Editor in Hue) will display an output like the one below, showing the steps the optimiser plans to take to return the query results:

[bda7node1.rittmandev.com:21000] > explain
select sum(f.flights) as total_flights, d.dest_city
from flight_delays f join geog_dest d on f.dest = d.dest
join geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and   o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000;
Query: explain select sum(f.flights) as total_flights, d.dest_city
from flight_delays f join geog_dest d on f.dest = d.dest
join geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and   o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000
+---------------------------------------------------------------------+
| Explain String                                                      |
+---------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=154.01MB VCores=2           |
|                                                                     |
| 10:EXCHANGE [UNPARTITIONED]                                         |
| |                                                                   |
| 09:AGGREGATE [FINALIZE]                                             |
| |  output: sum:merge(f.flights)                                     |
| |  group by: d.dest_city                                            |
| |  having: sum(f.flights) > 3000                                    |
| |                                                                   |
| 08:EXCHANGE [HASH(d.dest_city)]                                     |
| |                                                                   |
| 05:AGGREGATE                                                        |
| |  output: sum(f.flights)                                           |
| |  group by: d.dest_city                                            |
| |                                                                   |
| 04:HASH JOIN [INNER JOIN, BROADCAST]                                |
| |  hash predicates: f.orig = o.orig                                 |
| |                                                                   |
| |--07:EXCHANGE [BROADCAST]                                          |
| |  |                                                                |
| |  02:SCAN HDFS [airlines.geog_origin o]                            |
| |     partitions=1/1 files=1 size=147.08KB                          |
| |     predicates: o.orig_state IN ('Florida', 'New York', 'Alaska') |
| |                                                                   |
| 03:HASH JOIN [INNER JOIN, BROADCAST]                                |
| |  hash predicates: f.dest = d.dest                                 |
| |                                                                   |
| |--06:EXCHANGE [BROADCAST]                                          |
| |  |                                                                |
| |  01:SCAN HDFS [airlines.geog_dest d]                              |
| |     partitions=1/1 files=1 size=147.08KB                          |
| |     predicates: d.dest_state = 'California'                       |
| |                                                                   |
| 00:SCAN HDFS [airlines.flight_delays f]                             |
|    partitions=1/1 files=1 size=64.00MB                              |
+---------------------------------------------------------------------+
Fetched 35 row(s) in 0.21s

Like an Oracle SQL explain plan, Impala’s cost-based optimiser uses table and partition stats that you should have gathered previously using Impala’s “compute stats” command to determine what it thinks is the optimal execution plan for your query. To see the actual cost and timings for the various plan steps that are run for a query, you can then use the “summary” statement after your query has run (or for more detail, the “profile” statement”) to see the actual timings and stats for each step in the query execution.

[bda7node1.rittmandev.com:21000] > summary;                                 > ;
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
| 10:EXCHANGE     | 1      | 20.35us  | 20.35us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 142.18ms | 180.81ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 59.86us  | 123.39us | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 171.72ms | 208.36ms | 60      | 1.93K      | 22.73 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 89.42ms  | 101.82ms | 540.04K | 131.88M    | 12.79 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 16.32us  | 19.63us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 302.83ms | 302.83ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 936.71ms | 1.10s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 19.02us  | 46.49us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 266.99ms | 266.99ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 1.07s    | 1.90s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+

Output from the Summary statement gives us some useful information in working out the impact of the various resource management options for the Oracle Big Data Appliance, at least in terms of its impact on individual Impala queries – we’ll look at the impact on the overall Hadoop cluster and individual nodes later on. From the output of the above Summary report I can see that my query ran on all six nodes in the cluster (queries I ran earlier on a smaller version of the fact table ran on just a single node), and I can see how long each step in the query actually took to run. So what happens if I run the same query again on the cluster but disable the Impala daemon service role on three of the nodes, using Cloudera Manager?

NewImage

Here’s the Summary output after running the query again:

[bda7node1.rittmandev.com:21000] > summary;
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
| 10:EXCHANGE     | 1      | 22.01us  | 22.01us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 3      | 111.12ms | 117.24ms | 7       | 193        | 6.27 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 3      | 30.09us  | 39.02us  | 30      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 3      | 161.26ms | 173.57ms | 30      | 1.93K      | 22.84 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 3      | 156.50ms | 238.90ms | 540.04K | 131.88M    | 12.81 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 3      | 20.19us  | 28.93us  | 1.41K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 477.38ms | 477.38ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 3      | 1.48s    | 1.66s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 3      | 12.07us  | 14.89us  | 519     | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 308.83ms | 308.83ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 3      | 3.39s    | 6.85s    | 131.88M | 131.88M    | 74.11 MB  | 480.00 MB     | airlines.flight_delays_full f |
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+

What the Summary statement doesn’t show you is the overall time the query took to run, and the query ran against three nodes took 9.48s to run compared to 3.59s for the one before where I had all six nodes’ Impala daemon enabled. In-fact I’d expect a query running on the client’s BDA with just three out of twelve nodes enabled to run even slower because of the block locality issue – Impala has a feature called block locality tracking which keeps track of where HDFS data blocks are actually located on the cluster and tries to run impalad tasks on the right nodes, but three out of twelve nodes running makes that job really hard – but the other factor that we need to consider is how running multiple queries concurrently affects things when only a few nodes are handling all the Impala user queries.

To try and simulate concurrent queries running I opened six terminal session against nodes actually running Impala Daemon service roles and submitted the same query from each session, with a second or two gap between each query; with all six nodes enabled the average response time rose to about 6s, but with just three enabled the response rose fairly consistently to around 27s.

NewImage

This is of course what you’d expect when everything was trying to run on the same three (now resource-starved) server nodes, and again I’d expect this to be even more pronounced on the client’s twelve-node BDA. What this test of course didn’t cover was running workloads other than Impala on the same cluster, or running queries against different datasets, but it did at least show us how response-time increases fairly dramatically (albeit consistently) as more Impala users come onto the system.

So now we have some baseline benchmarking figures, let’s configure Impala to use YARN, using Cloudera Manager on the CDH5.3 setup used on the client’s BDA and our development cluster back in the office. There’s actually two parts to Impala running on YARN in CDH5.x; YARN itself as the overall cluster resource management layer, and another component called Llama (Low-Latency, or “Long-Lived”, Application Master) that sits between YARN and Impala and reduces the time that each Impala query takes to obtain YARN resource allocations.

llama arch

Enabling YARN and Llama (and if you want to, configuring Llama and thereby Impala for high-availability) is done through a wizard in CDH5.3 that also offers to set up an Linux feature called Cgroups that YARN can use to limit the “containers” it uses for resource management at the OS-level.

Once you’ve run through the wizard and restarted the cluster, Impala should be configured to use YARN instead of its own scheduler to request resources, which in-theory will allow Hadoop and the Big Data Appliance to consider Impala workloads alongside MapReduce, Spark and HBase when scheduling jobs across the cluster. Before we get into the options YARN gives us for managing these workloads I ran the same Impala queries again, first as a single query and then with six running concurrently, to see what impact YARN on its own had on query response times.

The single query on its own took around the same time as without YARN to run (3-4s), but when I ran six concurrent queries together the response time went up from the 3-4s that I saw without YARN enabled to between 5s and 18s depending on the session, with quite a bit of variation between response times compared to the consistent times I saw when YARN wasn’t being used – which surprised me as one of the stated benefits of YARN is making job execution times more predictable and smooth, though this cloud be more of an overall-cluster thing and there are also recommendations around configuring YARN and Llama’s resource estimation more efficient for Impala in the Cloudera docs.

[bda7node1.rittmandev.com:21000] > summary;
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
| 10:EXCHANGE     | 1      | 41.38us  | 41.38us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 115.28ms | 123.04ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 44.44us  | 67.62us  | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 170.91ms | 201.47ms | 60      | 1.93K      | 22.82 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 82.25ms  | 98.34ms  | 540.04K | 131.88M    | 12.81 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 15.39us  | 18.99us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 244.40ms | 244.40ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 850.55ms | 942.47ms | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 13.99us  | 19.05us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 222.03ms | 222.03ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 1.54s    | 2.88s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+

But it seems clear that users of Impala on the client cluster should expect some sort of overhead from using YARN to manage Impala’s resources, with the payoff being better balance between Impala workloads and the other uses they’re putting the BDA cluster too – however I think there’s more we can do to fine-tune how Llama and YARN allocate memory to Impala queries up-front (allocating a set amount of memory for all queries, rather than making an estimate and then adding more memory mid-query if it’s needed) and of course we’ve not really tested it on a cluster with a full, mixed workload running. But what about our original scenario, where only a certain percentage of the overall cluster resources or nodes are allocated to Impala query processing? To set up that sort of division resources we can use another feature of YARN called dynamic allocation, and dynamic resource pools that we can set up through Cloudera Manager again.

Dynamic allocation is one of the ways that YARN can be configured to manage multiple workloads on a Hadoop cluster (the other way is through static service pools, and I’ll come to those in a moment). Using dynamic allocation I can set up a resource pool for the airline flight delays application that my Impala SQL queries are associated with and allocate it 25% of overall cluster resources, with the remainder of cluster resources allocated to other applications. I can keep that weighting simple as I have done in the screenshot below, or I can allocate resources based on virtual cores and memory, but I found it simpler to just set these overall weightings and let YARN worry about cores and RAM. 

NewImage

Depending on the scheduling policy you select, YARN will prioritise Impala and other jobs in different ways, but the recommended scheduling policy for mixed workloads is dominent resource fairness which balances RAM and CPU depending on which resource pool needs them most at a particular time. Note also that Impala can either be managed as part of the overall YARN workload or separately, a choice you can make in the Impala service configuration settings in Cloudera Manager (the “Enable Dynamic Resource Pools” setting that’s checked below, but was unchecked for the screenshot above)

NewImage

There’s also a separate control you can place on Impala queries called Admission Control, that limits the number of queries that can run or be queued for a resource pool at any particular time. The docs are a bit vague on when to use admission control, when to use YARN or not and so on, but my take on this is that if it’s just Impala queries you’re worried about and throttling their use solves the problem then use this feature and leave Impala outside of YARN, but if you need to manage overall mixed workloads then do it all through YARN. For my testing example though I just went with simple resource pool weighting, and you can see from the screenshot below where multiple queries are running at once for my pool, CPU and RAM resources are constrained as expected.

NewImage

To make a particular Impala query run within a specific resource pool you can either allocate that user to a named resource pool, or you can specific the resource pool in your Impala shell session like this:

[bda7node1.rittmandev.com:21000] > set request_pool = airlines; 
REQUEST_POOL set to airlines
[bda7node1.rittmandev.com:21000] > select sum(f.flights) as total_flights, d.dest_city
from airlines.flight_delays_full f join airlines.geog_dest d on f.dest = d.dest
join airlines.geog_origin o on f.orig = o.orig
where d.dest_state = 'California'
and o.orig_state in ('Florida','New York','Alaska')
group by d.dest_city
having total_flights > 3000;

Looking then at a typical summary output for a query running with these restrictions (25% of resources overall) and other queries running concurrently, the numbers don’t look all that different to before and results took between 8s and 30s to return – again I was surprised on the variance but I think YARN is more about overall cluster performance rather than individual queries, and you shouldn’t read too much into specific times on a dev server with an unrepresentative overall workload.

[bda7node1.rittmandev.com:21000] > summary;
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
| Operator        | #Hosts | Avg Time | Max Time | #Rows   | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail                        |
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
| 10:EXCHANGE     | 1      | 26.78us  | 26.78us  | 7       | 193        | 0 B       | -1 B          | UNPARTITIONED                 |
| 09:AGGREGATE    | 6      | 209.10ms | 262.02ms | 7       | 193        | 6.28 MB   | 10.00 MB      | FINALIZE                      |
| 08:EXCHANGE     | 6      | 63.20us  | 118.89us | 60      | 1.93K      | 0 B       | 0 B           | HASH(d.dest_city)             |
| 05:AGGREGATE    | 6      | 282.56ms | 401.37ms | 60      | 1.93K      | 22.76 MB  | 10.00 MB      |                               |
| 04:HASH JOIN    | 6      | 99.56ms  | 114.14ms | 540.04K | 131.88M    | 12.85 MB  | 5.41 KB       | INNER JOIN, BROADCAST         |
| |--07:EXCHANGE  | 6      | 15.49us  | 17.94us  | 2.81K   | 117        | 0 B       | 0 B           | BROADCAST                     |
| |  02:SCAN HDFS | 1      | 531.08ms | 531.08ms | 469     | 117        | 309.00 KB | 32.00 MB      | airlines.geog_origin o        |
| 03:HASH JOIN    | 6      | 1.20s    | 1.54s    | 15.68M  | 131.88M    | 12.14 MB  | 3.02 KB       | INNER JOIN, BROADCAST         |
| |--06:EXCHANGE  | 6      | 24.29us  | 68.23us  | 1.04K   | 39         | 0 B       | 0 B           | BROADCAST                     |
| |  01:SCAN HDFS | 1      | 287.31ms | 287.31ms | 173     | 39         | 325.00 KB | 32.00 MB      | airlines.geog_dest d          |
| 00:SCAN HDFS    | 6      | 2.34s    | 3.13s    | 131.88M | 131.88M    | 74.03 MB  | 480.00 MB     | airlines.flight_delays_full f |
+-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+

A point to note is that I found it very hard to get Impala queries to run when I got down to specifying virtual core and memory limits rather than just overall weightings, so I’d go with these high-level resource pool prioritisations which seemed to work and didn’t unduly affect query response times. For example the setting below looked clever, but queries always seemed to time out and I never really got a satisfactory setup working properly.

NewImage

Note that for YARN dynamic resource pools to be used, all Linux/CDH users will need to be assigned to resource pools so they don’t run as “unconstrained”; this can also be done from the Dynamic Resource Pools configuration page.

Finally though, all of this management through resource pools might not be the best way to control resource usage by YARN. The Cloudera docs say quite clearly on the Integrated Resource Management page that:

“When using YARN with Impala, Cloudera recommends using the static partitioning technique (through a static service pool) rather than the combination of YARN and Llama. YARN is a central, synchronous scheduler and thus introduces higher latency and variance which is better suited for batch processing than for interactive workloads like Impala (especially with higher concurrency). Currently, YARN allocates memory throughout the query, making it hard to reason about out-of-memory and timeout conditions.

What this means in-practice is that, if you’ve got a single project using the Big Data Appliance and you just want to specify at a high-level what proportion of resources Impala, HBase, MapReduce and the other services under YARN management use, you can define this as static service pool settings in Cloudera Manager and have these restrictions enforced by Linux Cgroups. In the screenshot below I unwound all of the dynamic resource pool settings I created a moment ago and allocated 25% of overall cluster resources to Impala, with the wizard then using those top-level values to set limits for services across all nodes in the cluster based on their actual RAM and CPU, the services running on them and so on.

NewImage

Then, going back to Cloudera Manager and running some queries, you can see these static service pool limits being applied in real-time and their effect in the form of graphs for particular cluster resources.

NewImage

So given all of this, what was our recommendation to the client about how best to set up resource management for Impala and other workloads on their Big Data Appliance? Not too much should be read into individual numbers – it’s hard to simulate a proper mixed workload on a development server, and of course their BDA has 12 nodes, more memory, faster CPUs. However it’s probably fair to say these are the obvious conclusions:

  • Running Impala daemons on just a subset of nodes isn’t actually a bad way to constrain resources used by Impala, but it’ll only work on clusters with a small amount of nodes so that there’s a good chance one node will have one of the three copies of a data block. On a system of the scale of our customer’s, we’ll probably hit unacceptable overheads in terms of block locality. I would not carry on with this approach because of that.
  • If the customer BDA will be running a mixed workload, i.e. data loading, long-running Hive/Pig/Spark jobs as well as short-running Impala jobs, enabling Impala for YARN and setting overall resource pools for applications would be the best approach, but individual Impala queries will probably run slower than now (even given the restriction in resources), due to the overhead YARN imposes when scheduling and running jobs. But this will be the best way to allocate resource between applications and provide a generally “smoother” experience for users
  • If the BDA needs to be optimised mostly for Impala queries, then don’t manage Impala under YARN, leave it outside of this and just use Static service pools to allocate Impala roughly 25% of resources across all nodes. In both this and the previous instance (Impala on YARN) then all nodes should be re-enabled for Impala so as to minimize issues over block locality
  • If the only real issue is Impala queries for a particular application taking all resources/becoming runaway, Impala could be left outside of YARN but enabled for admission control so as to limit the total number of running/queued queries for a particular application.

New Oracle Big Data Quick-Start Packages from Rittman Mead

Many organisations using Oracle’s business intelligence and data warehousing tools are now looking to extend their capabilities using “big data” technologies. Customers running their data warehouses on Oracle Databases are now looking to use Hadoop to extend their storage capacity whilst offloading initial data loading and ETL to this complementary platform; other customers are using Hadoop and Oracle’s Big Data Appliance to add new capabilities around unstructured and sensor data analysis, all at considerably lower-cost than traditional database storage.

NewImage

In addition, as data and analytics technologies and capabilities have evolved, there has never been a better opportunity to reach further into your data to exploit more value. Big Data platforms, Data Science methods and data discovery technologies make it possible to unlock the power of your data and put it in the hands of your  executives and team members – but what is it worth to you? What’s the value to your organisation of exploring deeper int the data you have, and how do you show return?

Many organisations have begin to explore Big Data technologies to understand where they can exploit value and extend their existing analytics platforms, but what’s the business case? The good news is, using current platforms, and following architectures like the Oracle Information Management and Big Reference Architecture written in conjunction with Rittman Mead, the foundation is in place to unlock a range of growth opportunities. Finding new value in existing data, predictive analytics, data discovery, reducing the cost of data storage, ETL offloading are all starter business cases proven to return value quickly.

NewImage

To help you start on the Oracle big data journey, Rittman Mead have put together two quick-start packages focuses on the most popular Oracle customer use-cases;

If this sounds like something you or your organization might be interested in, take a look at our new Quick Start Oracle Big Data and Big Data Discovery packages from Rittman Mead home page, or drop me an email at mark.rittman@rittmanmead.com and I’ll let you know how we can help.

RM BI Forum 2015 : Justification Letters for Employers

(Thanks to Christian Berg @Nephentur for the suggestion, and acknowledgements to ODTUG KScope for the original idea – our favourite conference after the BI Forum)

The Rittman Mead BI Forum 2015 promises to be our best BI Forum yet, with fantastic speakers at each event, keynotes and guest speakers from Oracle and John Foreman, author of the bestselling book “Data Smart”, a data visualisation challenge and an optional one-day masterclass on delivering Oracle’s new Information Management and Big Data reference architecture by Rittman Mead’s Mark Rittman and Jordan Meyer. Uniquely amongst Oracle BI events we keep the numbers attending very limited and run just a single stream at each event, so everyone takes part in the same sessions and gets to meet all the attendees and speakers over the three days.

Sometimes though, management within organizations require special justification for team members to attend events like these, and to help you put your case together and get across the unique education and networking benefits of the Rittman Mead BI Forum, we’ve prepared justification letters for you to complete with your details, one each for the Brighton and Atlanta events. Click on the links below to download sample justification letters for the Brighton BI Forum running on May 6th-8th 2015, and the Atlanta one running the week after on May 13th-15th 2015:

Full details on the BI Forum 2015 agenda and how to register can be found on the Rittman Mead BI Forum 2015 home page, with registration open until the weekend before each event – hurry though as attendee numbers are strictly limited.