Tag Archives: Uncategorized
OBIEE 11g and Essbase – Faking Federation Using the GoURL
This blog is going to address what happens when we can’t take advantage of the Admin tool’s powerful vertical federation capabilities when integrating relational stars and Essbase cubes. In the Admin tool, synonymously referred to as the RPD, vertical federation is the process of integrating an aggregate data source, in this case Essbase, with a detail level source from a data mart. This technique not only has the ability to increase query efficiency and decrease query time, it also has the added benefit of bringing together two powerful and dynamic reporting tools. But like most things, there is a pretty big caveat to this approach. But, before I jump into what that is, some housework. To start, let’s make sure things don’t get lost in translation when going back and forth between Essbase and OBIEE jargon. In Essbase speak, dimensions can be thought of as tables in a relational structure, whereas Essbase generations can be thought of as columns in each table, and members are the values in each column. Housework done, now the caveat. Often, dimensions in Essbase cubes are built in such a way as to not neatly support federation; that is, they are arranged so as to have an uneven number of generations relative to their corresponding relational dimension. It should be noted at this point that while federation is possible with a ragged hierarchical structure, it can get kind of messy, essentially ending up in a final product that doesn’t really look like something an Essbase-centric user community would readily and eagerly adopt. So what then, can we do when federation is out of the question? Let’s frame the solution in the form of a not-atypical client scenario. Say we’ve got a requirement per a large finance institution of a client to bring together their Essbase cubes they’ve used thus far for their standardized reporting, i.e. balance sheets, income statements and the like, with their relational source in order to drill to account detail information behind the numbers they’re seeing on said reports. They’ve got a pretty large user base that’s fairly entrenched and happy with their Smart View and Excel in getting what they want from their cubes. And why shouldn’t they be? OBIEE simply can’t support this level of functionality when reporting on an Essbase source, in most cases. And, in addition to these pretty big user adoption barriers to an OBIEE solution, now we’ve got technology limitations to contend with. So what are our options then when faced with this dilemma? How can we wow these skeptical users with near seamless functionality between sources? The secret lies with URL Action Links! And while this solution is great to go from summary level data in Essbase to its relational counterpart, it is also a great way to simply pass values from one subject area to another. There are definitely some tricks to set this up, but more on those later. Read on.
The Scenario
In order to best demonstrate this solution, let’s set up a dashboard with two pages, one for each report, and a corresponding dashboard prompt. The primary, source report, out of Essbase, will be something that could easily resemble a typical financial report, if not at least in structure. From this high-level chart, or similar summary level analysis, we’ll be able to drill to a detail report, out of a relational source, to identify the drivers behind any figures present on the analysis. In this example, we’re going to be using the 11.1.1.9 Sample App, Sample Essbase subject area to go to the equivalent relational area, Sample Sales. Yes, you could federate these two, as they’ve done in Sample App, however they’ll serve well to demonstrate how the following concept could work for financial reporting against ragged or parent-child structures. Values for Product Type, in the following instance, could just as well be the descendants or children of a specific account, as an example. As well, there is no equivalent relational subject area to use for the sake of the SampleApp Essbase GL subject area. In the example below, we have a summary, month level pivot table giving us a monthly sales trend. The user, in the following example, can prompt on the Year and Customer segment through a dashboard prompt, but as you’ll see, this could easily be any number of prompts for your given scenario.
Monthly Trend Summary:
Solution 1:
In the sales trend example above, we are going to enable our user to click on a value for a revenue figure and then navigate to a detail report that shows products sold for the month by date. Again, this all must be done while passing any chosen parameters from both the dashboard prompt and analysis along to the detail analysis.
Proof of Concept
First, let’s start with the guts of the report example above. As you can see, there is quite a bit more under the hood than meets the eye. Let’s go over the approach piece by piece to help build a more thorough understanding of the method.
Step 1: Include the Columns!
So the idea here is that we want to pass any and all dimensional information associated with the revenue figure that we pick to a detail level report that will be filtered on the set of parameters at the chosen intersection. We can hide these columns later, so your report won’t be a mess. I’ll add here that you might want to set any promoted values to be equal to the presentation variable on its respective dashboard prompt with a default value set, as seen below. This will help to make the report digestible on the compound layout. The following picture shows the prompted values to drive our summary report on Year and Customer Segment. You can do this in the filters pane on the criteria tab with the following syntax:
All column values we want to pass need to be represented on the report:
Values that will be passed to detail report (in this case, BizTech, Communication, Active Singles, 2012, and 2012 / 11):
Step 2: More Columns!
In addition to the columns that comprise the report, we need to add an additional iteration of every column for all of those added to the report in the first place. In the pictures above, you can see that these are the columns titled with the ‘URL’ prefix. In the column editor, concatenate quotes to the column values by attaching the following string (this is a single quote followed by a double quote and another single quote w/ NO spaces between them):
‘ ” ‘ || “Table”.”Column Name” || ‘ ” ‘
While this step may seem extemporaneous, you’ll see a bit later that this step is all too necessary to successfully pass our column values through our URL Action Links. After you’ve created the custom columns, just group them along with their counterpart in the report, as in the pics above.
Step 3: An Approach to Handling Hierarchies
In the previous pictures, you can see the products hierarchy that comprises the rows to the report. In order to pass any value from the hierarchy as well as its members we are going to have to include its respective generations in the rows as well. For our example, we’re going to use Brand, LOB, and Product Type. In this way, a user can select any sales value and have all three of these values passed as filter parameters to the detail analysis through a URL. You’ll notice that we haven’t given these columns a counterpart wrapped in quotes as you were told to do previously. This is quite on purpose, as we’ll see later. These columns will provide for another example on how to pass values without having to implement a second column for the purpose of wrapping the value in quotes.
When first placing the hierarchy on your analysis and expanding it to where you’d like it for the sake of the report, you can simply select all the column values, right click and then select ‘Keep Only’. This will establish a selection step under the Products Hierarchy to ensure that the report always opens to the specified structure from now on. So, that’s good for now, let’s get to the magic of this approach.
Step 4. Set up the Action Link
In this case, we’re going to ‘drill’ off of the Sales column in our table, but we could really ‘drill’ off of anything, as you’ll see. So, pop open the Interaction tab for the column and select Action Links as our primary interaction. Edit that guy as follows (see URL procedure below). It used to be that we could do this via the ‘P’ parameters, however this method seems to be mostly deprecated in favor of the col/val method, as we shall utilize below.
URL Procedure
http://sampleapp.rittmanmead.com:7780/analytics/saw.dll? – Server URL*
Portal&Path=@{1} – path to dashboard
&Page=@{2} – dashboard page
&Action=@{3} – action to perform, in this case navigate (there are others)
&col1=@{4} – column from target analysis we wish to manipulate (our sales detail analysis)
&val1=@{5} – column from source analysis with which we are going to pass a filter parameter to target
&col2=@{6}
&val2=@{7}
&col3=@{8}
&val3=@{9}
&col4=@{10}
&val4=“@{11}” – will discuss these quoted parameters later on
&col5=@{12}
&val5=”@{13}”
*Note that this value can be made into a variable in order to be moved to different environments (DEV/TEST, etc…) while maintaining link integrity
The picture above details how to set up the URL link as described above. The col1 value is the column from the target analysis we want to filter using the value (val1) from our source. Be sure to qualify this column from the subject area from which it originates, in this case “A – Sample Sales”.
Ex: “A – Sample Sales”.”Time”.”T05 Per Name Year”
Val1, as these parameters exist in ‘sets’, is the column from our source analysis we want to use to filter the target analysis. This is where our custom, quoted columns come into play. Instead of using the original column from our analysis, we’re going to use its quoted counterpart. This will ensure that any values passed through the URL will be enclosed in quotes, as is required buy the URL. Note that we’re not using a value parameter in this case, but a column instead (the dropdown to the left of the text box).
Ex: ‘ ” ‘ || “Time”.”T05 Per Name Year” || ‘ ” ‘
You can proceed this way to pass as many values as you’d like to your detail analysis, with this coln, valn method. Again, just be sure that your columns are included in the source analysis or the values won’t get ported over. Once you’ve got all your columns and values set up, go ahead and enter them into the URL field in the Edit Action dialogue box, as above. Make sure you reference your variables using the proper syntax (similar to a presentation variable w/ an @ sign):
Ex: col1=@{4} – ‘4’ being the variable name (note that these can be named most anything)
Quoting Parameters
As an alternative to including an extra iteration of each column for the sake of passing quoted column values, we can instead, put quotes around the parameter in our URL, as in the example above. The limitation to this method, however, is that you can only pass a singular value, as in Year, for example. In later posts, we’ll address how to handle passing multiple values, as you might through a dashboard prompt.
Step 5. Set Up the Detail Analysis
For our detail analysis we’re going to set it up in much the same way as our summary. That is, we need to include the columns we want to filter on in the target report as. Unfortunately, our target report won’t simply pick them up as filters as you might put on your filters pane, without including them on the actual analysis. Again, any columns we don’t want visible to a user can be hidden. Below, we simply want to see the Calendar Date, Product, and Revenue, but filtered by all of our source analysis columns.
In the criteria view for our target, detail analysis, we need to make sure that we’re also setting any filtered columns to ‘is prompted’. This will ensure that our target analysis listens to any filter parameters passed through the URL from our source, summary analysis. As a last step, we must again fully qualify our filters, as in the picture below.
This picture shows our Year ‘is prompted’ filter on our target, detail analysis. Note that this column is also a column, albeit hidden, on this report as well. This will act as a filter on the analysis. It is being ‘prompted’ not by a dashboard prompt, in this instance, but by our source, summary analysis.
Step 6. Testing it All Out
Now that we’ve got all the pieces of the puzzle together, let’s see if it works! To QA this thing, let’s put a filter object on the target, detail analysis to make sure that the report is picking up on any values passed. So if we click on a sales value, we should be taken to the target analysis and see that all the parameters we set up were passed. The picture below confirms this!
Conclusion
Hopefully this can be one more trick to keep in the tool belt when faced with a similar scenario. If you have any hiccups in your implementation of this solution or other questions, please feel free to respond to this post. Stay tuned for additional articles related to this topic that go much more in depth. How do you handle passing multiple column values? How do I keep my report query time low with all those extra columns? How do I pass values using the presentation variable syntax? Can I use the Evaluate function to extract the descendants of a filtered column?
The post OBIEE 11g and Essbase – Faking Federation Using the GoURL appeared first on Rittman Mead Consulting.
Rittman Mead and Oracle Big Data Webcast Series – November 2015
We’re running a set of three webcasts together with Oracle on three popular use-cases for big data within an Oracle context – with the first one running tomorrow, November 3rd 2015 15:00 – 16:00 GMT / 16:00 – 17:00 CET on extending the data warehouse using Hadoop and NoSQL technologies.
The sessions are running over three weeks this month and look at ways we’re seeing Rittman Mead use big data technologies to extend the and capabilities of their data warehouse, create analysis sandpits for analysing customer behaviour, and taking data discovery into the Hadoop era using Oracle Big Data Discovery. All events are free to attend, we’re timing them to suit the UK,Europe and the US, with details of each webcast are as follows:
Extending and Enhancing Your Data Warehouse to Address Big Data
Organizations with data warehouses are increasingly looking at big data technologies to extend the capacity of their platform, offload simple ETL and data processing tasks and add new capabilities to store and process unstructured data along with their existing relational datasets. In this presentation we’ll look at what’s involved in adding Hadoop and other big data technologies to your data warehouse platform, see how tools such as Oracle Data Integrator and Oracle Business Intelligence can be used to process and analyze new “big data” data sources, and look at what’s involved in creating a single query and metadata layer over both sources of data.
Audience: DBAs, DW managers, architects Tuesday 3rd November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register
Audience : DBAs, DW managers, architects
What is Big Data Discovery and how does it complement traditional Business Analytics?
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization. At the same time Oracle Big Data Discovery reduces the dependency on expensive and often difficult to find Data Scientists, opening up many Big Data tasks to “Citizen” Data Scientists. In this session we’ll look at Oracle Big Data Discovery and how it provides a “visual face” to your big data initiatives, and how it complements and extends the work that you currently do using business analytics tools.
Audience : Data analysts, market analysts, & Big Data project team members Tuesday 10th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register
Adding Big Data to your Organization to create true 360-Degree Customer Insight
Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately. In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.
Audience : MI Managers, CX Managers, CIOs, BI / Analytics Managers Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register
News on Three Big Data Webcasts with Oracle, and a Customer Case-Study at Cloudera Sessions
Next week I’m presenting along with Liberty Global at the Cloudera Sessions event in Amsterdam on October 15th 2015, on their implementation of Cloudera Enterprise on Oracle Big Data Appliance for a number of big data and advanced analytics initiatives around their cable TV, mobile and internet business.
We’ve been working with Liberty Global for a number of years and helped them get started with their move into big data a year or so ago, and it’s great to see them speaking at this Cloudera event and the success they’ve had with this joint Oracle+Cloudera platform. Andre Lopes and Roberto Manfredini from Liberty Global will talk about the business drivers and initial PoC scenario that then paid for the first main stage of the project, and I’ll talk about how we worked with their implementation team and senior managers to implement Cloudera’s enterprise Hadoop platform on Oracle engineered systems.
Rittman Mead and Oracle Big Data Webcast Series – November 2015
We’re also running a set of three webcasts together with Oracle on three use-cases for big data in an Oracle context. The sessions will run over three weeks in November 2015 and will look at three ways we’re seeing Rittman Mead big data customers use the platform to extend the storage and capabilities of their data warehouse, creating repositories and analysis sandpits for customer behaviour analysis, and taking data discovery into the Hadoop era using Big Data Discovery.
All events are free to attend, we’re timing them to suit the UK,Europe and the US, and details of each webcast are as follows:
Extending and enhancing your Data Warehouse to address Big Data
Organizations with data warehouses are increasingly looking at big data technologies to extend the capacity of their platform, offload simple ETL and data processing tasks and add new capabilities to store and process unstructured data along with their existing relational datasets. In this presentation we’ll look at what’s involved in adding Hadoop and other big data technologies to your data warehouse platform, see how tools such as Oracle Data Integrator and Oracle Business Intelligence can be used to process and analyze new “big data” data sources, and look at what’s involved in creating a single query and metadata layer over both sources of data.
Audience: DBAs, DW managers, architects
Tuesday 3rd November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register
Audience : DBAs, DW managers, architects
What is Big Data Discovery and how does it complement traditional Business Analytics?
Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization. At the same time Oracle Big Data Discovery reduces the dependency on expensive and often difficult to find Data Scientists, opening up many Big Data tasks to “Citizen” Data Scientists.
In this session we’ll look at Oracle Big Data Discovery and how it provides a “visual face” to your big data initiatives, and how it complements and extends the work that you currently do using business analytics tools.
Audience : Data analysts, market analysts, & Big Data project team members
Tuesday 10th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register
Adding Big Data to your Organization to create true 360-Degree Customer Insight
Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately.
In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.
Audience : MI Managers, CX Managers, CIOs, BI / Analytics Managers
Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register
Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately.
In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.
Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Register for Customer Insight
Managing Impala and Other Mixed Workloads on the Oracle Big Data Appliance
One of our current client projects uses Cloudera Impala to provide fast ad-hoc querying to the data we’re loading into their Oracle Big Data Appliance Hadoop environment. Impala bypasses MapReduce to provide faster queries than Hive, but to do so it does a lot of processing in-memory and runs server processes on each node in the cluster, leading in some cases to runaway queries blocking other workloads in the same way that OBIEE queries on an Oracle Database can sometimes block ETL and application workloads. Several projects share this same Big Data Appliance, so to try and limit the impact Impala could have on other cluster workloads the client had disabled the Impala Daemons on nine of the twelve nodes in their Big Data Appliance; our concern with this approach was that an Impala query could access data from any datanode in the Big Data Appliance cluster, so whilst HDFS data is typically stored and replicated to three nodes in the cluster running the Impala daemons on just a quarter of the available nodes was likely to lead to data locality issues for Impala and blocks getting shipped across the network unnecessarily.
Going back to OBIEE and the Oracle Database, Oracle have a resource management feature for the Oracle database that allows you to put users and queries into separate resource pools and manage the share of overall resources that each pool gets. I covered this concept on the blog a few years ago, and the version of Cloudera Hadoop (CDH5.3) as used on the client’s Big Data Appliance has a feature called “YARN”, or Yet Another Resource Negotiator, that splits out the resource management and scheduling parts that were bound into MapReduce in Hadoop 1.0 so that MapReduce then just runs as a workload type on Hadoop, and with it then possible to run other workload types, for example Apache Spark, on that same cluster management framework.
Impala isn’t however configured to use YARN by default and uses an internal scheduler to govern how concurrent queries run and use cluster resources, but it can be configured to use YARN in what Cloudera term “Integrated Resource Management” and our initial response was to recommend this approach; however YARN is really optimised for longer-running batch jobs and not the shorter jobs that Impala generates (such that Cloudera recommends you don’t actually use YARN, and control Impala resource usage via service-level process constraints or through a new Impala feature called Admission Control instead). Taking a step back though, how do we actually see what resources Impala is using across the cluster when a query runs, and is there a feature similar to the Oracle Database’s SQL Explain Plan to help us understand how an Impala SQL query is executed? Then, using this and the various resource management options to us, can we use them to understand how YARN and other options will affect the Impala users on the client’s cluster if we enable them? And, given that we were going to test this all out on one of our development Hadoop clusters running back at the office on VMWare, how well could we simulate the multiple concurrent queries and mixed workload we’d then encounter on the real customer Big Data Appliance?
When trying to understand what goes on when a Cloudera Impala SQL query runs, the two main tools in your toolbox are EXPLAIN plans and query profiles. The concept of EXPLAIN plans will be familiar to Oracle developers, and putting “explain” before your Impala SQL query when you’re using the Impala Shell (or pressing the “Explain” button when you’re using the Impala Editor in Hue) will display an output like the one below, showing the steps the optimiser plans to take to return the query results:
[bda7node1.rittmandev.com:21000] > explain select sum(f.flights) as total_flights, d.dest_city from flight_delays f join geog_dest d on f.dest = d.dest join geog_origin o on f.orig = o.orig where d.dest_state = 'California' and o.orig_state in ('Florida','New York','Alaska') group by d.dest_city having total_flights > 3000; Query: explain select sum(f.flights) as total_flights, d.dest_city from flight_delays f join geog_dest d on f.dest = d.dest join geog_origin o on f.orig = o.orig where d.dest_state = 'California' and o.orig_state in ('Florida','New York','Alaska') group by d.dest_city having total_flights > 3000 +---------------------------------------------------------------------+ | Explain String | +---------------------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=154.01MB VCores=2 | | | | 10:EXCHANGE [UNPARTITIONED] | | | | | 09:AGGREGATE [FINALIZE] | | | output: sum:merge(f.flights) | | | group by: d.dest_city | | | having: sum(f.flights) > 3000 | | | | | 08:EXCHANGE [HASH(d.dest_city)] | | | | | 05:AGGREGATE | | | output: sum(f.flights) | | | group by: d.dest_city | | | | | 04:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: f.orig = o.orig | | | | | |--07:EXCHANGE [BROADCAST] | | | | | | | 02:SCAN HDFS [airlines.geog_origin o] | | | partitions=1/1 files=1 size=147.08KB | | | predicates: o.orig_state IN ('Florida', 'New York', 'Alaska') | | | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | | hash predicates: f.dest = d.dest | | | | | |--06:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [airlines.geog_dest d] | | | partitions=1/1 files=1 size=147.08KB | | | predicates: d.dest_state = 'California' | | | | | 00:SCAN HDFS [airlines.flight_delays f] | | partitions=1/1 files=1 size=64.00MB | +---------------------------------------------------------------------+ Fetched 35 row(s) in 0.21s
Like an Oracle SQL explain plan, Impala’s cost-based optimiser uses table and partition stats that you should have gathered previously using Impala’s “compute stats” command to determine what it thinks is the optimal execution plan for your query. To see the actual cost and timings for the various plan steps that are run for a query, you can then use the “summary” statement after your query has run (or for more detail, the “profile” statement”) to see the actual timings and stats for each step in the query execution.
[bda7node1.rittmandev.com:21000] > summary; > ; +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | 10:EXCHANGE | 1 | 20.35us | 20.35us | 7 | 193 | 0 B | -1 B | UNPARTITIONED | | 09:AGGREGATE | 6 | 142.18ms | 180.81ms | 7 | 193 | 6.28 MB | 10.00 MB | FINALIZE | | 08:EXCHANGE | 6 | 59.86us | 123.39us | 60 | 1.93K | 0 B | 0 B | HASH(d.dest_city) | | 05:AGGREGATE | 6 | 171.72ms | 208.36ms | 60 | 1.93K | 22.73 MB | 10.00 MB | | | 04:HASH JOIN | 6 | 89.42ms | 101.82ms | 540.04K | 131.88M | 12.79 MB | 5.41 KB | INNER JOIN, BROADCAST | | |--07:EXCHANGE | 6 | 16.32us | 19.63us | 2.81K | 117 | 0 B | 0 B | BROADCAST | | | 02:SCAN HDFS | 1 | 302.83ms | 302.83ms | 469 | 117 | 309.00 KB | 32.00 MB | airlines.geog_origin o | | 03:HASH JOIN | 6 | 936.71ms | 1.10s | 15.68M | 131.88M | 12.14 MB | 3.02 KB | INNER JOIN, BROADCAST | | |--06:EXCHANGE | 6 | 19.02us | 46.49us | 1.04K | 39 | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 1 | 266.99ms | 266.99ms | 173 | 39 | 325.00 KB | 32.00 MB | airlines.geog_dest d | | 00:SCAN HDFS | 6 | 1.07s | 1.90s | 131.88M | 131.88M | 74.03 MB | 480.00 MB | airlines.flight_delays_full f | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
Output from the Summary statement gives us some useful information in working out the impact of the various resource management options for the Oracle Big Data Appliance, at least in terms of its impact on individual Impala queries – we’ll look at the impact on the overall Hadoop cluster and individual nodes later on. From the output of the above Summary report I can see that my query ran on all six nodes in the cluster (queries I ran earlier on a smaller version of the fact table ran on just a single node), and I can see how long each step in the query actually took to run. So what happens if I run the same query again on the cluster but disable the Impala daemon service role on three of the nodes, using Cloudera Manager?
Here’s the Summary output after running the query again:
[bda7node1.rittmandev.com:21000] > summary; +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | 10:EXCHANGE | 1 | 22.01us | 22.01us | 7 | 193 | 0 B | -1 B | UNPARTITIONED | | 09:AGGREGATE | 3 | 111.12ms | 117.24ms | 7 | 193 | 6.27 MB | 10.00 MB | FINALIZE | | 08:EXCHANGE | 3 | 30.09us | 39.02us | 30 | 1.93K | 0 B | 0 B | HASH(d.dest_city) | | 05:AGGREGATE | 3 | 161.26ms | 173.57ms | 30 | 1.93K | 22.84 MB | 10.00 MB | | | 04:HASH JOIN | 3 | 156.50ms | 238.90ms | 540.04K | 131.88M | 12.81 MB | 5.41 KB | INNER JOIN, BROADCAST | | |--07:EXCHANGE | 3 | 20.19us | 28.93us | 1.41K | 117 | 0 B | 0 B | BROADCAST | | | 02:SCAN HDFS | 1 | 477.38ms | 477.38ms | 469 | 117 | 309.00 KB | 32.00 MB | airlines.geog_origin o | | 03:HASH JOIN | 3 | 1.48s | 1.66s | 15.68M | 131.88M | 12.14 MB | 3.02 KB | INNER JOIN, BROADCAST | | |--06:EXCHANGE | 3 | 12.07us | 14.89us | 519 | 39 | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 1 | 308.83ms | 308.83ms | 173 | 39 | 325.00 KB | 32.00 MB | airlines.geog_dest d | | 00:SCAN HDFS | 3 | 3.39s | 6.85s | 131.88M | 131.88M | 74.11 MB | 480.00 MB | airlines.flight_delays_full f | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
What the Summary statement doesn’t show you is the overall time the query took to run, and the query ran against three nodes took 9.48s to run compared to 3.59s for the one before where I had all six nodes’ Impala daemon enabled. In-fact I’d expect a query running on the client’s BDA with just three out of twelve nodes enabled to run even slower because of the block locality issue – Impala has a feature called block locality tracking which keeps track of where HDFS data blocks are actually located on the cluster and tries to run impalad tasks on the right nodes, but three out of twelve nodes running makes that job really hard – but the other factor that we need to consider is how running multiple queries concurrently affects things when only a few nodes are handling all the Impala user queries.
To try and simulate concurrent queries running I opened six terminal session against nodes actually running Impala Daemon service roles and submitted the same query from each session, with a second or two gap between each query; with all six nodes enabled the average response time rose to about 6s, but with just three enabled the response rose fairly consistently to around 27s.
This is of course what you’d expect when everything was trying to run on the same three (now resource-starved) server nodes, and again I’d expect this to be even more pronounced on the client’s twelve-node BDA. What this test of course didn’t cover was running workloads other than Impala on the same cluster, or running queries against different datasets, but it did at least show us how response-time increases fairly dramatically (albeit consistently) as more Impala users come onto the system.
So now we have some baseline benchmarking figures, let’s configure Impala to use YARN, using Cloudera Manager on the CDH5.3 setup used on the client’s BDA and our development cluster back in the office. There’s actually two parts to Impala running on YARN in CDH5.x; YARN itself as the overall cluster resource management layer, and another component called Llama (Low-Latency, or “Long-Lived”, Application Master) that sits between YARN and Impala and reduces the time that each Impala query takes to obtain YARN resource allocations.
Enabling YARN and Llama (and if you want to, configuring Llama and thereby Impala for high-availability) is done through a wizard in CDH5.3 that also offers to set up an Linux feature called Cgroups that YARN can use to limit the “containers” it uses for resource management at the OS-level.
Once you’ve run through the wizard and restarted the cluster, Impala should be configured to use YARN instead of its own scheduler to request resources, which in-theory will allow Hadoop and the Big Data Appliance to consider Impala workloads alongside MapReduce, Spark and HBase when scheduling jobs across the cluster. Before we get into the options YARN gives us for managing these workloads I ran the same Impala queries again, first as a single query and then with six running concurrently, to see what impact YARN on its own had on query response times.
The single query on its own took around the same time as without YARN to run (3-4s), but when I ran six concurrent queries together the response time went up from the 3-4s that I saw without YARN enabled to between 5s and 18s depending on the session, with quite a bit of variation between response times compared to the consistent times I saw when YARN wasn’t being used – which surprised me as one of the stated benefits of YARN is making job execution times more predictable and smooth, though this cloud be more of an overall-cluster thing and there are also recommendations around configuring YARN and Llama’s resource estimation more efficient for Impala in the Cloudera docs.
[bda7node1.rittmandev.com:21000] > summary; +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | 10:EXCHANGE | 1 | 41.38us | 41.38us | 7 | 193 | 0 B | -1 B | UNPARTITIONED | | 09:AGGREGATE | 6 | 115.28ms | 123.04ms | 7 | 193 | 6.28 MB | 10.00 MB | FINALIZE | | 08:EXCHANGE | 6 | 44.44us | 67.62us | 60 | 1.93K | 0 B | 0 B | HASH(d.dest_city) | | 05:AGGREGATE | 6 | 170.91ms | 201.47ms | 60 | 1.93K | 22.82 MB | 10.00 MB | | | 04:HASH JOIN | 6 | 82.25ms | 98.34ms | 540.04K | 131.88M | 12.81 MB | 5.41 KB | INNER JOIN, BROADCAST | | |--07:EXCHANGE | 6 | 15.39us | 18.99us | 2.81K | 117 | 0 B | 0 B | BROADCAST | | | 02:SCAN HDFS | 1 | 244.40ms | 244.40ms | 469 | 117 | 309.00 KB | 32.00 MB | airlines.geog_origin o | | 03:HASH JOIN | 6 | 850.55ms | 942.47ms | 15.68M | 131.88M | 12.14 MB | 3.02 KB | INNER JOIN, BROADCAST | | |--06:EXCHANGE | 6 | 13.99us | 19.05us | 1.04K | 39 | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 1 | 222.03ms | 222.03ms | 173 | 39 | 325.00 KB | 32.00 MB | airlines.geog_dest d | | 00:SCAN HDFS | 6 | 1.54s | 2.88s | 131.88M | 131.88M | 74.03 MB | 480.00 MB | airlines.flight_delays_full f | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
But it seems clear that users of Impala on the client cluster should expect some sort of overhead from using YARN to manage Impala’s resources, with the payoff being better balance between Impala workloads and the other uses they’re putting the BDA cluster too – however I think there’s more we can do to fine-tune how Llama and YARN allocate memory to Impala queries up-front (allocating a set amount of memory for all queries, rather than making an estimate and then adding more memory mid-query if it’s needed) and of course we’ve not really tested it on a cluster with a full, mixed workload running. But what about our original scenario, where only a certain percentage of the overall cluster resources or nodes are allocated to Impala query processing? To set up that sort of division resources we can use another feature of YARN called dynamic allocation, and dynamic resource pools that we can set up through Cloudera Manager again.
Dynamic allocation is one of the ways that YARN can be configured to manage multiple workloads on a Hadoop cluster (the other way is through static service pools, and I’ll come to those in a moment). Using dynamic allocation I can set up a resource pool for the airline flight delays application that my Impala SQL queries are associated with and allocate it 25% of overall cluster resources, with the remainder of cluster resources allocated to other applications. I can keep that weighting simple as I have done in the screenshot below, or I can allocate resources based on virtual cores and memory, but I found it simpler to just set these overall weightings and let YARN worry about cores and RAM.
Depending on the scheduling policy you select, YARN will prioritise Impala and other jobs in different ways, but the recommended scheduling policy for mixed workloads is dominent resource fairness which balances RAM and CPU depending on which resource pool needs them most at a particular time. Note also that Impala can either be managed as part of the overall YARN workload or separately, a choice you can make in the Impala service configuration settings in Cloudera Manager (the “Enable Dynamic Resource Pools” setting that’s checked below, but was unchecked for the screenshot above)
There’s also a separate control you can place on Impala queries called Admission Control, that limits the number of queries that can run or be queued for a resource pool at any particular time. The docs are a bit vague on when to use admission control, when to use YARN or not and so on, but my take on this is that if it’s just Impala queries you’re worried about and throttling their use solves the problem then use this feature and leave Impala outside of YARN, but if you need to manage overall mixed workloads then do it all through YARN. For my testing example though I just went with simple resource pool weighting, and you can see from the screenshot below where multiple queries are running at once for my pool, CPU and RAM resources are constrained as expected.
To make a particular Impala query run within a specific resource pool you can either allocate that user to a named resource pool, or you can specific the resource pool in your Impala shell session like this:
[bda7node1.rittmandev.com:21000] > set request_pool = airlines; REQUEST_POOL set to airlines [bda7node1.rittmandev.com:21000] > select sum(f.flights) as total_flights, d.dest_city from airlines.flight_delays_full f join airlines.geog_dest d on f.dest = d.dest join airlines.geog_origin o on f.orig = o.orig where d.dest_state = 'California' and o.orig_state in ('Florida','New York','Alaska') group by d.dest_city having total_flights > 3000;
Looking then at a typical summary output for a query running with these restrictions (25% of resources overall) and other queries running concurrently, the numbers don’t look all that different to before and results took between 8s and 30s to return – again I was surprised on the variance but I think YARN is more about overall cluster performance rather than individual queries, and you shouldn’t read too much into specific times on a dev server with an unrepresentative overall workload.
[bda7node1.rittmandev.com:21000] > summary; +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+ | 10:EXCHANGE | 1 | 26.78us | 26.78us | 7 | 193 | 0 B | -1 B | UNPARTITIONED | | 09:AGGREGATE | 6 | 209.10ms | 262.02ms | 7 | 193 | 6.28 MB | 10.00 MB | FINALIZE | | 08:EXCHANGE | 6 | 63.20us | 118.89us | 60 | 1.93K | 0 B | 0 B | HASH(d.dest_city) | | 05:AGGREGATE | 6 | 282.56ms | 401.37ms | 60 | 1.93K | 22.76 MB | 10.00 MB | | | 04:HASH JOIN | 6 | 99.56ms | 114.14ms | 540.04K | 131.88M | 12.85 MB | 5.41 KB | INNER JOIN, BROADCAST | | |--07:EXCHANGE | 6 | 15.49us | 17.94us | 2.81K | 117 | 0 B | 0 B | BROADCAST | | | 02:SCAN HDFS | 1 | 531.08ms | 531.08ms | 469 | 117 | 309.00 KB | 32.00 MB | airlines.geog_origin o | | 03:HASH JOIN | 6 | 1.20s | 1.54s | 15.68M | 131.88M | 12.14 MB | 3.02 KB | INNER JOIN, BROADCAST | | |--06:EXCHANGE | 6 | 24.29us | 68.23us | 1.04K | 39 | 0 B | 0 B | BROADCAST | | | 01:SCAN HDFS | 1 | 287.31ms | 287.31ms | 173 | 39 | 325.00 KB | 32.00 MB | airlines.geog_dest d | | 00:SCAN HDFS | 6 | 2.34s | 3.13s | 131.88M | 131.88M | 74.03 MB | 480.00 MB | airlines.flight_delays_full f | +-----------------+--------+----------+----------+---------+------------+-----------+---------------+-------------------------------+
A point to note is that I found it very hard to get Impala queries to run when I got down to specifying virtual core and memory limits rather than just overall weightings, so I’d go with these high-level resource pool prioritisations which seemed to work and didn’t unduly affect query response times. For example the setting below looked clever, but queries always seemed to time out and I never really got a satisfactory setup working properly.
Note that for YARN dynamic resource pools to be used, all Linux/CDH users will need to be assigned to resource pools so they don’t run as “unconstrained”; this can also be done from the Dynamic Resource Pools configuration page.
Finally though, all of this management through resource pools might not be the best way to control resource usage by YARN. The Cloudera docs say quite clearly on the Integrated Resource Management page that:
“When using YARN with Impala, Cloudera recommends using the static partitioning technique (through a static service pool) rather than the combination of YARN and Llama. YARN is a central, synchronous scheduler and thus introduces higher latency and variance which is better suited for batch processing than for interactive workloads like Impala (especially with higher concurrency). Currently, YARN allocates memory throughout the query, making it hard to reason about out-of-memory and timeout conditions.
What this means in-practice is that, if you’ve got a single project using the Big Data Appliance and you just want to specify at a high-level what proportion of resources Impala, HBase, MapReduce and the other services under YARN management use, you can define this as static service pool settings in Cloudera Manager and have these restrictions enforced by Linux Cgroups. In the screenshot below I unwound all of the dynamic resource pool settings I created a moment ago and allocated 25% of overall cluster resources to Impala, with the wizard then using those top-level values to set limits for services across all nodes in the cluster based on their actual RAM and CPU, the services running on them and so on.
Then, going back to Cloudera Manager and running some queries, you can see these static service pool limits being applied in real-time and their effect in the form of graphs for particular cluster resources.
So given all of this, what was our recommendation to the client about how best to set up resource management for Impala and other workloads on their Big Data Appliance? Not too much should be read into individual numbers – it’s hard to simulate a proper mixed workload on a development server, and of course their BDA has 12 nodes, more memory, faster CPUs. However it’s probably fair to say these are the obvious conclusions:
- Running Impala daemons on just a subset of nodes isn’t actually a bad way to constrain resources used by Impala, but it’ll only work on clusters with a small amount of nodes so that there’s a good chance one node will have one of the three copies of a data block. On a system of the scale of our customer’s, we’ll probably hit unacceptable overheads in terms of block locality. I would not carry on with this approach because of that.
- If the customer BDA will be running a mixed workload, i.e. data loading, long-running Hive/Pig/Spark jobs as well as short-running Impala jobs, enabling Impala for YARN and setting overall resource pools for applications would be the best approach, but individual Impala queries will probably run slower than now (even given the restriction in resources), due to the overhead YARN imposes when scheduling and running jobs. But this will be the best way to allocate resource between applications and provide a generally “smoother” experience for users
- If the BDA needs to be optimised mostly for Impala queries, then don’t manage Impala under YARN, leave it outside of this and just use Static service pools to allocate Impala roughly 25% of resources across all nodes. In both this and the previous instance (Impala on YARN) then all nodes should be re-enabled for Impala so as to minimize issues over block locality
- If the only real issue is Impala queries for a particular application taking all resources/becoming runaway, Impala could be left outside of YARN but enabled for admission control so as to limit the total number of running/queued queries for a particular application.
New Oracle Big Data Quick-Start Packages from Rittman Mead
Many organisations using Oracle’s business intelligence and data warehousing tools are now looking to extend their capabilities using “big data” technologies. Customers running their data warehouses on Oracle Databases are now looking to use Hadoop to extend their storage capacity whilst offloading initial data loading and ETL to this complementary platform; other customers are using Hadoop and Oracle’s Big Data Appliance to add new capabilities around unstructured and sensor data analysis, all at considerably lower-cost than traditional database storage.
In addition, as data and analytics technologies and capabilities have evolved, there has never been a better opportunity to reach further into your data to exploit more value. Big Data platforms, Data Science methods and data discovery technologies make it possible to unlock the power of your data and put it in the hands of your executives and team members – but what is it worth to you? What’s the value to your organisation of exploring deeper int the data you have, and how do you show return?
Many organisations have begin to explore Big Data technologies to understand where they can exploit value and extend their existing analytics platforms, but what’s the business case? The good news is, using current platforms, and following architectures like the Oracle Information Management and Big Reference Architecture written in conjunction with Rittman Mead, the foundation is in place to unlock a range of growth opportunities. Finding new value in existing data, predictive analytics, data discovery, reducing the cost of data storage, ETL offloading are all starter business cases proven to return value quickly.
To help you start on the Oracle big data journey, Rittman Mead have put together two quick-start packages focuses on the most popular Oracle customer use-cases;
- Extending Oracle data warehouses to use Hadoop technology (including Oracle’s Big Data Appliance and Oracle Data Integrator), and
- Using Hadoop and Oracle’s new Big Data Discovery tool to analyse your structured, semi-structured and social media data
If this sounds like something you or your organization might be interested in, take a look at our new Quick Start Oracle Big Data and Big Data Discovery packages from Rittman Mead home page, or drop me an email at mark.rittman@rittmanmead.com and I’ll let you know how we can help.