Category Archives: Rittman Mead
Presentation Slides and Photos from the Rittman Mead BI Forum 2015, Brighton and Atlanta
It’s now the Saturday after the two Rittman Mead BI Forum 2015 events, last week in Atlanta, GA and the week before in Brighton, UK. Both events were a great success and I’d like to say thanks to the speakers, attendees, our friends at Oracle and my colleagues within Rittman Mead for making the two events so much fun. If you’re interested in taking a look at some photos from the two events, I’ve put together two Flickr photosets that you can access using the links below:
- Flickr Photoset from the Brighton Rittman Mead BI Forum 2015
- Flickr Photoset from the Atlanta Rittman Mead BI Forum 2015
We’ve also uploaded the presentation slides from the two events (where we’ve been given permission to share them) to our website, and you can download them including the Delivering the Oracle Information Management and Big Data Reference Architecture masterclass using the links below:
Delivering the Oracle Information Management & Big Data Reference Architecture (Mark Rittman & Jordan Meyer, Rittman Mead)
- Part 1 : Delivering the Discovery Lab (Jordan Meyer, Head of R&D at Rittman Mead)
- Part 2 : Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Architecture (Mark Rittman, CTO, Rittman Mead)
Brighton, May 7th and 8th 2015
- Steve Devine, Independent : “The Art and Science of Creating Effective Data Visualisations”
- Chris Royles, Oracle Corporation : “Big Data Discovery”
- Christian Screen, Sierra-Cedar : “10 Tenats for Making Your Oracle BI Applications Project Succeed Like a Boss”
- Emiel van Bockel, CB : “Watch and see 12c on Exalytics”
- Daniel Adams, Rittman Mead : “User Experience First: Guided information and attractive dashboard design”
- Robin Moffatt, Rittman Mead : “Data Discovery and Systems Diagnostics with the ELK stack”
- André Lopes / Roberto Manfredini, Liberty Global : “A Journey into Big Data and Analytics”
- Antony Heljula, Peak Indicators : “Predictive BI – Using the Past to Predict the Future”
- Gerd Aiglstorfer, G.A. itbs GmbH : “Driving OBIEE Join Semantics on Multi Star Queries as User”
- Manuel Martin Marquez, CERN – European Laboratory for Particle Physics, “Governed Information Discovery: Data-driven decisions for more efficient operations at CERN”
Atlanta, May 14th and 15th 2015
- Robin Moffatt, Rittman Mead : “Smarter Regression Testing for OBIEE”
- Mark Rittman : “Oracle Big Data Discovery Tips and Techniques from the Field”
- Hasso Schaap, Qualogy : “Developing strategic analytics applications on OBICS PaaS”
- Tim German / Cameron Lackpour, Qubix / CLSolve : “Hybrid Mode – An Essbase Revolution”
- Stewart Bryson, Red Pill Analytics, “Supercharge BI Delivery with Continuous Integration”
- Andy Rocha & Pete Tamisin, Rittman Mead : “OBIEE Can Help You Achieve Your GOOOOOOOOOALS!”
- Christian Screen, Sierra-Cedar : “10 Tenats for Making Your Oracle BI Applications Project Succeed Like a Boss”
- Sumit Sarkar, Progress Software : “Make sense of NoSQL data using OBIEE”
Congratulations also to Emiel van Bockel and Robin Moffatt who jointly-won Best Speaker award at the Brighton event, and to Andy Rocha and Pete Tamsin who won Best Speaker in Atlanta for their joint session. It’s time for a well-earned rest now and then back to work, and hopefully we’ll see some of you at KScope’15, Oracle Openworld 2015 or the UKOUG Tech and Apps 2015 conferences later in 2015.
What’s New in OBIEE 11.1.1.9 for Systems Administrators and Developers
After over two years since the last major release of OBIEE, Oracle released version 11.1.1.9 in May 2015. You can find the installers here and documentation here. 11.1.1.9 is termed the “terminal release” of the 11g line, and the 12c version is already out in closed-beta. We’d expect to see patchsets for 11g to continue for some time covering bugs and any security issues, but for new functionality in 11g I would hazard a guess that this is pretty much it as Oracle concentrate their development efforts on OBIEE 12c and BICS, particularly Visual Analyser.
For both the end user and backend administrator/developer, OBIEE 11.1.1.9 has brought with it some nice little touches, none of which are going to revolutionise the OBIEE world but many of which are going to make life with the tool just that little bit smoother. In this article we take a look at what 11.1.1.9 brings for the sysadmin & developer.
BI Server Query Instrumentation and Usage Tracking
There are some notable developments here:
- Millisecond precision when logging events from the BI Server
- Usage Tracking now includes the physical query hash, which is what is also visible in the database, enabling end-to-end tracing
- User sessions can be tracked and summarised more precisely because session ID is now included in Usage Tracking.
- The execution of initialisation blocks is now also recorded, in a new Usage Tracking table called S_NQ_INITBLOCK.
Millisecond precision in BI Server logs
OBIEE 11.1.1.9 writes the nqquery.log with millisecond precision for both the timestamp of each entry, and also the summary timings for a query execution (at last!). It also calls out explicitly “Total time in BI Server” which is a welcome addition from a time profiling/performance analysis point of view:
[2016-07-31T02:11:48.231-04:00 [...] Sending query to database named X0 - Airlines Demo Dbs (ORCL) (id: <<221516>>), connection pool named Aggr Connection, logical request hash 544131ec, physical request hash 5018e5db: [[ [2016-07-31T02:12:04.31-04:00 [...] Query Status: Successful Completion [2016-07-31T02:12:04.31-04:00 [...] Rows 2, bytes 32 retrieved from database query id: <<221516>> [2016-07-31T02:12:04.31-04:00 [...] Physical query response time 2.394 (seconds), id <<221516>> [2016-07-31T02:12:04.31-04:00 [...] Physical Query Summary Stats: Number of physical queries 1, Cumulative time 2.394, DB-connect time 0.002 (seconds) [2016-07-31T02:12:04.31-04:00 [...] Rows returned to Client 2 [2016-07-31T02:12:04.31-04:00 [...] Logical Query Summary Stats: Elapsed time 16.564, Total time in BI Server 16.555, Response time 16.564, Compilation time 0.768 (seconds), Logical hash 544131ec
One thing to notice here is the subsecond timestamp precision seems to vary between 2 and 3 digits, which may or may not be a bug.
Being able to see this additional level of precision is really important. Previously OBIEE recorded information by the second, which was fine if you were looking at query executions taking dozens of seconds or minutes – but hopefully our aspirations for systems performance are actually closer to the realms of seconds or subsecond. At this scale the level of precision in the timings really matters. On the assumption that OBIEE was rounding values to the nearest whole number, you’d see “0 seconds” for a Logical SQL compile (for example) that was maybe 0.499 seconds. Per query this is not so significant, but if those queries run frequently then cumulatively that time stacks up and would be useful to be properly aware of and target with optimisation if needed.
Usage Tracking changes
Usage Tracking has five new columns for each logical query recorded in S_NQ_ACCT:
- ECID
- TENANT_ID
- SERVICE_NAME
- SESSION_ID
- HASH_ID
The presence of SESSION_ID is very useful, because it means that user behaviour can be more accurately analysed. For example, within a session, how many reports does a user run? What is the median duration of a session? Note that the session here is the session as seen by the BI Server, rather than Presentation Services.
ECID is also very useful for being able to link data in Usage Tracking back to more detailed entries in nqquery.log. Note that an ECID is multipart and concanated with RID and you won’t necessarily get a direct hit on the ECID you find in Usage Tracking with that in nqquery.log, but rather a substring of it. In this example here the root ECID is 11d1def534ea1be0:20f8da5c:14d4441f7e9:–8000–0000000000001891,0:1:103 and the varying component of the relationship (RID) id 1 and 3 respectively:
Usage Tracking:
select ecid,session_id,start_dt,start_hour_min ,saw_src_path from biee_biplatform.s_nq_acct
nqquery.log:
[2015-05-12T08:58:38.704-04:00] [...] [ecid: 11d1def534ea1be0:20f8da5c:14d4441f7e9:-8000-0000000000001891,0:1:103:3] [...] -------------------- SQL Request, logical request hash: 3fabea2b SET VARIABLE QUERY_SRC_CD='Report',SAW_DASHBOARD='/shared/02. Visualizations/_portal/2.11 Table Designs',SAW_DASHBOARD_PG='Conditional Format',SAW_SRC_PATH='/shared/02. Visualizations/Configured Visuals/Conditional Formats/CF based on a hidden column',PREFERRED_CURRENCY='USD';SELECT^M 0 s_0,^M [...]
In the above example note how the absence of a timezone in the Usage Tracking data is an impedance to accurate interpretation of the results, compared to nqquery.log which has a fully qualified timezone offset.
Usage Tracking changes – Physical Hash ID
As well as additions to the logical query table, there are two new columns for each physical query logged in S_NQ_DB_ACCT:
- HASH_ID
- PHYSICAL_HASH_ID
The implications of this are important – there is now native support in OBIEE for tracing OBIEE workloads directly down to the database (as discussed for OBIEE < 11.1.1.9 here), because the PHYSICAL_HASH_ID
is what OBIEE sets as the ACTION field when it connects to the database and is available in Oracle through both AWR, V$ views, and DBMS_MONITOR. For example, in V$SESSION the ACTION field is set to the physical hash:
SQL> select username,program,action from v$session where lower(program) like 'nqs%'; USERNAME PROGRAM ACTION -------- ------------------------------------------------ --------- BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3) 5065e891 BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3) 2b6148b2 BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3) BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3) 8802f14e BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3) 206c8d54 BISAMPLE nqsserver@demo.us.oracle.com (TNS V1-V3) c1c121a7
The ACTION is also available in many EM screens such as this one:
Now with OBIEE 11.1.1.9 the physical hash – which was previously only available in the nqquery.log file – is available in S_NQ_DB_ACCT which can in turn be joined to S_NQ_ACCT to find out the logical request related to the physical query seen on the database. Cool huh!
SELECT PHYSICAL_HASH_ID, USER_NAME, SAW_SRC_PATH, SAW_DASHBOARD, SAW_DASHBOARD_PG FROM BIEE_BIPLATFORM.S_NQ_DB_ACCT PHYS INNER JOIN BIEE_BIPLATFORM.S_NQ_ACCT LOGL ON LOGL.ID = PHYS.LOGICAL_QUERY_ID WHERE PHYS.PHYSICAL_HASH_ID = '5065e891'
This can be extended even further to associate AWR workload reports with specific OBIEE requests:
One little grumble (no pleasing some people…) – it would have been nice if Usage Tracking also stored:
- Timings at millisecond precision as well
- The number of bytes (rather than just row count)
- A proper TIMESTAMP WITH TIME ZONE (rather than the weird triplet of TS/DT/HOUR_MIN)
- “Total time in BI Server”
Who knows, maybe in 12c?…
Footnote – START_TS in Usage Tracking in 11.1.1.9
As a note for others who may hit this issue, my testing has shown that Usage Tracking in 11.1.1.9 appears to have introduced a bug with START_TS (on both S_NQ_ACCT and S_NQ_DB_ACCT), in that it stores only the date, not date + time as it did in previous versions. For example:
- 11.1.1.7:
SELECT TO_CHAR(START_TS, 'YYYY-MM-DD HH24:MI:SS') AS START_TS, TO_CHAR(START_DT, 'YYYY-MM-DD HH24:MI:SS') AS START_DT, START_HOUR_MIN FROM S_NQ_ACCT WHERE ROWNUM < 2 START_TS START_DT START_HOUR_MIN ------------------- ------------------- ----- 2015-03-19 15:32:23 2015-03-19 00:00:00 15:32
- 11.1.1.9:
SELECT TO_CHAR(START_TS, 'YYYY-MM-DD HH24:MI:SS') AS START_TS, TO_CHAR(START_DT, 'YYYY-MM-DD HH24:MI:SS') AS START_DT, START_HOUR_MIN FROM S_NQ_ACCT WHERE ROWNUM < 2 START_TS START_DT START_HOUR_MIN ------------------- ------------------- ----- 2015-01-27 00:00:00 2015-01-27 00:00:00 10:41
Initialisation Block information in Usage Tracking
A new table, S_NQ_INITBLOCK, has been added to BIPLATFORM and holds details of when an init block ran, for which user, and importantly, how long it took. From a performance analysis point of view this is really valuable data and it’s good to seeing it being added to the diagnostic data captured to database with Usage Tracking.
From a glance at the data it looks like there’s a bit of a bonus logging going on, with user sign in/sign out also recorded (“SIGNNING ON/SIGNED ON/SIGNED OFF”).
Note that there is no MBean for Init Block Usage Tracking, so regardless of how you configure the rest of Usage Tracking, you need to go to NQSConfig.ini to enable this one.
Presentation Services Cursor Cache
Oracle have added some additional Administration functionality for viewing and managing sessions and the cursor cache in Presentation Services. These let you track and trace more precisely user sessions.
From the Administration Page in OBIEE the new options are:
- Set dynamic log level per session from manage sessions
-
Filter cursor cache based on specific user sessions
-
Change sort order of cursor cache
-
Show Presentation Services diagnostics per cursor
- Download cursor cache list as CSV
- I can see this being useful for integration into bespoke monitoring applications, to which I must profess an interest…
Some of these are somewhat low-level and will not be used day-to-day, but the general move towards a more open diagnostics interface with OBIEE is really positive and I hope we see more of it in 12c… :-)
Command Line Aggregate Advisor
Only for use by those with an Exalytics licence, the Summary Advisor was previously available in the Windows Administration Tool only but can now be run from the command line:
[oracle@demo setup]$ nqaggradvisor -h Usage: nQAggrAdvisor -d <dataSource> -u <userName> -o <outputFile> -c <tupleInQuotes> [-p <password>] [-F <factFilter>] [-z <maxSizeAggr>] [-g <gainThreshold>] [-l <minQueryTime>] [-t <timeoutMinutes>] [-s <startDate>] [-e <endDate>] [-C <on/off>] [-M <on/off>] [-K <on/off>] Options: -d : Data source name -u : User name -o : Output aggregate persistence script file name -c : Aggregate persistence target - tuple in quotes: Fully qualified Connection pool, fully qualified schema name, capacity in MB -p : Password -F : Fact filter file name -z : Max size of any single aggregate in MB -g : Summary advisor will run until performance improvement for new aggregates drops below this value, default = 1 -l : The minimum amount of query time accumulated per LTS in seconds, before it is included for analysis, default = 0 -t : Max run time in minutes - 0 for unlimited, default = 0 -s : Statistics start date -e : Statistics end date -C : Prefer optimizer estimates - on/off, default = off -M : Only include measures used in queries - on/off, default = off -K : Use surrogate keys - on/off, default = on Examples: nQAggrAdvisor -d "AnalyticsWeb" -u "Administrator" -p "ADMIN" -o "C:tempaggr_advisor.out.txt" -c "DW_Aggr"."Connection Pool","DW_Aggr".."AGGR",1000 nQAggrAdvisor -d "AnalyticsWeb" -u "Administrator" -p "ADMIN" -o "C:tempaggr_advisor.out.txt" -F "C:tempfact_filter.txt" -g 10 -c "TimesTen_instance1"."Connection Pool","dbo",2000 -s "2011-05-02 08:00:00" -e "2011-05-07 18:30:00" -C on -M on -K off
Note that in the BIPLATFORM schema S_NQ_SUMMARY_STATISTICS is now called S_NQ_SUMMARY_ADVISOR.
HTML5 images
In previous versions of OBIEE graph images were rendered in Flash by default, and PNG on mobile devices. You could force it to use PNG for all images but would loose the interactivity (tooltips etc). Now in OBIEE 11.1.1.9 you can change the default from Flash to HTML5. This removes the need for a Flash plugin and is generally the way that a lot of visualisations are done on the web nowadays. To my eye there’s no difference in appearance:
To use HTML5 graphs by default, edit instanceconfig.xml
and under <Charts>
section add:
<DefaultWebImageType>html5</DefaultWebImageType>
Note that html5
is case-sensitive. The config file should look something like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <WebConfig xmlns="oracle.bi.presentation.services/config/v1.1"> <ServerInstance> [...] <Views> [...] <Charts> <DefaultWebImageType>html5</DefaultWebImageType> [...] </Charts> [...] </Views> [...] </ServerInstance> </WebConfig>
If Presentation Services doesn’t come back up when you restart it after making this change then check the stdout
logfile console~coreapplication_obips1~1.log
as well as the standard sawlog.log file, both of which you’ll find in $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIPresentationServicesComponent/
. The reason to check the console log file is that Presentation Services will refuse to start if the configuration supplied is invalid, and you’ll see an error message stating this here.
NQS ODBC functions
One for the Neos amongst you, a quick call of NQSGetSQLProcedures
(as seen in SampleApp dashboard 7.90 NQS ODBC Procedures) and comparison with 11.1.1.7.150120 shows the following new & changed NQS ODBC calls. If this means nothing to you then it probably doesn’t need to, but if you’re interested in exploiting OBIEE functionality from all angles, documented or not, then these might be of interest. It goes without saying, these are entirely undocumented and unsupported, completely liable to change or be removed at any time by Oracle.
- Added
- NQSGetUserDefinedFunctions
- NQSPAFIntegration
- NQSSearchPresentationObjects
- NQS_GetAllCacheEntries
- NQS_GetOverallCacheInfo
- NQS_GetRepositories
- NQS_LoadNewBaseRP
- NQS_LoadNewRPVersion
- NQS_LockSessionAgainstAutoRPSwitchOver
- NQS_SetRPDReadOnlyMode
- NQS_SwitchOverThisSessionToNewRP
- SAPurgeCacheBySubjectArea
- SAPurgeCacheEntryByIDVector
- SAPurgeXSACache
- SASeedXSACache
- Modified
- NQSGetQueryLogExcerpt (additional parameter)
- SAPurgeInternalCache (additional enum)
- Removed
- NQSChangeSelfPassword
Web Services
Web Services are one of the best ways to integrate with OBIEE programatically. You don’t need to be building heavy java apps just to use them – you can create and send the necessary SOAP messages from python or even just send it from bash with curl
.
There are 2.5 new WSDLs – two new ones (v9, v10) plus v8 which has changed. The new services are:
- KPIAssessmentService
- ScorecardAssessmentService
- ScorecardMetadataService
- UserPersonalizationService
You’ll find documentation for the Web Services in the Integrator’s Guide.
User Image Upload
Users can now upload their own images for use in Title views, conditional formats, etc. From an administration point of view this means you’ll want to be keeping an eye on /root/shared/custom/images/
in the Presentation Catalog, either on disk and/or through the OBIEE Catalog View, switch to Admin and enable “Show Hidden Items”:
QUERY_LIMIT_WARNING_INSTEAD_OF_ERROR
This new setting in NQSConfig.ini will warn users when they’re breaching defined query limits, but it won’t abort the query.
Pointless hacks
If you’re a geek like me, part of the fun of a new tool is simply poking around and seeing what’s new – not necessarily what’s useful. There’s plenty of great new stuff in 11.1.1.9, but let’s take a look at the “under the hood”, just Because Geek.
It was John Minkjan who first blogged several years about the xsd configuration schema files, and it is from this that we can find all the things that Presentation Services might be able to do – not just what it definitely can do, and not just what Oracle have documented that it can do. I wrote about some of these options a while back, and there are a few new ones in 11.1.1.9.
ALL OF THESE ARE COMPLETELY UNDOCUMENTED AND UNSUPPORTED. DO NOT USE THEM.
- EnableCloudBIEEHome sets the home page of OBIEE to be as it would be on BI Cloud Service (BICS). This is completely pointless since all the interesting stuff (Load Data, Model, Manage) is non-existent, even if it does give us a clue which application deployments are going to be supplying them (bimodeler and biserviceadministration respectively)
- GridViews/ShowDataModels outputs a bunch of debug data in Answers Table or Pivot Views:
- VirusScannerConfiguration – When a user uploads a custom image, this command will be called with it. For example, this simple script writes to a file the time and name of the file passed to it:
echo '---' >> /tmp/log.txt date >> /tmp/log.txt echo $1 >> /tmp/log.txt
If I save this as /tmp/test-script.sh and add it to instanceconfig.xml:
<VirusScannerConfiguration> <ScannerInvocationCommandLine>/tmp/test-script.sh</ScannerInvocationCommandLine> </VirusScannerConfiguration>
When I upload an image I get a row written to my log file. That in itself isn’t useful, but it could be a handy hook maybe from an auditing point of view, or indeed, virus scanning:
[oracle@demo tmp]$ cat /tmp/log.txt --- Wed May 20 16:01:47 EDT 2015 /app/oracle/biee/instances/instance1/tmp/OracleBIPresentationServicesComponent/coreapplication_obips1/defaultpool/sawserver_8673_5553759a_2-1.tmp
Some Oracle Big Data Discovery Tips and Techniques
I’ve been using Oracle Big Data Discovery for a couple of months now, and one of the sessions I’m delivering at this week’s Rittman Mead BI Forum 2015 in Atlanta is on Big Data Discovery Examples from the Field, co-presented with Tim Vlamis from Vlamis Software. Tim is going to concentrate on a data analysis and visualization example using a financial/trading dataset, and I’m going to look at at some of the trickier, or less obvious aspects to the BDD development process that we’ve come across putting together PoCs and demos for customers. I’ll start first with the data ingestion and transformation part of BDD.
There’s two basic ways to get data into BDD’s DGraph engine; you can either use the Data Processing CLI command-line utility to sample, ingest and enrich Hive table data into the DGraph engine, or you can use the web-based data uploader to ingest data from a spreadsheet, text file or similar. For example, to load a Hive table called “bdd_test_tweets” into the DGraph engine using the command-line, you’d enter the commands:
[oracle@bigdatalite Middleware]$ cd BDD1.0/dataprocessing/edp_cli [oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t bdd_test_tweets
Big Data Discovery would then read the Hive table metastore to get the table and column names, datatypes and file location, then spin-up a Spark job to sample, enrich and then load the data into the DGraph engine. If the Hive table has fewer than 1m rows the whole dataset gets loaded in, or the dataset is sampled if the number of Hive rows is greater than 1m. The diagram below shows the basic load, profile and enrich ingestion process.
There’s a couple of things to bear in-mind when you’re loading data into BDD in this way:
- You can only load Hive tables, not Hive views, as the Spark loading process only works with full table definitions in the Hive metastore
- If your Hive table uses a SerDe other than the ones that ship with Base CDH5, you’ll need to upload the SerDe into BDD’s EDP JAR file area in HDFS and update some JAR reference files before the import will work, as detailed in Chapter 3 of the Big Data Discovery Data Processing Guide doc
- If you’ve installed BDD on a laptop or a smaller-than-usual Hadoop setup, you’ll need to make sure the SPARK_EXECUTOR_MEMORY value you set in bdd.conf file when you installed the product can be handled by the Hadoop cluster – by default SPARK_EXECUTOR_MEMORY is set to 48G for the install, but on my single laptop install I set it to 2G (after having first installed BDD, the data ingestion process didn’t work, and then I had to reinstall it with SPARK_EXECUTOR_MEMORY = 2G as the new setting)
- If you installed an early copy of BDD you might also need to change the OLT_HOME value in the /localdisk/Oracle/Middleware/user_projects/domains/bdd_domain/bin/setDomainEnv.sh file so that OLT_HOME=”/opt/bdd/olt” instead reads OLT_HOME=”/opt/bdd/edp/olt” – recent updates to the install files and installer correct this problem, but if it’s set wrong then the noun extraction part of the ingestion process won’t work either from the CLI, or from the BDD Studio Transformation screen
- There’s also no current way to refresh or reload a BDD DGraph dataset, apart from deleting it from BDD and then re-importing it. Hopefully this, and the lack of Kerberos support, will be addressed in the next release
Another thing you might want to consider when providing datasets for use with BDD is whether you leave quotes around the column values, and whether you pre-strip out HTML tags from any text. Take for example the text file below, stored in CSV-delimited format:
The file contains three comma-separated fields per line; one with the IP address of the requesting user, the others with the page title and page content snippet, all three fields having quotes around their values due to the presence of commas in the content text. Loading this data into Hive using the Hue web interface gives us a table with quotes around all of the fields, as Hue (in this release) doesn’t strip quotes from CSV fields.
When I ingest this table into BDD using the Data Processing CLI, I’ve got just these three columns still with the quotes around the fields. I can easily remove the quotes by going into the Transformation screen and use Groovy transforms to strip the first and last characters from the fields, but this is more work for the user and I don’t benefit from the automatic enrichment that BDD can do when performing the initial ingest.
If, however, I replace the comma separator with a pipe symbol, and remove the quotes, like this:
and therefore use Hue’s ability to use pipe and other separators instead of commas (and quotes), my Hive table looks like this:
Now, when we ingest this table into BDD, we get six more derived attributes as the enrichment part of the ingestion process recognises the fields as containing IP addresses, text and so on. Presumably in the future BDD will have an option to ignore quotes around field values, but for now I tend to strip-out the quotes and uses pipes instead for my BDD ingestion files.
Similarly, with hive tables that contain fields with HTML content you can just load those fields into BDD as-is, and BDD will generally extract nouns and keywords and created derived fields for those. And whilst you can run Groovy transformations to strip-out the HTML tags (mostly), you’re then stuck with these derived columns that include HTML tag names – img, h2 and so on – in the keywords list. What I tend to do then is re-export the dataset with the content field stripped of the HTML tags, then re-ingest that table so I get a new keyword field with the HTML tags removed. What would be simpler though would be to strip-out the HTML tags before you load up the Hive table, so you didn’t have to do this round-trip to get rid of the HTML tag names from the noun keyword lists that are automatically generated during the ingest enrichment process.
Once you’ve got datasets loaded into BDD, something I didn’t quite get the value of when I first used BDD studio was the “scratchpad” feature. To take an example, in the masterclass session I bring in a table of tweets referencing Rittman Mead, and one of the attributes in the resulting BDD dataset is for the first hashtag mentioned in the tweet. I can select this attribute and click on the “Add to Scratchpad” link to add it into the BDD Studio scratchpad, like this:
The scratchpad then displays above the list of attributes for that dataset, and by default it shows a horizontal bar chart listing the number of times each hashtag in the dataset is referenced.
I could then, should I wish to, use the Refine By button to the left of the chart to filter-down (or “refine” in BDD and Endeca-speak) the chart to include just those tweets by a subset of Twitter users – in this case myself, Robin, Michael, Jerome and Edel.
I can also add other attributes to the scratchpad as well – for example, the Twitter handle for the person tweeting – so that we can turn the bar chart into a stacked bar chart with the Twitter handles used to show the breakdown of use of that hashtag by each of our staff.
You can also use these Scratchpad visualisations as the start of your main BDD Studio “Discover” dashboards, by pressing the Add to Discover page at the bottom right-hand corner of each visualization. In this way rather than creating your Discover dashboards from scratch each time, you can seed them with some starter graphs and data visualizations right from the dataset attribute views.
The last bit I’m going to talk about in the BI Forum session session are “dataset views”; by default, each dataset you create within a BDD project has just its own attributes within it, and if you use one of them to create a visualization in the Discovery section, you’ll not be able to use any of the attributes from your other datasets (although the faceted search feature above every BDD Studio page searches all datasets in your project and in the BDD Catalog, just like the Endeca Information Discovery “value searches” that I talked about in this older article. To use attributes from more than one dataset in a BDD Studio visualisation component you need to join them, similar to how you’d join tables in the OBIEE RPD.
To take the Tweets, Page Views and Page Content datasets I use in the BI Forum masterclass, consider a situation where I’d like to list out all o the tweets that reference our website, along with details of the page title, page category and other attributes that I can get from a second dataset that I pull from the website itself. To link these two datasets together I join them in BDD Studio using their common URL attribute (in reality I had to massage the datasets so that both URLs featured a trailing forward-slash (“/“) to make them join properly, but that’s another story)
If I then go to the Data Views tab within the Project Settings BDD Studio screen, I can see that two data views have been setup for this join; one (“rm_linked_tweets – linked”) leads on the RM Linked Tweets dataset (the tweets) and returns the 1547 tweets in that first dataset joined to pages in the Site Content dataset, the “site_content – linked” dataset starts from the 2229 records in the Site Content dataset and joins those records to the RM Linked Tweets dataset; you can then choose which one you want to use (or “drive off”) when you add components to the Discover dashboard part.
Where it gets interesting is when you add third, fourth datasets to the join. Depending on the way you join-in the third table affects the number of rows returned by the join; if join the web server logs dataset (“access_per_post_cat_authors”) to the Site Contents dataset the resulting three-way join view returns the 2229 rows driven by the entries in the Site Contents dataset, whereas if I join the tweets dataset to the web server logs dataset directly, so the tweets dataset joins first to the site contents dataset, and then separately to the web server logs dataset, like this:
then the resulting data view joining all three datasets return a row count equal to the rows in the tweets dataset driving it all.
The best way to work this all out in your head is to do what I did, and create a series of datasets with distinct row counts and join characteristics and then just test creating joins and viewing the resulting row count using the Preview button below the list of data views. To make things even more interesting you can choose, in the Discover page properties section, whether to left-outer join, equi-join or full-outer join a “primary” dataset used for the page with any it’s joined with, in our instance determining whether the full set of tweets are filtered by the list of pages they refer to (removing tweets that reference non-existant RM web pages in this example), or whether all tweets are returned regardless.
It’s actually pretty powerful stuff and you should avoid the temptation to pre-create all your joined datasets in Hive beforehand rather than use BDD Studio joins, as once you get your head around the concept it’s a flexible and straightforward way to join your datasets up in whatever way makes sense for your analysis – leading off of website hits for one type of analysis, and then pages referenced in tweets in the other, allowing yo to easily swap what would be the fact, and dimension tables in a regular relational database report.
That’s if for now though; it’s the day before the Atlanta Rittman Mead BI Forum and I need to to get myself to the venue and get set up for tomorrow’s masterclass with Jordan Meyer. To those of you that are coming along to Atlanta I look forward to seeing you, otherwise normal service will be resumed next week.
Final Agenda for the Rittman Mead BI Forum 2015 Atlanta – Running this Week!
It’s the Monday before the Atlanta Rittman Mead BI Forum 2015, and delegates will start to arrive in Atlanta tomorrow and during the rest of the week. The first event in Brighton was excellent, and we’re hoping for something even better in Atlanta this week!
Safe travels for everyone coming to Atlanta, the official Twitter hashtag for the event is #biforum2015, and in the meantime here’s the final agenda for Atlanta’s BI Forum 2015:
Rittman Mead BI Forum 2015
Renaissance Atlanta Midtown Hotel, Atlanta, GA, USA
Wednesday 13th May 2015
- 9.00am – 9.30am Registration for Masterclass attendees
- 9.30am – 12.30pm Masterclass Part 1
- 12.30pm – 1.30pm Lunch
- 1.30pm – 4.30pm Masterclass Part 2
- 5.30pm – 6.30pm Drinks Reception in Renaissance Midtown Hotel Atlanta
- 6.30pm – 7.30pm Oracle Keynote – Chris Lynskey
- 7.30pm – 9.30pm Dinner at Renaissance Midtown Hotel Atlanta
Thursday 14th May 2015
- 8.45am – 9.00am Welcome and Opening Comments
- 9.00am – 9.45am Chris Lynskey (Oracle Corporation) – Looking Ahead to Oracle BI 12c and Visual Analyzer
- 9.45am – 10.30am Robin Moffatt (Rittman Mead) – Smarter Regression Testing for OBIEE
- 10.30am – 11.00am Coffee
- 11.00am – 11.45pm Chris Lynskey (Oracle Corporation) – Big Data Discovery
- 11.45am – 12.30pm Mark Rittman (Rittman Mead) and Tim Vlamis (Vlamis Software Solutions) – Big Data Discovery – Examples from the Field
- 12.30pm – 1.30pm Lunch
- 3.30pm – 2.30pm Day 1 Debate – “Self-Service BI – The Answer to Users’ Prayers, or the Path to Madness?”
- 2.30pm – 3.15pm Tim German / Cameron Lackpour – Hybrid Mode – An Essbase Revolution
- 3.15pm – 3.45pm Coffee
- 3.45pm – 4.30pm Kevin McGinley (Red Pill Analytics) – Agile BI Applications: A Case Study
- 6.00pm – 6.45pm Guest Speaker/Keynote – John Foreman – How Mailchimp used qualitative and quantitative analysis to build their next product
- 7.00pm – 7.45pm Depart for dinner at restaurant
- 8.00pm – 10.00pm Dinner at external venue
Friday 15th May 2015
- 09.00am – 09.45am Stewart Bryson (Red Pill Analytics) – Supercharge BI Delivery with Continuous Integration
- 09.45am – 10.30am Gabby Rubin (Oracle Corporation) – Solid Standing for Analytics in the Cloud
- 10.30am – 11.15am Hasso Schaap (Qualogy) – Developing strategic analytics applications on OBICS PaaS
- 11.15am – 11.30am Coffee
- 11.30am – 12.15pm Andy Rocha and Pete Tamisin (Rittman Mead) – OBIEE Can Help You Achieve Your GOOOOOOOOOALS!
- 12.15pm – 1.00pm Christian Screen (Sierra-Cedar) – 10 Tenets for Making Your Oracle BI Applications Project Succeed Like a Boss
- 1.00pm – 1.30pm Short Lunch
- 1.30pm – 2.30pm Data Visualization Bake-off
- 2.30pm – 3.15pm Sumit Sarkar (Progress Software) – NoSQL and OBIEE
- 3.15pm – 3.45pm Closing Remarks, and Best Speaker Award
If you’re interested in coming along to the Rittman Mead BI Forum 2015 in Atlanta, GA, there are still spaces available with details of the event here, and the registration form here – otherwise we’ll look forward to seeing you all at the Renaissance Atlanta Midtown Hotel later this week.
Just Under a Week to go Until the Atlanta BI Forum 2015 – Places Still Available!
The Rittman Mead Brighton BI Forum 2015 is now underway, with presentations from Oracle, Rittman Mead, partners and customers on a range of topics around Oracle BI, DW and Big Data. So far this week we’ve had a one-day masterclass from myself and Jordan Meyer on Delivering the Oracle Information Management & Big Data Reference Architecture, sessions from Oracle on OBIEE12c, the new SampleApp for OBIEE 11.1.1.9, Big Data Discovery, BI Cloud Service and Visual Analyzer. We’ve also had sessions from the likes of Emiel van Bockel, Steve Devine, Christian Screen and others on Exalytics, data visualization, Oracle BI Apps and other topics – and a very entertaining debate on self-service BI.
… and we’re doing it all again in Atlanta, GA next week! If you’re interested in coming along to the Rittman Mead BI Forum 2015 in Atlanta, GA, there are still spaces available with details of the event here, and the registration form here. We’re running BI Forum 2015 in the Renaissance Hotel Midtown Atlanta, the masterclass with myself and Jordan Meyer runs on the Wednesday, with the event itself kicking-off with a reception, meal and keynote from Oracle on Wednesday evening, followed by the main event itself starting Thursday morning. Hopefully we’ll see some of you there…!