Tag Archives: Big Data
Using the ELK Stack to Analyse Donor’s Choose Data
Donor’s Choose is an online charity in America through which teachers can post details of projects that need funding and donors can give money towards them. The data from the charity since it began in 2000 is available to download freely here in several CSV datasets. In this article I’m going to show how to use the ELK stack of data discovery tools from Elastic to easily import some data (the donations dataset) and quickly start analysing it to produce results such as this one:
I’m assuming you’ve downloaded and unzipped Elasticsearch, Logstash and Kibana and made Java available if not already. I did this on a Mac, but the tools are cross-platform and should work just the same on Windows and Linux. I’d also recommend installing Kopf, which is an excellent plugin for the management of Elasticsearch.
CSV Data Ingest with Logstash
First off we’re going to get the data in to Elasticsearch using Logstash, after which we can do some analysis using Kibana.
To import the data with Logstash requires a configuration file which in this case is pretty straightforward. We’ll use the file input plugin, process it with the csv filter, set the date of the event to the donation timestamp (rather than now), cast a few fields to numeric, and then output it using the elasticsearch plugin. See inline comments for explanation of each step:
input { file { # This is necessary to ensure that the file is # processed in full. Without it logstash will default # to only processing new entries to the file (as would # be seen with a logfile for a live application, but # not static data like we're working with here) start_position => beginning # This is the full path to the file to process. # Wildcards are valid. path => ["/hdd/ELK/data/opendata/opendata_donations.csv"] } } filter { # Process the input using the csv filter. # The list of column names I took manually from the # file itself csv {separator => "," columns => ["_donationid","_projectid","_donor_acctid","_cartid","donor_city","donor_state","donor_zip","is_teacher_acct","donation_timestamp","donation_to_project","donation_optional_support","donation_total","dollar_amount","donation_included_optional_support","payment_method","payment_included_acct_credit","payment_included_campaign_gift_card","payment_included_web_purchased_gift_card","payment_was_promo_matched","via_giving_page","for_honoree","donation_message"]} # Store the date of the donation (rather than now) as the # event's timestamp # # Note that the data in the file uses formats both with and # without the milliseconds, so both formats are supplied # here. # Additional formats can be specified using the Joda syntax # (http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html) date { match => ["donation_timestamp", "yyyy-MM-dd HH:mm:ss.SSS", "yyyy-MM-dd HH:mm:ss"]} # ------------ # Cast the numeric fields to float (not mandatory but makes for additional analysis potential) mutate { convert => ["donation_optional_support","float"] convert => ["donation_to_project","float"] convert => ["donation_total","float"] } } output { # Now send it to Elasticsearch which here is running # on the same machine. elasticsearch { host => "localhost" index => "opendata" index_type => "donations"} }
With the configuration file created, we can now run the import:
./logstash-1.5.0.rc2/bin/logstash agent -f ./logstash-opendata-donations.conf
This will take a few minutes, during which your machine CPU will rocket as logstash processes all the records. Since logstash was originally designed for ingesting logfiles as they’re created it doesn’t actually exit after finishing processing the file, but you’ll notice your machine’s CPU return to normal, at which point you can hit Ctrl-C to kill logstash.
If you’ve installed Kopf then you can see at a glance how much data has been loaded:
Or alternatively query the index using Elasticsearch’s API directly:
curl -XGET 'http://localhost:9200/opendata/_status?pretty=true' [...] "opendata" : { "index" : { "primary_size_in_bytes" : 3679712363, }, [...] "docs" : { "num_docs" : 2608803,
Note that Elasticsearch will take more space than the source data (in total the 1.2Gb dataset ends up taking c.5Gb)
Data Exploration with Kibana
Now we can go to Kibana and start to analyse the data. From the Settings page of Kibana add the opendata index that we’ve just created:
Go to Discover and if necessary click the cog icon in the top right to set the index to opendata. The time filter defaults to the last 15 minutes only, and if your logstash has done its job right the events should have the timestamp of the actual donation, so you need to click on the time filter in the very top right of the screen to change time period to, for example, Previous year. Now you should see a bunch of data:
Click the toggle on one of the events to see the full data for it, including things like the donation amount, the message with the donation, and geographical details of the donor. You can find details of all the fields on the Donor’s Choose website here.
Click on the fields on the left to see a summary of the data within, showing very easily that within that time frame and sample of 500 records:
- two thirds of donations were in the 10-100 dollar range
- four-fifths included the optional donation towards the running costs of Donor’s Choose.
You can add fields into the table itself (which by default just shows the complete row of data) by clicking on add for the fields you want:
Let’s save this view (known as a “Search”), since it can be used on a Dashboard later:
Data Visualisation with Kibana
One of my favourite features of Kibana is its ability to aggregate data at various dimensions and grains with ridiculous ease. Here’s an example: (click to open full size)
Now let’s amend that chart to show the method of donation, or the donation amount range, or both: (click to open full size)
You can also change the aggregation from the default “Count” (in this case, number of donations) to other aggregations including sum, median, min, max, etc. Here we can compare cheque (check) vs paypal as a payment method in terms of amount given:
Kibana Dashboards
Now let’s bring the visualisations together along with the data table we saw in the the Discover tab. Click on Dashboard, and then the + icon:
Select the visualisations that you’ve created, and then switch to the Searches tab and add in the one that you saved earlier. You’ve now got a data table showing all currently selected data, along with various summaries on it.
You can rearrange the dashboard by dragging each box around to suit. Once you’ve got the elements of the dashboard in place you can start to drill into your data further. To zoom in on a time period click and drag a selection over it, and to filter on a particular data item (for example, state in the “Top ten states” visualisation) click on it and accept the prompt at the top of the screen. You can also use the freetext search at the top of the screen (this is valid on the Discover and Visualize pages too) to search across the dataset, or within a given field.
Example Analysis
Let’s look at some actual data analyses now. One of the most simple is the amount given in donations over time, split by amount given to project and also as the optional support amount:
One of the nice things about Kibana is the ability to quickly change resolution in a graph’s time frame. By default a bar chart will use an “Auto” granularity on the time axis, updating as you zoom in and out so that you always see an appropriate level of aggregation. This can be overridden to show, for example, year-on-year changes:
You can also easily switch the layout of the chart, for example to show the percentage of the two aggregations relative to each other. So whilst the above chart shows the optional support amount increasing by the year, it’s actually remaining pretty much the same when taken as a percentage of the donations overall – which if you look into the definition of the field (“we encourage donors to dedicate 15% of each donation to support the work that we do.“) makes a lot of sense
Analysis based on text in the data is easy. You can use the Terms sub-aggregation, where here we can see the top five states in terms of donation amount, California consistently being the top of the table.
Since the Terms sub-aggregation shows the Top-x only, you can’t necessarily judge the importance of those values in relation to the rest of the data. To do this more specific analysis you can use the Filters sub-aggregation to use free-form searches to create buckets, such as here to look at how much those from NY and CA donated, vs all other states. The syntax is field:value to include it, and -field:value to negate it. You can string these expressions together using AND and OR.
A lot of the analysis generally sits well in the bar chart visualisation, but the line chart has a role to play too. Donations are grouped according to the value range (<10, between 10 and 100, > 100), and these plot out nicely when considering the number of donations made (rather than total value). Whilst the total donation in a time period is significant, so is the engagement with the donors hence the number of donations made is important to analyse:
As well as splitting lines and bars, you can split charts themselves, which works well when you want to start comparing multiple dimensions without cluttering up a single chart. Here’s the same chart as previously but split out with one line per instance. Arguably it’s clearer to understand, and the relative values of the three items can be better seen here than in the clutter of the previous chart:
Following on from this previous graph, I’m interested in the spike in mid-value ($10-$100) donations at the end of 2011. Let’s pull the graph onto a dashboard and dig into it a bit. I’ve saved the visualisation and brought it in with the saved Search (from the Discover page earlier) and an additional visualisation showing payment methods for the donations:
Now I can click and drag the time frame to isolate the data of interest and we see that the number of donations jumps eight-fold at this point:
Clicking on one of the data points drills into it, and we eventually see that the spike was attributable to the use of campaign gift cards, presumably issued with a value > $10 and < $100.
Limitations
The simplicity described in this article comes at a cost, or rather, has its limits. You may well notice fields in the input data such as “_projectid”, and if you wanted to relate a donation to a given project you’d need to go and look that project code up manually. There’s no (easy) way of doing this in Elasticsearch – whilst you can easily bring in all the project data too and search on projectid, you can’t display the two (project and donation) alongside each other (easily). That’s because Elasticsearch is a document store, not a relational database. There are some options discussed on the Elasticsearch blog for handling this, none of which to my mind are applicable to this kind of data discovery (but Elasticsearch is used in a variety of applications, not just as a data store for Kibana, so in others cases it is more relevant). Given that, and if you wanted to resolve this relationship, you’d have to go about it a different way, maybe using the linux join command to pre-process the files and denormalise them prior to ingest with logstash. At this point you reach the “right tool/right job” decision – ELK is great, but not for everything :-)
Reprocessing
If you need to reload the data (for example, when building this I reprocessed the file in order to define the numerics as such, rather than the default string), you need to :
- Drop the Elasticsearch data:
curl -XDELETE 'http://localhost:9200/opendata'
- Remove the “sincedb” file that logstash uses to record where it last read from in a file (useful for tailing changing input files; not so for us with a static input file)
rm ~/.sincedb*
(better here would be to define a bespoke sincedb path in the file input parameters so we could delete a specific sincedb file without impacting other logstash processing that may be using sincedb in the same path) - Rerun the logstash as above
Handouts – Introducing Oracle’s Information Management Reference Architecture
Oracle Data Integrator Enterprise Edition Advanced Big Data Option Part 1- Overview and 12.1.3.0.1 install
Oracle recently announced Oracle Data Integrator Enterprise Edition Advanced Big Data Options as part of the new 12.1.3.0.1 release of ODI. It includes various great new functionalities to work on an Hadoop ecosystem. Let’s have a look at the new features and how to install it on Big Data Lite 4.1 Virtual Machine.
Note that some of these new features, for example Pig and Spark support and use of Oozie, requires the new ODI EE Advanced Big Data Option license on-top of base ODI EE.
Pig and Spark support
So far ODI12c allowed us to use Hive for any Hadoop-based transformation. With this new release, we can now use Pig and Spark as well. Depending on the use case, we can choose which technology will give better performance and switch from one to another with very few changes. That’s the beauty of ODI – all you need is to do is create the logical dataflow in your mapping and choose your technology. There is no need to be a Pig Latin expert or a PySpark ninja, all of this will be generated for you! These two technologies are now available in the Topology, along with the Hadoop Data Server to define where lies the Data. You can also see some Loading Knowledge Modules for Pig and Spark.
Pig, as Mark wrote before, is a dataflow language. It makes it really appropriate with the new “flow paradigm” introduced in ODI 12c. The idea is to write a data pipeline in Pig Latin. That code will undercover create MapReduce jobs that will be executed.
Quoting Mark one more time, Spark is a cluster processing framework that can be used in different programming languages, the two most common being Python and Scala. It allows to do operation like filters, joins and aggregates. All of this can be done in-memory which can provides way better performance over MapReduce. The ODI team choose to use Python as a programming language for Spark so the Knowledge Modules will use PySpark.
New Hive Driver and LKMs
This release also brings significant improvements to the existing Hive technology. A new driver as been introduced under the name DataDirect Apache Hive JDBC Driver. It is actually the Weblogic Hive JDBC driver which aims at improving the performance and the stability.
New Knowledges Modules are introduced to benefit from this new driver and they are LKMs instead multi-connections IKMs as it use to be. Thanks to that, it can be combined with other LKMs into the same mapping which was not the case before.
Oozie Agent
Oozie is another Apache project and they define it as “a workflow scheduler system to manage Apache Hadoop jobs”. We can create workflow of different jobs in the Hadoop stack, and then schedule it at a certain time or trigger it when data becomes available.
What Oozie does is similar to the role of the ODI agent, and it’s now possible to use directly an existing Oozie engine instead of deploying a standalone agent on the hadoop cluster.
The Oozie engine will do what your ODI agent usually does – execution, scheduling, monitoring – but it is integrated in the Hadoop ecosystem. So we will be able to schedule and monitor our ODI jobs at the same place as all our other Hadoop jobs that we use outside of ODI. Oozie can also automatically retrieve the Hadoop logs. Also we lower the footprint because it doesn’t requires to install an ODI-specific component on the cluster. However, according to the white paper (link below), it looks like Load Plans are not supported. So the idea would be to execute the Load Plans with a standalone or JEE agent that will delegate the execution of Big Data-related scenarios to the Oozie Engine.
HDFS support in file-related ODI Tools
Most of the ODI tools handling files can also do it on HDFS now. So you can delete, move, copy files and folders. You can also append files and transfer it to HDFS via FTP. It’s even possible to detect when a file is created on HDFS. All you need to do is to indicate your Hadoop Logical Schema for source, target or both. In the following example I’m copying a file from the Unix filesystem to HDFS.
I think this is a huge step forward. If we want to use ODI 12c for our Hadoop data integration, it must be able to do everything end-to-end. The maintenance or administrative tasks such as archiving, deleting or copying should also be done using ODI. So far it was a bit tedious to created a shell script using hdfs dfs commands and then launch it using OdiOsCommand tool. Now we can directly use the file tools in a package or a procedure!
New mapping components : Jagged and Flatten
The two new components can be used in a Big Data context but also in your traditional data integration. The first one, Jagged, will pivot a set of key-value pairs into a columns with their values.
The Flatten components can be used with advanced files when you have nested attributes, like in JSON. Using a flatten component will generate more rows if needed to extract different values for a same attribute nested into another attribute.
You can see the detail of all the new features in the white paper “Advancing Big Data Integration” for ODI 12c.
How to install it?
This patch must be applied on top of an existing Oracle Data Integrator 12.1.3.0.0 installation. It is not a bundled patch and it’s only related to Big Data Options so there is no point to install it if you don’t need its functionalities. Also make sure you are licensed for ODIEE Advanced Big Data Option if you plan to use Spark or Pig technology/KMs or execute your jobs using the Oozie engine.
To showcase this, I used the excellent –and free! – Big Data Lite 4.1 VM which already has ODI 12.1.3 and all the Hadoop components we need. So this example will be on an Oracle Enterprise Linux environment.
The first step is to download it from the OTN or My Oracle Support. Also make sure you close ODI Studio and shut down the agents. Then the README recommends to update OPatch and check the OUI. So let’s do that and also set some environment variables and unzip the ODI patch.
[oracle@bigdatalite ~]$ mkdir /home/oracle/bck [oracle@bigdatalite ~]$ ORACLE_HOME=/u01/ODI12c/ [oracle@bigdatalite ~]$ cd $ORACLE_HOME [oracle@bigdatalite ODI12c]$ unzip /home/oracle/Desktop/p6880880_132000_Generic.zip -d $ORACLE_HOME [oracle@bigdatalite ODI12c]$ OPatch/opatch lsinventory -jre /usr/java/latest/ [oracle@bigdatalite ODI12c]$ export PATH=$PATH:/u01/ODI12c/OPatch/ [oracle@bigdatalite ODI12c]$ unzip -d /home/oracle/bck/ /home/oracle/Desktop/p20042369_121300_Generic.zip [oracle@bigdatalite ODI12c]$ cd /home/oracle/bck/
This patch is actually composed of three piece. One of them, the second one, is only needed if you have an enterprise installation. If you have a standalone install, you can just skip it. Note that I always specify the JRE to be used by OPatch to be sure everything works fine.
[oracle@bigdatalite bck]$ unzip p20042369_121300_Generic.zip [oracle@bigdatalite ODI12c]$ cd 20042369/ [oracle@bigdatalite 20042369]$ opatch apply -jre /usr/java/latest/ [oracle@bigdatalite 20042369]$ cd /home/oracle/bck/ // ONLY FOR ENTERPRISE INSTALL //[oracle@bigdatalite bck]$ unzip p20674616_121300_Generic.zip //[oracle@bigdatalite bck]$ cd 20674616/ //[oracle@bigdatalite 20674616]$ opatch apply -jre /usr/java/latest/ //[oracle@bigdatalite 20674616]$ cd /home/oracle/bck/ [oracle@bigdatalite bck]$ unzip p20562777_121300_Generic.zip [oracle@bigdatalite bck]$ cd 20562777/ [oracle@bigdatalite 20562777]$ opatch apply -jre /usr/java/latest/
Now we need to run the upgrade assistant that will execute some scripts to upgrade our repositories. But in Big Data Lite, the tables of the repository have been compressed, so we first need to uncompress them and rebuild the invalid indexes as David Allan pointed it out on twitter. Here are the SQL queries that will create the DDL statement you need to run if you are also using Big Data Lite VM :
select 'alter table '||t.owner||'.'||t.table_name||' move nocompress;' q from all_tables t where owner = 'DEV_ODI_REPO' and table_name <> 'SNP_DATA'; select 'alter index '||owner||'.'||index_name||' rebuild tablespace '||tablespace_name ||';' from all_indexes where owner = 'DEV_ODI_REPO' and status = 'UNUSABLE';
Once it’s done we can start the upgrade assistant :
[oracle@bigdatalite 20562777]$ cd /u01/ODI12c/oracle_common/upgrade/bin [oracle@bigdatalite bin]$ ./ua
The steps are quite straightforward so I’ll leave it to you. Here I selected Schemas, but if you have a standalone agent you will have to run it again and select “Standalone System Component Configurations” to upgrade the domain as well.
Before opening ODI Studio we will clear the JDev cache so we are sure everything looks nice.
[oracle@bigdatalite bin]$ rm -rf /home/oracle/.odi/system12.1.3.0.0/
We can now open ODI Studio. Don’t worry the version mentioned there and in the upgrade assistant is still 12.1.3.0.0 but if you can see the new features it has been installed properly.
The last step is to go in the topology and change the driver used for all the Hive Data Server. As all the new LKMs use the new weblogic driver, we need to define the url instead of the existing one. We simply select “DataDirect Apache Hive JDBC Driver” instead of the existing Apache driver.
And that’s it, we can now enjoy all the new Big Data features in ODI 12c! A big thanks to David Allan and Denis Gray for their technical and licensing help. Stay tuned as I will soon publish a second blog post detailing some features.
Previewing Four Sessions at the Atlanta Rittman Mead BI Forum 2015
In a post earlier this week I previewed three sessions at the upcoming Brighton Rittman Mead BI Forum 2015; in this post I’m going to look at four particularly interesting sessions at the Atlanta Rittman Mead BI Forum 2015 event running the week after Brighton, on May 13th-15th 2015 at the Renaissance Atlanta Midtown Hotel, Atlanta GA. As well as an optional one-day masterclass on big data development by myself and Jordan Meyer on the 13th, the main event itself has keynotes and product update sessions from Oracle’s BI product management team, a data visualisation challenge and a guest talk by John Foreman, author of the book “Data Smart” and Chief Data Scientist at Mailchimp; in terms of the main sessions though there are four that I’m particularly interested in, starting with one by a speaker new to the BI Forum, Qualogy’s Hasso Schaap, who’ll be talking to us about their use of Oracle’s new BI Cloud Service in his session “Developing strategic analytics applications on OBICS PaaS”
“In this session I’ll tell how we use the Oracle BI Cloud Service in our development plans for a strategic analytics application. Focussing on Strategic HR Planning there’s so much you can do with your data that we decided to put it in a packaged app. I will discuss the important parts of the development process and show how we fixed the issues we came up with. Developing in the BI Cloud is different and expectations are also different.
As an example there’s the part of prediction. How do we predict based on data in the BI Cloud and what are other possibilities. With prediction we were able to tell our customers a different story. A story that was different than before using old-school tools and techniques. In this session I will uncover some of the most appreciated functionality and will happily elaborate on the story behind ‘The present, the future, development and scenario planning’.”
My second featured session is by someone very-well known to previous BI Forum attendees, and to the wider Oracle BI+DW community: Stewart Bryson. Stewart of course used to head-up Rittman Mead in the US and then went-on to become our first Chief Innovation Officer, before leaving to start his own company Red Pill Analytics with Kevin McGinley, another old friend of Rittman Mead and the BI Forum. We’re very pleased to have both Stewart and Kevin delivering sessions at the Atlanta BI Forum, and for Stewart’s session he’s talking about something very close to his heart – “Supercharging BI Delivery with Continuous Integration”:
“One of the things I’ve never understood about the lifecycle features in most BI tools is why the designers feel the need to roll their own source control and DevOps features. Instead of focusing on deeper integration with tools and processes that exist in the other 90% of development paradigms, BI vendors instead start with a clean palette and create something completely siloed and desperately alone.
In this presentation, we’ll take a look at how some of these other development paradigms approach DevOps — paying perhaps the closest attention to the world of Java development and other JVM languages. We’ll see how approaches such as continuous integration and continuous delivery play a part in rapid, iterative delivery, and how we can apply some of those approaches to the world of OBIEE development.”
My third session is by another speaker new to the BI Forum, but someone who’s well-known in the BI and data warehousing world and who I met in-person for the first time at last year’s Oracle Openworld: Sumit Sarkar. Sumit works for Progress Software, makers of the DataDirect ODBC drivers that powers OBIEE’s connection to Hadoop, for example, as well as connectors to MongoDB, Salesforce, Oracle RightNow and Eloqua, and as he’ll explain in his session “Make sense of NoSQL data using OBIEE”, NoSQL databases :
“NoSQL databases have stormed the top 10 db-engines rankings with MongoDB at #4 and Cassandra at #8. It’s inevitable that these NoSQL databases, storing unstructured data without a standard query language, will have BI requirements for unarmed OBIEE teams. Not even a complete Oracle stack can save you with the release of Oracle NoSQL.This will be the first session of its kind to tackle standards based NoSQL connectivity.
So join me at BI Forum ’15 to take control of NoSQL data with your RPD and expand big data skills and thought leadership within your organization. Learn how organizations are using SQL access to NoSQL databases for integration across existing business intelligence platforms. We’ll talk about common challenges and gotchas that shops are facing when exposing unstructured NoSQL data to OBIEE. It can get out of hand pretty quickly otherwise …”
My final selection is from CERN, the European Organization for Nuclear Research and home of course of the Large Hadron Collider (and who announced on April 1st the first unequivocal evidence for The Force, almost upstaging our announcement of Oracle E-Business Suite being ported to Hadoop and MongoDB). There’s several session at both the Brighton and Atlanta BI Forums on Oracle’s new Big Data Discovery tool, and in this session CERN’s Manuel Martin Marquez will be talking about their work in this area, in his session “Governed Information Discovery: Data-driven decisions for more efficient operations at CERN”
“The European Centre for Nuclear Research, CERN, is running the world’s largest and more powerful particle accelerator complex in order to shed light on how the Universe works and which are its main building blocks. CERN’s particle accelerators and detectors infrastructure is comprehensively heterogeneous and complex. A number of critical subsystems, which represent cutting-edge technology in several engineering fields, need to be considered: cryogenics, power converters, magnet protection, etc. The historical monitoring and control data derived from these systems has persisted mainly using Oracle database technologies, but also other sorts of data formats such as JSOM, XML and plain text files. All of these must be integrated and combined in order to provide a full picture and better understanding of the overall status of the accelerator complex.
Therefore, a key challenge is to facilitate easy access to, flexible interaction with, and dynamic visualization of heterogeneous data from different sources and domains. In our session, we will share our experience with a potential solution for finding insights within our data, Oracle Endeca Data Discovery. In addition, we will feature practical examples relating to future possibilities for improving the control and monitoring of CERN’s accelerator complex, optimization results for accelerator operations and a demo of the implemented solution”
Full agenda details on the Atlanta Rittman Mead BI Forum 2015 can be found on the event homepage, along with details of the optional one-day masterclass on Delivering the Oracle Information Management and Big Data Reference Architecture, and our first-ever Data Visualisation Bake-Off, using the DonorsChoose.org dataset. Registration is now open and the event takes place between May 13th and 15th April 2015, at the Renaissance Atlanta Midtown Hotel, Atlanta GA.
Previewing Four Sessions at the Atlanta Rittman Mead BI Forum 2015
In a post earlier this week I previewed three sessions at the upcoming Brighton Rittman Mead BI Forum 2015; in this post I’m going to look at four particularly interesting sessions at the Atlanta Rittman Mead BI Forum 2015 event running the week after Brighton, on May 13th-15th 2015 at the Renaissance Atlanta Midtown Hotel, Atlanta GA. As well as an optional one-day masterclass on big data development by myself and Jordan Meyer on the 13th, the main event itself has keynotes and product update sessions from Oracle’s BI product management team, a data visualisation challenge and a guest talk by John Foreman, author of the book “Data Smart” and Chief Data Scientist at Mailchimp; in terms of the main sessions though there are four that I’m particularly interested in, starting with one by a speaker new to the BI Forum, Qualogy’s Hasso Schaap, who’ll be talking to us about their use of Oracle’s new BI Cloud Service in his session “Developing strategic analytics applications on OBICS PaaS”
“In this session I’ll tell how we use the Oracle BI Cloud Service in our development plans for a strategic analytics application. Focussing on Strategic HR Planning there’s so much you can do with your data that we decided to put it in a packaged app. I will discuss the important parts of the development process and show how we fixed the issues we came up with. Developing in the BI Cloud is different and expectations are also different.
As an example there’s the part of prediction. How do we predict based on data in the BI Cloud and what are other possibilities. With prediction we were able to tell our customers a different story. A story that was different than before using old-school tools and techniques. In this session I will uncover some of the most appreciated functionality and will happily elaborate on the story behind ‘The present, the future, development and scenario planning’.”
My second featured session is by someone very-well known to previous BI Forum attendees, and to the wider Oracle BI+DW community: Stewart Bryson. Stewart of course used to head-up Rittman Mead in the US and then went-on to become our first Chief Innovation Officer, before leaving to start his own company Red Pill Analytics with Kevin McGinley, another old friend of Rittman Mead and the BI Forum. We’re very pleased to have both Stewart and Kevin delivering sessions at the Atlanta BI Forum, and for Stewart’s session he’s talking about something very close to his heart – “Supercharging BI Delivery with Continuous Integration”:
“One of the things I’ve never understood about the lifecycle features in most BI tools is why the designers feel the need to roll their own source control and DevOps features. Instead of focusing on deeper integration with tools and processes that exist in the other 90% of development paradigms, BI vendors instead start with a clean palette and create something completely siloed and desperately alone.
In this presentation, we’ll take a look at how some of these other development paradigms approach DevOps — paying perhaps the closest attention to the world of Java development and other JVM languages. We’ll see how approaches such as continuous integration and continuous delivery play a part in rapid, iterative delivery, and how we can apply some of those approaches to the world of OBIEE development.”
My third session is by another speaker new to the BI Forum, but someone who’s well-known in the BI and data warehousing world and who I met in-person for the first time at last year’s Oracle Openworld: Sumit Sarkar. Sumit works for Progress Software, makers of the DataDirect ODBC drivers that powers OBIEE’s connection to Hadoop, for example, as well as connectors to MongoDB, Salesforce, Oracle RightNow and Eloqua, and as he’ll explain in his session “Make sense of NoSQL data using OBIEE”, NoSQL databases :
“NoSQL databases have stormed the top 10 db-engines rankings with MongoDB at #4 and Cassandra at #8. It’s inevitable that these NoSQL databases, storing unstructured data without a standard query language, will have BI requirements for unarmed OBIEE teams. Not even a complete Oracle stack can save you with the release of Oracle NoSQL.This will be the first session of its kind to tackle standards based NoSQL connectivity.
So join me at BI Forum ’15 to take control of NoSQL data with your RPD and expand big data skills and thought leadership within your organization. Learn how organizations are using SQL access to NoSQL databases for integration across existing business intelligence platforms. We’ll talk about common challenges and gotchas that shops are facing when exposing unstructured NoSQL data to OBIEE. It can get out of hand pretty quickly otherwise …”
My final selection is from CERN, the European Organization for Nuclear Research and home of course of the Large Hadron Collider (and who announced on April 1st the first unequivocal evidence for The Force, almost upstaging our announcement of Oracle E-Business Suite being ported to Hadoop and MongoDB). There’s several session at both the Brighton and Atlanta BI Forums on Oracle’s new Big Data Discovery tool, and in this session CERN’s Manuel Martin Marquez will be talking about their work in this area, in his session “Governed Information Discovery: Data-driven decisions for more efficient operations at CERN”
“The European Centre for Nuclear Research, CERN, is running the world’s largest and more powerful particle accelerator complex in order to shed light on how the Universe works and which are its main building blocks. CERN’s particle accelerators and detectors infrastructure is comprehensively heterogeneous and complex. A number of critical subsystems, which represent cutting-edge technology in several engineering fields, need to be considered: cryogenics, power converters, magnet protection, etc. The historical monitoring and control data derived from these systems has persisted mainly using Oracle database technologies, but also other sorts of data formats such as JSOM, XML and plain text files. All of these must be integrated and combined in order to provide a full picture and better understanding of the overall status of the accelerator complex.
Therefore, a key challenge is to facilitate easy access to, flexible interaction with, and dynamic visualization of heterogeneous data from different sources and domains. In our session, we will share our experience with a potential solution for finding insights within our data, Oracle Endeca Data Discovery. In addition, we will feature practical examples relating to future possibilities for improving the control and monitoring of CERN’s accelerator complex, optimization results for accelerator operations and a demo of the implemented solution”
Full agenda details on the Atlanta Rittman Mead BI Forum 2015 can be found on the event homepage, along with details of the optional one-day masterclass on Delivering the Oracle Information Management and Big Data Reference Architecture, and our first-ever Data Visualisation Bake-Off, using the DonorsChoose.org dataset. Registration is now open and the event takes place between May 13th and 15th April 2015, at the Renaissance Atlanta Midtown Hotel, Atlanta GA.