The Oracle Business Intelligence Developers Guide, written by Mark Rittman

The definitive guide to developing BI applications on the Oracle Business Intelligence 11g platform. More »

With Oracle Exadata into the highest gears of speed!

Everything needs to get faster, but how fast ? (Dutch) More »

Share knowledge

Intelligence is all about knowledge. This website is dedicated sharing expertise on Oracle BI. More »

Meet the experts

Meet worlds best OBIEE experts during the BI Brighton event by Rittman Mead More »

 

Big Data Architecture – Frequently Asked Questions (FAQs)

by Antony Heljula

Big Data Architecture - Frequently Asked Questions (FAQs)

The aim of this article is to help provide a common understanding of what "Big Data" is all about, especially in relation to Oracle technologies.   A lot of organisations are now talking about big data and we are often asked to clarify what it actually means and, ultimately, why companies would want to consider going down that route.

Please use the form at the bottom to leave your own views on this or if you have some additional FAQs.

1) What is a Big Data Architecture?

2) Why would you want to store and process data in a file system?

3) What is "un-structured" data?

4) How do you report on un-structured data?

5) What is Hadoop and NoSQL?

6) Can Oracle BI (OBIEE) report directly against Hadoop and NoSQL?

7) This all sounds great!  Shall we get rid of our Data Warehouse then and just use Hadoop?

8) How do I get data out of Hadoop and into an Oracle Database?

9) What is Oracle Big Data Discovery?


1) What is a Big Data Architecture?

Generally speaking, a Big Data Architecture is one which involves the storing and processing of data in a file system rather than a relational database.   In most cases, this provides a mechanism to summarise and/or clean data prior to loading it into a database for further analysis.


2) Why would you want to store and process data in a file system?

Historically, companies have adopted relational databases to store and process their data.  However there are two issues which are creeping up on a number of organisations:

a) Data volumes are growing exponentially, it is becoming ever most costly to store all your data in a relational database.   Storing and processing data in a file system is relatively cheap in comparison and is highly scalable

b) To gain a competitive edge, Organisations need to bring a greater variety of data sources together to perform meaningful analysis.   Relational databases have always been good at analysing "structured" data but they often have limitations when dealing with "un-structured" data sources


3) What is "un-structured" data?

Relational databases store data in a very structured format - they contain tables with records that have a fixed number of columns with specific data types i.e. the database consists of a data-model.
"Un-structured" data sources are ones where the data-model is not so well defined.   For example, consider the following:

- An employee's CV
- Social media data e.g. a Twitter feed
- A customer's review
- A server log file

It is probably more accurate to say "semi-structured" data, since all data surely has some structure even if the structure is quite vague or complex!   But either way, one purpose of Big Data is to provide a mechanism for making use of your un-structured data sources.  This often means summarising it, making it more structured and then combining it with other structured data sources for further analysis.

4) How do you report on un-structured data?

In a relational database world you can't (or it is quite difficult).  You have to convert your un-structured sources to a more structured format first before you can report on them.    For example:

- Key word extraction:  Picking out the common terms or words mentioned in Twitter feeds or CVs
- Sentiment Analysis:   Determining whether the sentiment in a phrase or paragraph is "positive" or "negative"
- Log Parsing:          Parsing log files to extract error messages and other useful messages
- Entity Extraction:    Identify nouns, phone numbers, addresses from textual data

These processes would be useful for the following types of Business Intelligence query:

- How many employees do we have who can speak German?
- How many customers in each country have given us negative feedback in the last week?
....and so on

5) What is Hadoop and NoSQL?

Apache Hadoop is widely regarded as the main foundation of a Big Data Architecture.   Hadoop is open-source and provides a file system that allows you to store huge volumes of data and it supports distributed processing across clusters of computers.  It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.    It is "highly-avaiable", so losing any single Hadoop node will result in zero data loss and processing can continue unaffected.

NoSQL (Not Only SQL) is a type of database that can store data in different formats to the standard "structured" schemas used with relational databases, such as "key-value" (KV) pair format.   Just as a basic example, here is how relational and KV formats can differ when storing a person's Id, Name, Date of Birth and Company:

Relational:

(1234,John Smith,1976-14-05,Oracle Corporation)

Key-Value Pair:

(1234,Name,John Smith)
(1234,DoB,1976-14-05)
(1234,Company,Oracle Corporation)

Key-value pair format is very useful when the number of columns of information is extremely large or not known.   As an example, with Twitter feeds the number of pieces of information that is supplied with each Tweet could vary (some users will allow their Lat/Long geo-locations to be made public whilst others do not).


6) Can Oracle BI (OBIEE) report directly against Hadoop and NoSQL?

Yes.  It is possible for Oracle BI to query Hadoop structures using a Hadoop utility such as Hive.   Hive makes it possible to present structures as relational objects and therefore you can report against them using standard SQL commands via a JDBC connection

No.  It is not possible for Oracle BI to report against Oracle NoSQL.   You will need to write scripts to extract data from NoSQL where it can then be consumed by a relational database.


7) This all sounds great!  Shall we get rid of our Data Warehouse then and just use Hadoop?

No you don't want to go down this route.   When you run queries against Hadoop, they are essentially batch processes that can run massively in parallel.   Whilst this is extremely useful, you won't get the response times and levels of concurrency that are delivered by a relational database.  Perhaps you can think of it this way:

- Hadoop is designed for huge batch queries, but only a small number of them taking place at any one time
- A relational database is designed for mixed workloads with many small/medium/large processes all happening at the same time

8) How do I get data out of Hadoop and into an Oracle Database?

Oracle provide "Big Data Connectors" that enable Oracle Data Integrator (ODI) to extract/load data between Hadoop/NoSQL and an Oracle Database.    These connectors require additional licenses but are relatively low cost (and anyone can use them).

Oracle also provides "Big Data SQL" which enables you to create "external tables" in the Oracle Database that present Hadoop structures as standard Oracle Database tables.   So you can run any type of database SQL query against a table and the processing will all be done on the Hadoop file system.  This facility however is only available for customers who have purchased an Oracle Big Data Appliance (BDA).


9) What is Oracle Big Data Discovery?

Historically, one of the issues with a Big Data Architecture is that you don't know what your data will look like until you've extracted it, loaded it into a relational database and then built some reports.

Oracle Big Data Discovery overcomes this issue by building graphs and other visualisations direct against the structures in Hadoop.   The benefit is that it compliments your existing Business Intelligence tools by enabling you to explore your data (summarise, join, transform etc) at source to see whether it contains any value and to assist with defining further reporting and processing requirements.

Please see the Oracle web-site for more details:  https://www.oracle.com/big-data/big-data-discovery/index.html

 

Deploy another repository

To deploy another repository, follow these steps.

  • log in to Enterprise Manager
  • click Business Intelligence – coreapplication
  • click Deployment – Repository

deploy rpd 2

  • click Lock and Edit Configuration

deploy rpd 3

  • click Close

deploy rpd 1

  • click Browse…
  • select the repository file (rpd)
  • click OK

deploy rpd 5

  • enter Repository Password
  • enter Confirm Password

deploy rpd 6

  • click Apply

deploy rpd 7

  • click Activate Changes

deploy rpd 8

  • click Close

deploy rpd 9

The new rpd is deployed now.

Het bericht Deploy another repository verscheen eerst op OBIEE 24/7 | Oracle Business Intelligence.

No fact table exists at the requested level of detail

Problem: Display Error in report

Screenshot:

No fact table exists at the requested level of detail

Screenshot text:

State: HY000. Code: 10058. [NQODBC] [SQL_STATE: HY000] [nQSError: 10058] A general error has occurred. [nQSError: 43113] Message returned from OBIS. [nQSError: 43119] Query Failed: [nQSError: 14025] No fact table exists at the requested level of detail: [,,,,,,,[NAME.NAME],,,,]. (HY000)

Possible cause:

Initially, the physical tables were dragged from the physical layer to the business layer. Because of this, the logical join was created automatically. Recently, the physical join is changed, but in the Business Model the join is not changed (automatically).

Possible solution:

  • open Administration Tool
  • open repository
  • show Business Model diagram
  • remove logical join
  • add logical join again
  • check in
  • refresh report

The error message is not displayed anymore.

Het bericht No fact table exists at the requested level of detail verscheen eerst op OBIEE 24/7 | Oracle Business Intelligence.

Using the ELK Stack to Analyse Donor’s Choose Data

Donor’s Choose is an online charity in America through which teachers can post details of projects that need funding and donors can give money towards them. The data from the charity since it began in 2000 is available to download freely here in several CSV datasets. In this article I’m going to show how to use the ELK stack of data discovery tools from Elastic to easily import some data (the donations dataset) and quickly start analysing it to produce results such as this one:

I’m assuming you’ve downloaded and unzipped Elasticsearch, Logstash and Kibana and made Java available if not already. I did this on a Mac, but the tools are cross-platform and should work just the same on Windows and Linux. I’d also recommend installing Kopf, which is an excellent plugin for the management of Elasticsearch.

CSV Data Ingest with Logstash

First off we’re going to get the data in to Elasticsearch using Logstash, after which we can do some analysis using Kibana.

To import the data with Logstash requires a configuration file which in this case is pretty straightforward. We’ll use the file input plugin, process it with the csv filter, set the date of the event to the donation timestamp (rather than now), cast a few fields to numeric, and then output it using the elasticsearch plugin. See inline comments for explanation of each step:

input {  
    file {  
        # This is necessary to ensure that the file is  
        # processed in full. Without it logstash will default  
        # to only processing new entries to the file (as would  
        # be seen with a logfile for a live application, but  
        # not static data like we're working with here)  
        start_position  => beginning  
        # This is the full path to the file to process.  
        # Wildcards are valid.  
        path =>  ["/hdd/ELK/data/opendata/opendata_donations.csv"]  
    }
}

filter {  
        # Process the input using the csv filter.  
        # The list of column names I took manually from the  
        # file itself  
        csv {separator => ","  
                columns => ["_donationid","_projectid","_donor_acctid","_cartid","donor_city","donor_state","donor_zip","is_teacher_acct","donation_timestamp","donation_to_project","donation_optional_support","donation_total","dollar_amount","donation_included_optional_support","payment_method","payment_included_acct_credit","payment_included_campaign_gift_card","payment_included_web_purchased_gift_card","payment_was_promo_matched","via_giving_page","for_honoree","donation_message"]}

        # Store the date of the donation (rather than now) as the  
        # event's timestamp  
        # 
        # Note that the data in the file uses formats both with and  
        # without the milliseconds, so both formats are supplied  
        # here.  
        # Additional formats can be specified using the Joda syntax  
        # (http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html)  
        date { match => ["donation_timestamp", "yyyy-MM-dd HH:mm:ss.SSS", "yyyy-MM-dd HH:mm:ss"]}  
        # ------------
        # Cast the numeric fields to float (not mandatory but makes for additional analysis potential)
        mutate {
        convert => ["donation_optional_support","float"]
        convert => ["donation_to_project","float"]
        convert => ["donation_total","float"]
        }
}

output {  
        # Now send it to Elasticsearch which here is running  
        # on the same machine.  
        elasticsearch { host => "localhost" index => "opendata" index_type => "donations"}  
        }

With the configuration file created, we can now run the import:

./logstash-1.5.0.rc2/bin/logstash agent -f ./logstash-opendata-donations.conf

This will take a few minutes, during which your machine CPU will rocket as logstash processes all the records. Since logstash was originally designed for ingesting logfiles as they’re created it doesn’t actually exit after finishing processing the file, but you’ll notice your machine’s CPU return to normal, at which point you can hit Ctrl-C to kill logstash.

If you’ve installed Kopf then you can see at a glance how much data has been loaded:

Or alternatively query the index using Elasticsearch’s API directly:

curl -XGET 'http://localhost:9200/opendata/_status?pretty=true'

[...]  
    "opendata" : {  
      "index" : {  
        "primary_size_in_bytes" : 3679712363,  
      },  
[...]  
      "docs" : {  
        "num_docs" : 2608803,

Note that Elasticsearch will take more space than the source data (in total the 1.2Gb dataset ends up taking c.5Gb)

Data Exploration with Kibana

Now we can go to Kibana and start to analyse the data. From the Settings page of Kibana add the opendata index that we’ve just created:

Go to Discover and if necessary click the cog icon in the top right to set the index to opendata. The time filter defaults to the last 15 minutes only, and if your logstash has done its job right the events should have the timestamp of the actual donation, so you need to click on the time filter in the very top right of the screen to change time period to, for example, Previous year. Now you should see a bunch of data:

Click the toggle on one of the events to see the full data for it, including things like the donation amount, the message with the donation, and geographical details of the donor. You can find details of all the fields on the Donor’s Choose website here.

Click on the fields on the left to see a summary of the data within, showing very easily that within that time frame and sample of 500 records:

  • two thirds of donations were in the 10-100 dollar range
  • four-fifths included the optional donation towards the running costs of Donor’s Choose.

You can add fields into the table itself (which by default just shows the complete row of data) by clicking on add for the fields you want:

Let’s save this view (known as a “Search”), since it can be used on a Dashboard later:

Data Visualisation with Kibana

One of my favourite features of Kibana is its ability to aggregate data at various dimensions and grains with ridiculous ease. Here’s an example: (click to open full size)

Now let’s amend that chart to show the method of donation, or the donation amount range, or both: (click to open full size)

You can also change the aggregation from the default “Count” (in this case, number of donations) to other aggregations including sum, median, min, max, etc. Here we can compare cheque (check) vs paypal as a payment method in terms of amount given:

Kibana Dashboards

Now let’s bring the visualisations together along with the data table we saw in the the Discover tab. Click on Dashboard, and then the + icon:

Select the visualisations that you’ve created, and then switch to the Searches tab and add in the one that you saved earlier. You’ve now got a data table showing all currently selected data, along with various summaries on it.

You can rearrange the dashboard by dragging each box around to suit. Once you’ve got the elements of the dashboard in place you can start to drill into your data further. To zoom in on a time period click and drag a selection over it, and to filter on a particular data item (for example, state in the “Top ten states” visualisation) click on it and accept the prompt at the top of the screen. You can also use the freetext search at the top of the screen (this is valid on the Discover and Visualize pages too) to search across the dataset, or within a given field.

Example Analysis

Let’s look at some actual data analyses now. One of the most simple is the amount given in donations over time, split by amount given to project and also as the optional support amount:

One of the nice things about Kibana is the ability to quickly change resolution in a graph’s time frame. By default a bar chart will use an “Auto” granularity on the time axis, updating as you zoom in and out so that you always see an appropriate level of aggregation. This can be overridden to show, for example, year-on-year changes:

You can also easily switch the layout of the chart, for example to show the percentage of the two aggregations relative to each other. So whilst the above chart shows the optional support amount increasing by the year, it’s actually remaining pretty much the same when taken as a percentage of the donations overall – which if you look into the definition of the field (“we encourage donors to dedicate 15% of each donation to support the work that we do.“) makes a lot of sense

Analysis based on text in the data is easy. You can use the Terms sub-aggregation, where here we can see the top five states in terms of donation amount, California consistently being the top of the table.

Since the Terms sub-aggregation shows the Top-x only, you can’t necessarily judge the importance of those values in relation to the rest of the data. To do this more specific analysis you can use the Filters sub-aggregation to use free-form searches to create buckets, such as here to look at how much those from NY and CA donated, vs all other states. The syntax is field:value to include it, and -field:value to negate it. You can string these expressions together using AND and OR.

A lot of the analysis generally sits well in the bar chart visualisation, but the line chart has a role to play too. Donations are grouped according to the value range (<10, between 10 and 100, > 100), and these plot out nicely when considering the number of donations made (rather than total value). Whilst the total donation in a time period is significant, so is the engagement with the donors hence the number of donations made is important to analyse:

As well as splitting lines and bars, you can split charts themselves, which works well when you want to start comparing multiple dimensions without cluttering up a single chart. Here’s the same chart as previously but split out with one line per instance. Arguably it’s clearer to understand, and the relative values of the three items can be better seen here than in the clutter of the previous chart:

Following on from this previous graph, I’m interested in the spike in mid-value ($10-$100) donations at the end of 2011. Let’s pull the graph onto a dashboard and dig into it a bit. I’ve saved the visualisation and brought it in with the saved Search (from the Discover page earlier) and an additional visualisation showing payment methods for the donations:

Now I can click and drag the time frame to isolate the data of interest and we see that the number of donations jumps eight-fold at this point:

Clicking on one of the data points drills into it, and we eventually see that the spike was attributable to the use of campaign gift cards, presumably issued with a value > $10 and < $100.

elkodvis0502

Limitations

The simplicity described in this article comes at a cost, or rather, has its limits. You may well notice fields in the input data such as “_projectid”, and if you wanted to relate a donation to a given project you’d need to go and look that project code up manually. There’s no (easy) way of doing this in Elasticsearch – whilst you can easily bring in all the project data too and search on projectid, you can’t display the two (project and donation) alongside each other (easily). That’s because Elasticsearch is a document store, not a relational database. There are some options discussed on the Elasticsearch blog for handling this, none of which to my mind are applicable to this kind of data discovery (but Elasticsearch is used in a variety of applications, not just as a data store for Kibana, so in others cases it is more relevant). Given that, and if you wanted to resolve this relationship, you’d have to go about it a different way, maybe using the linux join command to pre-process the files and denormalise them prior to ingest with logstash. At this point you reach the “right tool/right job” decision – ELK is great, but not for everything :-)

Reprocessing

If you need to reload the data (for example, when building this I reprocessed the file in order to define the numerics as such, rather than the default string), you need to :

  • Drop the Elasticsearch data:
    curl -XDELETE 'http://localhost:9200/opendata'
  • Remove the “sincedb” file that logstash uses to record where it last read from in a file (useful for tailing changing input files; not so for us with a static input file)
    rm ~/.sincedb*

    (better here would be to define a bespoke sincedb path in the file input parameters so we could delete a specific sincedb file without impacting other logstash processing that may be using sincedb in the same path)
  • Rerun the logstash as above

 

BI Forum 2015 Preview — OBIEE Regression Testing, and Data Discovery with the ELK stack

I’m pleased to be presenting at both of the Rittman Mead BI Forums this year; in Brighton it’ll be my fourth time, whilst Atlanta will be my first, and my first trip to the city too. I’ve heard great things about the food, and I’m sure the forum content is going to be awesome too (Ed: get your priorities right).

OBIEE Regression Testing

In Atlanta I’ll be talking about Smarter Regression testing for OBIEE. The topic of Regression Testing in OBIEE is one that is – at last – starting to gain some real momentum. One of the drivers of this is the recognition in the industry that a more Agile approach to delivering BI projects is important, and to do this you need to have a good way of rapidly testing changes made. The other driver that I see is OBIEE 12c and the Baseline Validation Tool that Oracle announced at Oracle OpenWorld last year. Understanding how OBIEE works, and therefore how changes made can be tested most effectively, is key to a successful and efficient testing process.

In this presentation I’ll be diving into the OBIEE stack and explaining where it can be tested and how. I’ll discuss the common approaches and the relative strengths of each.

If you’ve not registered for the Atlanta BI Forum then do so now as places are limited and selling out fast. It runs May 14–15 with an optional masterclass on Wednesday 13th May from Mark Rittman and Jordan Meyer.

Data Discovery with the ELK Stack

My second presentation is at the Brighton forum the week before Atlanta, and I’ll be talking about Data Discovery and Systems Diagnostics with the ELK stack. The ELK stack is a set of tools from a company called Elastic, comprising Elasticsearch, Logstash and Kibana (E – L – K!). Data Discovery is a crucial part of the life cycle of acquiring, understanding, and exploiting data (one could even say, leverage the data). Before you can operationalise your reporting, you need to understand what data you have, how it relates, and what insights it can give you. This idea of a “Discovery Lab” is one of the key components of the Information Management and Big Data Reference Architecture that Oracle and Rittman Mead produced last year:

ELK gives you great flexibility to ingest data with loose data structures and rapidly visualise and analyse it. I wrote about it last year with an example of analysing data from our blog and associated tweets with data originating in Hadoop, and more recently have been analysing twitter activity using it. The great power of Kibana (the “K” of ELK) is the ability to rapidly filter and aggregate data, as well as see a summary of values within a data field:

The second aspect of my presentation is still on data discovery, but “discovering data” within the logfiles of an application stack such as OBIEE. ELK is perfectly suited to in-depth diagnostics against dense volumes of log data that you simply could not handle within simple log viewers or Enterprise Manager, such as the individual HTTP requests and types of value passed within the interactions of a single user session:

By its nature of log streaming and full text search, ELK also lends itself well to near real time system monitoring dashboards reporting the status of systems including OBIEE and ODI, and I’ll be discussing this in more detail during my talk.

The Brighton BI Forum is on 7–8 May, with an optional masterclass on Wednesday 6th May from Mark Rittman and Jordan Meyer. If you’ve not registered for the Brighton BI Forum then do so now as places are very limited!


Don’t forget, we’re running a Data Visualisation Challenge at each of the forums, and if you need to convince your boss to let you go you can find a pre-written ‘justification’ letter here.