Category Archives: Rittman Mead
Oracle Data Integrator 12c: Getting Started – What is ODI?
There is something about sharing stories with others about the technology you love that gets the blood flowing. I have written blog posts and articles, presented sessions at conferences, and recorded podcasts and tech tips about Oracle Data Integrator over the years. And it’s been a blast! But one thing I find is that I’m always drawn to sharing the difficult or overly technical aspects of ODI, often writing about integration with GoldenGate or strange quirks in the product that I’ve had to overcome for my clients. In a sense, that’s why you’re reading, right? The Rittman Mead blog has always been about solving the difficult problems with Oracle technologies and sharing that knowledge in a simple and easy to understand format. But what about a true “getting started” type of article?
I think it’s time for a refresher on Oracle Data Integrator. What is ODI? How has it evolved over the years and where is it going? And, of course, how do you get started with Oracle Data Integrator? I plan to share what I love about ODI and hope to get some folks interested in this great product as well. If you want to learn even more, I’d love to have you join me or one of my colleagues at one of our ODI 12c bootcamps. We’ll dig deep into the details and answer all of your questions about data integration. But until then, let’s get started with a little about where Oracle Data Integrator came from.
History of ODI
Oracle Data Integrator was the product of an acquisition that Oracle made in late 2006. Back then, Oracle already had a data integration tool for developing ETL (Extract, Transform, and Load) mappings, Oracle Warehouse Builder (OWB). At the time, OWB was delivered with the Oracle Database license, which made it the go-to data warehouse development tool for Oracle developers. In October of 2006, Oracle announced that they had acquired a French data integration company called Sunopsis who focused their product on E-LT (Extract, Load, Transform) rather than the traditional ETL.
What did that mean for OWB? The “standard” (and free with the Oracle DB) ETL development platform for Oracle was now in competition with a new product, Oracle Data Integrator. ODI was a separate license cost and resided in the Fusion Middleware stack, outside of the database. As you might imagine, OWB’s days were numbered. With the possibility to sell additional software licenses and the uprising of Fusion Middleware and Fusion Applications (of which ODI was a big part of), eventually Warehouse Builder was to be merged with Data Integrator.
The data integration product from Sunopsis became Oracle Data Integrator 10g after the acquisition. In August 2010, Oracle made their first updates to ODI with the release of Oracle Data Integrator 11g. This version moved ODI into the framework utilized by other products such as Oracle SQL Developer and JDeveloper, and introduced new features such as the JEE Agent and the ODI Console. While the 11g version was a step up from ODI 10g, it was still not widely regarded as a typical ETL development tool. The concept of Interfaces being a single unit of work for ETL versus the usual flow-based mapping approach found in most ETL tools, including OWB, led to a slower adoption rate. It took several years, but finally Oracle Data Integrator went flow-based with the release of ODI 12c, integrating some of the best features from Oracle Warehouse Builder into the current ODI product. In late 2013, the initial version of ODI 12c was made available to the public. Besides the switch to flow-based mappings, ODI 12c has also included integration with big data sources and targets, lifecycle management capabilities, many performance enhancements, and a migration utility for those moving from OWB.
Now that we’re caught up with that brief history lesson, let’s look at how ODI is able to differentiate itself from competitors.
What Makes ODI Different?
E-LT
I alluded to Oracle Data Integrator’s use of ELT (Extract, Load, and Transform) earlier, but what does that mean - and why are the letters in the acronym ETL out of order? The main difference is in the architecture.
Oracle Data Integrator is built to pushdown the transformation work to the source or target datasource. This means there is no need for a middle tier ETL engine to perform transformations, as many of the traditional ETL tools employ. In fact, in most implementations the ODI agent, which performs orchestration of the ETL processes, simply sends the code to the target server to be executed. This architecture allows ODI to use the power of the target datasource to execute the transformations. Why waste the processing power of your Oracle Database when it’s built for this type of SQL execution?
Knowledge Modules
The concept behind Knowledge Modules (KMs) is quite simple. In a nutshell, KMs are generic code templates that are applied to an ODI mapping and use a substitution language to input metadata from within the mapping to produce executable code at runtime. If you want to change the physical implementation of your mapping, say switching from an insert-only “append” integration method to an incremental update approach, you can simply switch the Knowledge Module applied to that particular mapping. KMs can even be customized or created from scratch to suit your specific data integration needs, adding to the overall flexibility of the tool.
Customization and Flexibility
The Knowledge Modules are just one aspect of Oracle Data Integrator that can be customized. With KMs, you can change how your mappings are physically implemented allowing for the ultimate flexibility. But it’s not just mappings, an object called a Procedure in ODI will allow any bit of code or command line call to be made using nearly any type of technology: Groovy, Jython, OS commands, Oracle SQL, MySQL, SQL Server…the list goes on and on. The great part about customizing Oracle Data Integrator is that you can make it adapt to your data warehouse - and not the other way around. Too often companies must adopt a “standard” because it is built into the software they use rather than a good business practice. Thankfully, that’s not the case with ODI.
Beyond customization within the Oracle Data Integrator objects, you can also access the application backend via the ODI SDK Java API. With this level of access, you can perform almost every task that can be completed within ODI Studio. Imagine, you need to create 400 source to staging mappings, all one-for-one column mappings. With the ODI SDK, and less than 50 lines of code, you can create them all in about 5 seconds! The power of the SDK is generally found when there is a need for a batch creation or modification of objects. But it’s not only those cases where the SDK shines, you can also perform actions such as automation of code deployment or even create a development quickstart for your standard mappings, all with custom code building ODI objects.
As you might have noticed, I’m a big fan of Oracle Data Integrator 12c. This is just the first of many in the “Oracle Data Integrator 12c: Getting Started” series. Up next, we’ll really look at how to go about getting started with ODI 12c. As always, please feel free to send me an email or message on twitter - or comment below - if you have any questions. And if I missed your favorite feature of Oracle Data Integrator, please share in the comments!
All You Ever Wanted to Know About OBIEE Performance…but were too afraid to ask
At the Polish Oracle Users Group (POUG) conference this week I'm presenting one of my favourite talks, [Still] No Silver Bullets - OBIEE Performance in the Real World. This presentation is based on the shared experiences of myself and my Rittman Mead colleagues helping our clients diagnose and - most importantly! - fix their OBIEE performance problems. Performance is something that's interested me for a long time, and my very first conference presentation back in 2010 was on how to test OBIEE performance. Since then I've written and collected a lot of material on the subject, which I thought would be useful to present here as a handy one-stop reference.
OBIEE Performance - testing and optimisation
This series of blog posts goes into great detail on the specifics of testing, and diagnosing, OBIEE performance. Topics such as how to define an "OBIEE report" are covered in depth, along with a discussion of the tools available to carry out performance testing of OBIEE and monitor the results. There's also a whole post on optimisation (frequently called 'tuning') and a hefty rant against some of the common anti-patterns that we see adopted when it comes to troubleshooting poor performance.
Caching
If you ever encounter the unqualified advice "To fix your OBIEE performance problems, enable BI Server caching", then run!
BI Server caching can be brilliant for providing the "icing on the cake" of a well-built system, giving that extra little performance boost. Where BI Server caching is a bad idea is when it is used to "paper over the cracks" of a fundamentally poorly-built OBIEE system. If the foundations are rotten, gold-plating the building on top is not going to do anything to improve things in the long run. If your OBIEE performance is bad, you need to get to the bottom of why and then rectify it. Using caching to mask it only "brushes the problem under the carpet", storing up the same problem or worse for the future as your system usage increases.
With that caveat made, here are two articles that explain how to best use BI Server caching:
You can also watch a "Two-Minute Tech Tip" that I recorded on the subject: Oracle Business Intelligence Server Caching 2MTT
[Still] No Silver Bullets - OBIEE Performance in the Real World
As presented previously at OOW, OUGF, UKOUG, OUG Scotland, and POUG. Newly updated for OBIEE 12c.
OBIEE Performance Analytics: Analysing the Impact of Suboptimal Report Design
This article has been published on Oracle Technical Network (OTN), and examines in detail what happens when there's sub-optimal design in the implementation of an OBIEE system. Disks fill up, IO bandwidth is exhausted, the server is overworked - and performance suffers.
Performance Analytics Service
The Rittman Mead Performance Analytics Service is the result of our deep understanding of the OBIEE stack, performance optimisation theory, and numerous client engagements. We understand why performance is so crucial to the success of an OBIEE project. We know how OBIEE works, how to diagnose performance issues -- and how to fix them!
Our Performance Analytics Service offers a free baseline Performance Analytics Report for your current OBIEE system, using data to demonstrate where your performance issues lie and prioritising the areas in which to focus investigation. Building on this baseline report, we also offer a detailed Performance Analytics Report. This is prepared by one of our expert OBIEE performance consultants based on deep analysis and understanding of your specific OBIEE implementation and performance pain-points. Not only do we diagnose the specific issues, but we demonstrate the relative effort involved to rectify each one and the expected performance benefit that each will yield.
To provide powerful monitoring and diagnostics for the performance of your system we built the OBIEE Performance Analytics Dashboards. These give a full-stack view of the metrics that can impact performance of your OBIEE reports. This gives you the information needed to accurately diagnose issues when they arise, as well as perform historical analysis. Have a look at these videos to see exactly how the dashboards can help, and get in touch with us to request a demo!
(if Youtube is blocked, you can also download them here : 1, 2, 3)
Product Documentation
The documentation for OBIEE includes very useful sections on OBIEE performance that are worth reading:
- OBIEE 11.1.1.9 : Managing Performance Tuning and Query Caching
- OBIEE 12.2.1 : Managing Performance Tuning and Query Caching
Tuning Guide
Oracle has published white papers available through My Oracle Support that detail further options available within the OBIEE stack for tuning performance. In a blog article I do somewhat grumble about these papers and how they can be misused ("oooh here's a knob that I can fiddle with") if taken without proper context and diagnostics. In the right hands, however, they are useful reference for some of the more advanced configuration settings and their impact on performance.
- OBIEE 11g Infrastructure Performance Tuning Guide
- OBIEE12c: Best Practices Guide for Infrastructure Tuning
See also page 13 of this document for good advice on working with exports to Excel from OBIEE: Doc ID 1558070.1
General Performance Resources
OBIEE Performance is nothing particularly fancy. The problems are the same as any other stack; just different components. The standard principles apply, and that is why I'd strongly recommend reading these articles as they give a fantastic background to the world of performance diagnostics and optimisation.
- Cary Millsap - Thinking Clearly About Performance
- Brendan Gregg - The USE method
- Brendan Gregg - Linux Performance Tools (also great discussion of performance troubleshooting methodology)
- Greg Rahn - The Core Performance Fundamentals Of Oracle Data Warehousing – Balanced Hardware Configuration
- Oracle documentation - Oracle® Database 2 Day + Data Warehousing Guide - Balanced Hardware Configuration
OBIEE12c – Upgrading to Version 12.2.1.1
INTRODUCTION
The new version of OBIEE 12c, 12.2.1.1 to be exact, is out, so let’s talk about it. It’s my intent that after reading this, you can expect some degree of comfort in regards to possibly doing this thing yourself (should you find yourself in just such a circumstance), but if not, feel free to drop us a line or give us a ring. It should be noted that Oracle documentation explicitly indicates that you’re going to need to upgrade to OBIEE version 12.2.1.0, which is to say you’re going to have to bring your 11g instance up to 12c before you can proceed with another upgrade. A colleague here at RM and I recently sat down to give the upgrade process (click there for the Oracle doc) a go on one of our hosted windows servers, and here’s the cut and dry of it. The examples throughout will be referencing both Linux and Windows, so choose how you’d like. Now, if you’ve gone through the 12c install process before, you’ll be plenty familiar with roughly 80% of the steps involved in the upgrade. Just to get this out of the way, no, it’s not a patch (in the sense that you’re actually going through the OBIEE patching process using OPatch). In fact, the process almost exactly mirrors a basic 12c install, with the addition of a few steps that I will make darn sure we cover in their entirety below. Speaking of which, I’m not going to do a play-by-play of the whole thing, but simply highlight those steps that are wholly unfamiliar. To provide some context, let’s go through the bullet points of what we’ll actually be doing during the upgrade.
First, we’ll make sure we have a server appropriate, supported version of java installed (8_77 is the lowest version) and that this guy corresponds to the JAVA_HOME you’ve got set up.
Next, we’ll be running the install for the WebLogic server into a NEW oracle home. That’s right, you heard me. A. new. oracle. home.
After that, we’ll be running a readiness check to make sure our OBIEE bits won’t run into any trouble during the actual upgrade process. This checks all OBIEE components, including those schemas you installed during the initial install process. Make sure to have your application database admin credentials on hand (we’ll talk about what you need below in more detail). The end of this step will actually have us upgrade all those pieces the readiness checker deems worthy of an upgrade.
Next, we’ll reconfigure and upgrade our existing domain by running the RECONFIGURATION WIZARD!!!!! and upgrade assistant, respectively.
Lastly, we’ll start up our services, cross our fingers, hold onto our four leaf clovers, etc.. (just kidding, at least about that last part).
Before we begin, however, let’s check off a few boxes on the ‘must have’ list.
Download all the files here, and make sure you get the right versions for whatever kind of server your version of OBIEE hangs out in. The java version will be 8_101 which will work out just fine even though the minimum needed is 8_77.
Get those database credentials! If you don’t know, drop everything and figure out how you’re going to access the application database within which the OBIEE 12c schemas were installed. You’ll need the user name/pass for the SYS user (or user with SYS privileges), and the database connection string as well, including the service name, host, and port.
Make sure you have enough disk space wherever you’re installing the upgrade. The downloads for the upgrade aren’t small. You should have at least 150GB, on a DEV box, say. You don’t want to have to manage allocating additional space at a time like this, especially if it involves putting in a ticket with IT (wink wink)! Speaking of which, you’ll also need the server credentials for whichever user 12c was installed under. Note that you probably don’t need root if it was a linux machine, however there have been some instances where I’ve needed to have these handy, as there were some file permission issues that required root credentials and were causing errors during an install. You’ll also need the weblogic/obiee admin user (if you changed the name for some reason).
Lastly, make sure you’re at least a tad bit familiar with both the path to the oracle and to the domain home.
SETTING UP JAVA
After downloading the version of Java you need, go ahead update it via the .rpm or .exe, etc… Make sure to update any environment variables you have set up, and to update both the JAVA_HOME
variable AND the PATH
to reference the new Java location. As stated above, at the time of this blog, the version we used, and that is currently available, is 8_101. During the upgrade process, we got a warning (see below) about our version not being 8_77. If this happens to you, just click Next. Everything will be alright, promise.
A NEW HOME FOR ORACLE
Did you click the link to the upgrade doc yet? If not, do so now, as things are about to get kind of crazy. Follow along as we walk through the next steps if you’d like. First, stop services and disable the SSL like it tells you to. Then, start OBIEE services back up and then run the infrastructure jar (java -jar fmw_12.2.1.1.0_infrastructure.jar
) for the WebLogic server install. Again, I’m not going to go pic by pic here as you can assume most everything resembles the initial 12c install process, and this part is no different. The one piece of this puzzle we need to focus on is establishing a new oracle home. After skipping those auto updates, move onto step 3 where we are, in fact, going to designate a new oracle home. You’ll see that, after completing the WebLogic install, we’ll have a bunch of updated feature sets, in addition to some new directories in our 12.2.1.1 oracle home. For example, if your original home is something like:
/u01/app/oracle/fmw
change it to:
when it asks you to enter a new one.
Breeze through the rest of the steps here, and remember to save that response file!
UPDATING OBIEE
Unzip both of the fmw_12.2.1.1.0_bi_linux64_Disk#_#of2.zip
files, making sure that your OBIEE install files are in the same directory. For windows, this will be the executable from the first zip file, and the zip file from the second part of disk 1. Execute the binary (on linux) or .exe, going through the usual motions and then in step 3, enter the NEW oracle home for 12.2.1.1. In the example above, it would be:
/u01/app/oracle/fmw2
for Linux, and likewise, for Windows:
Again, there isn’t too much to note or trap you here beyond just making sure that you take special care not to enter your original oracle home, but the one you created in the previous section. Proceed through the next steps as usual and remember, save your response file!
UPDATING THE 12C SCHEMAS - USING THE READINESS CHECKER AND UPDATE ASSISTANT
Run the readiness checker from:
NEW_ORACLE_HOME/oracle_common/upgrade/bin/ua -readiness
This next series of steps will take you through all the schemas currently deployed on your application database and confirm that they won’t explode once you take them through the upgrade process. In step 2 of 6, make sure that you’re entering the port for EM/Console (9500 by default). Remember when I said you’re going to need the DB credentials you used to install 12c in the first place? Well here’s where we’re going to use them. The readiness checker will guide you through a bunch of screens that essentially confirms the credentials for each schema installed, and then presents a report detailing which of these will actually get upgraded. That is to say, there are some that won’t be. I really like this new utility as an extra vote of confidence for a process that can admittedly be oftentimes troublesome.
Once you’ve validated that those schemas ready for update, go ahead and stop OBI12c services using the EXISTING oracle home.
Pro tip: they’ve made it super easy to do this now by just pointing your bash_profile to the binaries directory in OBIEE’s bitools folder (ORACLE_HOME/user_projects/domains/bi/bitools/bin). After logging this entry in your profile, you can simply type start.sh or stop.sh to bring everything up or down, not to mention take advantage of the myriad other scripts that are in there. Don't type those paths out every time.
I digress… After the services come down, run the upgrade assistant from within the NEW oracle home, as below:
Citing the previous example:
NEW_ORACLE_HOME/oracle_common/upgrade/bin/ua
After bringing up the install dialogue box, move on to step 2, and select the All Schemas Used by a Domain option (as in the example above), unless of course you’d like to hand select which ones you’d like to upgrade. I suppose if you were thinking about scrapping one you had previously installed, then this would be a good option for you. Make sure the domain directory you specify is from your existing/old 12c instance, as below:
Move through the next series of steps, which are more or less self explanatory (no tricks here, promise), once again validating connection credentials until you get to step 12. As always, save the response file, select Upgrade, and then watch the magic happen,….hopefully. Congratulations, you’ve just updated your schemas!
WHO INVITED A WIZARD TO THE PARTY? - RECONFIGURING THE BI DOMAIN AND UPDATING THE BI CONFIGURATION
Like I said before, I won’t be covering every single step of this process i.e, doing the map viewer portion, which means you’ll have to still consult the…oracle, on some of this stuff. That being said, don’t gloss over backing up the map viewer file..you have to do it. This is simply an attempt to help make the upgrade process a little easier to swallow and hopefully make some of the more confusing steps a bit clearer. Moving on. Guess what? It’s time to run another series of dialogue boxes. Beats the heck out of scripting this stuff though, I guess. Open up the RECONFIGURATION WIZARD!!!!! as instructed in the documentation, from the location within your NEW oracle home. The first step will prompt us for the location of the domain we want to upgrade. We want to upgrade our existing 12c domain (the old one). So type that in/browse for it. Right now.
Validate your java version and location in step 3 and then click your way through the next few screens, ensuring that you’ve at least given your stamp of approval on any pre-filled or manually filled entries in each dialogue box. Leave step 7 alone and click Next to get to the screen where we’re actually going to be starting the reconfiguration process. Click through and exit the RECONFIGURATION WIZARD!!!!!
Don’t forget to restore the map viewer config file at this point, and then launch the configuration assistant again, this time selecting the All Configurations Used By a Domain option in step 2. Make sure you’ve entered the location of the existing 12c domain in this step as well, and NOT the one created under the new oracle home.
Click through the next steps, again, paying close attention to all prompts and the location for the map viewer xml file. Verify in step 7 that the directory locations referenced for both domain and oracle map viewer are for the existing locations and NOT those created by the install of the update.
WRAPPING UP AND NOTES
You can now boot up ssl (as below) and then start OBIEE services.
DOMAIN_HOME/bitools/bin/ssl.sh internalssl true
Note: if you have tnsadmin or ldap.ora, place copies under NEW_ORACLE_HOME/network/admin
You can ignore the new oracle home created at this time, as, in my opinion, we’re going to have to do something similar for any following updates
for 12c. What did you think of the upgrade process and did you run into any issues? Thanks so much for reading, and as always, if you find any inconsistencies or errors please let us hear about them!
Becky’s BI Apps Corner: OBIA 11.1.1.10.2 New Features Part 1 – Health Check & ETL Diagnostics
I have been working with BI Applications since OBIA switched to ODI in version 11.1.1.7.0. I have installed and worked with all of the 11.x versions using several different sources. This most recent version of OBIA may only be a Patch-Set release, but it has some new features that make it very compelling; ETL automatic error handling, Health Check, ETL Diagnostics, and built in Email notifications.
If you've been using earlier version of OBIA 11.x (or 7.x for that matter), now may be the time to upgrade. In an Oracle Advisor's Webcast, Rajesh Lakra, BIApps Global Product Support Lead Engineer, said this version will be the upgrade path to the OBIA 12.x, which is slated for release sometime in 2017. Also, it may just be circumstances I've been exposed to, but this latest version seems more performant as well. Since I've not had an opportunity to do benchmark time testing against the same source data, I cannot verify yet. However, I am setting up some environments to do just that. Keep an eye out for a future blog post for performance comparisons.
Check if there are any other running load plans
Check Previous Load Plan Runs is a package that only has the following procedure:
Check State of Previous Load Plan Runs
1) Checks the status of Previous Load Plan Runs. Throws Exception if any other Load Plan is still in Running state.
2) Checks Blackout schedule period from w_etl_control_schedule table. If there is a Blackout schedule then LP would fail and comes out from the execution.
I’ve not found any documentation about this package or procedure. However, the procedure itself has three steps.
Check if Prev Load Plans are Running
a.
SELECT COUNT(1) FROM SNP_LPI_RUN WHERE STATUS IN ('R','W') AND I_LP_INST!=(?)
b.
>>>>>" + ctLPRunning + " Load plan(s) found to be running/waiting apart from the current one. ABORTING this Load Plan as running multiple Load Plans in Parallel can cause DATA CORRUPTION <<<<<<
Check Blackout schedule
a.
select -1 from w_etl_control_schedule cs where sysdate between cs.control_start_dt and cs.control_end_dt and cs.control_code = 'NO_ETL'
b.
>>>>> Blackout schedule coming up as seen in w_etl_control_schedule table, hence no Load plan(s) will be executed. ABORTING this Load Plan as there will be a down time to the repository due to the Blackout schedule <<<<<<
Check Source Upgrade
a.
SELECT 1 FROM W_ETL_SRC_VERSION_HISTORY WHERE SRC_PLV_NEW IS NOT NULL AND SRC_UPGRADE_STATUS IN ('PENDING','ERROR','STARTED’)
b.
>>>>>Found Incomplete Source Upgrade status from the Source Upgrade tracking table (W_ETL_SRC_VERSION_HISTORY).For more information, please refer to the Warehouse table W_ETL_SRC_VERSION_HISTORY for incomplete Source Upgrade status rows and take necessary action to run Source Upgrade Load plans <<<<<<
I believe this has been a good feature to add because Oracle’s OBIA documentation has always recommended not running more than one load plan at a time. Now if there is a load plan running, automatically the load plan will stop and there will be a message warning you about data corruption if more than one load plan is running.
I think it is interesting to see this Blackout schedule and Source Upgrade. I’ve done some looking in the documentation and on support.oracle.com but didn’t come up with any clues to what exactly these are for. It seems to me like the Blackout schedule is a calendar of sorts for when the repository will be down for maintenance. As for the Source Upgrade, that is more of a mystery to me.
Next step in the Load Plan is a Health Check.
Health Check and Diagnostics
Oracle's OBIA ETL Guide has a short paragraph on the Health Check.
Health Check is a preliminary ETL phase in which a diagnostic report is generated to identify problematic source data that might cause ETL failure or data loss or corruption in the data warehouse. The report is downloaded from Configuration Manager and includes any problematic data, a description of detected issues, and actions to resolve health check failures.
Referring again to Oracle's OBIA ETL Guide, the Manage Load Plans page is where you can download the Health Check. Highlight the relevant load plan that has run and/or failed and click on Show Data Problems
Select the desired DIAGNOSTIC HEALTHCHECK and download the zip file to your computer.
This file gets generated through a number of different steps starting with the Execute Healthcheck package and followed by the Create Report and Summarize Healthcheck package.
Execute Healthcheck is a package that only has the following procedure:
Run Diagnostics
This procedure has 12 steps. The two that are the most interesting are:
Define Execute Diagnostic Function and Run Diagnostics
The bulk of the code for the Diagnostics are in the step Define Execute Diagnostic Function. The code is Jython and it is really too long to paste here, but I highly recommend digging into how it works.
Then the step Run Diagnostics actually kicks off the process as shown here.
Once the diagnostics are run, there is a step in the load plan called Create Report and Summarize Healthcheck. This package has two procedures.
The first procedure, Create Diagnostic Report does exactly what you might guess. It does a lot of formatting and puts the diagnostic report into an html page.
The second procedure, Summarize Healthcheck Result, takes the information, inserts it into a table, then organizes it for an email to be sent out.
During the September 21st Oracle Advisor Webcast on BI Applications 11.1.1.10.2 New Release Overview, Rajesh Lakra, BIApps Global Product Support Lead Engineer, said the Health Check is most useful during a domain load. It checks to ensure source domain data is valid, collectively letting you know if these domains will cause issues further down in the ETL process.
In Part 2 of OBIA 11.1.1.10.2 New Features, I will show you what the html files look like, how to setup the email process, and how the automatic error handling is now working within many of the knowledge modules. Tune in next time for more exciting parts of the new features in this version of BI Applications!
Analysing Social Media Data for the Lightyear Foundation – Part 2
In my last post, I described the basics around connecting to social media data sources using Python. This is in aid of collating information for the Lightyear Foundation, a charity specialising in science education in the UK and Ghana. In this blog, I will demonstrate how to use those integrations to record data into a PostgreSQL database. I've chosen this database in so as to keep everything free and open source.
One of the downsides with some of the publicly available APIs is that they only store a certain amount of history. These makes analysing long term trends a bit more difficult. By recording the data periodically, in an ETL-like fashion, we can circumvent this. Given the vast amount of data available for social media it would not be appropriate to store all of the data in a relational database. Instead, we can store summary statistics for long term trend analysis.
All of the code used in this blog is on this public Git repository. You are welcome to download this and poke around as you follow the blog. Note that there is configuration in config.ini
and a keys/google-api.json
file required in order to get these working yourself.
Database Integration
The Python module psycopg2
was used to execute transactions against the PostgreSQL database with Python. The script lyf/psql.sql
contains a host of functions I wrote to perform useful database and ETL actions.
psql.truncate
: Truncates a table.psql.insert
: Inserts a single row to a table.psql.upsert
: Updates a row to a table if a matching key is found, else inserts the row.psql.update
: Updates values in a table based on matching keys.psql.delete
: Deletes rows from a table based on aWHERE
clause.psql.lookup
: Updates columns based on a lookup to another table. Useful for assigning foreign keys from natural keys.psql.load_csv
: Loads data from a CSV file into a table, mapping header names in the file to columns in the table.psql.query
: Executes a SQL statement and returns the data.
Google Analytics
For Google Analytics (GA), there is availability of many dimensions and metrics but I have chosen merely a useful subset to put into a star schema.
Dimensions
For efficiency, I wrote a method to dynamically load dimension tables from a tsv
file.
table ga_dims columns keys
d_ga_source "ga:sourceMedium,ga:source,ga:medium,ga:socialNetwork" "source_medium,source,medium,social_network" "source_medium"
d_ga_platform "ga:operatingSystem,ga:deviceCategory,ga:operatingSystemVersion" "os,device_category,os_version" "os,device_category,os_version"
d_ga_geo "ga:continent,ga:subContinent,ga:country,ga:countryIsoCode,ga:region,ga:cityId,ga:city" "continent,sub_continent,country,country_code,region,city_id,city" "city_id"
d_ga_page "ga:pageTitle" "page_title" "page_title"
The column table
refers to the table name in the PostgreSQL schema, which are created manually before executing the script. ga_dims
is a comma separated list of GA dimension codes. Note that this is an upper limit of 7 dimensions allowed for a GA API query. columns
is another list of equal length describing the mapping to the columns in the database table. Finally keys
is a comma separated list of which columns are unique natural keys for the database table. This is so the script can perform updates as well as inserts.
This input file is parsed by load_ga_dims.py
which executes a Google API query for each dimension and upserts the records into the database. Essentially, the ETL for this process is defined quite simply by the TSV file and can be changed with relative ease. It doesn't have the scope, flexibility or scale as ODI or Informatica, but for a quick, free version it will suffice.
Fact
The fact table is much simpler to load as we just need to execute a single GA query which specifies only the dimension natural keys and the metrics. The script load_ga_fact.py
will go and fetch the dimension warehouse key (always integer) as well as any metrics specified in the file. I've opted for:
- Sessions
- Bounces
- Bounce Rate
- Average Session Duration
- Session Duration
- Page Views
- Time on Page
Finally there is an integer date ID and some denormalised attributes: Longitude and Latitude. With Google its easy to retrieve a "full load" as the API allows querying of history as well as incremental loads which filtered for today's date. Below is an example query returning the total session time and number of sessions by continent for Jan 2016:
select
dt.year_month,
geo.continent,
sum(session_duration) as total_time,
sum(ga.sessions) as sessions
from
f_ga_daily ga
inner join d_ga_geo geo on geo.geo_id = ga.geo_id
inner join d_date dt on dt.date_id = ga.date_id
where dt.yyyymm = 201601
group by dt.year_month, geo.continent
order by dt.year_month;
The extract I wrote for Facebook was much simpler, using date as the sole dimension. If required, dimension tables could be created for posts and videos or other similar data, but in this case we just want to look at trends over time. The measures selected are:
- Total Likes
- Total Posts
- Total Videos
- Total Video Likes
- New Posts
- New Videos
- Impressions
- Reach
- Engaged Users
- New Likes
- New Unlikes
- Video Views
- Page Views
- Post Likes
There's no way of extracting history data from Facebook's API, so we need to load data regularly using f_facebook_daily.py
. Below is a time series query of likes and reach for a few days:
select
dt.date,
sum(total_likes) total_likes,
sum(new_likes) likes,
sum(engaged_users) engaged_users,
sum(reach) reach from
f_facebook_daily fb inner join d_date dt on dt.date_id = fb.date_id
group by date
order by 1;
The Twitter API is not as rich as the Google or Facebook APIs and is limited to only 15 queries an hour. For my ETL scripts this shouldn't matter, but once again no history is available. The script for this load is f_twitter_daily.py
. The measures we've got are:
- Total Followers
- Total Following
- Total Tweets
- Followers
- Following
- Tweets
select
dt.date,
sum(total_followers) total_followers,
sum(total_following) total_following,
sum(total_tweets) total_tweets,
sum(tweets) tweets from
f_twitter_daily tw inner join d_date dt on dt.date_id = tw.date_id
group by date
order by 1;
YouTube
The data for videos and views over time are stored in a flatten fact table. Given the small number of attributes for videos and the lack of interconnectivity with the other information, I just denormalised on them onto the fact table. For measures we have:
- Total Views
- Views
- Total Likes
- Likes
- Total Dislikes
- Dislikes
Once again this is loaded incrementally, storing records for each video for each day. The incremental loading script is f_youtube_daily.py
. The below query shows how many views the videos got by month:
select
dt.year_month,
sum(yt.views) views
from
f_youtube_daily yt
inner join d_date dt on dt.date_id = yt.date_id
group by dt.year_month
order by 1, 2;
MailChimp
The MailChimp API allows extraction of information about subscriber lists and campaigns. f_mc_lists_daily.py
will load in today's fact data as well as updating the subscriber list dimension. The measures include:
- Members
- Unsubscribed
- Cleaned
- Total Members
- Total Unsubscribed
- Total Cleaned
- Total Campaigns
- Open Rate
- Average Subscribe Rate
Below is an example query showing a snapshot of the subscriber list data as of 9th September 2016.
select
dt.date,
ml.name,
sum(mc.total_members) total_members,
sum(mc.total_campaigns) total_compaigns,
avg(mc.open_rate) open_rate
from
f_mc_lists_daily mc
inner join d_date dt on dt.date_id = mc.date_id
inner join d_mc_lists ml on ml.list_id = mc.list_id
where date = '2016-09-09'
group by dt.date, ml.name
order by 1, 2;
Summary
Now we've got all these different facts, we can write BI style summary queries to compare all of the data, or provide regular reports. This is a query across multiple fact tables conforming on the date dimension. It shows summary statistics for each fact per month.
with
ga as
(
select dt.year_month, sum(ga.sessions) as sessions, avg(ga.avg_session_duration) as avg_session_duration
from
f_ga_daily ga
inner join d_date dt on dt.date_id = ga.date_id
where dt.year = 2016
group by dt.year_month
order by 1
),
fb as
(
select dt.year_month, last(total_likes) as total_likes, sum(fb.reach) as reach
from
f_facebook_daily fb
inner join d_date dt on dt.date_id = fb.date_id
where dt.year = 2016
group by dt.year_month
order by 1
),
tw as
(
select dt.year_month, last(total_followers) as total_followers
from f_twitter_daily tw
inner join d_date dt on dt.date_id = tw.date_id
where dt.year = 2016
group by dt.year_month
order by 1
),
yt as
(
select dt.year_month, sum(views) as views, last(total_views) as total_views
from
f_youtube_daily yt
inner join d_date dt on dt.date_id = yt.date_id
where dt.year = 2016
group by dt.year_month
order by 1
)
select
ga.year_month, ga.sessions as web_views, ga.avg_session_duration as avg_web_view_dur, fb.reach as fb_reach,
fb.total_likes as fb_likes, tw.total_followers as twitter_followers, yt.total_views as total_yt_views, yt.views as yt_views
from
ga
inner join fb on ga.year_month = fb.year_month
inner join tw on ga.year_month = tw.year_month
inner join yt on ga.year_month = yt.year_month;
So there we have it, a open-sourced ETL package for a handful of social media sites that will track history. The final tables can be used as a source for any reporting tools you want, and the data can all be tied together by date. This is particularly useful for showing summary trends over time. However, I don't store even a fraction of the amount of data that we can get in real time from the APIs. The next blog entry will show how to use open source technologies to visualise this data as well as the real time information.