Category Archives: Rittman Mead
Data Visualization Desktop 12.2.2.0: Data Flow Component
My previous post contained a brief description of Data Visualization Desktop (DVD) new features in 12.2.2.0, in terms of sources, visualisations and components. In this post we're going to simulate a typical analyst use case and understand how DVD can support the process.
Data Visualisation Desktop is a tool aimed at departmental analysis, with data coming from different sources and results that need to be delivered quickly. Given the ad-hoc nature of it, traditional long term IT-driven Business Intelligence processes often won’t suffice. In this example we'll have a deep look at DVD's Data Flow component and how it can be used to create an ETL flow in order to analyse data coming from a multitude of sources. Data Flow is new functionality introduced in DVD 12.2.2.0.
Preamble: being Italian I can't avoid talking about football, the example provided in this post will analyse some Serie A data together with some Fantasy Football information in order to understand which players I should choose for my team.
Data Sources
In order to analyse Serie A players I based my research on the following data points:
- Players cost: Excel file containing Team, Role and Fantasy Football Cost for each Serie A player. This file can change match by match since Cost of a single players can vary reflecting his performances.
- Players statistics: CSV files containing players statistics like goal scored, yellow and red cards, assists and fantasy football mark for every match of the current and past season.
For the purpose of the example I'm assuming the Players cost file is an XLSX received manually by the analyst (think at Budget data) and the Players statistics data stored in an Hive table.
Creating Data Sources in DVD
Data Visualization Desktop has a native connector to Hive, so just need to click on "Data Sources", then Create -> Connection and select "Apache Hive". The setup is pretty simple, we need to specify the host, port, username and password of the Hive Server.
The next step is creating a new Data Source and select the newly created "TestHive" as source. The list of Hive's databases and, selecting FantasyFootball, the list of tables are visible.
After clicking on the ff_statistics table we can select and import the columns. There is also an option to check or directly enter the SQL if needed. After clicking OK (and checking that no errors arise) we are ready to use the Hive table.
The "Players Cost" Excel file, received manually by the analyst, can be directly updated using the Data Source -> Create -> Data Source -> File option.
DVD automatically detects the column types and provides a preview of the content
Once the data source is saved we are ready to start manipulating the data.
Data Flow
Our initial goal is to exclude from the statistics table any data quality issues. This could be down to invalid CSVs, as well as players not existing in "Players Cost" file (if they were sold to teams outside Serie A or they stopped their career). To do so we can use the Data Flow option included in DVD and accessible in the Data Source page.
The first step is to select ff_statistics from list of sources, right click and select "Add Step". From the list of options presented we can select Filter and remove all the invalid data by simply only include rows where the "Code" is not empty [null].
The Data Flow chart now includes the Filter component. Following step is to bring in the "Players cost" file in the flow by selecting the Add Data option. Then it's time to join the two sources, we can do that by selecting both them and choosing the Join option.
We can specify the columns which will be used in the joining condition and the join type (inner or outer) by selecting the desired option in the Keep Rows section (between Matching rows or All rows). For the purpose of our analysis we'll keep only the matching rows of the two datasets (inner join) since we are interested in all players listed in Players Cost and having a valid set of statistics in Players Statistics.
Now we can enrich the data set further, by adding derived metrics and attributes:
- Count of Matches: The number of valid matches (having a not null grade) played by so far by each player. This will be used later to filter out all players having less than 10 valid games since those are less likely to play most of the games.
- Role Translation: Roles are specified in Italian, a simple CASE WHEN can translate them in English.
The enrichment can be achieved by creating an additional Add Columns Step and filling properly the formulas.
After filtering out all players with less than 10 valid marks, an Aggregate step can be added to set the aggregation level and methods. The Aggregate step should be included in every Data Flow since it's the unique place where Attribute/Measure and aggregation definitions can be made. A Data Flow without the Aggregation step will provide a default column definition that may result in an unusable output data source.
Finally we can store the end resultset locally in order to proceed with the analysis.
We can now execute the data flow and FantasyFootball is automatically added to the list of DVD's Data Sources. The Data Flow can also be stored in DVD in order to be re-executed when necessary.
Keep in mind that Data Flow works locally on the workstation where DVD is installed, so data extraction and manipulation will generate a load on the system based on the data volume and complexity of the steps.
Project
Before creating a project we can review the resulting FantasyFootball dataset settings and change the Attribute/Measure definition of my Columns as well as the type of aggregation.
As written before it's better to define Attributes/Measures with an Aggregate step in the Data Flow since any setting changed directly in the dataset will be overwritten when the Data Flow is re-executed.
With the data preparation work completed, now is time to start creating a project using the FantasyFootball dataset. As written in my previous post a number of new visualisations is available with DVD 12.2.2.0, some are used in the example below like Chord, Parallel and Sankey diagrams.
Unfortunately I'll not share the details of my findings since those could be used against me in the competition but Hey....that Higuain looks like a good player!
In this post we saw a typical analyst use case, with data coming from multiple sources needing to be joined together and cleansed. All operations done manually via Excel that can now be automated, saved and re-executed with DVD's Data Flow.
Data Visualization Desktop 12.2.2.0
Yesterday Data Visualization Desktop (DVD) Version 12.2.2.0 was released. DVD, since its first release, aims to extend Oracle's Data Visualization portfolio by adding a desktop tool allowing data visualization capabilities directly to end users without the intervention of the IT department, in line with the Gartner's bi-modal IT.
The new version adds several capabilities to the existing product like visualization types, data sources and a wrangling option. This post will share the details of the new release additional features.
Installation
After downloading DVD, the installation is pretty simple, just double click on the Oracle_Data_Visualization_Desktop_V2_12_2_2_0_0.msi file, choose the installation folder and click on "Install".
Once the installation is finished, be aware of the message in the last screen, it says that the correct version of R and the set of packages need to be installed in order to be used with DVD for Advanced Analytics. Those can be installed via the "Install Advanced Analytics" file placed in Start Menu -> Programs -> Oracle.
This setup allows to chose the R installation location, installs R and then downloads from cran.us.r-project.org the relevant packages.
New Visualisations
The first set of improvements in the new release is about the new out of the box visualisation, and new set of graphs is now available:
- List: Shows a list of the dimension's values together with a gradient colouring based on the measure selected
- Parallel Coordinates: Shows multiple dimensions on the same chart enhancing the ability to quickly get an insight about possible connections between them
- Timeline: It's an effective way of showing time related facts, each fact is shown along a timeline, with one or more distinguishing attributes, the example shows the quantity shipped by day and city.
- Network Diagrams: Chord, Circular, Network and Sankey Diagrams are used to shows inter-relationship between elements
Other visual enhancements include a multi-canvas layout that can be exported with a single click and a hierarchical or gradient colouring for the charts.
Data Sources
A lot of new data sources have been added to DVD, some of them still in beta phase. A bunch of new databases are now supported like Netezza, Amazon Aurora and PostgreSQL.
An interesting enhancement is the connection to Dropbox and Google Drive allowing DVD to source files stored in Cloud. Finally DVD's exposure to Big Data world has been enhanced by the addition of the connectivity to tools such as Apache Drill, Presto and Cassandra.
Excel Editing
Excel sheets used as data source now can be edited and the DVD project refreshed without the need of manually reloading the spreadsheet.
Data Flows
There is a new component in DVD called Data Flow allowing the end user some basic transformations of the data like joining two datasets (even if coming from different sources), filtering, aggregating, adding columns based on custom formulas and storing the result on the local file system.
In the example below two files coming from Hive (but the source can also be different) are joined and a subset of columns is selected and stored locally.
Data Flows can be stored in DVD and re-executed upon request. The list of Data Flows is available under Data Sources -> Data Flows. In the next blog post I'll show a typical Analyst use case in which Data Flow can help automating a series of data loading, cleansing and enriching steps.
Data Insights
Data Insights provides a way of quickly understand the dataset available, by default it shows a series of graphs, one for every attribute, with the cardinality of each attribute's value. A drop down menu allows to show the same graphs based on any measure defined in the dataset.
BI Ask
The new DVD version contains also BI Ask, providing the ability to create queries with natural language which is automatically interpreted and presented in suggested visualisations.
As you could read in this post the new version of Data Visualization Desktop adds a series of really interesting features enabling not only the data visualisation but also data exploration and wrangling. In the next blog post we'll see a typical DVD use case and how the new Data Flow option could be used to couple data coming from various sources.
OBIEE, Big Data Discovery, and ODI security updates – October 2016
Oracle release their "Critical Patch Update" (CPU) notices every quarter, bundling together details of vulnerabilities and associated patches across their entire product line. October's was released yesterday, with a few entries of note in the analytics & DI space.
Each vulnerability is given a unique identifier (CVE-xxxx-xxxx
) and a score out of ten. The scoring uses a common industry-standard scale on the basis of how easy it is to exploit, and what is compromised (availability, data, etc). Ten is the worst, and I would crudely paraphrase it as generally meaning that someone can wander in, steal your data, change your data, and take your system offline. Lower than that and it might be that it requires extensive skills to exploit, or the impact be much lower.
A final point to note is that the security patches that are released are not available for old versions of the software. For example, if you're on OBIEE 11.1.1.6 or earlier, and it is affected by the vulnerability listed below (which I would assume it is), there is no security patch. So even if you don't want to update your version for the latest functionality, staying within support is an important thing to do and plan for. You can see the dates for OBIEE versions and when they go out of "Error Correction Support" here.
If you want more information on how Rittman Mead can help you plan, test, and carry out patching or upgrades, please do get in touch!
The vulnerabilities listed below are not a comprehensive view of an Oracle-based analytics/DI estate - things like the database itself, along with Web Logic Server, should also be checked. See the CPU itself for full details.
Big Data Discovery (BDD)
- CVE-2015-3253
- Affected versions: 1.1.1, 1.1.3, 1.2.0
- Base score: 9.8
- Action: upgrade to the latest version, 1.3.2. Note that the upgrade packages are on Oracle Software Delivery Cloud (née eDelivery)
OBIEE
- CVE-2016-2107
- Affected versions: 11.1.1.7.0, 11.1.1.9.0, 12.1.1.0.0, 12.2.1.1.0
- Base score: 5.9
- Action: apply bundle patch 161018 for your particular version (see MoS doc 2171485.1 for details)
BI Publisher
- CVE-2016-3473
- Affected versions: 11.1.1.7.0, 11.1.1.9.0, 12.2.1.0.0
- Base score 7.7
- Action: apply patch per MoS doc 2171485.1
ODI
-
- Affected versions: 11.1.1.7.0, 11.1.1.9.0, 12.1.3.0.0, 12.2.1.0.0, 12.2.1.1.0
- Base score: 5.7
- The
getInfo()
ODI API could be used to expose passwords for data server connections. - More details in MoS doc 2188855.1
-
- Affected versions: 11.1.1.7.0, 11.1.1.9.0, 12.1.2.0.0, 12.1.3.0.0, 12.2.1.0.0, 12.2.1.1.0
- Base score: 3.1
- This vulnerability documents the potential that a developer could take the master repository schema credentials and use them to grant themselves SUPERVISOR access. Even using the secure wallet, the credentials are deobfuscated on the local machine and therefore a malicious developer could still access the credentials in theory.
- More details in MoS doc 2188871.1
OBIEE, Big Data Discovery, and ODI security updates – October 2016
Oracle release their "Critical Patch Update" (CPU) notices every quarter, bundling together details of vulnerabilities and associated patches across their entire product line. October's was released yesterday, with a few entries of note in the analytics & DI space.
Each vulnerability is given a unique identifier (CVE-xxxx-xxxx
) and a score out of ten. The scoring uses a common industry-standard scale on the basis of how easy it is to exploit, and what is compromised (availability, data, etc). Ten is the worst, and I would crudely paraphrase it as generally meaning that someone can wander in, steal your data, change your data, and take your system offline. Lower than that and it might be that it requires extensive skills to exploit, or the impact be much lower.
A final point to note is that the security patches that are released are not available for old versions of the software. For example, if you're on OBIEE 11.1.1.6 or earlier, and it is affected by the vulnerability listed below (which I would assume it is), there is no security patch. So even if you don't want to update your version for the latest functionality, staying within support is an important thing to do and plan for. You can see the dates for OBIEE versions and when they go out of "Error Correction Support" here.
If you want more information on how Rittman Mead can help you plan, test, and carry out patching or upgrades, please do get in touch!
The vulnerabilities listed below are not a comprehensive view of an Oracle-based analytics/DI estate - things like the database itself, along with Web Logic Server, should also be checked. See the CPU itself for full details.
Big Data Discovery (BDD)
- CVE-2015-3253
- Affected versions: 1.1.1, 1.1.3, 1.2.0
- Base score: 9.8
- Action: upgrade to the latest version, 1.3.2. Note that the upgrade packages are on Oracle Software Delivery Cloud (née eDelivery)
OBIEE
- CVE-2016-2107
- Affected versions: 11.1.1.7.0, 11.1.1.9.0, 12.1.1.0.0, 12.2.1.1.0
- Base score: 5.9
- Action: apply bundle patch 161018 for your particular version (see MoS doc 2171485.1 for details)
BI Publisher
- CVE-2016-3473
- Affected versions: 11.1.1.7.0, 11.1.1.9.0, 12.2.1.0.0
- Base score 7.7
- Action: apply patch per MoS doc 2171485.1
ODI
-
- Affected versions: 11.1.1.7.0, 11.1.1.9.0, 12.1.3.0.0, 12.2.1.0.0, 12.2.1.1.0
- Base score: 5.7
- The
getInfo()
ODI API could be used to expose passwords for data server connections. - More details in MoS doc 2188855.1
-
- Affected versions: 11.1.1.7.0, 11.1.1.9.0, 12.1.2.0.0, 12.1.3.0.0, 12.2.1.0.0, 12.2.1.1.0
- Base score: 3.1
- This vulnerability documents the potential that a developer could take the master repository schema credentials and use them to grant themselves SUPERVISOR access. Even using the secure wallet, the credentials are deobfuscated on the local machine and therefore a malicious developer could still access the credentials in theory.
- More details in MoS doc 2188871.1
Oracle Data Integrator 12c: Getting Started – Components and Architecture
I’ve decided that it’s time for a refresher on Oracle Data Integrator 12c. I’m writing a Getting Started series to help folks get interested in the product and maybe even teach a few old dogs (including myself) some new tricks. In my last post, I shared the history of ODI and a bit about what sets it apart from other ETL tools on the market. In this article, I’ll walk through different components of Oracle Data Integrator and some of the architecture choices you’ll need to make in order to get started with ODI 12c.
Components
Before diving into the architecture, we need to understand the different components that are part of the Oracle Data Integrator installation.
Repositories
ODI is driven by metadata. This metadata is stored away in two different repositories: the Master repository and the Work repository. The Master repository contains information about security (users, profiles, etc), topology (data connections, contexts, physical/logical schemas), and ODI versioning. Each Master repository can be linked to one or more Work repositories. Work repositories can be of 2 different types: development or execution. In a Development Work repository you’ll find all of the design objects (mappings, packages, procedures, etc) and datastore metadata. The Execution Work repository only stores the execution objects, Scenarios and Load Plans, and there is no development capability. More on all of these objects in a later post.
The Master and Work repositories can reside in the same database schema or as their own schemas in the same database instance. The latter practice was more common in the past, before the Repository Creation Utility (RCU) was really the main mechanism for creating the repositories. The RCU doesn’t provide an option for separating the repositories into two different schemas, therefore the standard is to use one single schema. But that’s not the only reason, it also stems from the best practice of separating your environments in entirety; development, test, production, so each can be maintained, upgraded, and patch separately. We’ll jump into the environment setup further down.
Agents
The Oracle Data Integrator Agent is what orchestrates the execution of processes created in ODI. At runtime, agents will be used to run Load Plans and Scenarios via an ODI schedule, command line call, web service call, or a third-party scheduler. Agents are accessed via http/https requests, regardless of how they are called into action.
There are 3 types ODI 12c of agents:
- JEE Agent
Implemented as a deployment in Weblogic Server 12c, the JEE Agent allows you to use the features of WLS, such as clustering for high availability and JDBC connection pooling. - Standalone Agent
The Standalone Agent is a lightweight Java application that is typically installed to run closest to where most of the transformations will occur. In most data warehouse setups, this is on the data warehouse server. - Colocated Agent
This type of agent is essentially a Standalone Agent that is managed via Weblogic Server. If you want to manage all of your agents via WLS, this is the way to go.
A great article from the ODI A-Team, ODI Agents: Standalone, JEE and Colocated, describes the agent types in further detail, including the comparison of agent features chart, found below.
Studio
ODI Studio is a Java based development environment based on the JDeveloper framework. Studio is installed on client machines and used to connect to the master and work repositories to access the ODI metadata and perform object development. Essentially, this is where the magic happens!
Architecture - It Depends.
With any good question comes the answer, “it depends”. Before we can choose an architecture for ODI, the system requirements must be determined, allowing us to work through the “it depends” answer more clearly. Let’s dive right in with some potential requirements that may be necessary for a proper data integration setup.
High Availability
A key decision that drives which components of Oracle Data Integrator will be installed and configured stems from the need for a highly available ETL process. If there is a critical process or reporting that relies on ODI, then HA will be a requirement. Not only that, but you’ll want to look at using something like Oracle RAC for the repository database in order to keep it up and running. Finally, high availability won’t save you from an entire data center going offline, so ensure you have a disaster recovery process in place as well.
Environments Required
How many environments do you need? Let’s start with the minimum, Development, Test/QA, and Production. Ok, well if you’re a small shop you might be able to get away without Test/QA, but not recommended. I would also add a 4th environment, Hotfix, which will store the production development objects, allowing your team to fix a production issue quickly without having to restore code from source control. The purpose of understanding the number of environments upfront is to determine how many application servers and database servers will be required for the entire Oracle Data Integrator setup. There’s yet another great article from the Oracle A-Team that describes the use of the ODI Master Repository across these many environments.
Lifecycle Management and Deployment Process
As you can see in the environments image above, there are also different arrows showing the deployment process and use of source control. The deployment process is usually the easy part to determine: Migrate ODI execution objects from Dev—>Test—>Prod. But the mechanism for doing so might be a bit different, especially if (or more likely, when) source control is introduced.
Oracle Data Integrator 12c can integrate with Subversion, and soon Git, for full lifecycle management capabilities. ODI also has its own object versioning, but it really is only to be used as a last resort. Often, teams have developed their own process around exporting objects to XML, loading into a source control system, and migrating to the next environment. Whichever process you determine is best for your organization, or if you plan to piggy-back on what’s currently in play for developers at your company, you’ll want to ensure the correct components are introduced into the architecture.
Sources and Targets
This is about the types and location of the data sources that ODI will need to connect to. If you have a set of flat files on a server that is unreachable from the machine where the ODI agent is installed, you’ll need a new Standalone agent placed somewhere that can pull from the file server. Drawing up the entire “planned” data flow will help to sort out these decisions early on, especially if you introduce big data into the mix.
Security
Finally, everyone’s favorite topic: security. There are many aspects to security within ODI, including how developers access ODI Studio and how to secure your ETL processes and the application itself. As mentioned earlier, the ODI Agents are called via a web request. The addition of SSL can further secure transmission of these requests, but may also introduce additional setup. If you have a large team of ETL developers, or maybe just a company policy on how applications are to be accessed, ODI can be integrated with your organization’s LDAP via the external authentication setup. With these and other considerations for ODI security, be sure to sort this out during the requirements and architecture phase.
There are many other questions that will need to be answered in order to properly choose your architecture, but hopefully this will get you started. As always, you can join one of the Rittman Mead ODI bootcamps to learn more from one of our experts on the product. Up next in the Getting Started series, we’ll look at Oracle Data Integrator installation and configuration.