Category Archives: Rittman Mead
Recently I have be working with Oracle Enterprise Data Quality (EDQ) combined with Oracle Data Integrator (ODI), we will tell you more about exactly what we have been doing in another blog, but for now I would like to revisit our earlier EDQ blogs. I am going to keep to the batch ETL paradigm and not really look at the Real-Time aspects of Oracle EDQ.
My colleague, Koen Vantomme, blogged about EDQ back in August 2012, and Mark Rittman wrote about EDQ and ODI in December 2012. New versions of ODI (12c) and EDQ (188.8.131.52) have been released and this has introduced a few changes in the way we use the two products together. Before discussing that I will give a brief overview of what EDQ can do and three typical use cases we may wish to use in a data integration / data warehousing project.
The product name says it all. “Oracle Enterprise Data Quality” is Oracle’s Data Quality toolset for Enterprises. It provides a comprehensive set of functionality for data stewardship, data profiling and data correction (both realtime by use of web services and, more traditionally, in batch operations). Additional, added-cost, components can be added to provide functionality such as address verification, CRM address services and watch-list screening.
Out-of-the-box we get a web front end giving access to various applets.
As developers, we will probably use the Director applet most often. Director is a web-launch Oracle ADF applet and is the IDE for creating and managing EDQ processes and jobs. One or more processes can be chained together in an EDQ job and this job can be executed from the Director GUI, the command line or, as I will cover later, from inside an ODI package (and hence from an ODI scenario or loadplan).
In the data warehousing world success is usually measured by whether we give the data users what they want. This usually boils down to being ’timely’ (are the data recent enough for their purposes?) and ‘correct’. Timely is about ETL process and load design. Correct, however, is mainly about data quality (although ETL logic flaws also play a part). Typically we see three kinds of issue:
- Incomplete: attributes or values that are missing. For example an order in process will not have a shipping date until the order is shipped; and in the case of a website data feed non-mandatory fields on a web form may be left blank.
- Incorrect: I have seen a e-retailer with a disproportionate number of customers from Albania, where Albania just happened to be the first country listed on the web-form country selector on the customer registration page.
- Inconsistent: the same data being presented in different ways so that we don’t always spot multiple records referring to the same entity. This can be simply an inconsistent use of letter case in a description or it can be more complex such as the use of synonyms and variant spellings; for example my father was William Scott, but he could appear as ‘Will Scott’, ‘Bill Scott’, ‘Wm Scott’, ‘W Scott’ or ‘W G Scott’ . However we can’t just blindly convert all cases of Billy to William as Billy could be a legitimate data value; we need to build more complex data rules about this. Sometimes inconsistent data is more physical in nature, for example one of our data sources may use a different character set and we see unexpected character conversions in our data.
Fortunately, EDQ gives us some great tools that we as ETL developers can use to investigate and handle these kinds of issue. Just dropping a data source onto the EDQ Director canvas prompts us if we want to auto profile the data; click ‘yes’ and we automatically generate a process that we can use to inspect our data for a host of anomalies. We click on the green arrow above the canvas to execute the profiling and then click on any of the pre-built Profiler processor icons to see the results. A nice touch is the ability to drill down on the blue coloured results to see the underlying data.
Having profiled our data we are in a position to decide whether our ETL processes need to be written to handle anticipated data anomalies (of course we should always code defensively) or that we go back to the data source owners for improved data feeds. If we adopt the “handle anomalies” route we can use the EDQ tool set to help with this. Notice I say handle and not “fix” I am a strong believer that a data warehouse is the single source of truth and not the “place where truth is created” If we receive data that needs fixing we should separate it from the data suitable for loading and ideally return it to the data owner for fixing. We may do some automatic fix ups but only if the rules can be firmly agreed with the business and that we always maintain the original source data along side so we can go back if we find our ‘rule’ was incorrect.
In the EDQ sample data (available on OTN for download) we find a customer CSV file where amongst other things the LastName column is highly inconsistent; some names in lowercase, some in uppercase, some in proper case (first character of words is uppercase). We can build out a simple process that does a proper case conversion on all of the data. We can be even smarter, and get it to ignore embedded uppercase so we don’t mess up names like McDonnell and DiCampo. In the real world we would possibly add an exceptions list to handle prefix words such as van, de and von which are traditionally in lower case. In my data correction task I have added another processor to add the current date to outward-bound data so we have a trail of when data was processed. The data written out also has both the original last name and the amended version.
Sometimes (I’d say “mostly!”) fixing data in the ETL is the wrong thing to do. In cases like these we can build a process using the Data Match processors and based on defined data criteria elect to process data, reject data or flag for further investigation. We can use these match operators as a trigger for instigating Oracle EDQ case management workflows or just use the inbuilt Director functionality to split good and rejected data and raise an email alert using the email processor. Using my customer data as an example we have several customers that are likely to be duplicates, with either variant versions of company name or address. We can build match rules and assign weights to the rule components so that certain conditions trigger matching, non-matching or review. Oracle EDQ supports some very sophisticated match algorithms, such as “within a given number of character substitutions” (aka typo matching) and “matching a percentage of words in a string”.
We can review our results from the data match tool, in my case I had one record that my rules considered to be an exact match and 52 records that were flagged for human review a likely matches. Note we can highlight the discrepancies
As I mentioned above we can call these EDQ jobs from the command line or as part of ODI packages. The simplest way is probably using the EDQ tool that has been available since ODI 184.108.40.206.
However, calling the EDQ command line interface from ODI is also feasible for earlier ODI versions or for more complex use cases that may not be supported by the tool (for example the use of execution parameters although this restriction can be worked around in other ways). To use the ODI tool we need to know some information about the EDQ instance, particularly the host, EDQ JMX server port (this is not the same as the browser url port) the user name and password, EDQ project and job names, and the EDQ domain. There are a few gotchas that are not clear from the documentation, but setting thus up is relatively easy. The biggest of these gotchas is that the EDQ domain name (that is the base mbean to which you connect) has changed in the current EDQ release from the default name provided by ODI, it is now edq. We can verify this base mbean name by using Java Console on the server and drilling into the EDQ server process. The JMX server port number has also changed (8090 is now the default on the EDQ server). We need to explicitly set these values on the ODI EDQ tool General Tab (the ODI defaults no longer work). There are two further restrictions we need be aware of, but these are probably unchanged from earlier releases; the execution user for EDQ must be a full EDQ Administration user and also that user must not be currently logged into EDQ on another connection. For my testing I created a Weblogic EDQ-USER user and an EDQ_ODI_GROUP group in the WebLogic Security Realm and mapped the WebLogic group to the EDQ Administrators role on the edq Administration web page. That is I have pushed user administration down to the WebLogic server.
Oracle EDQ installs on to an application server such as Apache Tomcat, IBM WebSphere and of course Oracle WebLogic. For this blog post I have downloaded the latest available versions of EDQ and ODI. It should be noted that WebLogic 12 is not supported by the current Oracle EDQ release. For my Application Server I am using a basic WebLogic 10.3.6 install. Download and run the EDQ Repository creation assistant to create required meta data and schemas on the repository database and then run the EDQ provided Oracle Universal Installer. When the base install is complete run the WebLogic configuration utility and extend the domain to include Oracle EDQ. Startup the services from the WebLogic console. If the EDQ url does not bring up the home page it may be that you need to make a fix to the node manager configuration (see MOS note 1617687.1) If you are using the latest Java 1.7 you will also probably need to change some Java security setting to allow the Java Web Start applets to launch. Do to this connect to the Java Control Panel and add the EDQ server and port to the exception site list.
In a follow-up post we will take a look at some of the Director functionality in detail.
Stewart’s too modest to mention it on the blog himself, but I just wanted to congratulate Stewart Bryson on being awarded Oracle ACE Director status by the Oracle OTN ACE program. Stewart was given the Oracle ACE award a few years ago to recognise past work he’d done for the Oracle BI, DW and ETL community, but this higher ACE Director award recognises the ongoing work he’s since been doing to share his knowledge and experience with the Oracle community. For a taste of what he’s been doing recently, check out this recent article on OTN on migrating OWB to ODI, where he goes through ODI12c installation, and then walks through several scenarios for interoperating with, and migrating to, Oracle Data Integrator 12c.
I also wanted to take this opportunity to share the exciting news that Edelweiss Kammermann, who many of you will know from her work with the Latin American Oracle User Group community, joined us back at the start of this year to work with Robin Moffatt in our training development area.
We’ve known Edel for many years, and she came over to Brighton last year to present on BI and SOA/BPM at our 2013 BI Forum. Edel is initially working with Robin on developing our new ODI12c course, followed later on in 2014 with our new BI Apps 11g course, and she’ll be delivering training in Europe, USA and Latin America later on in the year. We’re incredibly pleased and proud that Edel has joined us, and being an Oracle ACE herself, we’ve now got two Oracle ACE Directors, and two Oracle ACEs, in the Rittman Mead team.
I’m sure you’ll all join me in congratulating Stewart, and welcoming Edel to the Rittman Mead team!
I’m very pleased to announce that the Rittman Mead BI Forum 2014 running in Brighton and Atlanta, May 2014, is now open for registration. Keeping the format as before – a single stream at each event, world-class speakers and expert-level presentations, and a strictly-limited number of attendees – this is the premier Oracle BI tech conference for developers looking for something beyond marketing and beginner-level content.
This year we have a fantastic line-up of speakers and sessions, including:
- Oracle ACE and past BI Forum best speaker winner Kevin McGinley, on adding third-party visualisations to OBIEE
- Tony Heljula, winner of multiple best speaker awards and this year presenting on Exalytics and TimesTen Columnar Storage
- Sessions from TimesTen PMs Chris Jenkins and Susan Cheung on what’s coming with TimesTen
- Edward Roske, author of multiple books on Essbase, on Essbase optimisation
- Oracle’s Andrew Bond, and our own Stewart Bryson (Oracle ACE) with an update to Oracle’s reference BI, DW and Big Data Architecture
- Sessions from Oracle’s Jack Berkowitz, Adam Bloom and Matt Bedin on what’s coming with OBIEE and Oracle BI Applications
- Endeca sessions from Chris Lynskey (PM), Omri Traub (Development Manager) on Endeca, along with ones from Branchbird’s Patrick Rafferty and Truls Bergersen
- And sessions from Rittman Mead’s Robin Moffatt (OBIEE performance), Gianni Ceresa (Essbase) and Michael Rainey (ODI, with Nick Hurt from IFPI)
We’ve also got some excellent keynote sessions including one in the US from Maria Colgan on the new in-memory database option, and another in Brighton from Matt Bedin and Adam Bloom on BI in the Cloud – along with the opening-night Oracle product development keynote in both Brighton and Atlanta.
We’re also very exited to welcome Lars George from Cloudera to deliver this year’s optional one-day masterclass, this year on Hadoop, big data, and how Oracle BI&DW developers can get started with this technology. Lars is Cloudera’s Chief Architect in EMEA and an HBase committer, and he’ll be covering topics such as:
- What is Hadoop, what’s in the Hadoop ecosystem and how do you design a Hadoop cluster
- Using tools such as Flume and Sqoop to import data into Hadoop, and then analyse it using Hive, Pig, Impala and Cloudera Search
- Introduction to NoSQL and HBase
- Connecting Hadoop to tools such as OBIEE and ODI using JDBC, ODBC, Impala and Hive
If you’ve been meaning to take a look at Hadoop, or if you’ve made a start but would like a chance to discuss techniques with someone who’s out in the field every week designing and building Hadoop systems, this session is aimed at you – it’s on the Wednesday before each event and you can book at the same time as registering for the main BI Forum days.
Attendance is limited to around seventy at each event, and we’re running the Brighton BI Forum back at the Hotel Seattle, whilst the US one is running at the Renaissance Midtown Hotel, Atlanta. We encourage attendees to stay at the hotel as well so as to maximise networking opportunities, and this year you can book US accommodation directly with the hotel so you can collect any Marriott points, corporate discounts etc. As usual, we’ll take good care of you over the two or three days, with meals each night, drinks receptions and lots of opportunities to meet colleagues and friends in the industry.
Full details are on the BI Forum 2014 web page including links to the registration sites. Book now so you don’t miss-out – each year we sell-out in advance, so don’t leave it to the last minute if you’re thinking of coming. Hopefully see you all in Brighton and Atlanta in May 2014!
Recently, my colleague, Pete Carpenter, described a proof of concept we carried out using Amazon Redshift as the data warehouse storage layer in a system capturing data from Oracle E-Business Suite (EBS) using Attunity CloudBeam in conjunction with Oracle Data Integrator (ODI) for specialised ETL processing and Oracle Business Intelligence (OBI) as the reporting tool.
In this blog I will look at Amazon Redshift and how it compares with a more traditional DW approach using, as my example, Oracle. I am not going to talk performance in absolute terms as your mileage is going to vary.
What is Redshift?
Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3 components but is managed separately using the Redshift tab of the AWS console. As a cloud based system it is rented by the hour from Amazon, and broadly the more storage you hire the more you pay. Currently, there are 2 families of Redshift servers, the traditional hard-disk based, and the recently introduced SSD family, which has less storage but far more processing power and faster CPUs. For our trials we looked at the traditional disk based storage on a 2 node cluster to give us 4TB of disk spread across 4 CPU cores. Apart from single node configurations, Redshift systems consist of a leader node and two or more database nodes; the leader node is supplied free of charge (you only pay for the storage nodes) and is responsible for acting as the query parser, coordinating the results from the database nodes, and being a central network address for user access.
The Redshift product has its origins in ParAccel and that in turn Postgres and thus supports ANSI SQL and the ODBC and JDBC Postgres drivers. In basic terms it is a share-nothing parallel processing columnar store database that supports columnar compression.
At the cluster level all sorts of robustness features come in to play to handle routine hardware failures such as a node or disk; regular automatic backups occur and on-demand backups can be made to S3 storage for DR or replication to other AWS networks. It is possible to dynamically change the number and or type of Redshift nodes in use, in effect a new cluster is spun up and the data copied from the existing system to the new before dropping the old system. The original database remains open for query (but not update) during the scale-out (or scale-down) process. As Pete Carpenter described, creating a new Redshift instance is a simple matter of completing a few web forms and waiting for the cluster to come up. Once up you can connect to the database using the master credentials you specified at cluster creation and then create databases, users, and schemas as required.
Databases, users, schemas and security
Although it is possible to run a Redshift database using the master user and the default database, good practice suggests that we do a bit more than this. In some ways Redshift is a little like the Oracle 12c database in that we can create additional databases within the master database, much in the style of plugable databases; a major difference comes with the concept of a USER. In Oracle 12c a user belongs to a plugable database, in Redshift all users belong to the master (container) database and can see any of the contained databases (subject to grants.) Schemas are logical groupings for objects and need not be aligned to database user names. Standard object and role grants allow users to access specific databases, schemas, and tables or to have role-rights such as administrator. The final aspect of security is outside the database and is in effect a firewall rule to permit any nominated AWS user or specified IP addresses to speak to the database listener; by default the rule is no inbound access. The diagram below is a block representation of how databases, users, schemas and firewall interrelate. Note user names are descriptive and not valid names!
A key point of difference between Amazon Redshift and Oracle is in how the data is stored or structured in the database. An understanding of this is vital in how to design a performant data warehouse. With Oracle we have shared storage (SAN or local disk) attached to a pool of processors (single machine or a cluster); however, Redshift uses a share-nothing architecture, that is the storage is tied to the individual processor cores of the nodes. As with Oracle, data is stored in blocks, however the Redshift block size is much larger (1MB) than the usual Oracle block sizes; the real difference is how tables are stored in the database, Redshift stores each column separately and optionally allows one of many forms of data compression. Tables are also distributed across the node slices so that each CPU core has its own section of the table to process. In addition, data in the table can be sorted on a sort column which can lead to further performance benefits; I will discuss this in the section on tables.
Not all of the database features we come to expect in an Oracle data warehouse are available to us in Redshift. The Redshift Developer Guide has the full rundown on what is available, but for now here is a short list of common DW features that are not going to be available to us.
- Primary, Unique, Foreign Key (all usable by optimizer but not enforced)
- Spatial (Locator) functionality
- Sequences (although there is an AUTO NUMBER column type)
- MERGE – we have to code as UPDATE and INSERT in two steps
- In-database PL/SQL-like language
- User defined functions
- Timestamps (with timezone)
- XML types
- Pseudo columns
- Various SQL functions (not a full list, but functions I often use in ETL processes)
- Regular expressions
- Regression functions
In addition data types may not be exactly the same as those used in Oracle; for example DATE in Oracle has a resolution of 1 SECOND, DATE in Redshift has a resolution of 1 DAY.
The basic Oracle syntax to create a table works (as does CTAS, Create Table As Select), however there are additional items we can, and should, specify at table creation.
By default the data distribution style is EVEN, that is data is distributed between node-slices in a round-robin fashion, for performance we may wish to specify a distribution key column to allow a particular column to control how data is distributed; a similar concept to Oracle hash partitioning, and with the same sort of performance characteristics. We aim to create an even distribution of rows per slice (else one slice will take longer than the others to process its data) and by applying the same distribution to other tables that are commonly joined we can benefit from improved table joining performance as all of the rows are stored in the same node-slice. Sometimes it is more appropriate to replicate the whole table to each slice so that the data is always available to join without the need to move data to the same slice before joining; In such cases we set the distribution style to be ALL.
The second thing we can set on a table is the SORTKEY this specifies one or more columns on the table by which the data is ordered on data load (it can be the same column as the distribution key). Redshift maintains information on the minimum and maximum values of the sort key in each database block and at query time uses this information to skip blocks that do not contain data of interest.
Finally, we can elect to compress columns in the database. If we do not specify compression, the default is RAW (i.e. uncompressed) is used. For compressed data we can specify the compression algorithm used, different algorithms are better for certain data types and values. Compression may be data block based (DELTA, BYTE-DICTIONARY, RUN LENGTH, TEXT255 and TEXT32K) or value base (LZO and the MOSTLY compressions). This sounds daunting but there are two ways we can get compression suggestions from the database: using the ANALYZE COMPRESSION command on a loaded table and the AUTO COMPRESS feature of the COPY command, this however requires an empty non-compressed target table; copy is the Redshift equivalent of SQL/Loader and takes a flat file and inserts it into the database.
Let’s consider a simple table T1 with three columns, C1, C2 and C3. We can create this using a simple piece of DDL:
CREATE TABLE T1 ( C1 INTEGER NOT NULL, C2 VARCHAR(20) NOT NULL, C3 DATE );
I have not used any of the Redshift nice-to-have features for sorting, distribution, and compression of data. Note too, that I am using NOT NULL constraints, this is the only constraint type enforced in the database. This simple create statement creates database objects on each slice of the cluster, with one block per column per slice (1 slice = 1 CPU core) see the following diagram, note there is no table object stored in the database, it is a collection of columns.
Without specifying a distribution key data is evenly spread across all slices. When a 1MB block for a column is full a new block is created for subsequent inserts on the slice. An empty table will occupy block size * number of columns * number of cores and our block size is 1MB this would be columns * cores megabytes
Using a distribution key effectively hashes the data on the key column by the number of cores. Adding a sort key declares that the rows in the table are ordered and potentially allows block elimination to kick in. If our sort key is, say, transaction date, it is likely that our data loads occur in transaction date order, however if we sorted on product code we might find each data load has data that needs to be inserted between existing rows. This does not happen, the data is still appended to the table and the table now needs to be reorganised to put the rows in order. There are two ways to achieve this, the VACUUM command that does an on-line reorg of the table and the potentially faster route of creating a copy table, populating it and then dropping the original and renaming the copy, of course this gives a little downtime when the original table is not available for access.
Applying compression, sort and distribution we get a DDL statement like:
CREATE TABLE T2 ( C1 INTEGER NOT NULL, C2 VARCHAR(20) NOT NULL SORTKEY DISTKEY, C3 DATE ENCODE DELTA );
This table uses column C2 as both the sort key and the distribution key; column c3 is compressed using delta compression – this is an efficient compression algorithm where most dates are ±127 days of the date of the previous row. If we wanted to use a multi-column sort key the DDL syntax would be like:
CREATE TABLE T1 ( C1 INTEGER NOT NULL, C2 VARCHAR(20) NOT NULL DISTKEY, C3 DATE ) SORTKEY (C3,C2);
Multi-column distribution keys are not supported.
Designing for Performance
Redshift is designed for query and bulk insert operations; we can optimise query performance by structuring data so that less data is transferred between nodes in a join operations or less data is read from disk in a table scan. Choosing the right data sortkeys and distkeys is vital in this process. Ideally these key columns should not be compressed. Adding primary and foreign keys to the tables tells the optimizer about the data relationships and thus improves the quality of query plan being generated. Of course up to date table stats are a given too; tables must be ANALYZEd when ever the contents changes significantly and certainly after initial load. I feel that we should collect stats after each data load.
For a FACT + DIMENSIONS data model (such as in the performance layer of Oracle’s Reference Data Warehouse Architecture) it would be appropriate to distribute data on the dimension key of the largest dimension on both the dimension and the fact tables, this will reduce the amount of data being moved between slices to facilitate joins.
For optimal performance we should always ensure we include both the distribution keys and the sort keys in any query, even if they appear to be redundant. The presence of these keys forces the optimizer to access the tables in an efficient way.
For best data load performance we insert rows in bulk and in sortkey order. Redshift claim best performance comes from using the COPY command to load from flat files and as second best the bulk insert SQL commands such as CTAS and INSERT INTO T1 (select * from T2);. Where Redshift performs less well is when we use certain kinds of ETL steps in our process, particularly those that involve updating rows or single row activities. In addition loading data without respecting the sort key leads to performance problems on data query. If data update is essential we have two real options: we move our ETL processes to a conventional database hub server (perhaps using ODI) and just use Redshift to store pre-transformed data; or we revise our ETL processes to mimimize update activity on the Redshift platform. There is some scope to optimize updates by distributing data on the update key but another approach is to use temporary tables to build the results of the update and to replace the table with the results of the merge. This requires a bit of inventiveness with the ETL design but fortunately many of our required SQL constructs including analytic functions are there to help us.
Building an Amazon Redshift Data Warehouse using ODI and Attunity Part 2 : Using ODI and OBIEE with Redshift as a Target/Source
In my post yesterday we stepped through the initial set up and staging load for a data warehouse using Amazon Redshift and Attunity, for eventual use with OBIEE and ODI. Now that we have our source data in Redshift, let’s look at transforming it into a star schema using ODI, by initially looking how we set up the connection to Redshift in ODI’s Topology Navigator.
As I mentioned in yesterday’s post and on a blog article we wrote on non-Oracle databases a few months ago, Redshift is based on ParAccel technology, but uses PostGreSQL ODBC and JDBC drivers to connect to it. Therefore. we can create a Physical Architecture connection in ODI through to it using the PostgresSQL technology type, like this:
For the Redshift JDBC URL you’ll need, refer to the AWS Cluster summary page back on the Amazon AWS Console.
We can then add our schemas:
Then, if your connection to the Redshift database is working correctly, you should see a list of schemas in the drop down menus:
Once all the schemas have been added, when then go through the familiar ODI process of logical architecture and context mapping:
Next, we can create a model and reverse engineer in our data stores -
It’s a good idea at this point to review each datastore and ensure that all attributes have been correctly assigned a datatype. To save a little reading time, a set of dimension and fact tables were created in the performance schema, and these were also reverse engineered in:
So from here on out, the process of creating and using mappings, packages, variables etc to populate your target tables is fairly straightforward, one which ODI Developers will be familiar with. You add your source and target data stores and implement the transformation logic.
For the initial load, the SQL Control Append IKM was used and for the most part, this KM worked well without customisation. However, one of the early issues encountered during execution was with unordered outer joins – this appears to be a an issue with the PostGresQL technology. This was resolved by amending the mapping and setting an order for the joins:
Merging / Incremental loading
One of the things to be aware with Redshift at the time of writing is the absence of Merge functionality (we’re so spoilt in Oracle DBMS land ;) ) You can of course achieve this by issuing separate insert and update statements. There is also the SQL Incremental Update IKM in ODI, but you may want to review the KM Task steps against your requirements before proceeding, as invariably some customisation will be required to improve performance. The same applies to SCD functionality – you should consider utilising analytic window functions (Redshift supports several aggregate and ranking window functions) to achieve your desired format.
Additionally, as mentioned previously Attunity supports a variety of replication tasks – you can do one off initial loading, an initial load + apply changes, and/or collect deltas into a change tracking table in the Redshift database. This in itself means you have the basic building blocks to create and maintain a Foundation layer as part of your Data Warehouse Design.
So, once we have data in our target fact and dimension tables, we can model these in OBIEE and create some analyses.
As per the AWS documentation for Redshift, the recommended ODBC driver was downloaded from the Amazon AWS website, and an ODBC data source was created on the OBIEE server:
For the server entry, you can use either the public or private leader node IP, depending on your network/security configuration. Just backtracking a little here – you may recall that the usr_dw account was created after the initial Redshift cluster build. In order for it to be able to access the tables in the performance schema, we need to grant the required select privileges.
Once the ODBC connection is defined, we can create an RPD, define our database connection properties, import our tables and start modelling.
When it came to setting the features I used the query DBMS function and then went with what it set. Bear in mind that because PostgreSQL and Redshift aren’t supported source databases for OBIEE, depending on the analyses you are creating you may or may not hit functionality issues, so you may find yourself tweaking these settings.
So if your connection is defined correctly and there are no other issues, you should now be able to import the tables:
Once you’ve made your selection, you can then define the relationships between the fact and dimension tables (if you have primary and foreign keys defined these should propagate through), create any aliases etc. and then you’re about ready to start building up the the Business Model and Presentation areas.
Once the RPD modelling is completed, we can upload it and create some content.
Admittedly the analyses created for this dashboard were fairly basic, so no functionality issues were encountered during creation and execution – you might hit some with more “edge-cases” such as analytic functions or nested subtotals, or you might end-up hitting performance issues when you’ve got lots of concurrent users, as the SQL issued by OBIEE might not be as efficient as it would be for a supported data source.
Performance Considerations for Redshift
Pete Scott will be covering the ins and outs of designing for performance in a blog post tomorrow, for which I’ll add a link to this post once it’s up. However, one thing that should be mentioned is the importance of setting sort and distribution keys appropriately on the fact and dimension tables, as well as Primary and Foreign keys (whilst these constraints are not enforced, they are used by the optimiser when determining execution plans). The sort and distribution keys determine how the data is stored on disk, and how it is distributed across the compute nodes, and can make a big impact on query response times, as well as any issues around lack of official support in OBIEE for Redshift as a data source.
So to conclude - whilst not officially supported, it is possible to create and maintain a Data Warehouse in Redshift and use the Oracle Product set to drive ETL and reporting. Some of the features that make Redshift an attractive DB platform is it’s ease of management, it’s scaleability and sizing options, and the ability to get a cluster up and running in hours rather than days. Combined with ODI and OBIEE and therefore a transferable skills base, it makes for an intriguing DW solution.