Tag Archives: Data Warehousing

Oracle 12.1.0.2 and Data Warehouses

If you follow Blogs and Tweets from the Oracle community you won’t have missed hearing about the recent release of the first patch-set for Oracle 12c. With this release there are some significant pieces of new functionality that will be of interest to Data Warehouse DBAs and architects. The headline feature that most Oracle followers will know of is the new in-memory option. In my opinion this is a game-changer for how we design reporting architectures; it gives us an effective way to build operational reporting over the reference data architecture described by Mark Rittman a few weeks ago. Of course, the database team here at Rittman Mead have been rolling up our sleeves and getting into in-memory technology for quite a while now, Mark even featured in the official launch presentation by Larry Ellison with the now famous “so easy it’s boring” quote. Last week Mark published the first of our Rittman Mead in-memory articles, with the promise of more in-memory articles to come including my article for the next edition of UKOUG’s “Oracle Scene”.

However, the in-memory option is not the only new feature that is going to be a benefit to us in the BI/DW world. One of the new features I am going to describe is Exadata only, but the first one I am going to mention is generally available in the 12.1.0.2 database.

Typically, data warehouse queries are different from those seen in the OLTP world – in DW we tend to access a large number of rows and probably aggregate things up to answer some business question. Often we are not using indexes and instead scanning tables or table partitions is the norm. Usually, the data we need to aggregate is widely scattered across the table or partition. Data Warehouse queries often look at records that share a set of common attributes; we look at the sales for the ‘ACME’ widget or the value of items shipped to Arizona. For us there can be great advantage if data we use together is stored together, and this is where Attribute Clustering can pay a part.

Attribute Clustering is usually configured on the table at at DDL time and in-effect controls the ordering of data inserted by DIRECT PATH operations, Oracle does not enforce this ordering for conventional inserts, this may not be an issue in data warehouses as bulk-batch operations typically use APPEND inserts, which are direct path inserts, or partition operations, it may be more of an issue with some of the real-time conventional path loading paradigms. In addition to Direct Path load operations Attribute Clustering can also occur when you do Alter table MOVE type operations (this also includes operations such as PARTITION SPLIT). On the surface, Attribute Clustering sounds little different to using an ORDER by on an append insert and hoping that Oracle actually stores the data where you expect it to. However, Attribute Clustering gives us two other possibilities in how we can order the data in the cluster.

Firstly, we can cluster on columns from JOINED dimension tables, for example in a SALES DW we may have a sales fact with a product key at the SKU level, but we often join to the product dimension and report at the Product Category level. In this case we can cluster our sales fact table so that each product category appears in the same cluster. For example, we have just opened a chain a supermarkets with a wide but uninspiring range of brands and products (see the tiny piece of our product dimension table below)

NewImage

As you can see, our Product PK has no relationship at all to the type of product being sold. In our Kimball-style data warehouse we typically store the product key on the fact table and join to the product dimension to obtain all of the other product attributes and hierarchy members. This is essentially what we can do with join Attribute Clustering, in our example we can cluster our fact table on PRODUCT_CATEGORY so that all of the Laundry sales are physically close to each other in the Fact table.

CREATE TABLE rm_sales (
product_idNUMBER NOT NULL,
store_id        NUMBER NOT NULL,
sales_date      DATE NOT NULL,
loyalty_card_id NUMBER ,
quantity_sold  NUMBER(3) NOT NULL,
value_sold    NUMBER(10,2) NOT NULL
)
CLUSTERING
rm_sales JOIN products ON (rm_sales.product_id = products.product_pk)
BY LINEAR ORDER (sales_date, product_category, store_id);

Notice we are clustering on a join to the product dimension table’s “product_category” column, we are also clustering on sales_date, this is especially important in the case of partitioned fact tables so that the benefits of clustering align to the partitioning strategy.  We are also not restricted in our clustering to just one join, if we wanted to we could also cluster our sales by store region e.g. the Colorado laundry product sales are located in the same area of the sales table. To use Join Attribute Clustering we need to define the PK / FK relationships between fact and dimension, however it is always good practice to have that in place as it helps the CBO so much with query plan evaluation

Secondly, notice the BY LINEAR ORDER clause in the table DDL. Of the two ordering options, Linear Order is the most basic form of clustering, it this case we have our data structured so that all the items for a sales day are clustered together and within that cluster we order by product category and those categories are in turn ordered by store_id. The other way we can cluster is BY INTERLEAVED ORDER; here, Oracle maps a combination of dimensional values to a single cluster value using a z-order curve fitting approach. This sounds complex but it ensures that items that are frequently queried together are co-located in the disk blocks in the storage.

Interleaved ordering is probably the best choice for data warehousing at it aligns well with how we access data in our queries. Although we could include all of the dimension keys in our ordering list, it is going to be more benefit to just include a subset of dimensions; typically for retail I’d go with DATE (or something that correlates to the time based partition key of the fact table), the product  and the store. Of course we can again join to the dimension tables and cluster at higher hierarchy levels such as product category and store region. The Oracle 12c Data Warehousing guide gives some good advice, but you can’t go far wrong if you think about clustering items together that will be queried together

Clustering data can give us some advantages in performance. Better data compression and improved index range scans spring to mind, but to get most benefits we should also look at another new feature, zone-maps. Unlike Attribute Clustering, Zone Maps are Engineered Systems only, In a way they are similar to storage indexes already found on Exadata, but they have some additional advantages, they are also somewhat different from zone maps encountered in other DB vendors’ products such as Netezza.

In Exadata, a storage index can provide the maximum and minimum values encountered for a column in storage cell. I say “can” as there is no guarantee that range for a given column is held in the storage index. Zone Maps on the other hand will always provide maxima and minima for all of the columns specified at zone map creation. The zone map is orientated in terms of contiguous database blocks and is materialized so that it is physically persisted in the database and thus survives DB startups. Like Materialized views Materialized zone maps can become stale and need to be maintained.

We can define a zone map on one or more table columns and just like Attribute Clustering we may also create zone maps on table joins. As a table can only have one zone map it is important to include all of the columns you wish to track. Zone Maps are designed to work well with attribute clustering, in fact it is just a simple DDL statement to add a zone-map to an Attribute Clustered table so that the zone map tracks the same attributes as the clustering. This is where we get the major performance boost from attribute clustering, Instead of looking at the whole table the zone map tells us which ranges of database blocks contain data that matches our query predicates.

Zone Maps with Attribute Clustering gives us another powerful tool to boost DW performance on Exadata – we can do star queries without resorting to bitmap indexes and we minimise IO when scanning fact tables as we only need look where we know the data to be. Exciting times!

Introducing the Updated Oracle / Rittman Mead Information Management Reference Architecture Pt1. – Information Architecture and the “Data Factory”

One of the things at Rittman Mead that we’re really interested in, is the architecture of “information management” systems and how these change over time as thinking, and product capabilities, evolve. In fact we often collaborate with the Enterprise Architecture team within Oracle, giving input into the architecture designs they come up with, and more recently working on a full-blown collaboration with them to come up with a next-generation Information Management architecture. I these two posts I wanted to share some of our recent thinking in this area, looking first at our new proposed architecture, and then in the second post talking about how we’d use agile development methods, in-particular our “ExtremeBI” development approach, to deliver it.

But first, some history. Back in 2009 I blogged about a first-generation DW reference architecture which introduced a couple of new concepts, based on new capabilities from tools such as OBIEE plus some thinking we, and the Enterprise Architecture team at Oracle, had been doing over the years. This reference architecture introduced the concept of “Foundation” and “Access and Performance” layers, and recognised the reality that Kimball-type star schemas were great for querying but not so good for long-term, query-neutral storage of data, whilst Inmon-style EDW models were great as a long-term, process-neutral system of record, but not so good for running user queries on. This new architecture included both of these design approaches, with the foundation layer forming the “information management” layer and the access and performance layer being the “information access” layer. Most importantly, tools like OBIEE made it possible for enterprises to create metadata layers that potentially accessed all layers in this model, so users could query the foundation layer if needed as well as the access and performance layer, if the foundation layer was a better source of data for a particular reports.

NewImage

A second revision to this model, a couple of years later, expanded on the original one and introduced another two interesting concepts, brought upon by the introduction of tools like Endeca Information Discovery, and the rise of unstructured and semi-structured data sources. This new architecture added unstructured and semi-structured sources into the model, and also introduced the concept of “sandboxes”, areas of the information management model that allowed more free-form, exploratory BI applications to be built.

NewImage

But in-practice, this idea of “unstructured” and “semi-structured” sources wasn’t all that helpful. What really started to make an impact in the past couple of years is the increasing use of “schema-on-read” databases, where we trade-off the performance and defined structure of traditional relational 3NF and star schemas for the flexibility and “time-to-value” provided by key-value store databases. The Endeca Server is a good example of these types of database, where the Endeca Server allows rapid storage of loosely-associated datasets and tools like Endeca Studio then apply a structure to the data, at the time of analysis. Schema-on-read databases are great for fast, flexible access to datasets, but the cost of ETL is then borne by each system that accesses the data.

Probably the most well-known examples of schema-on-read sources though are Hadoop, and NoSQL databases. Coupled with their ability to store lots of detail-level data at relatively low cost, Hadoop and NoSQL databases have significantly affected the overall landscape for BI, data warehousing and business analytics, and we thought it was about time for a new reference architecture that fully-incorporated the capabilities and latest thinking around this area. Back at the start of 2014 myself, Jon Mead and Stewart Bryson met up with Oracle’s Andrew Bond in his team for a series of workshops, and what came out of it was an updated Information Management Architecture *and* a development methodology for delivering it. Let’s start off then by looking at this updated architecture from a conceptual view.

NewImage

At a conceptual level, we build on this idea of sandbox environment and formally separate things out into the Execution area – business-as-usual, production and development areas – and an Innovation area, where we build on the idea of a sandbox and rename it the “Discovery lab”. The Discovery lab is where, for want of a better word, the “data scientists” work, with fewer constraints on development and whose inputs are events and data, and outputs are the discovery output that can be the prototype and inspiration for designs going into the execution area.

The main “engine” of the Execution area is our enterprise store of data, this time broken down into four areas:

  • A “data reservoir” where we store all incoming events and data at detail-level, typically on HDFS. This blog article by Oracle’s Jean-Pierre Dijcks sets out the concept of a data reservoir well, and I like this blog by Scaleabilities’ Jeff Needham where he makes the case for calling it a “data reservoir” that can ingest, process and analyse data rather than a “data lake”, which implies a passive store.
  • An Enterprise Data Store, analogous to the enterprise data warehouses we use today, and a reporting component, typically in our case OBIEE
  • Most importantly, the new concept of a “data factory”, a conduit between the data reservoir and the enterprise information store

Together, the execution and innovation layers form our “information platform”, with the event engine feeding real-time events into the platform and outputting them into the data reservoir, and traditional ETL routines loading structured data from the enterprise into the enterprise information store.

This conceptual architecture then permits several types of information application. For example, the data reservoir and the data factory together could support what we call “data applications”, applications working on semi-structured, large and low-granularity data sets such as those used for genomic analysis.

NewImage

Other applications might be more traditional BI and data warehousing applications, but with the addition of data from the data reservoir and the analysis capabilities of Hadoop.

NewImage

The discovery lab can be a standalone area, or the insights and discovery it outputs can be used as inputs into the main information platform. More event-based data will typically come in via the event engine, with its output going into the data reservoir and supporting “next-best-decision” applications like Oracle Real-Time Decisions.

Another way of looking at this architecture is from a logical perspective, in particular focusing on the data layers and access/loading processes to load them. The diagram below is our latest version of the two diagrams at the start of this article, and as you can see we’ve kept the data sources and BI element much the same, and kept the concept of the sandbox, in this case refined as the “discovery lab sandbox”.

NewImage

What is different this time though is the middle bit; we’ve lost the staging area and replaced it with the raw data reservoir, added a “Rapid Development Sandbox”, and drawn the main layers as a slanted set of stacked areas. So why?

What we’re trying to show with the slanted data layers is the relative cost of data ingestion (loading), and the relative cost of accessing it (information interpretation). For the raw data reservoir, for example, there’s little cost in ingesting the data – maybe copy some files to HDFS, or use Flume or GoldenGate to capture log or transaction data to HDFS or Hive, but the cost is then borne in accession this typically “schema-on-read” data source. As you go up the stack, there’s a bit more work in landing data into the Foundation layer – ETL routines, GoldenGate routines, some data cleaning and constraint checking, for example – but it’s correspondingly easier to get data out. For the Access and Performance Layer there’s the most cost in getting data in, but then users have very little work to do when getting data out.

Data can move up the stack from Raw Data Reservoir to Foundation, or directly into Access and Performance, or it could be landed at levels above Raw Data Reservoir, for example in our ExtremeBI approach where we use GoldenGate to replicate source system tables directly into Foundation without going through a staging layer. The Rapid Development Sandboxes are there to support agile, iterative development, with the output from them either being the result in itself, or their designs and insights being used to create more formal projects and data structures.

From a more product-centric perspective, you can overlay these types of diagrams with specific schematics for example enterprises. For example, in the diagram below you can see Oracle NoSQL database being see with HDFS and the Oracle Big Data Connectors to capture and store events from Complex Event Processing, and then outputs from CEP being also fed into a more traditional, “high density data” store as well as directly into a decision engine.

NewImage

So this all sounds great, but how do you build it? Do we have to use the (discredited) step-by-step, waterfall method to build this type of architecture, and in particular the key “data factory” element that provides connectivity between the Raw Data Reservoir and the Enterprise Information Store? And can we apply agile methods to big data sources, as well as regular databases and applications? Check back on Monday for our thoughts on how this should be done.

Rittman Mead Featured in Oracle In-Memory Option Launch

Today saw the official launch of the Oracle Database In-Memory Option, with Larry Ellison going through the product features and then reading out some quotes and testimonials from beta testers. Rittman Mead were part of the beta testing program, with several of our team testing out various scenarios where we ETL’d into it, used it with OBIEE and worked out what would be involved in “in-memory-enabling” some of our customer’s BI systems.

In fact, as we said in our quote for the launch, enabling Oracle Database for in-memory analysis was almost “boringly simple” – just enable the option, choose your tables, drop any OLTP indexes and you’re ready to go.

NewImage

Of course, in practice you’ll need to think about which tables you’ll put into memory if RAM is limited, in some scenarios TimesTen might be a better option, and you’ll need to test your particular system and carefully consider whether you’ll keep particular indexes or materialised views, but we’re really excited about the In-Memory Option for Oracle Database as it’s got the potential to significantly improve query response times for users – and from what we’ve seen so far, it “just works”.

We’re still in the NDA period whilst beta testing goes on, but you can read more on the In-Memory Option on the Oracle website, and on the blog post I wrote when the feature was announced last Openworld. Once it goes GA look out for some in-depth articles on the blog around how it works, and details on how we’ll be able to help customers take advantage of this significant new Oracle Database feature.

How We Deliver Agile OBIEE Projects – Introducing ExtremeBI

Most OBIEE projects that we see are delivered through some sort of “waterfall” method, where requirements are defined up-front, there’s several stages of development, one or more major releases at the end, and any revision to requirements takes the form of a change request. These work well where requirements can be defined upfront, and can be reassuring to customers when they want to agree a fixed-price up-front with every subsequent change clearly costed. But, as with the development world in general, some customers are starting to look at “agile” methods for delivering BI projects, where requirements emerge over the course of a project, there isn’t so much of a fixed design or specification at the start, but instead the project adds features or capabilities in response to what are called “user stories”, making it more likely in-the-end that what ends-up getting delivered is more in-line with what users want – and where changes and additions to requirements are welcomed, rather than extra-cost change requests.

OBIEE naturally lends itself to working in an agile manner, through the three-layer nature of the repository (RPD); by separating the physical representation of the source data from how it is then presented to the end-users, you can start from the off with the dimensional model that’s your end goal, and then over time evolve the back-end physical layer from pointing directly at the source system to instead point at a data warehouse or OLAP cube. In fact, I covered this approach back in 2008 in a blog post called “A Future Oracle OBIEE Architecture” where I positioned OBIEE’s BI Server as a “business logic layer”, and speculated that at some point in the future, OBIEE might be able to turn the logical > physical mappings in the RPD into actual ODI mappings and transformation.

NewImage

In the end, although OBIEE’s aggregate persistence feature gave us the ability to spin-off aggregate tables and cubes from the RPD logical model, full ETL “push-down” never came although you can see traces of it if you have a good poke around the DLLs and directories under the BI Server component. What did happen though was Exadata; with Exadata, features such as SmartScan, and its ability to do joins between normalised tables much faster than regular databases meant that it became possible to report directly against an OLTP schema, or a ODS-like foundation layer, only adding ETL to build a performance star schema layer if it was absolutely necessary. We covered this in a series of posts on Agile Data Warehousing with Exadata, and the focus of this method was performance – by adding Exadata, and the metadata flexibility in OBIEE’s RPD, we could deliver agile projects where Exadata gave us the performance even when we reported directly against a third-normal form data source.

NewImage

 

 

And this approach worked well for our customers; if they’d invested in Exadata, and were open to the idea of agile, iterative development, we could typically deliver a working system in just a few months, and at all times what the users got was what they’d requested in their user story backlog. But there were still ways in which we could improve this method; not everyone has access to Exadata, for example, and reporting directly against a source system makes it tricky to add DW features like history, and surrogate keys, so recently we introduced the successor to this approach, in the form of an OBIEE development method we called “ExtremeBI”. Building our previous agile work, ExtremeBI introduced an integration element, using GoldenGate and ODI to replicate in real time any source systems we were interested in to the DW foundation layer, add the table metadata that DW systems expect, and then provide a means to transform the logical to physical RPD mappings into ODI ETL specifications.

NewImage

But in a way, all the technical stuff is by-the-by; what this means in practice for customers is that we deliver working systems from the first iteration; initially, by reporting directly against a replicated copy of their source system (with replication and metadata enhancement by GoldenGate, and optionally ODI),and then over subsequent iterations adding more end-user functionality, OR hardened ODI ETL code, all the while driven by end-user stories and not some technical design signed-off months ago and which no longer reflects what users actually want.

NewImage

What we’ve found though from several ExtremeBI customer engagements, is that it’s not just down to the technology and how well ODI, OBIEE and GoldenGate work; the major factors in successful projects are firstly, having the project properly pre-qualified at the start; not every project, and not every client, suits agile working, and agile works best if you’re “all in” as opposed to just agreeing to work in sprints but still having a set-in-stone set of requirements which have to be met at a certain time. The second important success factor is proper project organisation; we’ve grown from just a couple of guys with laptops back in 2007 to a fully-fledged, end-to-end development organisation, with full-time delivery managers,a managed services desk and tools such as JIRA, and you need to have this sort of thing in place, particularly a project management structure that’s agile-friendly and a good relationship with the customer where they’re fully-signed up to the agile approach. As such, we’ve found the most success where we’ve used ExtremeBI for fairly technically-savvy customers, for example a MIS department, who’ve been tasked with delivering something for reasonable price and over a short amount of months, who understand that not all requirements can be delivered, but really want their system to get adopted, delight their customer and focus its features on what’s important to end-users.

As well as processes and a method, we’ve also developed utilities and accelerators to help speed-up the initial setup, and ensure the initial foundation and staging layers are built consistently, with GoldenGate mappings already put in place, and ready for our developers to start delivering reports against the foundation layer, or use these foundation-layer tables as the basis of a data mart or warehouse build-out. The screenshot below shows this particular tool, built using Groovy and run from within the ODI Studio user interface, where the developer selects a set of source tables from an ODI model, and then the utility builds out the staging and foundation layers automatically, typically saving days over the manual method.

NewImage

We’ve also built custom KMs for ExtremeBI, including one that uses Oracle Database’s flashback query feature to pull historical transactions from the UNDO log, as an alternative to Oracle Streams or Oracle GoldenGate when these aren’t available on the project.

All together, using Rittman Mead’s ExtremeBI method along with OBIEE, ODI and optionally GoldenGate has meant we’ve been able to deliver working OBIEE systems for customers over just a few months, typically for a budget less than £50k. Coupled with cloud hosting, where we can get the customer up-and-running immediately rather than having to wait for their IT department to provision servers, we think this the best way for most OBIEE11g projects to be delivered in the future. If you’re interested, we’ve got more details on our “ExtremeBI in the Cloud” web page, or you can contact me via email – mark.rittman@rittmanmead.com – if you’d like to discuss it more,

Thoughts on Using Amazon Redshift as a Replacement for an Oracle Data Warehouse

Recently, my colleague, Pete Carpenter, described a proof of concept we carried out using Amazon Redshift as the data warehouse storage layer in a system capturing data from Oracle E-Business Suite (EBS) using Attunity CloudBeam in conjunction with Oracle Data Integrator (ODI) for specialised ETL processing and Oracle Business Intelligence (OBI) as the reporting tool.

In this blog I will look at Amazon Redshift and how it compares with a more traditional DW approach using, as my example, Oracle. I am not going to talk performance in absolute terms as your mileage is going to vary.

What is Redshift?

Redshift is the Amazon Cloud Data Warehousing server; it can interact with Amazon EC2 and S3 components but is managed separately using the Redshift tab of the AWS console. As a cloud based system it is rented by the hour from Amazon, and broadly the more storage you hire the more you pay. Currently, there are 2 families of Redshift servers, the traditional hard-disk based, and the recently introduced SSD family, which has less storage but far more processing power and faster CPUs. For our trials we looked at the traditional disk based storage on a 2 node cluster to give us 4TB of disk spread across 4 CPU cores. Apart from single node configurations, Redshift systems consist of a leader node and two or more database nodes; the leader node is supplied free of charge (you only pay for the storage nodes) and is responsible for acting as the query parser, coordinating the results from the database nodes, and being a central network address for user access.

The Redshift product has its origins in ParAccel and that in turn Postgres and thus supports ANSI SQL and the ODBC and JDBC Postgres drivers. In basic terms it is a share-nothing parallel processing columnar store database that supports columnar compression.

At the cluster level all sorts of robustness features come in to play to handle routine hardware failures such as a node or disk; regular automatic backups occur and on-demand backups can be made to S3 storage for DR or replication to other AWS networks. It is possible to dynamically change the number and or type of Redshift nodes in use, in effect a new cluster is spun up and the data copied from the existing system to the new before dropping the old system. The original database remains open for query (but not update) during the scale-out (or scale-down) process. As Pete Carpenter described, creating a new Redshift instance is a simple matter of completing a few web forms and waiting for the cluster to come up. Once up you can connect to the database using the master credentials you specified at cluster creation and then create databases, users, and schemas as required.

Databases, users, schemas and security

Although it is possible to run a Redshift database using the master user and the default database, good practice suggests that we do a bit more than this. In some ways Redshift is a little like the Oracle 12c database in that we can create additional databases within the master database, much in the style of plugable databases; a major difference comes with the concept of a USER. In Oracle 12c a user belongs to a plugable database, in Redshift all users belong to the master (container) database and can see any of the contained databases (subject to grants.) Schemas are logical groupings for objects and need not be aligned to database user names. Standard object and role grants allow users to access specific databases, schemas, and tables or to have role-rights such as administrator. The final aspect of security is outside the database and is in effect a firewall rule to permit any nominated AWS user or specified IP addresses to speak to the database listener; by default the rule is no inbound access. The diagram below is a block representation of how databases, users, schemas and firewall interrelate. Note user names are descriptive and not valid names!

Screen Shot 2014 02 15 at 13 12 34

Database Design

A key point of difference between Amazon Redshift and Oracle is in how the data is stored or structured in the database. An understanding of this is vital in how to design a performant data warehouse. With Oracle we have shared storage (SAN or local disk) attached to a pool of processors (single machine or a cluster); however, Redshift uses a share-nothing architecture, that is the storage is tied to the individual processor cores of the nodes. As with Oracle, data is stored in blocks, however the Redshift  block size is much larger (1MB) than the usual Oracle block sizes; the real difference is how tables are stored in the database, Redshift stores each column separately and optionally allows one of many forms of data compression. Tables are also distributed across the node slices so that each CPU core has its own section of the table to process. In addition, data in the table can be sorted on a sort column which can lead to further performance benefits; I will discuss this in the section on tables.

Not all of the database features we come to expect in an Oracle data warehouse are available to us in Redshift. The Redshift Developer Guide has the full rundown on what is available, but for now here is a short list of common DW features that are not going to be available to us.

  • Tablespaces
  • Indexes
  • Partitions
  • Constraints
    • Check
    • Primary, Unique, Foreign Key (all usable by optimizer but not enforced)
  • Spatial (Locator) functionality
  • Sequences (although there is an AUTO NUMBER column type)
  • MERGE – we have to code as UPDATE and INSERT in two steps
  • In-database PL/SQL-like language
    • Triggers
    • User defined functions
    • Procedures
  • Timestamps (with timezone)
  • XML types
  • Pseudo columns
  • Various SQL functions (not a full list, but functions I often use in ETL processes)
    • Regular expressions
    • Regression functions
    • SUBSTR
    • TRANSLATE
    • ROW_NUMBER
    • TO_TIMESTAMP

In addition data types may not be exactly the same as those used in Oracle; for example DATE in Oracle has a resolution of 1 SECOND, DATE in Redshift has a resolution of 1 DAY.

Tables

The basic Oracle syntax to create a table works (as does CTAS, Create Table As Select), however there are additional items we can, and should, specify at table creation.

By default the data distribution style is EVEN, that is data is distributed between node-slices  in a round-robin fashion, for performance we may wish to specify a distribution key column to allow a particular column to control how data is distributed; a similar concept to Oracle hash partitioning, and with the same sort of performance characteristics. We aim to create an even distribution of rows per slice (else one slice will take longer than the others to process its data) and by applying the same distribution to other tables that are commonly joined we can benefit from improved table joining performance as all of the rows are stored in the same node-slice. Sometimes it is more appropriate to replicate the whole table to each slice so that the data is always available to join without the need to move data to the same slice before joining; In such cases we set the distribution style to be ALL.

The second thing we can set on a table is the SORTKEY this specifies one or more columns on the table by which the data is ordered on data load (it can be the same column as the distribution key). Redshift maintains information on the minimum and maximum values of the sort key in each database block and at query time uses this information to skip blocks that do not contain data of interest.

Finally, we can elect to compress columns in the database. If we do not specify compression, the default is RAW (i.e. uncompressed) is used. For compressed data we can specify the compression algorithm used, different algorithms are better for certain data types and values. Compression may be data block based (DELTA, BYTE-DICTIONARY, RUN LENGTH, TEXT255 and TEXT32K) or value base (LZO and the MOSTLY compressions). This sounds daunting but there are two ways we can get compression suggestions from the database: using the ANALYZE COMPRESSION command on a loaded table and the AUTO COMPRESS feature of the COPY command, this however requires an empty non-compressed target table; copy is the Redshift equivalent of SQL/Loader and takes a flat file and inserts it into the database.

Let’s consider a simple table T1 with three columns, C1, C2 and C3. We can create this using a simple piece of DDL:

CREATE TABLE T1
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL,
C3 DATE
);

I have not used any of the Redshift nice-to-have features for sorting, distribution, and compression of data. Note too, that I am using NOT NULL constraints, this is the only constraint type enforced in the database. This simple create statement creates database objects on each slice of the cluster, with one block per column per slice (1 slice = 1 CPU core) see the following diagram, note there is no table object stored in the database, it is a collection of columns.

Table slice

Without specifying a distribution key data is evenly spread across all slices. When a 1MB block for a column is full a new block is created for subsequent inserts on the slice. An empty table will occupy block size * number of columns * number of cores and our block size is 1MB this would be columns * cores megabytes

Using a distribution key effectively hashes the data on the key column by the number of cores. Adding a sort key declares that the rows in the table are ordered and potentially allows block elimination to kick in. If our sort key is, say, transaction date, it is likely that our data loads occur in transaction date order, however if we sorted on product code we might find each data load has data that needs to be inserted between existing rows. This does not happen, the data is still appended to the table and the table now needs to be reorganised to put the rows in order. There are two ways to achieve this, the VACUUM command that does an on-line reorg of the table and the potentially faster route of creating a copy table, populating it and then dropping the original and renaming the copy, of course this gives a little downtime when the original table is not available for access.

Applying compression, sort and distribution we get a DDL statement like:

CREATE TABLE T2 
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL SORTKEY DISTKEY,
C3 DATE ENCODE DELTA
);

This table uses column C2 as both the sort key and the distribution key; column c3 is compressed using delta compression – this is an efficient compression algorithm where most dates are ±127 days of the date of the previous row. If we wanted to use a multi-column sort key the DDL syntax would be like:

CREATE TABLE T1 
(
C1 INTEGER NOT NULL,
C2 VARCHAR(20) NOT NULL DISTKEY,
C3 DATE 
)
SORTKEY (C3,C2);

Multi-column distribution keys are not supported.

Designing for Performance

Redshift is designed for query and bulk insert operations; we can optimise query performance by structuring data so that less data is transferred between nodes in a join operations or less data is read from disk in a table scan. Choosing the right data sortkeys and distkeys is vital in this process. Ideally these key columns should not be compressed. Adding primary and foreign keys to the tables tells the optimizer about the data relationships and thus improves the quality of query plan being generated. Of course up to date table stats are a given too; tables must be ANALYZEd when ever the contents changes significantly and certainly after initial load. I feel that we should collect stats after each data load.

For a FACT + DIMENSIONS data model (such as in the performance layer of Oracle’s Reference Data Warehouse Architecture) it would be appropriate to distribute data on the dimension key of the largest dimension on both the dimension and the fact tables, this will reduce the amount of data being moved between slices to facilitate joins.

For optimal performance we should always ensure we include both the distribution keys and the sort keys in any query, even if they appear to be redundant. The presence of these keys forces the optimizer to access the tables in an efficient way.

For best data load performance we insert rows in bulk and in sortkey order. Redshift claim best performance comes from using the COPY command to load from flat files and as second best the bulk insert SQL commands such as CTAS and INSERT INTO T1 (select * from T2);. Where Redshift performs less well is when we use certain kinds of ETL steps in our process, particularly those that involve updating rows or single row activities. In addition loading data without respecting the sort key leads to performance problems on data query. If data update is essential we have two real options: we move our ETL processes to a conventional database hub server (perhaps using ODI) and just use Redshift to store pre-transformed data; or we revise our ETL processes to mimimize update activity on the Redshift platform. There is some scope to optimize updates by distributing data on the update key but another approach is to use temporary tables to build the results of the update and to replace the table with the results of the merge. This requires a bit of inventiveness with the ETL design but fortunately many of our required SQL constructs including analytic functions are there to help us.