Tag Archives: Data Warehousing

Realtime BI Show with Kevin and Stewart – BI Forum 2015 Special!

Jordan Meyer and I were very pleased to be invited onto the Realtime BI Show podcast last week, run by Kevin McGinley and Stewart Bryson, to talk about the upcoming Rittman Mead BI Forum running in Brighton and Atlanta in May 2015. Stewart and Kevin are of course speaking at the Atlanta BI Forum event on May 13th-15th 2015 at the Renaissance Atlanta Midtown Hotel, Atlanta, and in the podcast we talk about the one-day masterclass that Jordan and I are running, some of the sessions at the event, and the rise of big data and data discovery within the Oracle BI+DW industry.

Full details on the two BI Forum 2015 events can be found on the event homepage, along with details of the optional one-day masterclass on Delivering the Oracle Information Management and Big Data Reference Architecture, the guest speakers and the inaugural Data Visualization Challenge. Registration is now open and can be done online using the two links below.

  • Rittman Mead BI Forum 2015, Brighton –  May 6th – 8th 2015 

We’ve also set up a special discount code for listeners to the Realtime BI Show, with 10%-off both registration and the masterclass fee for both the Brighton and Atlanta events – use code RTBI10 on the Eventbrite registration forms to qualify.

Rittman Mead anuncia su catálogo de cursos en Español y en Portugués.

Spanish_Portuguese_Training

UnitedKingdom-ukflag brazil-flag

Tenemos el agrado de anunciarles que desde ahora Rittman Mead ofrece su catálogo completo de cursos en Español y en Portugués.  Esta es una gran noticia para quienes viven en América Latina, España y Portugal, ya que ahora pueden recibir la mejor capacitación, incluyendo el material teórico/práctico, en su idioma local.

Los cursos se dictan tanto en forma remota, con clases virtuales en vivo a través de nuestra plataforma web como también en forma presencial. De ambas maneras, cada estudiante recibirá el material teórico/práctico en forma electrónica y tendrá acceso durante el curso a una máquina virtual de uso exclusivo, para realizar las prácticas.

Ofrecemos un amplia variedad de cursos de Oracle Business Intelligence y tecnologías de Data Warehousing: desde cursos intensivos de 5 días (bootcamps)  a conjunto de cursos específicos orientados por rol de trabajo. Acceda al catálogo completo en Español, Portugués o Inglés.

¿Está interesado en nuestros cursos o quiere ser nuestro partner en capacitación? No dude en consultarnos al mail training@rittmanmead.com

Getting The Users’ Trust – Part 2

Last time I wrote about the performance aspects of a BI system and how they could affect a user’s confidence. I concluded by mentioning that incorrect data might be generated by poorly coded ETL routines causing data loss or duplication. This time I am looking more at the quality of the data we load (or don’t load).

Back in the 1990’s I worked with a 4.5 TB DWH that had a single source for fact and reference data, that is the data loaded was self-consistent. Less and less these days we find a single source DWH to be the case; we are adding multiple data sources (both internal and external). Customers can now appear on CRM, ERP, social media, credit referencing, loyalty, and a whole host of other systems. This proliferation of data sources gives rise to a variety of issues we need to be at least aware of, and in reality, should be actively managing. Some of these issues require us to work out processing rules within our data warehouse such as what do we do with fact data that arrives before its supporting reference data; I once had a system where our customer source could only be extracted once a week but purchases made by new customers would appear in our fact feed immediately after customer registration. Obviously, it is a business call on whether we publish facts that involve yet to be loaded customers straight away or defer those loads until the customer has been processed in the DWH. In the case of my example we needed to auto-create new customers in the data warehouse with just the minimum of data, the surrogate key and the business key and then do a SCD type 1update when the full customer data profile is loaded the following week. Technical issues such as these are trivial, we formulate and agree a business rule to define our actions and we implement it in our ETL or, possibly, the reporting code. In my opinion the bigger issues to resolve are in Data Governance and Data Quality.

Some people combine Data Quality and Governance together as a single topic and believe that a single solution will put all right. However, to my mind, they are completely separate issues. Data quality is about the content of the data and governance is about ownership, providence and business management of the data. Today, Data Governance is increasingly becoming a regulatory requirement, especially in finance.

Governance is much more than the data lineage tools we might access in ETL tools such as ODI and even OWB. ETL lineage is about source to target mappings; our ability to say that ‘bank branch name’ comes from this source attribute, travels through these multiple ODI mappings and finally updates that column in our BANK_BRANCH dimension table. In true Data Governance we probably do some or all of these:

  • Create a dictionary of approved business terms. This will define every attribute in business terms and also provide translations between geographic and business-unit centric ways of viewing data. In finance one division may talk about “customer”, another division will say “investor”, a third says “borrower”; in all three cases we are really talking about the same kind of object, a person. This dictionary should go down to the level of individual attribute and measures and include the type of data being held such as text, currency, date-time, these data types are logical types and not physical types as seen on the actual sources. It is important that this dictionary is shared throughout the organisation and is “the true definition” of what is reported.
  • Define ownership (or stewardship) for the approved business data item.
  • Map business data sources and targets to our approved list of terms (at attribute level). It is very possible that some attributes will have multiple potential sources, in such cases we must specify which source will be the master source.
  • Define processes to keep our business data aligned.  
  • Define ownership for the sources for design (and for static data such as ISO country codes, content) change accountability. Possibility integrate into change notification mechanism of change process.
  • Define data release processes for approved external reference data.
  • Define data access and redaction rules for compliance purposes.
  • Build-in audit and control.
As you can see we are not, in the main, talking data content, instead we are improving our description of the business data over that are already held in database data dictionaries and XSD files. This is still metadata and is almost certainly best managed in some kind of Data Governance application. One tool we might consider for this is Oracle Data Relationship Manager from the Hyperion family of products. If we want to go more DIY it may be possible to leverage some of the data responsibility features of Oracle SQL Developer Data Modeller.

Whereas governance is about using the right data and having processes and people to guarantee it is correctly sourced, Data Quality is much finer in grain and looks at the actual content. Here a tool such as Oracle Enterprise Data Quality is invaluable. By the way I have noticed that OEDQ version 12 has recently been released, I have a blog on this in the pipeline.

I tend to divide Data Quality into three disciplines:

  • Data Profiling is always going to be our first step. Before we fix things we need to know what to fix! Generally, we try to profile a sample of the data and assess it column by column, row by row to build a picture of the actual content. Typically we look at data range, nulls, number of distinct values and in the case of text data: character types used (alpha, letter case, numeric, accents, punctuation etc), regular expressions. From this we develop a plan to tackle quality, for example on a data entry web-page we may want to tighten processing rules to prevent certain “anticipated” errors; more usually we come up with business rules to apply in our next stage. 
  • Data Assessment. Here we test the full dataset against the developed rules to identify data that conforms or needs remedy. This remedy could be referring the data back to the source system owner for correction, providing a set of data fixes to apply to the source which can be validated and applied as a batch, creating processes to “fix” data on the source at initial data entry, or (and I would strongly advise against this for governance reasons) dynamically fix in an ETL process. The reason I am against fixing data downstream in ETL is that the data we report on in our Data Warehouse is not going to match the source and this will be problematic when we try to validate if our data warehouse fits reality.
  • Data de-duplication. This final discipline of our DQ process is the most difficult, identifying data that is potentially duplicated in our data feed. In data quality terms a duplicate is where two or more rows refer to what is probably (statistically) the same item, this is a lot more fuzzy than an exact match in database terms; people miskey data, call centre staff mis-hear names, companies merge and combine data sets, I have even seen customers registering a new email address because they can not be bothered to reset their password on a e-selling website. De-duplication is important to improve the accuracy of BI in general, it is nigh-on mandatory for organisations that need to manage risk and prevent fraud.
Data Quality is so important to trusted BI; without it we run the risk that our dimensions do not roll-up correctly and that we under-report by separating our duplicates. However, being correct at the data warehouse is only part of the story, these corrections also need to be on the sources; to do that we have to implement processes and disciplines throughout the organisation.
 
For BI that users can trust we need to combine both data management disciplines. From governance we need to be sure that we are using the correct business terms for all attributes and that the data displayed in those attributes has made the correct journey from the original source. From quality we gain confidence that we are correctly aggregating data in our reporting.
 
At the end of the day we need to be right to be trusted.

 

 

Getting The Users’ Trust – Part 2

Last time I wrote about the performance aspects of a BI system and how they could affect a user’s confidence. I concluded by mentioning that incorrect data might be generated by poorly coded ETL routines causing data loss or duplication. This time I am looking more at the quality of the data we load (or don’t load).

Back in the 1990’s I worked with a 4.5 TB DWH that had a single source for fact and reference data, that is the data loaded was self-consistent. Less and less these days we find a single source DWH to be the case; we are adding multiple data sources (both internal and external). Customers can now appear on CRM, ERP, social media, credit referencing, loyalty, and a whole host of other systems. This proliferation of data sources gives rise to a variety of issues we need to be at least aware of, and in reality, should be actively managing. Some of these issues require us to work out processing rules within our data warehouse such as what do we do with fact data that arrives before its supporting reference data; I once had a system where our customer source could only be extracted once a week but purchases made by new customers would appear in our fact feed immediately after customer registration. Obviously, it is a business call on whether we publish facts that involve yet to be loaded customers straight away or defer those loads until the customer has been processed in the DWH. In the case of my example we needed to auto-create new customers in the data warehouse with just the minimum of data, the surrogate key and the business key and then do a SCD type 1update when the full customer data profile is loaded the following week. Technical issues such as these are trivial, we formulate and agree a business rule to define our actions and we implement it in our ETL or, possibly, the reporting code. In my opinion the bigger issues to resolve are in Data Governance and Data Quality.

Some people combine Data Quality and Governance together as a single topic and believe that a single solution will put all right. However, to my mind, they are completely separate issues. Data quality is about the content of the data and governance is about ownership, providence and business management of the data. Today, Data Governance is increasingly becoming a regulatory requirement, especially in finance.

Governance is much more than the data lineage tools we might access in ETL tools such as ODI and even OWB. ETL lineage is about source to target mappings; our ability to say that ‘bank branch name’ comes from this source attribute, travels through these multiple ODI mappings and finally updates that column in our BANK_BRANCH dimension table. In true Data Governance we probably do some or all of these:

  • Create a dictionary of approved business terms. This will define every attribute in business terms and also provide translations between geographic and business-unit centric ways of viewing data. In finance one division may talk about “customer”, another division will say “investor”, a third says “borrower”; in all three cases we are really talking about the same kind of object, a person. This dictionary should go down to the level of individual attribute and measures and include the type of data being held such as text, currency, date-time, these data types are logical types and not physical types as seen on the actual sources. It is important that this dictionary is shared throughout the organisation and is “the true definition” of what is reported.
  • Define ownership (or stewardship) for the approved business data item.
  • Map business data sources and targets to our approved list of terms (at attribute level). It is very possible that some attributes will have multiple potential sources, in such cases we must specify which source will be the master source.
  • Define processes to keep our business data aligned.  
  • Define ownership for the sources for design (and for static data such as ISO country codes, content) change accountability. Possibility integrate into change notification mechanism of change process.
  • Define data release processes for approved external reference data.
  • Define data access and redaction rules for compliance purposes.
  • Build-in audit and control.
As you can see we are not, in the main, talking data content, instead we are improving our description of the business data over that are already held in database data dictionaries and XSD files. This is still metadata and is almost certainly best managed in some kind of Data Governance application. One tool we might consider for this is Oracle Data Relationship Manager from the Hyperion family of products. If we want to go more DIY it may be possible to leverage some of the data responsibility features of Oracle SQL Developer Data Modeller.

Whereas governance is about using the right data and having processes and people to guarantee it is correctly sourced, Data Quality is much finer in grain and looks at the actual content. Here a tool such as Oracle Enterprise Data Quality is invaluable. By the way I have noticed that OEDQ version 12 has recently been released, I have a blog on this in the pipeline.

I tend to divide Data Quality into three disciplines:

  • Data Profiling is always going to be our first step. Before we fix things we need to know what to fix! Generally, we try to profile a sample of the data and assess it column by column, row by row to build a picture of the actual content. Typically we look at data range, nulls, number of distinct values and in the case of text data: character types used (alpha, letter case, numeric, accents, punctuation etc), regular expressions. From this we develop a plan to tackle quality, for example on a data entry web-page we may want to tighten processing rules to prevent certain “anticipated” errors; more usually we come up with business rules to apply in our next stage. 
  • Data Assessment. Here we test the full dataset against the developed rules to identify data that conforms or needs remedy. This remedy could be referring the data back to the source system owner for correction, providing a set of data fixes to apply to the source which can be validated and applied as a batch, creating processes to “fix” data on the source at initial data entry, or (and I would strongly advise against this for governance reasons) dynamically fix in an ETL process. The reason I am against fixing data downstream in ETL is that the data we report on in our Data Warehouse is not going to match the source and this will be problematic when we try to validate if our data warehouse fits reality.
  • Data de-duplication. This final discipline of our DQ process is the most difficult, identifying data that is potentially duplicated in our data feed. In data quality terms a duplicate is where two or more rows refer to what is probably (statistically) the same item, this is a lot more fuzzy than an exact match in database terms; people miskey data, call centre staff mis-hear names, companies merge and combine data sets, I have even seen customers registering a new email address because they can not be bothered to reset their password on a e-selling website. De-duplication is important to improve the accuracy of BI in general, it is nigh-on mandatory for organisations that need to manage risk and prevent fraud.
Data Quality is so important to trusted BI; without it we run the risk that our dimensions do not roll-up correctly and that we under-report by separating our duplicates. However, being correct at the data warehouse is only part of the story, these corrections also need to be on the sources; to do that we have to implement processes and disciplines throughout the organisation.
 
For BI that users can trust we need to combine both data management disciplines. From governance we need to be sure that we are using the correct business terms for all attributes and that the data displayed in those attributes has made the correct journey from the original source. From quality we gain confidence that we are correctly aggregating data in our reporting.
 
At the end of the day we need to be right to be trusted.

 

 

Getting The Users’ Trust – Part 1

Looking back over some of my truly ancient Rittman Mead blogs (so old in fact that they came with me when I joined the company soon after Rittman Mead was launched), I see recurrent themes on why people “do” BI and what makes for successful implementations. After all, why would an organisation wish to invest serious money in a project if it does not give value either in terms of cost reduction or increasing profitability through smart decisions. This requires technology to provide answers and a workforce that is both able to use this technology and has faith that the answers returned allow them to do their jobs better. Giving users this trust in the BI platform generally boils down to resolving these three issues: ease of use of the reporting tool, quickness of data return and “accuracy” or validity of the response. These last two issues are a fundamental part of my work here at Rittman Mead and underpin all that I do in terms of BI architecture, performance, and data quality. Even today as we adapt our BI systems to include Big Data and Advanced Analytics I follow the same sound approaches to ensure usable, reliable data and the ability to analyse it in a reasonable time.

Storage is cheap so don’t aggregate away your knowledge. If my raw data feed is sales by item by store by customer by day and I only store it in my data warehouse as sales by month by state I can’t go back to do any analysis on my customers, my stores, my products. Remember that the UNGROUP BY only existed in my April Fools’ post. Where you choose to store your ‘unaggregated’ data may well be different these days; Hadoop and schema on read paradigms often being a sensible approach. Mark Rittman has been looking at architectures where both the traditional DWH and Big Data happily co-exist.

When improving performance I tend to avoid tuning specific queries, instead I aim to make frequent access patterns work well. Tuning individual queries is almost always not a sustainable approach in BI; this week’s hot, ‘we need the answer immediately’ query may have no business focus next week. Indexes that we create to make a specific query fly may have no positive effect on other queries; indeed, indexes may degrade other aspects of BI performance such as increased data load times and have subtle effects such as changing a query plan cost so that groups of materialized views are no longer candidates in query re-write (this is especially true when you use nested views and the base view is no longer accessed).

My favoured performance improvement techniques are: correctly placing the data be it clustering, partitioning, compressing, table pinning, in-memory or whatever, and making sure that the query optimiser knows all about the nature of the data; again and again “right” optimiser information is key to good performance. Right is not just about running DBMS_STATS.gather_XXX over tables or schemas every now and then; it is also about telling the optimiser about data relationships between data items. Constraints describe the data, for example which columns allow NULL values, which columns are part of parent-child relationships (foreign keys). Extended table statistics can help describe relationships between columns in a single table for example in a product dimensions table the product sub-category and the product category columns will have an interdependence, without that knowledge cardinality estimates can be very wrong and favour nested loop style plans that could be very poor performing on large data sets.

Sometimes we will need to create aggregates to answer queries quickly; I tend to build ‘generic’ aggregates, those that can be used by many queries. Often I find that although data is loaded frequently, even near-real-time, many business users wish to look at larger time windows such as week, month, or quarter; In practice I see little need for day level aggregates over the whole data warehouse timespan, however, there will always be specific cases that might require day-level summaries. If I build summary tables or use Materialized Views I would aim to make tables that are at least 80% smaller than the base table and to avoid aggregates that partially roll up many dimensional hierarchies; customer category by product category by store region by month would probably not be the ideal aggregate for most real-user queries. That said Oracle does allow us to use fancy grouping semantics in the building of aggregates (grouping sets, group by rollup and group by cube.) The in-database Oracle OLAP cube functionality is still alive and well (and was given a performance boost in Oracle 12c); it may be more appropriate to aggregate in a cube (or relational-look-alike) rather than individual summaries.

Getting the wrong results quickly is no good, we must be sure that the results we display are correct. As professional developers we test to prove that we are not losing or gaining data through incorrect joins and filters, but ETL coding is often the smallest factor in “incorrect results” and this brings me to part 2, Data Quality.