Tag Archives: Conferences
Rittman Mead BI Forum Atlanta Special Guest: Cary Millsap
I feel like I’m introducing the Beatles… though I think Kellyn Pot’Vin calls them the “DBA Gods”. Today I’ll be talking about Cary Millsap, and tomorrow I’ll introduce our other special guest: Alex Gorbachev.
As many of you know, I grew up as a DBA (albeit, focusing on data warehouse environments) before transitioning to development… initially as an ETL developer and later as an OBIEE architect. I had three or four “heroes” during that time… and Cary Millsap was certainly one of them. His brilliant white paper “Why a 99%+ Database Buffer Cache Hit Ratio is Not Ok” changed my whole direction with performance tuning: it’s probably the first time I thought about tuning processes instead of systems. Many of you also know about my love of Agile Methodologies… a cause that Cary has championed of late, and is also the subject of an excellent white paper “Measure Once, Cut Twice”. This purposeful inversion of the title helps to remind us that many of the analogies we use for software design don’t compute… it’s relatively simple to modify an API after the fact, so go ahead and “cut”.
A brief bit of history on Cary. He’s an Oracle ACE Director and has been contributing to the Oracle Community since 1989. He is an entrepreneur, software technology advisor, software developer, and Oracle software performance specialist. His technical papers are quoted in many Oracle books, in Wikipedia, in blogs all over the world, and in dozens of conference presentations each month. He has presented at hundreds of public and private events around the world, and he is published in Communications of the ACM. He wrote the book “Optimizing Oracle Performance” (O’Reilly 2003), for which he and co-author Jeff Holt were named Oracle Magazine’s 2004 Authors of the Year. Though many people (Kellyn included) think of Cary as a DBA… Cary considers himself to be software developer first, but explains what he believes to be the reason for this misconception:
“I think it’s fair to say that I’ve dedicated my entire professional career (27 years so far) to the software performance specialization. Most people who know me in an Oracle context probably think I’m a DBA, because I’ve spent so much time working with DBAs (…It’s still bizarre to me that performance is considered primarily a post-implementation operational topic in the Oracle universe). But my background is software development, and that’s where my heart is. I built the business I own so that I can hang out with extraordinarily talented software researchers and designers and developers and write software that helps people solve difficult problems.”
Cary’s presentation is called “Thinking Clearly about Performance” and will be given at the end of the day on Thursday before we head over to 4th and Swift for the Gala dinner. His message for the BI developers in the audience is an encouraging one:
“My message at the Rittman Mead BI Forum is that, though it’s often counterintuitive, software performance actually makes sense. When you can think clearly about it, you generally make progress more quickly and more permanently than if you just stab at different possible solutions without really understanding what’s going on. That’s what this presentation called “Thinking Clearly about Performance” is all about. It’s the result of more than 25 years of helping people understand performance in just about every imaginable context, from the tiny little law office in east Texas to the Large Hadron Collider at CERN near Geneva. It’s the result of seeing the same kinds of wasteful mistakes: buying hardware upgrades in hopes of reducing response times without understanding what a bottleneck is, adding load to overloaded systems in hopes of increasing throughput.”
My experience is that BI and DW systems suffer more from the “fast=true” disease than do OLTP systems, but that could simply be perspective. I’m excited that a group of BI developers, the majority of which are reporting against a database of some kind, will get an opportunity to see Cary’s approach to problem solving and performance tuning. As Cary tells us:
“The fundamental problems in an OBIEE implementation are just that: fundamental. The solution begins with understanding what’s really going on, which means engaging in a discussion of what we should be measuring and how (of course, in the OBIEE world, Robin Moffatt’s blog posts come in handy), and it continues through the standard subjects of profiles, and skew, and efficiency, and load, and queueing, and so on.”
If you are interested in seeing Cary and all the other great speakers at this year’s BI Forum, you can go over to the main page to get more information about the Atlanta event, or go directly to the registration page so we can see you in Atlanta in May.
Some Upcoming Events
It’s going to be a busy few weeks leading up to the BI Forum. First, Rittman Mead will be exhibiting at the UKOUG Engineered Systems Summit on Tuesday 16th April, this is a one day event in London for Exadata, Exalogic, SuperCluster and not least Exalytics. Mark will be presenting on Oracle Exalytics – Tips and Experiences from Rittman Mead , full agenda available here. Mark will then hoping over to Norway to speak at the Oracle Norway User Group event on High-Speed, In-Memory Big Data Analysis with Oracle Exalytics, maybe he’ll be previewing his work getting OBIEE 11.1.1.7 working with Hadoop.
The following week on Tuesday 23rd April I am speaking at an Oracle Business Analytics event, I am giving a presentation about our story so far with Exalytics, this event is at Oracle’s City Office in London. Later that week on Thursday 25th, as part of Big Data Week I’m speaking in the evening in Brighton about the evolution from Business Intelligence to Analytics and Big Data, full agenda here, please register here.
Tuning Philosophy – Tuning the Right Thing
My second presentation at this year’s RMOUG Training Days was on tuning “realtime data warehouses”; as usual, this paper is now on the Rittman Mead Articles page. Perhaps more accurately my talk was more about my tuning philosophy rather than a cookbook of tuning “rules” to give optimal performance. I don’t think there is a single recipe for guaranteed tuning success; the best we can come up with is set of principles for getting things as right as possible and to keep in our heads that each system has its own unique combinations of hardware, software and data and this interaction modifies the steps we should evaluate in our tuning process. The sub-title of my talk came from an early slide: “Making the Arrow as short as possible”
Another name for this arrow is “latency”, the thing that stops “realtime” from actually being “same time”. We will never have no arrow as the act of observing the event at source and reacting to output at target will always add some amount of delay to the data flow. I discussed this in a paper for the Evaluation Centre.
Rather than present my RMOUG talk here I will take a step back and write about how I go about improving performance in a data warehouse ETL.
Firstly, if we set out to improve performance we need to measure it, or else how can we be confident that we have “improved” things. This measurement can be as crude as clock time to run an ETL step, a throughput measure such as rows inserted per minute or we can really delve into performance and look fine detail such as number and type of database waits. I tend to start with coarse measurements. I do this for two simple reasons: execution time or throughput is often the business visible metric that is the basis of the problem perceived by the customer; and, from my experience of many data warehouses, the code as implemented may not be doing what the designers wanted, and there is little merit in tuning a process to do the “wrong thing” more quickly. I therefore take as my starting point what the business wants to achieve in the ETL step and not the query being run. Here I see four kinds of problems:
- The code used does not answer the business requirement.
- The query has a flaw that causes it to process too much data.
- The code uses inappropriate techniques.
- The process has redundant steps.
Fortunately, the first cause is rare, probably because it is usually spotted and resolved long before moving from developement to to production. Processing too much data should be easy to detect if the ETL process is adequately instrumented; If a business has 4000 customers who each make two transactions a day and if we are loading 14 million customer transactions per day there is something very wrong in the process. Just doing simple calculations of expected data volume and comparing that with actual loads can readily spot this type of thing – we can then dive down to isolate the cause, which is often incorrect joins or a missing source data filter. As I found at one customer site, it is quite possible to load far too much data without affecting the values reported in the user query tools; ETL logic flaws do not always lead to obvious data aggregation problems.
Too many times I have seen code created by developers that have insufficient understanding of what a relational database can do or what a particular vendor’s database can do. I have seen people calculating the first of the month by taking the date, converting it to a string, then concatenate the “month and year” substring on to the literal “01 before converting it back to a date” because they don’t know you can do this in a single function (TRUNC) in Oracle . I have seen developers re-inventing database joins as procedural steps (often they hand code nested loops!) rather than letting the database do it. I have seen others look for change in a row by computing the MD5 hash of the concatenated columns of the new row and comparing it to the previous MD5 hash for the original row. Don’t get me wrong, MD5 hashing can work but it is so compute intensive that it can starve the database of vital CPU resource; For change detection I much prefer to use the mainstays of set based ETL: MINUS operators or simple outer joins between new and old data and looking for inequality between columns.
Once I am sure I am looking at the right problem I can go about optimising performance. I tend to start with OEM12c and the SQL Monitoring reports or good old fashioned EXPLAIN. Just getting the cardinalities right can help the CBO to do wonders. But accurate CBO costings are not always the solution, it is this stage that I start to look at the more physical aspects of design such as partitioning, parallel (assuming these are licensed), memory, concurrent activity, indexing and the way the tables are defined. More on this in another blog.
Big Data and the Oracle Reference Architecture.
Last week I travelled from Europe to present at the RMOUG Training Days event In Denver, Colorado. As I blogged a couple of weeks back, this is one of my favourite user group conferences and it never fails to impress me. I expect to be wowed even more next year as they clock up their quarter century!
This year I presented twice: “Realtime Data Warehouse Tuning” and “Extending the Oracle Reference Data Architecture. More on tuning in another blog posting, but for now some thoughts on Big Data and Architecture. If you are interested in seeing my paper it is available on the Rittman Mead website here. The slides originally published on the RMOUG conference website were revised to incorporate some new graphics and information from Oracle’s white paper on Information Management and Big Data A Reference Architecture which was published just as I flew out to the USA. Fortunately, most of my original paper matched the thinking behind the new white paper. One change of note is the new description for the foundation layer – it is now “Immutable Enterprise Data with Full History” – see the image clipped from Oracle’s new white paper.
I think the new definition is a far more clear description of the Foundation Layer than using terms such as 3NF. After all architecture is more about reasons to do rather than how to do.
One of my early slides in the Big Data section of my RMOUG talk covers what I think Big Data means. Ask a dozen people what is meant by “Big Data” and you will probably hear at least two dozen answers. What is that stops data warehouses being “big data”, since they are both big (they can be really big) and contain data?
Some people argue that big data data is unstructured data. However, to my mind for data to be useful information it must have some degree of discernible structure, that is, it must be capable of being analysed or else it is just noise. Obviously, text has structure and meaning both from the ordering of characters to form words and the order of words to make coherent blocks of text. Likewise, digital data from from telemetry also has meaning. Harder to analyse are audio and video feeds, but even this is becoming commonplace both in the consumer marketplace and business; I have apps on my Macbook that tag photos based on who the software believes is in the picture and apps on my iPhone that identify and download music from an audio clip recorded on the phone. Business and government users do the similar things be it identify people in a crowd or to transcribe voice.
People often speak of the 4 or 5 “V”s of big data:
- Volume – large amounts of data
- Velocity – fast arriving data
- Variety – it comes in all types of structure (including “no” structure)
- Value - there is “worth” in processing the information
Add to the mix that optional “V”, number 5, Veracity - that the data is trustworthy. However, in reality, Value, Volume, Velocity, and Veracity are all normal requirements in realtime data warehousing and enterprise scale ERP implementations so are not unique to big data. So perhaps Variety might be the key differentiator. In truth there will be little variety in the data being handled; if you develop a process to do sentiment analysis on a Twitter feed you are extremely unlikely to come across the odd digitized X-ray image or smart meter reading lurking in that feed. This is very similar to the “variety” we come across in developing any other ETL process in data warehousing.
As you see, I struggle with some of the usual definitions of big data as opposed to large data sets. For me, the key to Big Data is what we intend to do with it. If it is important to know the exact value of a single item (for example billing from smart metering) then it is not big data, it is instead large volume transactional data. If the exact value of a single data item is less important than deriving a statistical picture of the whole data set we are in the realms of big data; if we lose a single record it is probably not crucial to our analysis, if we fail to send a short validity coupon to someone’s smartphone as they walk past our store (or better yet as they pass the competitor’s store at the other end of the same shopping mall) then it probably does not matter.
Rocky Mountain Oracle User Group Training Days
One of the great things about working here at Rittman Mead is our corporate ethos of sharing our knowledge and experience. Of course, much of this is “paid for” work with our customers where we train, mentor, consult, develop, implement and support all manner of things Oracle BI. However, sharing with the community is also a major feature of our culture; there is, of course, the Rittman Mead Blog, but we are also keen supporters of user groups throughout the world.
I am delighted to say that Rittman Mead will be at the RMOUG conference being held in Denver, CO between February 12 and 13, 2013. I love this conference – it is large enough to allow several parallel streams and attracts many world class speakers, yet the conference also manages to remain a close gathering of friends.
This year we will have four members of our team presenting: Stewart Bryson, Jordan Meyer and Michael Rainey from Rittman Mead North America, I will be representing the European offices. Between us we will presenting 7 sessions over the event.
- Reporting against Transactional Schemas with OBIEE 11g – Stewart Bryson, Feb 12 8:30 am.
- Aggregation: The BI Server versus the Oracle Optimizer - Stewart Bryson, Feb 12 2:30 pm.
- GoldenGate and ODI – A Perfect Match for Real-Time Data Warehousing – Michael Rainey, Feb 12 5:15 pm.
- Social Network Analysis with Oracle Tools – Jordan Meyer, Feb 13 9:45 am.
- Data Science for OBI Professionals - Jordan Meyer, Feb 13 11:15 am.
- Extending Oracle’s Data Warehouse Reference Architecture for a Real Time and Big Data World – Peter Scott, Feb 13 1:30 pm.
- Tuning “Real Time” Data Warehouses – A Guide from the Field - Peter Scott, Feb 13 4:00 pm.
In addition Stewart will also be co-presenting with Kent Graziano on “Using OBIEE and Data Vault to Virtualize Your BI Environment: An Agile Approach”
After the meeting we will posting our presentations on the articles page of our website.
If you see any of us in Denver then come up and say hello – we would love to meet you.