Tag Archives: User Groups & Conferences

Rittman Mead BI Forum 2014 Call for Papers Now Open!

It’s that time of year again when we start planning out next year’s BI Forum, which like this year’s event will be running in May 2014 in Brighton and Atlanta. This will be our sixth annual event, and as with previous year’s the most important part is the content – and as such I’m pleased to announce that the Call for Papers for BI Forum 2014 is now open, running through to January 31st 2014.

If you’ve not been to one of our BI Forum events in past years, the Rittman Mead BI Forum is all about Oracle Business Intelligence, and the technologies and techniques that surround it – data warehousing, data analysis, big data, unstructured data analysis, OLAP analysis and this year – in-memory analytics. Each year we select around ten speakers for Brighton, and ten for Atlanta, along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Stewart Bryson.

NewImage

Last year we had sessions on OBIEE internals and new features, OBIEE visualisations and data analysis, OBIEE and “big data”, along with sessions on Endeca, Exalytics, Exadata, Essbase and anything else that starts with an “E”. This year we’re continuing the theme, but are particularly looking for sessions on what’s hot this year and next – integration with unstructured and big data sources, use of engineered systems and in-memory analysis, advanced and innovative data visualisations, cloud deployment and analytics, and anything that “pushes the envelope” around Oracle BI, data warehousing and analytics.

NewImage

The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues. We’re also looking for presenters for ten-minute “TED”-style sessions, and any ideas you might have for keynote speakers, send them directly to me at mark.rittman@rittmanmead.com. Other than that – have a think about abstract ideas now, and make sure you get them in by January 31st 2014.

Creating a Custom Analytics Dashboard from Scratch the “Blue Peter” way

This year, Rittman Mead were the Analytics Sponsor for the UKOUG Tech13 conference in Manchester, and for those that visited the UKOUG stand during their time there will have noticed the Rittman Mead-sponsored Analytics Dashboard on display. In this blog post I will cover how it was put together, the “Blue Peter Way” !

dashboard

For those of you not familiar with Blue Peter it is a long running children’s TV show here in the UK that started way back in 1958, possibly it’s most infamous moment came in 1969 with lulu the defecating elephant. As a child the highlight of the show was the “How to make” something section where they would always use the phrase “Sticky-backed-plastic” instead of “Sellotape” due to a policy against using commercial terms on air. The presenters would create something from scratch with bits and bobs you could find around the house, cereal boxes, egg cartons, washing up bottles, Sellotape etc ( sorry, “sticky-backed-plastic” ). That’s exactly what I’m going to be doing here but instead of making a sock monster I’m going to show you how the analytics dashboard was created from scratch with bits you can find on the internet. So kids, without further delay, let’s begin…

(remember to ask your parents permission before downloading any of the following items)

You will need :

  • A Linux server.
  • A Web server
  • The Redis key/value store
  • DataSift account
  • The Webdis HTTP server
  • The tagcanvas.js jQuery plugin
  • The vTicker jQuery plugin
  • The flot plotting library
  • Some Sticky-backed-Plastic (jQuery)
  • Lots of coffee

The Linux server I’m using is Red Hat Enterprise Server 6.4 and the very first thing we’ll need to do is install a web server.

install httpd

Done. The Web server is now installed, configured to start on server-boot and up and running ready to service our http requests.

Next up is the Redis key/value datastore. I’ll be using Redis to store all incoming tweets from our datasource ( more on that in a bit )

redis install

Now that we have Redis up and running let’s perform a couple of tests, first a benchmark using the Redis-benchmark command

redis benchmark

This command throws out a lot more output than this but it is the LRANGE command we are particularly interested in as we’ll be using it to retrieve our tweets later on. 3593 requests per second seems reasonable to me, there were around a 1000 registrations for the UKOUG TECH13 conference, the likelihood of each of them making 3 concurrent dashboard requests all within the same second is slim to say the least – regardless, I’m willing to live with the risk. Now for a quick SET/GET test using the Redis-cli command.

redis setget

Ok so that’s our datastore sorted, but what about our datasource ? Clearly we will be using Twitter as our primary data source but mere mortals like myself don’t have access to Twitter’s entire data stream, for that we need to turn to specialist companies that do. DataSift, amongst other companies like Gnip and Topsy ( which interestingly was bought by Apple earlier this week ) can offer this service and will add extra features into the mix such as sentiment analysis and stream subscription services that push data to you. I’ll be using DataSift, as here at Rittman Mead we already have an account with them. The service itself charges you by DPU ( Data Processing Units ) the cost of which depends upon the computational complexity of the stream your running, suffice to say that running the stream to feed the analytics dashboard for a few days was pretty cheap.

To setup a stream you simply express what you want included from the Twitter firehose using their own Curated Stream Definition Language, CSDL. The stream used for the dashboard was very simple and the CSDL looked like this :-

dsift

This is simply searching for tweets containing the string “ukoug” from the Twitter stream. DataSift supports all kinds of other social data, for a full list head over to their website.

Now that we have our data stream setup, how do we actually populate our Redis data store with it ? Simple – get DataSift to push it to us. Using their own PUSH API you can setup a subscription in which you can specify the following output parameters: “output_type”, “host”, “port”, “delivery_frequency” and “list” amongst many others. The output type was set to “Redis”, the host, well, our hostname and the port set to the port upon which Redis is listening, by default 6379. The delivery_frequency was set to 60 seconds and the list is the name of the Redis list key you want the data pushed to. With all that setup and the subscription active, tweets will automagically start arriving in our Redis datastore in the JSON format – happy days !

The next step is to get the data from Redis to the browser. I could install PHP and configure it to work with apache and then install one of the several Redis-PHP libraries and write some backend code to serve data up to the browser, but time is limited and I don’t want to faff around with all that. Besides this is Blue Peter not a BBC4 Documentary. Here’s where we use the next cool piece of open source software, Webdis. Webdis acts as a HTTP interface to Redis which means I can make HTTP calls directly from the browser like this :-

NewImage

Yep, you guessed it, we’ve just set a key in Redis directly from the browser. To retrieve it we simply do this.

wget

Security can be handled in the Webdis config file by setting IP based rules to stop any Tom, Dick or Harry from modifying your Redis keys. So to install Webdis we do the following :-

webdis

An that’s it, we now have tweet to browser in 60 seconds and without writing a single line of code !  Here’s an overview of the architecture we have :-

ar

One advantage to using this architecture on a small data set is that you avoid having to write any backend code at all, the data is pulled directly to the browser where you can then use Javascript to manipulate it. This meant I could avoid having to test server performance which was good as I had no idea how many hits the dashboard would get – it got a lot ! The server simply acted as the middleman and I could instead focus on the performance of the client-side, this is something I could test with the various devices I have at home.

Would I do this in a production environment with larger data set, probably not, it wouldn’t scale, I’d instead write server-side code to handle the data processing, test performance and push only the required data to the client. The right tools for the right job.

Next we’ll move onto the dashboard itself where we’ll be using jQuery, Ajax, a couple of jQuery plugins, a plotting library and some Javascript to pull the whole thing together. Before we can do anything else we need to retrieve the data from Redis via Webdis and parse the JSON so we can manipulate it as a Javascript object. The following code snippet demonstrates how this can be done.

getdata

The function getData() is called within the setInterval() function, the second parameter to the setInterval() sets the frequency of the call, in this case every 5 mins (30000 milliseconds). The getData function performs an Ajax get request on the url that points to our Webdis server listening on port 7379 which then performs an LRANGE command on the Redis data store, it’s simply asking Redis to return all list items between 0 and 100000, each item being a single tweet. Once the Ajax request has successfully completed the “success” callback is called and within this callback I am pushing each tweet into an array so we end up with an array of tweet objects. We now have all the data in a format we can manipulate to our hearts content.

Now onto the Graphs and Visualisations.

The Globe

globe

The spinning Globe was built using the excellent tagcanvas.js jQuery plugin, a separate standalone javascript library also exists. To create the data for this a frequency count was performed on all the words in the Twitter content and the total for each word was used as a “weight”, this data was then passed to the jQuery plugin. There are a plethora of options for this plugin which allows you to produce all kinds on funky tag clouds. Every 60 seconds the globe fades out and is replaced by a vertical tweet ticker, this was done with jQuery and a setInterval timer.

The Tweet Ticker

tweet ticker

The tweet ticker was built by grabbing the latest 30 tweets from the data array and assigning them to html <li> tags, the vTicker jQuery plugin was then applied to the containing <ul> html tag. Various plugin options allow you to control things like the delay and scroll speed.

Tweet Velocity

tweetv

The flot plotting library was used to create this graph. You simply pass flot some data in the form of an array and set all the display options in a json string. The data for this was created by truncating the tweet timestamp to the nearest hour and then aggregating up to get the totals using javascript array manipulation.

Top 10 Speakers by Twitter Mention

t10

This one proved quite popular – these speaker types are a competitive bunch ! Having obtained a list of speakers from the UKOUG Tech13 website I was able to search all the twitter content for each speaker and aggregate up to get the total twitter mentions for each speaker, again the graph was rendered using the flot plotting library. As the graph updated during each day speakers were swapping places with our own Mark Rittman tweeting out the “The Scores on the doors” at regular intervals. When the stats can me manipulated though there’s always someone willing to take advantage !

tweet

tut, tut Mark.

Twitter Avatars

avatars

The twitter Avatars used the tagcavas.js library but instead of populating it with words from the Twitter content the Tweet avatars were used. A few changes to the plugin options were made to display the results as a horizontally scrolling cylinder instead of a globe.

Twitter Sentiment

sentiment

The Twitter sentiment graph again used flot. Tweet sentiment was graphed over time for the duration of the conference. The sentiment score was provided by DataSift as part of the Twitter payload. The scores we received as part of the 3500 tweets ranged between -15 and 15 each score reflected either a positive or negative tweet. Asking a computer to infer a human emotion from 140 characters of text is a tough ask. Having looked at the data in detail a fair few of the tweets that received negative scores weren’t negative in nature, for example tweets with the content “RAC attack” and “Dangerous anti patterns” generated a cluster of negative scores. As we know computers are not as clever as humans, how can they be aware of the context of a tweet? detect sarcasm or differentiate between banter and plain old insults? Not all the negatively scored tweets where false-negatives, some were genuine, a few regarding the food seemed to ring true.

Perhaps the data you’re analyzing needs to be taken into context as a whole. You’d expect a 1000 Techies running around a tech conference to be happy, the sentiment analysis seemed to do a better job at ranking how positive tweets were than how negative they were. Perhaps from a visualisation point of view, a logarithmic scale along with a multiplier to raise the lowest score would have worked better in this case to reflect how positive overall the event was. One thing is clear though and that is that further statistical analysis would be needed over a much larger data set to really gain insight into how positive or negative your data set is.

The remaining graphs were also created using flot. The data was sourced from a spreadsheet provided by the conference organizers, it was aggregated and then hardcoded into the web page as Javascript arrays and passed to the various flot instances.

So that’s it kids, I hope you’ve enjoyed this episode of Blue Peter, until next time….

Edit – 6-Dec-2013 19.07:

Mark has asked me the following question over twitter:

“any chance you could let us know why these particular tools / components were chosen?”.

I thought I’d give my answer here.

One of the overriding factors in choosing these tools was time. With only 3 days to piece the thing together I decided early on that I’d write all the code on the client side in Javascript. This meant I could write all my code in one location and in one language with less to test and less that could go wrong. Webdis Allowed me to do this because I didn’t need to write any back end code to get the data into the browser.

All the tools were also open source, easy to install/configure and are well documented, I had also used them all previously – again, a time saver. Redis was chosen for 2 reasons, it was supported as a subscription destination by DataSift ( along with many others ) and I’m currently using it in another development project I’m working on so I was up to speed with it. Although in this solution I’m not really taking advantage of Redis’s access speed it worked well as somewhere to persist the data.

I’ve used flot several times over the years and although there are other Javascript charting library’s out there I didn’t have time to test and learn a new one, flot was a no-brainer. As was jQuery, the de-facto Javascript library for Ajax, DOM manipulation and for adding suger to your webpages. I’d not used tagcanvas or vticker before but if you can install and get a jQuery plugin working in less than 10 minutes hassle free then it’s probably a good one, both these met this criteria.

If I was coding a more permanent solution with more development time then I’d add a relational database into the mix and use it to perform all of the data aggregation and analysis, this would make the solution more scalable. I’d then either feed the browser via ajax calls directly to the database via backend code or populate a cache layer from the database and make ajax called directly on the cache, similar to what I did in this solution. It would have been nice to use D3 to develop a more elaborate visualisation instead of the canned flot charts but again this would have taken more time to develop.

UKOUG Tech13 Conference

Next week Rittman Mead is the analytics sponsor for the UKOUG Tech13 conference up in Manchester.

We’ve got a great line up of talks, covering everything from BI Apps, Endeca, Hadoop, OBIEE and mobile – Stewart and Charles are both flying in from the US to present alongside Mark, Adam and myself. The full list is as follows:

On Monday night we are teaming up with our friends at Pythian to host some drinks and nibbles at Taps Bar, which is just around the corner from the conference venue. We’ll be there from 6.30-9.00pm, so come and join us for a free drink.

Analytics Sponsor

As part of being the analytics sponsor for the event we are looking to collate as much real time, social media and demographic information about the event as possible. We will be displaying an analytics dashboard in the conference venue detailing statistics from this data. To help us, could you:

  1. Complete the form here to give use some background infromation about why you are going; and
  2. Use the official hashtag ukoug_tech13 when tweeting anything about the event.

It’s looking like it will be a great few days, so look forward to seeing you up there.

TECH13_[Master]_Logo

Why ODI, DW and OBIEE Developers Should Be Interested in Hadoop

Over the past few months I’ve been posting a number of articles about Hadoop, and how you can connect to it from ODI and OBIEE. From an ODI perspective, I covered Hadoop as one of a number of new data sources ODI11g could connect to, then looked at how it leveraged Hive to issue SQL-like data extraction commands to Hadoop, and how it used Oracle Hadoop connector tools to transfer Hadoop data into the Oracle Database, and directly work with data in HDFS files. For OBIEE, I went through the background to Hadoop, Hive and the other “big data” technologies, stepped through a typical Hive query session, then showed how OBIEE 11.1.1.7 could connect to Hadoop through its newly-added Hive adaptor, then finally built a proof-of-concept OBIEE connection through to Cloudera Impala, then extended that to a multi-node Hadoop cluster.

But why all this interest in Hadoop – what’s it really got to do with OBIEE and ODI, and why should you as developers be interested in what’s probably yet another niche BI/DW datasource? Well in my opinion, Hadoop is the classic disruptive technology – cheap, and starting-off with far less functionality than regular, relational databases – but it’s improving fast, and as BI&DW developers it offers the potential of both massive benefits – significantly lower TCO for basic DW work, and support for lots of modern, internet-scale use-cases – and threats – in that if we don’t understand it and see how it can benefit our customers and end-users, we risk being left-behind as technology moves on.

To my mind, there are two main ways in which Hadoop, Hive, HDFS and the other big-data ecosystem technologies are used, in the BI/DW context:

1. Standalone, with their own query tools, database tools, query languages and so forth – your typical “data scientist” use case, originating from customers such as Facebook, LinkedIn etc. In this context, there’s typically no Oracle footprint, users are pretty self-sufficient, any output we see is in the form of “insights”, marketing campaigns etc.

2. Alongside more mainstream, for example Oracle, technologies. In this instance, Hadoop, Hive, HDFS, NoSQL etc are used as complementary, and supporting, technologies to enhance existing Oracle-based data warehouses, capture processes, BI systems. In some cases, Hadoop-type technologies can replace more traditional relational ones, but mostly they’re used to make BI&DW systems more scaleable, cheaper to run, able to work with a wider range of data sources and so forth. This is the context in which Hadoop can be relevant to more traditional Oracle BI, ETL and DW developers.

To understand how this happened, let’s go through a bit of a history lesson. Five years ago or so, your typical DW+BI architecture looked like this:

NewImage

The data warehouse was typically made-up of three layers – staging, foundation/ODS and performance/dimensional, with data stored in relational databases with some use made of OLAP servers, or some of the newer in-memory databases like Qlikview. But over the intervening years, the scale and types of data sources have increased, with customers now looking to store data from unstructured and semi-structured sources in their data warehouse, take in feeds from social media and other “streaming” sources, and access data in cloud systems typically via APIs, rather than traditional ETL loading. So now we end up with a data warehouse architecture that looks like this:

NewImage

But this poses challenges for us. From an ETL perspective, how do we access these non-traditional sources, and once we’ve accessed them – how do we efficiently process them? The scale and “velocity” of some of these sources can be challenging for traditional ETL processes that expect to log every transformation in a database with transactional integrity and multi-version concurrency control, whilst in some cases it doesn’t make sense to try and impose a formal data structure on incoming data as you’re capturing it, instead giving it the structure when we finally need it, or when we choose to access it in a query.

And then came “Hadoop”, and its platform and tool ecosystem. At its core, Hadoop is a framework for processing, in a massively-parallel and low-cost fashion, large amounts of data using simple transformation building blocks – filtering (mapping) and aggregating (reducing). Hadoop and MapReduce came out of the US West Coast Internet scene as a way of processing web and behavioural data in the same massively-distributed way that companies provided web search and other web 2.0 activities, and a core part of it was that it was (a) open-source, like Linux and (b) cheap, both in being open-source but also because it was designed from the outset to run on low-cost, commodity hardware that’s expected to fail. Pretty much the opposite of Oracle’s business model, but also obviously very attractive to anyone looking to lower the TCO of their data warehouse system.

So as I said – the Hadoop pioneers went-out and built their systems without much reference to vendors such as Oracle, IBM, Microsoft and the like, and being blunt, they won’t have much time for traditional Oracle BI&DW developers like ourselves. But those customers who are largely invested in Oracle technology, but see advantages in deploying Hadoop and big data technologies to make their systems more flexible, scaleable and cheaper to run – that’s where ODI and OBIEE’s connectivity to these technologies becomes interesting.

To take the example of customers who are looking to deploy Hadoop technologies to enhance their Oracle data warehouse – a typical architecture going down this route would look like this:

NewImage

In this example, we’re using HDFS – Hadoop Distributed File System – as a pre-staging area for the data warehouse, storing incoming files cheaply, and with build-in fault tolerance, to the point where storage is so cheap that you might as well keep stuff you’re not interested in now, but you think might be interesting in the future. Using Oracle Direct Connector for HDFS, you can set up Oracle Database external tables that map onto HDFS just like any other file system, so you can extract from and otherwise work with these files without worrying about writing MapReduce jobs; ODI, through Oracle Data Integration Adaptor for Hadoop, you can connect ODI to these table sources as well, and work with them just like any other topology source, as I show in the slide below from my upcoming UKOUG Tech’13 session on ODI, OBIEE and Hadoop that’s running in a couple of week’s time in Manchester:

NewImage

As well as storing data, you can also do simple filtering and transformation on that data, using the Hadoop framework. Most upfront data processing you do as part of an ETL process involves filtering out data you’re not interested in, joining data sets, grouping and aggregating data, and other large-scale data transformation tasks, before you then load it into the foundation/ODS layer and do more complex work. And this simple filtering and transformation is what Hadoop does best, on cheap hardware or even in the cloud – and if your customer is already invested in ODI and runs the rest of their ETL process  using it, its relatively simple to add Hadoop capabilities to it, using ODI to orchestrate the data processing steps but using Hadoop to do the heavy lifting, as my slide below shows:

NewImage

Now some customers, and of course Hadoop vendors, say that in reality you don’t even need the Oracle database if you’re going to build a data warehouse, or more realistically a data mart. Now that’s a bigger question and probably one that depends on the particular customer and circumstances, but a typical architecture that takes this approach might look like this:

NewImage

In this case, ODI again has capabilities to transform data entirely within Hadoop – with ODI acting as the ETL framework and co-ordinator, but Hadoop doing the heavy-lifting – and there’s always the ability to get the data of Hadoop and into a main Oracle data warehouse, if the Hadoop system is more of a data mart or deparment-specific analysis. But whichever way – in most cases the customer is going tho want to continue to use their existing BI tool, particularly if their BI strategy involves bringing together data from lots of different systems, as you can do with OBIEE’s federated query capability – giving you an overall architecture that looks like this:

NewImage

So it’s this context that makes OBIEE’s connectivity to Hadoop so important; I’m not saying that someone creating a Hadoop system from scratch is going to go out and buy OBIEE as their query tool – more typically, they’ll use other open-source tools or create models in tools like R; or they might go out and buy a lightweight data visualisation tool like Tableau and use that to connect solely to their Hadoop source. But the customers we work with have typically got much wider requirements for BI, have a need for an enterprise metadata model, recognise the value of data and report governance, and (at least at present) access most of their data from traditional relational and OLAP sources. But they will still be interested in accessing data from Hadoop sources, and OBIEE’s new capability to connect to this type of data, together with closer integration with Endeca and its unstructured and semi-structured sources, addresses this need.

So there you have it – that’s why I think OBIEE and ODI’s ability to connect to Hadoop is a big deal, and it’s why I think developers using those tools should be interested in how it works, and should try and set up their own Hadoop systems and see how it all works. As I said, I’ll be covering this topic in some detail at the UKOUG Tech’13 Conference in Birmingham in a couple of weeks time, so if you’re there on the Sunday come along and I’ll try and explain how I think it all fits together.

Rittman Mead / ODTUG India BI Masterclass Tour Roundup

Over the past week Venkat, myself and the Rittman Mead India team have been running a series of BI Masterclasses at locations in India, in conjunction with ODTUG, the Oracle Development Tools User Group. Starting off in Bangalore, then traveling to Hyderabad and Mumbai, we presented on topics ranging from OBIEE through Exalytics through to EPM Suite and BI Applications, and with networking events at the end of each day.

NewImage

Around 50 attended at Bangalore, 30 in Hyderbad and 40 in Mumbai, at at the last event we were joined by Harsh Bhogle from the local Oracle office, who presented on Oracle’s high-level strategy around business analytics. Thanks to everyone who attended, thanks to ODTUG for sponsoring the networking events, and thanks especially to Vijay and Pavan from Rittman Mead India who organised everything behind the scenes. If you’re interested, here’s a Flickr set of photos from all three events (plus a few at the start where I visited our offices in Bangalore.)

For anyone who couldn’t attend the events, or if you were there and you’d like copies of the slides, the links below are for the PDF versions of the sessions we presented at various points over the week.

So I’m writing this in my hotel room in Mumbai on Sunday morning, waiting for the airport transfer and then flying back to the UK around lunchtime. It’s been a great week but my only regret was missing the UKOUG Apps’13 conference last week, where I was also supposed to be speaking but managed to double-book myself with the event in India.

In the end, Mike Vickers from Rittman Mead in the UK gamely took my place and presented my session, which was put together as a joint effort with Minesh Patel, another of the team in the UK and one of our BI Apps specialists. Entitled “Oracle BI Apps – Giving the Users the Reports they *Really* Want”, it’s a presentation around the common front-end customisations that we typically carry out for customers who want to move beyond the standard, generic dashboards and reports provided by the BI Apps, and again if you missed the session or you’d like to see the slides, they’re linked-to below:

That’s it for now – and I’ll definitely be at Tech’13 in a few weeks’ time, if only because I’ve just realised I’m delivering the BI Masterclass sessions on the Sunday, including a session on OBIEE/ODI and Hadoop integration - I’ve been saying to myself I’d like to get these two tools working with Impala as an alternative to Hive, so that gives me something to start looking at on the flight back later today.