Tag Archives: Big Data

OBIEE, ODI and Hadoop Part 3: A Closer Look at Hive, HFDS and Cloudera CDH3

In the first two parts in this series, I looked at the recently-added support for Apache Hadoop as a data source for OBIEE 11.1.1.7 and ODI 11.1.1.6, and explained how the Hadoop support was really enabled through a related technology called Hive. In the second part in the series I showed how OBIEE 11.1.1.7 could report against “big data” sources using Hadoop and this Hive technology, but this all of course pre-supposes that we have data in Hive in the first place. So what actually is Hive, how do you load data into it, and can ODI help with this process?

To take a few steps back, Apache Hive is a Hadoop-family project that provides a “data warehouse” layer over Hadoop, through a metadata layer not unlike OBIEE’s RPD together with a SQL-like language called HiveQL. Coupled with ODBC and JDBC database drivers, BI tools like OBIEE use Hive to get access to big data sources, as the HiveQL language that Hive uses is very similar to SQL used to access databases such as Oracle, SQL Server or mySQL. Delving a bit deeper in the Hive product architecture, as shown in the diagram below Hive has a number of components including a “database engine”, a metadata store, APIs for client access, and a link through to Hadoop to actually load, process and retrieve data in HDFS (Hadoop Distributed File System).

NewImage

So what’s HDFS then? HFDS is a fault-tolerant, distributed filesystem that’s a core part of Apache Hadoop, and stores the data that MapReduce jobs then process via job trackers, task trackers and all the other Hadoop paraphernalia. HDFS is accessed through a URI (URL) rather than through your Linux filesystem browser, but distributions such as Cloudera’s CDH3 and CDH4 ship with tools such as Hue, shown below, that provide a web-based interface into HDFS so that you can browse HDFS like a regular OS-level filesystem.

NewImage

Notice how there’s a “user” folder like we’d get with Linux, and within that folder there’s a home folder for Hive? With Hive, generally the data you manage using Hive is actually loaded into a directory structure under the “hive” user, either using data taken from another directory area in HDFS or from external files. Hive’s data is still in file form and accessed via MapReduce and Hadoop, but it’s in a directory area away from everything else. You can, however, tell Hive to create tables using data held elsewhere in HDFS, analogous to Oracle’s external tables feature, which then skips the data loading process and just maps table structures onto files held elsewhere in the Hadoop filesystem.

NewImage

In most cases when we’re considering OBIEE accessing Hadoop data via Hive, the data would have been loaded into Hive-mananged tables tables beforehand, though it’s possible that Hive table metadata could have been mapped onto other data in HDFS. In your own particular Hive implementation and assuming you’ve got Hue installed, and Beeswax, a table browser for Hive that usually comes with Hue, you can see where each individual table within your Hive metastore is actually held; in the examples below, the dwh_customer Hive table is a managed table and has its data stored within the /user/hive/warehouse/ HDFS directory, whilst the ratings table has its data stored outside of Hive’s directory structure, but still within the HDFS managed filesystem.

NewImage

So how does one create a Hive table, load data into it and get it ready for OBIEE access, and can ODI help with this, as we asked earlier? Before we get into ODI then, let’s take a look at how a Hive table is created and loaded, and then we’ll see how ODI does the same job.

With thanks to the ODI product development team’s David Allan, who put together some great Hive and ODI examples in this blog post, let’s start by creating a Hive table against the same movie ratings data in the right-hand screenshot below, but this time with the data actually loaded into Hive’s directory structure (i.e. a “managed” table). From the Hive command-shell, I type in the following commands to create the managed table, after SSH’ing into the VM running Hive:

officeimac:~ markrittman$ ssh oracle@bigdatalite
Warning: Permanently added the RSA host key for IP address '192.168.2.35' to the list of known hosts.
oracle@bigdatalite's password:
Last login: Mon Apr 22 10:59:07 2013 from 192.168.2.47
=====================================================
=====================================================
Welcome to BigDataLite
run startx at the command line for X-Windows console
=====================================================
=====================================================

Host: bigdatalite.us.oracle.com [192.168.2.35]

[oracle@bigdatalite ~]$ hive
Hive history file=/tmp/oracle/hive_job_log_oracle_201304250732_1523047910.txt

hive> create table movie_ratings (user_id string
> , movie_id string
> , rating float
> , tmstmp string)
> row format delimited fields terminated by '\t';

OK
Time taken: 3.809 seconds
hive>

At this point the table is created but there’s no data in it; that part comes in a moment. I can see the table structure and its empty state from the Hive command-line:


hive> describe movie_ratings;
OK
user_id string
movie_id string
rating float
tmstmp string
Time taken: 0.168 seconds

hive> select count(*) from movie_ratings;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201303171815_0021, Tracking URL = http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0021
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0021
2013-04-25 07:40:51,581 Stage-1 map = 0%, reduce = 0%
2013-04-25 07:40:56,617 Stage-1 map = 0%, reduce = 100%
2013-04-25 07:40:58,640 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201303171815_0021
OK
0
Time taken: 12.931 seconds
hive>

and also from the Beeswax web UI:

NewImage

So how do we get the data into this table, without any tools such as ODI? I can either load data straight from files on my local workstation, or I can upload them, for example using Hue, into the HDFS filesystem first.

NewImage

Now I can use the HiveQL LOAD DATA command to load from one of these HDFS tables into Hive, and then count how many rows have been loaded, like this:


hive> load data inpath '/user/oracle/movielens_src/u.data'
> overwrite into table movie_ratings;
Loading data to table default.movie_ratings
Deleted hdfs://localhost.localdomain/user/hive/warehouse/movie_ratings
OK
Time taken: 0.341 seconds

hive> select count(*) from movie_ratings;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201303171815_0022, Tracking URL = http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0022
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0022
2013-04-25 08:14:24,159 Stage-1 map = 0%, reduce = 0%
2013-04-25 08:14:32,340 Stage-1 map = 100%, reduce = 0%
2013-04-25 08:14:42,420 Stage-1 map = 100%, reduce = 33%
2013-04-25 08:14:43,428 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201303171815_0022
OK
100000
Time taken: 26.32 seconds
hive>

So how does this process look when using ODI to do the Hive data loading? Let’s start with importing the Hive table metadata for the movie_ratings table I just created from the Hive command-line shell, by going over to the Topology navigator in ODI 11.1.1.6 – note that you’ll need to configure ODI to connect to your Hive, HDFS and Hadoop environment beforehand, using the Oracle Data Integrator for Hadoop documentation as a guide, with this adapter being an extra-cost license option on top of base ODI Enterprise Edition.

Hive has its own technology type within the Topology navigator, and you create the connection through to Hive using the HiveJDBC driver, first adding the connection to the Hive server and then specifying the particular Hive database / namespace, in this case selecting the “default” database for my Hive system.

NewImage

Now I can reverse-engineer the Hive table structures into a Designer navigator model, just like any other relational table structure.

NewImage

Within the ODI Topology navigator you can then create File technology connections either to files held in HFDS, or more likely with ODI to files on your workstation, or server, filesystem, like this:

NewImage

and then add the filedata stores to the Designer Navigator Model list, entering the correct delimiter information and reversing the column definitions into the datastore definition.

NewImage

Now it’s a case of creating an interface to load the Hive table. In this instance, I map each of the source file “columns” into the Hive table’s columns, as the source file is delimited with an easily-usable structure.

NewImage

Then, over in the Flows tab for the interface, I make sure the IKM File to Hive knowledge module is selected, keep the default values for the KM options (more on these in a moment), and then save the interface.

NewImage

Now it’s a case of running the interface, and checking the results. Notice in the Operator navigator code panel, the LOAD DATA command that ODI is generating dynamically, similar to the one I wrote manually earlier on in the article.

NewImage

Going back to my Hive command-line session, I can see that there’s now 100,000 rows in the movie_ratings Hive table.


hive> select count(*) from movie_ratings;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201303171815_0024, Tracking URL = http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0024
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0024
2013-04-25 16:59:12,275 Stage-1 map = 0%, reduce = 0%
2013-04-25 16:59:18,346 Stage-1 map = 100%, reduce = 0%
2013-04-25 16:59:29,467 Stage-1 map = 100%, reduce = 33%
2013-04-25 16:59:30,475 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201303171815_0024
OK
100000
Time taken: 27.251 seconds

Now in many cases the data going into a Hive table isn’t neatly arranged into columns within delimited files; it could be, for example, web log data that you’ll need to parse using regular expressions or other APIs or standard parsers. When that’s the case, you can use an option with the IKM File to Hive knowledge module to override the normal column-to-column mappings and instead use an expression, something Oracle have done in their demo environment for parsing these types of log files.

NewImage

“ROW FORMAT SERDE” is a reference to Hive’s “Serializer – Deserializer”, or row-formatting feature, that gives you the ability to use regular expressions and other data manipulation techniques to, in this case, allocate incoming file data to the proper columns in the target hive table.

So now we’re at the point where we can use ODI to populate the Hive tables that OBIEE in turn uses to access Hadoop data sources. But what if the data we want to load into Hive isn’t in the format or shape we need, and we need to join, filter or otherwise work with Hive data and tables before we can report on it. And what if we want to get data out of Hive and into regular tables if a relational data store makes more sense than Hadoop, for a particular reporting requirement? Check back tomorrow for the final part in this series, where we’ll answer these remaining questions.

Rittman Mead BI Forum Atlanta Special Guest: Alex Gorbachev

A few days back, I introduced our special guests for the Rittman Mead BI Forum in Atlanta, focusing first on Cary Millsap. Today I’d like to talk about our other special guest: Oracle ACE Director Alex Gorbachev. Alex was an inspiration for me back in the Oracle Database 10g and early 11g days when I was administering Oracle RAC for several data warehouse customers, and wondering whether RAC was the right platform for BI. Of course it was… and every time I read one of Alex’s blogs (he was quite a prolific blogger back then… we all were once upon a time) or saw him speak, I felt empowered to go take on Cache Fusion.

Alex joined Pythian in Canada as a DBA team lead in 2006. Just two years later, he moved to Australia to successfully startup Pythian Australia. In 2009, he returned to Canada and took up the mantle of Chief Technology Officer, a title he still holds today. He is a member of the distinguished OakTable Network (as is Cary Millsap… something I forgot to mention yesterday), and is a member of the Board of Directors of the Independent Oracle Users Group (IOUG). Alex founded the Battle Against Any Guess Party, a movement promoting scientific troubleshooting techniques. During his time in Australia he also founded Sydney Oracle Meetup, a vibrant local community of passionate Oracle professionals.

Its fortuitous that Mark blogged yesterday on Hadoop… as this is exactly what Alex is speaking on at the BI Forum. His presentation is titled “Hadoop versus the Relational Data Warehouse.” He’ll discuss some of the technical design principles of Hadoop and the reasons for it’s rise in popularity. We’ll get to see the position that Hadoop currently occupies in the enterprise data center, it’s possible future trajectory, and how that trajectory compares with the more traditional relational data warehouse. For the BI developers in the crowd who have perhaps never seen Alex speak… you’re definitely in for a treat. He’s set to speak first thing Friday morning to kick off the last day of the Forum. If you know Alex, you’re obviously aware that he’s an excellent technologist, but you also likely know how much fun he is to be around, so it will be good to have him at the social meet-ups in and around the conference.

I’d really like to thank our friend and business partner Pythian for always supporting Rittman Mead and ensuring that Alex would speak at the Forum. And of course… I’d be remiss if I didn’t say: Love Your Data!

OBIEE, ODI and Hadoop Part 1: So What Is Hadoop, MapReduce and Hive?

Recent releases of OBIEE and ODI have included support for Apache Hadoop as a data source, probably the most well-recognised technology within the “big data” movement. Most OBIEE and ODI developers have probably heard of Hadoop and MapReduce, a data-processing programming model that goes hand-in-hand with Hadoop, but haven’t tried it themselves or really found a pressing reason to use them. So over this next series of three articles, we’ll take a look at what these two technologies actually are, and then see how OBIEE 11g, and also ODI 11g connect to them and make use of their features.

Hadoop is actually a family of open-source tools sponsored by the Apache foundation that provides a distributed, reliable shared storage and analysis system. Designed around clusters of commodity servers (which may actually be virtual and cloud-based) and with data stored on the server, not on separate storage units, Hadoop came from the world of Silicon Valley social and search companies and has spawned a raft of Apache foundation sub-projects such as Hive (for SQL-like querying of Hadoop clusters), HBase (a distributed, column-store database based on Google’s “BigTable” technology), Pig (a procedural language for writing Hadoop analysis jobs that’s PL/SQL to Hive’s SQL) and HDFS (a distributed, fault-tolerant filesystem). Hadoop, being open-source, can be downloaded for free and run easily on most Unix-based PCs and servers, and also on Windows with a bit of mucking-around to create a Unix-like environment; the code from Hadoop has been extended and to an extent commercialised by companies such as Cloudera (who provide the Hadoop infrastructure for Oracle’s Big Data Appliance) and Hortonworks, who can be though of as the “Red Hat” and “SuSE” of the Hadoop world.

MapReduce, on the other hand, is a programming model or algorithm for processing data, typically in parallel. MapReduce jobs can be written, theoretically, in any language as long as they exposure two particular methods, steps or functions to the calling program (typically, the Hadoop Jobtracker):

  • A “Map” function, that takes input data in the form of key/value pairs and extracts the data that you’re interested in, outputting it again in the form of key/value pairs
  • A “Reduce” function, which typically sorts and groups the “mapped” key/value pairs, and then typically passes the results down to the line to another MapReduce job for further processing

Joel Spolsky (of Joel on Software fame, one of mine and Jon’s inspirations in setting up Rittman Mead) explains MapReduce well in this article back from 2006, when he’s trying to explain the fundamental differences between object-orientated languages like Java, and functional languages like Lisp and Haskell. Ironically, most MapReduce functions you see these days are actually written in Java, but it’s MapReduce’s intrinsic simplicity, and the way that Hadoop abstracts away the process of running individual map and reduce functions on lots of different servers , and the Hadoop job co-ordination tools take care of making sense of all the chaos and returning a result in the end, that make it take off so well and allow data analysis tasks to scale beyond the limits of just a single server..

NewImage

I don’t intend to try and explain the full details of Hadoop in this blog post though, and in reality most OBIEE and ODI developers won’t need to know how Hadoop works under the covers; what they will often want to be able to do though is connect to a Hadoop cluster and make use of the data it contains, and its data processing capabilities, either to report against directly or more likely, use as an input into a more traditional data warehouse. An organisation might store terabytes or petabytes of web log data, details of user interactions with a web-based service, or other e-commerce-type information in an HDFS clustered, distributed fault-tolerant file system, and while they might then be more than happy to process and analyse the data entirely using Hadoop-style data analysis tools, they might also want to load some of the nuggets of information derived from that data in a more traditional, Oracle-style data warehouse, or indeed make it available to less technical end-users more used to writing queries in SQL or using tools such as OBIEE.

Of course, the obvious disconnect here is that distributed computing, fault-tolerant clusters and MapReduce routines written in Java can get really “technical”, more technical than someone like myself generally gets involved in and certainly more technical than you average web analytics person will want to get. Because of this need to provide big-data style analytics to non-Java programmers, some developers at Facebook a few years ago came up with the idea of “Hive”, a set of technologies that provided a SQL-type interface over Hadoop and MapReduce, along with supporting technologies such as a metadata layer that’s not unlike the RPD that OBIEE uses, so that non-programmers could indirectly create MapReduce routines that queried data via Hadoop but with Hive actually creating the MapReduce routines for you. And for bonus points, because the HiveQL language that Hive provided was so like SQL, and because Hive also provided ODBC and JDBC drivers conforming to common standards, tools such as OBIEE and ODI can now access Hadoop/MapReduce data sources and analyse their data just like any other data source (more or less…)

Hive

So where this leaves us is that the 11.1.1.7 release of OBIEE can access Hadoop/MapReduce sources via a HiveODBC driver, whilst ODI 11.1.1.6+ can access the same sources via a HiveJDBC driver. There is of course the additional question as to why you might want to do this, but we’ll cover how OBIEE and then ODI can access Hadoop/MapReduce data sources in the next two articles in this series, as well as try and answer the question as to why you’d want to do this, and what benefits OBIEE and ODI might provide over more “native” or low-level big data query and analysis tools such as Cloudera’s Impala or Google’s Dremel (for data analysis) or Hadoop technologies such as Pig or Sqoop (for data loading and processing). Check back tomorrow for the next instalment in the series.

Some Upcoming Events

It’s going to be a busy few weeks leading up to the BI Forum. First, Rittman Mead will be exhibiting at the UKOUG Engineered Systems Summit on Tuesday 16th April, this is a one day event in London for Exadata, Exalogic, SuperCluster and not least Exalytics. Mark will be presenting on Oracle Exalytics – Tips and Experiences from Rittman Mead , full agenda available here. Mark will then hoping over to Norway to speak at the Oracle Norway User Group event on High-Speed, In-Memory Big Data Analysis with Oracle Exalytics, maybe he’ll be previewing his work getting OBIEE 11.1.1.7 working with Hadoop.

The following week on Tuesday 23rd April I am speaking at an Oracle Business Analytics event, I am giving a presentation about our story so far with Exalytics, this event is at Oracle’s City Office in London. Later that week on Thursday 25th, as part of Big Data Week I’m speaking in the evening in Brighton about the evolution from Business Intelligence to Analytics and Big Data, full agenda here, please register here.

Interview with Toby Potter from DataSift – Social Data and Business Intelligence

tobypotter

With the BI Forum fast approaching, Toby Potter from DataSift agreed to an interview for the blog, to discuss social data, it’s value and how it can be combined with other business intelligence.  Toby will be one of the guest speakers at the BI Forum in Brighton sharing more details and experiences in this growing area.  Over to the interview…

[James Knight] “For the readers of the blog that may not have heard of DataSift, tell us a bit about the company and what they can provide.”

[Toby Potter] ”DataSift provides the leading Social Data Platform for enterprise. DataSift collects, structures and enriches unstructured social, news and blog data allowing our customers to filter and collect the data of interest to their business for further, more detailed analysis.

We provide real-time access to the full Twitter firehose, bit.ly (click) data, Facebook, YouTube, millions of blogs, news and others, complemented by our historical archive of data.”

[James Knight] ”Big data is a hot topic and lots of our customers want to do some ‘big data stuff’, but what does the term mean to you?”

[Toby Potter] ”Whilst we consider ourselves to be a Big Data Platform – our core platform is into the Petabytes and we’re growing at several Terabytes per day – our focus is really on the Right Data. Using the power of our filtering capabilities, we’re able to take those huge volumes of data and provide the Right Data to our customers based upon the rules they define.

Even given that we are processing 1 Trillion (1,000,000,000,000) items of social data a week in our historics platform – and doubling every 2 months.”

[James Knight] ”There’s a ton of social data out there, which may make people wary of the effort, timescales and costs involved.  How can people start their journey in the use of social analytics and gain benefit quickly?”

[Toby Potter] ”Social analytics is actually pretty straightforward to get started on – for example, you can simply create a Pay As You Go account on the DataSift platform and be collecting data within a few miutes – but the challenge is often realising the insight that’s available within social data. A typical tweet for example, might have anywhere between 50 and 100 individual data points.

As with any analytics project, be clear on your objectives and you should see whether this is going to be of benefit very quickly. Think about how you would capture the data, what SLA (if any) do you need if this was to become mainstream, how would you analyse it and do you want to merge it with existing, internal sources.

The bottom line however, is to understand the action you would take from any insight gained. No matter how interesting the results, if there’s no action taken from the insight, you have to question the value of any analytics project.”

[James Knight] ”So far we’ve been talking about social media in relation to big data, but how can it also be used to enhance regular BI reports, such as those produced using Oracle BI EE?”

[Toby Potter] ”Typically, BI reports focus on reporting on data that’s available internally; sales data, customer data and effectiveness of a marketing campaign for example.

Social data in and of itself provides insight into the world outside; bringing this into your organisation’s Business Intelligence environment provides the business with a whole new perspective and delivers a new view of how your business is performing.

Both uses provide tremendous value, but it’s the combination of the two that really drives insight. Being able to link social discussions, or a particular news story, drove product interest and traffic to your web-site which both increased online sales, but also drove footfall to your physical stores (which also saw an uplift in sales) would provide the justification in running more similar campaigns. Understanding the social profiles of the customer base may also help you tune your offline messaging too.”

[James Knight] ”Out of those that you are allowed to talk about publicly, what’s the cleverest use of social data that you have seen so far?”

[Toby Potter] ”This is what I love most about my job; there are so many uses for the data that almost every discussion is different!

In the media business, I would point towards SecondSync as a great example of using social data to disrupt and add value to the fairly traditional business of measuring TV audiences. They capture social discussions around particular TV programmes and provide both a dashboard view and deeper analytics services to allow broadcasters to better understand their audience and how to engage with them.

News is an obvious area for analytics on social data and many of our customers are able to understand how stories are shared socially, enabling them to better target news stories, drive up audience and therefore drive increased advertising revenues.

Finance is another interesting area; the ability to bring together all of the discussion around particular investments, measure the sentiment and identify breaking news and the market reaction to it quickly, gives our customers an edge.

Customer services is another area and not just being able to react to complaints. Understanding what is working and what isn’t for your customers, collecting feedback and relating that to churn figures for example, allows you to understand how better to adapt your business. As an example, when the broadband goes down in a particular area, people tend to tweet about it very quickly; whilst your engineers may well be alerted, customer services are often left to pick up a sudden, increased load, without having any ideas as to why or what the problem might be. Integrating social data helps overcome this.

More broadly, by looking into the data itself, I’ve seen our customers collecting all of the geo-located tweets to understand population density to better inform telecoms infrastructure; retailers analyse data to understand future fashions to better inform stock ordering/promotions; competitor benchmarking to understand how your business compares to your peers.

The most innovative I’m seeing at the moment is the ability to build out profiles, much richer than simple demographics, of social customers to better understand their interests and so provide them with more useful promotions.

One of our customers, Local Response, have integrated a real-time social feed into their online advertising platform to drive “Intent Targetting”, the ability to target more appropriate and context sensitive advertising to web-site viewers.”

[James Knight] ”Social data can be quite messy.  How can organisations uncover value?”

[Toby Potter] ”As a gross over simplification, there are two fundamental elements required: the ability to capture, process and cleanse the data in the first place (this is where DataSift comes in), and the ability to analyse it, which is where the Oracle platforms and Endeca in particular come into play.

Getting the data into a format where it can be analysed initially – for more traditional types of data this would typically be some kind of ETL tool – is key to being able to then store it ready for further analysis.

Endeca is fantastic at then taking this data and providing a flexible “data playground” allowing users to discover what insight might be hidden inside. For more structured analytics the more traditional BI tools provide a great environment to distribute reports around the business to a broader Business As Usual user base.”

[James Knight] ”Technically, how are companies dealing with social data, and do you see a standard approach or a variety of different approaches?”

[Toby Potter] ”As is to be expected in a relatively young space, there are many approaches. The majority of businesses are dipping their toes in and pulling data directly from the various provider APIs out there, but this provides a very distorted view of the potential; most APIs provide very restricted data sets and have limits on the number of requests you can make, so you’re potentially missing out on a lot.

Increasingly, as organisations recognise the importance and value of the data available, they are turning to specialist providers such as DataSift to provide a reliable, enterprise quality feed of the data they need.

We see the same journey often – initial experiment seems to yield value; build infrastructure to handle and incorporate data; deliver valuable insight … and then an API changes, or a major event changes volume levels, or the business interest grows and so on. The result is almost always frustration with the amount of expense, lack of reliability and general hard work that is required to get this working, but the insight is so valuable the need for a working solution over-rides everything else.

It is all of this work that DataSift aims to replace with a single platform to take away this maintenance and reliability nightmare.

If you’re looking at social data coming in to your business, I’d highly recommend looking at the tools that are already out there to save you a lot of time, money and frustration!”

[James Knight] ”At Rittman Mead, we’ve been working with Oracle Endeca Information Discovery (OEID), allowing us to combine unstructured, semi-structured and structure data.  What additional value do you see OEID providing in the analysis of social data?”

[Toby Potter] ”This is exactly where the maximum value from the social data can be gained. Combining the data with existing sources, enhances the potential from the data, but having the right tools to extract this value is fundamental. Endeca provides exactly that kind of creative analytical environment where you can explore the data and determine where the most value lies.

Using Endeca, once the data is better understood, it’s then straightforward to productionise the areas of maximum benefit and focus energy on delivering that insight to where it can have most impact in your business.”

Many thanks to Toby for taking the time out for this interview, which we hope has provided some useful insight.  There’s still time to register for the BI Forum and see Toby’s presentation at the Brighton event.