Tag Archives: Cloud
Thoughts on ETL in the Cloud
In a few week’s time I’m presenting at the BIWA Summit 2014 on running OBIEE in the Cloud, with the idea being that I’ll go through the various options for deploying OBIEE in public clouds, including Amazon EC2 and Oracle’s own cloud offering. But the more I thought through the various architecture and product options, the more I began to think that ETL – getting data into the BI system, from wherever it is now – is the key enabling technology. The vast majority of BI systems use data that’s refreshed on a daily, weekly or even real-time basis, and on premise we tend to use tools such as Oracle Data Integrator, Oracle Warehouse Builder, Informatica PowerCenter and the like to move data from source to target, transforming and cleansing it en-route.
When you deploy a BI platform such as OBIEE into the cloud, you’ve got a couple of options around where you hold its data, and how you do the ETL:
- You can keep the data warehouse database on-premise, along with the ETL tool, and just run the BI part in the cloud
- You can move the database into the cloud as well (using a service such as Amazon RDS), but keep the ETL tool on-premise
- You can try and move the ETL into the cloud too – either hosting a tool like ODI in its own cloud VM, or make use of ETL-as-a-service such as that provided by Informatica Cloud.
Things get more interesting when part of your IT infrastructure sits on-premise, and some sits in the cloud. It gets even more interesting when your data sources are also software-as-a-service (SaaS), for example Salesforce.com and Workday, where data access is via APIs rather than direct SQL*Net connections.
So where do we start with ETL-in-the-cloud? One place to start is Oracle’s own slide-deck on OBIEE in the cloud, where they set out what they see as the key differences between traditional on-premise software and software delivered via SaaS:
In this SaaS world, the expectation is that:
- The infrastructure, software etc is already in placed, is patched regularly for you in the background, and new features emerge regularly
- You’re on a shared platform, but you won’t be aware of other tenants on a day-to-day basis
- The “innards” and workings are hidden from you – you just consume a service
- It’s cloud-native – from connectivity to cloud apps through to common social sign-ons, commenting and collaboration
- Everything is thin-client, web-based
- Everything is self-service, self-provisioning, right from creation of your user account
- Sign-up is on-demand, paid by the hour/day etc, and with no long setup process
What this means for data integration (and ODI) then, in my opinion, is:
- It must be available as a service – log in, create data source connections, define mappings, run code
- The internals typically would be hidden – so no need to know about agents, GoldenGate vs. ODI, choices of knowledge modules etc
- Web-based data flow modellers as well as admin tools
- Connectors through to Salesforce.com and other cloud apps
- Connectors through to Hadoop, Hive, JSON compatibility etc
- Instant sign-up and charging by time/use
With bonus points for:
- Low-cost or free option to get you started
- A consistent story that goes through from messaging, application integration through to large-scale data movement
- Easy clustering, scale-out, deployment on services such as Amazon EC2
- Options to deploy on-premise or cloud, and run integration tasks on-premise or cloud
Looking at it architecturally, ETL-in-the-cloud would sit between sources and targets, both on-premise and in the cloud, providing the vital data movement capability between source systems and the BI Platform itself – most probably to its own database-in-the-cloud, running the data warehouse.
So what are the options then, if you want to use cloud-based ETL to load a cloud-based data warehouse and BI system? To my mind, there’s four main options:
- If you’re using a cloud-based database service such as Amazon’s Redshift, or Oracle’s own public cloud “schema-as-a-service” database, you can use the ETL tools provided with the service
- You can try and put ODI in the cloud, maybe using an Amazon EC2 instance running WebLogic and a JEE agent, with another instance providing the metadata repository database
- or as an alternative, do the same for one of the open-source ETL tools
- You can use one of the new “cloud-native” ETL-as-a-service products, such as Informatica Cloud or SnapLogic
- Or – you can try and abstract away ETL altogether – more on this later on.
The first option really applies if (a) you’re using a service such as Amazon Web Services’ EC2, (b) your data also most probably sits in AWS cloud storage, (c) you want to move data between your source application or export into the main data warehouse database, and (d) you don’t really need to integrate data from different sources. Amazon AWS provides a number of options for loading data into EC2 and the various database and analytic services they provide, including Amazon Data Pipeline (shown in the screenshot below), the most “ETL-like” of their loading services, along with a sneakernet service, and the new Amazon Kinesis service for real-time streaming data.
Oracle’s Cloud Database service at the moment restricts you to a single schema, and more importantly there’s no SQL*Net access, so uploading data is either through SQL*Developer (which has a special, custom Oracle Cloud connector), or through utilities provided via ApEX.
Clearly this sort of web-based data loading isn’t designed for data warehouse scenarios, and in this initial iteration Oracle’s cloud database is probably designed to support cloud applications running on Oracle’s cloud Java service, with OBIEE-in-the-cloud mainly designed for reporting against this data, and personal “sandbox” scenarios. What’s not immediately obvious when you use these cloud ETL tools is that each one has its own admin console, its own API, it’s own scheduler, it’s own metadata – which is usually the point at which you decide you need an enterprise ETL tool.
So what about the idea of running Oracle Data Integrator in the cloud? There’s a few options here, that I can think of:
- Creating a cloud server instance that runs ODI, along with all of its metadata repositories, WebLogic instances and agents in a single location
- Run the repository database either on-premise or on another cloud server (or even Amazon’s RDS service), with agents running either in their own cloud servers, or on the cloud servers holding the target database
- Build ODI into the target platform, as Oracle have done with BI Apps 11.1.1.7.1, along with management tools to hide the complexity and workings of ODI
The first option sounds the “neatest” in terms of a wholly-cloud deployment, as all the parts of ODI are held in one place, and you can think of it as an applicance or service. But ODI’s Extract-Transform-Load approach ends-up complicating things in a cloud-only deployment; the target platform (for example one of the Postgres-compatible cloud databases) might not support all of the target-layer transformations you want, and the integration part can’t really sit in the ODI cloud instance, unless you run it hub-and-spoke style and either use a local database for transformations, or use the in-memory feature within the agents. In fact where ODI makes most sense is in a hybrid on-premise/cloud setup, where most of your data resides on-premise, as does your ETL process, with the cloud being gradually migrated to and in the meantime used alongside on-premise applications.
Oracle’s white paper on cloud data integration majors on this type of scenario, with Oracle GoldenGate also used to replicate data between the two environments. At this year’s Oracle Openworld the BI Apps team also announced cloud adapters for the BI Applications, initially used to extract data from Fusion Apps in the Cloud back to the on-premise BI Apps data warehouse, with other Oracle cloud applications data sources following-on.
Where things get interesting with Oracle’s approach is when it’s non-Oracle cloud applications that we’re looking to integrate with. “Cloud Integration – A Comprehensive Solution”, another Oracle white paper, describes techniques to integrate with Salesforce.com, Workday and the Fusion Apps, but all of this takes place through Oracle SOA Suite and Oracle Jdeveloper, products that are way too technical for the average ETL developer (or customer implementor).
In fact there’s two obvious things that are missing here, that are present in the Cloud Database and Cloud OBIEE offerings that Oracle recently launched:
- It’s not a “service” – you can’t just create an account, design some mappings, load a target – the assumption is that each system is single tenant, perpetual license, self-install
- It’s not consumer-level in terms of simplicity – there’s bits of ODI, bits of GoldenGate, bits of SOA, bits of Jdeveloper – all good bits, but not a single integration-platform-as-a-service
But it’s probably fine if you’re running a hybrid on-premise/cloud strategy, and as I’ll talk about later on, the ELT approach does still have some distinct advantages over cloud-native ETL tools.
What about the more recent, “cloud-native” ETL tools such as SnapLogic, and Informatica Cloud? Informatica Cloud is probably the easiest product to understand if, like me, you’re from an ODI background. What Informatica have done here in terms of architecture is move the metadata and application-layer parts of Informatica PowerCenter into the cloud, add a bunch of cloud application adapters (including a third-party marketplace for them), but still do the actual data integration on-premise, using a cut-down version of the main PowerCenter engine.
Some of the nicer features in this setup are its multi-tenant and “integration-as-a-service” nature, the way it deals with firewall issues (do the integration on-premise), and the interoperability with traditional Informatica PowerCenter, where you can publish custom maplets from on-premise Informatica and push them into the cloud version. If Oracle came out with an ODI-in-the-cloud service, I think it’d look a lot like this.
To my mind though, the most impressive of the cloud integration vendors is SnapLogic. Their SnapLogic Integration Cloud product looks like it was designed first and foremost to run “in the cloud”, it’s available as a service, and the data integration designer focuses on data paths and application integration rather than the low-level database centric approach traditional ETL tools take,
What’s particularly impressive to me is the way that they’ve taken concepts used in tools like ODI – agents that perform the data integration tasks, running as standalone servers – and built on it to create the concept of “snaplexes”, collections of JVMs that can sit in-the-cloud, or on-premise, elastically scale-up to handle larger workloads, and use Amazon’s EC2 and S3 compute and storage clouds to perform any data transformations en-route. Data being transformed in SnapLogic streams through the system with low latency and using web protocols, and the whole thing is just neatly designed to run native on the cloud.
Where tools like SnapLogic do fall short though is on the “last mile” into the data warehouse, where tools like ODI come with lots of templates, database integration features and so forth to handle slowly-changing dimensions, dimension lookups, analytic functions and the like, but cloud-native tools really focus on basic transformations and getting the data from source to target. Where this of course gets interesting is when SnapLogic is used to load databases such as Amazon Redshift, for example, in combination with Tableau, as some of the ETL tool features we’re used to using with tools such as ODI and Informatica just aren’t there yet. What this means in practice is that if you’re looking to move OBIEE into the cloud, along with an Oracle database, you’ll probably still want to use a tool like ODI to do your data load, as these tools are just so mature when it comes to loading relational data warehouses.
So finally, then, onto what I probably think will be the future of ETL – “invisible ETL”, or ETL so abstracted away that you don’t even think of it as a separate product in itself. If you think about it, ETL is a necessary evil when it comes to BI – customers just want their data loaded, they don’t want to worry about how it’s transported, which data loading tool you use and so on. Oracle have had a number of initiatives over the past few years to automate the creation of ETL code from logical table source mappings in the OBIEE RPD, and one of the major drivers in the BI Apps product roadmap is to reduce the amount of manual work propagating application data model changes through the various layers of ETL and BI tool metadata.
A really nice example though of taking this approach of hiding the complexity of ETL can be found with a company called Good Data, who sell a BI in the Cloud product that comes with data storage (via Vertica) and ETL all wrapped-up in a consumer-like service that’s focused on analytics and visualisations but supports data loading from a number of database and cloud application sources. The screenshot below from the GoodData CloudConnect LDM Modeler Guide shows GoodData’s logical data model development environment, with this part of the product handling the facts, dimensions, attributes and other user-facing data elements, and a more traditional ETL tool and server (based on Clover ETL, no less) doing the actual data heavy-lifting.
GoodData splits its data models into these logical and physical elements, which of course is exactly what ODI does – and what OBIEE does too. In fact, the more you look at GoodData, the more you think that all of the elements are already in place at Oracle to create a very similar product, with the added benefit of access to tools such as GoldenGate, EDQ and SOA Suite. Even SnapLogic’s Snaplex concept is conceptually very similar to ODI’s agents, but what they’ve done, and what GoodData, and Informatica and others have done, is wrap the products up into a service, made it all consistent (at least on paper), architected it for cloud-only, and hybrid on-premise/cloud environments, and built-out all of the third-party adapters. It’ll be interesting to see what Oracle’s long-term response to this will be – honestly I’ve no special insight into this part of the product roadmap so I’m just as interested as anyone as to how it will turn out.
So – is any approach better, is Oracle’s approach the worst, are vendors like SnapLogic the future, and what should we use if deploying OBIEE in the cloud? Well to my mind there’s no black-and-white answer, and the choice comes down to a number of factors including:
- Where are your sources and targets; the more that still reside on-premise, the more a traditional on-premise tool such as ODI makes sense
- To what extent are you interacting with non-Oracle targets and sources; the more non-Oracle ones you’re using, the more a non-Oracle tool will probably make sense
- How complex will your end data transformations be – if you’re expecting to do lots of SCD2 transformations, analytic queries, clever DW-stuff, a tool like ODI will make sense
- How much upfront investment do you want to make? Services like Informatica Cloud are far easier to get provisioned on, and involve far less up-front license costs, than an on-premise install
- Or are you just loading data from a single source into a managed cloud database – if so, one of the vendor-supplied utilities will probably do the job
Anyway – I’d be interested in others’ opinions, and whether I’ve missed anything out. But for now – that was some thoughts from me on running ETL “in the cloud”.
Rittman Mead BI Forum 2014 Call for Papers Now Open!
It’s that time of year again when we start planning out next year’s BI Forum, which like this year’s event will be running in May 2014 in Brighton and Atlanta. This will be our sixth annual event, and as with previous year’s the most important part is the content – and as such I’m pleased to announce that the Call for Papers for BI Forum 2014 is now open, running through to January 31st 2014.
If you’ve not been to one of our BI Forum events in past years, the Rittman Mead BI Forum is all about Oracle Business Intelligence, and the technologies and techniques that surround it – data warehousing, data analysis, big data, unstructured data analysis, OLAP analysis and this year – in-memory analytics. Each year we select around ten speakers for Brighton, and ten for Atlanta, along with keynote speakers and a masterclass session, with speaker choices driven by attendee votes at the end of January, and editorial input from myself, Jon Mead and Stewart Bryson.
Last year we had sessions on OBIEE internals and new features, OBIEE visualisations and data analysis, OBIEE and “big data”, along with sessions on Endeca, Exalytics, Exadata, Essbase and anything else that starts with an “E”. This year we’re continuing the theme, but are particularly looking for sessions on what’s hot this year and next – integration with unstructured and big data sources, use of engineered systems and in-memory analysis, advanced and innovative data visualisations, cloud deployment and analytics, and anything that “pushes the envelope” around Oracle BI, data warehousing and analytics.
The Call for Papers entry form is here, and we’re looking for speakers for Brighton, Atlanta, or both venues. We’re also looking for presenters for ten-minute “TED”-style sessions, and any ideas you might have for keynote speakers, send them directly to me at mark.rittman@rittmanmead.com. Other than that – have a think about abstract ideas now, and make sure you get them in by January 31st 2014.
Thoughts on Running OBIEE in the Cloud : Part 2 – Data Sources and ETL
In yesterday’s post on running OBIEE in the cloud, I looked at a number of options for hosting the actual OBIEE element; hosting it in a public cloud service such as Amazon EC2, using Oracle’s upcoming BI-as-a-Service offering, or partner offerings such as our upcoming ExtremeBI in the Cloud service. But the more you think about this sort of thing, the more you realise that the OBIEE element is actually the easy part – it’s what you do about data storage, security, LDAP directories and ETL that makes things more complicated.
Take the example I gave yesterday where OBIEE was run in the cloud, with the multi-tenancy option enabled, the main data warehouse in the cloud, and data sourced from cloud and on-premise sources.
In this type of setup, there’s a number of things you need to consider beyond how OBIEE is hosted. For example:
- If your corporate LDAP directory is on-premise, how do we link OBIEE to it? Or does the LDAP server also need to be in the cloud?
- What sort of database do we use if we’re hosting it in the cloud. Oracle? If so, self-hosted in a public cloud, or through one of the Oracle DB-in-the-cloud offerings?
- If not Oracle database, what other options are available?
- And how do we ETL data into the cloud-based data warehouse? Do we continue to use a tool like ODI, or use a cloud-based option – or even a service such as Amazon’s AWS Data Pipeline?
What complicates things at this stage in the development of “cloud”, is that most companies won’t move 100% to cloud in one go; more likely, individual application and systems might migrate to the cloud, but for a long time we’ll be left with a “hybrid” architecture where some infrastructure stays on premise, some might sit in a public cloud, others might be hosted on third-party private clouds. So again, what are the options?
Well Oracle’s upcoming BI-as-a-service offering works at one extreme end-of-the-spectrum; the only data source it’ll initially work with is Oracle’s own database-as-a-service, which in its initial incarnation provides a single schema, with no SQL*Net access and with data instead uploaded via a web interface (this may well change when Oracle launch their database instance-as-a-service later in 2014). No SQL*Net access means no ETL tool access though, in practice, as they all use SQL*Net or ODBC to connect to the database, so this offer to my mind is aimed at either (a) small BI applications where it’s practical to upload the data via Excel files etc, or (b) wider Oracle Cloud-based systems that might use database-as-a-service to hold their data, Java-as-a-service for the application and so forth. What this service does promise though is new capabilities within OBIEE where users can upload their own data, again via spreadsheets, to the cloud OBIEE system, and have that “mashed-up” with the existing corporate data – the aim being to avoid data being downloaded into Excel to do this type of work, and with user metrics clearly marked in the catalog so they’re distinct from the corporate ones.
But assuming you’re not going for the Oracle cloud offer, what are the other options over data? Well hosting OBIEE in the cloud is conceptually no different from hosting anywhere else, in that it can connect to various data sources via the various connection methods, so in-principle you’ve got just the same options open to you if running on premise. But the driver for moving OBIEE into the cloud might be that your applications, data etc are already in the cloud, and you might also be looking to take advantage of cloud features in your database such as dynamic provisioning and scaling, or indeed use one of the new cloud-native databases such as Amazon Redshift.
I covered alternative databases for use with OBIEE a few months ago in a blog post, and Amazon Redshift at the time looked like an interesting option; based on ParAccel, a mature analytic database offering, column-store and tightly integrated in with Amazon’s other offerings, a few customers have asked us about this as an option. And they’re certainly interesting – in practice, not all that different in pricing to Oracle database as a source but with some interesting analytic features – but they all suffer from the common issue that they’re not officially supported as data sources. Amazon Redshift, for example, uses Postgres-derviced ODBC drivers to connect to it, but Postgres itself isn’t officially supported as a source, which means you could well get sub-optimal queries and you certainly won’t get specific support from Oracle for that source. But if it works for you – then this could be an option, along with more left-field data source such as Hadoop.
But to my mind, it’s the ETL element that’s the most interesting, and most challenging, part of the equation. Going back to Openworld, Oracle made a few mentions of ETL in their general Cloud Analytics talks, including talk about an upcoming data source adapter for the BI Apps that’ll enable loading from Fusion Apps in the cloud, like this:
There were also a number of other deployment options discussed, including hybrid architectures where some sources were in the cloud, some on-premise, but all involved running the ETL elements of the BI Apps – ODI or Informatica – on-premise, the same way as installs today. And to my mind, this is where the Oracle cloud offering is the weakest, around cloud-based and cloud-native ETL and data integration – the only real option at the moment is to run ODI agents in the cloud and connect them back to an on-premise ODI install, or move the whole thing into the cloud in what could be quiet a heavyweight data integration architecture.
Other vendors are, in my opinion, quite a way further forward with their cloud data integration tools strategy than Oracle, who instead seem to be still focused on on-premise (mainly), database-to-database (mostly) ETL. To take two examples; Informatica have an Informatica Cloud service which appears to be a platform-as-a-service play, with customers presumably signing-up for the service, designing their ETL flows and then paying for what they use, with a focus on cloud APIs as well as database connectivity, and full services around data quality, MDM and so forth.
Another vendor in this space is SnapLogic, a pure-play cloud ETL vendor selling a component-based product with a big focus on cloud, application and big data sources. What’s interesting about this and other similar vendor’s approaches though are they they appear to be “cloud-first” – written for the cloud, sold as a service, as much focused on APIs as database connectivity – a contrast to Oracle’s current data integration tools strategy which to my mind still assumes an on-premise architecture. What’s more concerning is the lack of any announcement around ETL-in-the-cloud at the last Openworld – if you look at the platform-as-a-service products announced at the event, whilst database, BI, documents, BPM and so forth-as-a-service were announced, there was no mention of data integration:
What I’d like to see added to this platform in terms of data integration would be something like:
- On-demand data integration, sold as a service, available as a package along with database, Java and BI
- Support for Oracle and non-Oracle application APIs – for example Salesforce.com, Workday and SAP – see for example what SnapLogic support in this area.
- No need for an install – it’s already installed and it’s a shared platform, as they’re doing with OBIEE
- Good support for big data, unstructured and social data sources
I think it’s pretty likely this will happen – whilst products such as the BI Apps can have their ETL in the cloud, via ODI in the BI Apps 11g version for example, these are inherently single-tenant, and I’d fully expect Oracle plan at some time to offer BI Apps-as-a-service, with a corresponding data integration element designed from the ground-up to work cloud-native and integrate with the rest of Oracle’s platform-as-a-service offering.
So there we have it – some thoughts on the database and ETL elements in an OBIEE-in-the-cloud offering. Keep an eye on the blog over the next few months as I built-out a few examples, and I’ll be presenting on the topic at the upcoming BIWA 2014 event in San Francisco in January – watch this space as they say.
Thoughts on Running OBIEE in the Cloud : Part 1 – The BI Platform
One of the major announcements at this year’s Oracle Openworld in San Francisco was around “OBIEE in the cloud”. Part of a wide-ranging set of announcements around services and infrastructure in the cloud from Oracle, the idea with this service is that you’ll be able to access an Oracle-hosted future version of OBIEE running in Oracle’s public cloud service, so that you can create dashboards and reports without having to perpetually-license lots of software, or stand-up lots of infrastructure in your data centre. But of course “cloud” is a hot topic at the moment, and lots of our customers are asking us about options to run OBIEE in the cloud, what’s involved in it, and how we deal with the stuff that surrounds an OBIEE installation – the data, the ETL, the security, and so forth. So I thought I’d put a couple of blog posts together to go through what options are available to OBIEE customers looking to deploy in the cloud, and see what the state of the industry is around this requirement.
I’ll start-off then by looking at the most obvious element – the BI platform itself. If you want to run OBIEE “in the cloud”, what are your basic options?
Probably conceptually the simplest, is just to run the OBIEE server applications in a cloud-based, hosted environment, for example Amazon EC2. In this instance, OBIEE runs as it would do normally, but instead of the host server being in your datacenter, it’s in a public cloud. This type of setup has been available for some time in the form of services such as Oracle On-Demand, but the difference here is that you’re using a public cloud service, there’s no service contracts to set up, usually no minimum contract size, and everything to do with security, failover, maintenance and so on is down to you. We’ve been doing this sort of thing for a long-time, typically for our cloud-based training environments, or for ad-hoc development work where it makes more sense for a server to be centrally available on the internet, rather than sitting on someone’s laptop or on an internal server. Once potential complication here is licensing – for products such as the Oracle Database, there are formal schemes such as Amazon RDS where either the license is included, or there are set metrics or arrangements to either bring your own license, or buy them based on virtual CPUs. If you’re prepared to take care of the licensing, and all of the security and maintenance aspects, this is an interesting approach.
In practice though, any setup like this is going to be what’s termed a “hybrid” cloud deployment; at least the administration side of OBIEE (the BI Administration Tool) is still going to be “on-premise”, as there’s no way you can deploy that “thin-client”, unless you create a Windows VM in the cloud too, and run the BI Administration tool from there. Moreover, your data sources are more than likely to still be on-premise, with just a few at this point hosted in the cloud, so most likely your OBIEE-in-the-cloud architecture will look like this:
There’s other things to think about too; how do you do your ETL when your target data warehouse might be in the cloud, or your data sources might be cloud based; how do you connect to your LDAP server, and so forth – I’ll cover these in the next postings in the series. But for now, this is conceptually simple, and its main benefit is avoiding the need to host your own servers, buy your own hardware, and so forth.
So what if you want to use OBIEE, but you don’t want the hassle even of setting up your own servers in a public cloud? This is where Oracle’s new “BI in the Cloud” service comes in – the idea here is that you swipe your credit card, fill in your details, self-provision your system, and then start loading data into it.
Whilst the underlying technology is core OBIEE, it’s designed for departmental, power user-type scenarios where the scope of data is limited, and the focus really is on ease-of-use, self-service and “software-as-a-service”. Looking at Oracle’s slides from Openworld, you can see the focus is quite different to on-premise OBIEE – the competition here for Oracle is the various internet startups, and products like Tableau, that make it easy to get started, provide consumer-class user interfaces, and focus on the single user rather than enterprise-type requirements.
But this focus on simplicity means a limitation in functionality as well. At the start, at least, Oracle’s BI in the Cloud will only offer Answers and Dashboards; no KPIs, BI Publisher, Maps or Delivers. Moreover, at least in its first iteration, it’ll only support Oracle’s own database schema-as-a-service as the single, sole datasource, so no Hybrid cloud/on-premise federated queries, no Essbase, and no ETL tools – hence the focus on single-user, departmental use-cases.
What you do get with this setup though is a cloud-native, fully-managed service where patching is taken care of for you, all the new feature appear first, and the administration element has been re-thought to be more appropriate for deployment to the cloud – in particular, a new version of the BI Administration tool that’s cloud-native, focused on simple use-cases, and doesn’t require desktop client installs or VPN connections through to the OBIEE server. The diagram bellow shows the architecture for this option, and you can see it’s all pretty self-contained; great for simplicity, but maybe a bit limiting at least in the initial iteration.
The third major variant that I can see around OBIEE in the cloud, is where partners (such as Rittman Mead) offer shared access to their cloud OBIEE installations, either as part of a development package or through some sort of multi-tenant reporting product. For example, we’re about to launch a service we’re calling “ExtremeBI in the Cloud”, where we combine our “ExtremeBI” agile development method with cloud-based development server hosting, with the cloud element there to make it easier for clients to start on a project *today*. Endeca Information Discovery is another product that could benefit from this approach, as the infrastructure behind an Endeca install can be fairly substantial, but users are typically more-focused on the data modelling and report-building element. In our case, the back-end cloud hosting will typically be done on a service like Amazon EC2 or VMWare’s Hybrid Cloud product, but the focus is more on the development piece – cloud is there to provide business agility.
The other third-party approach to this centres around multi-tenancy; typically, a partner or company will provide a reporting or analytics element to a wider product set, and use OBIEE as its embedded dashboarding tool as part of this. In the past, setting up multi-tenancy for OBIEE was fairly difficult, to the point where most companies set up separate installs for each end-customer, but 11.1.1.7 brought a number of multi-tenant features that presumably were put there to support Oracle’s own cloud product. The product docs describe the new multi-tenant features well, but the basic premise is that a single RPD and catalog are still set up, but OBIEE can then run in a “multi-tenant” mode where GUIDs demarcate each tenant, a new BIGlobalAdministrator role is created that assumes what were the old BIAdministrator role’s privileges, with other administrator roles then set up with limited privileges and no access to Fusion Middleware Control, for example.
What this does give you though is the framework to set up tenant groupings of users, separate areas of the catalog for each tenant, additional identity store attributes to hold tenant details, and a security framework that enables more limited forms of administrator account more suited to tenant-type situations. Again, not all OBIEE features are available when running in multi-tenant mode, and presumably we’ll see this feature evolve over time as more requirements come in from Oracle’s own cloud BI product, but it’s a feature you can use now if you’re looking to set up a similar type of environment.
So that’s the three basic options if you want to run OBIEE in the cloud; host it in a public cloud service like Amazon EC2 but then just run it like regular OBIEE; go for Oracle’s upcoming BI-in-the-cloud service, if the use-case suits you and you can live with the limitations, or consider one of the services from partners such as ourselves where we bundle cloud hosting of OBIEE up with a larger service offering like our “ExtremeBI in the Cloud” offer. But of course OBIEE is only one part of the overall platform – where do you store the data, and how do you get the data from wherever it normally is up into the cloud – in other words, how do we do ETL in the cloud? Check back tomorrow for the follow-up post to this one: Part 2 – Data Warehouse and ETL.
Using Non-Oracle, and Cloud-Hosted Databases with OBIEE 11g
The vast majority of our customers use Oracle Database as the underlying data source for OBIEE 11g, sometimes on Exadata but almost always the Enterprise Edition of the database. OBIEE typically sells best into customers that already extensively use Oracle products, and often it’s often bundled-in with a wider technology or apps deal including database and middleware products. But for some customers, the cost of the Oracle Database as a data source can be prohibitive and in many cases, they’re also looking at purchasing and deployment options that don’t involve upfront perpetual license deals and running the software on-premise. One of our longest-standing customers came to us with just such a scenario a few weeks ago, and so I thought it’d be interesting to take a look at what alternatives there are to licensing Oracle Database Enterprise Edition, and whether running these alternatives in the cloud was a practical option.
To set the context: imagine that you wanted to deploy a small-to-medium sized data warehouse on two servers:
- A “production” server with 64GB RAM, 4 x 4 core servers, Intel x86_64 architecture and running Oracle Linux 6 (for licensing purposes, 8 processors)
- A “dev/test” server with the same spec, onto which we’d install two separate databases
For now we’ll ignore the ETL element, but if we were licensing Oracle Data Integrator Enterprise Edition, we’d need to license it for these two servers too at around $23k/processor.
So what are the alternatives to Oracle Database Enterprise Edition as a data source? To my mind, there are three main options:
- Trading down to fewer processors but keeping Oracle Database EE, or going down to a lower-specced version such as Standard Edition or Standard Edition One
- Using an alternative RDBMS, probably either open-source or a “community edition”
- The left-field option – using Hadoop, NoSQL, Hive and the other new “big data” technologies.
Let’s take a look at the Oracle options first of all. Based on 8 processors (4 x 4 Intel x86_64 cores, x0.5 multiplication factor), the cost of licensing Oracle DB EE + commonly-used database options based on the current Tech Price List would be as follows (at list price):
- Oracle Database Enterprise Edition ($47.5k x 8 = $380k)
- Partitioning Option ($11.5k x 8 = $92k)
- EM Tuning Pack ($5k x 8 = $40k)
- EM Diagnostics Pack ($5k x 8 = $40k)
Giving us a total of $552k for 8 processors (covering both of our servers). Add onto that the standard 22% support and software updates, and you’re looking at an initial payment of $673k or the perpetual license, and an ongoing $121k/year for annual maintenance. Factor in a DBA and the hardware as well, and you’ll probably not get much change out of $800k in year one, and $250k in the years afterwards.
You could switch-down to one of the cheaper editions of Oracle Database, which also don’t come with the add-on packs, which could take you down to $5.8k/processor for SE1 or $17.5k for SE, but you’ll then have to do without most of the high-end data warehousing features in the Oracle database, and work within socket restrictions when it comes to SE1. Now of course you get what you pay for with Oracle Database, including potentially lower ongoing admin costs because much of the tuning and optimisation work is either unnecessary or automated, but if price is the issue and the alternative is a cut-down version of Oracle, what else is on offer – and that works with OBIEE?
If you start to look around at the alternatives to Oracle, and assuming you’re discounting products like IBM DB/2 or Microsoft SQL Server, you’ll find there are two main “conventional” types of RBDMS that you could use with OBIEE:
- General purpose RBDMS’s, such as mySQL and PostgreSQL
- “Analytical” databases, typically in-memory, column-store and shared-nothing, such as Greenplum and Vertica
If you look under the covers with the analytical databases, in most cases you’ll find that the core RBDMS technology they use is in fact PostgreSQL, with the (open-source) code then forked and adapted to work with the particular technology angle each vendor chooses to emphasise. Most developers tend to think that PostgreSQL is a more “enterprise-scale” free RDBMS than mySQL, and so given that it also provides the core DB technology in the other products, how might this work with OBIEE?
As a bit of background, PostgreSQL is an open-source database that’s of a similar vintage to Oracle Database, and you can download it for free or install it as part of most Linux distributions. Many developers swear by PostgreSQL itself as being all they need for data warehouse scenarios, but a couple of vendors have taken the PostgreSQL code and commercialised it, adding their own management tools and support options around the core open-source code. The most famous of these is EnterpriseDB, who package up PostgreSQL into a number of options one of which is called Postgres Plus Advanced Server. If you’re looking for an open-source based, low-cost but mature RBDMS engine for your database, this is more or less the flagship option, so I set about creating a new OBIEE 11.1.1.7 VM for testing purposes, with the aim of using Postgres Plus Advanced Server as my data source.
In fact, the first issue you hit when trying to go “non-Oracle” for the database, is that you need a Oracle, SQL Server or IBM DB/2 database to install the RCU schemas into. PostgreSQL isn’t an option, but in reality most customers using OBIEE will have a small Oracle database available somewhere, so I cheated here and installed the RCU schemas into a spare Oracle database, and then went on with the install.
Installing Postgres Plus Advanced Server itself was pretty straightforward, with the installer separately downloading the core PostgreSQL files and the Apache HTTP server used for its web-based admin tools. On Windows, once you’ve installed the software there’s a new Start Menu entry with the EnterpriseDB-specific (value-add) tools, including a console-based Enterprise Manager-type tool called pgAdmin.
Opening up pgAdmin gives you an environment that’s similar to that you’d find in tools such as TOAD, SQL*Developer and the old Java thick-client based Enterprise Manager, and using the tool I was able to connect to my PostgreSQL database, create a few tables, and run some test SQL queries.
It’s pretty basic stuff compared to Oracle’s Enterprise Manager product, but it may be all you need if you’re looking for a tactical, or departmental solution. EnterpriseDB price Postgres Plus Advanced Server at just under $5k/processor/year for the core database (one processor = one socket, however many cores), with their Enterprise Manager product costing $1500/year for two monitored hosts, and their “Solution Pack” another $5k/year, so based on the same server sizing as we used for the Oracle costings, the price of an EnterpriseDB solution looks a bit like this:
- Postgres Plus Advanced Server ($5k x 8 = $40k)
- Enterprise Manager EM Tuning Pack ($1.5k)
- Solutions Pack ($5k)
Coming out at around $47k/year for the two servers. Now the Oracle licenses were perpetual not annual, but even so, the total cost of the on-premise PostgreSQL solution is about 1/3rd of just the annual maintenance cost of the Oracle software, let alone the software license cost, so it’s considerably cheaper. So what don’t you get with PostgreSQL – either core, or commercialised via EnterpriseDB, that you get with Oracle?
Now as a disclaimer, I’m mainly an Oracle person and can’t really speak with any great depth on PostgreSQL, but the major things that are missing in PostgreSQL compared to Oracle Database 11g are:
- Proper, in-built support for parallel query – PostgreSQL has an open-source GridSQL project but this looks more like a federated, shared-nothing solution rather than PQ within a single server
- Built-in ETL in the form of Oracle Warehouse Builder
- Anything like Oracle’s built in MOLAP server, Advanced Analytics and the like
- All the DR options, security options and so forth
And no doubt the core DW features – bitmap indexing, cost-based optimiser, support for VLDBs and so on – are not nearly as developed, and easy to manage, as with the Oracle database. But – given that not all customers need these features, and that many customers even on 10g and 11g are still using the database as if it were Oracle 7 or 8i – lack of such features may not be a show-stopper if money is tight.
Now so far we’ve been talking about running PostgreSQL “on-premise”, more or less a a direct substitute for Oracle Database; however, it’s also possible to run Postgres Plus Advanced Server on Amazon’s AWS cloud platform, with no local install of the software, simplified administration but access to the same core Postgres Plus Advanced Server features billed at an hourly rate, giving you an architecture like the diagram below:
EnterpriseDB and AWS provide a free 24-hour trial, so I set up another database this time on the Amazon cloud, provisioned it and created the same set of tables.
Connecting to the cloud-based Postgres Plus database was the same as connecting to a regular, on-premise one, and in fact this is more-or-less just commercial PostgreSQL on Amazon AWS. Instead of charging for software and support up-front as with regular Oracle or on-premise Postgres Plus Advanced Server, you’re instead charged, AWS-style, an hourly rate based on the size of instance – one comparable to one of the Oracle servers mentioned beforehand comes in at $6.48/hour or $56k/year, or $112k for the two servers we’d need, still overall less than even the Oracle software maintenance charge but covering hosting as well (though you’ll need to pay Amazon storage and bandwidth charges on top, too).
Connecting OBIEE 11g to these two databases is also fairly straightforward – the Data Direct Postgres ODBC drivers that come with OBIEE work fine, with the connection process involving creating an ODBC DSN to the Postgres database either on-premise on on AWS, then importing the tables into the BI Repository and modelling them as normal. Connecting to the AWS cloud as a data source worked fine as wasn’t laggy, but you’d obviously need to test it at greater volume and with a realistic network connection. But – it works and seems to perform OK, at least with three rows…
So something like Postgres Plus Advanced Server in the cloud would probably appeal to a customer with a temporary requirement, a departmental budget, or an infrastructure that’s already in the cloud, most probably on Amazon AWS. But you can also run Oracle Database on Amazon’s AWS cloud as well, through a partnership between Oracle and Amazon’s RDS (“Relational Database Service”). In this case, you go to the Amazon AWS website, sign-up for the service, put in your credit card details and within a few minutes, you’ve got a provisioned, Standard Edition One database ready to connect to via all the usual tools. And what this is, again, is the standard Oracle Database (or with EnterpriseDB’s offering, Postgres Plus Advanced Server) but hosted on AWS, charged hourly, with two main options for pricing:
- A license-included option (Oracle Database Standard Edition One) at $3.14/hour for a reasonably specced-instance, coming out at about $27k/year, or $54k for the two servers we’d need
- A “bring-your-own licence” (BYOL) option, where you provide the Oracle licenses and Amazon AWS hosts it, at around $19k/year for the same size instance, $38k/year for the two servers we’d need
So if you work on the assumption that Oracle Database SE1 is roughly equivalent to Postgres Plus Advanced Server in terms of DW features, it’s not a bad option, and probably a better one if you’re already heavily invested in Oracle and have lots of skills there. The BYOL option is really just about hosting, but at $38k/year to host the two servers, back them up and so on, it’s not a bad option compared to buying some servers and putting them somewhere.
So, with Amazon RDS or Postgres Plus Advanced Server in the cloud, if you’re happy to take the hit on features and scaleability it’s possible to bring your database costs down considerably, which may be an option when your project first starts off, and with the database platform reviewed say in a couple of years’ time to see whether it’s worth trading up to Oracle Database Enterprise Edition, or bring the whole thing back in-house. But Postgres and Oracle aren’t the only players in the analytical database market – there are other on-premise vendors such as Greenplum and Vertica who’ll sell you either a full-priced, Oracle-comparable database and who have cut-down, community editions available for single-node installs, or there are new players such as Amazon again who are bringing out cloud-native analytic databases again based on instance sizes and hourly charging – so lets look at one of them, Amazon’s “Redshift”.
Amazon Redshift is actually based on an analytic database product called ParAccel, a column-store database in the same vein as Vertica but with major bits cut-out in it’s Amazon Redshift form. That said – it’s a lot more of an interesting starting point than core PostgreSQL or Oracle Database SE1, an given that it’s a core Amazon AWS product, it’s no doubt going to receive a lot of investment, developer evangelism and usage over the next few years. It’s also Postgres-compatible in terms of ODBC drivers, so viable for OBIEE, though like Postgres unsupported, so you’ll not get much in the way of BI Server optimisations or help from Oracle Support if it goes wrong.
To connect OBIEE to Amazon Redshift there’s a set of ODBC and JDBC drivers that you can download, and once you’ve opened up the port to Redshift in the AWS firewall, you can create an ODBC connection on the OBIEE side just like most other PostgreSQL connections, and then import the Redshift metadata into OBIEE’s BI Repository.
Redshift is priced around the same mark as Postgres Plus Advanced Server, but without the Amazon AWS storage and bandwidth charges. It’s a bit higher-risk than Postgres Plus, and it’s not got the Oracle Database compatibility features EnterpriseDB added to core Postgres, but it’s column-store, cloud-native and backed fully by Amazon.
So – with all of this in mind, what are the conclusions? Well it’s fair to say that, based on what turned-up today and OBIEE’s pre-existing heterogenous database support, you can connect non-Oracle databases to OBIEE as well as Oracle ones, but you’ll need to bear in mind that the newer cloud-based ones, and the open-source ones, won’t necessarily come with support from Oracle. If you’re looking to keep as close to an Oracle solution as possible but cut costs dramatically, using Amazon’s RDS service particularly with the license-included SE1 option could bring down the cost considerably, from $800k or so to around $50k/year, but you’ll need to work within the product constraints of SE/SE1, so no materialised views, partitioning option and the like – so it’s probably a matter of preference whether you’d go for Oracle SE1 or a Postgres-based solution such as EnterpriseDB’s Postgres Plus Advanced Server (I’d probably go with Amazon RDS and Oracle Database SE1, if only because I can upgrade in-time to Oracle Database Enterprise Edition and my Oracle skills can be re-used).
Going forward – analytic databases-as-a-service certainly sound interesting, with Amazon Redshift looking particularly interesting. How much customers will take up cloud-based data warehousing databases in-general though will probably be limited until OBIEE itself goes into the cloud – either through cloud-friendly licensing policies that charge by the hour and instance capacity, or with a cloud-native version of OBIEE designed to interoperate with Oracle’s Cloud Database offering, which up until now doesn’t provide external SQL*Net access and is really designed for powering cloud-based Java apps. Watch this space as they say – and it’ll be interesting to hear what Oracle announce in this area at next months’ Oracle Openworld.