Tag Archives: Big Data

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets, particularly when used in conjunction with Jupyter Notebooks. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you’d find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2. In this article we’ll see how to configure BDD Shell on Big Data Lite 4.5 (along with Jupyter Notebooks), and in a subsequent post dive into how to actually use them.

Setting up BDD Shell on Big Data Lite

You can find the BDD Shell installation document here.

Login to BigDataLite 4.5 (oracle/welcome1) and open a Terminal window. The first step is to download Anaconda, which is a distribution of Python that also includes “[…] over 100 of the most popular Python, R and Scala packages for data science” as well as Jupyter notebook, which we’ll see in a moment.

cd ~/Downloads/
wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh

Then install it: (n.b. bash is part of the command to enter)

bash Anaconda2-4.0.0-Linux-x86_64.sh

Accept the licence when prompted, and then select a install location – I used /u01/anaconda2 where the rest of the BigDataLite installs are

Anaconda2 will now be installed into this location:
/home/oracle/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/oracle/anaconda2] >>> /u01/anaconda2

After a few minutes of installation, you’ll be prompted to whether you want to prepend Anaconda’s location to the PATH environment variable. I opted not to (which is the default) since Python is used elsewhere on the system and by prepending it it’ll take priority and possibly break things.

Do you wish the installer to prepend the Anaconda2 install location
to PATH in your /home/oracle/.bashrc ? [yes|no]
[no] >>> no

Now edit the BDD Shell configuration file (/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.conf) in your favourite text editor to add/amend the following lines:

SPARK_EXECUTOR_PYTHON=/u01/anaconda2/bin/python
LOCAL_PYTHON_HOME=/u01/anaconda2

Amend the path if you didn’t install Anaconda into /u01

In the same configuration file, add/amend:

SPARK_HOME=/usr/lib/spark/
SPARK_EXTRA_CLASSPATH=/usr/lib/oozie/oozie-sharelib-yarn/lib/spark/spark-avro_2.10-1.1.0-cdh5.7.0.jar

Now run the BDD Shell setup:

/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/setup.sh

This should succeed:

[bigdatalite.localdomain] Validating pre-requisites...
[bigdatalite.localdomain] Validation Success
[bigdatalite.localdomain] Setting up BDD Shell...
[bigdatalite.localdomain] Setup Success
[oracle@bigdatalite Downloads]$

Assuming it does, you can now launch the shell bdd-shell.sh:

[oracle@bigdatalite Downloads]$ /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.sh
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).
WARNING: Running spark-class from user-defined location.
spark.driver.cores is set but does not apply in client mode.
Welcome to
     ___   ___   ___       __   _     ____  _     _
    | |_) | |  | |      ( (` | |_| | |_  | |   | |
    |_|_) |_|_/ |_|_/     _)_) |_| | |_|__ |_|__ |_|__

SparkContext available as sc, HiveContext available as sqlContext.
BDD Context available as bc.

>>>

From the BDD Shell you can interact with BDD, for example to list out the datasets currently defined in the Catalog:

>>> bc.datasets().count
17
>>> for ds in bc.datasets():
...     print ds
...

media_demo_customer     edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        Hive    default.media_demo_customer

movie_genre     default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        Hive    default.movie_genre

media_demo_customer     default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        Hive    default.media_demo_customer

Whilst BDD Shell is command-line based, there’s also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive “Notebook”. This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a ‘Notebook’ enables you to modify and re-run commands, and then once correct, retain them as a fully functioning script for future use.

To launch it, run:

cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It’s important that you run this from the bdd-shell folder, otherwise the BDD shell won’t initialise properly
  • By default Jupyter uses 8888, which is already in use on BigDataLite by Hue, so use a different one by specifying --port
  • Jupyter by default only listens locally, so you need to either be using BigDataLite desktop to run Firefox, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and you should see the default Jupyter screen with a list of files:

In the next article, we’ll see how to use Jupyter Notebooks with Big Data Discovery, and get an idea of just how powerful the combination can be.

The post Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5 appeared first on Rittman Mead Consulting.

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

New in Big Data Discovery 1.2 is the addition of BDD Shell, an integration point with Python. This exposes the datasets and BDD functionality in a Python and PySpark environment, opening up huge possibilities for advanced data science work on BDD datasets, particularly when used in conjunction with Jupyter Notebooks. With the ability to push back to Hive and thus BDD data modified in this environment, this is important functionality that will make BDD even more useful for navigating and exploring big data.

The Big Data Lite virtual machine is produced by Oracle for demo and development purposes, and hosts all the components that you'd find on the Big Data Appliance, all configured and integrated for use. Version 4.5 was released recently, which included BDD 1.2. In this article we'll see how to configure BDD Shell on Big Data Lite 4.5 (along with Jupyter Notebooks), and in a subsequent post dive into how to actually use them.

Setting up BDD Shell on Big Data Lite

You can find the BDD Shell installation document here.

Login to BigDataLite 4.5 (oracle/welcome1) and open a Terminal window. The first step is to download Anaconda, which is a distribution of Python that also includes "[...] over 100 of the most popular Python, R and Scala packages for data science" as well as Jupyter notebook, which we'll see in a moment.

cd ~/Downloads/  
wget http://repo.continuum.io/archive/Anaconda2-4.0.0-Linux-x86_64.sh
Then install it: (n.b. bash is part of the command to enter)
  
bash Anaconda2-4.0.0-Linux-x86_64.sh  

Accept the licence when prompted, and then select a install location - I used /u01/anaconda2 where the rest of the BigDataLite installs are

Anaconda2 will now be installed into this location:  
/home/oracle/anaconda2

  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below

[/home/oracle/anaconda2] >>> /u01/anaconda2

After a few minutes of installation, you'll be prompted to whether you want to prepend Anaconda's location to the PATH environment variable. I opted not to (which is the default) since Python is used elsewhere on the system and by prepending it it'll take priority and possibly break things.

Do you wish the installer to prepend the Anaconda2 install location  
to PATH in your /home/oracle/.bashrc ? [yes|no]  
[no] >>> no

Now edit the BDD Shell configuration file (/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.conf) in your favourite text editor to add/amend the following lines:

  
SPARK_EXECUTOR_PYTHON=/u01/anaconda2/bin/python  
LOCAL_PYTHON_HOME=/u01/anaconda2  

Amend the path if you didn't install Anaconda into /u01

In the same configuration file, add/amend:

  
SPARK_HOME=/usr/lib/spark/  
SPARK_EXTRA_CLASSPATH=/usr/lib/oozie/oozie-sharelib-yarn/lib/spark/spark-avro_2.10-1.1.0-cdh5.7.0.jar  

Now run the BDD Shell setup:

  
/u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/setup.sh

This should succeed:

  
[bigdatalite.localdomain] Validating pre-requisites...
[bigdatalite.localdomain] Validation Success
[bigdatalite.localdomain] Setting up BDD Shell...
[bigdatalite.localdomain] Setup Success
[oracle@bigdatalite Downloads]$

Assuming it does, you can now launch the shell bdd-shell.sh:

  
[oracle@bigdatalite Downloads]$ /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell/bdd-shell.sh
WARNING: User-defined SPARK_HOME (/usr/lib/spark) overrides detected (/usr/lib/spark/).  
WARNING: Running spark-class from user-defined location.  
spark.driver.cores is set but does not apply in client mode.  
Welcome to  
     ___   ___   ___       __   _     ____  _     _
    | |_) | |  | |      ( (` | |_| | |_  | |   | |
    |_|_) |_|_/ |_|_/     _)_) |_| | |_|__ |_|__ |_|__

SparkContext available as sc, HiveContext available as sqlContext.  
BDD Context available as bc.

>>>

From the BDD Shell you can interact with BDD, for example to list out the datasets currently defined in the Catalog:

>>> bc.datasets().count  
17  
>>> for ds in bc.datasets():
...     print ds
...

media_demo_customer     edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        edp_cli_edp_2c7f41ee-65bf-43ac-8bb4-5b6b59a55d75        Hive    default.media_demo_customer

movie_genre     default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        default_edp_7d4c18a5-6f02-4067-9f63-91f950078b1e        Hive    default.movie_genre

media_demo_customer     default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        default_edp_89c616b6-aa10-4827-aa82-1e9c3fcc419e        Hive    default.media_demo_customer  

Whilst BDD Shell is command-line based, there's also the option to run Jupyter Notebooks (previous iPython Notebooks) which is a web-based interactive "Notebook". This lets you build up scripts exploring and manipulating the data within BDD, using both Python and Spark. The big advantage of this over the command-line interface is that a 'Notebook' enables you to modify and re-run commands, and then once correct, retain them as a fully functioning script for future use.

To launch it, run:

  
cd /u01/bdd/v1.2.0/BDD-1.2.0.31.813/bdd-shell  
/u01/anaconda2/bin/jupyter-notebook --port 18888

Important points to note:

  • It's important that you run this from the bdd-shell folder, otherwise the BDD shell won't initialise properly
  • By default Jupyter uses 8888, which is already in use on BigDataLite by Hue, so use a different one by specifying --port
  • Jupyter by default only listens locally, so you need to either be using BigDataLite desktop to run Firefox, or use port-forwarding if you want to access Jupyter from your local web browser.

Go to http://localhost:18888 in your web browser, and you should see the default Jupyter screen with a list of files:

Running Big Data Discovery Shell and Jupyter Notebook on Big Data Lite VM 4.5

In the next article, we'll see how to use Jupyter Notebooks with Big Data Discovery, and get an idea of just how powerful the combination can be.

Rittman Mead at Collaborate 16: Data Integration Focus

It’s that time of year again when Oracle technologists from around the world gather in Las Vegas, NV to teach, learn, and, of course, network with their peers. The Collaborate conference, running for 10 years now, has been a collaboration, if you will, between the Independent Oracle Users Group (IOUG), Oracle Applications Users Group (OAUG) and Quest International Users Group (Quest), making it one of the largest user group conferences in the world. Rittman Mead will once again be in attendance, with two data integration focused presentations by me over the course of the week.

My first session, “A Walk Through the Kimball ETL Subsystems with Oracle Data Integration”, scheduled for Monday, April 11 at 10:30am, will focus on how we can implement the ETL Subsystems using Oracle Data Integration solutions. As you know, Big Data integration has been the hot topic over the past few years, and it’s an excellent feature in the Oracle Data Integration product suite (Oracle Data Integrator, GoldenGate, & Enterprise Data Quality). But not all analytics require big data technologies, such as labor cost, revenue, or expense reporting. Ralph Kimball, dimensional modeling and data warehousing expert and founder of The Kimball Group, spent much of his career working to build an enterprise data warehouse methodology that can meet these reporting needs. His book, “The Data Warehouse ETL Toolkit“, is a guide for many ETL developers. This session will walk you through his ETL Subsystem categories; Extracting, Cleaning & Conforming, Delivering, and Managing, describing how the Oracle Data Integration products are perfectly suited for the Kimball approach.

I go into further detail on one of the ETL Subsystems in an upcoming IOUG Select Journal article, titled “Implement an Error Event Schema with Oracle Data Integrator”. The Select Journal is a technical magazine published quarterly and available exclusively to IOUG members. My recent post Data Integration Tips: ODI 12c Repository Query – Find the Mapping Target Table shows a bit of the detail behind the research performed for the article.

error-event-schema

If you’re not familiar with the Kimball approach to data warehousing, I definitely would recommend reading one (or more) of their published books on the subject. I would also recommend attending one of their training courses, but unfortunately for the data warehousing community the Kimball Group has closed shop as of December 2015. But hey, the good news is that two of the former Kimball team members have joined forces at Decision Works, and they offer the exact same training they used to deliver under The Kimball Group name.

GoldenGate to Kafka logo

On Thursday, April 14 at 11am, I will dive into the recently released Oracle GoldenGate for Big Data 12.2 in a session titled “Oracle GoldenGate and Apache Kafka: A Deep Dive into Real-Time Data Streaming”. The challenge for us as data integration professionals is to combine relational data with other non-structured, high volume and rapidly changing datasets, known in the industry as Big Data, and transform it into something useful. Not just that, but we must also do it in near real-time and using a big data target system such as Hadoop. The topic of this session, real-time data streaming, provides us a great solution for that challenging task. By combining GoldenGate, Oracle’s premier data replication technology, and Apache Kafka, the latest open-source streaming and messaging system for big data, we can implement a fast, durable, and scalable solution.

If you plan to be at Collaborate next week, feel free to drop me a line in the comments, via email at michael.rainey@rittmanmead.com, or on Twitter @mRainey, I’d love to meet up and have a discussion around my presentation topics, data integration, or really anything we’re doing at Rittman Mead. Hope to see you all there!

The post Rittman Mead at Collaborate 16: Data Integration Focus appeared first on Rittman Mead Consulting.

Introduction to Oracle R Training

Rittman Mead is thrilled to announce our new Introduction to Oracle R courses!

With the launch of 12c comes an exciting extension to OBIEE’s analytical capabilities—embedded R execution. Oracle has made several investments in the R language and its related technologies (Oracle R, ROracle, ORAAH, and Oracle R Enterprise). Based on our experience deploying advanced analytics solutions from within business intelligence departments, we view this simplified pairing of OBIEE and R as a great way forward in delivering advanced analytics across organizations.

Oracle R

But the world of R is a big one, with hundreds of open-source R packages and countless training avenues. With so many confusing options on the market, we wanted to simplify things for you. It can be difficult to know where to start, so we’ve designed an R course to help you bridge the gap between existing BI/DW skills and the new skills required to confidently derive insights with R. We want to help you look beyond the dashboard and delve into novel analytical techniques.

The Advanced Analytics course dives deeply into the analytical capabilities of 12c and R. We’ve designed the training so that prior knowledge of R isn’t assumed. The course builds on existing SQL and data visualization skills—covering R use cases from predictive analytics and time series forecasting to natural language processing and many others.

Oracle R

We want to share with you our vision of R-enhanced business intelligence that provides more insights, predictions, and actionable discoveries, all while scaling effortlessly with your data. With our training, you’ll learn to expand your analytic skills and become an invaluable asset to your company’s BI operation.

We are excited about what R brings to the world of BI and can’t wait to share our knowledge.

We will follow up this blog post with one about the business value of our new Introduction to Oracle R course and another post about the more technical aspects of the course.

The post Introduction to Oracle R Training appeared first on Rittman Mead Consulting.

Rittman Mead and Oracle Big Data Webcast Series – November 2015

We’re running a set of three webcasts together with Oracle on three popular use-cases for big data within an Oracle context – with the first one running tomorrow, November 3rd 2015 15:00 – 16:00 GMT / 16:00 – 17:00 CET on extending the data warehouse using Hadoop and NoSQL technologies.

The sessions are running over three weeks this month and look at ways we’re seeing Rittman Mead use big data technologies to extend the and capabilities of their data warehouse, create analysis sandpits for analysing customer behaviour, and taking data discovery into the Hadoop era using Oracle Big Data Discovery. All events are free to attend, we’re timing them to suit the UK,Europe and the US, with details of each webcast are as follows:

NewImage

Extending and Enhancing Your Data Warehouse to Address Big Data

Organizations with data warehouses are increasingly looking at big data technologies to extend the capacity of their platform, offload simple ETL and data processing tasks and add new capabilities to store and process unstructured data along with their existing relational datasets. In this presentation we’ll look at what’s involved in adding Hadoop and other big data technologies to your data warehouse platform, see how tools such as Oracle Data Integrator and Oracle Business Intelligence can be used to process and analyze new “big data” data sources, and look at what’s involved in creating a single query and metadata layer over both sources of data.

Audience: DBAs, DW managers, architects Tuesday 3rd November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Audience : DBAs, DW managers, architects

What is Big Data Discovery and how does it complement traditional Business Analytics?

Data Discovery is an analysis technique that complements traditional business analytics, and enables users to combine, explore and analyse disparate datasets to spot opportunities and patterns that lie hidden within your data. Oracle Big Data discovery takes this idea and applies it to your unstructured and big data datasets, giving users a way to catalogue, join and then analyse all types of data across your organization. At the same time Oracle Big Data Discovery reduces the dependency on expensive and often difficult to find Data Scientists, opening up many Big Data tasks to “Citizen” Data Scientists. In this session we’ll look at Oracle Big Data Discovery and how it provides a “visual face” to your big data initiatives, and how it complements and extends the work that you currently do using business analytics tools.

Audience : Data analysts, market analysts, & Big Data project team members Tuesday 10th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register

Adding Big Data to your Organization to create true 360-Degree Customer Insight

Organisations are increasingly looking to “big data” to create a true, 360-degree view of their customer and market activity. Big data technologies such as Hadoop, NoSQL databases and predictive modelling make it possible now to bring highly granular data from all customer touch-points into a single repository and use that information to make better offers, create more relevant products and predict customer behaviour more accurately. In this session we’ll look at what’s involved in creating a customer 360-degree view using big data technologies on the Oracle platform, see how unstructured and social media sources can be added to more traditional transactional and customer attribute data, and how machine learning and predictive modelling techniques can then be used to classify, cluster and predict customer behaviour.

Audience : MI Managers, CX Managers, CIOs, BI / Analytics Managers Tuesday 24th November, 15:00 – 16:00 GMT / 16:00 – 17:00 CET – Click here to register