Tag Archives: Uncategorized
Endeca Event in Birmingham
Just a quick note to highlight that we are running an Oracle Endeca Information Discovery (OEID) event at Oracle’s Birmingham office on Wednesday 26th June. Providing a great opportunity to learn how OEID can complement your current BI tools, allowing you to answer previously unanswerable questions through insight from both structured and unstructured data sources (such as social feeds and word documents).
There will also be experienced experts available before, during and after the event, providing a rare opportunity to get your questions answered by people who have been there and done it.
Click here for more information and to register.
OBIEE, ODI and Hadoop Part 2: Connecting OBIEE 11.1.1.7 to Hadoop Data Sources
In yesterday’s post I looked at the key enabling technologies behind OBIEE and ODI’s connectivity to Hadoop, and today I’ll look at how OBIEE 11.1.1.7 can now access Hadoop data sources through two related technologies; Hive, and MapReduce.
In my introduction to the topic I said that whilst writing MapReduce routines in Java, and then orchestrating them through other tools in the Apache Hadoop family could be quite complex technically, another tool called “Hive” provided an SQL-like query layer over Hadoop and MapReduce so that tools like OBIEE could access them. Rather than you having to write your own MapReduce routines in Java, for instance, Hive writes them for you, returning data to OBIEE and ODI via ODBC and JDBC drivers. The diagram below, also from yesterday’s post, shows the data layers used in such an arrangement.
Under the covers, Hive has its own metadata layer, server engine and data store, with developers “loading” data into Hive “tables” which are then generally stored on the HDFS file system, just like any other data processed through MapReduce. Then, when a query is issued through Hive, the Hive Server dynamically generates MapReduce routines to query the underlying data, returning data to users in a similar way to an interactive database SQL session, like this:
markmacbookpro:~ markrittman$ ssh oracle@bigdatalite
oracle@bigdatalite's password:
Last login: Wed Apr 17 04:02:59 2013 from 192.168.2.200
=====================================================
=====================================================
Welcome to BigDataLite
run startx at the command line for X-Windows console
=====================================================
=====================================================
Host: bigdatalite.us.oracle.com [192.168.2.35]
[oracle@bigdatalite ~]$ hive
Hive history file=/tmp/oracle/hive_job_log_oracle_201304170403_1991392312.txt
hive> show tables;
OK
dwh_customer
dwh_customer_tmp
i_dwh_customer
ratings
src_customer
src_sales_person
weblog
weblog_preprocessed
weblog_sessionized
Time taken: 2.925 seconds
hive> select count(*) from src_customer;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201303171815_0003, Tracking URL = http://localhost.localdomain:50030/jobdetails.jsp?jobid=job_201303171815_0003
Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 -kill job_201303171815_0003
2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0%
2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0%
2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33%
2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201303171815_0003
OK
25
Time taken: 22.21 seconds
hive>
In the example above, I connected to the Hive environment, listed out the “tables” available to me, and then ran a count of “rows” in the src_customers table which in the background, caused a MapReduce routine to be written and executed in the background by the Hive server. Hive has been described as the “Hadoop Data Warehouse”, but it’s not really a data warehouse as you and I would know it – you wouldn’t typically use Hadoop and Hive to store customer transaction data, for example, but you might use it as a store of Facebook interactions, for example, or most popular pages or interaction paths through your website, and someone working in Web Analytics might want to interactively query that set of data in a more user-friendly manner than writing their own Java routines. So how does OBIEE gain access to this data, and what extra software or configuration pieces do you need to put in-place to make it happen?
If you want to have OBIEE 11g access Hadoop data, you’re best going with the 11.1.1.7+ release as this is where it’s most tested and stable. You’ll need to configure drivers at two points; firstly at the server level (Hadoop access is only supported with Linux server installations of OBIEE 11.1.1.7) and then at the Windows-based Administration tool level. Let’s start with the BI Administration tool first, based on the instructions in the 11.1.1.7 Metadata Repository Builder’s Guide.
To have the BI Administration tool connect to a Hadoop/Hive data source, you’ll need to download some ODBC drivers for Hadoop via a My Oracle Support download, DocID 1520733.1. This gives you a set of HiveODBC drivers along with a PDF explaining the installation process, and once you’ve installed the drivers, you’ll need to open up the ODBC Data Source Administrator applet and create a new HiveODBC data source. In this instance, I call the datasource “bihdatalite” after the server name, and go with the default values for the other settings. Note that “default” is the name of the “database” within Hive, and the port number is the port that the Hive server is running on.
Now I can create a new repository offline, and connect to the Hive server via the HiveODBC connection to start importing table metadata into the RPD. Note that with the current implementation of this connectivity, whilst you can import tables from multiple Hive databases into the RPD, queries you issue can’t span more than a single Hive database (i.e. you can’t specify a schema name prefix for the table name, therefore can’t join across two schemas).
Then, once you’ve imported the Hive table metadata into the RPD, change the physical database type to “Apache Hadoop”, from the default ODBC 3.5 setting that would have been added automatically by the metadata import process. Leave the connection pool call interface at ODBC2.0, put in any old username and password into the shared login details (or a valid username/password if Hive security is enabled), and then save the repository.
You should then be able to use the View Data feature in the BI Administration tool to view data in a particular Hive table, like this:
Now you need to move over to the server part of OBIEE, and configure the ODBC connection to Hive there too. OBIEE 11.1.1.7 comes with DataDirect drivers already installed that will connect to Hive, so it’s just a case then of configuring a connection of the same name to the Hive datasource using OBIEE’s odbi.ini file, like this:
[ODBC Data Sources]
AnalyticsWeb=Oracle BI Server
Cluster=Oracle BI Server
SSL_Sample=Oracle BI Server
bigdatalite=Oracle 7.1 Apache Hive Wire Protocol
[bigdatalite]
Driver=/u01/app/Middleware/Oracle_BI1/common/ODBC/Merant/7.0.1/lib/ARhive27.so
Description=Oracle 7.1 Apache Hive Wire Protocol
ArraySize=16384
Database=default
DefaultLongDataBuffLen=1024
EnableLongDataBuffLen=1024
EnableDescribeParam=0
Hostname=bigdatalite
LoginTimeout=30
MaxVarcharSize=2000
PortNumber=10000
RemoveColumnQualifiers=0
StringDescribeType=12
TransactionMode=0
UseCurrentSchema=0
Note that you also need to configure OBIEE’s OPMN feature to use the DataDirect 7.1 drivers rather than the default, older ones – see the docs for full details on this step. Then, as far as the RPD is concerned, you just need to make a business model out of the Hive table sources, upload it using EM so that its running online on your OBIEE server installation, and your RPD in the end should look similar to this:
Then finally, you can create an OBIEE analysis using this data, and analyse it just like any other data source – except, of course, that there’s quite a lot of lag and latency at the start of the query, as Hive spins up its Java environment, writes the MapReduce query, and then send the data back to OBIEE’s BI Server.
So how do we get data into Hive in the first place, to create these tables that in the background, are access through Hadoop and MapReduce? Check back tomorrow, when I’ll look at how Oracle Data Integrator can be used to load data into Hive, as well as perform other data integration tasks using Hadoop and other big data technologies.
OBIEE, OEM12cR2 and the BI Management Pack Part 3: But What Does It Do?
In the previous two posts in this series, I looked at the product architecture for Oracle Enterprise Manager 12cR2 (EM12cR2) Cloud Control and the BI Management Pack, and how you registered OBIEE, TimesTen, Essbase and the DAC as targets for monitoring and managing. But what can you do with EM12cR2 and the BI Management Pack once you’ve set it all up, how well does it handle other related products such as Informatica and Siebel CRM, how customisable is it and what other tasks can it perform?
To start off, one of the questions we’ve been asked is whether, in a similar way to OBIEE and Oracle Portal, you can customise the EM web console display to include just those views that you’re interested in; to create, for example, a dashboard page for monitoring OBIEE that might include views on BI Server query throughput, GoldenGate activity, DAC ETL alerts and so on. The answer is – not quite – but there are some customisations and bookmarks that you can create which at least make it easier to navigate your way around.
When you first log into OEM12cR2, you’re presented with the standard Enterprise Summary view, which summarises a number of metrics across all targets in the EM repository.
You can, however, change this for a more focused view of a particular type of target, by selecting SYSMAN > Select My Home… (or whatever your logged-in user name is), and then selecting from the list of pre-defined target views presented on the Select Enterprise Manager Home page that’s then displayed.
If, for example, your primary responsibility was looking after OBIEE systems, you might choose to have the Middleware page as your homepage, so that all of the WebLogic farms are listed out on your front page.
You can also set individual pages in EM as “favorites”, so that they appear from the Favorites menu for quick access as shown in the screenshot below.
Something else that’s useful when you’ve got a number of similarly-named systems registered within your EM repository is to put them into groups. To create a group to hold my “demo” OBIEE systems, for example, I would select Targets > Groups from the web console menu, and then press the Create > Group button to bring up the Create Group page. Then, using the Search or Search by Criteria buttons I can refine the search to include, for example, just Fusion Middleware Farms, and then select the ones that I’d like to add to the new group.
You can also create “dynamic” groups as well, including all systems that have a “development” status in a group that updates over time, like this:
Once you’ve registered your systems, you can do all of the same things you did with the 10gR4 version of EM and the BI Management Pack, including view metrics over time rather than for just the time you’ve got the metric window open (to my mind, one of the most valuable features in EM vs. Fusion Middleware Control).
Metric thresholds can also be defined in a similar fashion to how they were in EM10gR4, with events that are then triggered by the threshold being exceeded to notify you, for example, when query response times exceed a certain number of seconds, or when the dashboard login page can’t be reached. Unfortunately the dashboard and scheduler reports that are included as part of the BI Management Pack can’t be turned into graphs, but like Fusion Middleware Control any of the standard metrics can be graphed, overlayed on the same server’s metrics for the previous day, or compared to another server’s metrics or a baseline.
Finally, another question we’re often asked is how many other systems EM12cR2 can monitor, either out-of-the-box, through paid-for official plugins, or through third party extensions? The first thing to be aware of then is what EM functionality is included “for free” as part of your database or middleware license and what functionality costs more, and the definitive place for this information is the Oracle® Enterprise Manager Licensing Information 12c Release 2 (12.1.0.2) doc on OTN; also from the web console you can select Setup > Management Packs > Show Management Pack Information to have EM highlight for you those menu items that require additional licensing beyond those included by default for licensed database or middleware customers. For example, in the example below the items annotated with “OBIM” would require an Oracle BI EE customer to purchase the BI Management Pack, whilst the others would be “free” to use by any BI customer.
As for what these management pack and plug-ins cost, again the definitive source is the Oracle Tech Price list, which changes from time to time but can always be found with a Google search for “oracle tech price list”. The price list as of the time of writing listed the BI Management Pack at $11,500/processor (based on the processors licensed for BI EE).
Note also with management packs that you generally – at least in the case of Oracle Database – need to license the appropriate database option as well, though plug-ins are generally free or at least provided as part of the main product cost, as is the case with TimesTen and Exadata. In terms of what features come out of the box and what ones require separate installation, you can check this by selecting Setup > Extensibility > Self Update and Plugins menu items, which show the downloaded and available agent versions, along with the various plugins that can be used immediately, or downloaded from Oracle’s support site, including ones for Siebel, below, and EMC’s SAN arrays.
There are also plug-ins available for download from third-party sites for targets such as ones for Informatica PowerCenter, VMWare VSphere and mySQL, with most of them gathered together at the Enterprise Manager Extensions Exchange, also on the Oracle website.
So there we are with our three-part look at EM12cR2 and the BI Management Pack. I’m over in Norway now for the Oracle User Group Norway conference, but check back soon for some new content on the 11.1.1.7 release of OBIEE 11g.
OBIEE, OEM12cR2 and the BI Management Pack Part 2: Installation and Configuration
In the previous post in this series, we looked at what Oracle Enterprise Manager 12cR2 and the BI Management Pack could do for OBIEE 11g admins, and how it manages a number of Oracle products from the database through to Fusion Middleware and the ERP applications. In today’s post I’m going to look at how an OBIEE system (or “target”) is registered so that we can then use BI Management Pack features, and how you make use of new features in the BI Management Pack such as support for Essbase targets.
I’ll work on the assumption that you’ve already got EM12cR2 installed, either on 64-bit Windows or 64-bit Unix (my preference is 64-bit Oracle Linux 5, though all should work); if you’ve not got EM12cR2 installed or your on an earlier version, the software is available on OTN and you’ll also need a suitable, recent and patched-up database to store the EM repository. Once you’ve got everything installed we can now login and take a look around – note that there’s no separate BI Management Pack download; all functionality is included but you need to be aware of what’s extra-cost and what’s not – the licensing guide is your best reference here, but at a high-level there are some parts of EM12cR2 that all licensed BI customers can use, whilst other features require the BI Management pack – we’ll cover this in more details in tomorrow’s post.
Logging into EM12cR2 presents you with a summary of the health and status of your systems, and you can see from the pie chart on the left that some of my systems are up, some are down and so forth. The Targets menu at the top of the screen lets me view similar information for hosts, middleware installations, databases and so on. My EM12cR2 installation has a number of OBIEE and other systems already registered with it, all of which are on VMs of which only a few are currently powered up.
In this example, I’m going to add a new host to this list, which is actually an Exalytics demo VM containing OBIEE 11.1.1.6 and TimesTen. Later on, we’ll look at adding Essbase to the list of monitored targets, both in terms of Essbase integrated into an OBIEE 11.1.1.7 install, and standalone as a separate install; finally, we’ll see how the BI Apps DAC is registered, so we can view the progress of Informatica ETL runs into the BI Apps data warehouse.
As I mentioned in yesterday’s post, EM12cR2 monitors and manages other servers by installing management agents on them; to do this, it needs to connect to the server via SSH, in order to install the agents software on there. To enable this, you need to provide the login credentials for a user on that server with “sudo” (act as an administrator) privileges, and a number of other settings have to be enabled for this process to work; to check that all of these are in place, let’s open up a console session on the Exalytics server and see how it looks:
[oracle@exalytics ~]$
[oracle@exalytics ~]$ sudo vi /etc/hosts
[sudo] password for oracle:
oracle is not in the shudders file. This incident will be reported.
[oracle@exalytics ~]
What happened here is that I tried to run a command as the superuser, and the system asked for my password, but it turns out that this user isn’t in the list of users authorised to act as the superuser. To fix this, I need to now actually log in as root, and then issue the command:
/usr/sbin/visudo
to open up a special version of “vi” used for editing the sudoers file, and then add the line:
oracle ALL=(ALL) ALL
to the end to enable the “oracle” user to use the sudo command. After doing this and trying the previous commands again, I can now use the sudo command. Let’s now move over to the EM12cR2 website and start the process of registering the host, and thereafter registering the various software components on the Exalytics server.
There are various automated and semi-automated ways of discovering candidate servers on your network, but for simplicity I just select Setup > Add Target > Add Targets Manually from the menu in to top right-hand corner of the screen, which allows me to add details of the host directly rather than let EM scan the network to find them.
The Add Targets Manually page is then displayed. I select Add Host Targets from the set of radio button options, and then press the Add Host … button.
I then type in the name of the host (or IP address), and select the platform type. Note that unless your target is Linux x64, you’ll probably need to download the required agent software for that platform before you perform this task, using the Self Update feature.
Then, type in the location on the remote server that you want to install the agent software to, and the login credentials of the user that you’ve just enabled for sudo access on that server.
EM12cR2 will then attempt to install the agent. If you’re lucky, it’ll install first time, but more likely you’ll see a message like the one in the screenshot below saying that there was a problem with the agent installation.
What this is saying is that you need to make some further changes to the “sudoers” file on the remote server before EM can properly use it to install the agent. There are usually actually two issues, and you hit the second one after fixing the first, so let’s tackle them both now. Going back over to the remote server and logging in as the “oracle” user, let’s use sudo again to fix the issue:
[oracle@exalytics ~]$ sudo /usr/sbin/visudo
The first bit to find in the file is the line that says (assuming Oracle Linux 5, as I’m using):
Defaults requiretty
This setting means that all users trying to use the sudo command need to be actually logging in via the server’s own console, not remotely via SSH; by default this is the setting with Oracle Linux 5, so to change it to not require access through the console for users, I change it to:
Defaults !requiretty
While you’re there, also add the line:
Defaults visiblepw
to the file as well, as EM will complain about that not being set if you try and deploy again now. Once both of these are set, go back to EM, retry the deployment with the existing settings, and the agent should deploy successfully. Note also that if your OBEE installation is on a Windows server, you’ll need to install the cygwin Unix-like environment on the server before you do all this, to enable the SSH service and BASH command shell that EM requires – see these notes in the EM12cR2 docs for more details.
So at this point the management agent software will be deployed to the server, but none of the WebLogic, database or BI software will be registered with EM yet. To do this, on the screen that’s again displayed after you’ve registered the host itself, select Add Non-Host Targets using Guided Process (Also Adds Related Targets) option, then from the Target Type drop-down menu select Oracle Fusion Middleware, and then press the Add Using Guided Discovery… button to start the process by registering the WebLogic installation which in turn hosts OBIEE 11g.
When prompted, select the host that you’ve just installed the agent to as the WebLogic Administration Server Host, put in the web logic administration user details to connect to the admin server (not the OS user details you used earlier), leave the rest of the settings at their default values and press Continue.
If all goes to plan EM should then report a number of targets found in the scan – these are the WebLogic Server components, plus the BI components that we’re actually interested in.
On the next page, add any notes that you want to the target registration details, then press the Add Targets button add these to the EM repository.
On the Middleware targets page that is displayed once the targets are registered, you should see your WebLogic installation now listed, and if you drill into the Farm entry you’ll see the domain and the coreapplication entry that represents your Oracle instance. Click on the details for the farm, and you’ll then see something that looks familiar from Fusion Middleware Control – the view of your OBIEE installation, where you can also drill into core application and see details of your instance. We’ll cover more on what this screen can do in tomorrows, final post on this topic.
At this point our OBIEE system is mostly registered and configured, but we still need to register the repository database, so the dashboard and scheduler reports can work. To do this, select Business Intelligence Instance > Target Setup > Monitoring Credentials from the coreapplication drop-down menu, and then enter the details for that server’s BIPLATFORM schema, like this:
You should then be able to select Business Intelligence Instance > Dashboard Reports from the coreapplication drop-down menu to see details of which dashboards have run, what error messages were logged and so forth.
Note that this is a fairly minimal set of reports against usage tracking data – there’s no ability to graph the results, for example, and no ability to view individual report usage, just dashboards. But at least it’s something.
So that’s taken care of the OBIEE elements of the Exalytics server install. But what about TimesTen server that provides the in-memory database cache on the Exalytics server? TimesTen support doesn’t come out-of-the-box with EM12cR2, but you can enable it through a plug-in that you enable via EM12cR2′s “self-update” feature. To do this, from the Setup menu on the top-right hand side select Setup > Extensibility > Self Update, click on Plug-in in the list of folders that are then displayed, and then locate the Oracle TimesTen In-Memory Database entry in the Plug-in Updates listing that is then displayed. Assuming that its not been downloaded by someone else beforehand, click on it and press the Download button, to start the download process into EM’s software library.
After a short while the plug-in should be downloaded from Oracle’s support website, and you can then make it available for use with an agent. To do so, locate it in the list of agents again, click on it to select it, and then press the Apply button that’s next to the (greyed-out) Download button. You’ll then be taken to the Plug-ins page where you should use the Deploy On button to deploy it first to the Management Server (i.e. the EM12cR2 server) and then the Management Agent server (in this case, our Exalytics server) – note that you’ll need to know the SYS password for the database that holds your EM repository to do the OMS registration part.
If all goes to plan EM should then start the process of deploying the TimesTen plugin to the Management Server first, once its checked prerequisites and so forth. On my system, it also actually deployed the plug-in to the Exalytics server too, even though I don’t think I actually requested it.
The final configuration step now is to use the plug-in to register the TimesTen target on the Exalytics server. To do this I return to the main Setup menu in the EM web console, and select Setup > Add Target > Add Targets Manually, and then select the Add Non-Host Targets by Specifying Target Monitoring Properties radio button. Then, when the Target Type drop-down menu is displayed, select TimesTen In Memory Database 11g from the list, and then select the management agent that’s on the Exalytics server. Once done, press the Add Manually… to go on to the next stage of the target registration process.
Then, when prompted, enter the connection details to your TimesTen instance, as used on the Exalytics server.
And that should be it – the TimesTen server should be registered and then available as a target to view in EM. It’ll take a while for metrics to start getting collected and displayed in the various graphs, but you can take a look at what’s recorded and what actions you can take from the menu that’ll now appear when you view a TimesTen target.
For Essbase, how to register the Essbase server as a target depends on whether Essbase is installed standalone in just a WebLogic managed server (as it is with Oracle’s SampleApp demo VMs), installed alongside OBIEE 11.1.1.7 or 11.1.1.6.2 BP1 as part of a single BI domain, or installed in its own full WebLogic domain with a WebLogic Server administration server. If its installed standalone, the initial registration of the WebLogic domain on the server concerned won’t register the Essbase server, and instead you’ll need to register it manually afterwards in a similar manner to the TimesTen server. If you’ve installed Essbase along with OBIEE as part of the combined 11.1.1.7 install, it’ll get registered along with OBIEE, and be displayed underneath coreapplication as shown in the screenshot below. Finally, if Essbase has its own WebLogic domain, then it gets detected as a target type as part of that domain’s registration, the same way that OBIEE does when registering it’s WebLogic domain as a target.
Finally, the BI Apps Data Warehouse Administration Console (DAC) is registered similarly to Essbase and TimesTen, except that like Essbase the plug-in required for management is already included in most EM12cR2 installations. As the DAC isn’t associated with any particular middleware home (at least, not with BI Apps 7.9.6.3) you’ll need to find it within the general list of targets rather than with the OBIEE installation its linked with.
So, with all of this set up, what can you do with it? In the final part of this series tomorrow, we’ll look at some common questions and usage scenarios around EM12cR2 and the BI Management Pack, and try and answer some of these questions.
Introduction to Hadoop HDFS (and writing to it with node.js)
The core Hadoop project solves two problems with big data – fast, reliable storage and batch processing.
For this first post on Hadoop we are going to focus on the default storage engine and how to integrate with it using its REST API. Hadoop is actually quite easy to install so let’s see what we can do in 15 minutes. I’ve assumed some knowledge of the Unix shell but hopefully it’s not too difficult to follow – the software a versions are listed in the previous post.
If you’re completely new to Hadoop three things worth knowing are…
- The default storage engine is HDFS – a distributed file system with directories and files (ls, mkdir, rm, etc)
- Data written to HDFS is immutable – although there is some support for appends
- HDFS is suited for large files – avoid lots of small files
If you think about batch processing billions of records, large and immutable files make sense. You don’t want the disk spending time doing random access and dealing with fragmented data if you can stream the whole lot from beginning to end.
Files are split in to blocks so that nodes can process files in parallel using map-reduce. By default a Hadoop cluster will replicate each file block to 3 nodes and each file block can take up to the configured block size (~64M).
Starting up a local Hadoop instance for development is pretty simple and even easier as we’re only going to start half of it. The only setting that’s needed is the host and port where the HDFS master ‘namenode’ will exist but we’ll add a property for the location of the filesystem too.
After downloading and unpacking Hadoop add the following under the <configuration> tags in core-site.xml…
conf/core-site.xml:
<property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/${user.name}/hdfs-filesystem</value> </property>
Add your Hadoop bin directory to the PATH
export PATH=$PWD/hadoop-1.0.4/bin:$PATH
The only other step before starting Hadoop is to format the filesystem…
hadoop namenode -format
Hadoop normally runs with a master and many slaves. The master ‘namenode’ tracks the location of file blocks and the files they represent and the slave ‘datanodes’ just store file blocks. To start with we’ll run both a master and a slave on the same machine…
# start the master namenode hadoop-daemon.sh start namenode # start a slave datanode hadoop-daemon.sh start datanode
At this point we can write some data to the filesystem…
hadoop dfs -put /etc/hosts /test/hosts-file hadoop dfs -ls /test
You can check that the hadoop daemons are running correctly by running jps (Java ps). Shutting down the daemons can be done quickly with a ctrl-c or killall java – do this now.
To add data we’ll be using the WebHDFS REST api with node.js as both are simple but still fast.
First we need to enable the WebHDFS and Append features in HDFS. Append has some issues and has been disabled in 1.1.x so make sure you are using 1.0.4. It should be back in 2.x and should be fine for our use – this is what Hadoop development is like! Add the following properties…
conf/hdfs-site.xml
<property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <property> <name>dfs.support.append</name> <value>true</value> </property>
Restart HDFS…
killall java hadoop-daemon.sh start namenode && hadoop-daemon.sh start datanode
Before loading data we need to create the file that will store the JSON. We’ll append all incoming data to this file…
hadoop dfs -touchz /test/feed.data hadoop dfs -ls /test
If you see the message “Name node is in safe mode” then just wait for a minute as the namenode is still starting up.
Next download node.js (http://nodejs.org/download/) – if you’re using Unix you can ‘export PATH’ in the same way we did for hadoop.
export PATH=$PWD/node-v0.10.0-linux-x64/bin/:$PATH
Scripting in node.js is very quick thanks to the large number of packages developed by users. Obviously the quality can vary but for quick prototypes there always seems to be a package for anything. All we need to start is an empty directory where the packages and our script will be installed. I’ve picked three packages that will help us…
mkdir hdfs-example cd hdfs-example npm install node-webhdfs npm install ntwitter npm install syncqueue
The webhdfs and twitter packages are obvious but I’ve also used the syncqueue package so that only one append command is sent at a time – Javascript is asynchronous. To use these create and edit a file named twitter.js and add….
var hdfs = new (require("node-webhdfs")).WebHDFSClient({ user: process.env.USER, namenode_host: "localhost", namenode_port: 50070 }); var twitter = require("ntwitter"); var SyncQueue = require("syncqueue"); var hdfsFile = "/test/feed.data"; // make appending synchronous var queue = new SyncQueue(); // get your developer keys from: https://dev.twitter.com/apps/new var twit = new twitter({ consumer_key: "keykeykeykeykeykey", consumer_secret: "secretsecretsecretsecret", access_token_key: "keykeykeykeykeykey", access_token_secret: "secretsecretsecretsecret" }); twit.stream("statuses/filter", {"track":"hadoop,big data"}, function(stream) { stream.on("data", function (data) { queue.push(function(done) { console.log(data.text); hdfs.append(hdfsFile, JSON.stringify(data), function (err, success) { if (err instanceof Error) { console.log(err); } done(); }); }); }); });
And run node twitter.js
Now sit back and watch the data flow – here we’re filtering on “hadoop,big data” but you might want to choose a different query or even a different source – eg. tail a local log file, call a web service, run a webserver.