Tag Archives: Oracle BI Suite EE
Oracle OpenWorld 2015 Roundup Part 1 : OBIEE12c and Data Visualisation Cloud Service
Last week saw Oracle Openworld 2015 running in San Francisco, USA, with Rittman Mead delivering a number of sessions around BI, data integration, Big Data and cloud. Several of us took part in Partner Advisory Councils on the Friday before Openworld itself, and along with the ACE Director briefings earlier that week we went into Openworld with a pretty good idea already on what was being announced – but as ever there were a few surprises and some sessions hidden away that were actually very significant in terms of where Oracle might be going – so let’s go through what we thought were the key announcements first, then we’ll get onto the more interesting stuff at the end.
And of course the key announcement for us and our customers was the general availability of OBIEE12c 12.2.1, which we described in a blog post at the time as being focused primarily on business agility and self-service – the primary drivers of BI license spend today. OBIEE12c came out the Friday before Openworld with availability across all supported Unix platforms as well as Linux and Windows, with this initial release not seeming massively different to 11g for developers and end-users at least at first glance – RPD development through the BI Administration tool is largely the same as 11g, at least for now; Answers and Dashboards has had a face-lift and uses a new flatter UI style called “Oracle Alta” but otherwise is recognisably similar to 11g, and the installer lays down Essbase and BI Publisher alongside OBIEE.
Under the covers though there are some key differences and improvements that will only become apparent after a while, or are really a foundation for much wider changes and improvements coming later in the 12c product timeline. The way you upload RPDs gives some hint of what’s to come – with 11g we used Enterprise Manager to upload new RPDs to the BI Server which then had to be restarted to pick-up the new repository, whereas 12c has a separate utility for uploading RPDs and they’re not stored in quite the same way as before (more on this to come…). In addition there’s no longer any need to restart the BI Server (or cluster of BI Servers) to use the new repository, and the back-end has been simplified in lots of different ways all designed to enable cloning, provisioning and portability between on-premise and cloud based around two new concepts of “service instances” and “BI Modules” – expect to hear more about these over the next few years, and with the diagram below outlining 12c’s product architecture at a high-level.
Of course there are two very obvious new front-end features in OBIEE12c, Visual Analyzer and data-mashups, but they require an extra net-new license on-top of BI Foundation Suite to use in production. Visual Analyzer is Oracle’s answer to Tableau and adds data analysis, managed data discovery and data visualisation to OBIEE’s existing capabilities, but crucially uses OBIEE’s RPD as the primary data source for users’ analysis – in other words providing Tableau-like functionality but with a trusted, managed single source of data managed and curated centrally. Visual Analyzer is all about self-service and exploring datasets, and it’s here that the new data-mashup feature is really aimed at – users can upload spreadsheets of additional measures and attributes to the core dataset used in their Visual Analyzer project, and blend or “mash-up” their data to create their own unique visualizations, as shown in the screenshot below:
Data Mashups are also available for the core Answers product as well but they’re primarily aimed at VA, and for more casual users where data visualisation is all they want and cloud is their ideal delivery platform, Oracle also released Data Visualisation Cloud Service (DVCS)– aka Visual-Analyzer-in-the-cloud.
To see DVCS in action, the Youtube video below shows just the business analytics part of Thomas Kurian’s session where DVCS links to Oracle’s Social Network Cloud Service to provide instant data visualisation and mashup capabilities all from the browser – pretty compelling if you ignore the Oracle Social Network part (is that ever used outside of Oracle?)
Think of DVCS as BICS with Answers, Dashboards and the RPD Model Builder stripped-out, all data instead uploaded from spreadsheets, half the price of BICS and first-in-line for new VA features as they become available. This “cloud first” strategy goes across the board for Oracle now – partly incentive to move to the cloud, mostly a reflection of how much easier it is to ship new features out when Oracle controls the installation, DVCS and BICS will see updates on a more or less monthly cycle now (see this MOS document that details new features added to BICS since initial availability, and this blog post from ourselves announcing VA and data mashups on BICS well before they became available on on-premise. In fact we’re almost at the point now where it’s conceivable that whole on-premise OBIEE systems can be moved into Oracle Cloud now, with my main Openworld session on just this topic – the primary end-user benefit being first to access the usability, self-service and data viz capabilities Oracle are now adding to their BI platform.
Moreover, DVCS is probably just the start of a number of standalone, on-premise and cloud VA derivates trying to capture the Tableau / Excel / PowerBI market – pricing is more competitive than with BICS but as Oracle move more downmarket with VA it’ll end-up competing more head-to-head with Tableau on features, and PowerBI is just a tenth of the cost of DVCS – I see it more as a “land-and-expand” play with the aim being to trade the customer up to full BICS, or at least capture the segment of the market who’d otherwise go to Excel or Tableau desktop – it’ll be interesting to see how this one plays out.
So that’s it for Part 1 of our Oracle Openworld 2015 roundup – tomorrow we’ll look at data integration and big data.
Oracle Business Intelligence 12c Now Available – Improving Agility and Enabling Self-Service for BI Users
Oracle Business Intelligence 12c became available for download last Friday and is being officially launched at Oracle Openworld next week. Key new features in 12c include an updated and cleaner look-and-feel, Visual Analyser that brings Tableau-style reporting to OBIEE users along with another new feature called “data-mashups”, enables users to upload spreadsheets of their own data to combine with their main curated datasets.
Behind the scenes the back-end of OBIEE has been overhauled with simplification aimed at making it easier to clone, provision and backup BI systems, whilst other changes are laying the foundation for future public and private cloud features that we’ll see over the coming years – and expect Oracle BI Cloud Service to be an increasingly important part of Oracle’s overall BI offering over the next few years as innovation comes more rapidly and “cloud-first”.
So what does Oracle Business Intelligence 12c offer customers currently on the 11g release, and why would you want to upgrade? In our view, the new features in 12c come down to two main areas – “agility”, and “self-service” – two major trends that having been driving spend and investment in BI over the past few years.
OBIEE 12c for Business Agility – Giving Users the Ability to complete the “Last Mile” in Reporting, and Moving towards “BI-as-a-Service” for IT
A common issue that all BI implementors have had over many years is the time it takes to spin-up new environments, create reports for users, to respond to new requirements and new opportunities. OBIEE12c new features such as data mashups make it easier for end-users to complete the “last mile” in reporting by adding particular measures and attribute values to the reports and subject areas provided by IT, avoiding the situation where they instead export all data to Excel or wait for IT to add the data they need into the curated dataset managed centrally.
From an IT perspective, simplifications to the back-end of OBIEE such as bringing all configuration files into one place, deprecating the BI Systems Management API and returning to configuration files, simpler upgrades and faster installation make it quicker and easier to provision new 12c environments and move workloads between on-premise and in the cloud. The point of these changes is to enable organisations to respond to opportunities faster, and make sure IT isn’t the thing that’s slowing the reporting process down.
OBIEE 12c for Self-Service – Recognising the Shift in Ownership from IT to the End-Users
One of the biggest trends in BI, and in computing in-general over the past few years, is the consumerization of IT and expectations around self-service. A big beneficiary of that trend has been vendors such as Tableau and Qlikview who’ve delivered BI tools that run on the desktop, make everything point-and-click and are the equivalent to the PC vendors when IT used to run mainframes; data and applications became a bit of a free-for-all but users were able to get things done now, rather than having to wait for IT to provide the service. Similar to the data upload feature I mentioned in the context of agility, the new Visual Analyser feature in OBIEE12c brings those same self-service, point-and-click data analysis features to OBIEE users – but crucially with a centrally managed, single-version-of-the-truth business semantic model at the centre of things.
Visual Analyser comes with the same data-mashup features as Answers, and new advanced analytics capabilities in Logical SQL and Answers’s query builder bring statistical functions like trend analysis and clustering into the hands of end-users, avoiding the need to involve DBAs or data scientists to provide complex SQL functions. If you do have a data scientist and you want to re-use their work without learning another tool, OBIEE12c makes it possible to call external R functions within Answers separate to the Oracle R Enterprise integration in OBIEE11g.
We’ll be covering more around the OBIEE12c launch over the coming weeks building on these themes of enabling business agility, and putting more self-service tools into the hands of users. We’ll also be launching our new OBIEE12c course over the next couple of days, with the first runs happening in Brighton and Atlanta in January 2015 – watch this space for more details.
Introducing the Rittman Mead OBIEE Performance Analytics Service
Fix Your OBIEE Performance Problems Today
OBIEE is a powerful analytics tool that enables your users to make the most of the data in your organisation. Ensuring that expected response times are met is key to driving user uptake and successful user engagement with OBIEE.
Rittman Mead can help diagnose and resolve performance problems on your OBIEE system. Taking a holistic, full-stack view, we can help you deliver the best service to your users. Fast response times enable your users to do more with OBIEE, driving better engagement, higher satisfaction, and greater return on investment. We enable you to :
- Create a positive user experience
- Ensure OBIEE returns answers quickly
- Empower your BI team to identify and resolve performance bottlenecks in real time
Rittman Mead Are The OBIEE Performance Experts
Rittman Mead have many years of experience in the full life cycle of data warehousing and analytical solutions, especially in the Oracle space. We know what it takes to design a good system, and to troubleshoot a problematic one.
We are firm believers in a practical and logical approach to performance analytics and optimisation. Eschewing the drunk man anti-method of ‘tuning’ configuration settings at random, we advocate making a clear diagnosis and baseline of performance problems before changing anything. Once a clear understanding of the situation is established, steps are taken in a controlled manner to implement and validate one change at a time.
Rittman Mead have spoken at conferences, produced videos, and written many blogs specifically on the subject of OBIEE Performance.
Performance Analytics is not a dark art. It is not the blind application of ‘best practices’ or ‘tuning’ configuration settings. It is the logical analysis of performance behaviour to accurately determine the issue(s) present, and the possible remedies for them.
Diagnose and Resolve OBIEE Performance Problems with Confidence
When you sign up for the Rittman Mead OBIEE Performance Analytics Service you get:
- On-site consultancy from one of our team of Performance experts, including Mark Rittman (Oracle ACE Director), and Robin Moffatt (Oracle ACE).
- A Performance Analysis Report to give you an assessment of the current performance and prioritised list of optimisation suggestions, which we can help you implement.
- Use of the Performance Diagnostics Toolkit to measure and analyse the behaviour of your system and correlate any poor response times with the metrics from the server and OBIEE itself.
- Training, which is vital for enabling your staff to deliver optimal OBIEE performance. We work with your staff to help them understand the good practices to be looking for in design and diagnostics. Training is based on formal courseware along with workshops based on examples from your OBIEE system where appropriate
Let Us Help You, Today!
Get in touch now to find out how we can help improve your OBIEE system’s performance. We offer a free, no-obligation sample of the Performance Analysis Report, built on YOUR data.
Don’t just call us when performance may already be problematic – we can help you assess your OBIEE system for optimal performance at all stages of the build process. Gaining a clear understanding of the performance profile of your system and any potential issues gives you the confidence and ability to understand any potential risks to the success of your project – before it gets too late.
Forays into Kafka – 01 : Logstash transport / centralisation
The holy trinity of Elasticsearch, Logstash, and Kibana (ELK) are a powerful trio of tools for data discovery and systems diagnostics. In a nutshell, they enable you to easily search through your log files, slice & dice them visually, drill into problem timeframes, and generally be the boss of knowing where your application’s at.
Getting application logs into ELK in the most basic configuration means doing the processing with Logstash local to the application server, but this has two overheads – the CPU required to do the processing, and (assuming you have more than one application server) the management of multiple configurations and deployments across your servers. A more flexible and maintainable architecture is to ship logs from the application server to a separate ELK server with something like Logstash-forwarder (aka Lumberjack), and do all your heavy ELK-lifting there.
In this article I’m going to demonstrate an alternative way of shipping and centralising your logs for Logstash processing, with Apache Kafka.
Kafka is a “publish-subscribe messaging rethought as a distributed commit log”. What does that mean in plainer English? My over-simplified description would be that it is a tool that:
- Enables one or more components, local or across many machines, to send messages (of any format) to …
- …a centralised store, which may be holding messages from other applications too…
- …from where one or more consumers can independently opt to pull these messages in exact order, either as they arrive, batched, or ‘rewinding’ to a previous point in time on demand.
Kafka has been designed from the outset to be distributed and fault-tolerant, and for very high performance (low latency) too. For a good introduction to Kafka and its concepts, the introduction section of the documentation is a good place to start, as is Gwen Shapira’s Kafka for DBAs presentation.
If you’re interested in reading more about Kafka, the article that really caught my imagination with its possibilities was by Martin Kleppmann in which he likens (broadly) Kafka to the unix Pipe concept, being the joiner between components that never had to be designed to talk to each other specifically.
Kafka gets a lot of press in the context of “Big Data”, Spark, and the like, but it also makes a lot of sense as a “pipe” between slightly more ‘mundane’ systems such as Logstash…
Overview
In this article we’re using Kafka at its very simplest – one Producer, one Topic, one Consumer. But hey, if it works and it is a good use of technology who cares if it’s not a gazillion message throughput per second to give us bragging rights on Hacker News
We’re going to run Logstash twice; once on the application server to simply get the logfiles out and in to Kafka, and then again to pull the data from Kafka and process it at our leisure:
Once Logstash has processed the data we’ll load it into Elasticsearch, from where we can do some nice analysis against it in Kibana.
Build
This article was written based on three servers:
- Application server (OBIEE)
- Kafka server
- ELK server
In practice, Kafka could run on the ELK server if you needed it to and throughput was low. If things got busier, splitting them out would make sense as would scaling out Kafka and ELK across multiple nodes each for capacity and resilience. Both Kafka and Elasticsearch are designed to be run distributed and are easy to do so.
The steps below show how to get the required software installed and running.
Networking and Host Names
Make sure that each host has a hostname that is proper (not ‘demo
’) and can be resolved from all the other hosts being used. Liberal use of /etc/hosts
hardcoding of IP/hostnames and copying to each host is one way around this in a sandbox environment. In the real world use DNS CNAMEs to resolve the static ip of each host.
Make sure that the hostname is accessible from all other machines in use. That is, if you type hostname
on one machine:
rmoff@ubuntu-03:~$ hostname ubuntu-03
Make sure that you can ping it from another machine:
rmoff@ubuntu-02:/opt$ ping ubuntu-03 PING ubuntu-03 (192.168.56.203) 56(84) bytes of data. 64 bytes from ubuntu-03 (192.168.56.203): icmp_seq=1 ttl=64 time=0.530 ms 64 bytes from ubuntu-03 (192.168.56.203): icmp_seq=2 ttl=64 time=0.287 ms [...]
and use netcat to hit a particular port (assuming that something’s listening on that port):
rmoff@ubuntu-02:/opt$ nc -vz ubuntu-03 9200 Connection to ubuntu-03 9200 port [tcp/*] succeeded!
Application Server – log source (“sampleappv406
”)
This is going to be the machine from which we’re collecting logs. In my example it’s OBIEE that’s generating the logs, but it could be any application. All we need to install is Logstash, which is going to ship the logs – unprocessed – over to Kafka. Because we’re working with Kafka, it’s also useful to have the console scripts (that ship with the Kafka distribution) available as well, but strictly speaking, we don’t need to install Kafka on this machine.
- Downloadkafka is optional, but useful to have the console scripts there for testing
wget https://download.elastic.co/logstash/logstash/logstash-1.5.4.zip wget http://apache.mirror.anlx.net/kafka/0.8.2.0/kafka_2.10-0.8.2.0.tgz
- Install
unzip logstash*.zip tar -xf kafka* sudo mv kafka* /opt sudo mv logstash* /opt
Kafka host (“ubuntu-02
”)
This is our kafka server, where Zookeeper and Kafka run. Messages are stored here before being passed to the consumer.
- Download
wget http://apache.mirror.anlx.net/kafka/0.8.2.0/kafka_2.10-0.8.2.0.tgz
- Install
tar -xf kafka* sudo mv kafka* /opt
- ConfigureIf there’s any funny business with your networking, such as a hostname on your kafka server that won’t resolve externally, make sure you set the
advertised.host.name
value in/opt/kafka*/config/server.properties
to a hostname/IP for the kafka server that can be connected to externally. - RunUse separate sessions, or even better, screen, to run both these concurrently:
- Zookeeper
cd /opt/kafka* bin/zookeeper-server-start.sh config/zookeeper.properties
- Kafka Server
cd /opt/kafka* bin/kafka-server-start.sh config/server.properties
- Zookeeper
ELK host (“ubuntu-03
”)
All the logs from the application server (“sampleappv406
” in our example) are destined for here. We’ll do post-processing on them in Logstash to extract lots of lovely data fields, store it in Elasticsearch, and produce some funky interactive dashboards with Kibana. If, for some bizarre reason, you didn’t want to use Elasticsearch and Kibana but had some other target for your logs after Logtash had parsed them you could use one of the many other output plugins for Logstash.
- Downloadkafka is optional, but useful to have the console scripts there for testing
wget https://download.elastic.co/kibana/kibana/kibana-4.1.2-linux-x64.tar.gz wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.2.zip wget https://download.elastic.co/logstash/logstash/logstash-1.5.4.zip wget http://apache.mirror.anlx.net/kafka/0.8.2.0/kafka_2.10-0.8.2.0.tgz
- Install
tar -xf kibana* unzip elastic*.zip unzip logstash*.zip tar -xf kafka* sudo mv kafka* /opt sudo mv kibana* /opt sudo mv elastic* /opt sudo mv logstash* /opt # Kopf is an optional, but very useful, Elasticsearch admin web GUI /opt/elastic*/bin/plugin --install lmenezes/elasticsearch-kopf
- RunUse separate sessions, or even better, screen, to run both these concurrently:
/opt/elastic*/bin/elasticsearch /opt/kibana*/bin/kibana
Configuring Kafka
Create the topic. This can be run from any machine with kafka console tools available. The important thing is that you specify the --zookeeper
correctly so that it knows where to find Kafka.
cd /opt/kafka* bin/kafka-topics.sh --create --zookeeper ubuntu-02:2181 --replication-factor 1 --partitions 1 --topic logstash
Smoke test
- Having created the topic, check that the other nodes can connect to zookeeper and see it. The point is less about viewing the topic as checking that the connectivity between the machines is working.
$ cd /opt/kafka* $ ./bin/kafka-topics.sh --list --zookeeper ubuntu-02:2181 logstash
If you get an error then check that the host resolves and the port is accessible:
$ nc -vz ubuntu-02 2181 found 0 associations found 1 connections: 1: flags=82<CONNECTED,PREFERRED> outif vboxnet0 src 192.168.56.1 port 63919 dst 192.168.56.202 port 2181 rank info not available TCP aux info available Connection to ubuntu-02 port 2181 [tcp/eforward] succeeded!
- Set up a simple producer / consumer test
- On the application server node, run a script that will be the producer, sending anything you type to the kafka server:
cd /opt/kafka* ./bin/kafka-console-producer.sh --broker-list ubuntu-02:9092 --topic logstash
(I always get the warningWARN Property topic is not valid (kafka.utils.VerifiableProperties)
; it seems to be harmless so ignore it…)This will sit waiting for input; you won’t get the command prompt back.
- On the ELK node, run a script that will be the consumer:
cd /opt/kafka* ./bin/kafka-console-consumer.sh --zookeeper ubuntu-02:2181 --topic logstash
- Now go back to the application server node and enter some text and press enter. You should see the same appear shortly afterwards on the ELK node. This is demonstrating Producer -> Kafka -> Consumer
- Optionally, run
kafka-console-consumer.sh
on a second machine (either the kafka host itself, or on a Mac where you’ve runbrew install kafka
). Now when you enter something on the Producer, you see both Consumers receive it
- On the application server node, run a script that will be the producer, sending anything you type to the kafka server:
If the two above tests work, then you’re good to go. If not, then you’ve got to sort this out first because the later stuff sure isn’t going to.
Configuring Logstash on the Application Server (Kafka Producer)
Logstash has a very simple role on the application server – to track the log files that we want to collect, and pass new content in the log file straight across to Kafka. We’re not doing any fancy parsing of the files this side – we want to be as light-touch as possible. This means that our Logstash configuration is dead simple:
input { file { path => ["/app/oracle/biee/instances/instance1/diagnostics/logs/*/*/*.log"] } } output { kafka { broker_list => 'ubuntu-02:9092' topic_id => 'logstash' } }
Notice the wildcards in the path variable – in this example we’re going to pick up everything related to the OBIEE system components here, so in practice you may want to restrict it down a bit at least during development. You can specify multiple path patterns by comma-separating them within the square brackets, and you can use the exclude parameter to (…drum roll…) exclude specific paths from the wildcard match.
If you now run Logstash with the above configuration (assuming it’s saved as logstash-obi-kafka-producer.conf
)
/opt/logstash*/bin/logstash -f logstash-obi-kafka-producer.conf
Logstash will now sit and monitor the file paths that you’ve given it. If they don’t exist, it will keep checking. If they did exist, and got deleted and recreated, or truncated – it’ll still pick up the differences. It’s a whole bunch more smart than your average bear^H^H^H^H tail -f
.
If you happen to have left your Kafka console consumer running you might be in for a bit of a shock, depending on how much activity there is on your application server:
Talk about opening the floodgates!
Configuring Logstash on the ELK server (Kafka Consumer)
Let’s give all these lovely log messages somewhere to head. We’re going to use Logstash again, but on the ELK server this time, and with the Kafka input plugin:
input { kafka { zk_connect => 'ubuntu-02:2181' topic_id => 'logstash' } } output { stdout { codec => rubydebug } }
Save and run it:
/opt/logstash*/bin/logstash -f logstash-obi-kafka-consumer.conf
and assuming the application server is still writing new log content we’ll get it written out here:
So far we’re doing nothing fancy at all – simply dumping to the console whatever messages we receive from kafka. In effect, it’s the same as the kafka-console-consumer.sh
script that we ran as part of the smoke test earlier. But now we’ve got the messages come in to Logstash we can do some serious processing on them with grok and the like (something I discuss and demonstrate in an earlier article) to pull out meaningful data fields from each log message. The console is not the best place to write this all too – Elasticsearch is! So we specify that as the output plugin instead. An extract of our configuration looks something like this now:
input { kafka { zk_connect => 'ubuntu-02:2181' topic_id => 'logstash' } } filter { grok { match => ["file", "%{WLSSERVER}"] [...] } geoip { source => "saw_http_RemoteIP"} [...] } output { elasticsearch { host => "ubuntu-03" protocol=> "http" } }
Note the [...]
bit in the filter section – this is all the really cool stuff where we wring every last bit of value from the log data and split it into lots of useful data fields…which is why you should get in touch with us so we can help YOU with your OBIEE and ODI monitoring and diagnostics solution!
Advert break over, back to the blog. We’ve set up the new hyper-cool config file, we’ve primed the blasters, we’ve set the “lasers” to 11 … we hit run … and …
…nothing happens. “Logstash startup completed” is the last sign of visible life we see from this console. Checking our kafka-console-consumer.sh
we can still see the messages are flowing through:
But Logstash remains silent? Well, no – it’s doing exactly what we told it to, which is to send all output to Elasticsearch (and nowhere else), which is exactly what it’s doing. Don’t believe me? Add back in to the output stanza of the configuration file the output to stdout (console in this case):
output { elasticsearch { host => "ubuntu-03" protocol=> "http" } stdout { codec => rubydebug } }
(Did I mention Logstash is mega-powerful yet? You can combine, split, and filter data streams however you want from and to mulitple sources. Here we’re sending it to both elasticsearch and stdout, but it could easily be sending it to elasticsearch and then conditionally to email, or pagerduty, or enriched data back to Kafka, or … you get the idea)
Re-run Logstash with the updated configuration and sure enough, it’s mute no longer:
(this snippet gives you an idea of the kind of data fields that can be extracted from a log file, and this is one of the less interesting ones, difficult to imagine, I know).
Analysing OBIEE Log Data in Elasticsearch with Kibana
The kopf plugin provides a nice web frontend to some of the administrative functions of Elasticsearch, including a quick overview of the state of a cluster and number of documents. Using it we can confirm we’ve got some data that’s been loaded from our Logstash -> Kafka -> Logstash pipeline:
and now in Kibana:
You can read a lot more about Kibana, including the (minimal) setup required to get it to show data from Elasticsearch, in other articles that I’ve written here, here, and here.
Using Kibana we can get a very powerful but simple view over the data we extracted from the log files, showing things like response times, errors, hosts, data models used, and so on:
MOAR Application Servers
Let’s scale this thing out a bit, and add a second application server into the mix. All we need to do is replicate the Logstash install and configuration on the second application server – everything else remains the same. Doing this we start to see the benefit of centralising the log processing, and decoupling it from the application server.
Set the Logstash ‘producer’ running on the second application server, and the data starts passing through, straight into Elasticsearch and Kibana at the other end, no changes needed.
Reprocessing data
One of the appealing features of Kafka is that it stores data for a period of time. This means that consumers can stream or batch as they desire, and that they can also reprocess data. By acting as a durable ‘buffer’ for the data it means that recovering from a client crash, such as a Logstash failure like this:
Error: Your application used more memory than the safety cap of 500M. Specify -J-Xmx####m to increase it (#### = cap size in MB). Specify -w for full OutOfMemoryError stack trace
is really simple – you just restart Logstash and it picks up processing from where it left off. Because Kafka tracks the last message that a consumer (Logstash in this case) read, it can scroll back through its log to pass to the consumer just messages that have accumulated since that point.
Another benefit of the data being available in Kafka is the ability to reprocess data because the processing itself has changed. A pertinent example of this is with Logstash. The processing that Logstash can do on logs is incredibly powerful, but it may be that a bug is there in the processing, or maybe an additional enrichment (such as geoip) has been added. Instead of having to go back and bother the application server for all its logs (which may have since been housekept away) we can just rerun our Logstash processing as the Kafka consumer and re-pull the data from Kafka. All that needs doing is telling the Logstash consumer to reset its position in the Kafka log from which it reads:
input { kafka { zk_connect => 'ubuntu-02:2181' topic_id => 'logstash' # Use the following two if you want to reset processing reset_beginning => 'true' auto_offset_reset => 'smallest' } }
Kafka will keep data for the length of time, or size of data, as defined in the log.retention.minutes
and log.retention.bytes
configuration settings respectively. This is set globally by default to 7 days (and no size limit), and can be changed globally or per topic.
Conclusion
Logstash with Kafka is a powerful and easy way to stream your application log files off the application server with minimal overhead and then process them on a dedicated host. Elasticsearch and Kibana are a great way to visualise, analyse, and diagnose issues within your application’s log files.
Kafka enables you to loosely couple your application server to your monitoring and diagnostics with minimal overhead, whilst adding the benefit of log replay if you want to reprocess them.