Filling your data lake with log messages: the syslog-ng Hadoop (HDFS) destination

Datetime:2016-08-23 01:48:44          Topic: Hadoop  HDFS           Share

Petabytes of data are now collected into huge data lakes around the world. Hadoop is the technology enabling this. While syslog-ng was able write logs to Hadoop using some workarounds (mounting HDFS through FUSE) for quite some time, the new Java-based HDFS destination driver provides both better performance and more reliability. Instead of developing our own implementation, syslog-ng can utilize the official HDFS libraries from http://hadoop.apache.org/ making it a future-proof solution.

Getting started with Hadoop

When I first tested Hadoop, I did it in the hard way, following the long and detailed instructions on the Apache Hadoop website and configuring everything manually. It worked, but it was a quite long and tiring process. Fortunately, in the age of virtualization, configuration management and Docker there are much easier ways to get started. Just download a Docker image, a Puppet manifest or a complete virtual machine and you can have Hadoop up and running in your environment in a matter of minutes. Of course, while next-next-finish is good for testing, for a production environment you still need to gain some in-depth knowledge about Hadoop.

Hadoop is now an Apache project, as you might have already figured it out from the above URL. On the other hand there is now a huge ecosystem built around Hadoop with many vendors providing their own Hadoop distribution with support, integration and additional management tools. Some of the largest Hadoop vendors are already certified to work with syslog-ng: we are MapR Certified Technology Partner, Hortonworks HDP Certified, but Cloudera and others should work as well. On the syslog-ng side we tend to use the HDFS library (JAR files) from the Apache project, except for MapR, which uses a slightly different implementation. For the purposes of this blog I used the Hortonworks Sandbox as the HDFS server, which is a virtual machine for testing purposes with everything pre-installed. But anything I write here should be the same for any other Hadoop implementation on the syslog-ng side (except for MapR, as noted).

Benefits of using syslog-ng with Hadoop

You most likely already use syslog-ng on your Linux, or even without knowing it, on your network or storage device. And you can use it in the same way in a Big Data environment as well. The syslog-ng application can write log messages to a Hadoop Distributed File System (HDFS). However, it is not just a simple collection of syslog messages and writing them to HDFS. The syslog-ng application can collect messages from several sources and process as well as filter them before storing them to HDFS. This can simplify the architecture, lessen the load on both the storage and the processing side of the Hadoop infrastructure due to filtering and ease the work of processing jobs as they receive pre-processed messages.

Data collector

Based on the name of syslog-ng most people consider it as an application for collecting syslog messages. And that is partially right: syslog-ng can collect syslog messages from a large number of platform-specific sources (like /dev/log, journal, sun-streams, and so on). But syslog-ng can also read files, run applications and collect their standard output, read messages from sockets and pipes or receive messages over the network. There is no need for a separate script or application to accomplish these tasks: syslog-ng can be used as a generic data collector that can greatly simplify the data pipeline.

There is a considerable number of devices that emit a high number of syslog messages to the network, but cannot store them: routers, firewalls, network appliances. The syslog-ng application can collect these messages, even at high message rates, no matter if it is transmitted using the legacy or RFC5424 syslog protocols, over TCP, UDP or TLS.

This means that application logs can be enriched with syslog and networking device logs, and provide valuable context for operation teams and all of these provided by a single application: syslog-ng.

Data processor

There are several ways to process data in syslog-ng. First of all, data is parsed. By default it is one of the syslog parsers (either the legacy or the RFC5424), but it can either be replaced by others, or the message content can further be parsed. Columnar data can be parsed with the CSV parser, free form messages – like most syslog messages – with the PatternDB parser, and there are parsers for JSON data and key-value pairs as well. You can read in-depth about the parsers in Chapter 12 ( Parsing and segmenting structured messages ) and Chapter 13 ( Processing message content with a pattern database ) of the documentation.

Messages can be rewritten, for example by overwriting credit card numbers or user names due to compliance or privacy regulations.

Data can also be enriched in multiple ways. The PatternDB parser can create additional name-value pairs based on message content. The GeoIP parser can add geographical location based on the IP address contained in the log message.

It is also possible to completely reformat messages using templates based on the requirements or the needs of applications processing the log data. Why send all fields from a web server log if only a third of them are used on the processing end?

All of these can be done close to the message source, anywhere on your syslog-ng clients, relays or servers which can lessen the load significantly on your Hadoop infrastructure.

Data filtering

Unless you really want to forward all collected data, you will use one or more filters in syslog-ng. But even if you store everything, you are most likely to store different logs in different files. There are several filter functions both for message content and message parameters, like application name or message priority. All of these can be freely combined using boolean operators, making very complex filters possible. The use of filters has two major uses:

  • only relevant messages get through, the rest can be discarded
  • messages can be routed to the right destinations

Either of these can lessen the resource usage and therefore the load as well on both storage nodes and data processing jobs.

How the syslog-ng HDFS driver works?

The last step on the syslog-ng side is to store the collected, processed and filtered log messages on HDFS. In order to do that, you need to configure a HDFS destination in your syslog-ng.conf (or in a new .conf file under /etc/syslog-ng/conf.d if you have the include feature configured). The basic configuration is very simple:

destination d_hdfs {
    hdfs(
        client_lib_dir(/opt/hadoop/libs)
        hdfs_uri("hdfs://10.20.30.40:8020")
        hdfs_file("/user/log/logfile.txt")
    );
};

The client_lib_dir is a list of directories, where required Java classes are located. The hdfs_uri sets the URI in hdfs://hostname:port format. For MapR replace hdfs:// with maprsfs://. The last mandatory option is hdfs_file, which sets the path and name of the log file. For additional options check the documentation.

Hadoop-specific considerations

There are some limitations, when using the HDFS driver due to how Hadoop and the client libraries work.

The first one is, that while appending to files can be enabled, it is still an experimental feature in HDFS. To work around this on the syslog-ng side, it is not possible to use macros in hdfs_file. Also, a UUID is appended to the file name, and a new UUID is generated each time syslog-ng is reloaded or when the HDFS client returns an error.

The other one is, that you cannot define when log messages are flushed. The syslog-ng application cannot influence when the messages are actually written to disk.

Summary

There are endless debates whether it is better to store all of your logs in your data lake (skeptics call it the grave) or keep only those that are relevant for operation or business analytics. In either case there are many benefits of using syslog-ng as a data collection, processing and filtering tool in a Hadoop environment. A single application can collect log and other data from many sources, which complement each other well. Processing of your data can be done close to the source in efficient C code, lessening the load on the processing side of your Hadoop infrastructure. And before storing your messages to HDFS, you can use filters to throw away irrelevant messages or just to route your messages to the right files.

The Hadoop destination is available in both syslog-ng Open Source Edition (OSE) and in the commercially supported Premium Edition (PE). Getting started with it takes a bit more configuration work than with other non-Java based destinations, but it is worth the effort.





About List