In this post, we will discuss about flume agent configuration and setup for collecting data from an output of a command line tool into a flat file.
We will use Exec Source type, File Channel and File Roll sink type in configuration of our agent. Lets name our agent as Agent2. We will discuss more about each component and their additional properties at the bottom of this post but we will focus on agent configuration and deployment in the beginning of the post itself.
Table of Contents
Flume Agent – Exec Source, File Roll Sink and File Channel:
Lets create an agent Agent2 in flume.conf properties file under Flume_Home/conf directory. We can either use the existing flume.conf file by appending our new agent properties at the bottom of the file or can create new file with our agent only.
Add the below properties in flume.conf file.
### Agent2 Configuration - Exec Source, File Roll Sink & File Channel ###
# Name the components on this agent
Agent2.sources = exec-source
Agent2.channels = file-channel
Agent2.sinks = file-sink
# Describe/configure Source
Agent2.sources.exec-source.type = exec
Agent2.sources.exec-source.command = tail -F /var/log/syslog
# Describe the sink
Agent2.sinks.file-sink.type = FILE_ROLL
Agent2.sinks.file-sink.sink.directory = /usr/lib/flume/agent/files/
Agent2.sinks.file-sink.sink.rollInterval = 0
# Use a channel which buffers events in file
Agent2.channels.file-channel.type = file
Agent2.channels.file-channel.checkpointDir = /var/log/flume/checkpoint/
Agent2.channels.file-channel.dataDirs = /var/log/flume/data/
# Bind the source and sink to the channel
Agent2.sources.exec-source.channels = file-channel
Agent2.sinks.file-sink.channel = file-channel
and make sure below things before starting agent.
- Parent directory given in Agent2.sinks.file-sink.sink.directory property should already be created and the flume user has write access to it. Even if the flume user has access to create files and parent directory is not created prior to starting agent, flume process will not create the directory on the fly.
- Flume user should have write access to Agent2.channels.file-channel.checkpointDir and Agent2.channels.file-channel.dataDir directory locations and these should be created prior to starting the agent or if the flume user has write access to the given path, then flume JVM process will create these folders/files on the fly.
Start Flume Agent:
Now start the flume agent with the below command in terminal
$ flume-ng agent --conf $FLUME_CONF_DIR --conf-file $FLUME_CONF_DIR/flume.conf --name Agent2
Below is the screen shot of started agent:
After some time of running the agent stop the agent by pressing ctrl+c key.
Now open the output directory in another terminal and we can see new files created under the target directory. Below is the screen shot of new files and contents.
In the above screen, we can observe the log messages copied from /var/log/syslog file into 1411*-1 file and this file is constantly open for writing by flume agent. This file will be closed only once the agent is stopped by hitting ctrl+c key.
We can also observe the files created under the File channel’s checkpoint directory and data directory locations.
So we have successfully configured Agent2 with Exec Source, File Roll sink and File channel. Now we can jump into deep insight of each component used in this agent.
Details of Components:
Exec source runs a given Unix command on shell and captures its output as the input to the Flume agent. This process will be continued to produce events to flume agent continuously. If the process exits for any reason, the source will also exit and will produce no further data. This source is best suitable for command that produce streams of data continuously.
In our above example, we have used below Unix command.
$ tail -F /var/log/syslog
The tail command is used to display contents of a file from the end. Below are examples of its usage. It accepts below arguments.
$ tail -f /var/log/syslog
$ tail -n50 /var/log/mail.log
$ tail -F /var/log/lighttpd/error.log
By default it displays last 10 lines of a file. It accepts below arguments:
-f, –follow : output appended data as the file grows
-n, –lines: output the last N lines, instead of the last 10; or use +N to output lines starting with the Nth
–retry: keep trying to open a file even if it is inaccessible when tail starts or if it becomes inaccessible later
We can also specify additional properties on Exec source. Below are a list of properties that are allowed on Exec source. The required properties are in bold.
|type||–||The component type name, needs to be exec|
|command||–||The command to execute|
|shell||–||Tells in which shell to run the above command. e.g /bin/bash|
|restartThrottle||10000||Amount of time (in millis) to wait before attempting a restart|
|restart||FALSE||Whether the executed cmd should be restarted if it dies|
|logStdErr||FALSE||Whether the command’s stderr should be logged|
|batchSize||20||The max number of lines to read and send to the channel at a time|
The ‘shell’ config is used to invoke the ‘command’ through a command shell (such as Bash or C).
Exec Source can not guarantee that if there is a failure to put the event into the Channel. In such cases, the data will be lost. As a for instance, the tail -F [log file] – like use case, where an application writes to a log file on disk and Flume tails the file, sending each line as an event. In this use case, if the channel fills up and Flume can’t send an event, then flume has no way of indicating to the application writing the log file that it needs to retain the log. There is absolutely zero guarantee of event delivery when using this source. For stronger reliability guarantees, consider the Spooling Directory Source or direct integration with Flume via the SDK.
File Roll Sink:
The output of the agent is written to a file on the local file system as specified in the configuration file. By default, Flume rotates (rolls) to a new file every 30 seconds, In our setup, we have disabled this feature to track what’s going on in a single file itself.
Required properties are in bold.
|type||–||The component type name, needs to be file_roll|
|sink.directory||–||The directory where files will be stored|
|sink.rollInterval||30||Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file|
|sink.serializer||TEXT||Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.|
|batchSize||100||The max number of lines to read and send to the channel at a time|
File channel stores events in files on local file system. By default the File Channel uses paths for checkpoint and data directories that are within the user home as specified in below table. As a result, if we have more than one File Channel instances active within the agent, only one will be able to lock the directories and cause the other channel initialization to fail. It is therefore necessary to provide explicit paths to all the configured channels, preferably on different disks.
Required properties are in bold.
|type||–||The component type name, needs to be file|
|The directory where checkpoint file will be stored|
|useDualCheckpoints||FALSE||Backup the checkpoint. If this is set to true , backupCheckpointDir must be set|
|backupCheckpointDir||–||The directory where the checkpoint is backed up to.|
|The directory where log files will be stored|
|transactionCapacity||1000||The maximum size of transaction supported by the channel|
|checkpointInterval||30000||Amount of time (in millis) between checkpoints|
|maxFileSize||2146435071||Max size (in bytes) of a single log file|
|minimumRequiredSpace||524288000||Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value|
|capacity||1000000||Maximum capacity of the channel|
|keep-alive||3||Amount of time (in sec) to wait for a put operation|
|write-timeout||3||Amount of time (in sec) to wait for a write operation|
|checkpoint-timeout||600||Amount of time (in sec) to wait for a checkpoint|