Flume Agent – Collect Data From Command to a Flat File 1

In this post, we will discuss about flume agent configuration and setup for collecting data from an output of a command line tool into a flat file.

We will use Exec Source type, File Channel and File Roll sink type in configuration of our agent. Lets name our agent as Agent2. We will discuss more about each component and their additional properties at the bottom of this post but we will focus on agent configuration and deployment in the beginning of the post itself.

Flume Agent – Exec Source, File Roll Sink and File Channel:

Lets create an agent Agent2 in flume.conf properties file under Flume_Home/conf directory. We can either use the existing flume.conf file by appending our new agent properties at the bottom of the file or can create new file with our agent only.

Add the below properties in flume.conf file.

and make sure below things before starting agent.

  • Parent directory given in Agent2.sinks.file-sink.sink.directory property should already be created and the flume user has write access to it. Even if the flume user has access to create files and parent directory is not created prior to starting agent, flume process will not create the directory on the fly.
  • Flume user should have write access to Agent2.channels.file-channel.checkpointDir and Agent2.channels.file-channel.dataDir directory locations and these should be created prior to starting the agent or if the flume user has write access to the given path, then flume JVM process will create these folders/files on the fly.
Start Flume Agent:

Now start the flume agent with the below command in terminal

Below is the screen shot of started agent:

Flume Agent2

After some time of running the agent stop the agent by pressing ctrl+c key.

Now open the output directory in another terminal and we can see new files created under the target directory. Below is the screen shot of new files and contents.

Flume Agent2 Output

In the above screen, we can observe the log messages copied from /var/log/syslog file into 1411*-1 file and this file is constantly open for writing by flume agent. This file will be closed only once the agent is stopped by hitting ctrl+c key.

We can also observe the files created under the File channel’s checkpoint directory and data directory locations.

Flume channel op

So we have successfully configured Agent2 with Exec Source, File Roll sink and File channel. Now we can jump into deep insight of each component used in this agent.

Details of Components:

Exec Source:

Exec source runs a given Unix command on shell and captures its output as the input to the Flume agent. This process will be continued to produce events to flume agent continuously. If the process exits for any reason, the source will also exit and will produce no further data. This source is best suitable for command that produce streams of data continuously.

In our above example, we have used below Unix command.

The tail command is used to display contents of a file from the end. Below are examples of its usage. It accepts below arguments.

By default it displays last 10 lines of a file. It accepts below arguments:

We can also specify additional properties on Exec source. Below are a list of properties that are allowed on Exec source. The required properties are in bold.

Property Name Default Description
type The component type name, needs to be exec
command The command to execute
shell Tells in which shell to run the above command. e.g /bin/bash
restartThrottle 10000 Amount of time (in millis) to wait before attempting a restart
restart FALSE Whether the executed cmd should be restarted if it dies
logStdErr FALSE Whether the command’s stderr should be logged
batchSize 20 The max number of lines to read and send to the channel at a time

The ‘shell’ config is used to invoke the ‘command’ through a command shell (such as Bash or C).


Exec Source can not guarantee that if there is a failure to put the event into the Channel. In such cases, the data will be lost. As a for instance, the tail -F [log file] – like use case, where an application writes to a log file on disk and Flume tails the file, sending each line as an event. In this use case, if the channel fills up and Flume can’t send an event, then flume has no way of indicating to the application writing the log file that it needs to retain the log. There is absolutely zero guarantee of event delivery when using this source. For stronger reliability guarantees, consider the Spooling Directory Source or direct integration with Flume via the SDK.

File Roll Sink:

The output of the agent is written to a file on the local file system as specified in the configuration file. By default, Flume rotates (rolls) to a new file every 30 seconds, In our setup, we have disabled this feature to track what’s going on in a single file itself.

Required properties are in bold.

Property Name Default Description
type The component type name, needs to be file_roll
sink.directory The directory where files will be stored
sink.rollInterval 30 Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file
sink.serializer TEXT Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.
batchSize 100 The max number of lines to read and send to the channel at a time

File Channel:

File channel stores events in files on local file system. By default the File Channel uses paths for checkpoint and data directories that are within the user home as specified in below table. As a result, if we have more than one File Channel instances active within the agent, only one will be able to lock the directories and cause the other channel initialization to fail. It is therefore necessary to provide explicit paths to all the configured channels, preferably on different disks.

Required properties are in bold.

Property Name Default Description
type The component type name, needs to be file
checkpointDir ~/.flume/file-
The directory where checkpoint file will be stored
useDualCheckpoints FALSE Backup the checkpoint. If this is set to true , backupCheckpointDir must be set
backupCheckpointDir The directory where the checkpoint is backed up to.
dataDirs ~/.flume/file-
The directory where log files will be stored
transactionCapacity 1000 The maximum size of transaction supported by the channel
checkpointInterval 30000 Amount of time (in millis) between checkpoints
maxFileSize 2146435071 Max size (in bytes) of a single log file
minimumRequiredSpace 524288000 Minimum Required free space (in bytes). To avoid data corruption, File Channel stops accepting take/put requests when free space drops below this value
capacity 1000000 Maximum capacity of the channel
keep-alive 3 Amount of time (in sec) to wait for a put operation
write-timeout 3 Amount of time (in sec) to wait for a write operation
checkpoint-timeout 600 Amount of time (in sec) to wait for a checkpoint

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Flume Agent – Collect Data From Command to a Flat File