This post describes basics of Apache Flume overview and illustrates its architecture.
Table of Contents
What is Flume ? :
Flume is a highly reliable, distributed and configurable streaming data collection tool. Flume can transport log files across a large number of hosts into HDFS.
Need for Flume:
These days, most of the new data is contained in high-throughput streams like Application logs, social media updates, Web Server logs, Network logs and website click streams create fast-moving streams requiring storage in HDFS. Though there are other tools like Facebook’s Scribe and Apache’s Kafka for collecting these streams, Apache flume is more reliable and efficient for collecting these streams and logs.
- Flume collects data efficiently, aggregates and moves large amounts of log data from many different sources to a centralized data store.
- Simple and flexible architecture, and provides a streaming of data flows and leverages data movement from multiple machines in within an enterprise into Hadoop.
- Flume is not restricted to log data aggregation and it can transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
- Built-in support for several Sources and destination platforms to integrate with.
- Support for multi-hop flows where events travel through multiple agents before reaching the final destination. fan-in and fan-out flows, contextual routing and backup routes for failed hops are also allowed.
Flume deploys as one or more agents. A Flume agent is a JVM process that hosts the components through which events flow from an external source to the next destination. Each Agent contains three components: Source(s), Channel(s) and Sink.
Below is the high level diagram of flume architecture.
As the flow shown in above diagram and an Event from a Web Server flows through Agent (Source –> Channel –> Sink) to HDFS.
A single packet of data passed through a system (Source –> Channel –> Sink) is called as an event. In log files terminology, an event is a line of text followed by a new line character.
Flume source is configured within an agent and it listens for events from an external source (eg: web server) it reads data, translates events, and handles failure situations. But source doesn’t know how to store the even. So, after receiving enough data to produce a Flume event, it sends events to the channel to which the source is connected.
The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink.
Flume provides support for various source types with its built-in API. A few of the many basic Sources include:
Spooling Directory Source
The source has the responsibility of delivering the event to the configured channel,
and all other aspects of the event processing are invisible to the source.
Channels are communication bridges between sources and sinks within an agent. Once a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink.
Memory channel stores the events from in an in-memory queue and from there events will be accessed by sink. Because of software or hardware failures, if the agent process dies in the middle, then all the events currently in the memory channel are lost forever.
The file channel is another example – it is backed by the local file system. Unlike memory channel, file channel writes the contents to a file on the file system that is deleted only after successful delivery to the sink.
Flume provides support for various channels by default. Below are a few of them.
Spillable Memory Channel
Pseudo Transaction Channel
Note: The memory channel is the fastest but has the risk of data loss. The file channels are typically much slower but effectively provide guaranteed delivery to the sink.
Sink removes the event from the channel and puts it into an external repository like HDFS or forwards it to the Flume source of the next Flume agent in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.
By default, Flume provides support for various sink types and a few of them are:
File Roll Sink
Apart from the built-in components, we can also create custom sources, channels and sinks in Java and we can deploy them into Flume.
- Channel selectors:
- Replicating channel selector (default)
- Multiplexing channel selector
- Custom channel selector
- Sink Processors (Sink groups allow users to group multiple sinks into one entity. Sink processors can be used to provide load balancing capabilities over all sinks inside the group or to achieve fail over from one sink to another in case of temporal failure)
- Sink Processors type
- Default sink processor
- Failover sink processor
- Load balancing sink processor
- Custom sink processor
- Interceptors (modify/drop events in-flight)
- Timestamp Interceptor (inserts a header with key timestampwhose value is the relevant timestamp)
- Host Interceptor (inserts the hostname/IP address of the host that this agent is running on)
- Static Interceptor (append a static header with static value to all events)
- UUID Interceptor
- Morphline Interceptor
- Regex Filtering Interceptor (filters events selectively by interpreting the event body as text and matching the text against a configured regular expression)
- Regex Extractor Interceptor
Log Analysis Flow:
A common use case for Flume is loading the weblog data from several sources into HDFS.
Logs would be created in respective Log Servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework. FLUME has agents running on Log servers that collect data intermediately using collectors and finally push that data to HDFS and it will be processed by pig or hive and it is stored in structured format into HBase or any other equivalent database. And from there, data will be pulled by Business intelligence tools to generate reports.
Below is the high level overview of the Log analysis flow.
Web server –> Flume –> HDFS –> ETL (Pig/Hive) –> Database (HBase) —> Reporting
In the next posts in this section, we will discuss about installation of Flume and collecting data from various sources and analyzing it.