Table of Contents
Hadoop Output Formats
We have discussed input formats supported by hadoop in previous post. In this post, we will have an overview of the hadoop output formats and their usage.
Hadoop provides output formats that corresponding to each input format. All hadoop output formats must implement the interface org.apache.hadoop.mapreduce.OutputFormat.
OutputFormat describes the output-specification for a Map-Reduce job. Based on Output specification,
Mapreduce job checks that the output directory doesn’t already exist.
OutputFormat provides the RecordWriter implementation to be used to write out the output files of the job.
These two requirements of the OutputFormat are accomplished with below two methods in the interface.
public abstract void checkOutputSpecs(JobContext context) throws IOException, InterruptedException
This method checks that output directory doesn’t exist already and throws an exception when it already exists, so that output is not overwritten.
public abstract RecordWriter<K,V> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException
This method Gets the RecordWriter for the given task.
org.apache.hadoop.mapreduce.RecordWriter<K,V> class implementations are used to write the output <key, value> pairs to an output file.
Built-In Hadoop Output Formats
Hadoop provided some built in InputFormat implementations in the org.apache.hadoop.mapreduce.lib.output package:
Base class for all file-based OutputFormat implementations.
Some of the important sub classes of the FileOutputFormat class are:
The default output format provided by hadoop is TextOuputFormat and it writes records as lines of text. If file output format is not specified explicitly, then text files are created as output files.
Output Key-value pairs can be of any format because TextOutputFormat converts these into strings with toString() method. Output key-value pairs are tab delimited by default.
For reading these output text files as input, KeyValueTextInputFormat is best suitable, since it breaks input lines into key value pairs based on a separator character.
This output format class is useful to write out sequence files which is a best option when the output files need to be fed into another mapreduce jobs as input files, since these are compressed and compact.
SequenceFileAsBinaryOutputFormat is a direct subclass of SequenceFileOutputFormat and it is counter part for SequenceFileAsBinaryInputFormat. It writes keys and values to Sequence Files in binary format.
It is also a direct subclass of FileOutputFormat and it is used to write output as Map files.
The MultipleOutputs class is used to write output data to multiple outputs. Below are the two main use cases of MultipleOutputs.
- Job output can be written to additional outputs other than the default output. Each additional output, or named output, may be configured with its own
OutputFormat, with its own key class and value class.
- Write data to different files provided by user
MultipleOutputs supports counters to count the number records written to each output name. But these are disabled by default.
Usage pattern for job submission:
Job job = new Job();
// Defines additional single text based output 'text' for the job
MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class,
// Defines additional sequence-file based output 'sequence' for the job
[Read Next Page]