MapReduce Multiple Outputs Use case 1

Use Case Description:

In this post we will discuss about the usage of Mapreduce Multiple Outputs Output format in Mapreduce jobs by taking one real world use case. In this, we are considering an use case to generate multiple output file names from reducer and these file names should be based on the certain input data parameters. I.e. We need control over the naming of the

In this scenario, we have below sample input data in JSON format which contains data from various types (country, state, city, street, zip). We need to create sub directories up to 5 levels in the format of country/state/city/street/zip and partition the input records by country, state, city, street and zip.

Input file  –> json_input

In Mapreduce, by default, one output file per reducer will be created, and files are named by the partition number: part-r-00000, part-r-00001, etc. But in this scenario, we need file names in the format of us/nz/al/lst/1000-r-0000* format and input data records segregated appropriately into each sub directory based on the country, state, city, etc… values.

MapReduce comes with the MultipleOutputs output format class to help us do this. For details on concept of MultipleOutputs please refer the post Mapreduce Output formats.


We will write mapreduce program using MultipleOutputs to partition the data by country, state, city, street and zip. MultipleOutputs allows us to write data to files whose names are derived from the output keys and values, or in fact from an arbitrary string.

File names are of the form name -m- nnnnn for map outputs and name -r- nnnnn for reduce outputs, where name is an arbitrary name that can be set by us in the program, and nnnnn is an integer designating the part number, starting from zero.

How to zip It:
  • We need JSONObject to parse our input data and we will build the key with required directory structure in mapper itself and pass our (key,value) pairs to reducer.
  • In reducer we will create an instance of MultipleOutputs in the setup() method and assign it to an instance variable. We then use the MultipleOutputs instance in the reduce() method to write to the output, in place of the context. The write() method takes the key and value, as well as a file name.
  • Finally close the MultipleOutputs instance in cleanup() method in reducer.

Below are the zip changes that can be used to perform above activities.

  • Save these lines of code into file.
  • Compile the program after adding required JSON Jar files into classpath. (Or in eclipse, add external jars to build path).
  • Create Jar with the class files and copy the json_input.txt file into HDFS.

Below is the screenshot of terminal performing above actions.


Run and Validate Output results:

Run the mapreduce job with below command.


Output results:

Verify the output directory structure in HDFS.

MultipleOutputs out

For example let us view the contents of one fail to verify whether contents of the file are matching with the directory structure.

Multiple Outputs out2

In the above screen shot,we can clearly see that country = uk, state = ny , city = fr, street = nyk and zip = 1009. These records are stored in /output/mos/uk/ny/fr/nyk/1009-r-00000 file.

So our results are correct.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “MapReduce Multiple Outputs Use case