Map Reduce


4
Hadoop Performance Tuning There are many ways to improve the performance of Hadoop jobs. In this post, we will provide a few MapReduce properties that can be used at various mapreduce phases to improve the performance tuning. There is no one-size-fits-all technique for tuning Hadoop jobs, because of the architecture of […]

Hadoop Performance Tuning


In this post we will provide solution to famous N-Grams calculator in Mapreduce Programming. Mapreduce Use case for N-Gram Statistics. N-Gram: In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, […]

Mapreduce Use Case for N-Gram Statistics


PageRank is a way of measuring the importance of website pages. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other […]

Mapreduce Use Case to Calculate PageRank




JSON View of avro file
This post is a continuation for previous post on working with small files issue. In previous we have merged huge number of small files on HDFS directory into sequencefile and in this post we will merge huge number of small files on local file system into avro file on HDFS […]

Merging Small Files Into Avro File


SequenceFile Key Extractor 7
In this post, we will discuss one of the famous use case of SequenceFiles, where we will merge large number of small files into SequenceFile. We will get to this requirement mainly due to the lack efficient processing of large number of small files in hadoop or mapreduce. Need For […]

Merging Small Files into SequenceFile



Output in JSON format
Avro provides support for both old Mapreduce Package API (org.apache.hadoop.mapred) and new Mapreduce Package API (org.apache.hadoop.mapreduce). Avro data can be used as both input and output from a MapReduce job, as well as the intermediate format. In this post we will provide an example run of Avro Mapreduce 2 API. This […]

Avro MapReduce 2 API Example


avro mr job output 1
In this post, we will discuss about famous word count example through mapreduce and create a sample avro data file in hadoop distributed file system. Prerequisite: In order to execute the mapreduce word count program given in this post, we need avro-mapred-1.7.4-hadoop2.jar file to be present in $HADOOP_HOME/share/hadoop/common/lib directory. This […]

Avro MapReduce Word Count Example


Multiple Outputs out2 1
Use Case Description: In this post we will discuss about the usage of Mapreduce Multiple Outputs Output format in Mapreduce jobs by taking one real world use case. In this, we are considering an use case to generate multiple output file names from reducer and these file names should be […]

MapReduce Multiple Outputs Use case



output 4
Use Case Description: This post describes an approach to use case scenario, where an input file contains some columns and its corresponding values as records. But some of these columns may have blanks/nulls instead of actual values. I.e. data is missing for some columns. And developer needs to write a […]

Mapreduce Program to calculate Missing Count


2
Hadoop Output Formats We have discussed input formats supported by hadoop in previous post. In this post, we will have an overview of the hadoop output formats and their usage. Hadoop provides output formats that corresponding to each input format. All hadoop output formats must implement the interface org.apache.hadoop.mapreduce.OutputFormat. OutputFormat […]

Hadoop Output Formats


5
Hadoop Input Formats: As we discussed about files being broken into splits as part of the job startup and the data in a split is being sent to the mapper implementation in our Mapreduce Job Flow post, in this post, we will go into detailed discussion on input formats supported by […]

Hadoop Input Formats



cat head 8
If none of the built-in Hadoop Writable data types matches our requirements some times, then we can create custom Hadoop data type by implementing Writable interface or WritableComparable interface. Common Rules for creating custom Hadoop Writable Data Type A custom hadoop writable data type which needs to be used as value […]

Creating Custom Hadoop Writable Data Type


WritablesTest 2
Hadoop provides Writable interface based data types for serialization and de-serialization of data storage in HDFS and mapreduce computations. Serialization Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage. Deserialization Deserialization […]

Hadoop Data Types


FixCombinerOutput 3
Combiners In Mapreduce Combiner is a semi-reducer in mapreduce. This is an optional class which can be specified in mapreduce driver class to process the output of map tasks before submitting it to reducer tasks. Purpose In Mapreduce framework, usually the output from the map tasks is large and data […]

Combiner in Mapreduce