Monthly Archives: April 2014

Hadoop Input Formats: As we discussed about files being broken into splits as part of the job startup and the data in a split is being sent to the mapper implementation in our Mapreduce Job Flow post, in this post, we will go into detailed discussion on input formats supported by […]

Hadoop Input Formats

Below are a few more hadoop interview questions and answers for both freshers and experienced hadoop developers and administrators. Hadoop Interview questions and answers 1.  What is a Backup Node? It is an extended checkpoint node that performs checkpointing and also supports online streaming of file system edits. It maintains […]

Hadoop Interview Questions and Answers Part – 3

Below are a few more Hadoop Interview Questions and Answers. Please refer previous posts on this topic for additional questions on HDFS. Hadoop Interview Questions and Answers 1.  What is HDFS ?  HDFS is a distributed file system implemented on Hadoop’s framework. It is a block-structured distributed file system designed […]

Hadoop Interview Questions and Answers Part – 2

Below are a few hadoop interview questions for both hadoop developers and administrators. Hadoop Interview Questions and Answers 1.  What is Big Data ? Big data is vast amount of data (generally in GBs or TBs of size) that exceeds the regular processing capacity of the traditional computing servers and requires special […]

Hadoop Interview Questions Part – 1

mapreduce interview questions and answers 9
A few of the hadoop Mapreduce Interview questions and answers are presented in this post. These are suitable for both beginners and experienced mapreduce developers. Mapreduce Interview Questions and Answers for Freshers: 1.  What is Mapreduce ? Mapreduce is a framework for processing  big data (huge data sets using a large […]

50 Mapreduce Interview Questions and Answers Part – 1

The main motto of this site is to provide tutorials on Hadoop technology and Big data tools so that any IT programmer can easily learn  the most emerging technology, Hadoop and use it for solving Big data processing problems. All the concepts covered in the site are presented with simple […]

About Me

cat head 8
If none of the built-in Hadoop Writable data types matches our requirements some times, then we can create custom Hadoop data type by implementing Writable interface or WritableComparable interface. Common Rules for creating custom Hadoop Writable Data Type A custom hadoop writable data type which needs to be used as value […]

Creating Custom Hadoop Writable Data Type

WritablesTest 2
Hadoop provides Writable interface based data types for serialization and de-serialization of data storage in HDFS and mapreduce computations. Serialization Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage. Deserialization Deserialization […]

Hadoop Data Types

FixCombinerOutput 3
Combiners In Mapreduce Combiner is a semi-reducer in mapreduce. This is an optional class which can be specified in mapreduce driver class to process the output of map tasks before submitting it to reducer tasks. Purpose In Mapreduce framework, usually the output from the map tasks is large and data […]

Combiner in Mapreduce

WordCount2 Output
Hadoop provided some predefined Mapper and Reducer classes in its Java API and these will be helpful in writing simple or default mapreduce jobs. A few among the entire list of predefined mapper and reducer classes are provided below. Identity Mapper Identity Mapper is the default Mapper class provided by […]

Predefined Mapper and Reducer Classes

Mapreduce Job Flow Through YARN Implementation This post is to describe the mapreduce job flow – behind the scenes, when a job is submit to hadoop through submit() or waitForCompletion() method on Job object. This Mapreduce job flow is explained with the help of Word Count mapreduce program described in […]

MapReduce Job Flow

hadoop jar wordcount2 1
In this post, we are going to review the building blocks & programming model of example mapreduce program word count run in previous post in this Mapreduce Category. We will not go too deep into code, our focus will be mainly on structure of the mapreduce program written in java […]

MapReduce Programming Model

hdfs oev stats output 1
HDFS OEV Tool Similar to Offline Image Viewer (oiv), Hadoop also provides viewer tool for edits log files since these are also not in human-readable format. This is called as HDFS Offline Edits Viewer (oev) tool. This tool doesn’t require HDFS to be running and it can run in offline […]

HDFS Offline Edits Viewer Tool – oev

hdfs-oiv-indent-output 3
Usually fsimage files, which contain file system namespace on namenodes are not human-readable. So, Hadoop provided HDFS Offline Image viewer in hadoop-2.0.4 release to view the fsimage contents in readable format. This is completely offline in its functionality and doesn’t require HDFS cluster to be running. It can easily process very […]

HDFS Offline Image Viewer Tool – oiv

distcp-update 6
HDFS Distributed File copy Hadoop provides HDFS Distributed File copy (distcp) tool for copying large amounts of HDFS files within or in between HDFS clusters. It is implemented based on Mapreduce framework and thus it submits a map-only mapreduce job to parallelize the copy process. Usually this tool is useful for copying […]

HDFS Distributed File Copy Tool – distcp