FileStatus: Java API for Hadoop Distributed file system provides one important class org.apache.hadoop.fs.FileStatus for querying HDFS File System. This class encapsulates the file system meta data. We can obtain meta data about the files, directories, including file length, block size, replication, modification time, ownership, and permission information. We can get […]

Querying HDFS File System

Java Interface to HDFS File Read Write This post describes Java interface to HDFS File Read Write and it is a continuation for previous post, Java Interface for HDFS I/O.  Reading HDFS Files Through FileSystem API: In order to read any File in HDFS, We first need to get an instance of FileSystem underlying the […]

Java Interface to HDFS File Read Write

Java Interface for HDFS File I/O: This post describes Java Interface for Hadoop Distributed File System. It is recommended to go through this post after having basic knowledge on Java Basic Input and Output, Java Binary Input and Output and Java File Input and Output concepts. To explore more into […]

Java Interface for HDFS File I/O

This post is written under the assumption that, an user reading this post already have an idea about installing and configuring hadoop on single node cluster. If not, it is better to go through the post Installing Hadoop on single node cluster In this post we will briefly discuss about installing […]

Install Hadoop on Multi Node Cluster

Hadoop Input Formats: As we discussed about files being broken into splits as part of the job startup and the data in a split is being sent to the mapper implementation in our Mapreduce Job Flow post, in this post, we will go into detailed discussion on input formats supported by […]

Hadoop Input Formats

cat head 8
If none of the built-in Hadoop Writable data types matches our requirements some times, then we can create custom Hadoop data type by implementing Writable interface or WritableComparable interface. Common Rules for creating custom Hadoop Writable Data Type A custom hadoop writable data type which needs to be used as value […]

Creating Custom Hadoop Writable Data Type

WritablesTest 2
Hadoop provides Writable interface based data types for serialization and de-serialization of data storage in HDFS and mapreduce computations. Serialization Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage. Deserialization Deserialization […]

Hadoop Data Types

hdfs oev stats output 1
HDFS OEV Tool Similar to Offline Image Viewer (oiv), Hadoop also provides viewer tool for edits log files since these are also not in human-readable format. This is called as HDFS Offline Edits Viewer (oev) tool. This tool doesn’t require HDFS to be running and it can run in offline […]

HDFS Offline Edits Viewer Tool – oev

hdfs-oiv-indent-output 3
Usually fsimage files, which contain file system namespace on namenodes are not human-readable. So, Hadoop provided HDFS Offline Image viewer in hadoop-2.0.4 release to view the fsimage contents in readable format. This is completely offline in its functionality and doesn’t require HDFS cluster to be running. It can easily process very […]

HDFS Offline Image Viewer Tool – oiv

distcp-update 6
HDFS Distributed File copy Hadoop provides HDFS Distributed File copy (distcp) tool for copying large amounts of HDFS files within or in between HDFS clusters. It is implemented based on Mapreduce framework and thus it submits a map-only mapreduce job to parallelize the copy process. Usually this tool is useful for copying […]

HDFS Distributed File Copy Tool – distcp

hdfs-balancer 2
Whenever a new data node is added to the existing HDFS cluster or a data node is removed from the cluster then some of the data nodes in the cluster will have more/less blocks compared to other data nodes. In this unbalanced cluster, data read/write requests become very busy on […]

HDFS Rebalance

The Syntax for Hadoop commands is $ hadoop [–config confdir]  [Command]  [Generic_Options]  [Command_Options] here –config parameter is used for overwriting the default configuration directory. Commands can be either user commands or administrator commands. Below are the details of the useful administrator command dfsadmin. dfsadmin: dfsadmin (distributed file system administration) command […]

dfsadmin – HDFS Administration Command