Whenever a new data node is added to the existing HDFS cluster or a data node is removed from the cluster then some of the data nodes in the cluster will have more/less blocks compared to other data nodes.
In this unbalanced cluster, data read/write requests become very busy on some data nodes and some data nodes are under utilized.
In such cases, to make all the data nodes space is uniformly utilized for blocks distribution, HDFS rebalance will be triggered by Hadoop Administrator.
A cluster is in a balanced status when, % of space used in each data node is within limits of Average % of space used on data nodes +/- Threshold size .
Percentage space used on a data node should not be less than Average % of space used on data nodes – Threshold size.
Percentage space used on a data node should not be greater than Average % of space used on data nodes + Threshold size.
Here Threshold size is configurable value which is 20 % of used spaced by default.
Rebalancer is a administration tool in HDFS, to balance the distribution of blocks uniformly across all the data nodes in the cluster.
Rebalancing will be done on demand only. It will not get triggered automatically.
HDFS administrator issues this command on request to balance the cluster.
$ hdfs balancer
If a Rebalancer is triggered, NameNode will scan entire data node list and when
- Under-utilized data node is found, it moves blocks from over-utilized data nodes or not-under-utilized data nodes to this current data node
- If Over-utilized data node is found, it moves blocks from this data node to other under-utilized or not-over-utilized data nodes.
- Spreads different replicas of a block across the racks so that cluster can survive loss of an entire rack.
- One of the replicas is placed on the same rack as the node writing to the file so that cross-rack network I/O is reduced.
- Spread HDFS data uniformly across the DataNodes in the cluster.