Hadoop Best Practices
- Avoiding small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file.
- Maintain Optimal HDFS Block size, generally >= 128 MB, to avoid tens of thousands of map tasks in processing large data sets.
- Usage of Combiners wherever applicable/suitable to reduce the network traffic from mapper nodes to reducer nodes.
- Applications processing large data-sets with optimal number of reducers and avoiding cases like using a few reducers (e.g., 1).
- Using the PARALLEL keyword in Pig scripts for processing large data-sets.
- Avoid Applications writing out multiple, small, output files from each reducer.
- Using the DistributedCache to distribute artifacts of sizes around tens of MBs each.
- Enabling compression for both intermediate map-outputs and the output of the application, that is, output of the reduces.
- Implementing automated processes to screen-scrape the web-ui is strictly prohibited. Some parts of the web-ui, such as browsing of job-history, are very resource-intensive on the JobTracker and could lead to severe performance problems when they are screen-scraped.
- Applications should not perform any metadata operations on the file-system from the backend, they should be confined to the job-client during job-submission. Furthermore, applications should be careful not to contact the JobTracker from the backend.
- Maintaining Oozie workflows to be comprise of fewer number of medium-to-large sized Map-Reduce jobs, in terms of processing, rather than large number of small Map-Reduce jobs.
- Counters are very expensive since the JobTracker has to maintain every counter of every map/reduce task for the entire duration of the application