Real Time Hadoop Interview Questions From Different Readers

Real Time Hadoop Interview Questions from Various interviews

  1. Hive – Where do you use Internal or Managed table? What scenarios?
  2. In your resume, what do you mean by, “monitoring & managing MapReduce jobs”? Explain?
  3. Interviewer’s Project: How to modify the RDBMs’ Nested SQL queries into Hadoop framework using Pig.
  4. Sqoop: Need to know very well. Some of the current projects are importing data from other RDBMs sources into HDFS.
  5. Can you join or transform tables/columns when importing using Sqoop?
  6. Can you do the above with different RDBMs (not clear)?
  7. How do you transfer flat files from Unix systems?
  8. What is your Pig/Hive programming level (1- 10)? (Almost all interviewers asked this.)
  9. Learn Scala! – Interviewer repeatedly told me.

Other Interview Questions:

  1. Hive – Interval vs External How do you save your files in Hive
  2. Sqoop – Incremental vs hast modified relate to your project
  3. Sqoop – How to check if RDBMS Table Columns added/removed and how to incorporate these changes into the import job.
  4. What are the challenges you’ve faced in your project? Give 2 examples.
  5. How do you check Data Integrity (log files)
  6. How to improve performance in your script (PIG)?
  7. Tell me about your project? work.
  8. How do you use Partitioning/Bucketing in your project? (Examples from your project)
  9. Where do you look for answers? (user groups, Apache Web, stack overflow)
  10. NOSQL- HBase – Unstructured data storage?
  11. How to debug Production issue?      Give example. (logs, script counters, JVM)
  12. Data Ingestion
  13. What is the file size you’ve used?

Dev. Environmet

Production Environmet

  1. Does Hive support indexing? (How does this relate to Partition and Bucketing)
  2. Pig support Conditional Loop?
  3. Hive – What type of data stored?
  4. Recruiter: In your experience, what is the jump from DB developer to Hadoop without Java experience?

More Technical type Interview Questions:

  1. What functions did you use in PIG?
  2. Filter – What did you filter out?
  3. Join – What did you join?
  4. What is your cluster size?
  5. What is the file size for production environment?
  6. How long does it take to run your script in Production cluster?
  7. Are you planning for anything to improve the performance?
  8. What size of file do you use for Development?
  9. What did you work on HBase?
  10. Why Hadoop? compare to RDBMS.
  11. Hive –  What did you do to increase the performance.
  12. PIG – what did you do to increase the performance
  13. What Java UDF did you write?
  14. What scenario do you think you can use Java for?
  15. You can process log files in RDBMS too. Why Hadoop?
  16. Hive partitioning – your project example? Why?


  1. Hive – What file format do you use in your work? (Avro, Parquet, Sequence file)
  2. Hadoop – What is the challenge or difficulty you’ve faced?
  3. PIG – What is the challenge or difficulty you’ve faced?
  4. Flume – What is the challenge or difficulty you’ve faced?
  5. Sqoop – What is the challenge or difficulty you’ve faced? (he didn’t ask this question)
  6. How experienced are you in Linux?
  7. What shell type do you use?
  8. How about your experience in Cloudera Manager?
  9. How about your experience in Cloudera Manager?
  10. Do you use Impala? (I compared it with Hive and explained in more details)
  11. How do you select the Eco system tools for your project?

InfoSys – Interview Questions:

As you can see, questions are mostly based on theory.

  1. Why Hadoop? (Compare to RDBMS)
  2. What would happen if NameNode failed? How do you bring it up?
  3. What details are in the “fsimage” file?
  4. What is SecondaryNameNode?
  5. Explain the MapReduce processing framework? (start to end)
  6. What is Combiner? Where does it fit and give an example? Preferably from your project.
  7. What is Partitioner? Why do you need it and give an example? Preferably from your project.
  8. Oozie – What are the nodes?
  9. What are the actions in Action Node?
  10. Explain your Pig project?
  11. What log file loaders did you use in Pig?
  12. Hive Joining?  What did you join?
  13. Explain Partitioning & Bucketing (based on your project)?
  14. Why do we need bucketing?
  15. Did you write any Hive UDFs?
  16. Filter – What did you filter out?
  17. HBase?
  18. Flume?
  19. Sqoop?
  20. Zookeeper?
  21. Impala? Explain the use of Impala?
  22. Cassandra? What do you know about Cassandra?
  23. ClickStream.
  24. What is your cluster size?
  25. What is the DataNode configurations? (RAM, CPU core, Disk size)
  26. What is the NameNode configurations? (RAM, CPU core, Disk size)
  27. How many Map slots & reducer slots configured in each DataNode? (he didn’t ask this)
  28. How do you copy file from cluster to cluster?
  29. What commands do you use to check to check system health, jobs, etc.?
  30. Do you use Cloudera Manager to monitor and manage the jobs, cluster, etc.?
  31. What is Speculative execution?
  32. What do you know about Scala? (interviewer asked about the skills that I listed in my resume)
Java Interview Questions:
Had an array of the follwing elements: [29 12 24 18 -11 -5]
Need an O/P of  sorting of arrays ,== [12 18 24 29 -5 -11]
Need an O/P of  even and odd numbers  in array ,==[12 18 24]  && [29 -5 -11]//Declaring an araylist
ArrayList<Integer> arraylist = new ArrayList<Integer>();/* Sorting of arraylist using Collections.sort*/Collections.sort(arraylist);

for(int counter: arraylist)

/*Sort array in reverse order*/


System.out.println(“****** Reverse Sorted String Array *******”);
for (int i : stringArray)

/* sort an array to even numbers and odd numbers*/
public class SortNumbers
private static int[] array = {12 18 24 29 -5 -11};
private static List<Integer> even = new ArrayList<>();
private static List<Integer> odd = new ArrayList<>();

public static void even(int[] arr, List even , List odd)
for(int i = 0 ; i < arr.length ; i++)
if(arr[i] % 2 == 0)

//To Display the even and Odd numbers
public static void display(List<Integer> list)
for(Integer i : list)

public static void main(String[] args){

2)How to make your class compatible with Java Hash Maps?
Overriding hashcode() and equals() method.

3)You have two tables  Employee and Dept with the below columns.Select Maximum salary by Department.

DEPT—–dept_id  dept_name

SELECT d.dept_name, MAX(e.SAL)  FROM Employee e,Dept d where (d.dept_id = e.dept_id)   group by

On 07/28/2015

1. Tell me some List implementations?



2. In what Purposes you use ArrayList and Linkedlist?

ArrayList for fast searching,

LinkedList,for more insertions/deletes

3. In both Arraylist and Linkedlist, which is faster?

ArrayList is faster as it containis duplicates, no sorting

Linkedlist is slow as it contains adding and removing of elements

4. Tell me some Map implementations?

HashMap (unsorted)

TreeMap (Sorted values)

LinkedHashMap( if you want near-HashMap performance and insertion-order iteration)

5. Which of the Map implementations is faster and why?

Hash map is fast as there is no need of extra burden in sorting values…

6. What Happens in Shuffle Phase in Map Reduce?

All the part files will be exchanged between reduce tasks

part files will be generated by partitioners

map output will be transferred over network…

7. What is the Fundemental Data Structure inside a HashMap?

Integer, For calculating hash value for all keys stored into buckets….Buckets are used as storage

locations…Usually Buckets are array….

8. How do you use Map Reduce methods?

map is method to parse the input records

reduce for aggregating the results reading input from map

9. What are the parameters in Mapper class

map(key, value, context)

10. What is the interface on a Main function on a Mapper?

In Mapper Class you write…..


map(key, value, context)—-( return type of map method is void…but it writes output to context)


11. Is it possible to get multiple Key,value pairs from the Map phase?

Yes, by concatenating two or more fields into same field.

12. Imagine you have a Server Class Computer, If you have two files of 1 GB each on Hard disk,

These files consists of Integers from smaller to larger, how do you Merge the files into one File

and generate an output of Sorted Order? Tell me the Logic

Read record by record from each file and compare first record from first file with first record in 2nd

file and same way with 2nd record b/w the files….

If first rec in 1st file < 1st record 2nd file then i will emit 1st record in 1st file and i will move cursor

of first file to 2nd record in the first file then check with 1st record in 2nd file and so on…

13. What if there are no records in one of the files in the above Scenario?

I will copy records from the remaining file as it is without comparing

14. What is the execution time of the above Program?

1-2 minutes…in Hadoop

15. If you have two files of 1 TB on two disks, you should Merge the files into one File and generate

an output of Sorted Order? What will you do?

Write all the above logic in map method of map reduce job….or reduce method

16. How the records of the two files are compared in the Map Reducer Phase?

If one of the file is small then i can read that into memory through distributed cache in setup

method of mapper class

17. What Problems you face in the Reducer Phase?

Out of Memory Problem (To overcome this problem increase the heap size

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *