Monthly Archives: October 2014


Hive UDF usage 5
In this post we will describe about the process of creating custom UDF in Hive. Though there are many generic UDFs (User defined functions)  provided by Hive we might need to write our custom UDFs sometime to meet our requirements. In this post, we will discuss about one of the […]

Creating Custom UDF in Hive – Auto Increment Column in ...


Healthy Cluster 20
In this post, we will discuss about hadoop installation on cloud storage. Though there are number of posts available across internet on this topic, we are documenting the procedure for Cloudera Manager Installation on Amazon EC2 instances with some of our practical views on installation and tips and hints to […]

Cloudera Manager Installation on Amazon EC2


1
Below are a few important Hadoop HBase Interview Questions and Answers that are suitable for hadoop freshers or experienced developers. 1. What is HBase? HBase is Column-Oriented , Open-Source, Multidimensional, Distributed database. It run on the top of HDFS. 2. Why do we use HBase? HBase provide random read and […]

HBase Interview Questions and Answers Part – 1



Hunk Visualization 3
In this post we will discuss about the configuration required for Hive connectivity with Hunk, Hadoop flavor of Splunk, the famous visualization tool. Splunk Overview: Splunk tool captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, dashboards and visualizations. Splunk released a product […]

Hive Connectivity With Hunk (Splunk)


Sqlline 4
Phoenix HBase Overview What is Apache Phoenix? Apache phoenix is another Top Level project from Apache Software Foundation. It provides an SQL interface to HBase. It is like an SQL layer on top of HBase architecture. It maps HBase data model to the relational world. Phoenix is developed in java […]

Apache Phoenix – An SQL Layer on HBase


hive tez2 1
In this post, we will discuss about Hive integration with Tez framework or Enabling Tez for Hive Queries. And we will also run sample hive queries both on Mapreduce and Tez frameworks and we will evaluate the performance difference between Tez and MR Frameworks. Tez Advantages: Tez offers a customizable […]

Hive on Tez – Hive Integration with Tez



Tez Dag job output 4
Apache Tez Overview What is Apache Tez? Apache Tez is another execution framework project from Apache Software Foundation and it is built on top of Hadoop YARN. It is considered as a more flexible and powerful successor of the mapreduce framework. Apache Tez Features: Tez provides, Performance gain over Map Reduce […]

Apache Tez – Successor of Mapreduce Framework


Hive Table Mapping with HBase 18
In this post, we will discuss about the setup needed for HBase Integration with Hive and we will test this integration with the creation of some test hbase tables from hive shell and populate the contents of it from another hive table and finally verify these contents in hbase table. […]

HBase Integration with Hive


JSON View of avro file
This post is a continuation for previous post on working with small files issue. In previous we have merged huge number of small files on HDFS directory into sequencefile and in this post we will merge huge number of small files on local file system into avro file on HDFS […]

Merging Small Files Into Avro File



SequenceFile Key Extractor 7
In this post, we will discuss one of the famous use case of SequenceFiles, where we will merge large number of small files into SequenceFile. We will get to this requirement mainly due to the lack efficient processing of large number of small files in hadoop or mapreduce. Need For […]

Merging Small Files into SequenceFile


Error Scenario: java.io.IOException: Cannot create an instance of InputFormat class We will get this error message when we try to execute simple hadoop fs commands or running any hive queries. Below is the complete error message. [crayon-5defca7ea0bdd621036406/] Root Cause: This error message will be received when there are any spaces […]

Cannot create an instance of InputFormat


hadoop fs text command
This post is continuation for previous post on hadoop sequence files. In this post we will discuss about Reading and Writing SequenceFile Examples using Apache Hadoop 2 API. Writing Sequence File Example: As discussed in the previous post, we will use static method SequenceFile.createWriter(conf, opts) to create SequenceFile.Writer instance and […]

Reading and Writing SequenceFile Example



Sequence File Format 2
In addition to text files, hadoop also provides support for binary files. Out of these binary file formats, Hadoop Sequence Files are one of the hadoop specific file format that stores serialized key/value pairs. In this post we will discuss about basic details and format of hadoop sequence files examples. Hadoop […]

Hadoop Sequence Files example



Output in JSON format
Avro provides support for both old Mapreduce Package API (org.apache.hadoop.mapred) and new Mapreduce Package API (org.apache.hadoop.mapreduce). Avro data can be used as both input and output from a MapReduce job, as well as the intermediate format. In this post we will provide an example run of Avro Mapreduce 2 API. This […]

Avro MapReduce 2 API Example