Creating Custom UDF in Hive – Auto Increment Column in Hive 5

In this post we will describe about the process of creating custom UDF in Hive. Though there are many generic UDFs (User defined functions)  provided by Hive we might need to write our custom UDFs sometime to meet our requirements. In this post, we will discuss about one of the general requirement for the clients, those migrating from any traditional RDBMSs to Hive, they will expect Auto Increment Column in a table to have unique ID in the column which is very easy to write in SQL.

As of Hive-0.13.1 release, there is no support for Auto Increment Column value in Hive. To achieve this feature in Hive, we need to write custom UDF.

Creating Custom UDF in Hive:

As the Hive language is written in Java, The UDFs need to be written in Java. There are three types of UDFs in Hive:

  • Regular UDFs (User defined functions)
  • UDAFs (User-defined aggregate functions)
  • UDTFs (User-defined table-generating functions).

In this post we will concentrate on only Regular UDFs. Any custom UDFs that we are going to write must satisfy the following two properties:

  • Must extend class org.apache.hadoop.hive.ql.exec.UDF .
  • Must implement at least one evaluate() method.

Hive inspects the UDF to find the evaluate() method that matches the Hive function
that was invoked. Below is the code for our custom UDF for creating Auto Increment Column in Hive.

Here make sure that addition of annotation @UDFType(stateful = true) is required otherwise counter value will not get increment in the Hive column, it will just return value 1 for all the rows but not the actual row number.

Compile this program by adding hive-exec-*.jar file into classpath and build a jar file named AutoIncrementUDF.jar

To use this UDF in Hive we need to register this Jar file in Hive with ADD JAR command as shown below.

Here we can create either Temporary function or Permanent function depending on our requirement. Temporary function will be available during current session only, once the session is closed temporary functions can’t be accessed. This means that we need to add the JAR file, and define the function at the beginning of each session.

Permanent function can be added for the above with below command.

Note:   Changes in hive-site.xml or file

As an alternative to calling ADD JAR command for each session, we can add the paths of our custom jars to value of hive.aux.jars.path property in hive-site.xml file. Once it is added to hive.aux.jars.path then we do not need to add this jar at each session.

Or we can add the same property into file with name HIVE_AUX_JARS_PATH under HIVE_CONF_DIR location, as shown below.

We can use these UDF functions only in Select statements but not in Create Statements. But to create table with auto increment column, we will make use of CREATE TABLE AS SELECT command in Hive.

Auto Increment Column In Hive:

Lets create a sample Hive table t1 with below data.

Hive Table Creation

Now register the Jar file and create Temporary function and create similar table with one extra column for Auto Increment value.

Hive UDF usage

So we have successfully added one extra column with automatic increment value with row number by creating Custom UDF in hive.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

5 thoughts on “Creating Custom UDF in Hive – Auto Increment Column in Hive

  • Alind Billore

    Hi Siva,

    Many thanks to you for this fantastic tutorial.

    But i have just one query what is the word inc ? after attribute AS.


    > CREATE TABLE increment_table1 (id INT, c1 STRING, c2 STRING, c3 STRING);
    hive> INSERT OVERWRITE TABLE increment_table1 SELECT incr() AS inc, id, c1, c2 FROM t1;
    hive> SELECT * FROM increment_table1;

  • Nikolay Christov

    Keep in mind that this only works because you are running in a single mapper. Having more mappers will cause two separate counts and you will end up with dupe IDs