Hive on Tez – Hive Integration with Tez 1

In this post, we will discuss about Hive integration with Tez framework or Enabling Tez for Hive Queries. And we will also run sample hive queries both on Mapreduce and Tez frameworks and we will evaluate the performance difference between Tez and MR Frameworks.

Tez Advantages:

  • Tez offers a customizable execution architecture that allows us to express complex computations as data flow graphs and allows for dynamic performance optimizations based on real information about the data and the resources required to process it.
  • Tez increases the processing speed from GB’s to PB’s of data and 10’s to 1000’s of nodes when compared to mapreduce framework.
  • The Apache Tez library allows developers to create Hadoop applications that integrate with YARN and perform well with Hadoop clusters.
Benefits of Integrating Hive with Tez:
  • Tez can translate complex SQL statements into highly optimized, purpose-built data processing graphs that strike the right balance between performance, throughput, and scalability across a wide range of use cases and data set sizes.
  • Tez helps Hive in becoming into interactive from batch mode.
  • Till hive-0.12 release, there is only mapreduce framework available in hive to convert hive queries into execution jobs on hadoop clusters. But first time in hive-0.13.1 release Tez execution engine framework is embedded into hive to improve the performance of complex hive queries.

Hive on Tez:

By default, execution engine in hive is mapreduce (mr), so we don’t need to specify it explicitly to submit mapreduce jobs from our hive queries. To setup hive on tez, we need below components at the minimum.

  • Running Hadoop 2 cluster with YARN framework
  • Hive-0.13.1 installed on hadoop cluster
  • Tez installed and configured on Hadoop successfully.

For tez installation and configuration hadoop2 we can refer our previous post on Tez framework.

We assume all the above three installations are done already and running fine.

Hive setup for Tez:

As Tez is already installed successfully and we are able to sample Tez DAG jobs successfully on hadoop cluster, now we can easily setup Hive for Tez engine.

We need to perform below list of activities in the same order.

  • As of Hive 0.13.1 release, Hive embeds Tez, we need to copy hive-exec-0.13.1.jar file from $HIVE_HOME/lib directory into HDFS directory specified in tez.lib.uris property in tez-site.xml file in ${TEZ_CONF_DIR}. In this post, it is /apps/tez-0.4.1 is the HDFS directory. Use below command to copy this jar.

  • To run query on Tez engine, we need whether to set hive.execution.engine=tez; each time for hive session or change this value permanently in hive-site.xml. In this post, we will simply set this hive variable for each session to compare results of mr and tez frameworks.

That’s it we are done with hive setup for Tez.

Sample run of Hive Queries on Tez:

To test the performance improvement of Tez over mapreduce, lets create a sample hive table and perform some basic queries on it. Sample data used for running the examples in this section is available at —> SampleUserData

  • Lets create USER_TEST table with below schema. Save this below code snippet as createuser.hql

  • Run this script on hive.

Hive managed table creation

Run below Hive queries on MR and Tez frameworks:

Mapreduce FrameWork:

Login to hive shell, set hive.execution.engine=mr to run the above queries through Mapreduce jobs and note down the execution time for each query.

Query 1:

Hive mr 1

So it took around 13.5 seconds.

Query 2:

hive mr2

This second query took 18.2 seconds.

Tez Framework:

Now we will run the same above two queries on Tez framework after setting

For the first query run after setting the above property, tez will take some extra time when compared to running any subsequent queries. It is due to that, during first query execution, Tez will assign containers required for the Tez session. Once Tez session is established, any next queries will not take that much time.

That’s why in the below screen the first run of the same query took around 13 seconds whereas the second run took just 6 seconds.

Query 1:

hive tez1

So this query took just 6 Seconds which is more than 200% faster when compared to mr engine  performance which is around 13.5 seconds.

Query 2:

hive tez2

So even this query also ran with more than 200% faster performance () when compared to mr engine (18 seconds).

So, with this we can confirm that Tez is 200% faster than MR framework.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Hive on Tez – Hive Integration with Tez