Partitioning in Hive 14

In this post, we will discuss about one of the most critical and important concept in Hive, Partitioning in Hive Tables.

Partitioning in Hive

Table partitioning means dividing table data into some parts based on the values of particular columns like date or country, segregate the input records into different files/directories based on date or country.

Partitioning can be done based on more than column which will impose multi-dimensional structure on directory storage. For Example, In addition to partitioning log records by date column, we can also sup divide the single day records into country wise separate files by including country column into partitioning. We will see more about this in the examples.

Partitions are defined at the time of table creation using the PARTITIONED BY clause, with a list of column definitions for partitioning.


As shown in syntax, we can also add comments to partitioned columns.

  • Partitioning is used for distributing execution load horizontally.
  • As the data is stored as slices/parts, query response time is faster to process the small part of the data instead of looking for a search in the entire data set.
  • For example, In a large user table where the table is partitioned by country, then selecting users of country ‘IN’ will just scan one directory ‘country=IN’ instead of all the directories.
  • Having too many partitions in table creates large number of files and directories in HDFS, which is an overhead to NameNode since it must keep all metadata for the file system in memory only.
  • Partitions may optimize some queries based on Where clauses, but may be less responsive for other important queries on grouping clauses.
  • In Mapreduce processing, Huge number of partitions will lead to huge no of tasks (which will run in separate JVM) in each mapreduce job, thus creates lot of overhead in maintaining JVM start up and tear down. For small files, a separate task will be used for each file. In worst scenarios, the overhead of JVM start up and tear down can exceed the actual processing time.
Example Scenarios
  • Partitioning is used in real-time log files analysis to segregate the records based on time stamp or date value to see the results day wise quickly.
  • Another real-time use is that, Customer/user details are partitioned by country/state or department for fast retrieval of subset data pertaining to some category.
  • Sales records by-product type, country, year and month is another commonly used scenario.

In this post we will try examples of use case 2.

Sample Use Case

Lets explore the other features of partitions with the help of sample use case of Loading User records into Hive and performing some queries.

Sample User Records file for testing in this post –> UserRecords

Observation of Input Data

Input data has below fields or columns.

  • First Name
  • Last Name
  • Address
  • Country
  • City
  • State
  • Postal Code
  • Phone Number
  • Alternative Phone Number
  • Email Id
  • Website URL

Easiest part is that, each field is separated by ‘,’ and no field contains the same ‘,’ in its values. Lets Assume we need to create Hive Table partitioned_user partitioned by Country and State and load these input records into table is our requirement.

Creation of Partition Table

Managed Partitioned Table

Below is the HiveQL to create managed partitioned_user table as per the above requirements.

Note that we didn’t include country and state columns in table definition but included in partition definition. If we include them, then we will encounter error scenario 1. We can verify the partition columns of the table with the help of below command.

Partitioned columns country and state can be used in Query statements WHERE clause and can be treated regular column names even though there is actual column inside the input file data.

External Partitioned Tables

We can create external partitioned tables as well, just by using the EXTERNAL keyword in the CREATE statement, but for creation of External Partitioned Tables, we do not need to mention LOCATION clause as we will mention locations of each partitions separately while inserting data into table.                                                  

Inserting Data Into Partitioned Tables

Data insertion into partitioned tables can be done in two modes.

  • Static Partitioning
  • Dynamic Partitioning

Static Partitioning in Hive

In this mode, input data should contain the columns listed only in table definition (for example, firstname, lastname, address, city, post, phone1, phone2, email and web) but not the columns defined in partitioned by clause (country and state).

If our input column layout is according to the expected layout and we already have separate input files for each partitioned key value pairs, like one separate file for each combination of country and state values (country=US and state=CA), then these files can be easily loaded into partitioned tables with below syntax.

Loading Data into Managed Partitioned Table From Local FS


For example, lets take below 3 records, which are not containing partitioned columns and save into staticinput.txt file on home directory. And assume that all these records belongs to country=US and State=CA.

Now this file can be loaded into partitioned table with below syntax by specifying the country and state value at load time itself.

This will create separate directory under the default warehouse directory in HDFS.

Similarly we have to add other partitions, which will create corresponding directories in HDFS. Or else we can load the entire directory into Hive table with single command and can add partitions for each file with ALTER command.

Loading Partition From Other Table

We can load or add partitions with query results from another table as shown below.

Overwriting Existing Partition

We can overwrite an existing partition with help of OVERWRITE INTO TABLE partitioned_user clause.

Loading Data into External Partitioned Table From HDFS

There is alternative for bulk loading of partitions into hive table. As data is already present in HDFS and should be made accessible by Hive, we will just mention the locations of the HDFS files for each partition.

If our files are on Local FS, they can be moved to a directory in HDFS and we can add partition for each file in that directory with commands similar to below.

Similarly we need to repeat the above alter command for all partition files in the directory so that a meta data entry will be created in metastore, mapping the partition and table.

Dynamic Partitioning in Hive

Instead of loading each partition with single SQL statement as shown above, which will result in writing lot of SQL statements for huge no of partitions, Hive supports dynamic partitioning with which we can add any number of partitions with single SQL execution. Hive will automatically splits our data into separate partition files based on the values of partition keys present in the input files.

It gives the advantages of easy coding and no need of manual identification of partitions. This dynamic partition suits well for our example requirement on user records provided above.

For dynamic partition loading we will not provide the values for partition keys, as shown below for previously seen query.

We can also mix dynamic and static partitions by specifying it as PARTITION(country = ‘US’, state). But static partition keys must come before the dynamic partition keys.

But by default, Dynamic Partitioning is disabled in Hive to prevent accidental partition creations. To use dynamic partitioning we need to set below properties either in Hive Shell or in hive-site.xml file.

We can set these through hive shell with below commands,

Here, except dynamic partition mode property, remaining three are self explanatory. By default, dynamic partition mode is set to strict in Hive to specify at least one static partition column (key), which prevents generation huge no of partitions by a badly designed query.  

Note that, this dynamic partition strict mode is different from hive.mapred.mode=strict, but it is right time to discuss about mapreduce strict mode also, because if this property is set to strict, then we cannot certain queries on partitioned tables as well. 

Strict Mode in Hive

In mapreduce strict mode (hive.mapred.mode=strict) , some risky queries are not allowed to run. They include:

  • Cartesian Product.
  • No partition being picked up for a query.
  • Comparing bigints and strings.
  • Comparing bigints and doubles.
  • Orderby without limit.

According to point 2 and 5, we can not use SELECT statements without at least one partition key filter (like WHERE country=’US’) or ORDER BY clause without LIMIT condition on partitioned tables. But by default this property is set to nonstrict.                                                                   

Example Use Case

With the help of above concepts lets create the dynamic partitioned table for the user records provided on first page of this post –> UserRecords

As this input file contains, partitions keys also as fields in each record, we need to create temporary user with the all the columns present in the input file and from that we need to extract the columns needed into partition table by keeping country and state columns as partition keys.

Lets also examine the performance of this temporary table and partitioned table for a SELECT statement with WHERE clause to confirm the performance improve on partitioned table.

We can create this temporary table and partitioned table and also load partitioned table dynamically with the help of below HiveQL. After going through all the above concepts, this HiveQL will be self explanatory, so we are not providing details/description of implementation but just the results.

Output of the above script execution

We can see the partitioned table query resulted in 22 sec whereas temp_user table resulted in 28 sec for the same query. Thus we can confirm the speed of partitioned tables over regular tables.


When inserting data into a partition, it’s necessary to include the partition columns as the last columns in the query. The column names in the source query don’t need to match the partition column names, but they really do need to be last.                         
Below are a few more commands that are supported on Hive partitioned tables.

  • Show Command
  • Describe Command
  • Alter Command

Show Partitions

We can see the partitions of a partitioned table with SHOW command as shown below.

If we have a lot of partitions and want to see partitions for particular partition keys, we can further restrict the command with an optional PARTITION clause that specifies one or more of the partitions with specific values.

Describe partitions

As we already know how to see the descriptions of tables, Now we can see the descriptions of each partition with commands similar to below.

Alter Partitions

We can alter/change partitions (add/change/drop) with the help of below commands.

Adding Partitions

We can add partitions to an existing table with ADD PARTITION clause as shown below.

Changing Partitions

We can change a partition location with commands like below. This command does not move the data from the old location and does not delete the old data but the reference to old data file will be lost.

Drop Partitions

We can drop partitions of a table with DROP IF EXISTS PARTITION clause as shown below.

Other rarely used alter commands on partitions

The ARCHIVE PARTITION clause captures the partition files into a Hadoop archive (HAR) file.This only reduces the number of files in the filesystem, reducing the load on the NameNode, but doesn’t provide any space savings.

we can un archive these with UNARCHIVE PARTITION clause. 

The following statements prevent the partition from being dropped and queried.

To reverse either operation, replace ENABLE with DISABLE.

Error Scenario 1

While creating Partitioned tables based on particular columns, then these columns should not be defined in the column types of the table as they will be included in PARTITION BY clause. If we declare then in the definition as well as shown below,

Then we will receive below error message, as we have included dept STRING table definition, by removing which, the issue will be solved.

Error Scenario 2

When we do not set hive.exec.max.dynamic.partitions.pernode to appropriate value then we will receive error messages like below.

Solution 2

We need to override the values of below two properties to at least 1000 per each value. This can be done in hive-site.xml file or in hive CLI shell with the help of SET commands.


Apache Language Manual –

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a Reply to glim Cancel reply

Your email address will not be published. Required fields are marked *

14 thoughts on “Partitioning in Hive

  • senthil kumar

    Hi Siva,

    Where i can see the error scenario 1.You have mentioned about this one under managed partitions table topic …



    • amin

      from =
      If, when creating a partitioned table, you get this error: “FAILED: Error in semantic analysis: Column repeated in partitioning columns,” it means you are trying to include the partitioned column in the data of the table itself. You probably really do have the column defined. However, the partition you create makes a pseudocolumn on which you can query, so you must rename your table column to something else (that users should not query on!).

  • Arpit Jain


    Thanks for crystal clear explanation of complicated partition topic. I need clarification for below query. can you please help on this.



    In hive partition , Hive create hierarchy of folder structure on basis of partition key inside default hive/warehouse directory.
    If file already present in HDFS on location /user/arpit/hive_hdfs/input_data.txt , And I want to create external table with partition over column 2.
    Will Hive create hierarchy of folders for external table ?
    If it will create then what will be the location for partition folders ?

  • Rachit Shah

    LOADING PARTITION FROM OTHER TABLE command gives error :
    hive> INSERT OVERWRITE TABLE partitioned_user
    PARTITION (country = ‘US’, state = ‘AL’)
    SELECT * FROM another_user au
    WHERE = ‘US’ AND au.state = ‘AL’;

    FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into target table because column number/types are different ”MI”: Table insclause-0 has 9 columns, but query has 11 columns.

    However, I have tried below statement and it is working fine :

    INSERT OVERWRITE TABLE partitioned_user1 PARTITION (country=’US’,state=’MI’) SELECT first_name,
    web FROM partitioned_user7 pu7 WHERE’US’ AND pu7.state=’MI’;



    • Profile photo of Siva
      Siva Post author

      Yes this is how it works, we should not use select * from in most of the partition insert queries but we need to specify the columns and the partition columns coming at the end