Impala Commands Cheat Sheet

This is quick touch on Impala commands and Functions. Impala accepts basic SQL syntax and below is the list of a few operators and commands that can be used inside Impala. This is just a quick cheat sheet.


In Impala, a database is a logical container for a group of tables. Each database defines a separate namespace. Within a database, you can refer to the tables inside it using their unqualified names. Different databases can contain tables with identical names. Creating a database is a lightweight operation.

There are no database-specific properties to configure. Therefore, there is no ALTER DATABASE.

Typically, you create a separate database for each project or application, to avoid naming conflicts between tables and to make clear which tables are related to each other.

Each database is physically represented by a directory in HDFS.

There is a special database, named default, where you begin when you connect to Impala. Tables created in default are physically located one level higher in HDFS than all the user-created databases.


Tables in impala are very similar to hive tables which will hold the actual data.

There are two types of tables

  1. Internal table:

These are managed by Impala, use directories inside the designated Impala work area.

  1. External table

They use arbitrary HDFS directories, where the data files are typically shared between different Hadoop components

To see whether a table is internal or external, and its associated HDFS location, issue the statement DESCRIBE FORMATTED table_name. The Table Type field displays MANAGED_TABLE for internal tables and EXTERNAL_TABLE for external tables. The Location field displays the path of the table directory as an HDFS URI.


Views are lightweight logical constructs that act as aliases for queries. You can specify a view name in a query (a SELECT statement or the SELECT portion of an INSERT statement) where you would usually specify a table name.

Take a complicated reporting query, plug it into a CREATE VIEW statement…

and now you can produce the report with 1 line of * from v1;


Create Database

To get List Databases

Use Database

Create Table

Creating table syntax in Impala is same as hive table creation syntax. Below is example table creation

We can create tables in many ways as shown below,

  • Create external table
  • Create table with Column definitions inferred from data file.

  • Create table using PARTITIONED BY clause
  • Cloning table: To create an empty table with same columns and other properties of already existing table we LIKE

Show Tables

will show the list tables available in the particular database (Here it will show list of tables in “impala_test” database)

Load Data into table

Note: if you get below exception,

Then you have to change the permission of the file which you are loading into table


We can copy data from one table to another table using below syntax.







Exists/ not exists

IN/not IN

Note: A correlated subquery used in EXISTS and IN operators cannot include a LIMIT clause.


The RLIKE operator is a synonym for REGEXP.

LIKE,IS NULL,Logical operators are pretty much same as in SQL.


REFRESH Statement

Refresh statement is to load the latest metastore metadata and block location data for a particular table. This is only for table level


Marks the metadata for one or all tables as stale. Required after a table is created through the Hive shell, before the table is available for Impala queries. The next time the current Impala node performs a query against a table whose metadata is invalidated, Impala reloads the associated metadata before the query proceeds. This is a relatively expensive operation compared to the incremental metadata update done by the REFRESH statement, so in the common scenario of adding new data files to an existing table, prefer REFRESH rather than INVALIDATE METADATA.

By default, the cached metadata for all tables is flushed. If you specify a table name, only the metadata for that one table is flushed. Even for a single table,INVALIDATE METADATA is more expensive than REFRESH, so prefer REFRESH in the common case where you add new data files for an existing table.


It gives the details of table

Gives below output

EXPLAIN Statement

Will give the execution plan of a query

User-Defined Functions (UDFs) in Impala is same as hive UDF and Impala doesnot support UDTF

Starting Impala from Command Line

To start the Impala state store and Impala from the command line or a script, you can either use the service command or you can start the daemons directly through the impalad, statestored, and catalogd executables in cloudera quickstart vm.

Start the statestore service using a command similar to the following:

Start the catalog service using a command similar to the following:

Start the Impala service on each data node using a command similar to the following:

Joins Refer at cloudera link below

Impala Performance Tuning:

  • Use Parquet File format with Snappy Compression
  • Maintain the file block size as around 1 GB
  • Make the tables partitioned by any appropriate column, mostly the columns that are being used in Join conditions or where conditions.
  • In Joins arrange the tables in Largest vs Tiny vs Small vs Medium vs Big order
  • Use SHUFFLE hint before the joins if the tables are big in size otherwise use BROADCAST
  • Try to avoid joins but use IN clauses to replace them
  • Submit COMPUTE STATS whenever there is data refresh into the tables
  • Enable short circuit reads in the Impala cluster to improve the read performance.
  • Use appropriate data types to avoid implicit conversions.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *