Cloudera provides a separate tool called Impala to overcome the slowness of Hive Queries. Syntactically Impala queries are more or less same as Hive Queries but they run very faster than Hive Queries. Impala provides high-performance, low-latency SQL queries. When we are dealing with medium sized data sets and we expect real time response from our queries then choosing Impala is the best option but Impala is available only in Cloudera’s Hadoop distribution only. This post is about Impala Introduction.
Table of Contents
Impala is a distributed massively parallel processing (MPP) database engine on Hadoop. Impala is from cloudera distribution. It does not build on mapreduce, as mapreduce store intermediate results in file system, so it is very slow for real time query processing. Impala has its own execution engine, which will store the intermediate results in IN memory. Thus query execution is very fast when compared to other tools which use mapreduce.
- Impala provides high-performance, low-latency SQL queries.
- Impala integrates very well with the Hive metastore, to share databases and tables between both Impala and Hive.
- Compatible with HiveQL Syntax
- Impala integrates well with HBase database system and Amazon Simple Storage System (S3) and provides SQL front end access to these.
- Impala is Interactive but Hive is Batch oriented. Impala generally runs in a few seconds to minutes when compared to hive runs for minutes to hours.
Daemons in Impala:
ImpalaD (impala Daemon):
ImpalaD will be one per node. It will be installed on every data node. This will do the real work, they form the core of the Impala execution engine and are the ones reading data from HDFS/HBase and aggregating/processing it. All ImpalaD’s are equivalent.
ImpalaD will use Hive metastore, which will store mapping between table and files and it also uses HDFS NN to get mapping between files and blocks. Thus impala uses hive metastore and Name node to get/process the data.
- It accepts queries from the impala-shell command, Hue, JDBC, or ODBC.
- Parallelizes the queries and distributes work across the cluster; and transmits intermediate query results back to the central coordinator node.
- ImpalaD running on a data node on which user submitted a query, will act as a coordinator node for that query. The ImpalaD daemons running on other nodes submit their partial results to this coordinator node and coordinator node prepares the final result by aggregating/combining partial results.
The statestored daemon implements the Impala statestore service, which monitors the availability of Impala services across the cluster, and handles situations such as nodes becoming unavailable or becoming available again
- Statestore daemon installed on 1 instance of an N node cluster.
- Statestore daemon is a name service i.e. it keeps track of which ImpalaD’s are up and running, and relays this information to all the ImpalaD’s in the cluster so they are aware of this information when distributing tasks to other ImpalaD’s.
- This isn’t a single point of failure since Impala daemons still continue to function even when state stored dies, albeit with stale name service data.
CatalogD is installed on 1 instance of an N node cluster. It distributes metadata (table names, column names, types, etc.) to Impala daemons via the state stored (which acts as a pub-sub mechanism). This isn’t a single point of failure since impala daemons still continue to function. The user has to manually run a command to invalidate the metadata if there is an update to it.
CatalogD service avoids the need to issue REFRESH and INVALIDATE METADATA statements when the metadata changes are performed by statement issued through Impala. But if we create any table, load data, and so on through Hive, then we need to issue REFRESH or INVALIDATE METADATA on an Impala node before executing a query there.
The below diagram shows the impala architecture
Step-1: When client sends a request to ImpalaD. ImpalaD will accept that request
Step-2: Now impalaD will fetches metadata from Hive metastore and HDFS NN
Step-3: Now ImpalaD will fetches state of other ImpalaD’s in which data is exist from Statestore
Step-4: Thus all the ImpalaD’s will start execution of query which is received from client.
Step-5: ImpalaD has 3 components
Planner is responsible for parsing out the query. The planning happens in 2 parts. First, a single node plan is made, as if all the data in the cluster resided on just one node, and secondly, this single node plan is converted to a distributed plan based on the location of various data sources in the cluster (thereby leveraging data locality).
Coordinator is responsible for coordinating the execution of the entire query. In particular, it sends requests to various executors to read and process data. It then receives the data back from these executors and streams it back to the client via JDBC/ODBC.
Executor is responsible for reading the data from HDFS/HBase and/or doing aggregations on data (which is read locally or if not available locally could be streamed from executors of other Impala daemons).
Other important points:
- ImpalaD’s caches the metadata for fast access.
- Impala will be best fit for parquet format(columnar oriented format)
- Impala will use OS cache.
- 128 GB RAM is recommended in each node.(Cloudera recommended)
- Scales well up to 100s of users in small cluster
- In Impala we can create tables or we can use tables in the hive
- Load Test
- No of threads created by impalaD = 2 or 3x no of cores
- Intermediate results stored in In-memory.
- Largest table should be listed first in the FROM clause of a query.
- Default join in impala is BROADCAST Join. (A and B are two tables. Table A is big, so data in B is broadcast-ed to A).
- Broadcast join best fit for one big table and many small tables.
- BROADCAST join is not good for two large tables because large data cannot be stored in in-memory.
- For two large tables we have to use PARTITIONED join.