In this post we will discuss about basic details of Hive Overview .
Table of Contents
What is Hive ? :
Hive is an important tool in the Hadoop ecosystem and it is a framework for data warehousing on top of Hadoop.
Hive is initially developed at Facebook but now, it is an open source Apache project used by many organizations as a general-purpose, scalable data processing platform. Hive Overview is described below on high level.
In the current IT industry, Most of the data warehouse applications are implemented using relational databases that use SQL as the query language.
If these data warehouses are moved onto hadoop, then the users of these data warehouses (SQL Users) must learn new languages and tools to become productive again on hadoop data.
Instead of this, Hive provided HiveQL which is similar to SQL, so that all SQL users can learn Hive very easily. Without Hive, developers will face difficulties when porting their SQL applications to Hadoop.
- Tools to enable easy data extract/transform/load (ETL)
- A mechanism to project structure on a variety of data formats
- Access to files stored either directly in HDFS or other data storage systems as HBase
- Query execution through MapReduce jobs.
- SQL like language called HiveQL that facilitates querying and managing large data sets residing in hadoop.
- Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
- Hive does not provide record-level update, insert, nor delete.
- Hive queries have higher latency than SQL queries, because of start-up overhead for MapReduce jobs submitted for each hive query.
- As Hadoop is a batch-oriented system, Hive doesn’t support OLTP (Online Transaction Processing).
- Hive is close to OLAP (Online Analytic Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, both due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve.
- If we need OLAP, we need to use NoSQL databases like HBase that can be integrated with Hadoop.
Differences Between Hive and HBase:
Hive is not a database but a data warehousing frame work. Hive doesn’t provide record level operations on tables.
HBase is a NoSQL Database and it provides record level updates, inserts and deletes to the table data.
HBase doesn’t provide a query language like SQL, but Hive is now integrated with