HBase is the Hadoop’s database and Below is the high level HBase Overview.
Table of Contents
What is HBase ?
HBase is a scalable distributed column oriented database built on top of Hadoop and HDFS. Apache HBase is open-source non-relational database implemented based on Google’s Big Table – A Distributed storage system for structured data.
HBase provides random and real time read/write access to Big Data.
Need For HBase:
Although most organizations use RDBMS as database to store their data, RDBMS lacks its efficiency if the data is very big. RDBMS is not scalable beyond a limit, it is scalable up to a point – the size of a single database server – and for its best performance, it requires specialized hardware and storage devices.
RDBMS databases are not built with very large scale and distribution in mind. Joins, complex queries, triggers, views, and foreign-key constraints become very expensive to run on a scaled RDBMS or some times do not work at all.
To overcome these scalability and distribution limitations of RDBMS, HBase is best solution. HBase is non relational data base and doesn’t support SQL but with proper usage, HBase can do what an RDBMS cannot. HBase is key-value, schema less, column-oriented view of data. Any number of columns can be added at run-time.
HBase look-up is a key-value mapping from the row key to column value.
HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS. Its full form is Not only SQL. As HBase doesn’t support SQL, it lacks many of the features that we find in an RDBMS, such as triggers, secondary indexes, typed columns and advanced query languages, etc.
Though HBase lacks some of the important features of RDBMS it provides many other extra features.
HBase supports linear and modular scalability. New region servers can be easily added to HBase clusters and HBase clusters expansion results in increment of HBase ability in terms of both storage and processing capacity.
HBase supports strictly consistent reads and writes.
HBase supports Automatic sharding. HBase tables reside on the regions of HBase clusters and regions are automatically split and re-distributed as data grows.
HBase supports HDFS as its distributed file system and Mapreduce as its parallel processing framework.
HBase supports Automatic Region Server Fail over.
Below is the high level Hbase architecture. Similar to HDFS and Mapreduce’s Master/slave architectures, HBase is also based on Master/slave architecure where HMaster is the master node and Region servers are the slave nodes.
HBase depends on ZooKeeper and by default it manages a ZooKeeper instance. HBase can be configured to use an existing ZooKeeper cluster instead. Regionserver slave nodes are listed in the HBase conf/regionservers file. HBase persists data via the Hadoop filesystem API. The client HTable contains the logic of finding the server responsible for a particular region, and communicates with RegionServers directly to write and retrieve key-value pairs.
HBase maintains catalog tables -ROOT- and .META. within which it maintains the current list, state, and location of all regions afloat on the cluster. The -ROOT- table holds the list of .META. table regions. The .META. table holds the list of all user-space regions.
HBase is best suitable only when
We have hundreds of millions or billions of rows of data. If we have only a few thousand/million rows, then a traditional RDBMS is a better choice.
We are ready to sacrifice all the extra features that an RDBMS provides (triggers, secondary indexes, typed columns and advanced query languages, etc).
HBase cluster is large enough of at least five data nodes.
Difference Between HDFS and HBase:
HBase is not HDFS. HDFS is a distributed file system for storing large scale data files. HDFS does not provide fast individual record look-ups in files.
But HBase on the other hand, is built on top of HDFS and provides fast record look-ups and updates for large tables.