In this post we will discuss about the basic introduction to Sqoop and Sqoop Installation on Ubuntu machine and we will discuss about example run of Sqoop from MySQL database in the next post.
Table of Contents
What is Sqoop?
Sqoop is open source tool that enables users to transfer bulk data between Hadoop eco system and relational databases. Here Hadoop eco system includes, HDFS, Hive, HBase, HCatalog, etc. And Relational databases supported at this time are MySQL, PostGreSQL, Oracle, SQL Server and DB2.
Apache Sqoop is another top level open source project from Apache software foundation.
Sqoop can be used to both import data from external structured databases into HDFS or related systems like Hive and HBase, as well as export data from Hadoop to external relational databases and enterprise data warehouses.
It is a mechanism for unlocking hadoop for relational database users.
In many ways Sqoop is similar to hadoop’s distcp (Moving data efficiently between clusters using Distributed Copy). Both are built on top of MapReduce and
take advantage of its parallelism and fault tolerance. Both run map only tasks in parallel. But the difference is Instead of moving data between clusters, Sqoop is designed to move data from/to hadoop to/from relational databases.
Sqoop was originally developed by Cloudera but later provided to Apache Open source community. Now Sqoop is top level project at Apache Software foundation. Sqoop has two versions, Sqoop1 and Sqoop2 but at the time of this writing, Sqoop2 is not suitable for production deployment so, we will discuss using Sqoop1 only.
Sqoop is named after Sql + Hadoop (Sqoop) meaning that bridge between Sql databases and Hadoop eco system.
Sqoop Installation on Ubuntu:
In this section we will install Sqoop1.4.5 version (latest stable version) on Ubuntu machine and configure it to run on Hadoop cluster.
Sqoop installation on Ubuntu machine is simple and straight.
- Download gzipped Sqoop binary tar ball from Apache Download mirrors at sqoop site. This file would be of format sqoop-<version>.bin__hadoop-<version>-<alpha>.tar.gz
- Copy this tar ball into preferred location of installation directory (usually into /usr/lib/sqoop) and extract the contents of gzipped the tar ball.
- Set environment variables SQOOP_HOME with sqoop installation directory in .bashrc file and add its bin directory to $PATH.
### Sqoop Home directory ####
- Sqoop requires the JDBC drivers for specific database servers like MySQL, Oracle etc. Due to licensing issues, Sqoop doesn’t include the JDBC drivers for these external databases. Some of the JDBC drivers are available for free of charge from the database vendors’ websites. For example, we can download JDBC drivers for MySQL from MySql Connectors Download page. For Ubuntu, download the platform independent version of this connector. Extract this mysql-connector-java-<version>.tar.gz file and copy the mysql-connector-java-<version>-bin.jar file into SQOOP_HOME/lib directory.
hadoop1@ubuntu-1:~$ source ~/.bashrc
hadoop1@ubuntu-1:~$ cd Downloads/
hadoop1@ubuntu-1:~/Downloads$ tar -xzf mysql-connector-java-5.1.34.tar.gz
hadoop1@ubuntu-1:~/Downloads$ cp mysql-connector-java-5.1.34/mysql-connector-java-5.1.34-bin.jar $SQOOP_HOME/lib/
Verify Sqoop Installation:
Lets verify sqoop installation with the following command.
hadoop1@ubuntu-1:~$ sqoop help
Warning: /usr/lib/sqoop/sqoop-1.4.5.bin__hadoop-2.0.4-alpha/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Warning: /usr/lib/sqoop/sqoop-1.4.5.bin__hadoop-2.0.4-alpha/../zookeeper does not exist! Accumulo imports will fail.
Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.
find: paths must precede expression: avromr.jar
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]
usage: sqoop COMMAND [ARGS]
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
See 'sqoop help COMMAND' for information on a specific command.
If we receive message similar to above, then our sqoop installation on ubuntu is successful. We will discuss example sqoop import and export commands in the next post.