[sociallocker][/sociallocker]In this section we will cover only the important Import Control arguments.
|Argument||Description & Example|
|–as-textfile||Imports data as plain text file. It is the default option.|
|–as-sequencefile||Imports data to SequenceFiles|
|–as-avrodatafile||Imports data to Avro Data Files|
|–columns <col,col,col…>||Specify Columns to import from RDBMS table.|
|–table <table-name>||The table to import|
|-e, –query <statement>||It imports the results of query instead of a table.|
|–where <clause>||Import only the rows for which _clause_ is true. Example: –where “user_id > 400 AND hidden == 0”|
|–num-mappers <n> or -m <n>||Using this option we can speed the parallel copy process by setting the number of map tasks high. To set this, RDBMS table should have primary key.|
|–append||Append data to an existing HDFS dataset.|
|-z,–compress||Enable compression. By default uses GZip codec if not overridden below arg.|
|–compression-codec <c>||Uses the Hadoop +codec+ class to compress data on HDFS|
|–null-string||The string to be written for a null value for string columns|
|–null-non-string||The string to be written for a null value for non-string columns|
|–target-dir <dir>||Explicit HDFS target directory for the import|
|–warehouse-dir <dir>||Tables are uploaded to the HDFS path +/warehouse/dir/(tablename)/+|
|–split-by (column-name)|| Column of the table used to split the table for parallel import. By default sqoop will use query
|–boundary-query (query)||Using query to select minimal and maximal value of ‘–split-by’ column (primary key column) for creating splits|
|–direct||Use direct import fast path (for MySQL and PostgreSQL)|
|–direct-split-size (n)||Split the input stream every ‘n’ bytes when importing in direct mode|
|–mapreduce-job-name <name>||Set name for generated mapreduce job|
|–fetch-size <n>||Set number ‘n’ of rows to fetch from the database.|
|–delete-target-dir||Delete the import target directory if it exists|
|–inline-lob-limit <n>||Set the maximum size for an inline LOB|
- By default sqoop will run 4 map tasks.
- By default Sqoop uses JDBC channel to import, but with –direct option, sqoop uses high-performance direct channel but this is supported only in MySQL and PostgreSQL. We can split the output files based max size on HDFS, with –direct-split-size argument.
- When importing results of a query, we must specify destination directory with –target-dir.
- If we need import to be done in parallel we need to specify -m value greater than 1, so that multiple mappers can run in parallel but we must specify the split-by column or table should have primary key. Query must include the token $CONDITION which each Sqoop process will replace with a unique condition expression as shown in below example.
$ sqoop import \
--query 'SELECT emp.*, usr.* FROM emp JOIN dep on (emp.id == usr.id) WHERE $CONDITIONS' \
--split-by emp.id --target-dir /user/emp/join/
If we use double quotes (“”) to write query statement, we need to use \$CONDITION instead of just $CONDITION to format escape $.
Example:- if we have an emp table with a primary key column id of whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, then four tasks will split the id column values as (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.
- –target-dir argument is mutually exclusive with –warehouse-dir. If the destination directory already exists in HDFS, Sqoop will refuse to import. If we use the –append argument, Sqoop will import data to a temporary directory and then rename the files into normal target directory in a manner that, it does not conflict with existing file names in that directory.
- Delimited text is the default import format but sqoop also supports sequence files and avro files. Text files cannot hold binary fields (such as database columns of type VARBINARY) and cannot distinguish between null values and String-based fields containing the value “null”. In this cases, we need to use sequence files or avro files.
- Sqoop supports large objects (BLOB and CLOB columns) also. By default, large objects less than 16 MB in size (this can be configured by –inline-lob-limit argument) are stored inline with the rest of the data. At a larger size, they are stored in files in the _lobs sub-directory of the import target directory.
Example query using –null-string options:
sqoop import \
--connect jdbc:mysql://mysql.server.com/sqoop \
--table emp \
--null-string '\\N' \
Example – Handling Null Values from Netezza and Storing as Parquet File
sqoop import \
--connect jdbc:netezza://hostname:5480/dbname --username "username" --password pwd \
--table tablename \
--null-string '\\N' \
--null-non-string '\\N' \
--target-dir 'hdfs://namenode_hostname/dir/path' \
--as-parquetfile --m 1