Sqoop Import Command Arguments 1


Control arguments

[sociallocker][/sociallocker]In this section we will cover only the important Import Control arguments.

Argument Description & Example
–as-textfile  Imports data as plain text file. It is the default option.
–as-sequencefile  Imports data to SequenceFiles
–as-avrodatafile  Imports data to Avro Data Files
–columns <col,col,col…>  Specify Columns to import from RDBMS table.
–table <table-name>  The table to import
-e, –query <statement>  It imports the results of query instead of a table.
–where <clause>  Import only the rows for which _clause_ is true. Example: –where “user_id > 400  AND hidden == 0”
–num-mappers <n> or -m <n>  Using this option we can speed the parallel copy process by setting the number  of map tasks high. To set this, RDBMS table should have primary key.
 –append  Append data to an existing HDFS dataset.
-z,–compress   Enable compression. By default uses GZip codec if not overridden below arg.
 –compression-codec <c>  Uses the Hadoop +codec+ class to compress data on HDFS
–null-string   The string to be written for a null value for string columns
 –null-non-string  The string to be written for a null value for non-string columns
–target-dir <dir>  Explicit HDFS target directory for the import
–warehouse-dir <dir>  Tables are uploaded to the HDFS path +/warehouse/dir/(tablename)/+
–split-by (column-name)   Column of the table used to split the table for parallel import. By default sqoop  will use query select min(<split-by>), max(<split-by>) from <table name>  to find out boundaries for creating splits. By default, Sqoop will identify the  primary key column (if present) in a table and use it as the splitting column.
–boundary-query (query)   Using query to select minimal and maximal value of ‘–split-by’ column (primary  key column) for creating splits
 –direct  Use direct import fast path (for MySQL and PostgreSQL)
 –direct-split-size (n)  Split the input stream every ‘n’ bytes when importing in direct mode
 –mapreduce-job-name <name>  Set name for generated mapreduce job 
–fetch-size <n>  Set number ‘n’ of rows to fetch from the database.
–delete-target-dir  Delete the import target directory if it exists 
 –inline-lob-limit <n>  Set the maximum size for an inline LOB

Notes

  • By default sqoop will run 4 map tasks
  • By default Sqoop uses JDBC channel to import, but with –direct option, sqoop uses high-performance direct channel but this is supported only in MySQL and PostgreSQL. We can split the output files based max size on HDFS, with –direct-split-size argument.
  • When importing results of a query, we must specify destination directory with –target-dir
  • If we need import to be done in parallel we need to specify -m value greater than 1, so that multiple mappers can run in parallel but we must specify the split-by column or table should have primary key. Query must include the token $CONDITION which each Sqoop process will replace with a unique condition expression as shown in below example.

If we use double quotes (“”) to write query statement, we need to use \$CONDITION instead of just $CONDITION to format escape $

Example:- if we have an emp table with a primary key column id of whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, then four tasks will split the id column values as (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.

  • –target-dir argument is mutually exclusive with –warehouse-dir. If the destination directory already exists in HDFS, Sqoop will refuse to import. If we use the –append argument, Sqoop will import data to a temporary directory and then rename the files into normal target directory in a manner that, it does not conflict with existing file names in that directory.
  • Delimited text is the default import format but sqoop also supports sequence files and avro files. Text files cannot hold binary fields (such as database columns of type VARBINARY) and cannot distinguish between null values and String-based fields containing the value “null”. In this cases, we need to use sequence files or avro files.
  • Sqoop supports large objects (BLOB and CLOB columns) also. By default, large objects less than 16 MB in size (this can be configured by –inline-lob-limit argument) are stored inline with the rest of the data. At a larger size, they are stored in files in the _lobs sub-directory of the import target directory.

Example query using –null-string options:

Example – Handling Null Values from Netezza and Storing as Parquet File


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Sqoop Import Command Arguments

  • Joseph

    Hey Siva,

    I have been trying to work with sqoop and i wanted to test connection to mysql and i tried your command
    sqoop list-databases –connect jdbc:mysql://localhost –connection-manager org.apache.sqoop.manager.GenericJdbcManager –driver com.mysql.jdbc.Driver –username root –password MYPASSWORD
    and i end up with an error
    15/09/25 14:25:03 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    15/09/25 14:25:03 INFO manager.SqlManager: Using default fetchSize of 1000
    15/09/25 14:25:03 ERROR manager.SqlManager: Generic SqlManager.listDatabases() not implemented.
    Could not retrieve database list from server
    15/09/25 14:25:03 ERROR tool.ListDatabasesTool: manager.listDatabases() returned null

    I have also put the jar file in /var/lib/sqoop ”
    mysql-connector-java-5.0.8-bin.jar”  and when i try it wothout specifying the –connection-manager it automatically defaults to GenericJDBCManager,  any idea why this isnt working ?
    Thanks
    -Joseph