Massive Technical Interviews Tips: Apache Sqoop

Thursday, January 7, 2016

Apache Sqoop

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Sqqop - move data between Hadoop and RDBMS. Sqoop has two important features: Import, Export
$ bin/sqoop import --connect jdbc:mysql://localhost:3306/db1 --username root --password password --table tableName --target-dir /user/abc/tableName
--fields-terminated-by ',' --escaped-by \\ --enclosed-by '\"'
--columns "student_id,address,name"
--where 'student_id<100'

Sqoop first connects to the database server to pull the desired metadata information from the input table.
In the second step, Sqoop executes a MapReduce job on the Hadoop cluster. MapReduce will use the metadata generated in step 1 to perform the actual import process.

--as-textfile
--as-avrodatafile
--as-sequencefile

--query "select student_id, name, addess from student where student_id< 100 and $CONDITIONS" –split-by id

The import-all-tables tool creates a separate directory in HDFS for each RDBMS table
$ bin/sqoop import-all-tables --connect jdbc:mysql://localhost:3306/db1 --username root --password password

hbase shell
create 'tableName1','columnfamily1'
describe 'tableName1'
put 'tableName1','1','columnfamily1:col1','abc'

–hbase-table hbase_tableName --column-family hbase_table_col1 –hbase-creat
In Sqoop, the --hbase-row-key argument is used to specify which input column we want to use as the HBase row key. If –hbase-row-key and –split-by are not used, Sqoop itself tries to identify the primary key of the given RDBMS table and uses that column as the HBase row key.

hive> create table student (id INT, name STRING);
show tables;
load data local inpath'./examples/files/kv1.txt' overwrite into table student;
hive> load data inpath'/user/username/kv1.txt' overwrite into table student;

move the HDFS data back to the RDBMS table
$ bin/sqoop export –connect jdbc:mysql://localhost:3306/db1 –table tableName --username root –-password password --export-dir /user/abc/tableName
–m and –num-mappers.
We can remove the previous problem by using the –staging-table argument; if we use the –staging-table argument, the data will stage into a staging table and then the staged data is moved to the destination table in a single transaction.

If we are using –staging-table, the staging table should be empty before running the export job or specify the –clean-staging-table argument in the Sqoop statement. If we specify the –clean-staging-table argument, Sqoop first cleans the staging table before running the export job.

the data of Hive resides in HDFS at /user/hive/warehouse/tableName. Also, Hive uses '^A' as the field delimiter and '\n' as the line terminator.
$ bin/sqoop export –connect jdbc:mysql://localhost/test_db –table invoice --export-dir /user/hive/warehouse/invoice –username root –password password –m 1 –input-fields-terminated-by '\001'
Apache Sqoop Cookbook
Incremental Import --incremental parameter
When your table is only getting new rows and the existing ones are not changed, use the append mode.

Incremental import also requires two additional parameters: --check-column indicates a column name that should be checked for newly appended data, and --last-value contains the last value that successfully imported into Hadoop.
--incremental append \
--check-column id \
--last-value 1
One downside is the need to know the value of the last imported row so that next time Sqoop can start off where it ended. Sqoop, when running in incremental mode, always prints out the value of the last imported row.

Incrementally Importing Mutable Data
Transfer rows whose value in column last_update_date is greater than 2013-05-22 01:01:01:
--incremental lastmodified \
--check-column last_update_date \
--last-value "2013-05-22 01:01:01"

The incremental mode lastmodified requires a column holding a date value (suitable types are date, time, datetime, and timestamp) containing information as to when each row was last updated. Sqoop will import only those rows that were updated after the last import.

The Sqoop metastore is a powerful part of Sqoop that allows you to retain your job definitions and to easily run them anytime.
sqoop job --list

You can remove the old job definitions that are no longer needed with the --delete parameter, for example:
sqoop job --delete visits

Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job. This way, users do not need to remember the last imported value after each execution; everything is handled automatically.

sqoop job \
--create visits \
-- \
import \ ..
--incremental append \
--check-column id \
--last-value 0
And start it with the --exec parameter:

sqoop job --exec visits

sqoop-site.xml:
<property>
<name>sqoop.metastore.client.record.password</name>
<value>true</value>
</property>
--password-file

Overriding the Arguments to a Saved Job
sqoop job --exec visits -- --verbose

sqoop metastore
sqoop job
--create visits \
--meta-connect jdbc:hsqldb:hsql://metastore.example.com:16000/sqoop \

Thursday, January 7, 2016

Apache Sqoop

Labels

Popular Posts