Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Sqoop Export

Today we will see how to load data from Hive to RDBMS (MySQL) using Sqoop Export command.

Below are some of key observations that need to be keep in mind before proceeding with Sqoop Export process —
☛ Exporting table schema must exists in target RDBMS.
☛ The data which is exported from HDFS to RDBMS table, the data must be in sync with target table schema, i.e., Datatypes of fields, sequence of fields, constraints, etc.
☛ If any one record in the exported file is not matching with table schema, the entire file export will fail.
☛ The default operation is to insert all the record from the input files to the database table using the INSERT statement. In update mode, Sqoop generates the UPDATE statement that replaces the existing record into the database.

Step 1 — Create database and table in MySQL Database Server:

mysql> create database mydb;
 
mysql> use mydb;

mysql> create table person
    ->        (empid         varchar(11)  NOT NULL  PRIMARY KEY,
    ->         name          varchar(50),
    ->         birthdate     date,
    ->         birthplace    varchar(30),
    ->         birthcountry  varchar(3),
    ->         birthstate    varchar(3),
    ->         country       varchar(3),
    ->         address1      varchar(50),
    ->         address2      varchar(50),
    ->         address3      varchar(50),
    ->         address4      varchar(50),
    ->         city          varchar(50),
    ->         county        varchar(30),
    ->         state         varchar(3),
    ->         postal        int,
    ->         phone         varchar(15),
    ->         emailaddress  varchar(100)
    ->        );

mysql> desc person;
+--------------+--------------+------+-----+---------+-------+
| Field        | Type         | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| empid        | varchar(11)  | NO   | PRI | NULL    |       |
| name         | varchar(50)  | YES  |     | NULL    |       |
| birthdate    | date         | YES  |     | NULL    |       |
| birthplace   | varchar(30)  | YES  |     | NULL    |       |
| birthcountry | varchar(3)   | YES  |     | NULL    |       |
| birthstate   | varchar(3)   | YES  |     | NULL    |       |
| country      | varchar(3)   | YES  |     | NULL    |       |
| address1     | varchar(50)  | YES  |     | NULL    |       |
| address2     | varchar(50)  | YES  |     | NULL    |       |
| address3     | varchar(50)  | YES  |     | NULL    |       |
| address4     | varchar(50)  | YES  |     | NULL    |       |
| city         | varchar(50)  | YES  |     | NULL    |       |
| county       | varchar(30)  | YES  |     | NULL    |       |
| state        | varchar(3)   | YES  |     | NULL    |       |
| postal       | int(11)      | YES  |     | NULL    |       |
| phone        | varchar(15)  | YES  |     | NULL    |       |
| emailaddress | varchar(100) | YES  |     | NULL    |       |
+--------------+--------------+------+-----+---------+-------+
17 rows in set (0.00 sec)

mysql> create table person_stg
    ->        (empid         varchar(11)  NOT NULL  PRIMARY KEY,
    ->         name          varchar(50),
    ->         birthdate     date,
    ->         birthplace    varchar(30),
    ->         birthcountry  varchar(3),
    ->         birthstate    varchar(3),
    ->         country       varchar(3),
    ->         address1      varchar(50),
    ->         address2      varchar(50),
    ->         address3      varchar(50),
    ->         address4      varchar(50),
    ->         city          varchar(50),
    ->         county        varchar(30),
    ->         state         varchar(3),
    ->         postal        int,
    ->         phone         varchar(15),
    ->         emailaddress  varchar(100)
    ->        );

mysql> desc person_stg;
+--------------+--------------+------+-----+---------+-------+
| Field        | Type         | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| empid        | varchar(11)  | NO   | PRI | NULL    |       |
| name         | varchar(50)  | YES  |     | NULL    |       |
| birthdate    | date         | YES  |     | NULL    |       |
| birthplace   | varchar(30)  | YES  |     | NULL    |       |
| birthcountry | varchar(3)   | YES  |     | NULL    |       |
| birthstate   | varchar(3)   | YES  |     | NULL    |       |
| country      | varchar(3)   | YES  |     | NULL    |       |
| address1     | varchar(50)  | YES  |     | NULL    |       |
| address2     | varchar(50)  | YES  |     | NULL    |       |
| address3     | varchar(50)  | YES  |     | NULL    |       |
| address4     | varchar(50)  | YES  |     | NULL    |       |
| city         | varchar(50)  | YES  |     | NULL    |       |
| county       | varchar(30)  | YES  |     | NULL    |       |
| state        | varchar(3)   | YES  |     | NULL    |       |
| postal       | int(11)      | YES  |     | NULL    |       |
| phone        | varchar(15)  | YES  |     | NULL    |       |
| emailaddress | varchar(100) | YES  |     | NULL    |       |
+--------------+--------------+------+-----+---------+-------+
17 rows in set (0.00 sec)

Step 2 — Execute SQOOP EXPORT command on hadoop cluster:

$ sqoop export \
>     --connect jdbc:mysql://localhost/mydb \
>     --username root -P \
>     --export-dir /tmp/hive/person/part-m-00000 \           --HDFS source path of data to be exported
>     --table person \                                       --RDBMS table to populate
>     --staging-table person_stg \                           --Good practice to load data into staging table first. Avoids any inconsistency
>     --clear-staging-table \                                --Indicates that any data present in the staging table can be deleted
>     --verbose \                                            --Print more information while working, useful for debugging
>     --fields-terminated-by ',';                            --Hive table/HDFS file field delimiter
Warning: /usr/hdp/2.3.0.0-2557/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
2016-01-05 15:26:44,870 INFO  - [main:] ~ Running Sqoop version: 1.4.6.2.3.0.0-2557 (Sqoop:92)
Enter password:
2016-01-05 15:26:46,317 INFO  - [main:] ~ Preparing to use a MySQL streaming resultset. (MySQLManager:69)
2016-01-05 15:26:46,318 INFO  - [main:] ~ Beginning code generation (CodeGenTool:92)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.3.0.0-2557/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2016-01-05 15:26:46,699 INFO  - [main:] ~ Executing SQL statement: SELECT t.* FROM `person` AS t LIMIT 1 (SqlManager:757)
2016-01-05 15:26:46,750 INFO  - [main:] ~ Executing SQL statement: SELECT t.* FROM `person` AS t LIMIT 1 (SqlManager:757)
2016-01-05 15:26:46,756 INFO  - [main:] ~ HADOOP_MAPRED_HOME is /usr/hdp/2.3.0.0-2557/hadoop-mapreduce (CompilationManager:94)
Note: /tmp/sqoop-root/compile/fb0efb7d2a0bf1bac7d97f9e484649fb/person.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
2016-01-05 15:26:48,812 INFO  - [main:] ~ Writing jar file: /tmp/sqoop-root/compile/fb0efb7d2a0bf1bac7d97f9e484649fb/person.jar (CompilationManager:330)
2016-01-05 15:26:48,828 INFO  - [main:] ~ Data will be staged in the table: person_stg (ExportJobBase:357)
2016-01-05 15:26:48,829 INFO  - [main:] ~ Beginning export of person (ExportJobBase:378)
2016-01-05 15:26:48,840 INFO  - [main:] ~ Deleted 0 records from `person_stg` (SqlManager:1053)
2016-01-05 15:26:49,194 INFO  - [main:] ~ mapred.jar is deprecated. Instead, use mapreduce.job.jar (deprecation:1173)
2016-01-05 15:26:50,245 INFO  - [main:] ~ mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative (deprecation:1173)
2016-01-05 15:26:50,252 INFO  - [main:] ~ mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative (deprecation:1173)
2016-01-05 15:26:50,253 INFO  - [main:] ~ mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps (deprecation:1173)
2016-01-05 15:26:50,800 INFO  - [main:] ~ Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ (TimelineClientImpl:296)
2016-01-05 15:26:50,950 INFO  - [main:] ~ Connecting to ResourceManager at sandbox.hortonworks.com/192.168.88.129:8050 (RMProxy:98)
2016-01-05 15:26:52,717 INFO  - [main:] ~ Total input paths to process : 1 (FileInputFormat:283)
2016-01-05 15:26:52,721 INFO  - [main:] ~ Total input paths to process : 1 (FileInputFormat:283)
2016-01-05 15:26:52,811 INFO  - [main:] ~ number of splits:4 (JobSubmitter:198)
2016-01-05 15:26:52,831 INFO  - [main:] ~ mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative (deprecation:1173)
2016-01-05 15:26:53,037 INFO  - [main:] ~ Submitting tokens for job: job_1455596313072_0009 (JobSubmitter:287)
2016-01-05 15:26:53,503 INFO  - [main:] ~ Submitted application application_1455596313072_0009 (YarnClientImpl:274)
2016-01-05 15:26:53,596 INFO  - [main:] ~ The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1455596313072_0009/ (Job:1294)
2016-01-05 15:26:53,597 INFO  - [main:] ~ Running job: job_1455596313072_0009 (Job:1339)
2016-01-05 15:26:59,778 INFO  - [main:] ~ Job job_1455596313072_0009 running in uber mode : false (Job:1360)
2016-01-05 15:26:59,779 INFO  - [main:] ~  map 0% reduce 0% (Job:1367)
2016-01-05 15:27:11,210 INFO  - [main:] ~  map 75% reduce 0% (Job:1367)
2016-01-05 15:27:12,229 INFO  - [main:] ~  map 100% reduce 0% (Job:1367)
2016-01-05 15:27:12,251 INFO  - [main:] ~ Job job_1455596313072_0009 completed successfully (Job:1378)
2016-01-05 15:27:12,473 INFO  - [main:] ~ Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=573776
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=72660
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=16
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=4
                Data-local map tasks=4
                Total time spent by all maps in occupied slots (ms)=34919
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=34919
                Total vcore-seconds taken by all map tasks=34919
                Total megabyte-seconds taken by all map tasks=8729750
        Map-Reduce Framework
                Map input records=223
                Map output records=223
                Input split bytes=608
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=413
                CPU time spent (ms)=4910
                Physical memory (bytes) snapshot=570736640
                Virtual memory (bytes) snapshot=3342573568
                Total committed heap usage (bytes)=526909440
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0 (Job:1385)
2016-01-05 15:27:12,480 INFO  - [main:] ~ Transferred 70.957 KB in 22.2127 seconds (3.1944 KB/sec) (ExportJobBase:301)
2016-01-05 15:27:12,485 INFO  - [main:] ~ Exported 223 records. (ExportJobBase:303)
2016-01-05 15:27:12,486 INFO  - [main:] ~ Starting to migrate data from staging table to destination. (ExportJobBase:453)
2016-01-05 15:27:12,488 INFO  - [main:] ~ Migrated 223 records from `person_stg` to `person` (SqlManager:1089)

Step 3 — Verify data loaded into MySQL table successfully:

mysql> select count(*) from mydb.person;
+----------+
| count(*) |
+----------+
|      223 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from mydb.person_stg;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect

Sqoop Export

No comments:

Post a Comment