Saturday, 28 December 2013

Exporting data from HDFS into MySQL using Sqoop

Sqoop is an Apache project that is part of the broader Hadoop ecosphere. In many ways Sqoop is similar to distcp
Complete the following steps to transfer data from HDFS to a MySQL table:
1. Create a new database in the MySQL instance:
CREATE DATABASE logs;
2. Create the weblogs_from_hdfs table:
USE logs;
CREATE TABLE weblogs_from_hdfs (
md5 VARCHAR(32),
url VARCHAR(64),
request_date DATE,
request_time TIME,
ip VARCHAR(15)
);
3. Export the weblog_entries.txt file from HDFS to MySQL:
sqoop export -m 1 --connect jdbc:mysql://<HOST>:<PORT>/logs --username hdp_usr --password test1 --table weblogs_from_hdfs --export-dir /data/weblogs/05102012 --input-fields-terminated-by '\t' --mysql-delmiters
The output is as follows:
INFO mapreduce.ExportJobBase: Beginning export of weblogs_from_ hdfs
Sqoop loads the JDBC driver defined in the --connect statement from $SQOOP_HOME/ libs, where $SQOOP_HOME is the full path to the location where Sqoop is installed. The --username and --password options are used to authenticate the user issuing the command against the MySQL instance.
mysql> USE mysql;
mysql> select host, user from user;
+---------------+-----------+
| user | host |
+---------------+-----------+
| hdp_usr | hdp01 |
| hdp_usr | hdp02 |
For example, the export job can be thought of as reading each line of the weblogs_entries.txt file in HDFS and producing the following output:
INSERT INTO weblogs_from_hdfs
VALUES('aabba15edcd0c8042a14bf216c5', '/jcwbtvnkkujo.html', '2012-05- 10', '21:25:44', '148.113.13.214');
INSERT INTO weblogs_from_hdfs
VALUES('e7d3f242f111c1b522137481d8508ab7', '/ckyhatbpxu.html', '2012- 05-10', '21:11:20', '4.175.198.160');
By default, Sqoop export creates INSERT statements. If the --update-key argument is specified, UPDATE statements will be created instead. If the preceding example had used the argument --update-key md5, the generated code would have run like the following:
UPDATE weblogs_from_hdfs SET url='/jcwbtvnkkujo.html', request_ date='2012-05-10'request_time='21:25:44'
ip='148.113.13.214'WHERE md5='aabba15edcd0c8042a14bf216c5'

UPDATE weblogs_from_hdfs SET url='/jcwbtvnkkujo.html', request_ date='2012-05-10'request_time='21:11:20' ip='4.175.198.160' WHERE md5='e7d3f242f111c1b522137481d8508ab7'

No comments:

Post a Comment