我们正在使用sqoop从mysql导出一些数据,通过apachepig对其进行一些处理,然后尝试将这些数据从hdfs导出回mysql数据库。但是,在导出数据时,我们遇到了以下问题:
java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NumberFormatException: For input string: ".proseries.com"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:449)
at java.lang.Integer.valueOf(Integer.java:554)
at mdm_urls.__loadFromFields(mdm_urls.java:419)
hdfs数据如下(以制表符分隔):
id:int url:text tld:text port:int
不知怎么的 tld
字段正在导入到 port
某些行的列。在约250m排中,只有不到10排是这样。我最初的假设是url字段中必须有一个选项卡。但是,我们已经删除了pig脚本中的所有选项卡:
REGISTER target/mystuff.jar;
legacy_urls = LOAD 'url' USING PigStorage(',') AS (id, sha1, url_text);
legacy_urls_norm = FOREACH legacy_urls GENERATE id AS id, sha1 AS sha1, REPLACE(REPLACE(url_text, '\n', ''), '\t', '') AS url_text;
urls = FOREACH legacy_urls_norm GENERATE id, url_text, mystuff.RootDomain(url_text), mystuff.Protocol(url_text), mystuff.Host(url_text), mystuff.Path(url_text), mystuff.EffectiveTld(url_text), mystuff.Port(url_text), sha1;
STORE urls INTO 'mdm_urls';
这是我的sqoop导出命令:
sqoop export --connect jdbc:mysql://hostnmae/db_name --input-fields-terminated-by "\t" --table test --export-dir my_urls
我在调试这个时遇到了困难,因为sqoop错误没有给出任何关于它正在处理哪一行的指示(这样我就可以确认tab字符是否仍然存在,等等)。我的第一个问题是,如何更好地解决这个问题?我的第二个问题是,人们如何用pig来逃避糟糕的输入数据?
暂无答案!
目前还没有任何答案,快来回答吧!