apachesqoop/pig字段转义

mm9b1k5b  于 2021-06-24  发布在  Pig
关注(0)|答案(0)|浏览(145)

我们正在使用sqoop从mysql导出一些数据,通过apachepig对其进行一些处理,然后尝试将这些数据从hdfs导出回mysql数据库。但是,在导出数据时,我们遇到了以下问题:

java.io.IOException: Can't export data, please check task tracker logs
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.NumberFormatException: For input string: ".proseries.com"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Integer.parseInt(Integer.java:449)
    at java.lang.Integer.valueOf(Integer.java:554)
    at mdm_urls.__loadFromFields(mdm_urls.java:419)

hdfs数据如下(以制表符分隔):

id:int  url:text  tld:text  port:int

不知怎么的 tld 字段正在导入到 port 某些行的列。在约250m排中,只有不到10排是这样。我最初的假设是url字段中必须有一个选项卡。但是,我们已经删除了pig脚本中的所有选项卡:

REGISTER target/mystuff.jar;

legacy_urls = LOAD 'url' USING PigStorage(',') AS (id, sha1, url_text);
legacy_urls_norm = FOREACH legacy_urls GENERATE id AS id, sha1 AS sha1, REPLACE(REPLACE(url_text, '\n', ''), '\t', '') AS url_text;

urls = FOREACH legacy_urls_norm GENERATE id, url_text, mystuff.RootDomain(url_text), mystuff.Protocol(url_text), mystuff.Host(url_text), mystuff.Path(url_text), mystuff.EffectiveTld(url_text), mystuff.Port(url_text), sha1;

STORE urls INTO 'mdm_urls';

这是我的sqoop导出命令:

sqoop export --connect jdbc:mysql://hostnmae/db_name --input-fields-terminated-by "\t" --table test --export-dir my_urls

我在调试这个时遇到了困难,因为sqoop错误没有给出任何关于它正在处理哪一行的指示(这样我就可以确认tab字符是否仍然存在,等等)。我的第一个问题是,如何更好地解决这个问题?我的第二个问题是,人们如何用pig来逃避糟糕的输入数据?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题