无法从s3加载数据

nafvub8i  于 2021-06-25  发布在  Pig
关注(0)|答案(2)|浏览(258)

我在amazonec2上启动了两个m1.medium节点来执行我的pig脚本,但是看起来它在第一行失败了(甚至在mapreduce启动之前): raw = LOAD 's3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000' USING TextLoader as (line:chararray); 我收到的错误消息是:

2015-02-04 02:15:39,804 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-02-04 02:15:39,821 [JobControl] INFO  org.apache.hadoop.mapred.JobClient - Default number of map tasks: null
2015-02-04 02:15:39,822 [JobControl] INFO  org.apache.hadoop.mapred.JobClient - Setting default number of map tasks based on cluster size to : 20
... (omitted)
2015-02-04 02:18:40,955 [main] WARN  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-02-04 02:18:40,956 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201502040202_0002 has failed! Stop running all dependent jobs
2015-02-04 02:18:40,956 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-02-04 02:18:40,997 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space
2015-02-04 02:18:40,997 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-02-04 02:18:40,997 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion    PigVersion  UserId  StartedAt   FinishedAt  Features 1.0.3  0.11.1.1-amzn   hadoop 2015-02-04 02:15:32  2015-02-04 02:18:40 GROUP_BY

Failed!

Failed Jobs:
JobId   Alias   Feature Message Outputs
job_201502050202_0002   ngroup,raw,triples,tt   GROUP_BY,COMBINER   Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201502050202_0002_m_000022

Input(s):
Failed to read data from "s3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000"

Output(s):

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

我认为代码应该很好,因为我曾经用相同的语法成功地加载过其他数据,并且 s3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000 看起来有效。我怀疑这可能与我的一些ec2设置有关,但不确定如何进一步调查或缩小问题范围。有人有线索吗?

e4yzc0pl

e4yzc0pl1#

通过将我的节点从m1.medium更改为m3.large,这个问题现在得到了解决,感谢@nat给出的好提示,他指出了关于java堆空间的错误消息。稍后我会更新更多细节。

y53ybaqx

y53ybaqx2#

“java堆空间”错误消息提供了一些线索。你的文件似乎相当大(~2gb)。确保有足够的内存供每个任务运行程序读取数据。

相关问题