我有下表:
hive> describe tv_counter_stats;
OK
day string
event string
query_id string
userid string
headers string
我要执行以下查询:
hive -e 'SELECT
day,
event,
query_id,
COUNT(1) AS count,
COLLECT_SET(userid)
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
但是,此查询失败。但以下查询可以正常工作:
hive -e 'SELECT
day,
event,
query_id,
COUNT(1) AS count
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
在这里我删除collect\u set命令。所以我的问题是:有人知道为什么collect\u set在这种情况下可能会失败吗?
更新:添加了错误消息:
Diagnostic Messages for this Task:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 Cumulative CPU: 10.49 sec HDFS Read: 109136387 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 10 seconds 490 msec
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
Error: GC overhead limit exceeded
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
Error: GC overhead limit exceeded
更新2:我修改了查询,使其看起来像这样:
hive -e '
SET mapred.child.java.opts="-server -Xmx1g -XX:+UseConcMarkSweepGC";
SELECT
day,
event,
query_id,
COUNT(1) AS count,
COLLECT_SET(userid)
FROM
tv_counter_stats
GROUP BY
day,
event,
query_id;' > counter_stats_data.csv
但是,我得到以下错误:
Diagnostic Messages for this Task:
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 3 Reduce: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
2条答案
按热度按时间kmb7vmvb1#
这可能是内存问题,因为
collect_set
在内存中聚合数据。尝试增加堆大小并启用并发gc(通过设置hadoop)
mapred.child.java.opts
例如-Xmx1g -XX:+UseConcMarkSweepGC
).这个答案有更多关于“gc开销限制”错误的信息。
nwlls2ji2#
我也遇到了同样的问题,也遇到了这个问题,所以我想我应该和大家分享我找到的解决方案。
根本的问题很可能是hive试图在Map器端进行聚合,并且它用于管理该方法的内存哈希Map的启发式方法被“宽而浅”的数据抛出——即在您的情况下,如果每天/事件/查询id组的用户id值非常少。
我发现一篇文章解释了解决这个问题的各种方法,但其中大多数只是对完全核心选项的优化:完全禁用Map器端聚合。
使用
set hive.map.aggr = false;
应该会成功的。