如何在hadoop流媒体中读取orc文件？

2guxujil 于 2021-06-03 发布在 Hadoop

关注(0)|答案(1)|浏览(499)

我想在python上阅读mapreduce中的orc文件。我试着运行它：

hadoop jar /usr/lib/hadoop/lib/hadoop-streaming-2.6.0.2.2.6.0-2800.jar 
-file /hdfs/price/mymapper.py 
-mapper '/usr/local/anaconda/bin/python mymapper.py' 
-file /hdfs/price/myreducer.py 
-reducer '/usr/local/anaconda/bin/python myreducer.py' 
-input /user/hive/orcfiles/* 
-libjars /usr/hdp/2.2.6.0-2800/hive/lib/hive-exec.jar 
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
-numReduceTasks 1 
-output /user/hive/output

但我有个错误：

-inputformat : class not found : org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

我发现了一个类似的问题或者newinputformat作为hadoop流的inputformat，但答案并不清楚
请给我举个例子，如何在hadoop流媒体中正确读取orc文件。

hadoop streaming python orc

来源：https://stackoverflow.com/questions/32307999/how-to-read-orc-file-in-hadoop-streaming

1条答案

按热度按时间

xmd2e60i1#

下面是我使用orc分区配置单元表作为输入的示例之一：

hadoop jar /usr/hdp/2.2.4.12-1/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.4.12-1.jar \
-libjars /usr/hdp/current/hive-client/lib/hive-exec.jar \
-Dmapreduce.task.timeout=0 -Dmapred.reduce.tasks=1 \
-Dmapreduce.job.queuename=default \
 -file RStreamMapper.R RStreamReducer2.R \
-mapper "Rscript RStreamMapper.R" -reducer "Rscript RStreamReducer2.R" \
-input /hive/warehouse/asv.db/rtd_430304_fnl2 \
-output /user/Abhi/MRExample/Output \
-inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat 
-outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

在这里 /apps/hive/warehouse/asv.db/rtd_430304_fnl2 是配置单元表后台orc数据存储位置的路径。其余我需要提供适当的罐流以及Hive。

赞(0）回复(0）举报 2021-06-03

我来回答

如何在hadoop流媒体中读取orc文件？

1条答案

相关问题

热门标签

最新问答