从hadoopmapreduce作业打开hdfs上的文件

yv5phkfx 于 2021-06-03 发布在 Hadoop

关注(0)|答案(2)|浏览(350)

通常，我可以用以下方式打开一个新文件：

aDict = {}
with open('WordLists/positive_words.txt', 'r') as f:
    aDict['positive'] = {line.strip() for line in f}

with open('WordLists/negative_words.txt', 'r') as f:
    aDict['negative'] = {line.strip() for line in f}

这将打开wordlists文件夹中的两个相关文本文件，并将每一行作为正数或负数附加到字典中。
但是，当我想在hadoop中运行mapreduce作业时，我认为这是行不通的。我是这样运行我的程序的：

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -file hadoop_map.py -mapper hadoop_reduce.py -input /toBeProcessed -output /Completed

我尝试将代码更改为：

with open('/mapreduce/WordLists/negative_words.txt', 'r')

其中mapreduce是hdfs上的一个文件夹，wordlists是包含否定词的子文件夹。但我的程序找不到这个。我正在做的是可能的，如果可能的话，在hdfs上加载文件的正确方法是什么。
编辑
我现在试着：

with open('hdfs://localhost:9000/mapreduce/WordLists/negative_words.txt', 'r')

这似乎起到了作用，但现在我得到了这样的输出：

13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 50%  reduce 0%
13/08/27 21:18:50 INFO streaming.StreamJob:  map 0%  reduce 0%

然后工作失败了。所以还是不对。有什么想法吗？
编辑2：
重新阅读了api之后，我注意到我可以使用 -files 选项来指定文件。api规定：
-files选项在指向文件本地副本的任务的当前工作目录中创建符号链接。
在本例中，hadoop会在任务的当前工作目录中自动创建一个名为testfile.txt的符号链接。此符号链接指向testfile.txt的本地副本。

-files hdfs://host:fs_port/user/testfile.txt

因此，我运行：

./hadoop/bin/hadoop jar contrib/streaming/hadoop-streaming-1.1.2.jar -D mapred.reduce.tasks=0 -files hdfs://localhost:54310/mapreduce/SentimentWordLists/positive_words.txt#positive_words -files hdfs://localhost:54310/mapreduce/SentimentWordLists/negative_words.txt#negative_words -file hadoop_map.py -mapper hadoop_map.py -input /toBeProcessed -output /Completed

根据我对api的理解，这会创建符号链接，这样我就可以在代码中使用“肯定的词”和“否定的词”，如下所示：

with open('negative_words.txt', 'r')

然而，这仍然不起作用。任何人能提供的任何帮助都将不胜感激，因为在我解决这个问题之前，我无能为力。
编辑3：
我可以使用以下命令：

-file ~/Twitter/SentimentWordLists/positive_words.txt

以及运行hadoop作业的其余命令。这将在本地系统而不是hdfs上查找文件。这不会抛出任何错误，因此它在某个地方被接受为一个文件。但是，我不知道如何访问该文件。

hadoop python hadoop-streaming

来源：https://stackoverflow.com/questions/18474519/opening-files-on-hdfs-from-hadoop-mapreduce-job

2条答案

按热度按时间

wkyowqbh1#

当以编程方式处理hdfs时，您应该查看文件系统、文件状态和路径。这些是hadoopapi类，允许您访问程序中的hdfs。

赞(0）回复(0）举报 2021-06-03

2jcobegt2#

大量评论后的解决方案：）
读取python中的数据文件：用 -file 并在脚本中添加以下内容：

import sys

有时需要在 import :

sys.path.append('.')

（与hadoop流媒体中的@drdee注解相关-找不到文件错误）

赞(0）回复(0）举报 2021-06-03

我来回答

从hadoopmapreduce作业打开hdfs上的文件

2条答案

相关问题

热门标签

最新问答