hadoop流调用python脚本

0g0grzrc  于 2021-06-03  发布在  Hadoop
关注(0)|答案(2)|浏览(305)

我有两个小的python脚本
countwordoccurrence\u mapper.py


# !/usr/bin/env python

import sys

# print(sys.argv[1])

text = sys.argv[1]

wordCount = text.count(sys.argv[2])

# print (sys.argv[2],wordCount)

print '%s\t%s' % (sys.argv[2], wordCount)

printwordcount\ u reducer.py版


# !/usr/bin/env python

import sys

finalCount = 0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t')

    count=int(count)

    finalCount += count

    print(word,finalCount)

我执行了以下命令:

$ ./CountWordOccurence_mapper.py \
    "I am a Honda customer 100%.. 94 Accord ex 96 Accord exV6 98 Accord exv6 cpe 2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower motor blown 2months after the warranty expired. Sad $600 didn't last very long." \
    "Accord" \
     | /home/hadoopranch/omkar/PrintWordCount_reducer.py
('Accord', 4)

如前所述,我的目标是哑巴计数给定文本中所提供单词(在本例中为accord)的出现次数。
现在,我打算使用hadoop流执行同样的操作。hdfs上的文本文件(部分)是:

"message" : "I am a Honda customer 100%.. 94 Accord ex   96 Accord exV6  98 Accord exv6 cpe  2001 S2000 ... 2003 Pilot for me and 2003 Accord for hubby that are still going beautifully...\n\nBUT.... Honda lawnmower   motor blown 2months after the warranty expired. Sad $600 didn't last very long."
"message" : "I am an angry Honda owner! In 2009 I bought a new Honda Civic and have taken great care of it.  Yesterday   I tried to start it unsuccessfully.  After hours at the auto mechanics  it was found that there was a glitch in the electric/computer system.  The news was disappointing enough (and expensive)    but to find out the problem is basically a defect/common problem with the year/make/model I purchased is awful.  When I bought a NEW Honda I thought I bought quality.  I was wrong! Will Honda step up?"

我修改了countwordoccurrence\u mapper.py


# !/usr/bin/env python

import sys

for text in sys.stdin:  

    wordCount = text.count(sys.argv[1])

    print '%s\t%s' % (sys.argv[1], wordCount)

我的第一个困惑是——如何将要计算的单词(例如“accord”、“honda”)作为参数发送给Map器(-cmdenv name=value),这让我很困惑。我仍然继续执行以下命令:

$HADOOP_HOME/bin/hadoop jar \
  $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.4.jar \
 -input /random_data/honda_service_within_warranty.txt \
 -output /random_op/cnt.txt \
 -file /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py \
 -mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord" \
 -file /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py \
 -reducer /home/hduser/dumphere/codes/python/PrintWordCount_reducer.py

正如所料,作业失败,我得到以下错误:

Traceback (most recent call last):
  File "/tmp/hadoop-hduser/mapred/local/taskTracker/hduser/jobcache/job_201304232210_0007/attempt_201304232210_0007_m_000001_3/work/./CountWordOccurence_mapper.py", line 6, in <module>
    wordCount = text.count(sys.argv[1])
IndexError: list index out of range
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
    at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362)
    at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:576)
    at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

请改正我犯的语法和基本错误。
谢谢和问候!

cnwbcb6i

cnwbcb6i1#

我认为您的问题在于命令行调用的以下部分:

-mapper /home/hduser/dumphere/codes/python/CountWordOccurence_mapper.py "Accord"

我认为您在这里的假设是字符串“accord”作为第一个参数传递给Map器。我很肯定不是这样的,事实上字符串“accord”很可能被流驱动程序入口点类(streamjob.java)忽略了。
要解决这个问题,您需要重新使用 -cmdenv 参数,然后在python代码中提取这个键/值对(我不是python程序员,但我相信一个快速的google会指向您需要的代码段)。

7vhp5slm

7vhp5slm2#

实际上,我尝试过使用-cmdenv wordtobecounted=“accord”,但问题是我的python文件-我忘记对其进行更改,以便从环境变量(而不是参数数组)读取值“accord”。我附上countwordoccurrence\u mapper.py的代码,以防万一有人想使用它作为参考:


# !/usr/bin/env python

import sys
import os

wordToBeCounted = os.environ['wordToBeCounted']

for text in sys.stdin:  

    wordCount = text.count(wordToBeCounted)

    print '%s\t%s' % (wordToBeCounted,wordCount)

谢谢和问候!

相关问题