如何获得pig中每行的字数?

7nbnzgx9  于 2021-06-25  发布在  Pig
关注(0)|答案(2)|浏览(197)

我想弄清楚他们在pig文件中每行有多少字。我已经完成了加载和拆分:

raw = load file;
words = FOREACH raw GENERATE TOKENIZE(*);

给我一包薄纱,每个都有一个单词。然后我去数这些项目,我得到一个错误:

counts = FOREACH words GENERATE COUNT(*);

我得到一个错误:

org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing count in COUNT
...
Caused by: java.lang.NullPointerException

是不是因为有些队伍有空袋子?还是我做错了什么?

dzjeubhm

dzjeubhm1#

如果是空袋子的问题,那么你可以尝试这样的方法:(未测试)

raw = load file;

words = FOREACH raw GENERATE TOKENIZE(*) as tokenized_words;

counts = FOREACH words GENERATE ((tokenized_words IS null or TRIM(tokenized_words) == '') ? 0 : COUNT(*)) as total_count;

在这里,我们编写if else条件来检查标记化的单词是null还是空的,如果是,那么我们为它赋值零,否则就是总计数。

sycxhyv7

sycxhyv72#

你能这样试试吗?
输入

Hi hello how are you
this is apache pig
works

like a charm

Pig手稿:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE TOKENIZE(line);
C = FOREACH B GENERATE COUNT($0);
DUMP C;

输出:

(5)
(4)
(1)
()
(3)

相关问题