仅当逗号(,)在pig中的内部引号(“”)时才替换逗号(,)

fykwrbwg  于 2021-06-24  发布在  Pig
关注(0)|答案(3)|浏览(363)

我有这样的数据:

1,234,"john, lee", john@xyz.com

我想用pig脚本删除,里面有空格的“”。所以我的数据看起来像:

1,234,john lee, john@xyz.com

我尝试使用csvexcelstorage加载此数据,但我还需要使用'-tagfile'选项,这在csvexcelstorage中不受支持。所以我计划只使用pigstorage,然后替换引号中的逗号(,)。我被困在这上面了。非常感谢您的帮助。谢谢

jtw3ybtb

jtw3ybtb1#

下面的命令将有帮助:

csvFile = load '/path/to/file' using PigStorage(',');
result = foreach csvFile generate $0 as (field1:chararray),$1 as (field2:chararray),CONCAT(REPLACE($2, '\\"', '') , REPLACE($3, '\\"', '')) as field3,$4 as (field4:chararray);

输出:
(1234年,约翰·李,john@xyz.com)

vbopmzt1

vbopmzt12#

将其加载到单个字段中,然后使用strsplit和replace

A = LOAD 'data.csv' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE STRSPLIT(line,'\\"',3); 
C = FOREACH B GENERATE REPLACE($1,',','');
D = FOREACH C GENERATE CONCAT(CONCAT($0,$1),$2); -- You can further use STRSPLIT to get individual fields or just CONCAT
E = FOREACH D GENERATE STRSPLIT(D.$0,',',4);
DUMP E;

1,234,"john, lee", john@xyz.com

b

(1,234,)(john, lee)(, john@xyz.com)

c

(1,234,)(john lee)(, john@xyz.com)

d

(1,234,john lee, john@xyz.com)

e

(1),(234),(john lee),(john@xyz.com)
nkhmeac6

nkhmeac63#

我有个完美的方法。一个非常通用的解决方案如下:

data = LOAD 'data.csv' using PigStorage(',','-tagFile') AS (filename:chararray, record:chararray);

/*replace comma(,) if it appears in column content*/
replaceComma = FOREACH data GENERATE filename, REPLACE (record, ',(?!(([^\\"]*\\"){2})*[^\\"]*$)', '');

/*replace the quotes("") which is present around the column if it have comma(,) as its a csv file feature*/
replaceQuotes = FOREACH replaceComma GENERATE filename, REPLACE ($4,'"','') as record;

详细的用例可以在我的博客上找到

相关问题