hive-manual解析用双引号括起来并用逗号分隔的数据

ttisahbt  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(350)

我见过一些类似的问题,但由于问题不完全相同或解决方案不适用于我的情况,我张贴我的问题在这里。
我正在分析一个表,该表在 csv_line 列。问题是有些列有逗号 , 这也是字段分隔符。这些列被嵌入引号中。
我所做的分析是:

with  
sample as (
select 'field1,field3,"http://another.domain/abc/...eIds=111,222,333,444,...,",CustomerX,end' as csv_line)

select 

 regexp_extract(csv_line,'(,?(".*?"|[^,]*)){1}') as f1
 regexp_extract(csv_line,'(,?(".*?"|[^,]*)){n}') as fn

from raw_sample

我试图替换字符/逗号。
我知道opencsvserde允许在create表中定义分隔符和转义双引号,但是我正在寻找一个可以设置的属性,或者一个可以以正确方式进行拆分的正则表达式。
提前谢谢

sf6xfgos

sf6xfgos1#

with raw_sample as (
select 'field1,field2,fiend3,123,456,"http://some.domain/abc/Player.aspx?playerID=111&BrowseIds=2221,423062611,423870887,424044345,...,",THIS_IS_MY,en,20 294 998 1001,end' as raw_line
)

select  regexp_extract(raw_line,'(,?(".*?"|[^,]*)){01}',2) as c01
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){02}',2) as c02
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){03}',2) as c03
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){04}',2) as c04
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){05}',2) as c05
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){06}',2) as c06
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){07}',2) as c07
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){08}',2) as c08
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){09}',2) as c09
       ,regexp_extract(raw_line,'(,?(".*?"|[^,]*)){10}',2) as c10

from    raw_sample
;
+--------+--------+--------+-----+-----+-----------------------------------------------------------------------------------------------------+------------+-----+-----------------+-----+
|  c01   |  c02   |  c03   | c04 | c05 |                                                 c06                                                 |    c07     | c08 |       c09       | c10 |
+--------+--------+--------+-----+-----+-----------------------------------------------------------------------------------------------------+------------+-----+-----------------+-----+
| field1 | field2 | fiend3 | 123 | 456 | "http://some.domain/abc/Player.aspx?playerID=111&BrowseIds=2221,423062611,423870887,424044345,...," | THIS_IS_MY | en  | 20 294 998 1001 | end |
+--------+--------+--------+-----+-----+-----------------------------------------------------------------------------------------------------+------------+-----+-----------------+-----+

相关问题