Hive对复杂JSON数据获取和数据拆分

x33g5p2x  于2020-11-12 发布在 Hive  
字(1.7k)|赞(0)|评价(0)|浏览(1171)

原始数据:

[{"name":"woods","app_id":"abc123"},{"name":"tiger","app_id":"def456"}]

get_json_object

函数的作用:用来解析json字符串的一个字段:

select get_json_object(flist,'$.filtertype') as filtertype
      ,get_json_object(flist,'$.filtersubtype')as filtersubtype
      ,get_json_object(flist,'$.filtername') as filtername
from aaaaaaa

json_tuple

函数的作用:用来解析json字符串中的多个字段

select a.flist
      ,b.filtertype
      ,b.filtersubtype
      ,b.filtername
 from aaaaaaaaa a 
lateral view json_tuple(flist,'filtertype', 'filtersubtype', 'filtername') b as 

hive用法

没有别的好办法,肯定提取“[]”之间的json字符串之后、后进行分割,逗号分割不行,逗号太多,而我们只想两个大括号之间的逗号做分割“ },{”,所以要把逗号替换成json数组中不存在的其它字符,之后再进行分割,如下,五步走:

1、regexp_extract提取[]内的json字符串

2、regexp_replace替换为逗号为自定义分隔符

3、split根据自定义分隔符分割

4、explode行转列函数

5、get_json_object解析json取字段值

SELECT
	get_json_object(str_json, '$.sku') AS sid
FROM
	(
		SELECT
			event_param_json,
			split(regexp_replace(regexp_extract(event_param_json, '(\\[)(.*?)(\\])', 2), '\\},\\{', '\\}#\\{'), '\\#') AS json_list
		FROM
			gdm.gdm_m14_wireless_exposure_log
		WHERE
			dt = '2020-09-06'
			AND event_id = 'NewManChannel_RecProExpo'
			AND event_param_json LIKE '%uss%'
	)
	a lateral VIEW explode(json_list) list_tab AS str_json

或者如下SQL都可以(两种SQL只是提取中括号[]间json字符串的正则不一样)

SELECT
	GET_JSON_OBJECT(json_str, '$.sku') AS sid
FROM
	(
		SELECT
			json_str
		FROM
			(
				SELECT
					split(regexp_replace(regexp_extract(event_param_json -- 获取data数组,格式[{json},{json}]
					, '^\\[(.+)\\]$', 1) -- 删除字符串前后的[],格式{json},{json}
					, '\\}\\,\\{', '\\}\\|\\|\\{') -- 将josn字符串中的分隔符代换成||,格式{json}||{json}
					, '\\|\\|') AS json_list
				FROM
					gdm.gdm_m14_wireless_exposure_log
				WHERE
					dt = '2020-09-06'
					AND page_id = 'NewManChannel_Home'
					-- and event_id = 'NewManChannel_RecProExpo'
					AND event_param_json LIKE '%uss%'
			)
			a lateral VIEW explode(json_list) list_tab AS json_str
	)
	t

相关文章