如何将hive表中的json数组划分成给定批量大小的批?

wpcxdonn  于 2023-02-08  发布在  Hive
关注(0)|答案(1)|浏览(116)

例如,在输入端我们有这样的hive表:
| 身份证|实体|
| - ------|- ------|
| 1个|[{" a ":" a1 "," b ":"b1 "},{" a":"a2","b":"b2 "},{" a":"a3","b":"b3 "},{" a":"a4","b":"b4 "},{" a":"a5"、"b":"b5 "}]|
| 第二章|[{" c ":" c1 "," d ":"d1 "},{" c":"c2","d":"d2 "},{" c":"c3","d":"d3 "},{" c":"c4","d":"d4 "},{" c":"c5"、"d":"d5 "}]|
当batchSize = 3时,每个数组中的元素不应超过三个,如下所示:
| 身份证|实体|
| - ------|- ------|
| 1个|[{" a ":" a1 "," b ":"b1 "},{" a":"a2","b":"b2 "},{" a":"a3"、"b":"b3 "}]|
| 1个|[{" a ":" a4 "," b ":"b4 "},{" a":"a5"、"b":"b5 "}]|
| 第二章|[{" c ":" c1 "," d ":"d1 "},{" c":"c2","d":"d2 "},{" c":"c3"、"d":"d3 "}]|
| 第二章|[{" c ":" c4 "," d ":"d4 "},{" c":"c5"、"d":"d5 "}]|
batchSize = 2时,数组中不超过两个元素:
| 身份证|实体|
| - ------|- ------|
| 1个|[{" a ":" a1 "," b ":"b1 "},{" a":"a2"、"b":"b2 "}]|
| 1个|[{" a ":" a3 "," b ":"b3 "},{" a":"a4"、"b":"b4 "}]|
| 1个|[{" a ":" a5 "," b ":"b5 "}]|
| 第二章|[{" c ":" c1 "," d ":"d1 "},{" c":"c2"、"d":"d2 "}]|
| 第二章|[{" c ":" c3 "," d ":"d3 "},{" c":"c4"、"d":"d4 "}]|
| 第二章|[{" c ":" c5 "," d ":"d5 "}]|
Hive里有内置的吗?或者你能推荐一个spark udf吗?
我试着写udf-但是我不知道如何让它返回几行子数组而不是一行,因为udf通常对几列做一些操作并返回结果,但是我不知道如何将数组拆分成几行子数组。
谢谢!

qzlgjiam

qzlgjiam1#

可以使用slice函数获取子数组,然后使用explode获取多行。

import spark.implicits._

val df = // input

val batchSize = 2 

df.select('id, explode(expr(
    s"transform(sequence(1, size(entities), $batchSize)," +
      s" s-> slice(entities, s, $batchSize)) ")))
    .show(false)

+---+--------------------+
|id |col                 |
+---+--------------------+
|1  |[[a1, b1], [a2, b2]]|
|1  |[[a3, b3], [a4, b4]]|
|1  |[[a5, b5]]          |
+---+--------------------+

相关问题