在pyspark中分解嵌套数组

d6kp6zgx  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(488)

我正在pyspark中查看以下Dataframe模式(为了隐私而更改了名称)。

|-- some_data: struct (nullable = true)
|    |-- some_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_nested_array: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- some_param_1: long (nullable = true)
|    |    |    |    |    |-- some_param_2: string (nullable = true)
|    |    |    |    |    |-- some_param_3: string (nullable = true)
|    |    |    |-- some_param_4: string (nullable = true)
|    |    |    |-- some_param_5: string (nullable = true)
|    |-- some_other_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_param_6: string (nullable = true)
|    |    |    |-- some_param_7: string (nullable = true)
|    |-- yet_another_array: array (nullable = true)
|    |    |-- element: struct (containsNull = true)
|    |    |    |-- some_param_8: string (nullable = true)
|    |    |    |-- some_param_9: string (nullable = true)

我在双嵌套数组中使用explode函数很困难。我希望能以某种方式接近下面的参数 some_array 在他们自己的列中,这样我就可以比较一些参数1到9,甚至只是一些参数1到5。

dzjeubhm

dzjeubhm1#

可以分两步分解嵌套数组:首先分解外部数组,然后分解嵌套的内部数组:

from pyspark.sql.functions import explode

df2 = df.withColumn("exploded_some_array", \
                    explode("some_data.some_array"))
df2 = df2.withColumn("exploded_some_nested_array", \
                     explode("exploded_some_array.some_nested_array"))
df2.select("exploded_some_nested_array.*", \
           "exploded_some_array.some_param_4", \
           "exploded_some_array.some_param_5").show(truncate=False)

打印5列 some_param_1some_param_5 .
添加其余两个数组的列很简单:

df2 = df2.withColumn("exploded_some_other_array", explode("some_data.some_other_array"))
df2 = df2.withColumn("exploded_yet_another_array", explode("some_data.yet_another_array"))
df2.select("exploded_some_nested_array.*", \
           "exploded_some_array.some_param_4", \
           "exploded_some_array.some_param_5", \
           "exploded_some_other_array.*", \
           "exploded_yet_another_array.*").show(truncate=False)

打印所有列 some_param_1some_param_9 就像一张扁平的table。

qco9c6ql

qco9c6ql2#

请将列转换为 json 使用 json_path 以列形式获取每个参数。示例代码如下-

df.selectExpr("get_json_object(to_json(struct(some_data)),
 '$.some_data.some_array[0].some_nested_array[0].some_param_1') as some_param_1",
 ...<add_others>).show(false)

// compare each param as column here

如果您不熟悉json路径,那么-
获取通过创建的json-

df.selectExpr(to_json(struct(some_data))).show(false)

将json从结果单元格复制到 https://jsonpathfinder.com/ 在这里您将看到对象树层次结构。现在,点击 some_param_1 节点
复制同一页上的路径并替换 x$ 把它作为第二个参数 get_json_object 你就完了
一旦获得单个param列,就可以进行处理。

相关问题