在结构的数组中触发Dataframe结构的反序列化

hfyxw5xn 于 2021-07-14 发布在 Spark

关注(0)|答案(1)|浏览(387)

我有一个dataframe模式（用于以parquet格式存储的数据），如下所示

root
 |-- id: string (nullable = true)
 |-- mid: integer (nullable = true)
 |-- relationship: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- child_id: string (nullable = true)
 |    |    |-- details: struct (nullable = true)
 |    |    |    |-- package_contains: string (nullable = true)
 |    |    |    |-- package_level: string (nullable = true)

如果我有这样的问题

df.select($"id",$"mid").filter(array_contains(col("relationship.child_id"), "id1") && $"mid" === 1).show(false)

有人知道spark是否反序列化了吗 details struct数组中的struct( relationship )仅查询时 child_id 列？如果有办法验证的话？

apache-spark apache-spark-sql parquet

来源：https://stackoverflow.com/questions/67079254/spark-deserialization-of-dataframe-struct-inside-array-of-struct

1条答案

按热度按时间

jaxagkaj1#

斯帕克不会看电视的 details 除非有必要。
我用相同的结构创建了一个新的人工数据集。这个 details 结构由自定义项计算。稍后，udf将被更改，以便在运行时引发异常。如果查询仍然有效，我知道尚未调用udf，并且 details 尚未计算结构。

尝试一：“好”的自定义项

def details(x:Int): Tuple2[String, String] = ("a", "b")
val detailsUdf = udf(details _)

val df = spark.range(start = 0, end = 5, step = 1)
  .withColumn("mid", lit(1))
  .withColumn("relationship",
      array(struct(concat(lit("id"),'id).as("child_id"), detailsUdf('id).as("details"))))
``` `df` 现在有了结构

并包含数据

+---+---+---------------+
| id|mid| relationship|
+---+---+---------------+
| 0| 1|id0, [a, b]|
| 1| 1|id1, [a, b]|
| 2| 1|id2, [a, b]|
| 3| 1|id3, [a, b]|
| 4| 1|id4, [a, b]|
+---+---+---------------+


#### 尝试二：坏的自定义项

现在udf被替换为

def details(x:Int): Tuple2[String, String] = throw new NullPointerException
val detailsUdf = udf(details _)

Dataframe的结构仍然是相同的，但是调用 `show` 现在抛出一个 `NullPointerException` .
如果我运行查询

df.select($"id",$"mid").filter(array_contains(col("relationship.child_id"), "id1") && $"mid" === 1).show(false)

我得到了预期的结果

+---+---+
|id |mid|
+---+---+
|1 |1 |
+---+---+

在更改查询以便spark必须计算 `details` 结构

df.select($"id",$"mid").filter(array_contains(col("relationship.details._1"), "id1") && $"mid" === 1).show(false)

查询失败并引发nullpointerexception。

赞(0）回复(0）举报 2021-07-14

我来回答

在结构的数组中触发Dataframe结构的反序列化

1条答案

尝试一：“好”的自定义项

相关问题

热门标签

最新问答