scala 从Spark数组列中的ArrayType类型的行中获取不同元素

mjqavswn 于 5个月前发布在 Scala

关注(0)|答案(2)|浏览(53)

我有一个带有以下模式的框架：

root
     |-- e: array (nullable = true)
     |    |-- element: string (containsNull = true)

字符串
例如，启动一个框架：

val df = Seq(Seq("73","73"), null, null, null, Seq("51"), null, null, null, Seq("52", "53", "53", "73", "84"), Seq("73", "72", "51", "73")).toDF("e")

df.show()

+--------------------+
|                   e|
+--------------------+
|            [73, 73]|
|                null|
|                null|
|                null|
|                [51]|
|                null|
|                null|
|                null|
|[52, 53, 53, 73, 84]|
|    [73, 72, 51, 73]|
+--------------------+

型
我希望输出为：

+--------------------+
|                   e|
+--------------------+
|                [73]|
|                null|
|                null|
|                null|
|                [51]|
|                null|
|                null|
|                null|
|    [52, 53, 73, 84]|
|        [73, 72, 51]|
+--------------------+

型
我正在尝试以下udf：

def distinct(arr: TraversableOnce[String])=arr.toList.distinct
val distinctUDF=udf(distinct(_:Traversable[String]))

型
但它只在行不为空时有效，即。

df.filter($"e".isNotNull).select(distinctUDF($"e"))

型
给我

+----------------+
|          UDF(e)|
+----------------+
|            [73]|
|            [51]|
|[52, 53, 73, 84]|
|    [73, 72, 51]|
+----------------+

型
但

df.select(distinctUDF($"e"))

型
失败。在这种情况下，我如何使udf句柄为null？或者，如果有一种更简单的方法可以获得唯一值，我想尝试一下。

scala

来源：https://stackoverflow.com/questions/52322922/get-distinct-elements-from-rows-of-type-arraytype-in-spark-dataframe-column

2条答案

按热度按时间

rmbxnbpk1#

只有当列值不是null时，您才可以使用when().otherwise()来应用您的UDF。在这种情况下，.otherwise(null)也可以跳过，因为当不指定otherwise子句时，它默认为null。

val distinctUDF = udf( (s: Seq[String]) => s.distinct )

df.select(when($"e".isNotNull, distinctUDF($"e")).as("e"))

字符串

赞(0）回复(0）举报 5个月前

xe55xuns2#

在您提出这个问题的两个月后，随着Spark 2.4.0的发布，引入了函数array_distinct，它完全按照预期运行。

赞(0）回复(0）举报 5个月前

我来回答

scala 从Spark数组列中的ArrayType类型的行中获取不同元素

2条答案

相关问题

热门标签

最新问答