pyspark使用map函数应用于列列表

j8yoct9x 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(621)

下面的列表包含了Dataframe中的一些列名 df ```
stringList = ['A', 'B', 'C']

我要计算这些列中的不同值。我看到下面的代码，但它似乎不工作。

from pyspark.sql.functions import *

distinctList = []
def countDistinctCats(colName):
count = df.agg(countDistinct(colName)).collect()
distinctList.append(count)

Apply function on every column

map(countDistinctCats, stringList)
print(distinctList)

不过，以下两种方法似乎效果不错：

result = map(lambda x: df.agg(countDistinct(col(x))).collect(), stringList)
print(list(result))

与以下方法相比，这种方法非常缓慢：

display(df.agg(*(countDistinct(col(c)).alias(c) for c in stringList)))

为什么第一个代码块不工作？

python apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/63394880/pyspark-using-map-function-to-apply-to-list-of-columns

1条答案

按热度按时间

k2fxgqgv1#

回答你的问题：为什么第一个街区不跑？
医生说https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html#pyspark.sql.functions.countdistinct，countdistinct预期 column 不是一个 string .
你的代码块 df.agg(countDistinct(colName)) 传递一个字符串给它，因为它是python，这样的东西在编译时不会被捕获，并且在运行时会得到一个异常。

赞(0）回复(0）举报 2021-05-27

我来回答

pyspark使用map函数应用于列列表

Apply function on every column

1条答案

相关问题

热门标签

最新问答