spark-df中是否有可以用applymap替换的函数?

6ojccjat  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(378)

下面是为pandas-df编写的代码,由于内存问题,我不得不转移到pyspark,这就是为什么我需要转换此代码以便可以为spark-df执行。我试着直接运行这个,但是它产生了一个错误。在pyspark中,下面的代码有什么替代方法?

def units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

sets = df.applymap(units)

这里是我得到的错误:

AttributeErrorTraceback (most recent call last)
<ipython-input-20-7e54b4e7a7e7> in <module>()
----> 1 sets = pivoted.applymap(units)

/usr/lib/spark/python/pyspark/sql/dataframe.py in __getattr__(self, name)
   1180         if name not in self.columns:
   1181             raise AttributeError(
-> 1182                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
   1183         jc = self._jdf.apply(name)
   1184         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'applymap'
eit6fx6z

eit6fx6z1#

您可以将单位函数 Package 为自定义项:

from pyspark.sql.types import LongType
from pyspark.sql.functions import udf, col

def units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

units_udf = udf(lambda x: units(x), LongType())

df = spark.createDataFrame([(-1,), (0,), (1,), (2,)], ['id'])

df.show()
+---+                                                                           
| id|
+---+
| -1|
|  0|
|  1|
|  2|
+---+

sets = df.withColumn("id", units_udf(col("id")))
sets.show()
+---+
| id|
+---+
|  0|
|  0|
|  1|
|  1|
+---+

相关问题