筛选pyspark dataframe中的行并创建包含结果的新列

l2osamch 于 2021-08-01 发布在 Java

关注(0)|答案(2)|浏览(259)

所以我想找出周日发生在旧金山市区边界内的犯罪。我的想法是首先写一个自定义项来标记，如果每个犯罪都发生在我确定为市区的区域内，如果它发生在该区域内，那么它将有一个标签“1”，如果不是“0”。之后，我尝试创建一个新列来存储这些结果。我尽我最大的努力写了所有我能写的东西，但是因为某种原因它不起作用。下面是我写的代码：

from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf

def filter_dt(x,y):
  if (((x < -122.4213) & (x > -122.4313)) & ((y > 37.7540) & (y < 37.7740))):
    return '1'
  else:
    return '0'

schema = StructType([StructField("isDT", BooleanType(), False)])
filter_dt_boolean = udf(lambda row: filter_dt(row[0], row[1]), schema)

# First, pick out the crime cases that happens on Sunday BooleanType()

q3_sunday = spark.sql("SELECT * FROM sf_crime WHERE DayOfWeek='Sunday'")

# Then, we add a new column for us to filter out(identify) if the crime is in DT

q3_final = q3_result.withColumn("isDT", filter_dt(q3_sunday.select('X'),q3_sunday.select('Y')))

我得到的错误是：错误消息的图片
我的猜测是，我现在使用的udf不支持将整个列作为要比较的输入，但我不知道如何修复它以使其工作。请帮帮我！谢谢您！

sql pyspark user-defined-functions

来源：https://stackoverflow.com/questions/62647660/filtering-rows-in-pyspark-dataframe-and-creating-a-new-column-that-contains-the

2条答案

按热度按时间

hi3rlvi21#

一个样本数据会有所帮助。目前，我假设您的数据如下所示：

+----+---+---+
|val1|  x|  y|
+----+---+---+
|  10|  7| 14|
|   5|  1|  4|
|   9|  8| 10|
|   2|  6| 90|
|   7|  2| 30|
|   3|  5| 11|
+----+---+---+

这样就不需要自定义项，因为可以使用when（）函数进行计算

import pyspark.sql.functions as F
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_res = tst.withColumn("isdt",F.when(((tst.x.between(4,10))&(tst.y.between(11,20))),1).otherwise(0))This will give the result
   tst_res.show()
+----+---+---+----+
|val1|  x|  y|isdt|
+----+---+---+----+
|  10|  7| 14|   1|
|   5|  1|  4|   0|
|   9|  8| 10|   0|
|   2|  6| 90|   0|
|   7|  2| 30|   0|
|   3|  5| 11|   1|
+----+---+---+----+

如果我把数据搞错了，仍然需要将多个值传递给udf，则必须将其作为数组或结构传递。我喜欢结构

from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf(IntegerType())
def check_data(row):
    if((row.x in range(4,5))&(row.y in range(1,20))):
        return(1)
    else:
        return(0)
tst_res1 = tst.withColumn("isdt",check_data(F.struct('x','y')))

结果是一样的。但最好避免使用自定义函数，而使用spark内置函数，因为spark catalyst无法理解自定义函数内部的逻辑，也无法对其进行优化。

赞(0）回复(0）举报 2021-08-01

wljmcqd82#

尝试更改最后一行，如下所示-

from pyspark.sql.functions import col
q3_final = q3_result.withColumn("isDT", filter_dt(col('X'),col('Y')))

赞(0）回复(0）举报 2021-08-01

我来回答

筛选pyspark dataframe中的行并创建包含结果的新列

2条答案

相关问题

热门标签

最新问答