pyspark:在连接中处理null

6yoyoihd 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(453)

我正在尝试在pyspark中加入2个Dataframe。我的问题是我想让我的“内部连接”给它一个通行证，不管空值。我可以看到，在scala中，我有一个<=>。但是，<=>在pyspark中不起作用。

userLeft = sc.parallelize([
Row(id=u'1', 
    first_name=u'Steve', 
    last_name=u'Kent', 
    email=u's.kent@email.com'),
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'marge.peace@email.com'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'marge.hh@email.com')]).toDF()

userRight = sc.parallelize([
Row(id=u'2', 
    first_name=u'Margaret', 
    last_name=u'Peace', 
    email=u'marge.peace@email.com'),
Row(id=u'3', 
    first_name=None, 
    last_name=u'hh', 
    email=u'marge.hh@email.com')]).toDF()

当前工作版本： userLeft.join(userRight, (userLeft.last_name==userRight.last_name) & (userLeft.first_name==userRight.first_name)).show() 当前结果：

+--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+ 
    |marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
    +--------------------+----------+---+---------+--------------------+----------+---+---------+

预期结果：

+--------------------+----------+---+---------+--------------------+----------+---+---------+
|               email|first_name| id|last_name|               email|first_name| id|last_name|
+--------------------+----------+---+---------+--------------------+----------+---+---------+
|  marge.hh@email.com|      null|  3|       hh|  marge.hh@email.com|      null|  3|       hh|
|marge.peace@email...|  Margaret|  2|    Peace|marge.peace@email...|  Margaret|  2|    Peace|
+--------------------+----------+---+---------+--------------------+----------+---+---------+

hadoop DataFrame pyspark

来源：https://stackoverflow.com/questions/46061866/pyspark-handing-null-in-joins

2条答案

按热度按时间

yquaqz181#

使用另一个值而不是 null :

userLeft = userLeft.na.fill("unknown")
userRight = userRight.na.fill("unknown")

userLeft.join(userRight, ["last_name", "first_name"])

    +---------+----------+--------------------+---+--------------------+---+
    |last_name|first_name|               email| id|               email| id|
    +---------+----------+--------------------+---+--------------------+---+
    |    Peace|  Margaret|marge.peace@email...|  2|marge.peace@email...|  2|
    |       hh|   unknown|  marge.hh@email.com|  3|  marge.hh@email.com|  3|
    +---------+----------+--------------------+---+--------------------+---+

赞(0）回复(0）举报 2021-05-29

ygya80vv2#

对于pyspark<2.3.0，仍然可以使用如下表达式列构建<=>运算符：

import pyspark.sql.functions as F
df1.alias("df1").join(df2.alias("df2"), on = F.expr('df1.column <=> df2.column'))

对于pyspark>=2.3.0，可以使用column.eqnullsafe或与此处回答的不同。

赞(0）回复(0）举报 2021-05-29

我来回答

pyspark:在连接中处理null

2条答案

相关问题

热门标签

最新问答