组合pyspark中两个Dataframe中的两列

olmpazwi  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(457)

假设我们有两个Dataframe

df1 = spark.createDataFrame([
    Row(a=107831, f="test1"),
    Row(a=125231, f=None),
])
df2 = spark.createDataFrame([
    Row(a=107831, f=None),
    Row(a=125231, f="test2"),
])

如何组合这两个Dataframe并获得一个具有以下df的Dataframe?

df=spark.createDataFrame([
    Row(a=107831, f="test1"),
    Row(a=125231, f="test2"),
])
myss37ts

myss37ts1#

连接上的两个Dataframe a 列,然后使用 coalsce 功能。

df1.alias("t1").join(df2.alias("t2"),["a"],'inner').\
select("t1.a",coalesce("t1.f","t2.f").alias("f")).\
show()

# +------+-----+

# |     a|    f|

# +------+-----+

# |107831|test1|

# |125231|test2|

# +------+-----+
yh2wf1be

yh2wf1be2#

我做pyspark已经有一段时间了,你可以通过这样做得到你想要的。

from pyspark.sql.functions import col, when

df3 = df1.join(df2, df1.a == df2.a).select(df1.a, df1.f.alias('d1f'), df2.f.alias('d2f'))

# build a new column conditionally select either df1.f or df2.f

df4 = df3.withColumn('f', when(col('d1f').isNull(), df3.d2f).otherwise(df3.d1f))

df4.show()
+------+-----+-----+-----+
|     a|  d1f|  d2f|    f|
+------+-----+-----+-----+
|107831|test1| null|test1|
|125231| null|test2|test2|
+------+-----+-----+-----+

# drop off the 2 temporary columns

df4 = df4.drop('d1f','d2f')

df4.show()
+------+-----+
|     a|    f|
+------+-----+
|107831|test1|
|125231|test2|
+------+-----+

相关问题