按值筛选dataframe，该值不在其他dataframe的列中

vhipe2zx 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(664)

这个问题在这里已经有答案了：

基于指定denylist条件的另一个Dataframe筛选sparkDataframe（2个答案）
四年前关门了。
我觉得答案很简单。给定两个Dataframe，我想过滤第一个Dataframe，其中一列中的值不在另一个Dataframe的列中。
我不想求助于完整的sparksql，所以只需要使用dataframe.filter、column.contains、isin关键字或join方法。

val df1 = Seq(("Hampstead", "London"), 
              ("Spui", "Amsterdam"), 
              ("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")

val res = df1.filter(df2("cities").contains("city") === false)
// doesn't work, nor do the 20 other variants I have tried

有人有什么想法吗？

scala apache-spark apache-spark-sql spark-dataframe

来源：https://stackoverflow.com/questions/40244925/filter-dataframe-by-value-not-present-in-column-of-other-dataframe

2条答案

按热度按时间

23c0lvtd1#

我发现我可以用一个更简单的方法来解决这个问题——似乎反连接可以作为连接方法的参数，但是spark scaladoc没有描述它：

import org.apache.spark.sql.functions._

val df1 = Seq(("Hampstead", "London"), 
              ("Spui", "Amsterdam"), 
              ("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")

df1.join(df2, df1("city") === df2("cities"), "leftanti").show

结果：

+----------+-------+ 
|  location|   city| 
+----------+-------+ 
|Chittagong|Chennai| 
+----------+-------+

p、谢谢你给我的复本的指针-正确地标记为这样

赞(0）回复(0）举报 2021-05-27

g6ll5ycj2#

如果您试图筛选 DataFrame 使用另一个，你应该使用 join （或其任何变体）。如果您需要的是使用 List 或者任何适合您的主服务器和工作服务器的数据结构，您可以广播它，然后在 filter 或者 where 方法。
例如，我会这样做：

import org.apache.spark.sql.functions._

val df1 = Seq(("Hampstead", "London"), 
              ("Spui", "Amsterdam"), 
              ("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")

df2.join(df1, joinExprs=df1("city") === df2("cities"), joinType="full_outer")
   .select("city", "cities")
   .where(isnull($"cities"))
   .drop("cities").show()

赞(0）回复(0）举报 2021-05-27

我来回答

按值筛选dataframe，该值不在其他dataframe的列中

2条答案

相关问题

热门标签

最新问答