在pyspark中将两个Dataframe中的一个Dataframe作为单独的子列

wb1gzix0 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(379)

我想把两个Dataframe放在一个Dataframe中，所以每个Dataframe都是子列，不是Dataframe的连接。所以我有两个Dataframe，stat1，df和stat2，df，它们看起来像这样：

root
 |-- max_scenes: integer (nullable = true)
 |-- median_scenes: double (nullable = false)
 |-- avg_scenes: double (nullable = true)

+----------+-------------+------------------+
|max_scenes|median_scenes|avg_scenes        |
+----------+-------------+------------------+
|97        |7.0          |10.806451612903226|
|97        |7.0          |10.806451612903226|
|97        |7.0          |10.806451612903226|
|97        |7.0          |10.806451612903226|
+----------+-------------+------------------+

root
 |-- max: double (nullable = true)
 |-- type: string (nullable = true)

+-----+-----------+
|max  |type       |
+-----+-----------+
|10.0 |small      |
|25.0 |medium     |
|50.0 |large      |
|250.0|extra_large|
+-----+-----------+

，我希望结果是：

root
 |-- some_statistics1: struct (nullable = true)
 |    |-- max_scenes: integer (nullable = true)
      |-- median_scenes: double (nullable = false)
      |-- avg_scenes: double (nullable = true)
 |-- some_statistics2: struct (nullable = true)
 |    |-- max: double (nullable = true)
      |-- type: string (nullable = true)

有没有办法把这两个放在图中？stat1_df和stat2_df是简单的Dataframe，没有数组和嵌套列。最后的Dataframe被写入mongodb。如果有任何其他问题我在这里回答。

apache-spark pyspark apache-spark-sql pyspark-dataframes

来源：https://stackoverflow.com/questions/63289915/making-one-dataframe-out-of-two-dataframes-as-separate-subcolumns-in-pyspark

1条答案

按热度按时间

cidc1ykv1#

检查以下代码。
添加 id 列中的两个 DataFrame ，将所有列移到结构中，然后使用 join 两个Dataframe

scala> val dfa = Seq(("10","8.9","7.9")).toDF("max_scenes","median_scenes","avg_scenes")
dfa: org.apache.spark.sql.DataFrame = [max_scenes: string, median_scenes: string ... 1 more field]

scala> dfa.show(false)
+----------+-------------+----------+
|max_scenes|median_scenes|avg_scenes|
+----------+-------------+----------+
|10        |8.9          |7.9       |
+----------+-------------+----------+

scala> dfa.printSchema
root
 |-- max_scenes: string (nullable = true)
 |-- median_scenes: string (nullable = true)
 |-- avg_scenes: string (nullable = true)

scala> val mdfa = dfa.select(struct($"*").as("some_statistics1")).withColumn("id",monotonically_increasing_id)
mdfa: org.apache.spark.sql.DataFrame = [some_statistics1: struct<max_scenes: string, median_scenes: string ... 1 more field>, id: bigint]

scala> mdfa.printSchema
root
 |-- some_statistics1: struct (nullable = false)
 |    |-- max_scenes: string (nullable = true)
 |    |-- median_scenes: string (nullable = true)
 |    |-- avg_scenes: string (nullable = true)
 |-- id: long (nullable = false)

scala> mdfa.show(false)
+----------------+---+
|some_statistics1|id |
+----------------+---+
|[10,8.9,7.9]    |0  |
+----------------+---+

scala> val dfb = Seq(("11.2","sample")).toDF("max","type")
dfb: org.apache.spark.sql.DataFrame = [max: string, type: string]

scala> dfb.printSchema
root
 |-- max: string (nullable = true)
 |-- type: string (nullable = true)

scala> dfb.show(false)
+----+------+
|max |type  |
+----+------+
|11.2|sample|
+----+------+

scala> val mdfb = dfb.select(struct($"*").as("some_statistics2")).withColumn("id",monotonically_increasing_id)
mdfb: org.apache.spark.sql.DataFrame = [some_statistics2: struct<max: string, type: string>, id: bigint]

scala> mdfb.printSchema
root
 |-- some_statistics2: struct (nullable = false)
 |    |-- max: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- id: long (nullable = false)

scala> mdfb.show(false)
+----------------+---+
|some_statistics2|id |
+----------------+---+
|[11.2,sample]   |0  |
+----------------+---+

scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").printSchema
root
 |-- some_statistics1: struct (nullable = false)
 |    |-- max_scenes: string (nullable = true)
 |    |-- median_scenes: string (nullable = true)
 |    |-- avg_scenes: string (nullable = true)
 |-- some_statistics2: struct (nullable = false)
 |    |-- max: string (nullable = true)
 |    |-- type: string (nullable = true)

scala> mdfa.join(mdfb,Seq("id"),"inner").drop("id").show(false)
+----------------+----------------+
|some_statistics1|some_statistics2|
+----------------+----------------+
|[10,8.9,7.9]    |[11.2,sample]   |
+----------------+----------------+

赞(0）回复(0）举报 2021-05-27

我来回答

在pyspark中将两个Dataframe中的一个Dataframe作为单独的子列

1条答案

相关问题

热门标签

最新问答