change-schema-of-spark-dataframe列

5hcedyr0 于 2021-05-29 发布在 Spark

关注(0)|答案(1)|浏览(338)

我有一个pyspark数据框，列为“学生”。
一项数据如下：

{
   "Student" : {
       "m" : {
           "name" : {"s" : "john"},
           "score": {"s" : "165"}
       }
   }
}

我想更改此列的架构，以便条目如下所示：

{
    "Student" : 
    {
        "m" : 
        {
            "StudentDetails" : 
            {
                "m" : 
                {
                    "name" : {"s" : "john"},
                    "score": {"s" : "165"}
                }
            }
        }
    } 
}

问题是，在dataframe中student字段也可以为null。所以我想保留空值，但更改非空值的模式。我使用了一个自定义项为上述过程的工作。

def Helper_ChangeSchema(row):
            #null check
            if row is None:
                return None
            #change schema
            data = row.asDict(True)
            return {"m":{"StudentDetails":data}}

但udf是spark的黑匣子。是否有任何方法可以使用内置的spark函数或sql查询来执行相同的操作。

python DataFrame apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/62249074/change-schema-of-spark-dataframe-column

1条答案

按热度按时间

91zkwejq1#

它的工作原理和这个答案一模一样。只需在结构中添加另一个嵌套级别：
作为sql表达式：

processedDf = df.withColumn("student", F.expr("named_struct('m', named_struct('student_details', student))"))

或者在使用struct函数的python代码中：

processedDf = df.withColumn("student", F.struct(F.struct(F.col("student")).alias('m')))

两个版本的结果相同：

root
 |-- student: struct (nullable = false)
 |    |-- m: struct (nullable = false)
 |    |    |-- student_details: struct (nullable = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- name: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)
 |    |    |    |    |-- score: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)

对于空行，这两种方法也都适用。使用此输入数据

data ='{"student" : {"m" : {"name" : {"s" : "john"},"score": {"s" : "165"}}}}'
data2='{"student": null }'
df = spark.read.json(sc.parallelize([data, data2]))
``` `processedDf.show(truncate=False)` 印刷品

+---------------------+
|student |
+---------------------+
|[john], [165]|
| |
+---------------------+

编辑：如果整行应该设置为null而不是结构的字段，那么可以在

processedDf = df.withColumn("student", F.when(F.col("student").isNull(), F.lit(None)).otherwise(F.struct(F.struct(F.col("student")).alias('m'))))

这将导致相同的架构，但空行的输出不同：

+---------------------+
|student |
+---------------------+
|[john], [165]|
|null |
+---------------------+

赞(0）回复(0）举报 2021-05-29

我来回答

change-schema-of-spark-dataframe列

1条答案

相关问题

热门标签

最新问答