pyspark:dataframe与关系表的嵌套字段

qyzbxkaa  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(417)

我有一个pyspark的学生数据框架,其模式如下:

Id: string
 |-- School: array
 |-- element: struct
 |   |-- Subject: string
 |   |-- Classes: string
 |   |-- Score: array
 |       |-- element: struct
 |           |-- ScoreID: string
 |           |-- Value: string

我想从Dataframe中提取一些字段并对其进行规范化,以便将其输入到数据库中。我期望的关系模式由以下字段组成 Id, School, Subject, ScoreId, Value . 我怎样才能有效地做到这一点?

qnakjoqk

qnakjoqk1#

explode 数组以获取展平数据,然后选择所有必需的列。 Example: ```
df.show(10,False)

+---+--------------------------+

|Id |School |

+---+--------------------------+

|1 |b, [[A, 3], [B, 4]], a|

+---+--------------------------+

df.printSchema()

root

|-- Id: string (nullable = true)

|-- School: array (nullable = true)

| |-- element: struct (containsNull = true)

| | |-- Classes: string (nullable = true)

| | |-- Score: array (nullable = true)

| | | |-- element: struct (containsNull = true)

| | | | |-- ScoreID: string (nullable = true)

| | | | |-- Value: string (nullable = true)

| | |-- Subject: string (nullable = true)

df.selectExpr("Id","explode(School)").
selectExpr("Id","col.","explode(col.Score)").
selectExpr("Id","Classes","Subject","col.
").
show()

+---+-------+-------+-------+-----+

| Id|Classes|Subject|ScoreID|Value|

+---+-------+-------+-------+-----+

| 1| b| a| A| 3|

| 1| b| a| B| 4|

+---+-------+-------+-------+-----+

相关问题