scala—基于sparkDataframe中字符串列中的json数据的过滤器

zrfyljdw  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(268)

我有以下格式的sparkDataframe,其中 FamilyDetails 列是字符串字段:

root
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- FamilyDetails: string (nullable = true)

+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails                                                                                                             |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|Emma      |Smith    |{                                                                                                                         |
|          |         | "23214598.31601190":{"gender":"F","Name":"Ms Olivia Smith","relationship":"Daughter"},                                   |
|          |         | "23214598.23214598":{"gender":"F","Name":"Ms Emma Smith","relationship":null}                                            |
|          |         |}                                                                                                                         |
|Joe       |Williams |{                                                                                                                         |
|          |         |  "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null},                                          |
|          |         |  "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"}                      |
|          |         |}                                                                                                                         |
|Liam      |Jones    |{                                                                                                                         |
|          |         |  "2321464.12379942":{"gender":"F","Name":"Miss Patricia Jones","relationship":"Sister"},                                 |
|          |         |  "2321464.2321464":{"gender":"M","Name":"Mr Liam Jones","relationship":null,"IsActive":"Y"}                              |
|          |         |}                                                                                                                         |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+

我想做的是:
我在找那些家庭成员不活跃的记录( IsActive='N' ). 请注意 IsActive 是可选字段。
预期产量:

+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails                                                                                                             |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+                                                                                                                   |
|Joe       |Williams |{                                                                                                                         |
|          |         |  "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null},                                          |
|          |         |  "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"}                      |
|          |         |}                                                                                                                         |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+

到目前为止,我尝试了:
因为完整的模式是未知的,所以我尝试从 FamilyDetails 列本身。

import org.apache.spark.sql.functions._
import spark.implicits._
val json_schema = spark.read.json(myDF.select("FamilyDetails").as[String]).schema
println(json_schema)

这给了我:

StructType(
    StructField(23214598.31601190,
        StructType(
            StructField(gender,StringType,true), 
            StructField(Name,StringType,true), 
            StructField(relationship,StringType,true), 
            StructField(IsActive,StringType,true)
        )
        ,true
    )
)

如何去掉第一个值(2321463.2321463),只获取json模式中的必需字段?或者有没有更简单的方法来过滤记录 IsActive = 'N' ?

icnyk63a

icnyk63a1#

也许您可以通过简单地查找字符串来避免解析json "IsActive":"N" :

val df2 = df.filter("""FamilyDetails rlike '"IsActive":"N"'""")

要进行更严格的解析,可以使用:

val df2 = df.filter("exists(map_values(from_json(FamilyDetails, 'map<string,map<string,string>>')), x -> x['IsActive'] = 'N')")

相关问题