我有以下格式的sparkDataframe,其中 FamilyDetails
列是字符串字段:
root
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
|-- FamilyDetails: string (nullable = true)
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|Emma |Smith |{ |
| | | "23214598.31601190":{"gender":"F","Name":"Ms Olivia Smith","relationship":"Daughter"}, |
| | | "23214598.23214598":{"gender":"F","Name":"Ms Emma Smith","relationship":null} |
| | |} |
|Joe |Williams |{ |
| | | "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null}, |
| | | "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"} |
| | |} |
|Liam |Jones |{ |
| | | "2321464.12379942":{"gender":"F","Name":"Miss Patricia Jones","relationship":"Sister"}, |
| | | "2321464.2321464":{"gender":"M","Name":"Mr Liam Jones","relationship":null,"IsActive":"Y"} |
| | |} |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
我想做的是:
我在找那些家庭成员不活跃的记录( IsActive='N'
). 请注意 IsActive
是可选字段。
预期产量:
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+ |
|Joe |Williams |{ |
| | | "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null}, |
| | | "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"} |
| | |} |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
到目前为止,我尝试了:
因为完整的模式是未知的,所以我尝试从 FamilyDetails
列本身。
import org.apache.spark.sql.functions._
import spark.implicits._
val json_schema = spark.read.json(myDF.select("FamilyDetails").as[String]).schema
println(json_schema)
这给了我:
StructType(
StructField(23214598.31601190,
StructType(
StructField(gender,StringType,true),
StructField(Name,StringType,true),
StructField(relationship,StringType,true),
StructField(IsActive,StringType,true)
)
,true
)
)
如何去掉第一个值(2321463.2321463),只获取json模式中的必需字段?或者有没有更简单的方法来过滤记录 IsActive = 'N'
?
1条答案
按热度按时间icnyk63a1#
也许您可以通过简单地查找字符串来避免解析json
"IsActive":"N"
:要进行更严格的解析,可以使用: