我有一个有两列的Dataframe。我想删除每个记录中嵌套数组的第一个数组。例如:-我有一个这样的df
+---+-------+--------+-----------+-------------+
|id |arrayField |
+---+------------------------------------------+
|1 |[[Akash,Kunal],[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Kunal, Mrinal],[Priya,Diya]] |
|3 |[[Adi,Sadi]] |
+---+-------+---------+----------+-------------+
我想要我的输出像this:-
+---+-------+------+------+-------+
|id |arrayField |
+---+-----------------------------+
|1 |[[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Priya,Diya]] |
|3 | null |
+---+-------+------+------+-------+
1条答案
按热度按时间nzkunb0c1#
来自spark-2.4使用
slice
功能。Example:
```df.show(10,false)
/*
+------------------------+
|arrayField |
+------------------------+
|[[A, k], [s, m], [R, k]]|
|[[k, M], [c, z]] |
|A, b |
+------------------------+
import org.apache.spark.sql.functions._
df.withColumn("sliced",expr("slice(arrayField,2,size(arrayField))")).
withColumn("arrayField",when(size(col("sliced"))==0,lit(null)).otherwise(col("sliced"))).
drop("sliced").
show()
/*
+----------------+
| arrayField|
+----------------+
|[[s, m], [R, k]]|
| c, z|
| null|
+----------------+