在pyspark中将最后一位数字分隔到新列

ttvkxqim  于 2021-05-27  发布在  Spark
关注(0)|答案(3)|浏览(372)

我从Pypark开始。。。我有以下问题,我正在尝试将每个id的最后一个数字分隔成一个新列:

|---------------------|------------------|
|         Name        |      Number      |
|---------------------|------------------|
|          Bob        |       34789      |
|---------------------|------------------|

对此:

|---------------------|------------------|---------------------|
|         Name        |      Number_1    |      Number_2       |
|---------------------|------------------|---------------------|
|          Bob        |       3478       |          9          |
|---------------------|------------------|---------------------|

有什么建议吗?

mtb9vblg

mtb9vblg1#

使用 substring 函数提取值并创建新列。 Example: ```
df.show()

+----+------+

|Name|Number|

+----+------+

| Bob| 34789|

+----+------+

from pyspark.sql.functions import *
df.withColumn("number_1",expr("substring(Number,1,length(Number)-1)")).
withColumn("number_2",substring(col("Number"),-1,1)).
drop("Number").
show()

df.withColumn("number_1",expr("substring(Number,1,length(Number)-1)")).
withColumn("number_2",expr("substring(number,-1,1)")).
drop("Number").
show()

+----+--------+--------+

|Name|number_1|number_2|

+----+--------+--------+

| Bob| 3478| 9|

+----+--------+--------+

clj7thdc

clj7thdc2#

检查以下代码。 UDF 分割给定的数字

scala> val splitNumber = udf((d:String) => (d.init,d.last), new StructType().add("number_1","string",true).add("numer_2","string",true))

应用 UDF ```
scala>
df
.withColumn("number",splitNumber($"number"))
.select($"name",$"number.*")
.show(false)

+----+--------+-------+
|name|number_1|numer_2|
+----+--------+-------+
|Bob |3478 |9 |
+----+--------+-------+

bakd9h0s

bakd9h0s3#

看看这个。你可以先用 regexp_extract 获取最后一位以创建新列并使用 regexp_replace 将最后一个数字替换为空字符串“”。

from pyspark.sql import functions as F

    df.show()

    # +----+------+
    # |name|number|
    # +----+------+
    # | bob| 34789|
    # +----+------+

    df.withColumn("number_1",F.regexp_replace(F.col('number'),r'(\d$)','')).withColumn("number_2", F.regexp_extract(F.col('number'), r'(\d$)',1)).withColumn("number_1", F.regexp_replace(F.col('number'), r'(\d$)', '')).drop(F.col('number')).show()

    # +----+--------+--------+
    # |name|number_1|number_2|
    # +----+--------+--------+
    # | bob|    3478|       9|
    # +----+--------+--------+

相关问题