在pyspark中将完整文件路径转换为多行父绝对路径

mf98qq94  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(331)

在pysparkDataframe中,我想将一个字符串完整的文件路径转换为每个父路径的多行。
输入Dataframe值:

ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt

输出:每一行都应该显示一个绝对路径以及 / 分隔符

ParentFolder/
ParentFolder/Folder1/
ParentFolder/Folder1/Folder2/
ParentFolder/Folder1/Folder2/Folder3/
ParentFolder/Folder1/Folder2/Folder3/Folder4/
ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt
l0oc07j2

l0oc07j21#

可以拆分列 value/ 分隔符以获取路径的所有部分。然后使用 transform 函数,可以使用 slice 以及 array_join 功能:

from pyspark.sql import functions as F

df1 = df.withColumn("value", F.split(F.col("value"), "/")) \
    .selectExpr("""
      explode(
           transform(value, 
                     (x, i) -> struct(i+1 as rn, array_join(slice(value, 1, i+1), '/') ||
                                      IF(i+1 < size(value), '/', '') as path)
                     )
       ) as paths
    """).select("paths.*")

df1.show(truncate=False)

# +---+---------------------------------------------------------+

# |rn |path                                                     |

# +---+---------------------------------------------------------+

# |1  |ParentFolder/                                            |

# |2  |ParentFolder/Folder1/                                    |

# |3  |ParentFolder/Folder1/Folder2/                            |

# |4  |ParentFolder/Folder1/Folder2/Folder3/                    |

# |5  |ParentFolder/Folder1/Folder2/Folder3/Folder4/            |

# |6  |ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt|

# +---+---------------------------------------------------------+

对于spark<2.4,可以这样使用udf:

import os
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType

def get_all_paths(path: str):
    paths = [path]
    for _ in range(path.count("/")):
        path, base = os.path.split(path)
        paths.append(path + "/")

    return list(reversed(paths))

decompose_path = F.udf(get_all_paths, ArrayType(StringType()))

df1 = df.select(F.explode(decompose_path(F.col("value"))).alias("paths"))
3xiyfsfu

3xiyfsfu2#

你可以用 substring_index 具体如下:

df2 = df.selectExpr("""
    explode(
        transform(
            sequence(1, size(split(col, '/'))),
            (x, i) -> case when i = size(split(col, '/')) - 1
                           then col
                           else substring_index(col, '/', x) || '/'
                           end
        )
    ) as col
""")

df2.show(20,0)
+---------------------------------------------------------+
|col                                                      |
+---------------------------------------------------------+
|ParentFolder/                                            |
|ParentFolder/Folder1/                                    |
|ParentFolder/Folder1/Folder2/                            |
|ParentFolder/Folder1/Folder2/Folder3/                    |
|ParentFolder/Folder1/Folder2/Folder3/Folder4/            |
|ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt|
+---------------------------------------------------------+

相关问题