使用通配符通过spark读取parquet文件

x33g5p2x 于 2021-05-26 发布在 Spark

关注(0)|答案(1)|浏览(465)

我有许多Parquet文件在s3目录。目录结构可能因vid而异。像这样：

bucketname/vid=123/year=2020/month=9/date=12/hf1hfw2he.parquet
bucketname/vid=456/year=2020/month=8/date=13/34jbj.parquet
bucketname/vid=876/year=2020/month=9/date=15/ghg76.parquet

我有一个包含所有视频的列表

vid_list = ['123','456','876']

如何在没有有效性能问题的情况下一次读取month=9的所有文件？

current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3a://bucketname' + 'vid={}/year=2020/month={}/*/*.parquet'.format(*vid_list,current_month))

这是给我的错误 Path does not exist: file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet; . 有没有办法有效地实现这一点？

apache-spark pyspark python-3.x

来源：https://stackoverflow.com/questions/64022667/reading-parquet-file-by-spark-using-wildcard

1条答案

按热度按时间

dohp0rv51#

请尝试以下代码：

vid_list = '(' + '|'.join(['123','456','876']) + ')'
current_month=9
temp_df = sqlContext.read.option("mergeSchema", "false").parquet('s3://bucketname/' + 'vid={}/year=2020/month={}/*/*.parquet'.format(vid_list,current_month))
// URL should look like: s3://bucketname/vid=(123|456|876)/year=2020/month=9/*/*.parquet

代码错误：月值是456，应该是9

file:/Users/home/desktop/test1/vid=123/year=2020/month=456/*/*.parquet;

赞(0）回复(0）举报 2021-05-26

我来回答

使用通配符通过spark读取parquet文件

1条答案

相关问题

热门标签

最新问答