如何在spark中定义传入文件的文件命名约定

envsm3lx 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(390)

我在hdfs中实时接收文件，它们具有相同的命名约定。
id\u名称\u…\u时间戳
我是否可以在spark（scala）上定义这个命名约定，以便稍后与id进行比较？
谢谢您

hadoop apache-spark Naming Convention

来源：https://stackoverflow.com/questions/51059113/how-can-i-define-a-file-naming-convention-of-incoming-files-in-spark

1条答案

按热度按时间

ybzsozfc1#

你可以这样使用：

注册自定义项

spark.udf()
  .register("get_only_file_name", (String fullPath) -> {
     int lastIndex = fullPath.lastIndexOf("/");
     return fullPath.substring(lastIndex, fullPath.length - 1);
    }, DataTypes.StringType);

导入org.apache.spark.sql.functions.input\文件\名称


# use the udf to get last token(filename) in full path

Dataset<Row> initialDs = spark.read()
  .option("dateFormat", conf.dateFormat)
  .schema(conf.schema)
  .csv(conf.path)
  .withColumn("input_file_name", get_only_file_name(input_file_name()));

赞(0）回复(0）举报 2021-05-29

我来回答

如何在spark中定义传入文件的文件命名约定

1条答案

注册自定义项

相关问题

热门标签

最新问答