如何在不保存文件的情况下使用Hadoop Parquet Reader读取InputStream

im9ewurl 于 2023-03-29 发布在 Hadoop

关注(0)|答案(1)|浏览(206)

在springboot应用程序中，我使用hadoop从s3 amazon bucket中读取parquet文件。在获得目标文件作为inputstream之后，我想读取它。下面是我的代码

var s3="s3a://bucketX/file.parquet";
Path s3Path = new Path(s3);

Configuration configuration = new Configuration();
configuration.set("fs.s3a.aws.credentials.profileName", "profileX"); //profileX have the permission to read the file
configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");

configuration.set("fs.s3a.endpoint", "s3-eu-west-3.amazonaws.com"); 
configuration.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider"); 

var s3fs=new S3AFileSystem();
s3fs.initialize(new URI(s3), configuration);
InputStream s3InputStream = s3fs.open(new Path(s3));

下面是我的pom.xml配置

<dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-aws</artifactId>
        <version>3.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.3.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.parquet</groupId>
        <artifactId>parquet-hadoop</artifactId>
        <version>1.12.3</version>
    </dependency>

ParquetFileReader需要一个HadoopInputFile作为输入。如何将输入流转换为HaddopInputFile？

ParquetFileReader reader = ParquetFileReader.open(convertToHadoopInputStream(s3InputStream))

hadoop

来源：https://stackoverflow.com/questions/75854129/how-to-read-inputstream-using-hadoop-parquet-reader-without-saving-the-file

1条答案

按热度按时间

ygya80vv1#

fs.s3a.aws.credentials.profileName不是有效的s3 a选项。该连接器的所有属性都是小写的。FWIW，没有对应的小写。
fs.s3a.impl是一些堆栈溢出的迷信。使用它意味着你还没有看hadoop s3 a文档，这是你在配置它时应该开始的地方，* 不是过时的SO帖子 *。
1.不需要打开文件本身。使用ParquetFileReader(Configuration conf, Path file, MetadataFilter filter)构造函数，给出相关的hadoop conf和hadoop Path类型，一个过滤器（可能是NO_FILTER），让它完成工作。
1.并且，在将来，使用FileSystem.get(Path, Configuration)来创建和初始化一个s3 a示例;它还进行缓存。

赞(0）回复(0）举报 2023-03-29

我来回答

如何在不保存文件的情况下使用Hadoop Parquet Reader读取InputStream

1条答案

相关问题

热门标签

最新问答