从amazonemr读取hadoop reducer程序中s3 bucket中的所有文件

brc7rcf0 于 2021-06-02 发布在 Hadoop

关注(0)|答案(0)|浏览(249)

我是亚马逊电子病历的新手。我想从输入目录中获取所有文件的列表，并在hadoop reducer程序中循环读取它们。在我的本地机器上，以下代码在hadoop上运行得非常好（hdfs中有输入文件）。但是当我在amazonemr上部署这段代码时（输入文件在s3 bucket中），它没有给出原因，因为未知。我试着把这条路定为 s3://<bucket_name>/<input_dir> 在 valPath 变量，但不起作用。

Path valPath=new Path("hdfs:/input/" + dirName);
FileSystem fs = FileSystem.get(new Configuration());
FileStatus[] valFilePathList = fs.listStatus(valPath);
for (FileStatus file : valFilePathList) {
    BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(file.getPath())));
    String line=br.readLine();
    while (line != null){
        <do something>
        line=br.readLine();
    }
}

在emr的reducer程序中，访问s3 bucket中所有文件的正确方法是什么？
更新：我对上面的代码片段做了一些修改，从而解决了这个问题。

Path valPath=new Path("/input/<dirName>"); // Provide relative path
FileSystem fs = FileSystem.get(URI.create("s3://<bucket-name>"), context.getConfiguration()); // Pass the URI of bucket here
FileStatus[] valFilePathList = fs.listStatus(valPath); // Same as above

hadoop mapreduce amazon-emr amazon-web-services emr

来源：https://stackoverflow.com/questions/46612860/read-all-files-residing-in-s3-bucket-inside-hadoop-reducer-program-from-amazon-e