分布式系统中语义网的hadoop推理

ccgok5k5 于 2021-06-04 发布在 Hadoop

关注(0)|答案(2)|浏览(311)

我想在hadoop平台上使用web级并行推理机（webpie）推理器。我已经用两个ubuntu虚拟机实现了hadoop结构，并且运行良好。当我想用webpie对rdf文件进行推理时，由于需要序列文件格式，这个过程失败了。webpie教程没有提到序列文件格式是hadoop推理的先决条件。为了生成序列文件格式，我编写了以下代码：

public static void main(String[] args) {

    FileInputStream fis = null;
    SequenceFile.Writer swriter = null;
    try {

        Configuration conf = new Configuration();

        File outputDirectory = new File("output");
        File inputDirectory = new File("input");
        File[] files = inputDirectory.listFiles();

        for (File inputFile : files) {

            //Input
            fis = new FileInputStream(inputFile);

            byte[] content = new byte[(int) inputFile.length()];
            fis.read(content);

            Text key = new Text(inputFile.getName());
            BytesWritable value = new BytesWritable(content);

            //Output
            Path outputPath = new Path(outputDirectory.getAbsolutePath()+"/"+inputFile.getName());

            FileSystem hdfs = outputPath.getFileSystem(conf);

            FSDataOutputStream dos = hdfs.create(outputPath);

            swriter = SequenceFile.createWriter(conf, dos, Text.class,
                    BytesWritable.class, SequenceFile.CompressionType.BLOCK, new DefaultCodec());

            swriter.append(key, value);

        }

        fis.close();
        swriter.close();

    } catch (IOException e) {

        System.out.println(e.getMessage());
    }

}

这段代码使用一些rdf文件生成正确的序列文件格式，但不能100%正确地工作，有时还会生成损坏的文件。从一开始是否有任何解决方案可以避免此代码，如果没有，我如何改进此代码以正确地使用任何rdf文件作为输入？

hadoop rdf reasoning

来源：https://stackoverflow.com/questions/14441660/reasoning-of-semantic-web-in-distributed-systems

2条答案

按热度按时间

hrirmatl1#

例如，输入数据必须由n-triples格式的gzip压缩文件组成（triplepart1.gz，triplepart2.gz….），因此我们有：input_triples.tar.gz，它包含n-triples压缩文件（triplepart1.gz，triplepart2.gz….）。
解压缩tar文件并将内容复制到hdfs
---/hadoop$tar zxvf/tmp/input\u triples.tar.gz/tmp/input\u triples。
---/hadoop$bin/hadoop fs-copyfromlocal/tmp/input files/input。
压缩输入数据
---/hadoop$bin/hadoop jar webpie.jar jobs.filesmimporttriples/input/tmp/pool--maptasks 4--reducetasks 2--samplingpercentage 10--samplingthreshold 1000
推理
---/hadoop$bin/hadoop jar webpie.jar jobs.reasoner/pool--fragment owl--rulessstrategy fixed--reducetasks 2--samplingpercentage 10--samplingthreshold 1000
在此继续：-）

赞(0）回复(0）举报 2021-06-04

irtuqstp2#

本教程基于在amazonec2上运行webpie，因此在配置上可能会有一些差异。但是，请注意，根据教程，输入的不是普通的rdf文件，而是“n-triples格式的gzip压缩的triples文件”（原文强调）：
在启动reasoner之前，我们需要将输入数据上传到hdfs文件系统并以适当的格式压缩它。输入数据必须由n-triples格式的gzip压缩文件组成。尽量将文件保持在相似的大小，并且拥有比cpu内核更多的文件，因为每个文件都将由一台机器处理。
该教程的第二部分“第2步：在集群上上传输入数据”描述了如何将数据实际地导入系统，看起来它应该应用于amazonec2以及您自己的hadoop安装。我不想在这里简单地引用这一节的全部内容，但它们给出的命令顺序是：

$ ./cmd-hadoop-cluster login webpie
$ hadoop fs -ls /
$ hadoop fs -mkdir /input
$ ./cmd-hadoop-cluster push webpie input_triples.tar.gz

不过，这只能将数据输入hdfs。在“第三步：压缩输入数据”中，
推理器以压缩格式处理数据。我们使用以下命令压缩数据：

hadoop jar webpie.jar jobs.FilesImportTriples /input /tmp /pool --maptasks 4 --reducetasks 2 --samplingPercentage 10 --samplingThreshold 1000

…上面的命令可以理解为：启动压缩并将作业分为4个map任务和2个reduce任务，使用10%的数据对输入进行采样，并将此示例中出现超过1000次的所有资源标记为popular。
完成这项工作后，我们在目录/池中有压缩的输入数据，我们可以继续推理。
剩下的部分将讨论推理、获取数据等等，这应该不是问题，我想，一旦您获得了数据。

赞(0）回复(0）举报 2021-06-04

我来回答

分布式系统中语义网的hadoop推理

2条答案

相关问题

热门标签

最新问答