udf和apache-tika从pdf中提取文本？

yx2lnoni 于 2021-06-25 发布在 Pig

关注(0)|答案(0)|浏览(216)

我正试图编写一个pig eval函数（udf），使用apachetika从pdf文件中提取文本。但是，每当我尝试运行函数时，函数只向输出写入0或1个字节。如何修复代码？

public class ExtractTextFromPDFs extends EvalFunc<String> {

  @Override
  public String exec(Tuple input) throws IOException {
      String pdfText;

      if (input == null || input.size() == 0 || input.get(0) == null) {
          return "N/A";
      }

      DataByteArray dba = (DataByteArray)input.get(0);
      InputStream is = new ByteArrayInputStream(dba.get());
      ContentHandler contenthandler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      Parser pdfparser = new AutoDetectParser();

      try {
        pdfparser.parse(is, contenthandler, metadata, new ParseContext());
      } catch (SAXException | TikaException e) {
        e.printStackTrace();
      }
      pdfText = contenthandler.toString();

      //close the input stream
      if(is != null){
        is.close();
      }
      return pdfText;
  }
}

我使用'c=foreach b generate extracttextfrompdfs（content）；'运行代码其中b是pdf，content是bytearray。

apache-pig apache-tika

来源：https://stackoverflow.com/questions/26965078/how-to-extract-text-from-pdfs-using-a-pig-udf-and-apache-tika