org.apache.tika.Tika.parseToString()方法的使用及代码示例

x33g5p2x  于2022-01-29 转载在 其他  
字(10.6k)|赞(0)|评价(0)|浏览(452)

本文整理了Java中org.apache.tika.Tika.parseToString()方法的一些代码示例,展示了Tika.parseToString()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Tika.parseToString()方法的具体详情如下:
包路径:org.apache.tika.Tika
类名称:Tika
方法名:parseToString

Tika.parseToString介绍

[英]Parses the given file and returns the extracted text content.

To avoid unpredictable excess memory use, the returned string contains only up to #getMaxStringLength() first characters extracted from the input document. Use the #setMaxStringLength(int)method to adjust this limitation.
[中]解析给定文件并返回提取的文本内容。
为了避免不可预测的内存过量使用,返回的字符串最多只包含从输入文档中提取的#getMaxStringLength()第一个字符。使用#setMaxStringLength(int)方法调整此限制。

代码示例

代码示例来源:origin: apache/tika

public static void main(String[] args) throws Exception {
    // Create a Tika instance with the default configuration
    Tika tika = new Tika();

    // Parse all given files and print out the extracted
    // text content
    for (String file : args) {
      String text = tika.parseToString(new File(file));
      System.out.print(text);
    }
  }
}

代码示例来源:origin: apache/tika

public static String parseToStringExample() throws Exception {
  File document = new File("example.doc");
  String content = new Tika().parseToString(document);
  System.out.print(content);
  return content;
}

代码示例来源:origin: apache/tika

/**
 * Example of how to use Tika's parseToString method to parse the content of a file,
 * and return any text found.
 * <p>
 * Note: Tika.parseToString() will extract content from the outer container
 * document and any embedded/attached documents.
 *
 * @return The content of a file.
 */
public String parseToStringExample() throws IOException, SAXException, TikaException {
  Tika tika = new Tika();
  try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {
    return tika.parseToString(stream);
  }
}

代码示例来源:origin: apache/tika

public void indexDocument(File file) throws Exception {
    Document document = new Document();
    document.add(new TextField("filename", file.getName(), Store.YES));
    document.add(new TextField("fulltext", tika.parseToString(file), Store.NO));
    writer.addDocument(document);
  }
}

代码示例来源:origin: apache/tika

/**
 * Parses the given document and returns the extracted text content.
 * The given input stream is closed by this method.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 * <p>
 * <strong>NOTE:</strong> Unlike most other Tika methods that take an
 * {@link InputStream}, this method will close the given stream for
 * you as a convenience. With other methods you are still responsible
 * for closing the stream or a wrapper instance returned by Tika.
 *
 * @param stream the document to be parsed
 * @return extracted text content
 * @throws IOException if the document can not be read
 * @throws TikaException if the document can not be parsed
 */
public String parseToString(InputStream stream)
    throws IOException, TikaException {
  return parseToString(stream, new Metadata());
}

代码示例来源:origin: apache/tika

/**
 * Parses the file at the given path and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param path the path of the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 */
public String parseToString(Path path) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(path, metadata);
  return parseToString(stream, metadata);
}

代码示例来源:origin: apache/tika

/**
 * Parses the resource at the given URL and returns the extracted
 * text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param url the URL of the resource to be parsed
 * @return extracted text content
 * @throws IOException if the resource can not be read
 * @throws TikaException if the resource can not be parsed
 */
public String parseToString(URL url) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(url, metadata);
  return parseToString(stream, metadata);
}

代码示例来源:origin: rnewson/couchdb-lucene

public void parse(final InputStream in, final String contentType, final String fieldName, final Document doc)
    throws IOException {
  final Metadata md = new Metadata();
  md.set(HttpHeaders.CONTENT_TYPE, contentType);
  try {
    // Add body text.
    doc.add(text(fieldName, tika.parseToString(in, md), false));
  } catch (final IOException e) {
    log.warn("Failed to index an attachment.", e);
    return;
  } catch (final TikaException e) {
    log.warn("Failed to parse an attachment.", e);
    return;
  }
  // Add DC attributes.
  addDublinCoreAttributes(md, doc);
}

代码示例来源:origin: apache/tika

/**
 * Parses the given file and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param file the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 * @see #parseToString(Path)
 */
public String parseToString(File file) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  @SuppressWarnings("deprecation")
  InputStream stream = TikaInputStream.get(file, metadata);
  return parseToString(stream, metadata);
}

代码示例来源:origin: apache/tika

public TrecDocument summarize(File file) throws FileNotFoundException,
    IOException, TikaException {
  Tika tika = new Tika();
  Metadata met = new Metadata();
  String contents = tika.parseToString(new FileInputStream(file), met);
  return new TrecDocument(met.get(TikaCoreProperties.RESOURCE_NAME_KEY), contents,
      met.getDate(TikaCoreProperties.CREATED));
}

代码示例来源:origin: stackoverflow.com

private void compareXlsx(File expected, File result) throws IOException, TikaException {
   Tika tika = new Tika();
   String expectedText = tika.parseToString(expected);
   String resultText = tika.parseToString(result);
   assertEquals(expectedText, resultText);
}

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>1.13</version>
  <scope>test</scope>
</dependency>

代码示例来源:origin: org.onehippo.cms7/hippo-cms-api

private String doParse(final InputStream inputStream) {
  try {
    // tika parseToString already closes the inputStream
    return tika.parseToString(inputStream);
  } catch (TikaException e) {
    throw new IllegalStateException("Unexpected TikaException processing failure", e);
  } catch (IOException e) {
    throw new IllegalStateException("Unexpected IOException processing failure", e);
  }
}

代码示例来源:origin: stackoverflow.com

public String parseToStringExample() throws IOException, SAXException, TikaException 
 {

   Tika tika = new Tika();
   try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
      return tika.parseToString(stream); // This should return you the pdf's text
   }
}

代码示例来源:origin: stackoverflow.com

File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);

代码示例来源:origin: org.apache.tika/tika-core

/**
 * Parses the resource at the given URL and returns the extracted
 * text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param url the URL of the resource to be parsed
 * @return extracted text content
 * @throws IOException if the resource can not be read
 * @throws TikaException if the resource can not be parsed
 */
public String parseToString(URL url) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(url, metadata);
  return parseToString(stream, metadata);
}

代码示例来源:origin: stackoverflow.com

Tika tika = new Tika();
Metadata metadata = new Metadata(); 
metadata.set(Metadata.RESOURCE_NAME_KEY, "myfile.name");
String text = tika.parseToString(new File("myfile.name"));

代码示例来源:origin: com.github.lafa.tikaNoExternal/tika-core

/**
 * Parses the file at the given path and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param path the path of the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 */
public String parseToString(Path path) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(path, metadata);
  return parseToString(stream, metadata);
}

代码示例来源:origin: org.apache.tika/tika-core

/**
 * Parses the file at the given path and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param path the path of the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 */
public String parseToString(Path path) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  InputStream stream = TikaInputStream.get(path, metadata);
  return parseToString(stream, metadata);
}

代码示例来源:origin: com.github.lafa.tikaNoExternal/tika-core

/**
 * Parses the given file and returns the extracted text content.
 * <p>
 * To avoid unpredictable excess memory use, the returned string contains
 * only up to {@link #getMaxStringLength()} first characters extracted
 * from the input document. Use the {@link #setMaxStringLength(int)}
 * method to adjust this limitation.
 *
 * @param file the file to be parsed
 * @return extracted text content
 * @throws IOException if the file can not be read
 * @throws TikaException if the file can not be parsed
 * @see #parseToString(Path)
 */
public String parseToString(File file) throws IOException, TikaException {
  Metadata metadata = new Metadata();
  @SuppressWarnings("deprecation")
  InputStream stream = TikaInputStream.get(file, metadata);
  return parseToString(stream, metadata);
}

代码示例来源:origin: org.xwiki.platform/xwiki-platform-search-lucene-api

private String getContentAsText(XWikiDocument doc, XWikiContext context)
  {
    String contentText = null;

    try {
      XWikiAttachment att = doc.getAttachment(this.filename);

      LOGGER.debug("Start parsing attachement [{}] in document [{}]", this.filename, doc.getDocumentReference());

      Tika tika = new Tika();

      Metadata metadata = new Metadata();
      metadata.set(Metadata.RESOURCE_NAME_KEY, this.filename);

      contentText = StringUtils.lowerCase(tika.parseToString(att.getContentInputStream(context), metadata));
    } catch (Throwable ex) {
      LOGGER.warn("error getting content of attachment [{}] for document [{}]",
        new Object[] {this.filename, doc.getDocumentReference(), ex});
    }

    return contentText;
  }
}

相关文章

微信公众号

最新文章

更多