org.jsoup.parser.Parser.htmlParser()方法的使用及代码示例

x33g5p2x  于2022-01-26 转载在 其他  
字(6.1k)|赞(0)|评价(0)|浏览(142)

本文整理了Java中org.jsoup.parser.Parser.htmlParser()方法的一些代码示例,展示了Parser.htmlParser()的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Parser.htmlParser()方法的具体详情如下:
包路径:org.jsoup.parser.Parser
类名称:Parser
方法名:htmlParser

Parser.htmlParser介绍

[英]Create a new HTML parser. This parser treats input as HTML5, and enforces the creation of a normalised document, based on a knowledge of the semantics of the incoming tags.
[中]创建一个新的HTML解析器。该解析器将输入视为HTML5,并根据传入标记的语义知识强制创建规范化文档。

代码示例

代码示例来源:origin: org.jsoup/jsoup

/**
 * Loads a file to a Document.
 * @param in file to load
 * @param charsetName character set of input
 * @param baseUri base URI of document, to resolve relative links against
 * @return Document
 * @throws IOException on IO error
 */
public static Document load(File in, String charsetName, String baseUri) throws IOException {
  return parseInputStream(new FileInputStream(in), charsetName, baseUri, Parser.htmlParser());
}

代码示例来源:origin: org.jsoup/jsoup

/**
 * Parses a Document from an input steam.
 * @param in input stream to parse. You will need to close it.
 * @param charsetName character set of input
 * @param baseUri base URI of document, to resolve relative links against
 * @return Document
 * @throws IOException on IO error
 */
public static Document load(InputStream in, String charsetName, String baseUri) throws IOException {
  return parseInputStream(in, charsetName, baseUri, Parser.htmlParser());
}

代码示例来源:origin: org.jsoup/jsoup

Request() {
  timeoutMilliseconds = 30000; // 30 seconds
  maxBodySizeBytes = 1024 * 1024; // 1MB
  followRedirects = true;
  data = new ArrayList<>();
  method = Method.GET;
  addHeader("Accept-Encoding", "gzip");
  addHeader(USER_AGENT, DEFAULT_UA);
  parser = Parser.htmlParser();
}

代码示例来源:origin: com.vaadin/vaadin-server

/**
 * Parses the given input stream into a jsoup document
 *
 * @param html
 *            the stream containing the design
 * @return the parsed jsoup document
 * @throws IOException
 */
private static Document parse(InputStream html) {
  try {
    Document doc = Jsoup.parse(html, UTF_8.name(), "",
        Parser.htmlParser());
    return doc;
  } catch (IOException e) {
    throw new DesignException("The html document cannot be parsed.");
  }
}

代码示例来源:origin: rakam-io/rakam

Document parse = Jsoup.parse(content, "", Parser.htmlParser());

代码示例来源:origin: fivesmallq/web-data-extractor

/**
 * change parser to htmlParser.
 *
 * @return
 */
public SelectorExtractor htmlParser() {
  this.parser = Parser.htmlParser();
  return this;
}

代码示例来源:origin: com.norconex.collectors/norconex-importer

/**
 * Gets the JSoup parser associated with the string representation.
 * The string "xml" (case insensitive) will return the XML parser.  
 * Anything else will return the HTML parser. 
 * @param parser "html" or "xml"
 * @return JSoup parser
 * @since 2.8.0
 */
public static Parser toJSoupParser(String parser) {
  if ("xml".equalsIgnoreCase(parser)) {
    return Parser.xmlParser();
  }
  return Parser.htmlParser();
}

代码示例来源:origin: abola/CrawlerPack

/**
 * 將 HTML 轉化為 Jsoup Document 物件
 *
 * HTML的內容就使用Jsoup原生的 HTML Parser
 *
 * @param html Html document
 * @return org.jsoup.nodes.Document
 */
public org.jsoup.nodes.Document htmlToJsoupDoc(String html){
  // 將 html(html/html5) 轉為 jsoup Document 物件
  Document jsoupDoc = Jsoup.parse(html, "UTF-8", Parser.htmlParser() );
  jsoupDoc.charset(StandardCharsets.UTF_8);
  return jsoupDoc;
}

代码示例来源:origin: addthis/hydra

Parser parser = Parser.htmlParser().setTrackErrors(0);
@Nonnull Document doc = parser.parseInput(html, "");
@Nonnull Elements tags = doc.select(tagName);

代码示例来源:origin: org.apache.any23/apache-any23-core

return Jsoup.parse(input, encoding, documentIRI, Parser.htmlParser());

代码示例来源:origin: DigitalPebble/storm-crawler

/**
 * Attempt to find a META tag in the HTML that hints at the character set
 * used to write the document.
 */
private static String getCharsetFromMeta(byte buffer[], int maxlength) {
  // convert to UTF-8 String -- which hopefully will not mess up the
  // characters we're interested in...
  int len = buffer.length;
  if (maxlength > 0 && maxlength < len) {
    len = maxlength;
  }
  String html = new String(buffer, 0, len, DEFAULT_CHARSET);
  Document doc = Parser.htmlParser().parseInput(html, "dummy");
  // look for <meta http-equiv="Content-Type"
  // content="text/html;charset=gb2312"> or HTML5 <meta charset="gb2312">
  Elements metaElements = doc
      .select("meta[http-equiv=content-type], meta[charset]");
  String foundCharset = null;
  for (Element meta : metaElements) {
    if (meta.hasAttr("http-equiv"))
      foundCharset = getCharsetFromContentType(meta.attr("content"));
    if (foundCharset == null && meta.hasAttr("charset"))
      foundCharset = meta.attr("charset");
    if (foundCharset != null)
      return foundCharset;
  }
  return foundCharset;
}

代码示例来源:origin: DigitalPebble/storm-crawler

.decode(ByteBuffer.wrap(content)).toString();
jsoupDoc = Parser.htmlParser().parseInput(html, url);

代码示例来源:origin: DigitalPebble/storm-crawler

@Test
public void testExclusionCase() throws IOException {
  Config conf = new Config();
  conf.put(TextExtractor.EXCLUDE_PARAM_NAME, "style");
  TextExtractor extractor = new TextExtractor(conf);
  String content = "<html>the<STYLE>main</STYLE>content of the page</html>";
  Document jsoupDoc = Parser.htmlParser().parseInput(content,
      "http://stormcrawler.net");
  String text = extractor.text(jsoupDoc.body());
  assertEquals("the content of the page", text);
}

代码示例来源:origin: DigitalPebble/storm-crawler

@Test
public void testMainContent() throws IOException {
  Config conf = new Config();
  conf.put(TextExtractor.INCLUDE_PARAM_NAME, "DIV[id=\"maincontent\"]");
  TextExtractor extractor = new TextExtractor(conf);
  String content = "<html>the<div id='maincontent'>main<div>content</div></div>of the page</html>";
  Document jsoupDoc = Parser.htmlParser().parseInput(content,
      "http://stormcrawler.net");
  String text = extractor.text(jsoupDoc.body());
  assertEquals("main content", text);
}

代码示例来源:origin: DigitalPebble/storm-crawler

@Test
public void testExclusion() throws IOException {
  Config conf = new Config();
  conf.put(TextExtractor.EXCLUDE_PARAM_NAME, "STYLE");
  TextExtractor extractor = new TextExtractor(conf);
  String content = "<html>the<style>main</style>content of the page</html>";
  Document jsoupDoc = Parser.htmlParser().parseInput(content,
      "http://stormcrawler.net");
  String text = extractor.text(jsoupDoc.body());
  assertEquals("the content of the page", text);
}

相关文章