解包器,带有自定义的取消跳过部分符号utf-8和utf-16

qyzbxkaa  于 2021-07-09  发布在  Java
关注(0)|答案(0)|浏览(218)

我有一个xml文件,其中一些符号由于混合了utf-16和utf-8而被错误编码。
例如?符号编码为�� ( �� )而不是( 📞 ).
我想解组这个xml文件,但是当解组器遇到这些不正确的符号时,它就失败了。如果我只用 StringEscapeUtils#unescapeHtml4 (或 StringEscapeUtils#unescapeXml )一切正常。
但我不想把xml读入字符串,然后解码,然后解组。
如何在解组过程中执行相同的操作(之前不将xml文件读取为字符串)?
我创建了一个简单的测试来重现这一点:

public class XmlReaderTest {

    private static final Pattern HTML_UNICODE_REGEX = Pattern.compile("&#[a-zA-Z0-9]+;&#[a-zA-Z0-9]+;");

    @Test
    public void test() throws Exception {
        final Unmarshaller unmarshaller = JAXBContext.newInstance(Value.class).createUnmarshaller();
        final XMLInputFactory factory = createXmlInputFactory();

        String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><value><name>&#55357;&#56542; &amp; &#128222; O&#771;</name></value>";

        XMLEventReader xmlReader = factory.createXMLEventReader(new StringReader(decodeHtmlEntities(xml)));
        Value result = (Value)unmarshaller.unmarshal(xmlReader);
        assert result.name.equals("\uD83D\uDCDE & \uD83D\uDCDE Õ");

        XMLEventReader xmlReader2 = factory.createXMLEventReader(new StringReader(xml));
        Value result2 = (Value)unmarshaller.unmarshal(xmlReader2); // ! exception
        assert result2.name.equals("\uD83D\uDCDE & \uD83D\uDCDE Õ");
    }

    @XmlRootElement(name = "value")
    private static class Value {
        @XmlElement
        public String name;
    }

    private String decodeHtmlEntities(String readerString) {
        StringBuffer unescapedString = new StringBuffer();

        Matcher regexMatcher = HTML_UNICODE_REGEX.matcher(readerString);
        while (regexMatcher.find()) {
            regexMatcher.appendReplacement(unescapedString, StringEscapeUtils.unescapeHtml4(regexMatcher.group()));
        }
        regexMatcher.appendTail(unescapedString);

        return unescapedString.toString();
    }

    private XMLInputFactory createXmlInputFactory() {
        XMLInputFactory factory = XMLInputFactory.newFactory();
        factory.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false);
        factory.setProperty(XMLInputFactory.SUPPORT_DTD, false);
        return factory;
    }
}

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题