java.nio.charset.Charset类的使用及代码示例

x33g5p2x  于2022-01-17 转载在 其他  
字(15.5k)|赞(0)|评价(0)|浏览(509)

本文整理了Java中java.nio.charset.Charset类的一些代码示例,展示了Charset类的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Charset类的具体详情如下:
包路径:java.nio.charset.Charset
类名称:Charset

Charset介绍

[英]A charset is a named mapping between Unicode characters and byte sequences. Every Charset can decode, converting a byte sequence into a sequence of characters, and some can also encode, converting a sequence of characters into a byte sequence. Use the method #canEncode to find out whether a charset supports both.

Characters

In the context of this class, character always refers to a Java character: a Unicode code point in the range U+0000 to U+FFFF. (Java represents supplementary characters using surrogates.) Not all byte sequences will represent a character, and not all characters can necessarily be represented by a given charset. The method #containscan be used to determine whether every character representable by one charset can also be represented by another (meaning that a lossless transformation is possible from the contained to the container).

Encodings

There are many possible ways to represent Unicode characters as byte sequences. See UTR#17: Unicode Character Encoding Model for detailed discussion.

The most important mappings capable of representing every character are the Unicode Transformation Format (UTF) charsets. Of those, UTF-8 and the UTF-16 family are the most common. UTF-8 (described in RFC 3629) encodes a character using 1 to 4 bytes. UTF-16 uses exactly 2 bytes per character (potentially wasting space, but allowing efficient random access into BMP text), and UTF-32 uses exactly 4 bytes per character (trading off even more space for efficient random access into text that includes supplementary characters).

UTF-16 and UTF-32 encode characters directly, using their code point as a two- or four-byte integer. This means that any given UTF-16 or UTF-32 byte sequence is either big- or little-endian. To assist decoders, Unicode includes a special byte order mark (BOM) character U+FEFF used to determine the endianness of a sequence. The corresponding byte-swapped code point U+FFFE is guaranteed never to be assigned. If a UTF-16 decoder sees 0xfe, 0xff, for example, it knows it's reading a big-endian byte sequence, while 0xff, 0xfe, would indicate a little-endian byte sequence.

UTF-8 can contain a BOM, but since the UTF-8 encoding of a character always uses the same byte sequence, there is no information about endianness to convey. Seeing the bytes corresponding to the UTF-8 encoding of U+FEFF ( 0xef, 0xbb, 0xbf) would only serve to suggest that you're reading UTF-8. Note that BOMs are decoded as the U+FEFF character, and will appear in the output character sequence. This means that a disadvantage to including a BOM in UTF-8 is that most applications that use UTF-8 do not expect to see a BOM. (This is also a reason to prefer UTF-8: it's one less complication to worry about.)

Because a BOM indicates how the data that follows should be interpreted, a BOM should occur as the first character in a character sequence.

See the Byte Order Mark (BOM) FAQ for more about dealing with BOMs.

Endianness and BOM behavior

The following tables show the endianness and BOM behavior of the UTF-16 variants.

This table shows what the encoder writes. "BE" means that the byte sequence is big-endian, "LE" means little-endian. "BE BOM" means a big-endian BOM (that is, 0xfe, 0xff).

CharsetEncoder writesUTF-16BEBE, no BOMUTF-16LELE, no BOMUTF-16BE, with BE BOM

The next table shows how each variant's decoder behaves when reading a byte sequence. The exact meaning of "failure" in the table is dependent on the CodingErrorAction supplied to CharsetDecoder#malformedInputAction, so "BE, failure" means "the byte sequence is treated as big-endian, and a little-endian BOM triggers the malformedInputAction".

The phrase "includes BOM" means that the output includes the U+FEFF byte order mark character.

CharsetBE BOMLE BOMNo BOMUTF-16BEBE, includes BOMBE, failureBEUTF-16LELE, failureLE, includes BOMLEUTF-16BELEBE

Charset names

A charset has a canonical name, returned by #name. Most charsets will also have one or more aliases, returned by #aliases. A charset can be looked up by canonical name or any of its aliases using #forName.

Guaranteed-available charsets

The following charsets are available on every Java implementation:

  • ISO-8859-1
  • US-ASCII
  • UTF-16
  • UTF-16BE
  • UTF-16LE
  • UTF-8

All of these charsets support both decoding and encoding. The charsets whose names begin "UTF" can represent all characters, as mentioned above. The "ISO-8859-1" and "US-ASCII" charsets can only represent small subsets of these characters. Except when required to do otherwise for compatibility, new code should use one of the UTF charsets listed above. The platform's default charset is UTF-8. (This is in contrast to some older implementations, where the default charset depended on the user's locale.)

Most implementations will support hundreds of charsets. Use #availableCharsets or #isSupported to see what's available. If you intend to use the charset if it's available, just call #forName and catch the exceptions it throws if the charset isn't available.

Additional charsets can be made available by configuring one or more charset providers through provider configuration files. Such files are always named as "java.nio.charset.spi.CharsetProvider" and located in the "META-INF/services" directory of one or more classpaths. The files should be encoded in "UTF-8". Each line of their content specifies the class name of a charset provider which extends java.nio.charset.spi.CharsetProvider. A line should end with '\r', '\n' or '\r\n'. Leading and trailing whitespace is trimmed. Blank lines, and lines (after trimming) starting with "#" which are regarded as comments, are both ignored. Duplicates of names already found are also ignored. Both the configuration files and the provider classes will be loaded using the thread context class loader.

Although class is thread-safe, the CharsetDecoder and CharsetEncoder instances it returns are inherently stateful.
[中]字符集是Unicode字符和字节序列之间的命名映射。每个字符集都可以解码,将字节序列转换为字符序列,有些字符集还可以编码,将字符序列转换为字节序列。使用#canEncode方法确定字符集是否同时支持这两种类型。
#####人物
在这个类的上下文中,character总是指Java字符:范围为U+0000到U+FFFF的Unicode代码点。(Java使用代理表示补充字符。)并非所有字节序列都将表示一个字符,也并非所有字符都必须由给定的字符集表示。方法#contains可用于确定由一个字符集表示的每个字符是否也可以由另一个字符集表示(这意味着可以从包含的字符到容器进行无损转换)。
#####编码
有许多可能的方法将Unicode字符表示为字节序列。有关详细讨论,请参见UTR#17: Unicode Character Encoding Model
能够表示每个字符的最重要映射是Unicode转换格式(UTF)字符集。其中,UTF-8和UTF-16家族是最常见的。UTF-8(在RFC 3629中描述)使用1到4个字节对字符进行编码。UTF-16每个字符仅使用2个字节(可能会浪费空间,但允许高效随机访问BMP文本),而UTF-32每个字符仅使用4个字节(为了高效随机访问包含补充字符的文本,需要交换更多的空间)。
UTF-16和UTF-32直接编码字符,使用其代码点作为两字节或四字节整数。这意味着任何给定的UTF-16或UTF-32字节序列要么是大端,要么是小端。为了帮助解码器,Unicode包含一个特殊的字节顺序标记(BOM)字符U+FEFF,用于确定序列的结尾。保证不会分配相应的字节交换代码点U+FFFE。例如,如果UTF-16解码器看到0xfe,0xff,它知道它正在读取一个大端字节序列,而0xff,0xfe则表示一个小端字节序列。
UTF-8可以包含BOM表,但由于字符的UTF-8编码始终使用相同的字节序列,因此没有要传达的有关尾数的信息。看到与U+FEFF(0xef,0xbb,0xbf)的UTF-8编码相对应的字节只会提示您正在读取UTF-8。请注意,BOM表被解码为U+FEFF字符,并将出现在输出字符序列中。这意味着在UTF-8中包含BOM的缺点是,大多数使用UTF-8的应用程序都不希望看到BOM。(这也是选择UTF-8的一个原因:它不那么复杂。)
因为BOM表指明了应该如何解释后面的数据,所以BOM表应该作为字符序列中的第一个字符出现。
有关处理BOM的更多信息,请参见{$2$}。
#####Endianness和BOM行为
下表显示了UTF-16变型的端部和BOM行为。
此表显示编码器写入的内容。“BE”表示字节序列是大端,而“LE”表示小端。“BE BOM”指大端BOM(即0xfe、0xff)。
CharsetEncoder写入UTF-16BEBE,无BOMUTF-16LELE,无BOMUTF-16BE,带BE BOM
下表显示了读取字节序列时每个变量的解码器的行为。表中“failure”的确切含义取决于提供给CharsetDecoder的CodingErrorAction#malformedInputAction,因此“BE,failure”表示“字节序列被视为大端字节,小端字节BOM触发了畸形运算”。
短语“包括BOM”表示输出包括U+FEFF字节顺序标记字符。
CharsetBE BOMLE BOMNo BOMUTF-16BEBE,包括BOMBE、failureBEUTF-16LELE、failureLE,包括BOMLEUTF-16BELEBE
#####字符集名称
字符集有一个规范名称,由#name返回。大多数字符集也会有一个或多个别名,由#别名返回。可以使用#forName按规范名称或其任何别名查找字符集。
#####保证可用字符集
以下字符集可用于每个Java实现:
*ISO-8859-1
*US-ASCII
*UTF-16
*UTF-16BE
*UTF-16LE
*UTF-8
所有这些字符集都支持解码和编码。名称以“UTF”开头的字符集可以表示所有字符,如上所述。“ISO-8859-1”和“US-ASCII”字符集只能表示这些字符的小子集。除非出于兼容性需要,否则新代码应该使用上面列出的UTF字符集之一。平台的默认字符集是UTF-8。(这与一些较旧的实现不同,在这些实现中,默认字符集取决于用户的区域设置。)
大多数实现将支持数百个字符集。使用#availableCharsets或#isSupported查看可用内容。如果您打算使用可用的字符集,只需调用#forName并捕获它在字符集不可用时抛出的异常。
通过提供程序配置文件配置一个或多个字符集提供程序,可以使用其他字符集。此类文件始终命名为“java.nio.charset.spi.CharsetProvider”,并位于一个或多个类路径的“META-INF/services”目录中。文件应编码为“UTF-8”。其内容的每一行都指定扩展java的字符集提供程序的类名。尼奥。查塞特。spi。CharsetProvider。一行应该以“\r”、“\n”或“\r\n”结尾。前导和尾随空格被修剪。空行和以“#”开头的行(修剪后)都被忽略,它们被视为注释。已找到的名称的重复项也将被忽略。配置文件和提供程序类都将使用线程上下文类加载器加载。
尽管类是线程安全的,但它返回的CharsetDecoder和CharsetEncoder实例本质上是有状态的。

代码示例

代码示例来源:origin: stackoverflow.com

String line;
try (
  InputStream fis = new FileInputStream("the_file_name");
  InputStreamReader isr = new InputStreamReader(fis, Charset.forName("UTF-8"));
  BufferedReader br = new BufferedReader(isr);
) {
  while ((line = br.readLine()) != null) {
    // Deal with the line
  }
}

代码示例来源:origin: apache/flink

@Override
public void configure(Configuration parameters) {
  super.configure(parameters);
  if (charsetName == null || !Charset.isSupported(charsetName)) {
    throw new RuntimeException("Unsupported charset: " + charsetName);
  }
  if (charsetName.equalsIgnoreCase(StandardCharsets.US_ASCII.name())) {
    ascii = true;
  }
  this.decoder = Charset.forName(charsetName).newDecoder();
  this.byteWrapper = ByteBuffer.allocate(1);
}

代码示例来源:origin: org.assertj/assertj-core

private BufferedReader readerFor(InputStream stream) {
 return new BufferedReader(new InputStreamReader(stream, Charset.defaultCharset()));
}

代码示例来源:origin: apache/kafka

/**
 * Attempt to read a file as a string
 * @throws IOException
 */
public static String readFileAsString(String path, Charset charset) throws IOException {
  if (charset == null) charset = Charset.defaultCharset();
  try (FileChannel fc = FileChannel.open(Paths.get(path))) {
    MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
    return charset.decode(bb).toString();
  }
}

代码示例来源:origin: stackoverflow.com

OutputStreamWriter char_output = new OutputStreamWriter(
  new FileOutputStream("some_output.utf8"),
  Charset.forName("UTF-8").newEncoder() 
);
InputStreamReader char_input = new InputStreamReader(
  new FileInputStream("some_input.utf8"),
  Charset.forName("UTF-8").newDecoder() 
);

代码示例来源:origin: loklak/loklak_server

public static JSONArray readJsonFromUrl(String url) throws IOException, JSONException {
  InputStream is = new URL(url).openStream();
  try {
    BufferedReader rd = new BufferedReader(new InputStreamReader(is, Charset.forName("UTF-8")));
    String jsonText = readAll(rd);
    JSONArray json = new JSONArray(jsonText);
    return json;
  } finally {
    is.close();
  }
}

代码示例来源:origin: stackoverflow.com

URLConnection connection = new URL("https://www.google.com/search?q=" + query).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();

BufferedReader r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));

StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
  sb.append(line);
}
System.out.println(sb.toString());

代码示例来源:origin: plantuml/plantuml

FileInputStream stream = new FileInputStream(path);
  InputStreamReader input = new InputStreamReader(stream, Charset.defaultCharset());
  Reader reader = new BufferedReader(input);
  int read;
  while ((read = reader.read(buffer, 0, buffer.length)) > 0) {
    builder.append(buffer, 0, read);
  stream.close();

代码示例来源:origin: commons-io/commons-io

err = proc.getErrorStream();
inr = new BufferedReader(new InputStreamReader(in, Charset.defaultCharset()));
String line = inr.readLine();
while (line != null && lines.size() < max) {
  line = line.toLowerCase(Locale.ENGLISH).trim();
  lines.add(line);
  line = inr.readLine();
inr = null;
in.close();
in = null;
  err.close();
  err = null;

代码示例来源:origin: commons-io/commons-io

@Test
public void testMultiByteBreak() throws Exception {
  System.out.println("testMultiByteBreak() Default charset: "+Charset.defaultCharset().displayName());
  final long delay = 50;
  final File origin = new File(this.getClass().getResource("/test-file-utf8.bin").toURI());
  final File file = new File(getTestDirectory(), "testMultiByteBreak.txt");
  createFile(file, 0);
     BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(origin), charsetUTF8))) {
    final List<String> lines = new ArrayList<>();
    String line;
    while((line = reader.readLine()) != null){
      out.write(line);
      out.write("\n");

代码示例来源:origin: stackoverflow.com

public static String readFile(String file, String csName)
      throws IOException {
  Charset cs = Charset.forName(csName);
  return readFile(file, cs);
}

public static String readFile(String file, Charset cs)
      throws IOException {
  // No real need to close the BufferedReader/InputStreamReader
  // as they're only wrapping the stream
  FileInputStream stream = new FileInputStream(file);
  try {
    Reader reader = new BufferedReader(new InputStreamReader(stream, cs));
    StringBuilder builder = new StringBuilder();
    char[] buffer = new char[8192];
    int read;
    while ((read = reader.read(buffer, 0, buffer.length)) > 0) {
      builder.append(buffer, 0, read);
    }
    return builder.toString();
  } finally {
    // Potential issue here: if this throws an IOException,
    // it will mask any others. Normally I'd use a utility
    // method which would log exceptions and swallow them
    stream.close();
  }        
}

代码示例来源:origin: hibernate/hibernate-orm

reader = new BufferedReader(
    new InputStreamReader( is, Charset.forName( "UTF-8" ) )
);
BufferedWriter writer = new BufferedWriter( sw );
for ( int c = reader.read(); c != -1; c = reader.read() ) {
  writer.write( c );
  reader.close();
is.close();

代码示例来源:origin: ehcache/ehcache3

public static String urlToText(URL url, String encoding) throws IOException {
 Charset charset = encoding == null ? StandardCharsets.UTF_8 : Charset.forName(encoding);
 try (BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), charset))) {
  return reader.lines().collect(joining(System.lineSeparator()));
 }
}

代码示例来源:origin: commons-io/commons-io

? new InputStreamReader(new FileInputStream(file1), Charset.defaultCharset())
           : new InputStreamReader(new FileInputStream(file1), charsetName);
 Reader input2 = charsetName == null
           ? new InputStreamReader(new FileInputStream(file2), Charset.defaultCharset())
           : new InputStreamReader(new FileInputStream(file2), charsetName)) {
return IOUtils.contentEqualsIgnoreEOL(input1, input2);

代码示例来源:origin: javamelody/javamelody

final InputStream inputStream = zipFile.getInputStream(entry);
try {
  final Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-8"));
  try {
    final char[] chars = new char[1024];
    int read = reader.read(chars);
    while (read != -1) {
      writer.write(chars, 0, read);
      read = reader.read(chars);
    reader.close();
  inputStream.close();

代码示例来源:origin: stackoverflow.com

BufferedReader br = new BufferedReader(new InputStreamReader(System.in, Charset.forName("ISO-8859-1")),1024);
 // ...
    // inside some iteration / processing logic:
    if (br.ready()) {
      int readCount = br.read(inputData, bufferOffset, inputData.length-bufferOffset);
    }

代码示例来源:origin: stanfordnlp/CoreNLP

/**
 * Creates a new scanner.
 * There is also java.io.Reader version of this constructor.
 *
 * @param   in  the java.io.Inputstream to read input from.
 */
NegraPennLexer(java.io.InputStream in) {
 this(new java.io.InputStreamReader
      (in, java.nio.charset.Charset.forName("UTF-8")));
}

代码示例来源:origin: opentripplanner/OpenTripPlanner

@Override
public boolean update() {
  try {
    InputStream stream = HttpUtils.getData(url);
    if (stream == null) {
      log.warn("Failed to get data from url " + url);
      return false;
    }
    Reader reader = new BufferedReader(new InputStreamReader(stream,
        Charset.forName("UTF-8")));
    StringBuilder builder = new StringBuilder();
    char[] buffer = new char[4096];
    int charactersRead;
    while ((charactersRead = reader.read(buffer, 0, buffer.length)) > 0) {
      builder.append(buffer, 0, charactersRead);
    }
    String data = builder.toString();
    parseJson(data);
  } catch (IOException e) {
    log.warn("Error reading bike rental feed from " + url, e);
    return false;
  } catch (ParserConfigurationException e) {
    throw new RuntimeException(e);
  } catch (SAXException e) {
    log.warn("Error parsing bike rental feed from " + url + "(bad XML of some sort)", e);
    return false;
  }
  return true;
}

代码示例来源:origin: knowm/XChange

public static <T> T readValue(URL src, Class<T> valueType) throws IOException {
 try (InputStream inputStream = src.openStream()) {
  Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-8"));
  return objectMapperWithoutIndentation.readValue(reader, valueType);
 }
}

代码示例来源:origin: apache/groovy

public Writer writeTo(final Writer out) throws IOException {
  try (Reader reader = (this.encoding == null)
      ? new InputStreamReader(Files.newInputStream(this))
      : new InputStreamReader(Files.newInputStream(this), Charset.forName(this.encoding))) {
    int c = reader.read();
    while (c != -1) {
      out.write(c);
      c = reader.read();
    }
  }
  return out;
}

相关文章