com.ibm.icu.text.UTF16类的使用及代码示例

x33g5p2x  于2022-02-01 转载在 其他  
字(12.3k)|赞(0)|评价(0)|浏览(316)

本文整理了Java中com.ibm.icu.text.UTF16类的一些代码示例,展示了UTF16类的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。UTF16类的具体详情如下:
包路径:com.ibm.icu.text.UTF16
类名称:UTF16

UTF16介绍

[英]Standalone utility class providing UTF16 character conversions and indexing conversions.

Code that uses strings alone rarely need modification. By design, UTF-16 does not allow overlap, so searching for strings is a safe operation. Similarly, concatenation is always safe. Substringing is safe if the start and end are both on UTF-32 boundaries. In normal code, the values for start and end are on those boundaries, since they arose from operations like searching. If not, the nearest UTF-32 boundaries can be determined using bounds().
Examples:

The following examples illustrate use of some of these methods.

// iteration forwards: Original 
for (int i = 0; i < s.length(); ++i) { 
char ch = s.charAt(i); 
doSomethingWith(ch); 
} 
// iteration forwards: Changes for UTF-32 
int ch; 
for (int i = 0; i < s.length(); i += UTF16.getCharCount(ch)) { 
ch = UTF16.charAt(s, i); 
doSomethingWith(ch); 
} 
// iteration backwards: Original 
for (int i = s.length() - 1; i >= 0; --i) { 
char ch = s.charAt(i); 
doSomethingWith(ch); 
} 
// iteration backwards: Changes for UTF-32 
int ch; 
for (int i = s.length() - 1; i > 0; i -= UTF16.getCharCount(ch)) { 
ch = UTF16.charAt(s, i); 
doSomethingWith(ch); 
}

Notes:

  • Naming: For clarity, High and Low surrogates are called Lead and Trail in the API, which gives a better sense of their ordering in a string. offset16 and offset32 are used to distinguish offsets to UTF-16 boundaries vs offsets to UTF-32 boundaries. int char32 is used to contain UTF-32 characters, as opposed to char16, which is a UTF-16 code unit.
  • Roundtripping Offsets: You can always roundtrip from a UTF-32 offset to a UTF-16 offset and back. Because of the difference in structure, you can roundtrip from a UTF-16 offset to a UTF-32 offset and back if and only if bounds(string, offset16) != TRAIL.
  • Exceptions: The error checking will throw an exception if indices are out of bounds. Other than than that, all methods will behave reasonably, even if unmatched surrogates or out-of-bounds UTF-32 values are present. UCharacter.isLegal() can be used to check for validity if desired.
  • Unmatched Surrogates: If the string contains unmatched surrogates, then these are counted as one UTF-32 value. This matches their iteration behavior, which is vital. It also matches common display practice as missing glyphs (see the Unicode Standard Section 5.4, 5.5).
  • Optimization: The method implementations may need optimization if the compiler doesn't fold static final methods. Since surrogate pairs will form an exceeding small percentage of all the text in the world, the singleton case should always be optimized for.
    [中]提供UTF16字符转换和索引转换的独立实用程序类。
    单独使用字符串的代码很少需要修改。根据设计,UTF-16不允许重叠,因此搜索字符串是一种安全的操作。类似地,连接总是安全的。如果起始和结束都在UTF-32边界上,则子串是安全的。在普通代码中,start和end的值都在这些边界上,因为它们来自搜索等操作。如果没有,可以使用bounds()确定最近的UTF-32边界。
    例如:
    以下示例说明了其中一些方法的使用。
// iteration forwards: Original 
for (int i = 0; i < s.length(); ++i) { 
char ch = s.charAt(i); 
doSomethingWith(ch); 
} 
// iteration forwards: Changes for UTF-32 
int ch; 
for (int i = 0; i < s.length(); i += UTF16.getCharCount(ch)) { 
ch = UTF16.charAt(s, i); 
doSomethingWith(ch); 
} 
// iteration backwards: Original 
for (int i = s.length() - 1; i >= 0; --i) { 
char ch = s.charAt(i); 
doSomethingWith(ch); 
} 
// iteration backwards: Changes for UTF-32 
int ch; 
for (int i = s.length() - 1; i > 0; i -= UTF16.getCharCount(ch)) { 
ch = UTF16.charAt(s, i); 
doSomethingWith(ch); 
}

注意:
*命名:为清晰起见,在API中,高代理和低代理分别被称为LeadTrail,这可以更好地理解它们在字符串中的顺序。offset16offset32用于区分对UTF-16边界的偏移与对UTF-32边界的偏移。int char32用于包含UTF-32字符,而char16是UTF-16代码单元。
*往返偏移:您始终可以从UTF-32偏移往返到UTF-16偏移并返回。由于结构上的差异,您可以从UTF-16偏移量往返到UTF-32偏移量,并在bounds(string, offset16) != TRAIL时返回。
*异常:如果索引超出范围,错误检查将抛出异常。除此之外,即使存在不匹配的替代项或超出范围的UTF-32值,所有方法都将表现合理。如果需要,可以使用UCharacter.isLegal()检查有效性。
*不匹配的代理项:如果字符串包含不匹配的代理项,则这些代理项将计为一个UTF-32值。这与他们的迭代行为相匹配,这是至关重要的。它还与常见的显示实践相匹配,因为缺少glyph(请参见Unicode标准第5.4、5.5节)。
*优化:如果编译器不折叠静态final方法,那么方法实现可能需要优化。由于代理项对在世界上所有文本中所占的比例非常小,因此应始终对单例案例进行优化。

代码示例

代码示例来源:origin: io.virtdata/virtdata-lib-realer

/**
 * Skips over a run of zero or more Pattern_White_Space characters at pos in text.
 */
private static int skipPatternWhiteSpace(String text, int pos) {
  while (pos < text.length()) {
    int c = UTF16.charAt(text, pos);
    if (!PatternProps.isWhiteSpace(c)) {
      break;
    }
    pos += UTF16.getCharCount(c);
  }
  return pos;
}

代码示例来源:origin: io.virtdata/virtdata-lib-realer

if (isSurrogate(ch)) {
  if (isLeadSurrogate(ch)) {
    ++offset16;
    if (offset16 < limit && isTrailSurrogate(source[offset16])) {
      return LEAD_SURROGATE_BOUNDARY;
    if (offset16 >= start && isLeadSurrogate(source[offset16])) {
      return TRAIL_SURROGATE_BOUNDARY;

代码示例来源:origin: io.virtdata/virtdata-lib-realer

int nextCharLL() {
  int ch;
  if (fNextIndex >= fRB.fRules.length()) {
    return -1;
  }
  ch = UTF16.charAt(fRB.fRules, fNextIndex);
  fNextIndex = UTF16.moveCodePointOffset(fRB.fRules, fNextIndex, 1);
  if (ch == '\r' ||
    ch == chNEL ||
    ch == chLS ||
    ch == '\n' && fLastChar != '\r') {
    // Character is starting a new line.  Bump up the line number, and
    //  reset the column to 0.
    fLineNum++;
    fCharNum = 0;
    if (fQuoteMode) {
      error(RBBIRuleBuilder.U_BRK_NEW_LINE_IN_QUOTED_STRING);
      fQuoteMode = false;
    }
  } else {
    // Character is not starting a new line.  Except in the case of a
    //   LF following a CR, increment the column position.
    if (ch != '\n') {
      fCharNum++;
    }
  }
  fLastChar = ch;
  return ch;
}

代码示例来源:origin: io.virtdata/virtdata-lib-realer

/**
 * Set a code point into a UTF16 position. Adjusts target according if we are replacing a
 * non-supplementary codepoint with a supplementary and vice versa.
 *
 * @param target Stringbuffer
 * @param offset16 UTF16 position to insert into
 * @param char32 Code point
 * @stable ICU 2.1
 */
public static void setCharAt(StringBuffer target, int offset16, int char32) {
  int count = 1;
  char single = target.charAt(offset16);
  if (isSurrogate(single)) {
    // pairs of the surrogate with offset16 at the lead char found
    if (isLeadSurrogate(single) && (target.length() > offset16 + 1)
        && isTrailSurrogate(target.charAt(offset16 + 1))) {
      count++;
    } else {
      // pairs of the surrogate with offset16 at the trail char
      // found
      if (isTrailSurrogate(single) && (offset16 > 0)
          && isLeadSurrogate(target.charAt(offset16 - 1))) {
        offset16--;
        count++;
      }
    }
  }
  target.replace(offset16, offset16 + count, valueOf(char32));
}

代码示例来源:origin: io.virtdata/virtdata-lib-realer

int cp;
main:
  for (int i = start; i < limit; i += UTF16.getCharCount(cp)) {
    cp = UTF16.charAt(pattern, i);
      default:
        if (usingSlash) {
          UTF16.append(buffer, cp);
          quoteStatus = NONE;
          continue main;
      if (hexCount == 0) {
        quoteStatus = NONE;
        UTF16.append(buffer, hexValue);
        UTF16.append(buffer, cp);
        quoteStatus = NORMAL_QUOTE;
        continue main;
        UTF16.append(buffer, cp);
      UTF16.append(buffer, cp);
      quoteStatus = NORMAL_QUOTE;
      continue main;
      UTF16.append(buffer, cp);
      continue main;
        UTF16.append(buffer, cp);

代码示例来源:origin: org.apache.lucene/lucene-analyzers-icu

@Override
public int char32At(int pos) {
 return UTF16.charAt(buffer, 0, length, pos);
}

代码示例来源:origin: io.virtdata/virtdata-lib-realer

/**
 * Performs character mirroring without reordering. When this method is
 * called, <code>{@link #text}</code> should be in a Logical form.
 */
private void mirror() {
  if ((reorderingOptions & Bidi.DO_MIRRORING) == 0) {
    return;
  }
  StringBuffer sb = new StringBuffer(text);
  byte[] levels = bidi.getLevels();
  for (int i = 0, n = levels.length; i < n;) {
    int ch = UTF16.charAt(sb, i);
    if ((levels[i] & 1) != 0) {
      UTF16.setCharAt(sb, i, UCharacter.getMirror(ch));
    }
    i += UTF16.getCharCount(ch);
  }
  text = sb.toString();
  reorderingOptions &= ~Bidi.DO_MIRRORING;
}

代码示例来源:origin: io.virtdata/virtdata-lib-realer

if (source.length() <= 2 && UTF16.countCodePoint(source) <= 1) {
  output.add(source);
  return;
for (int i = 0; i < source.length(); i += UTF16.getCharCount(cp)) {
  cp = UTF16.charAt(source, i);
    + source.substring(i + UTF16.getCharCount(cp)), skipZeros, subpermute);
  String chStr = UTF16.valueOf(source, i);
  for (String s : subpermute) {
    String piece = chStr + s;

代码示例来源:origin: io.virtdata/virtdata-lib-realer

if (PROGRESS) System.out.println(" extract: " + Utility.hex(UTF16.valueOf(comp))
  + ", " + Utility.hex(segment.substring(segmentPos)));
  decomp = UTF16.valueOf(comp);
int cp;
int decompPos = 0;
int decompCp = UTF16.charAt(decomp,0);
decompPos += UTF16.getCharCount(decompCp); // adjust position to skip first char
for (int i = segmentPos; i < segment.length(); i += UTF16.getCharCount(cp)) {
  cp = UTF16.charAt(segment, i);
  if (cp == decompCp) { // if equal, eat another cp from decomp
    if (PROGRESS) System.out.println("  matches: " + Utility.hex(UTF16.valueOf(cp)));
    if (decompPos == decomp.length()) { // done, have all decomp characters!
      buf.append(segment.substring(i + UTF16.getCharCount(cp))); // add remaining segment chars
      ok = true;
      break;
    decompCp = UTF16.charAt(decomp, decompPos);
    decompPos += UTF16.getCharCount(decompCp);
    if (PROGRESS) System.out.println("  buffer: " + Utility.hex(UTF16.valueOf(cp)));
    UTF16.append(buf, cp);
if (0!=Normalizer.compare(UTF16.valueOf(comp) + remainder, segment.substring(segmentPos), 0)) return null;

代码示例来源:origin: io.virtdata/virtdata-lib-realer

if (isLeadSurrogate(ch) && ((result + 1) < limit)
    && isTrailSurrogate(source[result + 1])) {
  result++;

代码示例来源:origin: com.ibm.icu/icu4j-charset

/*public*/int fromUCountPending() {
  if (preFromULength > 0) {
    return UTF16.getCharCount(preFromUFirstCP) + preFromULength;
  } else if (preFromULength < 0) {
    return -preFromULength;
  } else if (fromUChar32 > 0) {
    return 1;
  } else if (preFromUFirstCP > 0) {
    return UTF16.getCharCount(preFromUFirstCP);
  }
  return 0;
}

代码示例来源:origin: com.ibm.icu/icu4j-charset

} else if (!UTF16.isSurrogate((char) c)) {
} else if (UTF16.isLeadSurrogate((char) c)) {
length = UTF16.getCharCount(c);

代码示例来源:origin: com.ibm.icu/icu4j-charset

protected final CoderResult encodeMalformedOrUnmappable(CharBuffer source, int ch, boolean flush) {
  /*
   * if the character is a lead surrogate, we need to call encodeTrail to attempt to match
   * it up with a trail surrogate. if not, the character is unmappable.
   */
  return (UTF16.isSurrogate((char) ch))
      ? encodeTrail(source, (char) ch, flush)
      : CoderResult.unmappableForLength(1);
}

代码示例来源:origin: io.virtdata/virtdata-lib-realer

String str = UTF16.valueOf(c);
      text.replace(openPos, cursor, str);
    UTF16.append(name, c);
cursor += UTF16.getCharCount(c);

代码示例来源:origin: com.ibm.icu/icu4j-charset

private int getTrail(CharBuffer source, ByteBuffer target, IntBuffer offsets){
  if(source.hasRemaining()){
    /*test the following code unit*/
    char trail = source.get(source.position());
    if(UTF16.isTrailSurrogate(trail)){
      source.position(source.position()+1);
      ++nextSourceIndex;
      c=UCharacter.getCodePoint((char)c, trail);
    }
  } else {
    /*no more input*/
    c = -c; /*negative lead surrogate as "incomplete" indicator to avoid c=0 everywhere else*/
    checkNegative = true;
  }
  LoopAfterTrail = true;
  return regularLoop;
}

代码示例来源:origin: com.ibm.icu/icu4j-charset

private CoderResult toUWriteCodePoint(int c, CharBuffer target, IntBuffer offsets, int sourceIndex) {
  CoderResult cr = CoderResult.UNDERFLOW;
  int tBeginIndex = target.position();
  if (target.hasRemaining()) {
    if (c <= 0xffff) {
      target.put((char) c);
      c = UConverterConstants.U_SENTINEL;
    } else /* c is a supplementary code point */{
      target.put(UTF16.getLeadSurrogate(c));
      c = UTF16.getTrailSurrogate(c);
      if (target.hasRemaining()) {
        target.put((char) c);
        c = UConverterConstants.U_SENTINEL;
      }
    }
    /* write offsets */
    if (offsets != null) {
      offsets.put(sourceIndex);
      if ((tBeginIndex + 1) < target.position()) {
        offsets.put(sourceIndex);
      }
    }
  }
  /* write overflow from c */
  if (c >= 0) {
    charErrorBufferLength = UTF16.append(charErrorBufferArray, 0, c);
    cr = CoderResult.OVERFLOW;
  }
  return cr;
}

代码示例来源:origin: com.ibm.icu/icu4j-charset

private final CoderResult encodeChar(CharBuffer source, ByteBuffer target, IntBuffer offsets, char ch) {
    int sourceIndex = source.position() - 1;
    CoderResult cr;
    if (UTF16.isSurrogate(ch)) {
      cr = handleSurrogates(source, ch);
      if (cr != null)
        return cr;
      char trail = UTF16.getTrailSurrogate(fromUChar32);
      fromUChar32 = 0;
      // 4 bytes
      temp[0 ^ endianXOR] = (byte) (ch >>> 8);
      temp[1 ^ endianXOR] = (byte) (ch);
      temp[2 ^ endianXOR] = (byte) (trail >>> 8);
      temp[3 ^ endianXOR] = (byte) (trail);
      cr = fromUWriteBytes(this, temp, 0, 4, target, offsets, sourceIndex);
    } else {
      // 2 bytes
      temp[0 ^ endianXOR] = (byte) (ch >>> 8);
      temp[1 ^ endianXOR] = (byte) (ch);
      cr = fromUWriteBytes(this, temp, 0, 2, target, offsets, sourceIndex);
    }
    return (cr.isUnderflow() ? null : cr);
  }
}

代码示例来源:origin: com.ibm.icu/icu4j-charset

boolean doread = true;
if (c != 0 && target.hasRemaining()) {
  if (UTF16.isLeadSurrogate((char) c)) {
    SideEffectsDouble x = new SideEffectsDouble(c, sourceArrayIndex, sourceIndex, nextSourceIndex);
    doloop = getTrailDouble(source, target, uniMask, x, flush, cr);
        c = source.get(sourceArrayIndex++);
        ++nextSourceIndex;
        if (UTF16.isSurrogate((char) c)) {
          if (UTF16.isLeadSurrogate((char) c)) {

代码示例来源:origin: io.virtdata/virtdata-lib-realer

/**
 * Cover JDK 1.5 APIs. Append the code point to the buffer and return the buffer as a
 * convenience.
 *
 * @param target The buffer to append to
 * @param cp The code point to append
 * @return the updated StringBuffer
 * @throws IllegalArgumentException If cp is not a valid code point
 * @stable ICU 3.0
 */
public static StringBuffer appendCodePoint(StringBuffer target, int cp) {
  return append(target, cp);
}

代码示例来源:origin: org.openehealth.ipf.eclipse.ocl/ipf-eclipse-ocl

@Override
int shiftCodePointOffsetBy0(String text, int offset, int shift) {
  return UTF16.moveCodePointOffset(text, offset, shift);
}

相关文章