Crate cesu8

Expand description

A simple library implementing the CESU-8 compatibility encoding scheme. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 data as 8-bit characters. Yes, this is ugly.

Use of this encoding is discouraged by the Unicode Consortium. It’s OK for working with existing internal APIs, but it should not be used for transmitting or storing data.

use std::borrow::Cow;
use cesu8::{from_cesu8, to_cesu8};

// 16-bit Unicode characters are the same in UTF-8 and CESU-8.
assert_eq!(Cow::Borrowed("aé日".as_bytes()),
           to_cesu8("aé日"));
assert_eq!(Cow::Borrowed("aé日"),
           from_cesu8("aé日".as_bytes()).unwrap());

// This string is CESU-8 data containing a 6-byte surrogate pair,
// which decodes to a 4-byte UTF-8 string.
let data = &[0xED, 0xA0, 0x81, 0xED, 0xB0, 0x81];
assert_eq!(Cow::Borrowed("\u{10401}"),
           from_cesu8(data).unwrap());

A note about security

As a general rule, this library is intended to fail on malformed or unexpected input. CESU-8 is supposed to be an internal-only format, and if we’re seeing malformed data, we assume that it’s either a bug in somebody’s code, or an attacker is trying to improperly encode data to evade security checks.

If you have a use case for lossy conversion to UTF-8, or conversion from mixed UTF-8/CESU-8 data, please feel free to submit a pull request for from_cesu8_lossy_permissive with appropriate behavior.

Java and U+0000, and other variants

Java uses the CESU-8 encoding as described above, but with one difference: The null character U+0000 is represented as an overlong UTF-8 sequence C0 80. This is supported by the from_java_cesu8 and to_java_cesu8 methods.

Surrogate pairs and UTF-8

The UTF-16 encoding uses “surrogate pairs” to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.

0xD800 to 0xDBFF: First half of surrogate pair. When encoded as CESU-8, these become 11101101 10100000 10000000 to 11101101 10101111 10111111.
0xDC00 to 0xDFFF: Second half of surrogate pair. These become 11101101 10110000 10000000 to 11101101 10111111 10111111.

Wikipedia explains the code point to UTF-16 conversion process:

Consider the encoding of U+10437 (𐐷):

Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.

Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.

Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.

Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

Structs

Cesu8DecodingError

The CESU-8 data could not be decoded as valid UTF-8 data.

Functions

from_cesu8

Convert CESU-8 data to a Rust string, re-encoding only if necessary. Returns an error if the data cannot be represented as valid UTF-8.

from_java_cesu8

Convert Java’s modified UTF-8 data to a Rust string, re-encoding only if necessary. Returns an error if the data cannot be represented as valid UTF-8.

is_valid_cesu8

Check whether a Rust string contains valid CESU-8 data.

is_valid_java_cesu8

Check whether a Rust string contains valid Java’s modified UTF-8 data.

to_cesu8

Convert a Rust &str to CESU-8 bytes.

to_java_cesu8

Convert a Rust &str to Java’s modified UTF-8 bytes.