CESU-8

Also known as Compatibility Encoding Scheme for UTF-16: 8-Bit, CESU8

The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range to , is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range to , is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Though not spec

Described at

Stabilized Technical Report

unicode.org →

UTR 26, "Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)" has been stabilized: There are no plans to ever publish another update for it. The last version can be found at . CESU-8 documents an obsolete internal-use encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In CESU-8, supplementary characters are represented as six-byte sequences rather than four-byte sequences. CESU-8 is not intended nor recommended as an encoding used for open information exchange. Therefore, there is no need to develop this report any further.

Excerpt from a page describing this subject · 1,587 chars · not written by Vinony

Wikidata facts

Official name: CESU-8

Show 2 more facts

described at URL: www.unicode.org/reports/tr26
time of discovery or invention: 2001-00-00

Sources (2)

via Wikidata · CC0

~3 min read

Article

3 sections

Contents

Examples
References
External links

The encoding of Unicode non-BMP characters works out to 11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one). The byte values will not appear in CESU-8, as they start the 4-byte encodings used by UTF-8.