Category

Character encoding

page 1

Unicode (also known as The Unicode Standard and TUS) is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 17.0 defines 159,801 characters and 172 scripts used in various ordinary, literary, academic and technical contexts.

ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable and 33 control characters a total of 128 code points. The set of available punctuation had significant impact on the syntax of computer languages and text markup. ASCII hugely influenced the design of character sets used by modern computers; for example, the first 128 code points of Unicode are the same as ASCII.

character encoding

system using a prescribed set of digital values to represent textual characters

UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format 8-bit. As of 2026, almost every webpage (99%) is transmitted as UTF-8.

data type representing a finite sequence of encoded characters

primitive data type

thumb|300px|The UTF-8-encoded Japanese Wikipedia article for Mojibake displayed as if interpreted as [[Windows-1252]] thumb|300px|The UTF-8-encoded Russian Wikipedia article on Church Slavonic displayed as if interpreted as [[KOI8-R]]

UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

pioneering five-bit character encodings

numerical value representing a character in a coded character set

UTF-32 (32-bit Unicode Transformation Format), sometimes called UCS-4, is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). In contrast, all other Unicode transformation formats are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.

type of character encoding

Unicode block (U+13A0-13FF)

separator for character strings (e.g. between words or expressions)

UTF-7 (7-bit Unicode Transformation Format) is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.

method for entering non-ASCII characters into a computer

bi-directional text

text that contains both LTR and RTL text

character encoding in which characters are encoded in one or two bytes

script directionality

Bush hid the facts

Bug in Microsoft Windows Applications

Han unification

effort by Unicode/ISO 10646 to map Han characters into a single set, ignoring regional variations

graphic character

encoded character that is associated with one or more glyphs

binary-to-text encoding

scheme for encoding arbitrary binary data as plain text

Windows Glyph List 4

pan-European character set specified by Microsoft

ghost characters

erroneous kanji included in the Japanese JIS X 0208 standard and later in Unicode

Braille alphabet for the Hebrew language

UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using 1 to 5 bytes (in contrast to a maximum of 4 for UTF-8). It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.

BCD character encoding

family of 6-bit character encodings

ITU-T Recommendation

The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range to , is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range to , is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Though not spec

digraphs and trigraphs

in computer programming

Vietnamese Quoted-Readable

convention for writing Vietnamese using ASCII characters

Character encoding using one byte per character

Chinese Character Code for Information Interchange

character encoding standard

ITU-T recommendation

International Ideographs Core

subset of Unicode CJK Unified Ideographs characters intended for use on less capable devices

computer system that correctly handles 8-bit character encodings

Perso-Arabic Script Code for Information Interchange

Indian government standard for encoding Kashmiri, Persian, Sindhi, and Urdu