Skip to content
Category

Character encoding

page 1
Unicode
Unicode (also known as The Unicode Standard and TUS) is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 17.0 defines 159,801 characters and 172 scripts used in various ordinary, literary, academic and technical contexts.
Q8815
ASCII ( ), an acronym for American Standard Code for Information Interchange, is a character encoding standard for representing a particular set of 95 (English language focused) printable and 33 control characters a total of 128 code points. The set of available punctuation had significant impact on the syntax of computer languages and text markup. ASCII hugely influenced the design of character sets used by modern computers; for example, the first 128 code points of Unicode are the same as ASCII.
character encoding
system using a prescribed set of digital values to represent textual characters
UTF-8
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format 8-bit. As of 2026, almost every webpage (99%) is transmitted as UTF-8.
string
data type representing a finite sequence of encoded characters
character
primitive data type
mojibake
thumb|300px|The UTF-8-encoded Japanese Wikipedia article for Mojibake displayed as if interpreted as [[Windows-1252]] thumb|300px|The UTF-8-encoded Russian Wikipedia article on Church Slavonic displayed as if interpreted as [[KOI8-R]]
UTF-16
UTF-16 (16-bit Unicode Transformation Format) is a character encoding that supports all 1,112,064 valid code points of Unicode. The encoding is variable-length as code points are encoded with one or two code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.
Baudot code
pioneering five-bit character encodings
code point
numerical value representing a character in a coded character set
UTF-32
UTF-32 (32-bit Unicode Transformation Format), sometimes called UCS-4, is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits (four bytes) per code point (but a number of leading bits must be zero as there are far fewer than 232 Unicode code points, needing actually only 21 bits). In contrast, all other Unicode transformation formats are variable-length encodings. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.
code page
type of character encoding
Cherokee
Unicode block (U+13A0-13FF)
whitespace
separator for character strings (e.g. between words or expressions)
UTF-7
UTF-7 (7-bit Unicode Transformation Format) is an obsolete variable-length character encoding for representing Unicode text using a stream of ASCII characters. It was originally intended to provide a means of encoding Unicode text for use in Internet E-mail messages that was more efficient than the combination of UTF-8 with quoted-printable.
Alt code
method for entering non-ASCII characters into a computer
bi-directional text
text that contains both LTR and RTL text
DBCS
character encoding in which characters are encoded in one or two bytes
right-to-left
script directionality
Bush hid the facts
Bug in Microsoft Windows Applications
Han unification
effort by Unicode/ISO 10646 to map Han characters into a single set, ignoring regional variations
graphic character
encoded character that is associated with one or more glyphs
binary-to-text encoding
scheme for encoding arbitrary binary data as plain text
Windows Glyph List 4
pan-European character set specified by Microsoft
ghost characters
erroneous kanji included in the Japanese JIS X 0208 standard and later in Unicode
Hebrew Braille
Braille alphabet for the Hebrew language
UTF-EBCDIC
UTF-EBCDIC is a character encoding capable of encoding all 1,112,064 valid character code points in Unicode using 1 to 5 bytes (in contrast to a maximum of 4 for UTF-8). It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
BCD character encoding
family of 6-bit character encodings
ISO/IEC 6937
ITU-T Recommendation
CESU-8
The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range to , is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range to , is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Though not spec
digraphs and trigraphs
in computer programming
Vietnamese Quoted-Readable
convention for writing Vietnamese using ASCII characters
SBCS
Character encoding using one byte per character
Chinese Character Code for Information Interchange
character encoding standard
Wide character
data type
T.50
ITU-T recommendation
International Ideographs Core
subset of Unicode CJK Unified Ideographs characters intended for use on less capable devices
8-bit clean
computer system that correctly handles 8-bit character encodings
Perso-Arabic Script Code for Information Interchange
Indian government standard for encoding Kashmiri, Persian, Sindhi, and Urdu
Character encoding — category · Vinony