UTF-1 is an obsolete method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.
UTF-1 is an obsolete method of transforming ISO/IEC 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.
== Design == Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or five bytes. All ASCII code points are a single byte (the code points through are also single bytes).
Discovered by embedding cosine similarity (sentence-transformers MiniLM, 384-dim).