UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format 8-bit. As of 2026, almost every webpage (99%) is transmitted as UTF-8.
via Wikipedia infobox
UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode Transformation Format 8-bit. As of 2026, almost every webpage (99%) is transmitted as UTF-8.
UTF-8 supports all 1,112,064 valid Unicode code points using a variable-width encoding of one to four one-byte (8-bit) code units.
Discovered by embedding cosine similarity (sentence-transformers MiniLM, 384-dim).