Also known as BPE
data compression in which the most common pair of consecutive bytes is replaced with a byte that doesn't occur within the data
字节对编码 是一种简单的数据压缩形式,这种方法用数据中不存的一个字节表示最常出现的连续字节数据。这样的替换需要重建全部原始数据。
Abstract from DBpedia / Wikipedia · CC BY-SA
Discovered by embedding cosine similarity (sentence-transformers MiniLM, 384-dim).