Entity

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses chann

Paper · arXiv

cs.SD

Authors: Dinghao Zhou, Xingchen Song, Di Wu, Pengyu Cheng, Shengfan Shen + 1 more
Published: 2026-06-04
Categories: cs.SDcs.AIeess.AS

Abstract ↗

via arXiv · 2606.06357