Beyond Text Compression: Evaluating Tokenizers Across Scales

AuthorsJonas F. Lotz†‡, António Vilarinho Lopes, Stephan Peitz, Hendra Setiawan, Leonardo Emili

Tokenizer design significantly impacts language model performance, yet evaluating tokenizer quality remains challenging. While text compression has emerged as a common intrinsic metric, recent work questions its reliability as a quality indicator. We investigate whether evaluating tokenizers on smaller models (350M parameters) reliably predicts their impact at larger scales (2.7B parameters). Through experiments with established tokenizers from widely-adopted language models, we find that tokenizer choice minimally affects English tasks but yields significant, scale-consistent differences in machine translation performance. Based on these findings, we propose additional intrinsic metrics that correlate more strongly with downstream performance than text compression. We combine these metrics into an evaluation framework that enables more reliable intrinsic tokenizer comparisons.

† Work done while at Apple
‡ University of Copenhagen & ROCKWOOL Foundation Research Unit

Beyond Text Compression: Evaluating Tokenizers Across Scales

Related readings and updates.

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Training a Tokenizer for Free with Private Federated Learning

Discover opportunities in Machine Learning.