juniorLLMOps
What is tokenization in LLMs?
Updated May 16, 2026
Short answer
Tokenization is the process of converting text into smaller units (tokens) that LLMs can process.
Deep explanation
LLMs do not process raw text directly. Instead, text is split into tokens such as subwords or characters using tokenizers like BPE. Each token maps to an integer ID. The model processes sequences of token IDs to generate predictions.
Real-world example
API billing is based on number of tokens processed per request.
Common mistakes
- Assuming tokens equal words.
Follow-up questions
- Why are tokens important for cost?
- What is BPE?