juniorLLMs

What is tokenization in LLMs?

Updated May 16, 2026

Short answer

Tokenization is the process of converting text into smaller units called tokens for model processing.

Deep explanation

LLMs cannot process raw text directly. Tokenization breaks text into subwords, words, or characters depending on the tokenizer. These tokens are mapped to numerical IDs, which are then processed by the model. Modern LLMs often use Byte Pair Encoding or SentencePiece tokenization.

Real-world example

The sentence 'Hello world' might be split into ['Hello', ' world'] or subword tokens depending on tokenizer.

Common mistakes

  • Assuming tokens are always words.

Follow-up questions

  • What is BPE tokenization?
  • Why is tokenization important?

More LLMs interview questions

View all →