juniorNLP
What is tokenization in NLP?
Updated May 17, 2026
Short answer
Tokenization is the process of breaking text into smaller units like words, subwords, or sentences.
Deep explanation
Tokenization is a fundamental preprocessing step in NLP where raw text is split into meaningful units called tokens. These tokens are then used for downstream tasks such as classification, translation, and sentiment analysis. Different tokenization strategies include whitespace-based, rule-based, and subword tokenization (like BPE or WordPiece used in transformers).
Real-world example
Search engines tokenize queries to match relevant documents more accurately.
Common mistakes
- Assuming tokenization is language-agnostic
- ignoring punctuation handling.
Follow-up questions
- What is subword tokenization?
- Why is tokenization language-dependent?