What is hierarchical token merging (ToMe) in Vision Transformers?
Updated May 15, 2026
Short answer
Token Merging (ToMe) reduces ViT computation by merging similar tokens during inference or training.
Deep explanation
Token Merging (ToMe) identifies similar tokens using similarity metrics (often cosine similarity) and merges them into a single representative token. This reduces sequence length dynamically while preserving information. It significantly reduces the O(N²) cost of attention while maintaining accuracy, especially in redundant regions like backgrounds.
Unlock with a Pro subscription to view this section.
View pricingReal-world example
No real-world example available yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProCommon mistakes
No common mistakes listed yet.
Unlock with a Pro subscription to view this section.
Upgrade to ProFollow-up questions
No follow-up questions available yet.
Unlock with a Pro subscription to view this section.
Upgrade to Pro