What is hierarchical token merging (ToMe) in Vision Transformers?

Updated May 15, 2026

Short answer

Token Merging (ToMe) reduces ViT computation by merging similar tokens during inference or training.

Deep explanation

Token Merging (ToMe) identifies similar tokens using similarity metrics (often cosine similarity) and merges them into a single representative token. This reduces sequence length dynamically while preserving information. It significantly reduces the O(N²) cost of attention while maintaining accuracy, especially in redundant regions like backgrounds.

Unlock with a Pro subscription to view this section.

View pricing

Real-world example

No real-world example available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Common mistakes

No common mistakes listed yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Follow-up questions

No follow-up questions available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

More Computer Vision interview questions

View all →