What is a backbone-neck-head architecture in object detection?

Updated May 15, 2026

Short answer

It is a modular design where backbone extracts features, neck fuses them, and head makes predictions.

Deep explanation

Backbone (like ResNet) extracts features, neck (like FPN) aggregates multi-scale features, and head predicts bounding boxes and classes. This modular design improves flexibility and performance.

Real-world example

Used in YOLO and Faster R-CNN architectures.

Common mistakes

  • Confusing neck with backbone responsibilities.

Follow-up questions

  • Why separate these components?
  • What is prediction head output?

More Computer Vision interview questions

View all →