What is Reinforcement Learning from Human Feedback (RLHF)?

Updated May 16, 2026

Short answer

RLHF aligns AI model behavior with human preferences by combining reinforcement learning with human-generated feedback signals.

Deep explanation

Large language models trained only through next-token prediction often produce outputs that are unsafe, irrelevant, misleading, or misaligned with human expectations.

RLHF improves alignment.

Pipeline:

  1. Pretraining:
  • Train language model using self-supervised learning.
  1. Supervised Fine-Tuning (SFT):
  • Train on curated instruction-response datasets.
  1. Human Preference Collection:
  • Humans rank multiple model outputs.
  1. Reward Model Training:
  • Train a reward model predicting human preferences.

5.…

Unlock with a Pro subscription to view this section.

View pricing

Real-world example

No real-world example available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Common mistakes

No common mistakes listed yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

Follow-up questions

No follow-up questions available yet.

Unlock with a Pro subscription to view this section.

Upgrade to Pro

More Deep Learning interview questions

View all →