← AI Notes

Direct Preference Optimization (DPO)

31 Oct 2024

🚧 Work in progress…

This article will cover Direct Preference Optimization (DPO), a simpler alternative to RLHF that directly optimizes language models using preference data without requiring a separate reward model.

Topics to cover:

  • Limitations of RLHF and PPO
  • DPO formulation and theoretical foundation
  • Comparison with PPO-based RLHF
  • Implementation and practical considerations
← AI Notes