Similar Tracks
Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
Umar Jamil
[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Yannic Kilcher
LLM Training & Reinforcement Learning from Google Engineer | SFT + RLHF | PPO vs GRPO vs DPO
Martin Is A Dad
Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.
Umar Jamil
Stanford CS224N | 2023 | Lecture 10 - Prompting, Reinforcement Learning from Human Feedback
Stanford Online