Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Similar Tracks
Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
Umar Jamil
Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm
Umar Jamil
[GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Yannic Kilcher
LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch
Umar Jamil
Attention is all you need (Transformer) - Model explanation (including math), Inference and Training
Umar Jamil
Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation
Umar Jamil