Similar Tracks
Attention is all you need (Transformer) - Model explanation (including math), Inference and Training
Umar Jamil
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
Umar Jamil
Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
Umar Jamil