Similar Tracks
Which transformer architecture is best? Encoder-only vs Encoder-decoder vs Decoder-only models
Efficient NLP
RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs
DeepLearning Hero
LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU
Umar Jamil
ALiBi - Train Short, Test Long: Attention with linear biases enables input length extrapolation
Yannic Kilcher
Self-Attention with Relative Position Representations – Paper explained
AI Coffee Break with Letitia