DeBERTa: Decoding-enhanced BERT with Disentangled Attention (Machine Learning Paper Explained)

Similar Tracks
ALiBi - Train Short, Test Long: Attention with linear biases enables input length extrapolation
Yannic Kilcher
Attention is all you need (Transformer) - Model explanation (including math), Inference and Training
Umar Jamil
NLP Demystified 15: Transformers From Scratch + Pre-training and Transfer Learning With BERT/GPT
Future Mojo
Feedback Transformers: Addressing Some Limitations of Transformers with Feedback Memory (Explained)
Yannic Kilcher