LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

Share:

Similar Tracks

Mixture of Experts: Mixtral 8x7B YanAITalk

Lecture 36: CUTLASS and Flash Attention 3 GPU MODE

DeepSeek-V3 Gabriel Mongaras

How might LLMs store facts | DL7 3Blue1Brown

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference Efficient NLP

Goodbye RAG - Smarter CAG w/ KV Cache Optimization Discover AI

How FlashAttention Accelerates Generative AI Revolution Jia-Bin Huang

An introduction to Policy Gradient methods - Deep Reinforcement Learning Arxiv Insights

Accelerating LLM Inference with vLLM Databricks

Llama 4 Explained: Architecture, Long Context, and Native Multimodality Julia Turc

LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU Umar Jamil

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA PyTorch

UML use case diagrams Lucid Software

How DeepSeek Rewrote the Transformer [MLA] Welch Labs

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou AI Engineer

Backpropagation Details Pt. 1: Optimizing 3 parameters simultaneously. StatQuest with Josh Starmer

How To Reduce LLM Decoding Time With KV-Caching! The ML Tech Lead!

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer Umar Jamil

Parameter-efficient Fine-tuning of LLMs with LoRA YanAITalk