Deep Dive: Optimizing LLM inference Share: Download MP3 Similar Tracks Deep dive: model merging (part 1) Julien Simon LLM inference optimization: Architecture, KV cache and Flash attention YanAITalk Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral MLOps.community Understanding the LLM Inference Workload - Mark Moyou, NVIDIA PyTorch Phylogeny 101 Session 2: Alignments Southern African Society for Systematic Biology Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum Julien Simon Accelerating LLM Inference with vLLM Databricks Transformers (how LLMs work) explained visually | DL5 3Blue1Brown Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works DataCamp Deep dive - Better Attention layers for Transformer models Julien Simon Quantization vs Pruning vs Distillation: Optimizing NNs for Inference Efficient NLP A Survey of Techniques for Maximizing LLM Performance OpenAI AI Hardware: Training, Inference, Devices and Model Optimization IBM Technology System Design Interview (Chapter 2 - BACK-OF-THE-ENVELOPE ESTIMATION) Javad Majidi Fast LLM Serving with vLLM and PagedAttention Anyscale Deep Dive: Quantizing Large Language Models, part 2 Julien Simon LLM inference optimization: Model Quantization and Distillation YanAITalk Secret Sharing with Snitching - disincentivizing collusion in secret-sharing schemes Marcin Mielniczuk Deep Dive: Quantizing Large Language Models, part 1 Julien Simon The KV Cache: Memory Usage in Transformers Efficient NLP