Deep Dive: Optimizing LLM inference

Deep Dive: Optimizing LLM inference

Share:

Similar Tracks

Deep dive: model merging (part 1) Julien Simon

LLM inference optimization: Architecture, KV cache and Flash attention YanAITalk

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral MLOps.community

Understanding the LLM Inference Workload - Mark Moyou, NVIDIA PyTorch

Phylogeny 101 Session 2: Alignments Southern African Society for Systematic Biology

Deep Dive: Parameter-Efficient Model Adaptation with LoRA and Spectrum Julien Simon

Accelerating LLM Inference with vLLM Databricks

Transformers (how LLMs work) explained visually | DL5 3Blue1Brown

Understanding LLM Inference | NVIDIA Experts Deconstruct How AI Works DataCamp

Deep dive - Better Attention layers for Transformer models Julien Simon

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference Efficient NLP

A Survey of Techniques for Maximizing LLM Performance OpenAI

AI Hardware: Training, Inference, Devices and Model Optimization IBM Technology

System Design Interview (Chapter 2 - BACK-OF-THE-ENVELOPE ESTIMATION) Javad Majidi

Fast LLM Serving with vLLM and PagedAttention Anyscale

Deep Dive: Quantizing Large Language Models, part 2 Julien Simon

LLM inference optimization: Model Quantization and Distillation YanAITalk

Secret Sharing with Snitching - disincentivizing collusion in secret-sharing schemes Marcin Mielniczuk

Deep Dive: Quantizing Large Language Models, part 1 Julien Simon

The KV Cache: Memory Usage in Transformers Efficient NLP