Question 1

how does mla work

Accepted Answer

Multi-head Latent Attention (MLA) applies a low-rank joint compression to the Key-Value (KV) cache. Instead of caching separate keys and values for every head, MLA projects the KV states down into a tiny latent vector during generation, and projects them back up only when calculating attention scores. This shrinks KV cache memory occupancy by over 80%.

Question 2

what is dualpipe

Accepted Answer

DualPipe is a pipeline parallel training scheduler. In normal pipeline training, accelerators wait idle (a "bubble") while communication happens between nodes. DualPipe overlaps the forward and backward computation steps of different micro-batches, so node GPU workloads compute continuously while communication flows in the background.

Question 3

why is v3 cheap

Accepted Answer

Due to MLA, training and inference memory footprints are highly optimized. DeepSeek-V3 was trained on 2.7 trillion tokens using only 2.788 million GPU hours (costing roughly $5.6M USD), which is a 10x saving compared to traditional architectures of similar size.

Question 4

why skip rnn

Accepted Answer

Recurrent Neural Networks (RNNs) process sequences sequentially word-by-word, preventing massive parallelization during training. Transformers process the entire sequence in a single forward pass, enabling scale-out training on modern hardware.

Question 5

explain multihead

Accepted Answer

Multi-Head Attention projects the query, key, and value vectors into multiple representation subspaces. This lets the model focus on different aspects of a sentence (e.g. grammar vs subject-action dependencies) simultaneously.

Question 6

what is scaled dot product

Accepted Answer

It calculates similarity scores between queries and keys. The score is scaled by the square root of the head dimension to prevent gradients from exploding or vanishing in the softmax stage.

Question 7

transformer vs mamba

Accepted Answer

Transformers have quadratic memory scaling, meaning a chat context twice as long takes four times the compute. Mamba scales linearly, allowing incredibly large context inputs with negligible latency overhead.

Question 8

selective parameterization

Accepted Answer

Standard State Space Models are time-invariant, meaning they process all tokens with the same mathematical matrices. Mamba dynamically adjusts its transition matrices based on current token inputs, choosing what to remember and what to discard.

Question 9

fast sram scan

Accepted Answer

To make selective scans fast, Gu & Dao implemented a custom CUDA kernel that materializes the large intermediate matrices on-chip inside GPU SRAM instead of sending them to standard high-bandwidth memory (HBM).

DeepSeek-V3 Technical Report: Efficient Training & Latent Attention

The Abstract

ELI5: Concept Simplified

Key Breakthrough Innovations

Reference PyTorch Implementation

Ask the Paper Assistant

Active Recall Quiz

What is the primary benefit of Multi-head Latent Attention (MLA)?