3.1 MLA: The best of both worlds
The state of attention mechanisms before DeepSeek presented a difficult choice. We were forced to navigate a trade-off curve: on one end, the maximum performance of Multi-Head Attention with its massive memory cost; on the other, the memory efficiency of Multi-Query Attention with its compromised expressiveness.
This led the DeepSeek team to ask a more profound question: Can we break the trade-off itself?
Is it possible to design an attention mechanism that delivers:
- Low Cache Size: A memory footprint comparable to the highly efficient MQA or GQA.
- High Performance: The full expressive power of MHA, where every attention head has a unique, specialized perspective.
It seems impossible. To reduce the cache size, it feels like we must reduce the amount of unique information we store. How could we possibly have different values for every head’s Key and Value without caching them all?
The answer lies in a beautiful trick, a new way of thinking about the problem. The core innovation of MLA is to shift the focus from reducing the number of heads to compressing the information within them.
The insight is this: what if we don’t have to cache the Key and Value matrices separately? What if we could first project our input into a single, combined, and much smaller matrix, a latent matrix, and cache only that?
This is the central idea of Multi-Head Latent Attention. Instead of caching two large, full-dimensional matrices ($K$ and $V$), we cache one smaller, lower-dimensional matrix ($cKV$). This single latent matrix becomes our new, highly efficient cache. Then, when we need the full Key and Value matrices for the attention calculation, we can reconstruct them on the fly from this compressed latent representation.
This is the magic of MLA:
- For Caching: We store a single, small, compressed matrix.
- For Calculation: We decompress it to get full-sized, unique Key and Value matrices for every head.
This approach promises to deliver the best of both worlds. It achieves this by cleverly factorizing the Key and Value projections into a two-step process: a compression to a smaller latent space for efficient caching, followed by a reconstruction step to the full-dimensional space for the attention calculation. To understand how it works, we must first look at the new architecture that makes this compression and decompression possible.