6.1 DSA Prototype: Lightning Indexer and Fine-Grained Token Selection
DeepSeek Sparse Attention (DSA) is the key architectural innovation introduced in DeepSeek-V3.2. The model uses exactly the same architecture as DeepSeek-V3.2-Exp, and compared to DeepSeek-V3.1-Terminus, the only architectural modification is the introduction of DSA through continued training.
The core insight is simple but powerful: standard dense attention has O(L²) complexity, which becomes a severe bottleneck at long sequence lengths. DSA reduces this to O(L·k) where k ≪ L is the number of selected tokens.
Lightning Indexer
The indexer computes an index score I_{t,s} between a query token h_t and a preceding token h_s:
I_{t,s} = Σ_{j=1}^{H^I} w_{t,j}^I · ReLU(q_{t,j}^I · k_s^I)
Where:
- H^I is the number of indexer heads (kept small for efficiency)
- q_{t,j}^I and w_{t,j}^I are derived from query token h_t
- k_s^I is derived from the preceding token h_s
- ReLU is chosen as the activation function for throughput optimization
The lightning indexer has a small number of heads and can be implemented in FP8, making its computational overhead remarkably low.
Fine-Grained Token Selection
Given the index scores {I_{t,s}} for each query token, the selection mechanism retrieves only the key-value entries corresponding to the top-k index scores:
u_t = Attn(h_t, {c_s | I_{t,s} ∈ Top-k(I_{t,:})})
Where c_s represents the key-value entry (latent vector in MLA) for token s. The attention output u_t is computed by applying the standard attention mechanism between the query token and the sparsely selected entries only.
Instantiation Under MLA
For continued training from DeepSeek-V3.1-Terminus, DSA is instantiated based on MLA. At the kernel level, each key-value entry must be shared across multiple queries for computational efficiency, so DSA is implemented under the MQA (Multi-Query Attention) mode of MLA — where each latent vector (the key-value entry of MLA) is shared across all query heads of the query token.