Skip to content

KV cache

The KV cache holds per-layer K and V tensors so subsequent decode steps don’t re-compute attention over the entire prefix. FFAI ships one cache implementation today (raw fp16 / bf16); compressed variants land in Phase 5.

AlgorithmWhen to useMemory ratioStatus
Raw fp16 / bf16 (KVCache, default)All current models.✅ Shipped (Phase 2).
affine8 (AffineQuantizedKVCache, 8-bit)Memory-constrained; ~7% decode-tok/s tax.~0.55× (45% smaller) measured on Qwen3 1.7B✅ Shipped (Phase 5c).
affine4 (AffineQuantizedKVCache, 4-bit)Tight memory; same speed as affine8.~0.31× (69% smaller, group_size=32)✅ Shipped (Phase 5c).
TurboQuantBest memory ratio at minimal quality loss.~6–8× at turbo4v2⏳ Planned (Phase 5d).
SSM / HybridMamba / GatedDeltaNet (Qwen 3.5, NemotronH)n/a — stores recurrent + conv state⏳ Planned (Phase 5e).
BatchedMulti-stream decode (speculative, B>1 serving)linear in B⏳ Planned (Phase 8+).

The shipped raw cache is what every demo / test exercises today. Compressed variants are deliberately deferred — the goal of Phase 4 (perf) was to nail the dispatch path; the goal of Phase 5 is to add cache compression on top.

Each layer holds its own KVCache instance. During the forward pass:

  1. Q, K, V are projected from the post-RMSNorm hidden state.
  2. RoPE is applied to Q and K.
  3. kv_cache_update kernel appends the new K/V rows into the per-layer cache buffer on the GPU. No CPU↔GPU sync — the append enqueues onto the same MTLCommandBuffer as the rest of the layer.
  4. sdpa_decode kernel scores the single query row against the full cached K/V slice up to the current position.

The cache buffer is allocated once per layer at the configured max context length; appends bump an offset rather than reallocating. This is the same shape MLX uses, minus the Metal compile latency.

let caches = model.engine.makeLayerCaches() // [any LayerCacheProtocol], one per layer

makeLayerCaches() is on the LanguageModel protocol — LlamaModel, Qwen3Model, and Mamba2Model all implement it. The user owns the cache lifetime; keep it across forward(...) / forwardSample(...) calls for multi-turn or streaming.

Two schemes ship today, selectable via LoadOptions.kvCache:

public enum KVCacheKind: Sendable, Equatable {
case raw // default
case affineQuantized(bits: Int = 8, groupSize: Int = 64) // Phase 5c
// .turbo — Phase 5d
}

Activating the 8-bit affine cache:

let model = try await Model.load(
"mlx-community/Qwen3-1.7B-4bit",
options: LoadOptions(kvCache: .affineQuantized(bits: 8, groupSize: 64))
)

Or via the CLI:

Terminal window
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "..." --kv-cache affine8

Per attention layer the cache holds three packed buffers per K (and V): kWeights (u32, 4 int8 values per word), kScales (fp16/bf16, per-group), kBiases (fp16/bf16, per-group). All layers in one makeLayerCaches(...) call share one pair of working buffers sized [nKVHeads, maxSeq, headDim] in the model dtype. On appendOnGPU(...) the quantize_kv_int8 kernel writes the new row into the layer’s compressed storage. On prepareForAttention(...) (called before SDPA) the bulk_dequant_kv_int8 kernel materialises the live slice into the shared working buffer, which SDPA then reads. Metal’s default hazard tracking serializes the working-buffer reuse across layers within a single command buffer.

Measured on Qwen3 1.7B 4-bit at maxSeq=40960

Section titled “Measured on Qwen3 1.7B 4-bit at maxSeq=40960”
Rawaffine8affine4Δ vs raw
KV cache (alloc)4.38 GB2.32 GB1.37 GB−47% / −69%
Peak GPU5.28 GB3.38 GB2.44 GB−36% / −54%
Decode tok/s46.743.645.4−7% / −3%
Output qualityreferencefirst ~13 tokens match raw, then minor driftcoherent, simpler answersboth stay on-topic
SchemeDefault groupSizeWhy
affine864Plenty of precision per group; matches mlx-format weight-quant convention.
affine4324 bits per element ÷ a wider group loses too much discriminative power on K/V — decode degenerates into repetition at group_size=64. TurboQuant-style rotation (Phase 5d) would let larger groups work.
  • affine6 variant — byte-packed sub-byte storage (mirror the existing dequant_gather_int6 pattern). Memory between affine4 and affine8.
  • Fused dequant-into-SDPA — today each attention step pays one extra dequant kernel dispatch. A fused bulk_dequant + sdpa_decode kernel removes the working-buffer materialisation entirely.
  • TurboQuant (Phase 5d) — block-wise MSE codec with asymmetric K/V bits + dense rotation; will recover full quality at 4-bit group_size=64.

For multi-turn or streaming UIs, drive the loop yourself and reuse the cache across calls (see quickstart.md § Lower-level API):

let caches = model.engine.makeLayerCaches()
func respond(_ prompt: String, position: inout Int) -> String {
var pos = position
var nextToken = 0
for t in model.tokenizer.encode(text: prompt) {
nextToken = model.engine.forwardSample(tokenId: t, position: pos, caches: caches)
pos += 1
}
var generated: [Int] = []
while !isStop(nextToken) {
generated.append(nextToken)
nextToken = model.engine.forwardSample(tokenId: nextToken, position: pos, caches: caches)
pos += 1
}
position = pos
return model.tokenizer.decode(tokens: generated)
}

pos keeps advancing across calls; the cache holds every K / V row appended so far.

From planning/plan.md:

  • Affine quantized KV cache — 4 / 6 / 8-bit affine group-quant for K and V. Self-transitions raw → quantized at startOffset so prefill stays fast. ~3.5× memory at 4-bit; modest decode-tok/s tax.
  • TurboQuant cache — block-wise MSE codec with asymmetric K/V bits (e.g. 4-bit K, 2-bit V — turbo4v2). Two attention paths: TurboFlash compressed-domain Metal kernel (default) or bulk-dequant → MLXFast SDPA (opt-in). ~6-8× memory.
  • SSMStateCache — for Mamba / GatedDeltaNet families (Qwen 3.5 / NemotronH / Jamba). Stores conv + recurrent state instead of K/V; composes with attention layers via CacheList.
  • Batched cache — slot-based admission for fixed-size batches; enables speculative decoding and multi-stream serving.

Each lands in its own commit with the corresponding kernels in metaltile. Affine and TurboQuant are the highest-priority Phase 5 deliverables; SSM/GDN follow.

  • Architecture — where the cache sits in the per-token dispatch loop.
  • Performance — current tok/s numbers, including what kv_cache_update (Phase 4 wave 1) bought us vs the original CPU-memcpy append.
  • Quantization — weight quantization (a different axis from KV cache compression).