Skip to content

Quantization

FFAI supports the mlx-format affine group-quantized weight layout at every bit width MLX itself ships: 3 / 4 / 5 / 6 / 8.

This is weight-only quantization — it shrinks the on-disk + on-GPU footprint of the model and (usually) speeds up decode by lowering memory-bandwidth pressure. It’s a different axis from KV cache quantization, which compresses the attention K/V tensors at runtime and lands in Phase 5.

Bit widthGroup sizesStatusNotes
bf16 / fp16n/aReference. unsloth/Llama-3.2-1B, mlx-community/Qwen3-4B-bf16, etc.
8-bit32 / 64 / 128One byte per weight, packed 4 per uint32.
6-bit32 / 64 / 128Byte-level pack — 4 weights per 3 bytes.
5-bit32 / 64 / 128Byte-level pack — 8 weights per 5 bytes.
4-bit32 / 64 / 128Eight weights per uint32. The mlx-community standard.
3-bit32 / 64 / 128Byte-level pack — 8 weights per 3 bytes.

mlx-community ships *-3bit, *-4bit, *-5bit, *-6bit, *-8bit variants of most models; see models.md for the curated set we regression-sweep.

mlx-format is the same *.safetensors file as a plain HF checkpoint, just with different contents. Each quantized linear layer ships as a triplet of tensors:

model.layers.0.self_attn.q_proj.weight uint32 (out, in/pack_factor)
model.layers.0.self_attn.q_proj.scales fp16 (out, n_groups)
model.layers.0.self_attn.q_proj.biases fp16 (out, n_groups)

pack_factor = 32 / bits for the 4-bit and 8-bit “uint32-aligned” case. For the byte-packed widths (3 / 5 / 6) the storage shape is chosen so the row stride lands on a byte boundary — kernels handle the unpack.

config.json carries:

{
"quantization": {
"bits": 4,
"group_size": 64
}
}

The loader (SafeTensors.swift) detects this block and routes the matching layers through the dequant-gemv kernels. Plain *.bin / non-quantized .safetensors keep the bf16/fp16 path.

A single decode-time matvec on a quantized linear is one dequant_gemv_<bits> kernel call:

input row (1 × in) → hidden state for this layer
weight uint32 buffer → packed quant
scales fp16 buffer → per-group scale
biases fp16 buffer → per-group bias (zero point)
per-(out_row, group) sum → reduce → output (1 × out)

On 4-bit and 8-bit, the uint32-packed layout means each thread can fetch a packed word and emit pack_factor partial-products inside the inner loop — the kernel parallelizes over both rows and packs. On 3 / 5 / 6-bit, the unpack happens byte-by-byte; the sub-group split trick (one SIMD subgroup per pack within a row) was the Phase 4 wave-2 optimization that closed the gap to the uint32-aligned widths.

Phase 4 perf table for Qwen 3 4B (M1 Max, single-stream decode, batch 1, ~32-token prompts):

WidthPhase 3 baselinePhase 4 (wave 1)Phase 4 (wave 2)Speedup
bf165.0 tok/s~24 tok/s28.0 tok/s5.6×
8-bit4.7 tok/s~21 tok/s27.5 tok/s5.9×
6-bit4.2 tok/s~19 tok/s26.1 tok/s6.2×
5-bit4.0 tok/s~18 tok/s25.4 tok/s6.4×
4-bit5.0 tok/s~22 tok/s29.8 tok/s6.0×
3-bit3.6 tok/s~17 tok/s24.1 tok/s6.7×

See performance.md for the full picture (Llama 3.2 1B + Qwen 3 4B across phases) and what each wave changed.

Rule of thumb (subject to per-model variation):

  • 8-bit — closest to bf16 quality; modest 2× memory savings vs bf16. Use when quality is the priority.
  • 6-bit — middle ground. Slightly worse PPL than 8-bit, ~25% more memory savings.
  • 4-bit — the mlx-community standard. Best mainstream speed/memory/quality balance for most chat tasks.
  • 5-bit — compromise between 4-bit and 6-bit. Less commonly shipped by mlx-community.
  • 3-bit — most aggressive. Notable PPL hit; useful when memory is the binding constraint.

For a given model, pick the highest bit width that fits your memory budget. If you can fit 8-bit, bf16 is rarely worth the extra space on Apple Silicon.

Same call as bf16:

let model = try await Model.load("mlx-community/Qwen3-4B-4bit")

The loader reads config.json, sees quantization.bits = 4, and routes the linear layers through dequant_gemv_4. No flag, no extra field on LoadOptions.

  • mxfp4 / nvfp4 — Microscaling FP4 with FP8-block scales. mlx-lm ships these for Qwen 3.5 / Gemma 4 / GPT-OSS; FFAI doesn’t yet. Different scale layout means a different kernel; planned alongside Qwen 3.5 in Phase 5.
  • gguf quantizations (Q4_K_M, Q5_K_M, Q8_0, …) — different binary layout, different per-block scales, different tensor naming. Planned for Phase 8+ if community demand justifies a per-arch name mapper.
  • QAT 4-bit (Gemma 3 in particular) — works mechanically (it’s affine 4-bit), but no Gemma family file exists yet.
  • Mixed-bit per-layer — config-driven per-layer bit budgets (quantization_config block). Planned alongside Phase 7 autotuner.
  • Models — checkpoints regression-swept per family.
  • Performance — full perf numbers and the Phase 4 wave-by-wave breakdown.
  • KV cache — runtime quantization of the attention K/V (a different axis).
  • Architecture — where dequant-gemv sits in the per-token dispatch loop.