Skip to content

Models

FFAI ships two architecture families today; both run real HuggingFace checkpoints end-to-end through Model.load("org/repo").

This page is the canonical landing for:

  • which architectures are in tree
  • which sizes / quantizations have been exercised
  • known gaps per family

For porting a new architecture, see developing/adding-a-model.md.

FamilyFilemodel_typearchitecturesVariants
Llama 3.xModels/Llama.swiftllamaLlamaForCausalLMLlamaDense
Qwen 3Models/Qwen3.swiftqwen3Qwen3ForCausalLMQwen3Dense

Both variants share the same Llama-shaped core: GQA attention with RoPE, RMSNorm, SwiGLU MLP. Qwen 3 adds per-head q_norm / k_norm RMSNorms applied to queries/keys before RoPE — the only structural difference vs Llama. No new kernels were needed for Qwen 3; just an extra RMSNorm site.

These are the checkpoints regression-swept in the test suite and used for the performance numbers. Any other Llama 3.x or Qwen 3 size from HuggingFace should work — the family files don’t hard-code size-specific paths.

RepoSizeQuantNotes
unsloth/Llama-3.2-1B1Bbf16Phase 2 reference.
unsloth/Llama-3.2-3B3Bbf16
mlx-community/Llama-3.2-1B-4bit1Bmlx 4-bit
mlx-community/Llama-3.2-3B-Instruct-4bit3Bmlx 4-bit
mlx-community/Meta-Llama-3.1-8B-Instruct-4bit8Bmlx 4-bit

Llama 3 RoPE scaling (factor, low_freq_factor, high_freq_factor, original_max_position) is honored; checkpoints with rope_type: "llama3" in config.json route through the scaled variant automatically.

RepoSizeQuantNotes
mlx-community/Qwen3-0.6B-bf160.6Bbf16
mlx-community/Qwen3-1.7B-bf161.7Bbf16Integration-test baseline.
mlx-community/Qwen3-1.7B-3bit1.7Bmlx 3-bitIntegration-tested.
mlx-community/Qwen3-1.7B-4bit1.7Bmlx 4-bitIntegration-tested.
mlx-community/Qwen3-1.7B-5bit1.7Bmlx 5-bitIntegration-tested. Published by us — mlx-community didn’t ship a plain text-only 5-bit Qwen3-1.7B (only TTS / ASR variants existed).
mlx-community/Qwen3-1.7B-6bit1.7Bmlx 6-bitIntegration-tested.
mlx-community/Qwen3-1.7B-8bit1.7Bmlx 8-bitIntegration-tested.
mlx-community/Qwen3-4B-bf164Bbf16
mlx-community/Qwen3-4B-4bit4Bmlx 4-bit
mlx-community/Qwen3-4B-8bit4Bmlx 8-bit
mlx-community/Qwen3-8B-4bit8Bmlx 4-bit
mlx-community/Qwen3-14B-4bit14Bmlx 4-bit

The integration-test suite uses Qwen 3 1.7B across every supported bit width to keep CI fast — the per-bit-width quantization paths don’t depend on model size, so 1.7B (~3.5 GB bf16) covers the same codepaths as 4B / 8B / 14B at a fraction of the download cost.

mlx-community didn’t have a plain text-only 5-bit Qwen3-1.7B (only TTS / ASR variants), so we converted + uploaded one ourselves:

Terminal window
mlx_lm.convert --hf-path Qwen/Qwen3-1.7B \
--mlx-path ./Qwen3-1.7B-5bit \
-q --q-bits 5 --q-group-size 64 \
--upload-repo mlx-community/Qwen3-1.7B-5bit

All in-tree families support every bit width FFAI implements:

Bit widthFormatStatus
bf16 / fp16Plain safetensors
8-bitmlx-format affine
6-bitmlx-format affine (byte-packed)
5-bitmlx-format affine (byte-packed)
4-bitmlx-format affine
3-bitmlx-format affine (byte-packed)

See quantization.md for the packing layout, the sub-group split dispatch trick that closed the perf gap on 4-bit Qwen3 4B, and how the loader auto-detects mlx-format weights.

Pass any HuggingFace repo ID with one of the in-tree model_type / architectures strings to Model.load:

let model = try await Model.load("mlx-community/Qwen3-14B-4bit")

The loader resolves the snapshot, parses config.json, picks the right family via ModelRegistry.dispatchAndLoad, and builds the variant. If the architecture isn’t in the registry yet, you get a ModelError.unsupportedArchitecture(...).

ItemStatus
Multi-modal (vision, audio)Capability infrastructure in place from Phase 2; first real exercise lands in Phase 6 (Qwen 2.5/3.5-VL).
Chat templatesTokenizer’s chat template is not auto-applied by generate(...) yet — pass the templated prompt yourself. Auto-apply lands alongside the first instruct-tuned VL model.
SamplingGreedy argmax only on the GPU path. Top-k / top-p / temperature exist as CPU helpers in Sampling.swift; GPU kernels for these land in Phase 5.
Quantized KV cacheRaw fp16/bf16 only. Affine + TurboQuant land in Phase 5 — see kv-cache.md.
Hybrid modelsQwen 3.5 (GDN + attention) and Mamba/Mamba 2 families need new SSM kernels; Phase 5.
MoEQwen 3.5 MoE and similar need fused-expert kernels; Phase 5.
MoE / vision-tied checkpointsDetected as unsupportedArchitecture until their family files land.
Prompt caching across requestsNot yet — the cache lives for one generate(...) call. Multi-turn cache reuse is straightforward via the lower-level API (see quickstart.md § Lower-level API).

Per planning/plan.md:

  • Phase 5 — TurboQuant KV cache + GDN + SSM. Unlocks Qwen 3.5 hybrid (GDN + attention), Qwen 3.5 MoE, NemotronH, Mamba families.
  • Phase 6 — first multi-modal model: Qwen 2.5-VL or Qwen 3.5-VL, exercising Capability.visionIn end-to-end.
  • Phase 7 — autotuner over kernel parameters (tile_dims, threads, unroll, simd_matrix, async_copy).
  • Phase 8+ — audio, additional families (Mistral, Phi, Gemma, GPT-OSS), gguf format support, dispatch-mode upgrades (.argumentBuffers, .icb).