Skip to content

Quick start

Generate text in 5 lines:

import FFAI
let model = try await Model.load("unsloth/Llama-3.2-1B")
let result = try await model.generate(prompt: "Once upon a time")
print(result.text)

The first call resolves and downloads the checkpoint on demand (cached under ~/.cache/huggingface/hub/), parses config.json, loads weights into per-tensor MTLBuffers, attaches the tokenizer, and prewarms the PSO cache. generate(...) defaults to the model’s family-declared defaultGenerationParameters — Llama and Qwen 3 each carry their own. The returned GenerationResult carries the prompt + generated tokens, the decoded text, and prefill / decode timings.

print("\(result.promptTokens.count) prompt + \(result.generatedTokens.count) generated tokens")
print(String(format: "%.2f tok/s", result.tokensPerSecond))

The same call works for any mlx-format checkpoint — 3 / 4 / 5 / 6 / 8-bit:

let model = try await Model.load("mlx-community/Qwen3-4B-4bit")
let result = try await model.generate(prompt: "What is the capital of France?")
print(result.text)

The same surface ships as the ffai executable. See using-the-cli.md for how to build the binary and get it on PATH; once that’s done:

Terminal window
ffai --model unsloth/Llama-3.2-1B --prompt "Once upon a time"
ffai --model mlx-community/Qwen3-4B-4bit --prompt "Hello" --max-tokens 128

Tokens are streamed to stdout as they’re generated. Pass --no-streaming to print the full text once at the end (matches the buffered API exactly). Pass --stats for the post-run [STATS] block (per-phase memory, TTFT, KV cache, wired ticket — see observability.md). Pass --verbose to print the top-5 next-token distribution from a single prefill instead of generating.

The second argument to generate is a GenerationParameters. Omit it (or pass nil) to use the family default — overriding a single field with the with(_:) copy-mutator preserves the family-tuned baseline:

let result = try await model.generate(
prompt: "Once upon a time",
parameters: model.defaultGenerationParameters.with { $0.maxTokens = 64 }
)

For the full field table (sampling temp, top-p / top-k, repetition penalty, prefill chunk size, …) and which fields are honored today vs staged for Phase 5, see generation-parameters.md.

Streaming is the primitive — buffered generate(...) collects from the same stream:

for try await chunk in model.generateStream(prompt: "Why is the sky blue?") {
print(chunk.text, terminator: "")
}

The final chunk carries the full GenerationStats (peak GPU, KV cache size, TTFT, …) on its stats property. Cancel the consuming task to stop generation early — the producer notices at the next token boundary. See streaming.md.

For instruct / chat models, pass [ChatMessage] and FFAI applies the tokenizer’s chat template:

let messages: [ChatMessage] = [
.init(role: .system, content: "You are concise."),
.init(role: .user, content: "Why is the sky blue?"),
]
let result = try await model.generate(messages: messages)
print(result.text)

For reasoning-tuned models (Qwen 3, DeepSeek-R1, GPT-OSS), opt into the model’s thinking / reasoning hooks:

let result = try await model.generate(
messages: messages,
templateOptions: ChatTemplateOptions(enableThinking: true)
)

See chat-templates.md for the full options surface and per-family quirks.

Model.load(_:options:) takes a LoadOptions:

let model = try await Model.load(
"unsloth/Llama-3.2-1B",
options: LoadOptions(
capabilities: [.textIn, .textOut],
kvCache: .raw,
prewarm: true,
revision: "main"
)
)
FieldDefaultNotes
capabilities[.textIn, .textOut]Which capabilities to load. Disabled modalities skip weight allocation entirely (relevant for VLMs in Phase 6).
kvCache.rawRaw fp16 / bf16 today. .affineQuantized and .turbo land in Phase 5.
dispatchMode.eagerStandard MTLComputeCommandEncoder per kernel. .argumentBuffers / .icb land in Phase 8+ if profiles justify.
prewarmtrueRun one no-op forward to compile the PSOs before the first user-visible decode.
lazyCapabilitiestrueAllow runtime enable(_:) / disable(_:) after load.
revision"main"HF branch / tag / commit.
cacheDirectorynilOverride the HF cache root for this load. nil honors HF_HOME then ~/.cache/huggingface/hub/. See § Custom model cache path.

By default FFAI shares a snapshot cache with Python’s huggingface_hub. The standard discovery order is:

  1. HF_HOME env var — if set, the cache lives under $HF_HOME/hub/ (or $HF_HOME if it’s already a hub dir).
  2. ~/.cache/huggingface/hub/ — the default fallback.

Three ways to point FFAI somewhere else, easiest first:

Cleanest for ad-hoc relocation — works for both the ffai CLI and any Swift code calling Model.load(...). Same env var Python’s huggingface_hub honors, so the cache stays shared with mlx-lm, huggingface-cli, etc.

Terminal window
export HF_HOME=/Volumes/Big/hf-cache
ffai --model unsloth/Llama-3.2-1B --prompt "Once upon a time"

2. LoadOptions.cacheDirectory (programmatic)

Section titled “2. LoadOptions.cacheDirectory (programmatic)”

Override per Model.load(...) call without touching the process env:

let model = try await Model.load(
"unsloth/Llama-3.2-1B",
options: LoadOptions(
cacheDirectory: URL(fileURLWithPath: "/Volumes/Big/hf-cache")
)
)

nil (the default) keeps the standard HF_HOME~/.cache/... discovery order. Useful when one process needs to read from multiple cache roots, or when you want to keep the user’s normal cache untouched while a background pipeline downloads to its own location.

Skip HF entirely — Model.load(...) accepts a local directory containing the snapshot files (config.json, tokenizer.json, *.safetensors, etc.):

let model = try await Model.load("/Volumes/Big/models/llama-3.2-1B-snapshot")
Terminal window
ffai --model /Volumes/Big/models/llama-3.2-1B-snapshot --prompt "Once upon a time"

ModelLocator.isLocalPath(_:) decides this — anything that starts with /, ./, ../, or ~ (or just exists on disk) routes to the local-path branch and never hits the network. The directory needs at minimum:

  • config.json
  • tokenizer.json (or the multi-file tokenizer the model uses)
  • *.safetensors (one or more shard files)
  • tokenizer_config.json if you’ll be using chat templates

This is also how you’d point at a snapshot you’ve already downloaded with huggingface-cli download or mlx-lm.

Model.events is an AsyncStream<ModelLifecycleEvent> that emits idle → downloading → loading → loaded → ready, plus failed(Error) from any state. Useful for UI progress bars:

let model = try await Model.load("unsloth/Llama-3.2-1B")
Task {
for await event in model.events {
print("model state: \(event.state)")
}
}

Model.generate(...) is a thin wrapper. To drive the loop yourself (e.g. custom sampling, streaming hooks, multi-turn cache reuse) drop to the LanguageModel protocol:

let caches = model.engine.makeLayerCaches()
// Prefill: feed each prompt token through the same forward path.
var nextToken = 0
let promptTokens = model.tokenizer.encode(text: "Once upon a time")
for (i, t) in promptTokens.enumerated() {
nextToken = model.engine.forwardSample(tokenId: t, position: i, caches: caches)
}
// Decode loop. forwardSample returns the next token id (GPU argmax) —
// no logits readback to CPU.
var pos = promptTokens.count
for _ in 0..<64 {
if nextToken == model.config.eosTokenId { break }
print(model.tokenizer.decode(tokens: [nextToken]), terminator: "")
nextToken = model.engine.forwardSample(tokenId: nextToken, position: pos, caches: caches)
pos += 1
}

forward(tokenId:position:caches:) returns the logits Tensor if you need them on CPU; forwardSample keeps them on the GPU and only returns the sampled token id (4 bytes across CPU↔GPU per token).

Want to …Read
Add FFAI to your projectinstallation.md
See which models are supportedmodels.md
Understand the three-layer stackarchitecture.md
Stream tokens to a UIstreaming.md
Use chat / instruct modelschat-templates.md
Tune sampling / prefill / maxTokensgeneration-parameters.md
See per-phase memory + tok/s in --statsobservability.md
Run benchmarks (ffai bench)benchmarking.md
Pick a KV cache strategykv-cache.md
Use 3 / 4 / 5 / 6 / 8-bit quantized weightsquantization.md
Check current tok/s numbersperformance.md
Port a new architecturedeveloping/adding-a-model.md