Observability
Three CLI flags to introspect what FFAI is doing per token. The
underlying types live in Sources/FFAI/Stats/
and are usable from your own code without going through the CLI.
This page covers --stats today; --debug and --profiling land
in follow-up commits and are described under
Coming next.
--stats
Section titled “--stats”Prints a [STATS] block after generation completes. Always-on
instrumentation: a PhaseMemoryTracker samples
MTLDevice.currentAllocatedSize at every token boundary (cost: one
property read per token, ~sub-µs). The fields and shape are
deliberately portable to / from mlx-swift-lm’s bench-row schema so
analysis tooling stays compatible.
ffai --model unsloth/Llama-3.2-1B --prompt "Once upon a time" --stats[STATS] prompt: 5 tokens generated: 64 tokens context: 131072 tokens ttft: 82.34 ms prefill: 0.08s (60.71 tok/s) decode: 1.00s (64.06 tok/s) decode (steady): 65.20 tok/s (tokens 11+) baseline GPU: 1240.0 MB post-prefill GPU: 1252.0 MB (+ 12.0 MB) post-decode GPU: 1268.0 MB (+ 16.0 MB) prefill peak: 1255.5 MB decode peak: 1269.0 MB weights: 1180.0 MB KV cache (alloc): 64.0 MB KV cache (used): 4.5 MB wired ticket: 16.00 GBWhat each field means
Section titled “What each field means”| Field | Source | Notes |
|---|---|---|
prompt / generated / context | counts | context is the model’s max sequence length. |
ttft | wall clock | Time-to-first-token (ms). On the slow per-token prefill, equal to total prefill time. |
prefill, decode | wall clock | Phase totals + tok/s. |
decode (steady) | wall clock | Mean per-token decode rate from token 11 onward. Drops PSO compile + autorelease pool warm-up. nil when fewer than 11 generated tokens. |
baseline GPU | MTLDevice.currentAllocatedSize at the start of generate(...) | Resident weights + warm caches. |
post-prefill GPU, post-decode GPU | same source | Includes a + delta column attributing the growth to the corresponding phase — answers “where did the GB come from?”. |
prefill peak, decode peak | per-token sample max | The actual high-water mark inside each phase, separately reported. |
weights | sum of Tensor.byteCount over engine.parameters() | Resident model weights only. |
KV cache (alloc) | sum of KVCache.bytesAllocated | Capacity allocated up-front at the model’s maxSeq. |
KV cache (used) | sum of KVCache.bytesInUse | The live length / maxSeq slice — the part you’re actually paying for in attention math. |
wired ticket | MTLDevice.recommendedMaxWorkingSetSize | The OS’s wired-memory budget for this device. |
Programmatic access
Section titled “Programmatic access”Stats are always populated on GenerationResult.stats. The CLI flag
only controls the printout.
let result = try await model.generate(prompt: "...")
print(result.stats.peakGPUBytes)print(result.stats.prefillGrowthBytes) // post-prefill – baselineprint(result.stats.decodeGrowthBytes) // post-decode – post-prefillprint(result.stats.kvCacheUsedBytes)print(result.stats.steadyTokensPerSecond ?? "<warm-up only>")For a custom report, call result.stats.formatted() — that’s what
the CLI prints.
Perplexity (opt-in)
Section titled “Perplexity (opt-in)”Perplexity computation requires a per-step logits readback (extra
cost on top of greedy decode), so it’s not folded into
generate(...). Use the standalone helper:
let r = try await model.generate(prompt: "Once upon a time")let ppl = Perplexity.compute(model: model, tokens: r.promptTokens + r.generatedTokens)print(ppl.perplexity) // exp(-mean log p(token | prefix))All perplexity math runs in fp32 — bf16’s 7 mantissa bits can’t
represent the partial Σ exp(x_i − max_x) over a 128–152K vocab
without losing precision in the third decimal of perplexity, which
is exactly the resolution we need for distinguishing quantization
tiers. fp32 costs ~1µs per token; only matters offline. Header in
Stats/Perplexity.swift
has the longer rationale.
The bench harness’s --method wikitext2 runs this over the
WikiText-2 corpus and writes the resulting genPerplexity into the
report row.
KL divergence vs a reference model
Section titled “KL divergence vs a reference model”Perplexity.klDivergence(reference:candidate:tokens:) runs both
models forward in lockstep over a fixed token sequence and returns
the mean per-position KL(p_ref || q_cand) in nats. Both models
must share the same tokenizer / vocab.
let reference = try await Model.load("mlx-community/Qwen3-4B-bf16")let candidate = try await Model.load("mlx-community/Qwen3-4B-4bit")let kld = Perplexity.klDivergence( reference: reference, candidate: candidate, tokens: tokenizer.encode(text: "Sample evaluation text…"))print(kld.meanKLDivergence)Pick the right reference. The number is only meaningful when the reference is a higher-fidelity sibling of the candidate — same architecture, same tokenizer. Use the bf16 unquantized variant if it fits in device memory. A smaller / different-family reference turns the KL into a measure of family closeness rather than quantization fidelity.
Memory: both models live resident simultaneously, plus their KV
caches. For Qwen 3 4B, that’s roughly 8GB (bf16) + 2.5GB (4-bit) +
a few hundred MB of caches — fits on any 16GB+ Mac. For Qwen 3 14B
the bf16 variant alone is ~28GB; pair with Qwen3-14B-8bit as the
candidate against Qwen3-14B-bf16 as the reference if you have the
RAM, otherwise drop down to the next size class.
CLI --ref-model <repo> plumbing ships with the bench subcommand
in the next chunk.
Think vs gen split
Section titled “Think vs gen split”Models that emit reasoning segments (Qwen 3 / DeepSeek-R1 ChatML,
GPT-OSS Harmony, Gemma 3/4 channels) get a separate
think_tokens / gen_tokens count line. Format auto-detection runs
on every generate(...) call via ThinkingSplit.detectFormat(model:):
| Detected from | Format | Implementation |
|---|---|---|
model_type contains gpt-oss | .harmony | TODO — lands with the GPT-OSS family file (Phase 8+). |
model_type contains gemma3 / gemma4 | .gemmaChannel | TODO — lands with the Gemma family file (Phase 5+). |
Tokenizer has <think> / </think> ids | .chatML | Implemented today. |
| Otherwise | .none | Whole generation counts as gen. |
For models that don’t emit thinking markers, the split is silently
omitted. See Stats/ThinkingSplit.swift
for the per-format scanners.
Per-segment perplexity is nil from generate(...) alone — the
bench harness will run perplexity over each segment separately
when it lands. For now you can do it yourself:
let split = ThinkingSplit.split(tokens: result.generatedTokens, model: model)if let s = split { let thinkPPL = Perplexity.compute(model: model, tokens: result.promptTokens + Array(s.thinkTokens)) let genPPL = Perplexity.compute(model: model, tokens: result.promptTokens + Array(s.thinkTokens) + [open] + Array(s.genTokens)) print(thinkPPL.perplexity, genPPL.perplexity)}Scaffolded fields (commented today)
Section titled “Scaffolded fields (commented today)”The GenerationStats struct reserves space — currently commented —
for capabilities that haven’t shipped yet:
- Batch decoding (Phase 8+):
batchSize,perSequenceDecodeTokensPerSecond. - Speculative decoding (Phase 8+):
acceptanceRate,draftTokensPerSecond,draftAcceptedTokens.
When those modes land, uncomment the fields, populate them in
Generate.swift, and the formatted output + bench writer pick them
up automatically.
--debug
Section titled “--debug”Gates verbose log output to stderr, tagged by subsystem. Off by default. The CLI flag flips the global gate; the same gate is controllable via env vars so other entry points (Xcode test runs, your own Swift code) can opt in without editing source.
ffai --debug --model unsloth/Llama-3.2-1B --prompt "Hi" 2>debug.logSample output (heavily abbreviated):
[ffai:load] Model.load id-or-path=unsloth/Llama-3.2-1B[ffai:loader] resolved snapshot dir: /Users/.../snapshots/abcd…[ffai:load] config: arch=LlamaForCausalLM model_type=llama hidden=2048 layers=16[ffai:generate] begin prefill: 5 tokens, maxTokens=64[ffai:generate] prefill done in 0.082s (60.7 tok/s)[ffai:generate] decode done: 64 tokens in 1.000s (64.0 tok/s)Per-subsystem gates
Section titled “Per-subsystem gates”--debug enables every subsystem. To enable just one, set the
matching env var instead of passing --debug:
| Env var | Subsystem | What it logs |
|---|---|---|
FFAI_DEBUG=1 | (all) | Global gate; every subsystem on. |
FFAI_DEBUG_LOADER=1 | loader | ModelLocator / ModelDownloader snapshot resolution + cache hits. |
FFAI_DEBUG_LOAD=1 | load | Model.load + family loaders, config decode. |
FFAI_DEBUG_GENERATE=1 | generate | Prefill + decode loop boundaries, per-call timing summary. |
FFAI_DEBUG_SAMPLING=1 | sampling | Sampling decisions (when GPU sampling kernels land in Phase 5). |
FFAI_DEBUG_KVCACHE=1 | kvcache | KV cache append + slice events. |
FFAI_DEBUG_KERNELS=1 | kernels | Per-kernel dispatch chatter — very loud, opt-in only. |
FFAI_DEBUG_DISPATCH=1 | dispatch | Per-MTLCommandBuffer commit / wait. |
FFAI_DEBUG_BENCH=1 | bench | BenchRunner method dispatch + sub-phase timing. |
The closure passed to Debug.log(...) is @autoclosure — when the
subsystem is off, the message string isn’t built and the call is
near-free. Safe to leave instrumentation in hot paths.
Debug.log(.kvcache, "append k+v at pos \(pos), live bytes=\(caches.totalBytesInUse)")--profiling N
Section titled “--profiling N”Three levels. 0 is the default (off, zero overhead anywhere).
ffai --profiling 1 --model unsloth/Llama-3.2-1B --prompt "Hi"Level 1 — wallclock breakdown
Section titled “Level 1 — wallclock breakdown”Captures wallclock durations at every phase boundary
(model_load, prewarm, prefill, ttft, decode,
generation_total) and prints a [PROFILE] block at the end of
the run. Cost is a few Date() calls per generation — negligible.
[PROFILE] model_load 2.34 s prewarm 480.12 ms prefill 82.30 ms ttft 82.30 ms decode 1.00 s generation_total 1.08 sLevel 2 — level 1 + os_signpost intervals
Section titled “Level 2 — level 1 + os_signpost intervals”Wraps each phase + the inner decode-step loop in
OSSignposter
intervals under subsystem ai.ffai, capturable by Instruments
(Profile → Logging → os_signpost) or xctrace record.
Zero overhead when no tracer is attached — OSSignposter checks
a flag at the start of each beginInterval(...) and bails before
constructing any state. ~40 ns per call when nobody’s listening.
xctrace record --template 'os_signpost' --launch -- \ .build/release/ffai --profiling 2 --model unsloth/Llama-3.2-1B --prompt "Hi"The recorded trace shows FFAI’s prefill, decode_step, prewarm,
model_load spans on the same timeline as Apple’s Metal subsystem
spans (com.apple.Metal). Each Metal kernel dispatch shows up
automatically there — no need to wrap individual kernels on our
side. Use Instruments’ Metal System Trace template to drill into
GPU vs CPU time per phase.
Pattern for new call sites
Section titled “Pattern for new call sites”Wrapping a call site in a signpost is a one-liner — passes through the body’s value, no signpost overhead at level 0/1:
Profile.signpost("MyOp.compute") { // existing implementation}
// Async variantawait Profile.signpostAsync("loader.fetch") { try await fetchSnapshot(...)}
// Wallclock-only timing (level 1)let result = Profile.time("custom_phase") { expensiveSyncWork()}
// Point-in-time event (no duration)Profile.event("first_decode_token")The top-level CLI sets Profile.shared.level once before any FFAI
work runs; the same global is used by the bench subcommand so trace
recordings line up across ffai generate and ffai bench.
Programmatic access
Section titled “Programmatic access”Profile.shared.level = .signpostsProfile.shared.resetPhases()
let result = try await model.generate(prompt: "...")
print(Profile.shared.phases.formatted()) // [PROFILE] blockPhaseTimings is a Sendable struct — safe to grab a snapshot from
any context.
Comparison with mlx-swift-lm
Section titled “Comparison with mlx-swift-lm”FFAI’s flags follow mlx-swift-lm’s conventions so trace recipes and analysis tooling carry over:
| mlx-swift-lm | FFAI |
|---|---|
MLX_BENCH_DEBUG=1 | --debug / FFAI_DEBUG=1 |
MLX_BENCH_PROFILE=1 | --profiling 1 |
MLX_BENCH_PROFILE=2 | --profiling 2 |
signposts under ai.mlx.metal | signposts under ai.ffai, kernel signposts via Apple’s com.apple.Metal |
See also
Section titled “See also”generation-parameters.md— the knobs that control what gets generated; stats describes how it ran.performance.md— currenttok/snumbers per model + the Phase 4 wave-by-wave perf history.kv-cache.md— what theKV cache (alloc)and(used)fields actually account for.- mlx-swift-lm’s benchmarks/README.md — the upstream schema FFAI’s stats fields mirror.