Skip to content

Observability

Three CLI flags to introspect what FFAI is doing per token. The underlying types live in Sources/FFAI/Stats/ and are usable from your own code without going through the CLI.

This page covers --stats today; --debug and --profiling land in follow-up commits and are described under Coming next.

Prints a [STATS] block after generation completes. Always-on instrumentation: a PhaseMemoryTracker samples MTLDevice.currentAllocatedSize at every token boundary (cost: one property read per token, ~sub-µs). The fields and shape are deliberately portable to / from mlx-swift-lm’s bench-row schema so analysis tooling stays compatible.

Terminal window
ffai --model unsloth/Llama-3.2-1B --prompt "Once upon a time" --stats
[STATS]
prompt: 5 tokens
generated: 64 tokens
context: 131072 tokens
ttft: 82.34 ms
prefill: 0.08s (60.71 tok/s)
decode: 1.00s (64.06 tok/s)
decode (steady): 65.20 tok/s (tokens 11+)
baseline GPU: 1240.0 MB
post-prefill GPU: 1252.0 MB (+ 12.0 MB)
post-decode GPU: 1268.0 MB (+ 16.0 MB)
prefill peak: 1255.5 MB
decode peak: 1269.0 MB
weights: 1180.0 MB
KV cache (alloc): 64.0 MB
KV cache (used): 4.5 MB
wired ticket: 16.00 GB
FieldSourceNotes
prompt / generated / contextcountscontext is the model’s max sequence length.
ttftwall clockTime-to-first-token (ms). On the slow per-token prefill, equal to total prefill time.
prefill, decodewall clockPhase totals + tok/s.
decode (steady)wall clockMean per-token decode rate from token 11 onward. Drops PSO compile + autorelease pool warm-up. nil when fewer than 11 generated tokens.
baseline GPUMTLDevice.currentAllocatedSize at the start of generate(...)Resident weights + warm caches.
post-prefill GPU, post-decode GPUsame sourceIncludes a + delta column attributing the growth to the corresponding phase — answers “where did the GB come from?”.
prefill peak, decode peakper-token sample maxThe actual high-water mark inside each phase, separately reported.
weightssum of Tensor.byteCount over engine.parameters()Resident model weights only.
KV cache (alloc)sum of KVCache.bytesAllocatedCapacity allocated up-front at the model’s maxSeq.
KV cache (used)sum of KVCache.bytesInUseThe live length / maxSeq slice — the part you’re actually paying for in attention math.
wired ticketMTLDevice.recommendedMaxWorkingSetSizeThe OS’s wired-memory budget for this device.

Stats are always populated on GenerationResult.stats. The CLI flag only controls the printout.

let result = try await model.generate(prompt: "...")
print(result.stats.peakGPUBytes)
print(result.stats.prefillGrowthBytes) // post-prefill – baseline
print(result.stats.decodeGrowthBytes) // post-decode – post-prefill
print(result.stats.kvCacheUsedBytes)
print(result.stats.steadyTokensPerSecond ?? "<warm-up only>")

For a custom report, call result.stats.formatted() — that’s what the CLI prints.

Perplexity computation requires a per-step logits readback (extra cost on top of greedy decode), so it’s not folded into generate(...). Use the standalone helper:

let r = try await model.generate(prompt: "Once upon a time")
let ppl = Perplexity.compute(model: model,
tokens: r.promptTokens + r.generatedTokens)
print(ppl.perplexity) // exp(-mean log p(token | prefix))

All perplexity math runs in fp32 — bf16’s 7 mantissa bits can’t represent the partial Σ exp(x_i − max_x) over a 128–152K vocab without losing precision in the third decimal of perplexity, which is exactly the resolution we need for distinguishing quantization tiers. fp32 costs ~1µs per token; only matters offline. Header in Stats/Perplexity.swift has the longer rationale.

The bench harness’s --method wikitext2 runs this over the WikiText-2 corpus and writes the resulting genPerplexity into the report row.

Perplexity.klDivergence(reference:candidate:tokens:) runs both models forward in lockstep over a fixed token sequence and returns the mean per-position KL(p_ref || q_cand) in nats. Both models must share the same tokenizer / vocab.

let reference = try await Model.load("mlx-community/Qwen3-4B-bf16")
let candidate = try await Model.load("mlx-community/Qwen3-4B-4bit")
let kld = Perplexity.klDivergence(
reference: reference, candidate: candidate,
tokens: tokenizer.encode(text: "Sample evaluation text…")
)
print(kld.meanKLDivergence)

Pick the right reference. The number is only meaningful when the reference is a higher-fidelity sibling of the candidate — same architecture, same tokenizer. Use the bf16 unquantized variant if it fits in device memory. A smaller / different-family reference turns the KL into a measure of family closeness rather than quantization fidelity.

Memory: both models live resident simultaneously, plus their KV caches. For Qwen 3 4B, that’s roughly 8GB (bf16) + 2.5GB (4-bit) + a few hundred MB of caches — fits on any 16GB+ Mac. For Qwen 3 14B the bf16 variant alone is ~28GB; pair with Qwen3-14B-8bit as the candidate against Qwen3-14B-bf16 as the reference if you have the RAM, otherwise drop down to the next size class.

CLI --ref-model <repo> plumbing ships with the bench subcommand in the next chunk.

Models that emit reasoning segments (Qwen 3 / DeepSeek-R1 ChatML, GPT-OSS Harmony, Gemma 3/4 channels) get a separate think_tokens / gen_tokens count line. Format auto-detection runs on every generate(...) call via ThinkingSplit.detectFormat(model:):

Detected fromFormatImplementation
model_type contains gpt-oss.harmonyTODO — lands with the GPT-OSS family file (Phase 8+).
model_type contains gemma3 / gemma4.gemmaChannelTODO — lands with the Gemma family file (Phase 5+).
Tokenizer has <think> / </think> ids.chatMLImplemented today.
Otherwise.noneWhole generation counts as gen.

For models that don’t emit thinking markers, the split is silently omitted. See Stats/ThinkingSplit.swift for the per-format scanners.

Per-segment perplexity is nil from generate(...) alone — the bench harness will run perplexity over each segment separately when it lands. For now you can do it yourself:

let split = ThinkingSplit.split(tokens: result.generatedTokens, model: model)
if let s = split {
let thinkPPL = Perplexity.compute(model: model,
tokens: result.promptTokens + Array(s.thinkTokens))
let genPPL = Perplexity.compute(model: model,
tokens: result.promptTokens + Array(s.thinkTokens)
+ [open] + Array(s.genTokens))
print(thinkPPL.perplexity, genPPL.perplexity)
}

The GenerationStats struct reserves space — currently commented — for capabilities that haven’t shipped yet:

  • Batch decoding (Phase 8+): batchSize, perSequenceDecodeTokensPerSecond.
  • Speculative decoding (Phase 8+): acceptanceRate, draftTokensPerSecond, draftAcceptedTokens.

When those modes land, uncomment the fields, populate them in Generate.swift, and the formatted output + bench writer pick them up automatically.

Gates verbose log output to stderr, tagged by subsystem. Off by default. The CLI flag flips the global gate; the same gate is controllable via env vars so other entry points (Xcode test runs, your own Swift code) can opt in without editing source.

Terminal window
ffai --debug --model unsloth/Llama-3.2-1B --prompt "Hi" 2>debug.log

Sample output (heavily abbreviated):

[ffai:load] Model.load id-or-path=unsloth/Llama-3.2-1B
[ffai:loader] resolved snapshot dir: /Users/.../snapshots/abcd…
[ffai:load] config: arch=LlamaForCausalLM model_type=llama hidden=2048 layers=16
[ffai:generate] begin prefill: 5 tokens, maxTokens=64
[ffai:generate] prefill done in 0.082s (60.7 tok/s)
[ffai:generate] decode done: 64 tokens in 1.000s (64.0 tok/s)

--debug enables every subsystem. To enable just one, set the matching env var instead of passing --debug:

Env varSubsystemWhat it logs
FFAI_DEBUG=1(all)Global gate; every subsystem on.
FFAI_DEBUG_LOADER=1loaderModelLocator / ModelDownloader snapshot resolution + cache hits.
FFAI_DEBUG_LOAD=1loadModel.load + family loaders, config decode.
FFAI_DEBUG_GENERATE=1generatePrefill + decode loop boundaries, per-call timing summary.
FFAI_DEBUG_SAMPLING=1samplingSampling decisions (when GPU sampling kernels land in Phase 5).
FFAI_DEBUG_KVCACHE=1kvcacheKV cache append + slice events.
FFAI_DEBUG_KERNELS=1kernelsPer-kernel dispatch chatter — very loud, opt-in only.
FFAI_DEBUG_DISPATCH=1dispatchPer-MTLCommandBuffer commit / wait.
FFAI_DEBUG_BENCH=1benchBenchRunner method dispatch + sub-phase timing.

The closure passed to Debug.log(...) is @autoclosure — when the subsystem is off, the message string isn’t built and the call is near-free. Safe to leave instrumentation in hot paths.

Debug.log(.kvcache, "append k+v at pos \(pos), live bytes=\(caches.totalBytesInUse)")

Three levels. 0 is the default (off, zero overhead anywhere).

Terminal window
ffai --profiling 1 --model unsloth/Llama-3.2-1B --prompt "Hi"

Captures wallclock durations at every phase boundary (model_load, prewarm, prefill, ttft, decode, generation_total) and prints a [PROFILE] block at the end of the run. Cost is a few Date() calls per generation — negligible.

[PROFILE]
model_load 2.34 s
prewarm 480.12 ms
prefill 82.30 ms
ttft 82.30 ms
decode 1.00 s
generation_total 1.08 s

Level 2 — level 1 + os_signpost intervals

Section titled “Level 2 — level 1 + os_signpost intervals”

Wraps each phase + the inner decode-step loop in OSSignposter intervals under subsystem ai.ffai, capturable by Instruments (Profile → Logging → os_signpost) or xctrace record.

Zero overhead when no tracer is attachedOSSignposter checks a flag at the start of each beginInterval(...) and bails before constructing any state. ~40 ns per call when nobody’s listening.

Terminal window
xctrace record --template 'os_signpost' --launch -- \
.build/release/ffai --profiling 2 --model unsloth/Llama-3.2-1B --prompt "Hi"

The recorded trace shows FFAI’s prefill, decode_step, prewarm, model_load spans on the same timeline as Apple’s Metal subsystem spans (com.apple.Metal). Each Metal kernel dispatch shows up automatically there — no need to wrap individual kernels on our side. Use Instruments’ Metal System Trace template to drill into GPU vs CPU time per phase.

Wrapping a call site in a signpost is a one-liner — passes through the body’s value, no signpost overhead at level 0/1:

Profile.signpost("MyOp.compute") {
// existing implementation
}
// Async variant
await Profile.signpostAsync("loader.fetch") {
try await fetchSnapshot(...)
}
// Wallclock-only timing (level 1)
let result = Profile.time("custom_phase") {
expensiveSyncWork()
}
// Point-in-time event (no duration)
Profile.event("first_decode_token")

The top-level CLI sets Profile.shared.level once before any FFAI work runs; the same global is used by the bench subcommand so trace recordings line up across ffai generate and ffai bench.

Profile.shared.level = .signposts
Profile.shared.resetPhases()
let result = try await model.generate(prompt: "...")
print(Profile.shared.phases.formatted()) // [PROFILE] block

PhaseTimings is a Sendable struct — safe to grab a snapshot from any context.

FFAI’s flags follow mlx-swift-lm’s conventions so trace recipes and analysis tooling carry over:

mlx-swift-lmFFAI
MLX_BENCH_DEBUG=1--debug / FFAI_DEBUG=1
MLX_BENCH_PROFILE=1--profiling 1
MLX_BENCH_PROFILE=2--profiling 2
signposts under ai.mlx.metalsignposts under ai.ffai, kernel signposts via Apple’s com.apple.Metal
  • generation-parameters.md — the knobs that control what gets generated; stats describes how it ran.
  • performance.md — current tok/s numbers per model + the Phase 4 wave-by-wave perf history.
  • kv-cache.md — what the KV cache (alloc) and (used) fields actually account for.
  • mlx-swift-lm’s benchmarks/README.md — the upstream schema FFAI’s stats fields mirror.