Testing
122 tests across 31 suites; 80.8% line coverage at the Phase 4 checkpoint. Every kernel, every Swift function, every model layer gets a unit test. CI gates on coverage + correctness.
Running tests
Section titled “Running tests”make test # everything (~30s)swift test --filter FFAITests # one test targetswift test --filter LlamaGenerateTests # one suiteswift test --filter Llama # pattern across suitesmake test always runs make regenerate-kernels first so you’re
never testing against stale kernels.
Coverage
Section titled “Coverage”make coverage # tests + summaryThis runs swift test --enable-code-coverage and then drives
xcrun llvm-cov to print a per-file table, excluding .build/,
Tests/, and the generated MetalTileKernels.swift.
The 100% target documented in planning/plan.md
applies to the Swift surface (Sources/FFAI/ + Sources/MetalTileSwift/).
Current coverage is 80.8% — the gap is mostly defensive error paths
(fatalError on programmer bugs, unreachable default: cases) that
are excluded from the denominator with // coverage:ignore markers.
Test layout
Section titled “Test layout”Tests/ MetalTileSwiftTests/ One test file per kernel. Numerical correctness vs CPU reference or fixed-input/fixed-output vectors, across fp32 / fp16 / bf16.
FFAITests/ Tensor, Module, Linear, BufferPool, KVCache, Sampling, ModelDownloader, ModelLifecycle, Capability, ModelConfig, SafeTensors, …
ModelTests/ One folder per model family. Llama/ LlamaForwardTests + LlamaGenerateTests. Qwen3/ Qwen3ForwardTests + Qwen3GenerateTests.
Fixtures/ Golden activations + token sequences captured from mlx-lm. Loaded at test time; never re-captured during `swift test`.Golden fixtures (the testing reference convention)
Section titled “Golden fixtures (the testing reference convention)”Numerical references for tests are golden fixtures, not live
Python invocations. This keeps swift test fully reproducible on a
stock Apple Silicon CI runner with zero Python dependency.
Tools/capture-fixtures.py Python script — only used to GENERATE fixtures, never run during swift test. Uses mlx-lm for text-only models and mlx-vlm for vision-language models.
Tests/Fixtures/<model>/ Captured activations + token sequences. metadata.json mlx-lm / mlx-vlm version + capture date for reproducibility.mlx-vlm lists mlx-lm as a runtime dependency, so a single
pip install mlx-vlm covers both backends. The capture script picks
the right one per model based on whether the config declares a
vision encoder.
When a fixture needs regeneration:
pip install mlx-vlm(also installsmlx-lm).- Run
python Tools/capture-fixtures.py --model <repo> --output Tests/Fixtures/<name>/. - Commit the new files alongside the model change.
- Update
metadata.jsonwith themlx-lm/mlx-vlmversion + date.
Writing a test
Section titled “Writing a test”Layer tests live in FFAITests/:
import XCTestimport FFAIimport Metal
final class LinearTests: XCTestCase { func testForwardMatchesCPUReference() throws { let device = Device.shared let weight = Tensor.from([[1, 2], [3, 4]], device: device) let input = Tensor.from([1, 1], device: device) let layer = Linear(weight: weight) let out = layer.forward(input) XCTAssertEqual(out.toCPU(), [3, 7], accuracy: 1e-5) }}Model tests live in ModelTests/<Family>/ and load the model + a
golden fixture:
final class LlamaGenerateTests: XCTestCase { func testDeterministicGreedyGreedy() async throws { let model = try await Model.load("unsloth/Llama-3.2-1B") let result = try await model.generate( prompt: "Once upon a time", options: GenerateOptions(maxNewTokens: 16) ) let golden = try loadFixture("llama-3.2-1b-once-upon.json") XCTAssertEqual(result.generatedTokens, golden.tokens) }}.github/workflows/ci.yml runs on Apple Silicon, executes
swift test, uploads the coverage report, and fails any PR that
drops coverage below the configured threshold.
.github/workflows/auto-label.yml applies conventional-commit PR
labels (adapted from mlx-swift-lm).
What we don’t test
Section titled “What we don’t test”- Property / fuzz testing. Out of scope for v0.1; revisit later.
- GPU mocking. All tests run real Metal dispatches.
- Defensive
fatalErroron programmer bugs. Excluded from coverage via// coverage:ignore. - Multi-GPU / Linux / CUDA. Different project.
See also
Section titled “See also”- Developing — the
makeworkflow, kernel regeneration. - Adding a model — including which tests to add.
- Performance —
Tests/PerfTests/regression thresholds.