Skip to content

GenerationParameters

Every Model.generate(...) call is configured by a GenerationParameters value. Each family declares its own defaults via the family Variant protocol; the user either uses them as-is, mutates fields, or constructs their own.

FieldDefaultHonored todayNotes
maxTokens: Int256Hard cap on generated tokens.
stopOnEOS: BooltrueStop at the model’s eosTokenId.
extraStopTokens: Set<Int>[]Additional stop ids beyond EOS.
prefillStepSize: Int1024🚧 Phase 5+Honored once chunked prefill ships; today’s per-token prefill ignores it.
temperature: Float0.60 → greedy (GPU argmax fast path, no logits readback). > 0 with no filters → GPU softmax_categorical_sample kernel, no logits readback. > 0 with any filter (top-K / top-P / min-P / rep-penalty) → CPU sample (one logits readback per token, ~30% decode-tok/s tax).
topP: Float1.0Nucleus cutoff. 1.0 = disabled. Forces CPU sample path when set.
topK: Int00 = disabled. Forces CPU sample path when set.
minP: Float0.0Qwen-style min-P cutoff: keep tokens with prob ≥ min_p × max_prob. Forces CPU sample path.
repetitionPenalty: Float1.0Hugging-Face convention — divide logit by penalty when seen + logit > 0; multiply when seen + logit < 0. 1.0 = disabled. Forces CPU sample path.
presencePenalty: Float0.0🚧 Phase 5+Additive. 0 = disabled.
seed: UInt64?nilWhen set, the CPU sample path is reproducible run-to-run (SplitMix64 PRNG seeded by this). Ignored on the greedy path (no RNG draw).

Generate picks the cheapest path that produces correct output for the supplied parameters:

PathTriggered byCost per token
greedy-GPUtemperature == 0, no filtersGPU argmax + 4-byte readback (fastest).
gpu-categoricaltemperature > 0, no filtersForward + softmax_categorical_sample GPU kernel + 4-byte readback. Two cmdbufs today (forward + sample); per-family fusion is a follow-up.
cpu-sampleAny of topK > 0, topP < 1, minP > 0, repetitionPenalty != 1Forward + full vocab readback (~300 KB at Qwen 3 fp16, trivial on unified memory) + Sampling.sample(...) pipeline.

The GPU categorical kernel itself (softmax_categorical_sample) lives in metaltile on the ek/sampling-kernels branch — cooperative 256-thread reduction for max + sum-exp, then a single-threaded inverse-CDF walk. The single-thread walk is the ~150µs bottleneck at vocab=152K; a parallel prefix-scan version is the natural follow-up alongside per-family forwardSampleCategorical fusion.

GPU top-K / top-P / min-P / rep-penalty kernels are deferred — they need a sort or radix-select, which is a substantial follow-up. Until those land, setting any filter falls back to the CPU sample path.

Each family’s Variant protocol declares a static defaultGenerationParameters: GenerationParameters that captures the values that family ships with. The Model instance carries the resolved value as model.defaultGenerationParameters.

let model = try await Model.load("mlx-community/Qwen3-4B-4bit")
print(model.defaultGenerationParameters.topP) // 0.95 (Qwen 3)
print(model.defaultGenerationParameters.topK) // 20 (Qwen 3)
print(model.defaultGenerationParameters.prefillStepSize) // 1024

Current values:

FamilytemperaturetopPtopKminPrepPenaltyprefillStepSizemaxTokens
LlamaDense0.61.000.01.01024256
Qwen3Dense0.60.95200.01.01024256

These match mlx-swift-lm’s per-family GenerationParameters baseline and defaultPrefillStepSize for the same architectures. As new families land (Qwen 3.5 hybrid, Qwen 3.5 MoE, Mistral, Phi, Gemma, etc.) they declare their own defaults — see developing/adding-a-model.md § Step 4: family defaults.

let result = try await model.generate(prompt: "Once upon a time")

parameters defaults to nil, which falls back to model.defaultGenerationParameters.

The with(_:) copy-mutator keeps the family-tuned baseline and edits a single knob:

let result = try await model.generate(
prompt: "Once upon a time",
parameters: model.defaultGenerationParameters.with { $0.maxTokens = 64 }
)

This is the recommended call shape — you don’t lose the family-tuned sampling values just because you wanted a shorter generation.

let params = GenerationParameters(
maxTokens: 1024,
temperature: 0.0, // greedy
topP: 1.0,
repetitionPenalty: 1.05
)
let result = try await model.generate(prompt: "...", parameters: params)

Any field you don’t pass picks the GenerationParameters.init default, not the family default. Use with(_:) (#2) when you want the family-tuned baseline.

Each CLI flag overrides only its corresponding GenerationParameters field; every other knob still picks up the family default. Omit all sampling flags to use the family value:

Terminal window
# Family defaults — Qwen 3 ships temperature=0.6 / top-p=0.95 / top-k=20.
# Routes through the CPU sample path (non-greedy):
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "Hello"
# Greedy fast path (GPU argmax, 4-byte readback per token, no sampling):
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "Hello" \
--temperature 0 --max-tokens 64
# Seeded non-greedy — reproducible run-to-run:
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "Hello" \
--temperature 0.7 --top-p 0.9 --seed 42 --max-tokens 64
# Full sampling pipeline:
ffai --model mlx-community/Qwen3-1.7B-4bit --prompt "Hello" \
--temperature 0.8 --top-k 40 --top-p 0.95 --min-p 0.05 \
--repetition-penalty 1.05 --seed 12345

Available flags: --temperature, --top-k, --top-p, --min-p, --repetition-penalty, --seed. Plus --max-tokens.