Chat templates
Most modern chat / instruct models ship with a Jinja chat template
in their tokenizer_config.json. FFAI calls into
swift-transformers’ Tokenizer.applyChatTemplate(...) to render
those templates; you pass typed ChatMessage values + a typed
ChatTemplateOptions and FFAI threads the right variables into the
Jinja context.
The Phase-2/2.5 plain Model.generate(prompt:...) API takes a raw
string — no chat template applied — which means you are
responsible for rendering. Use the messages: overloads instead
when working with chat / instruct models.
Buffered
Section titled “Buffered”let messages: [ChatMessage] = [ .init(role: .system, content: "You are concise."), .init(role: .user, content: "Why is the sky blue?"),]
let result = try await model.generate(messages: messages)print(result.text)Streaming
Section titled “Streaming”let stream = try model.generateStream(messages: messages)for try await chunk in stream { print(chunk.text, terminator: "")}Same chunk shape as the prompt: streaming variant — see
streaming.md.
ChatMessage
Section titled “ChatMessage”public struct ChatMessage: Sendable, Equatable { public enum Role: String { case system, user, assistant, tool } public var role: Role public var content: String public var thinking: String? // re-emit reasoning trace in multi-turn}The thinking field is for multi-turn conversations where the
prior assistant turn included a thinking segment that the template
wants to re-emit (Qwen 3 / DeepSeek-R1 do this).
ChatTemplateOptions
Section titled “ChatTemplateOptions”public struct ChatTemplateOptions: Sendable { public var addGenerationPrompt: Bool // = true public var enableThinking: Bool // = false public var reasoningEffort: ReasoningEffort?// = nil (.low | .medium | .high) public var maxLength: Int? // = nil public var truncation: Bool // = false public var extraContext: [String: any Sendable]}| Field | Maps to | When to set |
|---|---|---|
addGenerationPrompt | template’s “now generate the assistant reply” suffix | true (default) for the typical chat-completion case; false when scoring an existing assistant turn (e.g. perplexity over a fixed conversation). |
enableThinking | enable_thinking Jinja variable | true to turn on the model’s reasoning mode (Qwen 3 emits <think>...</think> blocks). Harmless on templates that don’t reference the variable. |
reasoningEffort | reasoning_effort Jinja variable | GPT-OSS Harmony reasoning levels (low / medium / high). |
maxLength / truncation | swift-transformers’ template-side truncation | Hard cap on the templated token count. truncation: false (default) throws on overflow; true truncates leading turns. |
extraContext | additional Jinja variables | Anything else the template reads. |
Format quirks
Section titled “Format quirks”The template does the per-family rendering. We pass typed inputs
through the well-known variable names; the rest is in the model’s
tokenizer_config.json. Specific behaviours:
| Family | Notes |
|---|---|
| Qwen 3 | enable_thinking: true → <think>...</think> block before the answer. Pair with ThinkingSplit for per-segment stats. |
| DeepSeek-R1 | Same <think>...</think> convention as Qwen 3. |
| GPT-OSS (Harmony) | reasoning_effort: "high" (etc.) → analysis + final channel structure. The ThinkingSplit.harmony scanner that partitions those stays a TODO until the GPT-OSS family ships (Phase 8+). |
| Gemma 3 / 4 | <channel|reasoning|> markers when reasoning is enabled. ThinkingSplit scanner stub lands with the Gemma family file (Phase 5+). |
| Llama 3 instruct | Standard chat template, no reasoning hooks. |
| Tools / function calling | Not yet wired — tools: is plumbed through swift-transformers but FFAI’s ChatMessage doesn’t carry toolCalls yet. Lands when .toolCalling capability ships. |
Errors
Section titled “Errors”public enum ChatTemplateError: Error { case noTemplateOnTokenizer // tokenizer_config.json had no chat_template case renderFailed(any Error) // wraps the underlying Jinja error}noTemplateOnTokenizer typically means you’ve loaded a base
(non-chat) checkpoint and should either pass a raw prompt via
generate(prompt:), or use a different checkpoint (e.g. *-Instruct).
Rendering without generating
Section titled “Rendering without generating”For testing / debugging the templated input, render to token ids without running the model:
let ids = try model.renderChatTemplate( messages: messages, options: ChatTemplateOptions(enableThinking: true))print(ids)print(model.tokenizer.decode(tokens: ids, skipSpecialTokens: false))See also
Section titled “See also”- Quickstart —
prompt:vsmessages:decision. - Streaming — both overloads support streaming.
- Observability § think vs gen split
— what
enable_thinking: trueenables on the stats side.