Streaming
Streaming is the primitive in FFAI’s generation API; the buffered
Model.generate(...) is a thin collector over the same stream so
there’s one source of truth for the prefill + decode loop.
let stream = model.generateStream(prompt: "Why is the sky blue?")for try await chunk in stream { print(chunk.text, terminator: "")}Each chunk carries the decoded delta text, the new token id(s)
since the last yield, and the absolute sequence position. The
final chunk has empty text / tokens and a non-nil
stats: GenerationStats with the full memory + timing numbers —
same shape --stats prints.
GenerationChunk shape
Section titled “GenerationChunk shape”public struct GenerationChunk: Sendable { public let text: String // decoded delta since last chunk public let tokens: [Int] // new token id(s) in this chunk public let position: Int // absolute sequence position after this chunk public let stats: GenerationStats? // populated only on the final chunk public var isFinal: Bool { stats != nil }}Cancellation
Section titled “Cancellation”The stream-producing task honors Swift’s structured cancellation.
Drop the consuming task (or call task.cancel()) and the producer
notices at the next token boundary, flushes the stream cleanly, and
finishes — no zombie command buffers, no leaked KV cache.
let task = Task { for try await chunk in model.generateStream(prompt: "...") { if userInterrupted { break } print(chunk.text, terminator: "") }}// elsewhere:task.cancel()Chat / multi-turn streaming
Section titled “Chat / multi-turn streaming”generateStream has a chat-templated overload that takes
[ChatMessage] and renders through the tokenizer’s chat template.
Same chunk shape on the output side:
let messages: [ChatMessage] = [ .init(role: .system, content: "You are concise."), .init(role: .user, content: "Why is the sky blue?"),]let stream = try model.generateStream(messages: messages)for try await chunk in stream { print(chunk.text, terminator: "")}See chat-templates.md for how ChatMessage
ChatTemplateOptionsmap onto the tokenizer’s Jinja template.
Buffered collection
Section titled “Buffered collection”Model.generate(prompt:parameters:) is implemented as roughly:
public func generate(prompt: String, parameters: GenerationParameters? = nil) async throws -> GenerationResult { var generated: [Int] = [], text = "" var stats: GenerationStats? for try await chunk in generateStream(prompt: prompt, parameters: parameters) { generated.append(contentsOf: chunk.tokens) text += chunk.text if let s = chunk.stats { stats = s } } return GenerationResult( promptTokens: tokenizer.encode(text: prompt), generatedTokens: generated, text: text, stats: stats! )}So you get the same GenerationResult shape — including all the
stats — whether you call the buffered API or build your own
collector around the stream.
Why streaming is the primitive
Section titled “Why streaming is the primitive”Three reasons:
- The most common UI pattern. Token-as-it-arrives chat / completion demos can’t wait for the full response.
- Fits batch decoding naturally. When Phase 8+ batching lands,
generateStream(...)for a single sequence becomes one of N sequences sharing the same kernel dispatch loop; the consumer side stays unchanged. - One loop. Buffered + streaming sharing the same producer means the prefill / decode / stats path is exercised by every call shape — no parallel “almost-the-same” implementations to keep in sync.
See also
Section titled “See also”- Quick start — basic usage including the streaming example.
- Chat templates — how
messages:overloads render through the tokenizer. generation-parameters.md— the knobs that control generation; same parameters apply to streaming and buffered.- Observability — the
[STATS]block streaming yields on its final chunk.