Overview
FoundationModels is Apple's framework for accessing the on-device large language model that powers Apple Intelligence. Introduced at WWDC 2025, it gives apps direct access to the same model behind Writing Tools, Smart Replies, and Mail Summaries — running entirely on-device, with no network requests and no data leaving the device.
Key characteristics:
- On-device only — no cloud fallback, no API key, no latency from network round-trips
- Privacy-first — all inference happens locally; Apple never sees your prompts or responses
- Availability-gated — requires Apple Intelligence to be enabled; not all devices qualify
- iOS 26+ only — requires iPhone 15 Pro / iPhone 15 Pro Max or later (or equivalent iPad)
- Shared resource — the model serves all apps; system may rate-limit under load
What it excels at:
- Text correction, normalisation, and reformatting
- Entity extraction and classification
- Summarisation of short-to-medium content
- Structured output generation (via guided generation)
- Context-aware suggestions and completions
What it is not:
- A replacement for frontier models (GPT-4, Claude, Gemini) for complex reasoning
- A cloud API — if the model is unavailable, there is no fallback infrastructure
- A general-purpose search or retrieval system
Minimum requirements:
- iOS 26.0+, iPadOS 26.0+, macOS Tahoe 26.0+
- Xcode 26.0+
- Device must support Apple Intelligence (iPhone 15 Pro or later)
- Apple Intelligence must be enabled in Settings
Contents
- Availability & Setup
SystemLanguageModel, availability cases,AnyObject?pattern - Sessions & Basic Prompting
LanguageModelSession,Instructions,Prompt,.contentgotcha - Prompt Engineering for On-Device Models
What works, what doesn't,#Playground - Guided Generation (
@Generable)@Guide, constraints,PartiallyGenerated - Streaming
streamResponse(),ResponseStream,.collect() - Generation Options
temperature,SamplingMode,maximumResponseTokens - Tool Calling
Toolprotocol, pre-fetch vs inject, context cost - Token Budget
tokenUsage(for:),contextSize, overflow strategies - The Transcript
Transcript.Entry, saving/resuming sessions - Failure Modes & Graceful Degradation
GenerationError, never-throws pattern - Testing
Four test categories,.disabled()on-device tests - Example Use Cases
10 concrete patterns across app domains - Quick Reference & Anti-Patterns
Cheatsheet + 10 things not to do - Context Engineering
Select/inject/compress/pre-summarise - Advanced Patterns
Actor isolation,@Generableenums with associated values,Observablemonitoring,PromptRepresentablechaining, bounded domain injection
Part 1: Availability & Setup
SystemLanguageModel.default
SystemLanguageModel.default is the singleton entry point for the on-device language model. You do not initialise it — it is a static property you reference directly. Everything in FoundationModels starts here.
let model = SystemLanguageModel.default switch model.availability { case .available: // model is ready — create a session and run prompts case .unavailable(let reason): // handle the specific reason @unknown default: break }
SystemLanguageModel is an Observable final class, so you can observe .availability changes in SwiftUI via @State or inside .task {} blocks without any special wiring.
Availability Cases
SystemLanguageModel.Availability is a @frozen enum with two top-level cases: .available and .unavailable(UnavailableReason). Always handle @unknown default — Apple will add cases in future OS versions.
.available
The model is downloaded, Apple Intelligence is enabled, and the device is eligible. Create a LanguageModelSession and proceed.
.unavailable(.deviceNotEligible)
The hardware does not support Apple Intelligence. This applies to iPhone 14 and earlier, and equivalent iPad/Mac models. This is permanent for the lifetime of the device — no amount of waiting or retrying will change it. When you see this case, remove the AI code path from your UI entirely and show a permanent alternative experience.
.unavailable(.appleIntelligenceNotEnabled)
The device is eligible but the user has not turned on Apple Intelligence in Settings > Apple Intelligence & Siri. This is a user choice, not a hardware limitation. You can optionally prompt the user to enable it:
// Optionally deep-link to Settings if let url = URL(string: UIApplication.openSettingsURLString) { UIApplication.shared.open(url) }
Respect the user's decision. If they choose not to enable it, show the non-AI path without nagging.
.unavailable(.modelNotReady)
This is the most misunderstood case. It does not mean the model is permanently unavailable — it means the model weights are currently downloading. There is no programmatic download API. You cannot trigger the download, request it, or track its progress. The OS manages download timing based on network conditions, battery level, device temperature, and system load. Download can take minutes to hours.
Treat .modelNotReady as a transient state. Do not show a permanent "not supported" message. Instead, show a softer "not available right now — check back later" state and retry on the next app launch or session.
func checkAvailability() -> String { switch SystemLanguageModel.default.availability { case .available: return "Ready" case .unavailable(let reason): switch reason { case .deviceNotEligible: return "Device not supported" case .appleIntelligenceNotEnabled: return "Enable Apple Intelligence in Settings" case .modelNotReady: return "Downloading... check back soon" @unknown default: return "Unavailable" } @unknown default: return "Unknown" } }
isAvailable Convenience Property
SystemLanguageModel.default.isAvailable is a Bool shorthand. Use it when you only need to gate a code path and don't need to distinguish between unavailability reasons:
guard SystemLanguageModel.default.isAvailable else { return } // proceed with AI code path
If you need to communicate why the model is unavailable to the user, use the full .availability switch instead.
UseCase — .general vs .contentTagging
SystemLanguageModel.UseCase selects a specialised version of the model. There are two options:
SystemLanguageModel.UseCase.general (the default for SystemLanguageModel.default) — a general-purpose model for writing assistance, analysis, correction, extraction, and summarisation. This is what you get when you use SystemLanguageModel.default.
SystemLanguageModel.UseCase.contentTagging — specialised for classification and extraction tasks. When you use this model, it always responds with tags — it is tuned to identify topics, emotions, actions, and objects. Use this when you want to categorise or label content rather than transform or generate it.
// General model (default — used for most tasks) let model = SystemLanguageModel.default // Content tagging model — for classification/extraction let taggingModel = SystemLanguageModel(useCase: .contentTagging) let session = LanguageModelSession(model: taggingModel)
Do not use .contentTagging for text correction or generation tasks. The model will produce tags rather than prose, regardless of your instructions.
Guardrails
SystemLanguageModel.Guardrails controls content safety filtering on model inputs and outputs. There are two presets:
SystemLanguageModel.Guardrails.default — the standard setting. Blocks unsafe content in both prompts and responses. When triggered, throws LanguageModelSession.GenerationError.guardrailViolation(_:).
SystemLanguageModel.Guardrails.permissiveContentTransformations — allows potentially sensitive source material to pass through for string generation tasks. Use this when your app legitimately processes user-generated content that might incidentally contain sensitive words (e.g., a chat moderation tool, a study app covering difficult topics). This mode only applies to string output — guided generation (@Generable) always uses default guardrails.
// Default guardrails (most apps) let model = SystemLanguageModel.default // Permissive mode — for apps that must process sensitive source text let model = SystemLanguageModel(guardrails: .permissiveContentTransformations)
Even in permissive mode, the model may still refuse certain content — it retains its own layer of safety separate from the guardrail system.
The AnyObject? Pattern for SwiftUI
iOS 26 types require @available(iOS 26, *) annotations. Annotating a @State property with @available propagates that constraint to the entire containing view struct — meaning the whole view requires iOS 26, which is likely not what you want.
The solution is to store iOS 26-only service instances as AnyObject? and cast them back inside #available guards:
// DON'T do this — @available propagates to the whole view @available(iOS 26, *) @State private var service: MyAI26Service? // ❌ forces view to require iOS 26 // DO this instead — no @available constraint on the view struct @State private var service: AnyObject? // ✅ clean — AnyObject has no availability // Store the service (inside a #available guard) if #available(iOS 26, *) { self.service = MyAI26Service() } // Use the service (inside a #available guard) if #available(iOS 26, *), let s = self.service as? MyAI26Service { let result = try await s.process(text) }
This pattern lets you write a single view struct that gracefully degrades on older OS versions without any @available annotation on the view itself.
Part 2: Sessions & Basic Prompting
LanguageModelSession — Init Variants
LanguageModelSession is the object you interact with to send prompts and receive responses. The two most common init patterns are:
Fresh session (most common):
// With builder-style instructions let session = LanguageModelSession { "You are a BJJ terminology corrector." "Fix misrecognised terms to their canonical spellings." } // With a specific model let session = LanguageModelSession(model: SystemLanguageModel.default) { "You are a motivational coach." } // With string instructions — also valid let session = LanguageModelSession( instructions: "You are a code review assistant." )
Resume from transcript (multi-turn):
// Rehydrate a session from a saved transcript to continue a conversation let session = LanguageModelSession( model: SystemLanguageModel.default, tools: [], transcript: savedTranscript )
The session is an Observable final class. It is also Sendable, so you can safely hold a reference from a @MainActor context and call its methods from async tasks.
Instructions
Instructions defines the model's persona, rules, and domain — what the model is and how it behaves. Set it once at session creation. Instructions apply to every prompt in that session.
Use @InstructionsBuilder (result builder syntax) to compose instructions from multiple strings:
let instructions = Instructions { "You are a BJJ terminology corrector." "Fix misrecognised BJJ terms to their canonical spellings." "Common corrections: kimora→Kimura, half card→Half Guard, darce→D'Arce" }
Or pass a plain String directly:
let session = LanguageModelSession( instructions: "You are a concise summariser. Respond in three sentences maximum." )
Instructions are not the user's question — that is the Prompt. Instructions define the container; the prompt fills it.
The framework injects instructions as the system-level context for the model. The model follows instructions at higher priority than prompt content, so put your constraints and rules in instructions, not in the prompt itself.
Prompt
Prompt is the user's input — the actual question, text, or content you want the model to process. Use @PromptBuilder for dynamic construction:
// Builder style — for dynamic prompts let prompt = Prompt { "Correct this transcript: \(rawText)" } // String literal — also valid let response = try await session.respond(to: "Summarise the following: \(article)")
Prompt strings accept string interpolation. Keep prompts concise — every token in a prompt consumes context budget that competes with the response.
The Critical .content Gotcha
respond(to:options:) returns LanguageModelSession.Response<T>, not T directly. Response<T> is a wrapper struct. The actual generated value is at .content.
This is the single most common mistake when first using the framework:
// WRONG — response is Response<String>, not String let text = try await session.respond(to: prompt) print(text.uppercased()) // compile error: Response<String> has no uppercased() // RIGHT let response = try await session.respond(to: prompt) let text = response.content // String print(text.uppercased()) // With typed guided generation let response = try await session.respond( to: prompt, generating: MyOutputType.self ) let value = response.content // MyOutputType
Internalise this: respond() always returns Response<T>. Always unwrap .content before using the value.
Response.rawContent
response.rawContent gives you the unprocessed GeneratedContent before guided generation parsing. This is the raw structured output the model produced, before it was decoded into your @Generable type. Use it for debugging when a response fails to parse or produces unexpected values — it shows you exactly what the model generated.
Session-Per-Call vs Persistent Sessions
This is a key architectural decision. Get it right at design time.
Session-per-call — create a new LanguageModelSession for each request. No conversation history accumulates. This is the correct pattern for the vast majority of use cases: text correction, extraction, summarisation, classification, entity detection. Each request is independent.
// Session-per-call — correct for stateless tasks func normalise(_ text: String) async throws -> String { let session = LanguageModelSession { "Fix speech-to-text errors in BJJ transcripts." "Corrections: kimora→Kimura, half card→Half Guard, darce→D'Arce" } let response = try await session.respond(to: Prompt { text }) return response.content }
Persistent session — keep the LanguageModelSession alive across multiple respond() calls. The session accumulates its Transcript as you go, so the model remembers previous exchanges. Use this only when the model needs that history to answer correctly — for example, a coaching chatbot where the user refers to something they said three turns ago.
// Persistent session — for multi-turn conversation @Observable class ChatAssistant { private let session = LanguageModelSession { "You are a BJJ coach assistant." "Help the user analyse and improve their game based on their training logs." } func chat(_ message: String) async throws -> String { let response = try await session.respond(to: Prompt { message }) return response.content // transcript accumulates automatically } }
The risk with persistent sessions: the transcript grows with each exchange and eventually hits the context window limit, throwing LanguageModelSession.GenerationError.exceededContextWindowSize. For long-running conversations, you need a strategy for trimming or summarising history. Session-per-call has no such risk.
Default to session-per-call. Only reach for persistent sessions when you have a concrete requirement for cross-turn memory.
Part 3: Prompt Engineering for On-Device Models
The On-Device Model is Smaller — This Changes Everything
The model powering FoundationModels is Apple’s private on-device LLM — not GPT-4, not Claude, not Gemini. It is significantly smaller (estimated ~3B parameters) than frontier cloud models. This is a feature, not a bug — it runs entirely on your device with sub-second latency — but it fundamentally changes how you should write prompts.
Techniques that work reliably on frontier models can actively degrade performance on the on-device model. Treat every prompt engineering heuristic you have learned from cloud models as a starting point to validate, not a rule to apply.
Principle 1: Short, Direct Instructions
Keep instructions under approximately 200 words total. Longer instructions dilute the signal — the model struggles to prioritise which parts matter most and may partially ignore sections buried deep in a long system prompt.
Every sentence in your instructions should earn its place. If you can remove a sentence without changing the model’s behaviour, remove it.
// WEAK — verbose, repetitive let session = LanguageModelSession { "You are a helpful assistant specialising in Brazilian Jiu-Jitsu." "Your primary purpose is to help users with BJJ-related queries." "When you see text from speech recognition, carefully examine it." "Your goal is to correct any speech recognition errors in the text." "Please make sure to handle common BJJ terminology correctly." } // STRONG — dense, direct let session = LanguageModelSession { "Fix speech-to-text errors in BJJ transcripts." "Correct misrecognised terms. Return only the corrected text." }
Principle 2: Explicit Corrections Beat Implied Inference
If you have known domain-specific misrecognitions or corrections, list them explicitly. Do not rely on the model inferring what “fix BJJ terms” means — it may not know the canonical spellings for niche vocabulary.
// WEAK — relies on the model knowing BJJ terminology "Fix any incorrectly transcribed Brazilian Jiu-Jitsu terminology." // STRONG — explicit correction table let session = LanguageModelSession { "Fix speech-to-text errors in BJJ transcripts." "Common misrecognitions: kimora/kimura -> Kimura, half card/half god -> Half Guard," "darce/dart -> D'Arce, rnc/arnc -> Rear Naked Choke, omoa plata -> Omoplata." }
The on-device model does not have the deep BJJ domain knowledge that a frontier model trained on vast internet corpora might have. Make your domain knowledge explicit in the prompt rather than hoping the model already knows it.
Principle 3: Include a Domain Vocabulary in Instructions
For niche domains — BJJ, medicine, legal, finance, specialised engineering — include a vocabulary list or canonical term glossary in your instructions. This gives the model the reference it needs to make correct corrections or use correct terminology in its output.
let session = LanguageModelSession { "You are a BJJ transcript corrector." "Canonical terms: Guard, Half Guard, Mount, Back Mount, Side Control," "North-South, Turtle, Closed Guard, Open Guard, De La Riva, X-Guard," "Kimura, Armbar, Triangle, Rear Naked Choke, D'Arce, Anaconda," "Omoplata, Heel Hook, Kneebar, Toe Hold." "Correct misrecognised terms to their canonical forms." }
This is more token-efficient than hoping for inference, and significantly more reliable.
Principle 4: One Task Per Session
Do not ask the model to perform multiple distinct tasks in one session. Correction AND summarisation AND extraction in a single prompt will produce worse results on the on-device model than running them as separate sessions.
// WEAK — three tasks in one call let response = try await session.respond(to: Prompt { "Correct BJJ terms, summarise the session, and extract techniques used." rawText }) // STRONG — one focused task per session let corrected = try await correctSession.respond(to: Prompt { rawText }) let summary = try await summarySession.respond(to: Prompt { corrected.content }) let techniques = try await extractSession.respond(to: Prompt { corrected.content })
The overhead of running multiple sessions is minimal compared to the reliability gain from focused, single-task prompts.
Principle 5: Avoid Chain-of-Thought Prompting
"Think step by step", "Let’s reason through this", and similar chain-of-thought prompts improve performance on large models but add noise on smaller on-device models. The model produces reasoning tokens that consume context budget without materially improving the final answer — and can sometimes cause the model to talk itself into a worse answer.
Do not use CoT prompting for on-device tasks. Give direct instructions and ask for direct output.
// WEAK — chain-of-thought on a small model "Think step by step about what BJJ terms might have been misrecognised, then correct them." // STRONG — direct instruction "Correct misrecognised BJJ terms. Return only the corrected text."
Frontier Model vs On-Device: Comparison
| Technique | Frontier Model | On-Device Model |
|---|---|---|
| Chain-of-thought prompting | Works well ✅ | Degrades performance ❌ |
| Long, elaborate instructions | Fine ✅ | Unreliable ⚠️ |
| Implicit domain inference | Often works ✅ | Unreliable for niche domains ⚠️ |
| Explicit correction lists | Helpful ✅ | Critical ✅✅ |
| Multi-task instructions | Usually works ✅ | Fails ❌ |
| Short, direct instructions | Works ✅ | Works best ✅✅ |
| CoT / "think step by step" | Major boost ✅ | Noise and overhead ❌ |
| Few-shot examples in prompt | Works ✅ | Works, watch token budget ⚠️ |
The #Playground Macro — Fast Prompt Iteration
Available from iOS 26.4+ (February 2026 Foundation Models update), the #Playground macro lets you iterate on prompts directly in Xcode without building and running the full app. Write a #Playground block in a Swift file, run it from the Xcode canvas, and see the response inline.
When you run the canvas, the output shows Input Token Count and Response Token Count separately — useful for understanding your prompt’s cost against the ~4,096 token context window estimate shown in canvas.
import FoundationModels #Playground { let session = LanguageModelSession { "Fix BJJ transcript errors." "kimora -> Kimura, half card -> Half Guard, darce -> D'Arce" } let response = try await session.respond( to: "worked kimora from half card today, finished with darce" ) response.content // displayed in Xcode canvas }
This is the fastest feedback loop for prompt engineering. Iterate on your instructions in the playground before wiring them into the app. Test with the exact on-device model, not a frontier proxy — behaviour differs significantly, and a prompt that works on GPT-4 may not work well on the Apple on-device model.
Part 4: Guided Generation (@Generable)
What @Generable Does
@Generable is an attached macro that synthesises Generable protocol conformance on a struct or enum. At compile time it does three things:
- Generates a
PartiallyGeneratedassociated type — a mirror of the struct where every stored property isOptional. This is the type you receive when iterating a stream mid-generation. - Infers a JSON schema from the struct's property types and any
@Guideannotations. That schema drives constrained sampling, which guarantees the output is always structurally valid — no parsing, no runtime crashes from malformed responses. - Synthesises
ConvertibleFromGeneratedContentandConvertibleToGeneratedContentconformances, which handle encoding and decoding between the model's internal representation and your Swift type.
The model generates properties in the order they are declared, so put properties that should influence later ones first.
Basic Usage
@Generable struct BookReview { var title: String var rating: Int var summary: String } let session = LanguageModelSession() let response = try await session.respond( to: "Review this book: \(bookTitle)", generating: BookReview.self ) let review = response.content // BookReview — fully populated, no parsing needed
@Guide — Descriptions
@Guide(description:) tells the model what a property means. Include descriptions for any property where the name alone is ambiguous. Keep them concise — long descriptions consume context and add latency.
@Generable struct NormalisedTranscript { @Guide(description: "The full transcript with BJJ terms corrected and properly cased") var normalisedText: String @Guide(description: "BJJ terms found in the transcript, each in canonical form e.g. 'Kimura', 'Half Guard'") var extractedTerms: [String] }
You can also annotate the struct itself via @Generable(description:):
@Generable(description: "A classified support ticket with priority and routing metadata") struct TicketClassification { @Guide(description: "Urgency level for routing decisions") var priority: Int }
@Guide — Constraints with GenerationGuide
@Guide also accepts one or more GenerationGuide<T> values to enforce numeric bounds and array sizes. All bounds are inclusive.
@Generable struct ProductReview { @Guide(description: "Star rating", .range(1...5)) var rating: Int @Guide(description: "Key selling points, at most three", .maximumCount(3)) var keyPoints: [String] @Guide(description: "Topics addressed, at least one", .minimumCount(1)) var topics: [String] @Guide(description: "Quality score", .minimum(0), .maximum(100)) var qualityScore: Double }
Available GenerationGuide constraints:
| Constraint | Applies To | Behaviour |
|---|---|---|
.range(n...m) | Numeric types | Value must fall within the closed range (inclusive both ends) |
.minimum(n) | Numeric types | Value must be ≥ n |
.maximum(n) | Numeric types | Value must be ≤ n |
.minimumCount(n) | [T] arrays | Array must contain ≥ n elements |
.maximumCount(n) | [T] arrays | Array must contain ≤ n elements |
Multiple guides can be combined on a single property as variadic arguments — .minimum(0), .maximum(100) is valid.
Enums as @Generable Types
Mark enums with @Generable to use them as property types inside other @Generable structs. The constrained sampler restricts output to valid case names only:
@Generable enum Sentiment { case positive case neutral case negative } @Generable struct MessageClassification { @Guide(description: "Overall tone of the message") var sentiment: Sentiment @Guide(description: "Urgency, 1 = routine, 5 = escalate immediately", .range(1...5)) var urgency: Int }
Enums with associated values are also supported — the @Generable macro ensures all associated and nested values are themselves generable.
PartiallyGenerated — Streaming Snapshots
Every @Generable type gets a synthesised PartiallyGenerated associated type. It is a version of the struct where all stored properties are Optional, representing work-in-progress output during streaming:
for try await snapshot in session.streamResponse( to: "Review: \(bookTitle)", generating: BookReview.self ) { let partial = snapshot.content // BookReview.PartiallyGenerated // partial.title might be "The G..." while still generating // partial.rating is nil until the model has written that property if let title = partial.title { titleLabel.text = title } } // After the loop completes, collect() gives a Response<BookReview> with all properties set
PartiallyGenerated is a streaming-only concern. When you call respond() (non-streaming), you receive the completed Content type directly — no optionals, no partial states to handle.
GeneratedContent — Untyped Escape Hatch
GeneratedContent is the framework's internal structured representation of model output. You normally never interact with it — @Generable handles encoding and decoding automatically.
When you need raw access, every Response exposes:
let response = try await session.respond(to: prompt, generating: BookReview.self) response.content // BookReview — your typed result response.rawContent // GeneratedContent — the underlying parsed value
rawContent is useful for debugging when model output does not match your type. You can inspect it to see exactly what the model produced before your ConvertibleFromGeneratedContent init ran.
For fully dynamic schemas (where the type is not known at compile time), use respond(schema:) with a GenerationSchema built from DynamicGenerationSchema. The response will have Content == GeneratedContent, and you decode manually via value(_:forProperty:):
let response = try await session.respond(to: prompt, schema: schema) let soup: String = try response.content.value(forProperty: "dailySoup")
Independent Constructability — Critical for Testing
@Generable types must be constructable via their memberwise initialiser without running the model. This is the property that makes them unit-testable:
// Your output type @Generable struct NormalisedTranscript { @Guide(description: "Corrected transcript text") var normalisedText: String @Guide(description: "Extracted BJJ terms in canonical form") var extractedTerms: [String] } // Tests run on any machine — no Apple Intelligence required func testNormalisationOutputType() { let result = NormalisedTranscript( normalisedText: "Worked Kimura from Half Guard", extractedTerms: ["Kimura", "Half Guard"] ) #expect(result.normalisedText.contains("Kimura")) #expect(result.extractedTerms.count == 2) }
If your @Generable type has custom initialisers that depend on model output, or computed properties with side effects, you have broken this contract. Keep output types as plain data containers — structs with stored properties and no embedded behaviour.
Protocol Hierarchy
You rarely interact with these directly — @Generable wires everything up — but understanding the hierarchy helps when debugging conformance errors or writing manual implementations:
| Protocol | Role |
|---|---|
Generable | Synthesised by @Generable. Requires PartiallyGenerated associated type, generationSchema, and ConvertibleFromGeneratedContent init. Inherits from both Convertible* protocols. |
ConvertibleFromGeneratedContent | Types constructable from model output. Int, String, Bool, Float, Double, Decimal, Array, enums, and @Generable structs all conform automatically. |
ConvertibleToGeneratedContent | Types that can be serialised back to GeneratedContent. Used for tool output and prompt injection. Inherits from PromptRepresentable. |
PromptRepresentable | Types that can appear inside a @PromptBuilder closure. @Generable types conform, so you can pass model output directly back as prompt input in a subsequent call. |
Part 5: Streaming
The Core Decision: Stream or Not?
| Use Case | Method | Reason |
|---|---|---|
| Live text appearing for the user (typing effect) | streamResponse() | User sees progress, engagement increases |
| Processing output programmatically | respond() | Simpler — no partial state handling |
| Background pipeline (normalisation, extraction) | respond() | No UI benefit; streaming increases rate-limit risk in background |
| Long-form generation the user is watching | streamResponse() | Progress feedback reduces perceived latency |
Structured @Generable output | respond() preferred | Partial structs with all-Optional properties add complexity for no gain |
Apple's own docs note that background tasks should use the non-streaming respond() to reduce the likelihood of encountering GenerationError.rateLimited errors.
String Streaming
let stream = session.streamResponse(to: "Summarise: \(text)") for try await snapshot in stream { let partial: String = snapshot.content // String grows with each chunk await MainActor.run { self.displayText = partial } } // Or skip the loop entirely and just collect the final result let fullResponse = try await stream.collect() let finalText = fullResponse.content // String — complete
Typed (@Generable) Streaming
let stream = session.streamResponse( to: "Review: \(text)", generating: BookReview.self ) for try await snapshot in stream { let partial = snapshot.content // BookReview.PartiallyGenerated // All properties are Optional — may be nil while the model generates earlier properties if let title = partial.title { titleLabel.text = title } if let rating = partial.rating { updateStars(rating) } } // Collect to receive the complete, fully-typed result let response = try await stream.collect() let review = response.content // BookReview — all properties non-nil
ResponseStream<Content>
streamResponse() returns a ResponseStream<Content>, which is an AsyncSequence of ResponseStream.Snapshot<Content> values. The type parameter matches what you would get from the equivalent respond() call.
// Type relationships session.streamResponse(to: prompt) // → ResponseStream<String> session.streamResponse(to: prompt, generating: BookReview.self) // → ResponseStream<BookReview> // Each snapshot during the stream: snapshot.content // → String (for string stream) // → BookReview.PartiallyGenerated (for typed stream — all properties Optional) // After .collect(): response.content // → String (complete) // → BookReview (complete, all properties set)
ResponseStream<Content> conforms to AsyncSequence, so you get the full suite of async sequence operators — map, filter, prefix, etc.
Progressive UI Update Pattern
The natural pattern for SwiftUI is to assign each snapshot directly to a @State property:
@Observable final class SummaryViewModel { var generatedText = "" func generate(prompt: Prompt) async throws { let stream = session.streamResponse(to: prompt) for try await snapshot in stream { generatedText = snapshot.content // @Observable triggers view update per chunk } } } // In the view: Text(viewModel.generatedText) .animation(.default, value: viewModel.generatedText)
For @Generable types, update individual UI elements as their backing properties become available:
for try await snapshot in stream { let partial = snapshot.content // BookReview.PartiallyGenerated titleLabel.text = partial.title ?? titleLabel.text // retain last known value summaryLabel.text = partial.summary ?? summaryLabel.text }
collect() — Streams to Full Response
collect() is an async method on ResponseStream that waits for the stream to finish and returns a complete Response<Content>:
let stream = session.streamResponse(to: prompt) // Option A: observe snapshots AND get the final result for try await snapshot in stream { updateProgressUI(snapshot.content) } // Stream is exhausted — collect() returns immediately since the stream is done let finalResponse = try await stream.collect() // Option B: skip observation, just get the final result let finalResponse = try await stream.collect()
If the stream finished with an error before collect() is called, collect() propagates that error. If the stream completed successfully, collect() returns immediately with the cached result.
Error Handling in Streams
Errors are thrown during iteration, not at stream creation (the stream object itself is always returned, even if the model will fail):
do { for try await snapshot in stream { // process snapshot } } catch LanguageModelSession.GenerationError.rateLimited(let retryAfter) { // system under load — retry after the given delay } catch LanguageModelSession.GenerationError.exceededContextWindowSize { // prompt + history too long — trim the input } catch LanguageModelSession.GenerationError.guardrailViolation { // content flagged — show alternative UX } catch { // unexpected error }
The same error types apply to streamResponse() as to respond() — the difference is only in when they surface during your async call.
Part 6: Generation Options
GenerationOptions is a struct you pass to respond() or streamResponse() to control how the model generates output. All properties are optional — omitting them leaves the model at its defaults, which are usually correct.
let options = GenerationOptions( temperature: 0.1, maximumResponseTokens: 200 ) let response = try await session.respond(to: prompt, options: options)
temperature
Temperature controls how "creative" or "random" the model's output is, on a scale from 0.0 to 1.0. nil (the default) lets the model use its own calibrated default, which is appropriate for most tasks.
| Temperature | Behaviour | Best For |
|---|---|---|
nil | Model default (typically ~0.7) | General use — let the model decide |
0.0–0.2 | Near-deterministic, consistent | Corrections, extraction, classification |
0.3–0.6 | Balanced | Summarisation, analysis |
0.7–1.0 | Creative, varied | Brainstorming, dialogue, story generation |
The most common mistake: setting a high temperature for a correction or extraction task. If you are normalising speech-to-text errors using @Generable, you want the model to produce the same correct answer every time — not a creatively varied interpretation. Use nil or set it low.
// ❌ High temperature for a structured correction task — produces inconsistent output let options = GenerationOptions(temperature: 0.9) let response = try await session.respond( to: Prompt { rawTranscript }, generating: NormalisedTranscript.self, options: options ) // ✅ Low temperature — deterministic, reliable corrections let options = GenerationOptions(temperature: 0.1) // or just omit options entirely — the schema constraints already reduce variance
For @Generable output, the constrained sampler that enforces your schema also reduces variance regardless of temperature. But setting temperature low is still good practice to signal your intent and produce maximally consistent output.
GenerationOptions.SamplingMode
SamplingMode gives you control over the underlying sampling algorithm. The two modes are:
.greedy — always selects the single most probable token at each step. Maximally deterministic. Best for tasks with one correct answer (grammar correction, structured extraction).
.random(temperature:) — samples from the probability distribution, with temperature scaling how broadly. This is the mode behind the temperature parameter.
// Explicit greedy sampling — maximum determinism let options = GenerationOptions( sampling: .greedy ) // Random sampling at a specific temperature let options = GenerationOptions( sampling: .random(temperature: 0.7) )
The temperature property on GenerationOptions is a convenience shorthand for .random(temperature:). Setting temperature: 0.0 is equivalent to .greedy.
maximumResponseTokens
Sets an upper bound on how many tokens the model can generate in its response. Useful for:
- Capping costs (on-device, this is latency rather than money) when you know responses should be short
- Preventing runaway generation in summary tasks where you want concise output
- Enforcing length constraints the instructions alone can't reliably enforce
// Limit to a short summary (~100 tokens ≈ ~75 words) let options = GenerationOptions(maximumResponseTokens: 100) let response = try await session.respond( to: "Summarise this training session in one paragraph: \(notes)", options: options )
Be careful not to set maximumResponseTokens too low for @Generable types — if the model runs out of tokens before completing your struct, it will throw GenerationError.exceededContextWindowSize.
Part 7: Tool Calling
Tools let the model call back into your Swift code to fetch data or perform actions during generation. The model autonomously decides whether and when to call a tool — you provide the definitions; it decides whether they are relevant to the current prompt.
The Tool Protocol
Conform to Tool to define a callable function the model can invoke:
@available(iOS 26, *) struct CurrentDateTool: Tool { let name = "getCurrentDate" let description = "Returns today's date in ISO 8601 format (YYYY-MM-DD)." // Arguments the model will pass — a @Generable struct @Generable struct Arguments { @Guide(description: "Optional timezone identifier, e.g. 'Europe/Dublin'") var timezone: String? } // Return type — any PromptRepresentable (String is simplest) func call(arguments: Arguments) async -> String { let formatter = ISO8601DateFormatter() if let tz = arguments.timezone, let zone = TimeZone(identifier: tz) { formatter.timeZone = zone } return formatter.string(from: Date()) } }
Key constraints:
Argumentsmust conform toConvertibleFromGeneratedContent. A@Generablestruct is the standard approach — the macro handles conformance automatically.Output(the return type) must conform toPromptRepresentable.Stringalways works.@Generabletypes also work.call(arguments:)is implicitly@concurrent— it runs off the main actor. Make itasyncif you need to do async work.
Registering Tools With a Session
Pass tools in the tools parameter when creating a session:
@available(iOS 26, *) let session = LanguageModelSession( tools: [CurrentDateTool(), UserProfileTool()] ) { "You are a task scheduling assistant." "Use getCurrentDate to determine today's date before scheduling." } let response = try await session.respond( to: "Schedule a reminder for two weeks from today" ) let text = response.content // model called getCurrentDate internally
The model receives each tool's name, description, and the JSON schema derived from Arguments. It uses the name and description to decide whether calling the tool is relevant to the prompt. Name and description are the primary signals — write them as short, specific phrases.
How the Model Decides to Call Tools
You cannot force the model to call a specific tool. It decides autonomously based on:
- Whether the tool's name and description match the intent of the prompt
- Whether it already has the information it needs without a tool call
- Whether the prompt semantically requires external data
The model may call zero tools (if it can answer from its knowledge), call one tool, or call multiple tools before producing its final response.
Critical Performance Insight: Pre-Fetch vs Tool
This is Apple's own guidance from the documentation, and it matters for performance:
If you ALWAYS need data from a source, inject it directly into instructions rather than defining a tool.
// ❌ Tool for data you always need — adds latency on every call struct UserPreferencesTool: Tool { ... } // ✅ Pre-fetch and inject — one fetch, zero tool overhead let preferences = await loadUserPreferences() let session = LanguageModelSession { "User preferences: \(preferences.serialised)" "Use these preferences when making recommendations." }
Tools have two costs:
- Token cost — each tool definition (name + description + arguments schema) consumes context budget. A tool with a complex
Argumentsstruct can cost 50–100 tokens just for its definition. - Latency cost — each tool call is a model inference round-trip: the model generates a call, your code runs, the result is injected back, the model continues. This adds meaningful latency.
Reserve tools for data that is conditionally needed — data you might need depending on what the user asks.
Context Window Cost
Define tools concisely. The model sees name + description + arguments schema for every tool, every call, whether it uses them or not.
// ❌ Verbose tool definition — each call consumes more context struct FetchUserTrainingHistoryForTheLastSixMonthsTool: Tool { let name = "fetchUserTrainingHistoryForTheLastSixMonths" let description = "This tool fetches the complete training history of the current user for the past six calendar months, including all session notes, techniques practised, and time spent..." // ... } // ✅ Concise — same capability, fraction of the tokens struct TrainingHistoryTool: Tool { let name = "getTrainingHistory" let description = "Returns recent training sessions with notes and techniques." // ... }
A practical limit is 3–5 tools per session. Beyond that, the definitions alone consume a significant portion of context, leaving less room for the actual conversation.
Tool Calls in the Transcript
When the model calls a tool, it appears in the session's Transcript as two entries:
Transcript.Entry.toolCalls— the model's request(s) to call toolsTranscript.Entry.toolOutput— the results that were injected back
This is useful when debugging why the model produced a particular response — you can inspect the transcript to see exactly what tool calls were made and what data the model received. See Part 9 (The Transcript) for full Transcript coverage.
Part 8: Token Budget
The on-device model has a fixed context window shared by all inputs and outputs for a session. Understanding how that budget is consumed is essential for building reliable features — especially multi-turn conversations and tool-using sessions.
The Budget Breakdown
Every token in a session competes for the same fixed window:
Total Context Window
├── Instructions (system prompt)
├── Tool definitions (name + description + args schema × number of tools)
├── Transcript history (all previous turns)
├── Current prompt
└── Response (tokens generated)
Response tokens are not free — they come out of the same pool as input. A long system prompt and a long conversation history leave less room for both the current prompt and its response.
Measuring Token Usage
SystemLanguageModel exposes three tokenUsage(for:) overloads (added February 2026):
let model = SystemLanguageModel.default // 1. Cost of Instructions + tool definitions let instrUsage = try await model.tokenUsage( for: instructions, tools: [MyTool()] ) print(instrUsage.tokenCount) // e.g. 180 // 2. Cost of a single Prompt let promptUsage = try await model.tokenUsage(for: prompt) print(promptUsage.tokenCount) // e.g. 45 // 3. Cost of a saved Transcript (conversation history) let historyUsage = try await model.tokenUsage(for: transcript.entries) print(historyUsage.tokenCount) // e.g. 620
All three return SystemLanguageModel.TokenUsage, with a single tokenCount: Int property. Use these to profile your sessions during development rather than guessing.
The contextSize Property
SystemLanguageModel.contextSize returns the total context window size in tokens as an async Int. It is back-deployed to earlier OS versions via @backDeployed:
let totalWindow = await SystemLanguageModel.default.contextSize // e.g. 4096 let available = totalWindow - instrUsage.tokenCount - historyUsage.tokenCount print("Available for prompt + response: \(available) tokens")
Use contextSize to compute headroom before sending a prompt, particularly in multi-turn sessions where history accumulates.
GenerationError.exceededContextWindowSize
This error is thrown when the combined input (instructions + tools + history + prompt) exceeds the context window. Handle it gracefully:
do { let response = try await session.respond(to: prompt) } catch LanguageModelSession.GenerationError.exceededContextWindowSize { // Strategies: // 1. Summarise the conversation history and start a new session // 2. Trim the oldest transcript entries // 3. Remove tool definitions you don't strictly need // 4. Shorten the prompt }
For multi-turn sessions, the most robust strategy is to detect when history is growing long and summarise it before continuing:
// When history exceeds a threshold, compress it if historyTokenCount > contextSize / 2 { let summary = try await summariseHistory(session.transcript) // Start fresh session with summary in instructions session = LanguageModelSession { "Previous conversation summary: \(summary)" } }
The #Playground Macro for Budget Profiling
The #Playground macro in Xcode (26.4+) shows Input Token Count and Response Token Count separately in the canvas after each run. This is the fastest way to profile token usage during development — no logging, no instrumentation, just iterate on the prompt and watch the counts update in real time.
Rules of Thumb
| Content | Approximate Token Cost |
|---|---|
| 1 word | ~1.3 tokens |
| 100 words | ~130 tokens |
| 1 page (250 words) | ~325 tokens |
Simple @Generable struct (2 props) | ~50 tokens overhead |
| Tool definition (name + description + args) | ~50–100 tokens |
| Default context window | ~4,096 tokens |
A 4k window sounds large but fills up quickly in multi-turn sessions with tool-heavy prompts.
Part 9: The Transcript
Transcript is the linear record of everything that has happened in a LanguageModelSession. Every turn adds entries. The transcript is how the model "remembers" previous exchanges in a multi-turn conversation.
Transcript.Entry
The transcript is an array of Transcript.Entry values. Each entry is one of five cases:
| Entry | When It Appears |
|---|---|
.instructions(Transcript.Instructions) | Session creation — the system prompt |
.prompt(Transcript.Prompt) | Each time you call respond() or streamResponse() |
.response(Transcript.Response) | Each model reply |
.toolCalls(Transcript.ToolCalls) | When the model decides to invoke one or more tools |
.toolOutput(Transcript.ToolOutput) | The result(s) returned from your tool's call() |
A simple two-turn conversation produces this entry sequence:
.instructions ← session setup
.prompt ← "What's the best sweep from Half Guard?"
.response ← "The Hip Bump Sweep is..."
.prompt ← "How do I set it up?"
.response ← "Start by flattening your opponent..."
A tool-calling exchange adds two extra entries per tool call:
.prompt ← "What techniques did I drill last Tuesday?"
.toolCalls ← [getTrainingHistory(date: "2026-02-24")]
.toolOutput ← [{ sessions: [...] }]
.response ← "Last Tuesday you drilled..."
Reading the Transcript
Access the current session transcript via session.transcript:
let session = LanguageModelSession { "You are a BJJ coach." } _ = try await session.respond(to: "What is the Kimura?") _ = try await session.respond(to: "How do I finish it from Guard?") // Inspect the transcript for entry in session.transcript.entries { switch entry { case .prompt(let p): print("User: \(p.segments.map(\.description).joined())") case .response(let r): print("Model: \(r.segments.map(\.description).joined())") default: break } }
Saving and Resuming Sessions
Save the transcript to persist a conversation and resume it later — useful for a coaching assistant where the user expects the model to remember what they discussed in previous sessions:
// Save let savedTranscript = session.transcript // Persist to SwiftData, UserDefaults, or disk... // Resume — new session with full history let resumedSession = LanguageModelSession( model: SystemLanguageModel.default, tools: [], transcript: savedTranscript ) // Model now has full context of the previous conversation let response = try await resumedSession.respond(to: "Where were we?")
The resumed session is identical in behaviour to a session that never stopped — the model sees the full entry history.
When to Use the Transcript
Use transcript accumulation when:
- The model needs to refer back to something the user said earlier ("as I mentioned before...")
- You are building a multi-turn chatbot or coaching assistant
- Continuity across app sessions is a user-facing feature
Do NOT accumulate transcripts when:
- Each call is independent (normalisation, extraction, summarisation, classification)
- You are using session-per-call — there is no transcript to worry about
- The task is stateless — the model does not need to "remember" anything
Unnecessary transcript accumulation wastes context budget and eventually causes GenerationError.exceededContextWindowSize. Most FoundationModels use cases do not need cross-turn memory — use session-per-call by default (see Part 2).
Part 10: Failure Modes & Graceful Degradation
FoundationModels can fail in ways that are different from a typical network API. Most failures are environmental (device eligibility, model state, system load) rather than logic errors. The right response in almost every case is graceful degradation, not throwing errors up to the UI.
GenerationError Cases
LanguageModelSession.GenerationError is thrown from respond() and streamResponse():
.exceededContextWindowSize
The combined input (instructions + tools + history + prompt) exceeded the context window. Solutions in order of preference:
- Reduce the prompt — summarise or truncate the input text
- Trim the oldest transcript entries in a multi-turn session
- Remove tool definitions that aren't needed for this call
- Split into multiple sessions
.rateLimited
The system is under load. The on-device model is a shared resource — all apps use the same model, and the OS rate-limits when demand is high. Handle with simple exponential backoff:
func generateWithRetry(session: LanguageModelSession, prompt: Prompt) async throws -> String { var delay: UInt64 = 1_000_000_000 // 1 second for attempt in 1...3 { do { return try await session.respond(to: prompt).content } catch LanguageModelSession.GenerationError.rateLimited { if attempt < 3 { try await Task.sleep(nanoseconds: delay) delay *= 2 } } } throw LanguageModelSession.GenerationError.rateLimited // re-throw after 3 attempts }
.guardrailViolation
The content triggered safety filtering. This can happen on the prompt (the input was flagged) or on the response (the model started generating something that triggered the filter). The error contains context on what was flagged.
.unsupportedGuide
A @Guide constraint on a @Generable type is not supported for the current model or OS version. This should not occur in production if your deployment target is correct, but handle it defensively.
LanguageModelSession.GenerationError.Refusal
When the model declines to answer a prompt, it throws a Refusal error. Refusal is special because it includes an explanation:
do { let response = try await session.respond(to: prompt) } catch let refusal as LanguageModelSession.GenerationError.Refusal { // Get the explanation as a complete Response<String> let explanation = try await refusal.explanation print(explanation.content) // "I can't help with that because..." // Or stream it for try await snapshot in refusal.explanationStream { print(snapshot.content) } }
Production Pattern: The Never-Throws Service
The cleanest production pattern is a service method that never throws — it returns the raw input unchanged on any failure. Callers have zero error handling burden, and worst case equals current pre-AI behaviour:
@available(iOS 26, *) final class TranscriptNormalisationService { func normalise(_ rawTranscript: String) async -> String { guard !rawTranscript.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty else { return rawTranscript } do { let session = LanguageModelSession { "Fix speech-to-text errors in BJJ transcripts." "Return only the corrected text." } let response = try await session.respond( to: Prompt { rawTranscript }, generating: NormalisedTranscript.self ) return response.content.normalisedText } catch { // Log the error, return the raw transcript unchanged GraplaLogger.data.error("Normalisation failed: \(error)") return rawTranscript } } }
This pattern means:
- The caller always gets a
Stringback — no try/catch required - If AI is unavailable, the app works exactly as before
- Errors are logged for debugging without surfacing to the user
Additional Production Patterns
Cache availability at setup, not per-call. SystemLanguageModel.default.availability has non-trivial overhead. Check it once when the view or service initialises and store the result. Availability doesn't change mid-session.
// ❌ Checking availability on every call func normalise(_ text: String) async -> String { guard SystemLanguageModel.default.isAvailable else { return text } // overhead each time ... } // ✅ Check once, cache final class NormalisationService { private let isAvailable = SystemLanguageModel.default.isAvailable func normalise(_ text: String) async -> String { guard isAvailable else { return text } ... } }
The fallback path is production code. On the majority of devices in 2026, Apple Intelligence will not be available (older hardware, non-supported regions, disabled in settings). Your non-AI code path is not a fallback — it is the primary path for most users. Test it as thoroughly as the AI path.
Use AnyObject? for iOS 26 services in SwiftUI views. Covered in Part 1, but worth repeating: avoid @available(iOS 26, *) on @State properties. Use AnyObject? and cast inside #available guards to prevent the constraint propagating to the whole view.
Part 11: Testing
The most important insight for testing FoundationModels code: most of your test suite should never touch the model. Well-structured FoundationModels code is testable at every layer without Apple Intelligence.
The Four Test Categories
1. Output Type Tests (No Model Required)
@Generable structs are plain data containers with memberwise initialisers. You can construct them directly in tests, verify Equatable conformance, and test edge cases without the model ever running:
@Suite("NormalisedTranscript") struct NormalisedTranscriptTests { @Test func construction() { let result = NormalisedTranscript( normalisedText: "Worked Kimura from Half Guard", extractedTerms: ["Kimura", "Half Guard"] ) #expect(result.normalisedText == "Worked Kimura from Half Guard") #expect(result.extractedTerms.count == 2) #expect(result.extractedTerms.contains("Kimura")) } @Test func equatable() { let a = NormalisedTranscript(normalisedText: "Test", extractedTerms: []) let b = NormalisedTranscript(normalisedText: "Test", extractedTerms: []) #expect(a == b) } @Test func emptyTerms() { let result = NormalisedTranscript(normalisedText: "Some text", extractedTerms: []) #expect(result.extractedTerms.isEmpty) } }
These tests run in CI on any machine. No simulator required.
2. Service Fallback Tests (Works on All Simulators)
Test that your service returns the raw input unchanged when the model is unavailable. The simulator never has Apple Intelligence, so this path is always exercised:
@MainActor @Suite("TranscriptNormalisationService") struct TranscriptNormalisationServiceTests { @Test func emptyTranscriptReturnsEmpty() async { guard #available(iOS 26, *) else { return } let service = TranscriptNormalisationService() let result = await service.normalise("") #expect(result.isEmpty) } @Test func whitespaceOnlyReturnsUnchanged() async { guard #available(iOS 26, *) else { return } let service = TranscriptNormalisationService() let result = await service.normalise(" \n ") #expect(result.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty) } @Test func unavailableModelReturnsFallback() async { guard #available(iOS 26, *) else { return } // On simulator, model is unavailable — service must return raw transcript let service = TranscriptNormalisationService() let raw = "worked on kimora from half card today" let result = await service.normalise(raw) // On device: could be corrected. On simulator: must equal raw input. #expect(!result.isEmpty) // just verify it doesn't crash } }
3. Availability Tests (Works Everywhere)
Verify your availability checking code runs without crashing. Do not assert the specific availability state — it varies by machine, OS, and whether Apple Intelligence is enabled:
@Test func availabilityCheckDoesNotCrash() { guard #available(iOS 26, *) else { return } let availability = SystemLanguageModel.default.availability // Just verify we get a valid state — don't assert which state switch availability { case .available: break // fine case .unavailable: break // also fine — expected on simulator @unknown default: break } }
4. On-Device Tests (Manual, .disabled() by Default)
Mark tests that require a real device with Apple Intelligence as .disabled(). They are skipped in CI but can be run manually on a real device:
@Test("Normalises BJJ terms on-device", .disabled("Requires device with Apple Intelligence")) func normalisesTermsOnDevice() async throws { guard #available(iOS 26, *) else { return } let service = TranscriptNormalisationService() let raw = "rolled today, worked on my kimora from half card" let result = await service.normalise(raw) // On a real device with AI, these should be corrected #expect(result.contains("Kimura") || result.contains("kimura")) #expect(!result.contains("kimora")) }
To run these locally: open the test plan in Xcode, filter by the test name, and run on a connected iPhone 15 Pro or later with Apple Intelligence enabled.
Testing Checklist
| Test | Runs in CI | Requires Apple Intelligence |
|---|---|---|
@Generable type construction | ✅ | ❌ |
@Generable equatable | ✅ | ❌ |
| Service empty input handling | ✅ | ❌ |
| Service fallback (model unavailable) | ✅ | ❌ |
| Availability check no-crash | ✅ | ❌ |
| End-to-end normalisation | ❌ (manual) | ✅ |
Aim for 100% automated coverage of everything above the model boundary. The on-device generation itself is integration-tested manually.
Part 12: Example Use Cases
These examples cover the range of tasks FoundationModels handles well. Each follows the same pattern: on-device, private, structured output, graceful fallback.
1. Sports / BJJ App — Domain-Specific Transcript Normalisation
Use case: Correct speech-to-text misrecognitions of BJJ terms before feeding into entity extraction.
Why FoundationModels: A regex can't handle "kimora" → "Kimura" contextually; a cloud API sends private training notes offsite. On-device gets both right.
@Generable struct NormalisedTranscript { @Guide(description: "The transcript with BJJ terms corrected") var normalisedText: String @Guide(description: "Canonical BJJ terms extracted, e.g. ['Kimura', 'Half Guard']") var extractedTerms: [String] }
Tools needed: None — pure text transformation. Session-per-call.
2. Recipe App — Ingredient Extraction From Voice
Use case: "I need some eggs, any kind of cheese, and that Italian herb" → structured shopping list.
Why FoundationModels: The model handles colloquial descriptions ("that Italian herb" → "basil"), vague quantities ("some"), and variety descriptions ("any kind of cheese") — none of which a regex can parse.
@Generable struct Ingredient { @Guide(description: "Canonical ingredient name, e.g. 'basil', 'eggs'") var name: String @Guide(description: "Quantity as spoken, e.g. '2', 'some', 'a handful'") var quantity: String } @Generable struct IngredientList { @Guide(description: "All ingredients mentioned", .minimumCount(1)) var ingredients: [Ingredient] }
Tools needed: None. Session-per-call.
3. Journaling App — Private Mood Tagging
Use case: Classify a journal entry's emotional tone without sending text to a cloud service.
Why FoundationModels: Journal entries are deeply personal. On-device is the only acceptable processing option — not a preference, a product requirement.
@Generable enum PrimaryMood { case joyful, content, neutral, anxious, sad, angry, reflective } @Generable struct MoodAnalysis { @Guide(description: "The dominant emotion in the entry") var primaryMood: PrimaryMood @Guide(description: "Intensity, 1 = mild, 5 = intense", .range(1...5)) var intensity: Int @Guide(description: "Key themes, up to three", .maximumCount(3)) var themes: [String] }
Tools needed: None. Session-per-call.
4. Task Manager — Natural Language Task Parsing
Use case: "Remind me to call Mum next Tuesday afternoon" → structured task with date components and priority.
Why FoundationModels: Natural language date parsing ("next Tuesday"), intent extraction, and priority inference in a single call.
@Generable struct ParsedTask { @Guide(description: "Clean task title, e.g. 'Call Mum'") var title: String @Guide(description: "Relative date reference as spoken, e.g. 'next Tuesday afternoon'") var dateReference: String @Guide(description: "Priority 1 (low) to 3 (high)", .range(1...3)) var priority: Int }
Tools needed: CurrentDateTool to anchor relative dates ("next Tuesday" needs to know what today is).
5. Fitness App — Workout Log Summarisation
Use case: After a training session, summarise a structured workout log into a human-readable weekly review.
Why FoundationModels: Summary generation from structured data into natural prose. Streaming makes it feel responsive.
// No @Generable needed — plain text output, streamed let stream = session.streamResponse( to: "Summarise this week's training in 2 paragraphs: \(workoutLogJSON)" ) for try await snapshot in stream { summaryView.text = snapshot.content // live update as text generates }
Tools needed: None. Session-per-call. Use streamResponse() for the typing effect.
6. Developer Tool — Conventional Commit Message Generation
Use case: Given a summary of changed files and diff, generate a conventional commit message.
Why FoundationModels: Requires understanding intent from code changes — beyond simple pattern matching, but doesn't need frontier reasoning. On-device keeps source code private.
@Generable enum CommitType { case feat, fix, chore, docs, refactor, test, perf } @Generable struct CommitMessage { @Guide(description: "Conventional commit type") var type: CommitType @Guide(description: "Affected scope, e.g. 'auth', 'ui', nil if unclear") var scope: String? @Guide(description: "Imperative subject line, 72 chars max") var subject: String @Guide(description: "Optional body with context on why this change was made") var body: String? }
Tools needed: None. Session-per-call.
7. Language Learning App — Sentence Correction
Use case: Correct a learner's written sentence while preserving their intended meaning.
Why FoundationModels: Grammar correction requires semantic understanding — the model must know what the learner was trying to say. On-device matters here too: learners write embarrassing mistakes they would prefer not to send to a cloud API.
@Generable struct CorrectedSentence { @Guide(description: "The corrected sentence with natural grammar") var correctedText: String @Guide(description: "Explanations of corrections made, e.g. ['Changed tense from past to present perfect']") var explanations: [String] @Guide(description: "Confidence the original meaning was preserved, 1-5", .range(1...5)) var meaningPreservedConfidence: Int }
Tools needed: None. Session-per-call.
8. E-Commerce — Product Attribute Extraction
Use case: Extract structured attributes (colour, size, material, style) from free-text product descriptions for catalogue indexing.
Why FoundationModels: Product descriptions are unstructured prose. Structured extraction via @Generable is more robust than regex for the variety of descriptions sellers write.
@Generable struct ProductAttributes { @Guide(description: "Primary colour(s), e.g. ['navy', 'white']") var colours: [String] @Guide(description: "Material, e.g. 'cotton', 'polyester blend'") var material: String? @Guide(description: "Style keywords, e.g. ['casual', 'slim-fit']", .maximumCount(5)) var styleKeywords: [String] }
Tools needed: Optional ProductCatalogTool to canonicalise values against your taxonomy.
9. Health App — Symptom Log Structuring
Use case: User dictates how they're feeling → structured symptom entry for a health log.
Why FoundationModels: Privacy is non-negotiable. Health data is the most sensitive category — on-device is not a preference, it's a product and ethical requirement.
@Generable enum BodyArea { case head, chest, abdomen, back, leftArm, rightArm, leftLeg, rightLeg, general } @Generable struct SymptomEntry { @Guide(description: "Primary affected body area") var bodyArea: BodyArea @Guide(description: "Symptom description in normalised clinical language") var description: String @Guide(description: "Severity 1 (mild) to 10 (severe)", .range(1...10)) var severity: Int @Guide(description: "Duration as spoken, e.g. 'since this morning', 'two days'") var duration: String }
Tools needed: None. Session-per-call.
10. Customer Support — Ticket Triage
Use case: Classify incoming support tickets by category, urgency, and sentiment to route them to the right team.
Why FoundationModels: Classification with semantic understanding. A keyword-based classifier misroutes tickets with indirect language; the model understands context.
@Generable enum TicketCategory { case billing, technicalSupport, accountAccess, featureRequest, complaint, other } @Generable enum CustomerSentiment { case positive, neutral, frustrated, angry } @Generable struct TicketClassification { @Guide(description: "Primary support category") var category: TicketCategory @Guide(description: "Urgency 1 (low) to 5 (escalate immediately)", .range(1...5)) var urgency: Int @Guide(description: "Customer emotional tone") var sentiment: CustomerSentiment @Guide(description: "One-sentence routing note for the support agent") var routingNote: String }
Tools needed: Optional KnowledgeBaseTool to check if similar issues have documented resolutions before routing.
Part 13: Quick Reference & Anti-Patterns
Quick Reference
Key Types
| Type | One-liner |
|---|---|
SystemLanguageModel | Entry point — SystemLanguageModel.default |
SystemLanguageModel.Availability | .available / .unavailable(reason) |
LanguageModelSession | Manages one conversation thread; stateful |
Instructions | System prompt — set once at session creation |
Prompt | User input for a single turn |
Response<Content> | Wrapper — always access .content |
ResponseStream<Content> | AsyncSequence of Snapshot<Content> |
GenerationOptions | temperature, maximumResponseTokens, sampling |
GenerationGuide<T> | Constraints on @Guide properties |
Transcript | Linear history of all session entries |
Tool | Protocol for functions the model can call |
SystemLanguageModel.TokenUsage | .tokenCount — cost of instructions/prompt/history |
Session Init Cheatsheet
// Fresh session, no tools LanguageModelSession { "Instructions here" } // With specific model LanguageModelSession(model: SystemLanguageModel.default) { "..." } // With tools LanguageModelSession(tools: [MyTool()]) { "..." } // Resume from saved transcript LanguageModelSession(model: .default, tools: [], transcript: savedTranscript)
respond() vs streamResponse()
respond() | streamResponse() | |
|---|---|---|
| Returns | Response<Content> | ResponseStream<Content> |
| Best for | Background processing, pipelines | Live UI with typing effect |
| Partial results | No | Yes (via Snapshot<Content>) |
| Rate limit risk | Lower | Higher in background tasks |
| Collect to full response | N/A | .collect() |
@Generable vs Raw String
Use @Generable when:
- You need structured, typed output (multiple fields)
- You want compile-time guarantees on output shape
- The response must be parsed/processed programmatically
- You need constraints (
@Guide) on values
Use raw String when:
- Output is prose for display to the user
- You're summarising or generating a paragraph
- Streaming the output for a typing effect
Token Budget Formula
Total = instructions + tool definitions + transcript history + prompt + response
All compete for the same fixed window (~4,096 tokens). Response tokens come out of the same pool as input.
Tool vs Pre-Fetch vs Inject
| If you... | Do this |
|---|---|
| Always need the data | Pre-fetch, inject into instructions |
| Sometimes need the data | Define as Tool |
| Need data only when asked about it | Define as Tool |
| Have more than 5 tools | Split into multiple focused sessions |
Anti-Patterns
1. Accessing response instead of response.content
respond() returns Response<T>, not T. Always unwrap .content.
let text = try await session.respond(to: prompt) // Response<String>, not String text.uppercased() // ❌ compile error let text = try await session.respond(to: prompt).content // ✅ String
2. Storing LanguageModelSession persistently when you don't need history
For stateless tasks (normalisation, extraction, classification), create a new session per call. Persistent sessions accumulate transcript and eventually hit the context limit.
3. Defining too many tools
Each tool definition consumes ~50–100 tokens of context budget, whether used or not. Keep it to 3–5 tools per session. If you have 10 tools, split them across multiple focused sessions.
4. Calling isAvailable or checkAvailability() per-call
Availability checking has overhead and doesn't change mid-session. Check once at service/view init and cache the result.
5. High temperature for structured/correction tasks
For @Generable types that correct or extract, use nil or temperature: 0.0–0.2. High temperature produces creatively varied — but wrong — corrections.
6. Long, elaborate instructions modelled on frontier model prompts
On a ~3B parameter model, shorter is better. Instructions over ~200 words dilute signal. Explicit rules outperform discursive descriptions.
7. Not testing the fallback path
On most devices today, Apple Intelligence is unavailable. Your non-AI code path is the primary experience for the majority of users. Test it as thoroughly as the AI path.
8. Using FoundationModels where a regex or simple function would do
If the task is a known, fixed pattern (extract a UUID, validate an email, format a date), use a deterministic function. LLM overhead — latency, availability, complexity — is waste for these cases.
9. Propagating @available(iOS 26, *) to SwiftUI views
Adding @available to a @State property forces the whole view to require iOS 26. Use the AnyObject? pattern instead and cast inside #available guards.
10. Treating .modelNotReady as permanent
.modelNotReady means the model is downloading. It's transient. Show "not available right now" UI and retry later. Do not show a permanent "unsupported" state for this case.
Part 14: Context Engineering for On-Device AI
The context window is the most important constraint in FoundationModels. Everything else — prompt engineering, temperature, tool design — happens within it. Understanding how to engineer what goes into that window is the difference between a feature that works reliably and one that fails silently on complex inputs.
The Fundamental Constraint
The on-device model has a fixed context window of approximately 4,096 tokens shared across:
instructions + tool definitions + transcript history + current prompt + response
This is roughly 3,000 words (about 12 pages) of total input and output. That sounds like a lot until you try to inject meaningful app data.
A BJJ training app with 116 positions, each with a 200-word description: ~30,000 tokens — 7x the entire context window. Injecting "all your app data" into instructions is not a strategy; it's a crash waiting to happen.
What Breaks First
When you over-fill the context window you get GenerationError.exceededContextWindowSize. But the model also silently degrades before it throws — a model given 3,500 tokens of input in a 4,096 window has only 596 tokens for its response. For most tasks that's enough. For others it's not — and the failure mode is truncation, not an error.
Common over-injection mistakes:
| Data | Tokens (approx) | Problem |
|---|---|---|
| All SwiftData records (100+ items) | 10,000–50,000 | Massively exceeds window |
| Full JSON blob of one complex entity | 500–2,000 | May leave little room for response |
| Entire app configuration/preferences | 200–800 | Unnecessary; most not relevant |
| Complete conversation history (100 turns) | 2,000–5,000 | Pushes out current prompt |
Pattern 1: Select, Don't Dump
The simplest and most impactful change: fetch only what's relevant to the current request.
// ❌ Dumps all 116 positions into context — will throw let allPositions = try await queryService.fetchAllPositions() let session = LanguageModelSession { "Here are all BJJ positions: \(allPositions.map(\.description).joined(separator: "\n"))" } // ✅ Fetches only positions relevant to the current question let relevantPositions = try await queryService.fetchPositions( matching: userQuery, limit: 5 // 5 positions × ~200 tokens = ~1,000 tokens — fits comfortably ) let session = LanguageModelSession { "Relevant positions: \(relevantPositions.map(\.summary).joined(separator: "\n"))" }
Use SwiftData predicates and fetchLimit to constrain what you load before it reaches the context.
Pattern 2: Layered Injection
Inject summaries at the top level, with detail available on-demand via tools. The model sees the overview by default and only loads detail when it actually needs it:
// Layer 1 — always injected: position names only (~50 tokens for 116 positions) let positionNames = positions.map(\.name).joined(separator: ", ") // Layer 2 — injected only when needed via tool struct PositionDetailTool: Tool { let name = "getPositionDetail" let description = "Returns full description and transitions for a named BJJ position." @Generable struct Arguments { var positionName: String } func call(arguments: Arguments) async -> String { // Fetch the full detail only when the model asks for it return await loadPositionDetail(arguments.positionName) } } let session = LanguageModelSession(tools: [PositionDetailTool()]) { "Available positions: \(positionNames)" "Use getPositionDetail to look up full information about any position." }
This keeps the base context lean (~50 tokens for names vs 30,000 for all descriptions) while still giving the model access to full detail on demand.
Pattern 3: The Two-Step Compression Pipeline
For tasks that require reasoning over large datasets, compress first, then reason. This only makes sense on-device — with a cloud API you pay per token on both calls and gain nothing. On-device both calls are free and private:
// Step 1: Summarise the large dataset (fresh session, large input is fine) func summariseTrainingHistory(_ sessions: [TrainingSession]) async throws -> String { let session = LanguageModelSession { "Summarise this training history in 150 words, highlighting patterns and progress." } let fullHistory = sessions.map(\.description).joined(separator: "\n\n") return try await session.respond(to: fullHistory).content // fullHistory might be 5,000 tokens — fills most of the window, but that's fine // The output is ~150 tokens } // Step 2: Reason with the summary (fresh context, compact input) func answerWithHistory(question: String, summary: String) async throws -> String { let session = LanguageModelSession { "Training history summary: \(summary)" // ~150 tokens "Answer questions about training progress based on this summary." } return try await session.respond(to: question).content // Plenty of context headroom for question + answer } // Usage let summary = try await summariseTrainingHistory(recentSessions) let answer = try await answerWithHistory(question: userQuestion, summary: summary)
The summary call uses most of its window for the raw data and produces a compact output. The reasoning call has clean context with just the summary. Each call is focused on a single task.
Pattern 4: Pre-Summarise at Write Time
For persistent app data (SwiftData entities), generate summaries when the data is saved and store them alongside the entity. The summary is computed once and reused for every future AI interaction:
@Model final class TrainingSession { var rawNotes: String = "" var date: Date = Date() var techniques: [String] = [] // Pre-generated — computed at save time, reused in every AI call var aiSummary: String = "" } // When saving a session func saveSession(_ session: TrainingSession) async { // Generate summary once at write time if #available(iOS 26, *) { let model = LanguageModelSession { "Summarise this BJJ training session in 50 words." } let summary = try? await model.respond( to: session.rawNotes + "\nTechniques: \(session.techniques.joined(separator: ", "))" ).content session.aiSummary = summary ?? "" } modelContext.insert(session) try? modelContext.save() } // At query time — inject pre-built summaries, not raw notes func buildSessionContext(recentSessions: [TrainingSession]) -> String { recentSessions .map { "[\($0.date.formatted())]: \($0.aiSummary)" } .joined(separator: "\n") // Each summary: ~50 tokens × 10 sessions = 500 tokens — fits comfortably }
Pre-summarisation at write time means:
- Zero AI cost at query time — the summary is already there
- The context load is predictable and bounded by summary length
- The summary can be updated when the entity changes
Dataset Size Reference
| Content | Volume | Approx. Tokens | Fits in Context? |
|---|---|---|---|
| Single entity description | 1 | 200–500 | ✅ Yes |
| Entity names list | 100 | ~150 | ✅ Yes |
| Short entity summaries | 10 | ~500 | ✅ Yes |
| Short entity summaries | 50 | ~2,500 | ⚠️ Tight |
| Full entity descriptions | 10 | ~2,000 | ⚠️ Tight |
| Full entity descriptions | 50+ | 10,000+ | ❌ No |
| Full entity descriptions | 100+ | 20,000+ | ❌ No |
| Conversation (10 turns) | — | ~1,000 | ✅ Yes |
| Conversation (50 turns) | — | ~5,000 | ❌ No |
Decision Tree
Do you need to inject app data into context?
│
├── Yes → How much data?
│ │
│ ├── 1–5 entities, full detail
│ │ └── Inject directly into instructions
│ │
│ ├── 5–20 entities
│ │ ├── Always need all of them? → Inject summaries (pre-generated at write time)
│ │ └── Only need some? → Names in instructions + detail via Tool
│ │
│ └── 20+ entities
│ ├── Need to reason across all of them? → Two-step: summarise first, then reason
│ └── Need specific ones? → Select with predicate, inject summaries for matched
│
└── No → Standard session-per-call, no data injection needed
On-Device vs Cloud: Why This Pattern Is Different
With cloud APIs (OpenAI, Anthropic, Google), the two-step pattern is often not worth it: you pay per token on both calls, and the total cost may be similar to one call with the full data — especially if the summarisation model is also expensive.
On-device, the economics flip:
- No per-token cost — both calls are free
- No network latency — both calls run locally, typically in under a second each
- No privacy concern — data never leaves the device regardless of call count
- Shared resource — each call consumes system resources and may be rate-limited, so compact contexts are still preferred
This makes on-device AI uniquely suited to multi-step pipelines where cloud would be prohibitively expensive or slow.
Part 15: Advanced Patterns
This section covers patterns that don't fit neatly into any earlier part — actor isolation details, the non-obvious syntax for @Generable enums with associated values, reactive availability monitoring in SwiftUI, chaining model output back as prompt input via PromptRepresentable, and the bounded domain injection pattern for apps with curated entity datasets.
Actor Isolation and call(arguments:) — What Actor Does Your Code Run On?
Understanding actor isolation in FoundationModels matters when your tool or service touches @MainActor-bound state.
Tool.call(arguments:) Is @concurrent
The call(arguments:) method on the Tool protocol is implicitly @concurrent, which means it runs off the main actor — in a generic concurrent executor, not @MainActor. This is deliberate: the model calls your tool during inference, which itself is off the main actor. Calling back to the main actor mid-inference would require a hop, adding latency.
@available(iOS 26, *) struct TrainingHistoryTool: Tool { let name = "getTrainingHistory" let description = "Returns recent training sessions." @Generable struct Arguments { var limit: Int } // This runs @concurrent — NOT on @MainActor func call(arguments: Arguments) async -> String { // ✅ Pure computation or actor-independent async work is fine here let sessions = await fetchSessions(limit: arguments.limit) return sessions.map(\.summary).joined(separator: "\n") // ❌ Accessing @MainActor-bound state directly will cause a data race warning // return self.someMainActorProperty // won't compile } }
If your tool genuinely needs main-actor state (e.g., reading from a @MainActor service), hop explicitly:
func call(arguments: Arguments) async -> String { // Hop to MainActor to read the value, then hop back let data = await MainActor.run { myMainActorService.currentData } return process(data) }
What Actor Does respond() Run On?
LanguageModelSession.respond() is async but has no actor isolation requirement — it is safe to call from any actor context, including @MainActor. Internally, the framework dispatches inference to a background executor automatically.
// ✅ Calling respond() from @MainActor is fine — the framework handles the dispatch @MainActor final class NormalisationService { func normalise(_ text: String) async -> String { let session = LanguageModelSession { "Fix BJJ terms." } // respond() is async but not @MainActor — call is fine from here let response = try? await session.respond(to: Prompt { text }) return response?.content ?? text } }
You do not need to manually Task.detach or use Task { @concurrent in ... } before calling respond(). The framework does the right thing automatically.
@MainActor Services Calling Tools — The Safe Pattern
When a @MainActor service needs tools that access non-@MainActor data, the cleanest pattern is to make the tool capture any main-actor dependencies at session creation time (before inference begins), rather than accessing them from within call():
@MainActor final class CoachingService { private let userProfile: UserProfile // @MainActor bound func answer(_ question: String) async -> String { // Capture the profile value NOW, on MainActor, before the session runs let profileSummary = userProfile.summary // safe — we're on MainActor // The tool closes over the already-captured value — no actor hop needed in call() struct ProfileContextTool: Tool { let name = "getUserProfile" let description = "Returns the user's training profile." @Generable struct Arguments {} let summary: String // captured at creation time func call(arguments: Arguments) async -> String { summary } } let session = LanguageModelSession(tools: [ProfileContextTool(summary: profileSummary)]) { "Answer BJJ coaching questions using the user's profile." } return (try? await session.respond(to: question).content) ?? "" } }
This is simpler than hopping to MainActor inside call() and avoids any potential race conditions.
@Generable Enums With Associated Values
The earlier enum examples in Part 4 showed simple case enums (.positive, .neutral, .negative). @Generable also supports enums with associated values — but the syntax has a specific constraint: all associated values must themselves conform to Generable (or be types that @Generable already knows how to handle: String, Int, Double, Bool, arrays of generable types).
Basic Associated Value Enum
@available(iOS 26, *) @Generable enum TranscriptCorrection { case termCorrection(original: String, corrected: String) case spellingFix(original: String, corrected: String) case noChange } @Generable struct AnnotatedTranscript { @Guide(description: "The corrected transcript text") var correctedText: String @Guide(description: "Each correction made, with original and corrected forms") var corrections: [TranscriptCorrection] }
The model generates each corrections element as a tagged union — it chooses the case name and then generates the associated values. This is significantly richer than a flat string array for corrections, because the output is fully typed.
Nested @Generable Structs as Associated Values
Associated values can also be @Generable structs:
@available(iOS 26, *) @Generable struct DateRange { @Guide(description: "Start date in YYYY-MM-DD format") var start: String @Guide(description: "End date in YYYY-MM-DD format") var end: String } @Generable enum ScheduleIntent { case singleDay(date: String) case dateRange(range: DateRange) case recurring(dayOfWeek: String, startTime: String) case unspecified } @Generable struct ParsedScheduleRequest { @Guide(description: "What the user wants to schedule") var activity: String @Guide(description: "When the user wants to schedule it") var timing: ScheduleIntent }
When to Use Associated Value Enums vs Flat Structs
Use associated value enums when the output shape is fundamentally discriminated — the presence of one field makes others meaningless. In the ScheduleIntent example above, if the user said "every Monday at 9am", the .recurring case makes date and range meaningless, and a flat struct would leave those fields awkwardly nil.
Use flat @Generable structs with optional properties when most combinations of values are valid. The associated value enum excels when the cases are truly mutually exclusive and each has distinct associated data.
The Constraint: All Associated Values Must Be Generable
If you include a type that is not Generable-conformant as an associated value, the @Generable macro will emit a compile-time error. The fix is always one of:
- Add
@Generableto the associated type - Change the associated type to a primitive (
String,Int, etc.) - Represent it as a separate
@Generablestruct with its own properties
Observable Availability Monitoring — Reactive SwiftUI Pattern
SystemLanguageModel is an Observable final class. This means SwiftUI views can react to .availability changes without any additional wiring — the view re-renders automatically when availability changes.
This is useful when you want to show/hide AI features reactively, for example when the model finishes downloading (.modelNotReady → .available) while the user is already in the app.
Basic Reactive Availability View
@available(iOS 26, *) struct AIFeatureBadge: View { var body: some View { // SwiftUI observes SystemLanguageModel.default automatically // because it's @Observable — no @StateObject, no manual subscription let model = SystemLanguageModel.default switch model.availability { case .available: Label("AI Ready", systemImage: "sparkles") .foregroundStyle(.green) case .unavailable(.modelNotReady): Label("AI Downloading...", systemImage: "arrow.down.circle") .foregroundStyle(.yellow) case .unavailable(.appleIntelligenceNotEnabled): Label("Enable Apple Intelligence", systemImage: "exclamationmark.circle") .foregroundStyle(.secondary) case .unavailable(.deviceNotEligible): EmptyView() // Don't surface this — it's permanent @unknown default: EmptyView() } } }
Because SystemLanguageModel is @Observable, SwiftUI tracks which properties the body reads and re-renders when they change. No .onReceive, no Combine, no explicit observation setup.
Watching for the Model Becoming Ready
The .task {} modifier is the right tool for reacting to an availability change and triggering a one-time action — for example, kicking off an initial data enrichment pass once the model becomes available:
@available(iOS 26, *) struct TrainingDashboardView: View { @State private var hasRunInitialEnrichment = false var body: some View { // ... view content ... .task { // This task runs when the view appears and re-runs if availability changes for await _ in SystemLanguageModel.default.availabilityUpdates { guard !hasRunInitialEnrichment else { break } if SystemLanguageModel.default.isAvailable { await runInitialEnrichment() hasRunInitialEnrichment = true } } } } private func runInitialEnrichment() async { // Generate AI summaries for any entities that don't have them yet } }
Note: If
availabilityUpdatesis not available on your OS target, use.task(id: SystemLanguageModel.default.availability)as an alternative — the task re-runs whenavailabilitychanges sinceAvailabilityisEquatable:
.task(id: SystemLanguageModel.default.availability) { guard SystemLanguageModel.default.isAvailable else { return } guard !hasRunInitialEnrichment else { return } await runInitialEnrichment() hasRunInitialEnrichment = true }
Avoiding the Per-View @available Constraint
The reactive pattern works cleanly with the AnyObject? wrapping approach from Part 1. Keep the Observable observation inside a #available check, or confine it to a view that is itself conditionally shown:
// In the parent view (no iOS 26 requirement): var body: some View { VStack { mainContent if #available(iOS 26, *) { AIStatusBadge() // only this view requires iOS 26 } } }
This way the availability-reactive logic is isolated to a specific subview, and the containing view has no version constraint.
PromptRepresentable — Chaining Model Output Back as Input
One of the cleaner architectural patterns enabled by the protocol hierarchy is output-as-input chaining: taking a @Generable type from one call and passing it directly as prompt input to the next call, without any serialisation step.
This works because @Generable types conform to PromptRepresentable (via ConvertibleToGeneratedContent), which means they can appear directly in a @PromptBuilder closure.
Basic Chaining Example
@available(iOS 26, *) @Generable struct NormalisedTranscript { @Guide(description: "Corrected transcript text") var normalisedText: String @Guide(description: "BJJ terms found, in canonical form") var extractedTerms: [String] } @Generable struct SessionSummary { @Guide(description: "One-paragraph summary of the training session") var summary: String @Guide(description: "Techniques practiced, from the corrected terms") var techniquesWorked: [String] } // Two-step pipeline: correct → summarise func processTranscript(_ raw: String) async throws -> SessionSummary { // Step 1: Correct BJJ terminology let correctionSession = LanguageModelSession { "Fix speech-to-text errors in BJJ transcripts. Return corrected text and term list." } let corrected = try await correctionSession.respond( to: Prompt { raw }, generating: NormalisedTranscript.self ) // Step 2: Summarise — pass the @Generable output directly as prompt input // No JSON encoding, no manual string building needed let summarySession = LanguageModelSession { "Summarise a BJJ training session given a corrected transcript." } let summary = try await summarySession.respond( to: Prompt { "Transcript: \(corrected.content)" // NormalisedTranscript directly in @PromptBuilder }, generating: SessionSummary.self ) return summary.content }
The \(corrected.content) interpolation works because NormalisedTranscript (a @Generable struct) conforms to PromptRepresentable. The framework serialises it appropriately for the model — you never touch the intermediate representation.
When Chaining Is Worth It
The chain pattern is most valuable when:
- Output type 1 contains richer structure than a plain string — passing the full
NormalisedTranscript(with bothnormalisedTextandextractedTerms) to the next session gives the model more signal than a plain corrected string - Each step is a focused, single-task session — staying true to the "one task per session" principle (Part 3) while getting compound results
- You want typed output at every step — rather than a single sprawling
@Generablestruct trying to do everything, each step produces its own clean type
Avoid chaining when the first step's output is a plain String — in that case, just use string interpolation normally. The PromptRepresentable chaining is most valuable for multi-property structured output.
Bounded Domain Injection — The Names-Only Pattern
This is a specialised context engineering pattern for apps that have a fixed, curated, known domain — a set of entities whose names are meaningful and bounded. The insight is that entity names alone are remarkably compact while still giving the model strong domain grounding.
The Core Insight
In Grapla, there are 116 BJJ positions, 150 techniques, 118 submissions, and 141 movements — 525 total entities. Injecting all the descriptions for all 525 entities would require tens of thousands of tokens and overflow the context window many times over.
But injecting just the names is cheap:
Mount, Half Guard, Side Control, Back Mount, Turtle, North-South, Closed Guard,
Open Guard, De La Riva, X-Guard, Butterfly Guard, Single Leg X, ...
Kimura, Armbar, Triangle, Rear Naked Choke, D'Arce, Anaconda, Omoplata, ...
Hip Bump Sweep, Flower Sweep, Scissor Sweep, Pendulum Sweep, ...
A full list of ~525 entity names in CSV format uses approximately 700–900 tokens — well within a 4,096-token window, leaving ample room for instructions, prompt, and response.
Why Names Alone Are Sufficient for Correction Tasks
For a transcript correction service, the model's job is:
- Recognise that "kimora" is a garbled version of a known entity
- Replace it with the canonical form "Kimura"
The model doesn't need the description of a Kimura to know that "kimora" should be "Kimura". The name list acts as a canonical term index — the model can fuzzy-match against it and apply corrections.
@available(iOS 26, *) struct BJJEntityNames { // Pre-built at app startup from the SwiftData store — reused for every normalisation call static let positions = [ "Mount", "Half Guard", "Side Control", "Back Mount", "Turtle", "North-South", "Closed Guard", "Open Guard", "De La Riva", "X-Guard", "Butterfly Guard", "Single Leg X", "Full Guard", "Rubber Guard", // ... all 116 positions ] static let techniques = [ /* all 150 */ ] static let submissions = [ /* all 118 */ ] static let movements = [ /* all 141 */ ] static var allAsCSV: String { (positions + techniques + submissions + movements).joined(separator: ", ") } } @available(iOS 26, *) final class TranscriptNormalisationService { func normalise(_ rawTranscript: String) async -> String { let entityNames = BJJEntityNames.allAsCSV // ~700 tokens let session = LanguageModelSession { "Fix speech-to-text errors in BJJ training transcripts." "Canonical entity names: \(entityNames)" "Correct misrecognised terms to their canonical forms. Return only the corrected text." } // Total instructions: ~750 tokens — leaves ~3,300 tokens for prompt + response let response = try? await session.respond(to: Prompt { rawTranscript }) return response?.content ?? rawTranscript } }
Generalising the Pattern
The bounded domain pattern works whenever your app has a finite, knowable set of canonical terms. Some examples:
| App | Bounded Domain | Names-Only Size |
|---|---|---|
| BJJ app | 525 positions/techniques/submissions/movements | ~700 tokens |
| Recipe app | 500 common ingredients | ~600 tokens |
| Medical notes | 300 ICD-10 conditions (common subset) | ~400 tokens |
| Developer tool | 200 API method names | ~250 tokens |
| Music app | 400 instruments + musical terms | ~500 tokens |
The test for whether this pattern applies: Can you enumerate all the canonical terms your app cares about? If yes, inject the names list. The model will use it as a correction index without needing any descriptions.
Names-Only vs Names + Detail
Combine with the Layered Injection pattern (Part 14) when you sometimes need both correction and reasoning about entities:
let session = LanguageModelSession(tools: [PositionDetailTool()]) { // Layer 1: names always present (~700 tokens) — enables correction "Canonical BJJ entities: \(BJJEntityNames.allAsCSV)" // Layer 2: detail available on demand via tool — enables reasoning "Use getPositionDetail to look up descriptions, transitions, and techniques for any position." }
This gives the model correction capability (names) plus on-demand depth (tool) while keeping the base context compact.
Experimental Directions
These patterns are worth exploring but untested at scale. They use only FoundationModels — no additional frameworks required.
Sharded parallel sessions. When your vocabulary corpus is too large for a single context but you need full coverage, split it across multiple sessions running concurrently. Each session holds a different shard of the names list. After all sessions return, merge results — prefer any correction over "unchanged", break ties by confidence or frequency. The on-device model's free-per-call economics make this viable in a way that would be expensive with a cloud API.
async let positions = normalise(rawText, vocabulary: BJJEntityNames.positions) async let techniques = normalise(rawText, vocabulary: BJJEntityNames.techniques) async let submissions = normalise(rawText, vocabulary: BJJEntityNames.submissions) let (p, t, s) = try await (positions, techniques, submissions) let merged = merge(p, t, s) // your logic for combining corrections
Adaptive context budgeting. Before injecting data, measure how much headroom you have with tokenUsage(for:), then fill to a target percentage (e.g. 60% of the window, reserving 40% for prompt + response). Rank your entities by relevance and inject greedily until you hit the budget. This turns context injection from a static decision into a runtime one.
let instrTokens = try await model.tokenUsage(for: instructions).tokenCount let window = await model.contextSize let budget = Int(Double(window) * 0.6) - instrTokens // 60% target, minus instructions var injected: [String] = [] var used = 0 for entity in rankedEntities { let cost = estimateTokens(entity.name) // ~1.3 tokens per word guard used + cost <= budget else { break } injected.append(entity.name) used += cost }
Transcript as structured cache. Rather than rehydrating a conversation, use a saved Transcript as a compressed knowledge cache — pre-generate a transcript that contains a curated Q&A exchange about your domain (e.g. "what is a Kimura?" → model's answer), then resume from that transcript for every live session. The model starts with pre-baked domain knowledge already in its context, without spending live call tokens to establish it.
All three patterns are speculative — they depend on how the model handles parallel resource contention, whether adaptive sizing materially improves output quality, and whether transcript rehydration preserves semantic coherence. The #Playground macro is the fastest way to validate any of them before committing to an implementation.
Resources
Official Apple Documentation
WWDC 2025 Sessions:
- Session 286: Meet the Foundation Models framework
- Session 301: Deep dive into the Foundation Models framework
- Session 259: Code-along: Bring on-device AI to your app using the Foundation Models framework
Framework Updates:
- February 2026: Improved instruction-following,
tokenUsage(for:),contextSize,#Playgroundmacro
Key Types at a Glance
| Type | Purpose |
|---|---|
SystemLanguageModel | Entry point — access the model, check availability |
LanguageModelSession | Manages a single conversation thread with the model |
Instructions | System-level behaviour definition for a session |
Prompt | User input to the model |
Response<Content> | Wrapper around typed model output — use .content |
ResponseStream<Content> | Async sequence of partial responses for streaming |
GenerationOptions | Controls temperature, sampling, max tokens |
GenerationGuide<T> | Constraint on @Guide properties (min/max/regex) |
GeneratedContent | Untyped structured output — escape hatch |
Transcript | Linear history of a multi-turn session |
Tool | Protocol for functions the model can call during generation |
SystemLanguageModel.TokenUsage | Token count for a prompt, instructions, or transcript |