# Fixing domain-specific speech recognition with FoundationModels
Speech-to-text fails predictably on niche vocabulary. On-device AI makes a clean fix — and the extractedTerms output is more useful than it first appears.

category: Engineering
date: 2026-03-02
reading-time: 8 min read
excerpt: Any app with voice input and niche vocabulary has this problem. Here's a clean pattern using Apple's on-device FoundationModels to silently correct domain terms before they reach your data layer — and how to pipe the extracted terms into entity matching downstream.

---


<VideoChapter videoId="post-video" startSeconds={0} endSeconds={3.5} label="The problem" />

## The problem

Speech-to-text is impressively good — until you step outside everyday vocabulary.

If your app lives in a niche domain, you've probably already seen it. Medical apps mishear procedure names. Climbing apps mishear route grades. Legal apps mishear case terminology. The speech model simply hasn't seen these words enough to transcribe them reliably, and the transcription errors are consistent — the same wrong words come out every time.

In [Grapla](https://grapla.app), a Brazilian Jiu-Jitsu training app, the vocabulary is dense with borrowed Portuguese, compound phrases, and proper nouns that the speech model mangles predictably: "kimura" becomes "kimora", "half guard" becomes "half card", "omoplata" becomes "omma plata". Stored verbatim, these transcripts break search, entity extraction, and anything downstream that expects canonical terms.

The naive fix is a substitution dictionary. Map "kimora" to "Kimura", "half card" to "Half Guard". This works until your dictionary has 200 entries and still misses half the variants the speech model produces.

A better fix: run the raw transcript through the on-device language model and let it understand and correct domain terminology in context.

---

<VideoChapter videoId="post-video" startSeconds={3.5} endSeconds={8} label="The pipeline" />

## The pipeline

Three stages, left to right:

```
mic → SpeechAnalyzer + vocabulary hints
   → raw transcript  ("worked my kimora from half card")
   → LanguageModelSession + system prompt with corrections
   → NormalisedTranscript { normalisedText, extractedTerms }
   → entity matching / storage
```

The middle stage is the interesting one. The system prompt is short — under 200 words, as recommended for on-device models — and contains two things: explicit correction pairs for the most common misrecognitions, and a vocabulary list of canonical domain terms. Everything else is inferred by the model.

The structured output carries two fields. Most examples focus on `normalisedText`. Don't discard `extractedTerms` — it's what makes the downstream pipeline precise.

---

<VideoChapter videoId="post-video" startSeconds={8} endSeconds={12} label="Structured output" />

## Four patterns worth keeping

### 1. `@Generable` for structured output

The output of a correction task isn't free-form prose — it's a fixed shape. `@Generable` makes this a compile-time guarantee rather than a parsing problem:

```swift
@available(iOS 26, *)
@Generable(description: "A normalised training transcript with corrected terminology")
struct NormalisedTranscript: Sendable, Equatable {
    @Guide(description: "Full transcript with domain terms corrected and properly cased")
    var normalisedText: String

    @Guide(description: "Domain terms found in the transcript, each in canonical form")
    var extractedTerms: [String]
}
```

The session then generates exactly this shape — no JSON parsing, no regex on the response, no prompt engineering for output format:

```swift
let session = LanguageModelSession(instructions: systemPrompt)
let response = try await session.respond(
    to: "Correct this transcript:\n\n\(rawText)",
    generating: NormalisedTranscript.self
)
let corrected = response.content.normalisedText
let terms = response.content.extractedTerms  // ["Kimura", "Half Guard"]
```

`extractedTerms` is produced as a side effect of the correction task. Don't throw it away — the next section explains why.

---

<VideoChapter videoId="post-video" startSeconds={12} endSeconds={16} label="Degraded path" />

### 2. Design the degraded path first

Apple Intelligence isn't available on most devices today. The model may be downloading, disabled in Settings, or the hardware simply doesn't support it. A correction service that throws on unavailability pushes error handling into every caller.

The better design: `normalise()` always returns something usable. On unavailability, it returns the raw transcript unchanged. Callers never handle errors — they always get a result back, either corrected or not.

```swift
func normalise(_ rawTranscript: String, entityNames: [String] = []) async -> NormalisedTranscript {
    let trimmed = rawTranscript.trimmingCharacters(in: .whitespacesAndNewlines)

    guard !trimmed.isEmpty else {
        return NormalisedTranscript(normalisedText: "", extractedTerms: [])
    }
    guard case .available = checkAvailability() else {
        // Raw transcript returned unchanged — callers are unaffected
        return NormalisedTranscript(normalisedText: trimmed, extractedTerms: [])
    }
    do {
        let session = LanguageModelSession(instructions: Self.sessionInstructions(entityNames: entityNames))
        let result = try await session.respond(to: trimmed, generating: NormalisedTranscript.self)
        return result.content
    } catch {
        return NormalisedTranscript(normalisedText: trimmed, extractedTerms: [])
    }
}
```

The uncorrected path is the primary experience for most users right now. Worth testing it as carefully as the AI path.

---

<VideoChapter videoId="post-video" startSeconds={16} endSeconds={20} label="Vocabulary hints" />

### 3. Vocabulary hints at two layers

Domain correction can happen at two points in this pipeline — and using both makes them complementary.

**At the speech recogniser:** iOS 26's `SpeechAnalyzer` accepts `contextualStrings`, a list of terms to bias the STT model toward. Passing your domain vocabulary here makes the raw transcript cleaner before it reaches the language model.

**At the language model:** The system prompt includes the same vocabulary list, plus explicit correction pairs for the most common failures:

```
Known terms: [your domain vocabulary list]
Common corrections: kimora→Kimura, half card→Half Guard, arm bar→Armbar...
```

Two cheap injections. The first pass reduces the noise; the second corrects what slipped through. Neither is expensive — both are strings you already have.

---

<VideoChapter videoId="post-video" startSeconds={20} endSeconds={24} label="Entity injection" />

### 4. Entity injection: precision over recall

The `extractedTerms` from Pattern 1 are useful but imprecise by default. The model knows the domain vocabulary and might return "Kimura", "Armbar", or "Half Guard" — but "Kimura" in your database has a specific UUID and canonical casing you care about.

The fix is to inject your known entity names into the prompt and instruct the model to constrain its output to exact matches:

```swift
private static func sessionInstructions(entityNames: [String] = []) -> String {
    var base = """
        You are a BJJ transcript corrector. Fix misrecognised terms...
        Known BJJ terms: \(BJJVocabularyHints.all.joined(separator: ", "))
        """

    if !entityNames.isEmpty {
        base += """


            Entity extraction: in extractedTerms, return ONLY names that exactly \
            match this list (preserve exact capitalisation): \
            \(entityNames.joined(separator: ", ")). \
            Return empty extractedTerms if no matches found.
            """
    }

    return base
}
```

Then on the coordinator, set the entity names before recording:

```swift
normalisationCoordinator.entityNames = positions.map(\.name)
    + submissions.map(\.name)
    + people.map(\.name)
```

When the transcript is normalised, `extractedTerms` comes back as a filtered list — only names that are in your database, with exact capitalisation preserved. This turns extraction from a fuzzy matching problem into a direct lookup.

**The tradeoff:** Higher precision, lower recall. If an entity isn't in the injected list, it won't appear in `extractedTerms`. For entity contexts where you've fetched the full set, this is fine. For open-ended contexts where you want the model to surprise you, omit the list and fall back to Levenshtein matching downstream.

**Using extractedTerms downstream:**

```swift
let analysisText = normalised.extractedTerms.isEmpty
    ? normalised.normalisedText           // fallback: full text, fuzzy match
    : normalised.extractedTerms.joined(separator: " ")  // precise: known entities only

let matches = entityAnalyzer.analyze(notes: analysisText, inputs: analysisInputs)
```

When the model is unavailable, `extractedTerms` will be empty and the fallback path kicks in automatically — Levenshtein matching over the full transcript still finds most entities.

---

## The `AnyObject?` availability gating pattern

One practical iOS 26 concern: adding `@State private var session: LanguageModelSession?` to a view forces `@available(iOS 26, *)` onto the whole view struct. A service class stored as `AnyObject?` and cast inside `#available` blocks avoids this.

The coordinator also needs a version-agnostic result type — `NormalisedTranscript` is `@Generable` and thus iOS 26-only. Define a plain mirror struct that carries the same fields without the availability constraint:

```swift
// No @available — works on all iOS versions
struct NormalisedTranscriptResult {
    let normalisedText: String
    let extractedTerms: [String]
}

@Observable @MainActor
final class VoiceNormalisationCoordinator {
    private var service: AnyObject?       // TranscriptNormalisationService on iOS 26+
    var entityNames: [String] = []        // injected before recording

    func setup() {
        if #available(iOS 26, *) {
            service = TranscriptNormalisationService()
        }
    }

    func normalise(_ rawText: String) async -> NormalisedTranscriptResult {
        guard #available(iOS 26, *),
              let s = service as? TranscriptNormalisationService
        else { return NormalisedTranscriptResult(normalisedText: rawText, extractedTerms: []) }

        let result = await s.normalise(rawText, entityNames: entityNames)
        return NormalisedTranscriptResult(
            normalisedText: result.normalisedText,
            extractedTerms: result.extractedTerms
        )
    }
}
```

Two things to notice: the coordinator now returns the full result (not just `normalisedText`), and `entityNames` is set externally before recording starts. The gating and availability handling are contained in one place. Callers work with a plain struct that compiles on any iOS version.

---

## When it's worth it

This pattern adds an on-device LLM call and a soft device requirement. It's worth it when:

- The vocabulary is large enough that a substitution dictionary becomes unmanageable
- Transcripts feed a downstream pipeline where bad terms cause actual failures
- You have a database of known entities and want precise extraction, not just correction
- Privacy is a concern — on-device means nothing leaves the phone

If you have a small fixed set of known corrections, a dictionary is simpler, faster, and works everywhere. Use the minimum viable tool.

---

Full FoundationModels API detail — `@Generable`, sessions, token budgets, streaming, availability cases — is in the [iOS 26 FoundationModels reference](/writing/foundation-models-reference).