Skip to content

0022 — Voice configurability: model, pronunciation dictionary, per-line pronunciation

Status: DRAFT Date: 2026-06-29 Source: REPORT-keryx-in-anger-afmpeg-reel.md (requests A, B, H, I) + the companion FEATURE-REQUEST-voice-model-selection.md and FEATURE-REQUEST-pronunciation-dictionary.md. Touches contracts: extends R-GEN-* (0002 §3.1) and corrects the R-GEN-28 note ("pronunciation dictionary is a gated Studio feature (no API)") — the TTS endpoint does accept pronunciation_dictionary_locators.

1. Goal

The ElevenLabs voice surface was the launch reel's dominant time sink: every pronunciation experiment dropped out of keryx into curl. Three configurability gaps caused it, and they share one request/config surface, so they are specced together:

  • B — the TTS model is hardcoded (eleven_multilingual_v2); no way to try eleven_v3 (native IPA) or eleven_flash_v2 (phoneme tags) without a recompile.
  • A — keryx emits no pronunciation_dictionary_locators, so a curated alias dictionary (already created: phpboyscout-reels, id SQnFpW9tuADX8ugtpo00) cannot be attached.
  • H/I — per-line pronunciation must be hand-mangled into the vo text; there is no structured per-card respelling / IPA field (per-line numeric overrides already exist — see §2).

Outcome: the voice request is configurable along model + dictionary + per-line pronunciation, defaulting to today's behaviour so nothing changes unless asked.

2. What already exists (do not rebuild)

Verified in tree 2026-06-29 (the field report was written against an older installed binary):

  • provider.VoiceRequest already carries Stability / Similarity / Style / Speed / SpeakerBoost (R-GEN-28); theme.Voice mirrors them + ID.
  • storyboard.VoiceOverride already has Speaker / Stability / Similarity / Style / Speed per card (R-GEN-25). So H's per-line numeric overrides are DONE — only per-line pronunciation/IPA is new.
  • The vo text passes through verbatim including control tags + phonetic spellings (R-GEN-24) — inline respelling is the current pronunciation lever.

3. Design

One shared surface: a field on provider.VoiceRequest, resolved from theme → workspace → per-card → CLI, mapped by the active adapter.

3.1 B — selectable voice model

  • Add Model string to provider.VoiceRequest. The ElevenLabs adapter uses req.Model (falling back to its defaultModel constant) instead of the hardcoded v.model.
  • Config: voice.model on the reel theme and workspace; --model on keryx voice gen.
  • Precedence (highest first): CLI --model → per-card voice.model → workspace voice.model → theme voice.model → built-in default eleven_multilingual_v2.
  • Keep v2 the default — the report's A/B test found eleven_v3 nails IPA but loses Matt's clone (drifts American). This knob is for experiments + future-proofing.
  • Validation (resolved Q4): a small allowlist of known-good ids (eleven_multilingual_v2, eleven_v3, eleven_flash_v2, eleven_turbo_v2, …) gives a clear keryx error on a typo; an unknown id is accepted with a warning behind an opt-in flag so a brand-new model isn't blocked. The allowlist is a documented constant to bump as ElevenLabs ships models.
  • Model-appropriate settings (resolved Q1): the adapter starts by omitting fields a model rejects (rather than a full per-model translation); a per-model voice_settings mapping (e.g. v3's discrete Creative/Natural/Robust stability) is added only if a model misbehaves.

3.2 A — pronunciation dictionary

  • Add PronunciationDictionaries []DictLocator to VoiceRequest (DictLocator{DictionaryID, VersionID string}); the ElevenLabs adapter emits pronunciation_dictionary_locators in ttsRequest. Empty ⇒ field omitted — the request is byte-identical to today (back-compat).
  • Config: voice.pronunciation: <dictionary_id> on theme and/or workspace (see Open Q5); the latest version is resolved at run time unless a version is pinned.
  • Caveat (document, don't oversell): on multilingual_v2 only alias rules apply, and an alias rule is a respelling — so a dictionary is centralisation, not new power (one curated, reviewable source of truth, not better pronunciation). Phoneme rules need flash/v3 (which lose the clone). Multi-token graphemes (ffmpeg.wasm) tokenise awkwardly.
  • Optional keryx voice lexicon subcommand to list/add rules and sync from a local pronunciation.yaml / PRONUNCIATION.md (see Open Q2).

3.3 H/I — per-line pronunciation + IPA passthrough

  • Extend storyboard.VoiceOverride with pronunciation:
  • pronounce — a respelling applied to that line's vo text (any model), and/or
  • ipa — IPA for that line, gated on a phoneme-capable model (B).
  • I (IPA passthrough) is then just: allow ipa (per-card, or inline in vo) and pass it through on flash/v3. On multilingual_v2, ipa is ignored with a warning (phoneme rules don't apply). Document the v3 trade-off (great IPA, lost clone/accent).
  • Numeric per-line overrides are unchanged (already R-GEN-25).
  • Both pronounce and ipa are specified (resolved Q3): pronounce is kept even though a global alias dictionary (A) overlaps it — it lets a single line be respelled without adding a rule the dictionary would apply to every occurrence; ipa is the part A cannot express on a capable model.

3.4 Relationship to J (locked VO)

Pronunciation control reduces drift in the input text; it does not remove take variance in the output audio. The durable fix on v2 remains human-reviewed, locked vo/NN.mp3 clips — spec 0025 (J), separate.

4. Requirements

  • R-GEN-29 (MUST) The TTS model is selectable via VoiceRequest.Model, resolved CLI → per-card → workspace → theme → default. An id outside the known allowlist is a clear keryx error, unless the opt-in accept-unknown flag is set (then a warning).
  • R-GEN-30 (MUST) The built-in default model stays eleven_multilingual_v2 (clone fidelity); nothing changes the model unless configured.
  • R-GEN-31 (MUST) VoiceRequest carries pronunciation-dictionary locators emitted as pronunciation_dictionary_locators; empty ⇒ omitted, leaving the request byte-identical to today (back-compat, cache-stable per R-GEN-8).
  • R-GEN-32 (SHOULD) A theme/workspace voice.pronunciation references a dictionary id; the latest version resolves at run time unless pinned.
  • R-GEN-33 (SHOULD) Per-card voice.pronounce (respelling, any model) and voice.ipa (phoneme-capable models only; ignored + warned on v2).
  • R-GEN-34 (SHOULD) The adapter sends model-appropriate voice_settings and does not send fields a model rejects.
  • R-GEN-35 (MAY) keryx voice lexicon manages the dictionary (list/add/sync from pronunciation.yaml).
  • Corrects R-GEN-28 — the "pronunciation dictionary is a gated Studio feature (no API)" note is wrong; the TTS API accepts locators.

5. Testing (TDD)

Deterministic core, all network-free behind the existing fakes:

  • ttsRequest.model_id reflects req.Model (and falls back to the default when unset).
  • pronunciation_dictionary_locators present when configured, absent when empty (byte-identical request).
  • Precedence resolution (CLI > per-card > workspace > theme > default) — table test.
  • Unknown model id → error (allowlist per Q4).
  • Per-card pronounce applied to vo; ipa passed through on a capable model and ignored-with-warning on v2.
  • Integration (INT_TEST): one live v2 synth with the phpboyscout-reels dictionary attached, asserting a 200 + non-empty mp3.

6. Resolved decisions

Reviewed with Matt 2026-06-29:

  • D1 (was Q3) — per-line overrides: ship both pronounce and ipa per card (§3.3). pronounce is line-scoped respelling the global dictionary shouldn't apply everywhere; ipa is phoneme control A can't express.
  • D2 (was Q2) — lexicon scope: reference-first. MUST attach an existing dictionary id (R-GEN-31/32); keryx voice lexicon management is a later MAY (R-GEN-35).
  • D3 (was Q4) — model validation: small allowlist + escape hatch — clear error on an unknown id, with an opt-in flag to accept-with-warning so new models aren't blocked (R-GEN-29).
  • D4 (was Q1) — settings depth: omit model-rejected fields first; add a per-model voice_settings mapping only if a model misbehaves (R-GEN-34).
  • D5 (was Q5) — dictionary attach level: both theme and workspace; workspace locators are appended after the theme's (per-reel terms extend house terms).

7. Implementation phasing (suggested)

Independently shippable, each test-first:

  1. B (model selection) — smallest, unblocks A's phoneme rules + I; thread Model through, allowlist, default unchanged.
  2. A (dictionary reference)pronunciation_dictionary_locators + voice.pronunciation config; corrects R-GEN-28.
  3. H/I (per-line pronounce/ipa) — extend VoiceOverride; ipa gated on B's model.
  4. voice lexicon (MAY) — only if dictionary management is wanted in keryx.