0022 — Voice configurability: model, pronunciation dictionary, per-line pronunciation¶
Status: DRAFT
Date: 2026-06-29
Source: REPORT-keryx-in-anger-afmpeg-reel.md (requests A, B, H, I) + the
companion FEATURE-REQUEST-voice-model-selection.md and
FEATURE-REQUEST-pronunciation-dictionary.md.
Touches contracts: extends R-GEN-* (0002 §3.1) and corrects the R-GEN-28 note
("pronunciation dictionary is a gated Studio feature (no API)") — the TTS endpoint
does accept pronunciation_dictionary_locators.
1. Goal¶
The ElevenLabs voice surface was the launch reel's dominant time sink: every
pronunciation experiment dropped out of keryx into curl. Three configurability gaps
caused it, and they share one request/config surface, so they are specced together:
- B — the TTS model is hardcoded (
eleven_multilingual_v2); no way to tryeleven_v3(native IPA) oreleven_flash_v2(phoneme tags) without a recompile. - A — keryx emits no
pronunciation_dictionary_locators, so a curated alias dictionary (already created:phpboyscout-reels, idSQnFpW9tuADX8ugtpo00) cannot be attached. - H/I — per-line pronunciation must be hand-mangled into the
votext; there is no structured per-card respelling / IPA field (per-line numeric overrides already exist — see §2).
Outcome: the voice request is configurable along model + dictionary + per-line pronunciation, defaulting to today's behaviour so nothing changes unless asked.
2. What already exists (do not rebuild)¶
Verified in tree 2026-06-29 (the field report was written against an older installed binary):
provider.VoiceRequestalready carriesStability/Similarity/Style/Speed/SpeakerBoost(R-GEN-28);theme.Voicemirrors them +ID.storyboard.VoiceOverridealready hasSpeaker/Stability/Similarity/Style/Speedper card (R-GEN-25). So H's per-line numeric overrides are DONE — only per-line pronunciation/IPA is new.- The
votext passes through verbatim including control tags + phonetic spellings (R-GEN-24) — inline respelling is the current pronunciation lever.
3. Design¶
One shared surface: a field on provider.VoiceRequest, resolved from theme → workspace →
per-card → CLI, mapped by the active adapter.
3.1 B — selectable voice model¶
- Add
Model stringtoprovider.VoiceRequest. The ElevenLabs adapter usesreq.Model(falling back to itsdefaultModelconstant) instead of the hardcodedv.model. - Config:
voice.modelon the reel theme and workspace;--modelonkeryx voice gen. - Precedence (highest first): CLI
--model→ per-cardvoice.model→ workspacevoice.model→ themevoice.model→ built-in defaulteleven_multilingual_v2. - Keep v2 the default — the report's A/B test found
eleven_v3nails IPA but loses Matt's clone (drifts American). This knob is for experiments + future-proofing. - Validation (resolved Q4): a small allowlist of known-good ids
(
eleven_multilingual_v2,eleven_v3,eleven_flash_v2,eleven_turbo_v2, …) gives a clear keryx error on a typo; an unknown id is accepted with a warning behind an opt-in flag so a brand-new model isn't blocked. The allowlist is a documented constant to bump as ElevenLabs ships models. - Model-appropriate settings (resolved Q1): the adapter starts by omitting fields a
model rejects (rather than a full per-model translation); a per-model
voice_settingsmapping (e.g. v3's discrete Creative/Natural/Robuststability) is added only if a model misbehaves.
3.2 A — pronunciation dictionary¶
- Add
PronunciationDictionaries []DictLocatortoVoiceRequest(DictLocator{DictionaryID, VersionID string}); the ElevenLabs adapter emitspronunciation_dictionary_locatorsinttsRequest. Empty ⇒ field omitted — the request is byte-identical to today (back-compat). - Config:
voice.pronunciation: <dictionary_id>on theme and/or workspace (see Open Q5); the latest version is resolved at run time unless a version is pinned. - Caveat (document, don't oversell): on
multilingual_v2only alias rules apply, and an alias rule is a respelling — so a dictionary is centralisation, not new power (one curated, reviewable source of truth, not better pronunciation). Phoneme rules need flash/v3 (which lose the clone). Multi-token graphemes (ffmpeg.wasm) tokenise awkwardly. - Optional
keryx voice lexiconsubcommand to list/add rules and sync from a localpronunciation.yaml/PRONUNCIATION.md(see Open Q2).
3.3 H/I — per-line pronunciation + IPA passthrough¶
- Extend
storyboard.VoiceOverridewith pronunciation: pronounce— a respelling applied to that line'svotext (any model), and/oripa— IPA for that line, gated on a phoneme-capable model (B).- I (IPA passthrough) is then just: allow
ipa(per-card, or inline invo) and pass it through on flash/v3. Onmultilingual_v2,ipais ignored with a warning (phoneme rules don't apply). Document the v3 trade-off (great IPA, lost clone/accent). - Numeric per-line overrides are unchanged (already
R-GEN-25). - Both
pronounceandipaare specified (resolved Q3):pronounceis kept even though a global alias dictionary (A) overlaps it — it lets a single line be respelled without adding a rule the dictionary would apply to every occurrence;ipais the part A cannot express on a capable model.
3.4 Relationship to J (locked VO)¶
Pronunciation control reduces drift in the input text; it does not remove take variance
in the output audio. The durable fix on v2 remains human-reviewed, locked vo/NN.mp3
clips — spec 0025 (J), separate.
4. Requirements¶
- R-GEN-29 (MUST) The TTS model is selectable via
VoiceRequest.Model, resolved CLI → per-card → workspace → theme → default. An id outside the known allowlist is a clear keryx error, unless the opt-in accept-unknown flag is set (then a warning). - R-GEN-30 (MUST) The built-in default model stays
eleven_multilingual_v2(clone fidelity); nothing changes the model unless configured. - R-GEN-31 (MUST)
VoiceRequestcarries pronunciation-dictionary locators emitted aspronunciation_dictionary_locators; empty ⇒ omitted, leaving the request byte-identical to today (back-compat, cache-stable perR-GEN-8). - R-GEN-32 (SHOULD) A theme/workspace
voice.pronunciationreferences a dictionary id; the latest version resolves at run time unless pinned. - R-GEN-33 (SHOULD) Per-card
voice.pronounce(respelling, any model) andvoice.ipa(phoneme-capable models only; ignored + warned on v2). - R-GEN-34 (SHOULD) The adapter sends model-appropriate
voice_settingsand does not send fields a model rejects. - R-GEN-35 (MAY)
keryx voice lexiconmanages the dictionary (list/add/sync frompronunciation.yaml). - Corrects R-GEN-28 — the "pronunciation dictionary is a gated Studio feature (no API)" note is wrong; the TTS API accepts locators.
5. Testing (TDD)¶
Deterministic core, all network-free behind the existing fakes:
ttsRequest.model_idreflectsreq.Model(and falls back to the default when unset).pronunciation_dictionary_locatorspresent when configured, absent when empty (byte-identical request).- Precedence resolution (CLI > per-card > workspace > theme > default) — table test.
- Unknown model id → error (allowlist per Q4).
- Per-card
pronounceapplied tovo;ipapassed through on a capable model and ignored-with-warning on v2. - Integration (
INT_TEST): one live v2 synth with thephpboyscout-reelsdictionary attached, asserting a 200 + non-empty mp3.
6. Resolved decisions¶
Reviewed with Matt 2026-06-29:
- D1 (was Q3) — per-line overrides: ship both
pronounceandipaper card (§3.3).pronounceis line-scoped respelling the global dictionary shouldn't apply everywhere;ipais phoneme control A can't express. - D2 (was Q2) — lexicon scope: reference-first. MUST attach an existing dictionary
id (
R-GEN-31/32);keryx voice lexiconmanagement is a later MAY (R-GEN-35). - D3 (was Q4) — model validation: small allowlist + escape hatch — clear error on
an unknown id, with an opt-in flag to accept-with-warning so new models aren't blocked
(
R-GEN-29). - D4 (was Q1) — settings depth: omit model-rejected fields first; add a per-model
voice_settingsmapping only if a model misbehaves (R-GEN-34). - D5 (was Q5) — dictionary attach level: both theme and workspace; workspace locators are appended after the theme's (per-reel terms extend house terms).
7. Implementation phasing (suggested)¶
Independently shippable, each test-first:
- B (model selection) — smallest, unblocks A's phoneme rules + I; thread
Modelthrough, allowlist, default unchanged. - A (dictionary reference) —
pronunciation_dictionary_locators+voice.pronunciationconfig; correctsR-GEN-28. - H/I (per-line
pronounce/ipa) — extendVoiceOverride;ipagated on B's model. voice lexicon(MAY) — only if dictionary management is wanted in keryx.