0023 — Voice take workflow: subset re-roll + take screening¶
Status: DRAFT
Date: 2026-06-29
Source: REPORT-keryx-in-anger-afmpeg-reel.md (request F, pinch point #6).
Related: 0022 (voice configurability), 0025 (locked VO).
1. Goal¶
Killing a stilted VO read meant a curl loop generating N takes per line and an
ffmpeg silencedetect script to screen them — none of it expressible in keryx. Make the
"re-roll these few lines and let me choose the best" loop first-class.
2. What already exists (do not rebuild)¶
Verified 2026-06-29 (the report ran an older binary):
keryx voice gen --workspace <slug> --line N --takes Mgenerates M candidates intovo/takes/NN-T.mp3(R-GEN-6/7).keryx voice gen --text "…" --out <path>is the standalone single-line synth the report praised and asked to "first-class" — it already exists and honours the theme voice.keryx voice pick <line> <take>promotes a candidate tovo/NN.mp3(R-GEN-9).reel makeauto-generates + auto-picks take 1 and skips lines that already have a selection (resume-idempotent).
So the missing pieces are (a) re-rolling a subset of lines in one invocation, and (b) screening takes for stiltedness — not the single-line or per-line basics.
3. Design¶
3.1 Subset re-roll¶
Let one invocation target several lines:
keryx voice gen --workspace <slug> --lines 3,5,7 --takes M(and a range form--lines 3-6). Each listed line gets M fresh candidates invo/takes/.--line Nstays as the single-line alias.- Pairs with
voice pick <line> <take>per line to choose. (Locking a chosen take is spec 0025.)
3.2 Take inspection¶
To choose without leaving the CLI you need to see what's there:
keryx voice takes --workspace <slug> [--line N]lists candidate takes per line with duration and (when screening is on) a stilt score + the selected marker. Text + JSON.- Actual listening stays out-of-band (or the studio) — keryx surfaces the metadata that ranks them.
3.3 Take screening (stilt detection)¶
The reads that fail are over-paused ("ef… ef… um-peg"). Screen takes by counting
internal silences (a silencedetect-style pass over each take) and comparing to the
intended pause count — the number of ellipses (…) / <break> tags in that line's
vo text. A take with materially more internal pauses than intended is flagged/ranked
lower.
- Exposed as a score in
voice takesand an optionalvoice pick --best <line>that promotes the top-ranked take. - The silence analysis reuses the render backend (afmpeg/ffmpeg) — no new dependency.
- Screening is a heuristic aid, never an auto-decision unless
--bestis asked for (preserves the taste-gate; the human still chooses).
4. Requirements¶
- R-GEN-36 (MUST)
voice gen --lines <list|range>re-rolls exactly the named lines (M takes each), leaving other lines' takes and selections untouched (--line Nis the single-line alias). - R-GEN-37 (SHOULD)
voice takes [--line N]lists candidates with duration + selected marker (text + JSON). - R-GEN-38 (SHOULD) Take screening scores each candidate by internal-pause count vs the
line's intended pauses (ellipses/
<break>); surfaced invoice takes. - R-GEN-39 (MAY)
voice pick --best <line>promotes the top-screened take; screening never auto-promotes otherwise.
5. Testing (TDD)¶
--lines 3,5,7/--lines 3-6parsing + that only those lines' take slots change (fake TTS).voice takesJSON lists the right candidates + durations (fake renderer probe).- Stilt scoring: a synthetic take with N detected silences vs a line with K intended
pauses yields the expected score;
--bestpicks the highest-ranked. Silence detection is faked at the seam (no real audio).
6. Resolved decisions¶
Reviewed with Matt 2026-06-29:
- D1 (was Q1) — subset surface:
--lines <list|range>on the existingvoice gen(one mental model),--line Nkept as the single-line alias. - D2 (was Q2) — phasing: subset re-roll +
voice takeslisting first (R-GEN-36/37, MUST); stilt screening is a fast-follow (R-GEN-38/39) once the silence-detection seam exists. - D3 (was Q3) — the studio already covers the "choose" half. Spec 0016 (studio audio
takes) already provides generate / audition / pick over the worktree, and notes no
ListVOTakescore exists yet. So this spec's CLI adds subset re-roll + a headless take listing (the newListVOTakescore benefits both CLI and studio) + screening; it does not build a CLI audition UI (that's the studio's job).