Skip to content

0023 — Voice take workflow: subset re-roll + take screening

Status: DRAFT Date: 2026-06-29 Source: REPORT-keryx-in-anger-afmpeg-reel.md (request F, pinch point #6). Related: 0022 (voice configurability), 0025 (locked VO).

1. Goal

Killing a stilted VO read meant a curl loop generating N takes per line and an ffmpeg silencedetect script to screen them — none of it expressible in keryx. Make the "re-roll these few lines and let me choose the best" loop first-class.

2. What already exists (do not rebuild)

Verified 2026-06-29 (the report ran an older binary):

  • keryx voice gen --workspace <slug> --line N --takes M generates M candidates into vo/takes/NN-T.mp3 (R-GEN-6/7).
  • keryx voice gen --text "…" --out <path> is the standalone single-line synth the report praised and asked to "first-class" — it already exists and honours the theme voice.
  • keryx voice pick <line> <take> promotes a candidate to vo/NN.mp3 (R-GEN-9).
  • reel make auto-generates + auto-picks take 1 and skips lines that already have a selection (resume-idempotent).

So the missing pieces are (a) re-rolling a subset of lines in one invocation, and (b) screening takes for stiltedness — not the single-line or per-line basics.

3. Design

3.1 Subset re-roll

Let one invocation target several lines:

  • keryx voice gen --workspace <slug> --lines 3,5,7 --takes M (and a range form --lines 3-6). Each listed line gets M fresh candidates in vo/takes/. --line N stays as the single-line alias.
  • Pairs with voice pick <line> <take> per line to choose. (Locking a chosen take is spec 0025.)

3.2 Take inspection

To choose without leaving the CLI you need to see what's there:

  • keryx voice takes --workspace <slug> [--line N] lists candidate takes per line with duration and (when screening is on) a stilt score + the selected marker. Text + JSON.
  • Actual listening stays out-of-band (or the studio) — keryx surfaces the metadata that ranks them.

3.3 Take screening (stilt detection)

The reads that fail are over-paused ("ef… ef… um-peg"). Screen takes by counting internal silences (a silencedetect-style pass over each take) and comparing to the intended pause count — the number of ellipses () / <break> tags in that line's vo text. A take with materially more internal pauses than intended is flagged/ranked lower.

  • Exposed as a score in voice takes and an optional voice pick --best <line> that promotes the top-ranked take.
  • The silence analysis reuses the render backend (afmpeg/ffmpeg) — no new dependency.
  • Screening is a heuristic aid, never an auto-decision unless --best is asked for (preserves the taste-gate; the human still chooses).

4. Requirements

  • R-GEN-36 (MUST) voice gen --lines <list|range> re-rolls exactly the named lines (M takes each), leaving other lines' takes and selections untouched (--line N is the single-line alias).
  • R-GEN-37 (SHOULD) voice takes [--line N] lists candidates with duration + selected marker (text + JSON).
  • R-GEN-38 (SHOULD) Take screening scores each candidate by internal-pause count vs the line's intended pauses (ellipses/<break>); surfaced in voice takes.
  • R-GEN-39 (MAY) voice pick --best <line> promotes the top-screened take; screening never auto-promotes otherwise.

5. Testing (TDD)

  • --lines 3,5,7 / --lines 3-6 parsing + that only those lines' take slots change (fake TTS).
  • voice takes JSON lists the right candidates + durations (fake renderer probe).
  • Stilt scoring: a synthetic take with N detected silences vs a line with K intended pauses yields the expected score; --best picks the highest-ranked. Silence detection is faked at the seam (no real audio).

6. Resolved decisions

Reviewed with Matt 2026-06-29:

  • D1 (was Q1) — subset surface: --lines <list|range> on the existing voice gen (one mental model), --line N kept as the single-line alias.
  • D2 (was Q2) — phasing: subset re-roll + voice takes listing first (R-GEN-36/37, MUST); stilt screening is a fast-follow (R-GEN-38/39) once the silence-detection seam exists.
  • D3 (was Q3) — the studio already covers the "choose" half. Spec 0016 (studio audio takes) already provides generate / audition / pick over the worktree, and notes no ListVOTakes core exists yet. So this spec's CLI adds subset re-roll + a headless take listing (the new ListVOTakes core benefits both CLI and studio) + screening; it does not build a CLI audition UI (that's the studio's job).