0016 — studio audio takes: VO + music audition & select (R-UI-9)¶

Status: IMPLEMENTED (studio Phase 2A, 2026-06-26). §7 decisions resolved — all per the recommendations: (1) extend gencmd/GenerateInto with audio targets, one seam; (2) generalized Cost.Unit + per-kind price keys, per-take flat estimate v1; (3) generalize MediaGen with a per-kind take tile; (4) VO endpoints under the card, line ≡ card index; (5) keep the CLI candidate-bed model; (6) reuse the 0013-D7 in-memory/TTL policy.) Date: 2026-06-26 Depends on: the Phase-1 media infra (spec 0013 — the async job/cost/gallery/pick harness, the Generator seam, gencmd.GenerateInto), per-project config (0014), and the audio cores (gencmd.VOTakes/MusicTakes, internal/takes, the voice/music provider seams). Parent: spec 0015 (§5 build order: 2A first; §7-D4 already resolved — FS-explicit audio gen + an audio cost axis).

1. Goal & scope¶

Surface R-UI-9 in the editor: for each card, generate VO takes, audition them, and pick one; and per reel, generate music-bed takes, audition, pick. This completes "generate every media kind in the studio" — images (0013) + now audio — and a faithful render (2B) needs selected VO (it drives card timing) + the bed.

The generation cores exist (gencmd.VOTakes/MusicTakes, ElevenLabs voice/music providers); the studio surfaces none. 2A is new HTTP surface + an audio audition UI + two core refactors (FS-explicit audio gen, an audio cost axis). It reuses the 0013 async harness almost wholesale; the deltas are the contention (C4) 0015 named: audio gen is per-line/per-bed and not FS-explicit, and paid on a different cost axis.

In scope: VO + music generate/list/pick/serve over the active worktree (in-memory capable); the audition UI; audio cost disclosure. Out of scope: render (2B), the contact-sheet (folds in later), TTS/music provider changes.

2. What already exists (reuse map)¶

Need	Existing core	Gap
VO takes	`gencmd.VOTakes(p, slug, line, n, theme)` → `vo/takes/NN-T.mp3`	bound to `p.FS` via `slug`; not FS-explicit
Music takes	`gencmd.MusicTakes(p, slug, n, theme)` → `music/takes/T.mp3`	same; single bed per workspace
Voice/music providers	`provider.VoiceProvider.Synthesize`, `MusicProvider.Compose`; elevenlabs adapters	—
Voice resolution	`gencmd.voiceRequest` (theme voice / named speaker + per-card overrides)	—
Take select	`takes.SelectVO(fs,ws,line,take)`, `takes.SelectMusic(fs,ws,take)`; `VOSelected`/`MusicSelected`/`Next*`/`HasVOSelected`	no `ListVOTakes`/`ListMusicTakes`
Async harness	`startGeneration`/`runJob`/`getJob`/`serveFile`/`pickSingleSlot` (0013)	cost + ETA are image-shaped

3. The studio surface (proposed)¶

VO is per storyboard line = the card index, so VO endpoints live under the card (line = n). Music is per-workspace.

Per-card VO: - POST /api/v1/workspace/{slug}/cards/{n}/vo/generate {takes?} → 202 + job + cost (generate N VO takes for card n's narration). - GET /api/v1/workspace/{slug}/cards/{n}/vo/takes → take paths for the audition list. - POST /api/v1/workspace/{slug}/cards/{n}/vo/pick {take} → SelectVO → vo/NN.mp3.

Per-reel music: - POST /api/v1/workspace/{slug}/music/generate {takes?} → 202 + job + cost. - GET /api/v1/workspace/{slug}/music/takes → take paths. - POST /api/v1/workspace/{slug}/music/pick {take} → SelectMusic → music.mp3.

Bytes are served by the existing GET …/file/{path...} (works for mp3 as for png; the 0015-D3 ReadSeeker streaming covers audio scrubbing too). No new serve route.

4. Core refactors (the C4 deltas)¶

4a. FS-explicit audio gen (0015-D4). Refactor VOTakes/MusicTakes to the GenerateInto(ctx, p, fs, req) shape so they write to the active worktree fs (in-memory capable, consistent with image gen). Concretely: extend gencmd.Request with audio targets and route them through GenerateInto: - Target: add TargetVO, TargetMusic (alongside TargetCard/Cover/Portrait). - Request: add Line int (VO, 1-based) (Count already exists). - GenerateInto's resolve dispatches audio targets to an audio writer (the storeTakes-style loop: N calls to Synthesize/Compose, each written via the VO/ music take slot) rather than genTakesFS (which is the image 1-call→N-images fan-out). Keeps ONE studio Generator.Generate(ctx, fs, req) seam + ONE gencmd entry for every media kind (the established "one mechanism" altitude); the differing generation loop (image fan-out vs audio per-call) stays hidden inside gencmd.

The legacy VOTakes/MusicTakes (CLI) become thin wrappers over the FS-explicit core (over p.FS), exactly as CardTakes wraps the card path — so CLI behaviour is unchanged + parity-tested.

4b. takes listing. Add takes.ListVOTakes(fs, ws, line) and takes.ListMusicTakes(fs, ws) (reusing the package's listTakes helper), mirroring ListCardTakes.

4c. VO pick is (line, take). pickSingleSlot (0013) is single-slot (cover/ portrait/music fit it). VO needs a (line, take) variant — a small pickCardVO handler parsing NN-T.mp3 → SelectVO(fs, dir, line, take). Music uses pickSingleSlot with SelectMusic.

5. Cost (the audio axis, 0015-D4)¶

The 0013 estimator is per-image (PerImage × Images). Audio is paid on a different axis (ElevenLabs: VO ≈ per-character, music ≈ per-second/length). Generalize the cost model to carry the unit: - Cost gains a Unit ("image" | "vo" | "music") and the count field is generalized (or a parallel estimateAudioCost). Config keys: providers.voice.price_per_take (or per-1k-chars) + providers.music.price_per_take (or per-second), with conservative defaults — a configurable estimate cue, never a billed figure (same posture as the image price). Disclosed up front + actual on the job (0013-D2).

The job store's image-shaped ETA (Images × etaPerImage) is generalized to a per-unit eta (VO/music clips take seconds each, looped — so eta = Count × etaPerUnit).

6. The audition UI¶

Audio audition differs from image: takes are played, not thumbnailed. Two paths: - Generalize MediaGen with a renderTake snippet so a take tile is an <img> (image) or an <audio controls> (audio) — one engine, per-kind tile. Keeps the generate→poll→gallery→pick + cost cue + localStorage re-attach machinery shared. - The card editor (CardMedia area) gains a VO take block per card (generate / audition / pick), beside the illustration block. Associated (workspace panel) gains a music block (generate / audition / pick the bed).

Rec: generalize MediaGen (the gallery is the only image-specific bit; the rest is kind-neutral). Selected-VO/music presence already surfaces in the card-list asset indicators (R-UI-16, the ♪ dot) — those light from HasVOSelected / the selected bed.

7. Decisions to confirm before building¶

Generator seam shape (4a). Extend gencmd.Request/GenerateInto with audio targets (one seam, one entry — rec) vs a parallel AudioGenerator seam. The generation loops differ (image fan-out vs audio per-call); confirm hiding that inside gencmd behind the single Generate seam is the right altitude.
Audio cost axis (5). A generalized Cost.Unit + per-kind price keys (rec), vs a separate audio estimator. And the pricing unit: per-take (simplest, a flat estimate cue) vs per-character/per-second (truer but needs char/length counting). Rec: per-take flat estimate for v1 (a cue, configurable), refine later.
Audition UI (6). Generalize MediaGen with a per-kind take tile (rec) vs a separate AudioGen component.
VO endpoint placement (3). Under the card (cards/{n}/vo/*, line = card index — rec) vs a standalone vo/{line}/*. Confirm line ≡ card index in the studio.
Music take length. Candidate beds at the CLI's defaultBedLen (~35s) then the render requests the VO-driven total (as today), vs generating at a guessed total. Rec: keep the CLI's candidate-bed model (parity).
In-memory + TTL prune (0013-D7 parity). Audio takes land in vo/takes / music/takes on the active worktree; for in-memory projects the same RAM-bloat + grace-TTL-prune story as image takes applies. Rec: reuse the same policy.

8. Testing / DoD¶

Failing test → code → green just ci. Fake voice/music providers (no spend, no network) for the studio handler tests + a godog scenario (VO generate→pick + music generate→pick via the fakes; the paid path stays env-gated integration). Parity test that the FS-explicit VO/music core matches the CLI VOTakes/MusicTakes output. -race over the job paths. Docs: the studio component page (audio takes section). /simplify + /code-review. The real ElevenLabs path stays INT_TEST=1-gated.