0016 — studio audio takes: VO + music audition & select (R-UI-9)¶
Status: IMPLEMENTED (studio Phase 2A, 2026-06-26). §7 decisions resolved — all per
the recommendations: (1) extend gencmd/GenerateInto with audio targets, one seam;
(2) generalized Cost.Unit + per-kind price keys, per-take flat estimate v1; (3)
generalize MediaGen with a per-kind take tile; (4) VO endpoints under the card,
line ≡ card index; (5) keep the CLI candidate-bed model; (6) reuse the 0013-D7
in-memory/TTL policy.)
Date: 2026-06-26
Depends on: the Phase-1 media infra (spec 0013 — the async job/cost/gallery/pick
harness, the Generator seam, gencmd.GenerateInto), per-project config (0014), and
the audio cores (gencmd.VOTakes/MusicTakes, internal/takes, the voice/music
provider seams). Parent: spec 0015 (§5 build order: 2A first; §7-D4 already
resolved — FS-explicit audio gen + an audio cost axis).
1. Goal & scope¶
Surface R-UI-9 in the editor: for each card, generate VO takes, audition them, and pick one; and per reel, generate music-bed takes, audition, pick. This completes "generate every media kind in the studio" — images (0013) + now audio — and a faithful render (2B) needs selected VO (it drives card timing) + the bed.
The generation cores exist (gencmd.VOTakes/MusicTakes, ElevenLabs voice/music
providers); the studio surfaces none. 2A is new HTTP surface + an audio audition UI +
two core refactors (FS-explicit audio gen, an audio cost axis). It reuses the 0013
async harness almost wholesale; the deltas are the contention (C4) 0015 named: audio
gen is per-line/per-bed and not FS-explicit, and paid on a different cost axis.
In scope: VO + music generate/list/pick/serve over the active worktree (in-memory capable); the audition UI; audio cost disclosure. Out of scope: render (2B), the contact-sheet (folds in later), TTS/music provider changes.
2. What already exists (reuse map)¶
| Need | Existing core | Gap |
|---|---|---|
| VO takes | gencmd.VOTakes(p, slug, line, n, theme) → vo/takes/NN-T.mp3 |
bound to p.FS via slug; not FS-explicit |
| Music takes | gencmd.MusicTakes(p, slug, n, theme) → music/takes/T.mp3 |
same; single bed per workspace |
| Voice/music providers | provider.VoiceProvider.Synthesize, MusicProvider.Compose; elevenlabs adapters |
— |
| Voice resolution | gencmd.voiceRequest (theme voice / named speaker + per-card overrides) |
— |
| Take select | takes.SelectVO(fs,ws,line,take), takes.SelectMusic(fs,ws,take); VOSelected/MusicSelected/Next*/HasVOSelected |
no ListVOTakes/ListMusicTakes |
| Async harness | startGeneration/runJob/getJob/serveFile/pickSingleSlot (0013) |
cost + ETA are image-shaped |
3. The studio surface (proposed)¶
VO is per storyboard line = the card index, so VO endpoints live under the card
(line = n). Music is per-workspace.
Per-card VO:
- POST /api/v1/workspace/{slug}/cards/{n}/vo/generate {takes?} → 202 + job + cost
(generate N VO takes for card n's narration).
- GET /api/v1/workspace/{slug}/cards/{n}/vo/takes → take paths for the audition list.
- POST /api/v1/workspace/{slug}/cards/{n}/vo/pick {take} → SelectVO → vo/NN.mp3.
Per-reel music:
- POST /api/v1/workspace/{slug}/music/generate {takes?} → 202 + job + cost.
- GET /api/v1/workspace/{slug}/music/takes → take paths.
- POST /api/v1/workspace/{slug}/music/pick {take} → SelectMusic → music.mp3.
Bytes are served by the existing GET …/file/{path...} (works for mp3 as for png;
the 0015-D3 ReadSeeker streaming covers audio scrubbing too). No new serve route.
4. Core refactors (the C4 deltas)¶
4a. FS-explicit audio gen (0015-D4). Refactor VOTakes/MusicTakes to the
GenerateInto(ctx, p, fs, req) shape so they write to the active worktree fs
(in-memory capable, consistent with image gen). Concretely: extend gencmd.Request
with audio targets and route them through GenerateInto:
- Target: add TargetVO, TargetMusic (alongside TargetCard/Cover/Portrait).
- Request: add Line int (VO, 1-based) (Count already exists).
- GenerateInto's resolve dispatches audio targets to an audio writer (the
storeTakes-style loop: N calls to Synthesize/Compose, each written via the VO/
music take slot) rather than genTakesFS (which is the image 1-call→N-images fan-out).
Keeps ONE studio Generator.Generate(ctx, fs, req) seam + ONE gencmd entry for
every media kind (the established "one mechanism" altitude); the differing generation
loop (image fan-out vs audio per-call) stays hidden inside gencmd.
The legacy VOTakes/MusicTakes (CLI) become thin wrappers over the FS-explicit core
(over p.FS), exactly as CardTakes wraps the card path — so CLI behaviour is
unchanged + parity-tested.
4b. takes listing. Add takes.ListVOTakes(fs, ws, line) and
takes.ListMusicTakes(fs, ws) (reusing the package's listTakes helper), mirroring
ListCardTakes.
4c. VO pick is (line, take). pickSingleSlot (0013) is single-slot (cover/
portrait/music fit it). VO needs a (line, take) variant — a small pickCardVO
handler parsing NN-T.mp3 → SelectVO(fs, dir, line, take). Music uses
pickSingleSlot with SelectMusic.
5. Cost (the audio axis, 0015-D4)¶
The 0013 estimator is per-image (PerImage × Images). Audio is paid on a different
axis (ElevenLabs: VO ≈ per-character, music ≈ per-second/length). Generalize the cost
model to carry the unit:
- Cost gains a Unit ("image" | "vo" | "music") and the count field is generalized
(or a parallel estimateAudioCost). Config keys: providers.voice.price_per_take
(or per-1k-chars) + providers.music.price_per_take (or per-second), with
conservative defaults — a configurable estimate cue, never a billed figure
(same posture as the image price). Disclosed up front + actual on the job (0013-D2).
The job store's image-shaped ETA (Images × etaPerImage) is generalized to a per-unit
eta (VO/music clips take seconds each, looped — so eta = Count × etaPerUnit).
6. The audition UI¶
Audio audition differs from image: takes are played, not thumbnailed. Two paths:
- Generalize MediaGen with a renderTake snippet so a take tile is an <img>
(image) or an <audio controls> (audio) — one engine, per-kind tile. Keeps the
generate→poll→gallery→pick + cost cue + localStorage re-attach machinery shared.
- The card editor (CardMedia area) gains a VO take block per card (generate /
audition / pick), beside the illustration block. Associated (workspace panel)
gains a music block (generate / audition / pick the bed).
Rec: generalize MediaGen (the gallery is the only image-specific bit; the rest is
kind-neutral). Selected-VO/music presence already surfaces in the card-list asset
indicators (R-UI-16, the ♪ dot) — those light from HasVOSelected / the selected bed.
7. Decisions to confirm before building¶
- Generator seam shape (4a). Extend
gencmd.Request/GenerateIntowith audio targets (one seam, one entry — rec) vs a parallelAudioGeneratorseam. The generation loops differ (image fan-out vs audio per-call); confirm hiding that insidegencmdbehind the singleGenerateseam is the right altitude. - Audio cost axis (5). A generalized
Cost.Unit+ per-kind price keys (rec), vs a separate audio estimator. And the pricing unit: per-take (simplest, a flat estimate cue) vs per-character/per-second (truer but needs char/length counting). Rec: per-take flat estimate for v1 (a cue, configurable), refine later. - Audition UI (6). Generalize
MediaGenwith a per-kind take tile (rec) vs a separateAudioGencomponent. - VO endpoint placement (3). Under the card (
cards/{n}/vo/*, line = card index — rec) vs a standalonevo/{line}/*. Confirm line ≡ card index in the studio. - Music take length. Candidate beds at the CLI's
defaultBedLen(~35s) then the render requests the VO-driven total (as today), vs generating at a guessed total. Rec: keep the CLI's candidate-bed model (parity). - In-memory + TTL prune (0013-D7 parity). Audio takes land in
vo/takes/music/takeson the active worktree; for in-memory projects the same RAM-bloat + grace-TTL-prune story as image takes applies. Rec: reuse the same policy.
8. Testing / DoD¶
Failing test → code → green just ci. Fake voice/music providers (no spend, no
network) for the studio handler tests + a godog scenario (VO generate→pick + music
generate→pick via the fakes; the paid path stays env-gated integration). Parity test
that the FS-explicit VO/music core matches the CLI VOTakes/MusicTakes output.
-race over the job paths. Docs: the studio component page (audio takes section).
/simplify + /code-review. The real ElevenLabs path stays INT_TEST=1-gated.