keryx — long-form video & in-browser capture (future directions)¶

Status: future / intent (beyond Phase 5; this leaves an architectural gap, it is not a build-ready design) Owner: Matt Cockayne Last updated: 2026-06-15

0. How to read this¶

This captures future capabilities so we leave room for them now:

Longer-form pieces — the reel grown in length: longer panels, deeper narrative, richer panel content (code-interaction clips, the occasional piece-to-camera). An extension of the reel pipeline, not a new editor.
In-browser capture — record webcam video and microphone audio directly in the studio, as an input alongside generate/upload.
AI-driven coarse edits — natural-language trims as the hard ceiling of editing keryx will ever do (§3).

It is intent + hooks, not a full spec. The core design (0001–0003) should not make these expensive to add later. The concrete decision they already forced is S3-first large-file storage (0001 §3.5) — captures and code clips routinely exceed GitLab's ~100 MB package cap.

Requirement IDs: R-LF-* (deferred).

1. Capability A — longer-form pieces (reels grown in length)¶

The primary intent stays reels. This expands them predominantly in length — longer panels with deeper narrative — and in panel content, not into a different tool. New panel content sources, all still composed (not edited): - Code-interaction clips — short MP4s of terminal/code sessions, generated from a script via VHS (charmbracelet/vhs) — a deterministic, re-runnable "generated" source ideal for how-to material (no screen-record flakiness). - Piece-to-camera — an occasional talking-head clip of Matt, captured in-browser (§2). - plus the existing generated stills/video and uploaded media.

R-LF-1 (intent) Longer-form is the same storyboard-driven composition pipeline extended (more/longer panels, more content sources), not a new timeline-edited artefact. Keep the model from hardwiring "short 9:16 reel" so length/aspect/panel-mix can grow; it remains composition only.
R-LF-2 (intent) It reuses the existing seams — VideoProvider, VoiceProvider, Renderer, themes, media store — plus a VHS code-clip source; new content + length, not new infrastructure.
R-LF-3 (intent) Posting reuses the Publisher interface; the longer pieces' natural target is YouTube (regular, not Shorts) — the existing adapter generalises to non-Short uploads.

2. Capability B — in-browser capture (webcam + microphone)¶

The studio gains media capture (getUserMedia/MediaRecorder): record a talking-head clip or a voice take in the browser, on desktop or phone.

R-LF-4 (intent) Capture is a third media source beside generate and upload (R-UI-29): a card/segment's media, or the cover, can be captured. Captured media is treated like an upload (used as-is, no AI, no text-leak screen).
R-LF-5 (intent) Voice has a capture source too — a card's vo can be the real recorded voice instead of the ElevenLabs clone (VoiceProvider gains a "captured/recorded" source, or capture writes the selected vo/NN take directly). This is the natural answer to clone-pronunciation limits (0002 §3.2): just say the line.
R-LF-6 (intent) Capture happens client-side; the file uploads to the workspace via the existing assets API (R-API) and lands in the S3 store (captures are large — the storage rationale, 0001 §3.5).

3. The editing ceiling — AI-driven coarse edits only¶

keryx is not a video editor (0001 §2) and will not grow a timeline, keyframes, or clip-level editing UI — the overhead is huge and far better tools exist (e.g. kiru.app, a modern Rust video editor, the recommended companion for real editing). The only editing keryx permits is conversational and coarse, and this is the deliberate hard ceiling:

R-LF-12 (intent, hard limit) the chat AI may perform coarse, describable clip operations on an uploaded/captured clip — trim head/tail, cut a span, set in/out points — from natural language ("trim the first 5 seconds of clip 1 to drop the awkward pause before I speak"), executed as a single deterministic ffmpeg operation. That is the full extent.
R-LF-13 (intent, boundary) no frame-accurate timeline, multi-track editing, transitions/effects authoring, colour, or compositing UI. Anything beyond a coarse trim/cut is out of scope — point the user to a real editor. Edits are non-destructive (operate into a new take; the source is kept).

4. Architectural hooks to keep open now (the gaps)¶

So that these are additive later, not a rewrite, the core design should hold to:

R-LF-7 (MUST, now) S3-first storage (0001 §3.5) — done; the enabler for both.
R-LF-8 (SHOULD, now) the media-source enum is open (generated / uploaded / … ) — adding captured is a new value, not a model change.
R-LF-9 (SHOULD, now) the voice source is not assumed to be TTS-only — the vo/voice model allows a recorded take to occupy the selected slot.
R-LF-10 (SHOULD, now) artefact-type neutrality — avoid baking "reel" into shared concepts (workspace, library, project, render timeline) so a second artefact type (long-form) is additive.
R-LF-11 (SHOULD, now) the Renderer interface is timeline-shaped, not reel-shaped, so a longer/chaptered composition is a different timeline through the same seam.

5. Open questions (for when this is picked up)¶

Composition model for longer pieces — the minimal panel/segment model for length + mixed content (VHS clip, piece-to-camera, generated panels) that stays composition, never crossing into the §3 editing ceiling.
VHS integration — drive charmbracelet/vhs as a code-clip source: where the .tape script lives (storyboard? bundle?), and caching its output.
Capture UX on mobile vs desktop; permissions; the coarse-trim affordance (the only edit, R-LF-12).
Recorded vs cloned voice as the default per project, and mixing both.
YouTube long-form posting specifics (chapters, thumbnails, longer processing) vs the Shorts path.

6. References¶

Storage decision that this drove: 0001-keryx.md §3.5 (S3-first).
Video generation seam + panels: 0003-video-panels.md, 0001 §3.4.
Media source + voice contracts: 0002-interface-contracts.md §3.2, R-UI-29.