keryx — long-form video & in-browser capture (future directions)¶
Status: future / intent (beyond Phase 5; this leaves an architectural gap, it is not a build-ready design) Owner: Matt Cockayne Last updated: 2026-06-15
0. How to read this¶
This captures future capabilities so we leave room for them now:
- Longer-form pieces — the reel grown in length: longer panels, deeper narrative, richer panel content (code-interaction clips, the occasional piece-to-camera). An extension of the reel pipeline, not a new editor.
- In-browser capture — record webcam video and microphone audio directly in the studio, as an input alongside generate/upload.
- AI-driven coarse edits — natural-language trims as the hard ceiling of editing keryx will ever do (§3).
It is intent + hooks, not a full spec. The core design (0001–0003) should not make these expensive to add later. The concrete decision they already forced is S3-first large-file storage (0001 §3.5) — captures and code clips routinely exceed GitLab's ~100 MB package cap.
Requirement IDs: R-LF-* (deferred).
1. Capability A — longer-form pieces (reels grown in length)¶
The primary intent stays reels. This expands them predominantly in length —
longer panels with deeper narrative — and in panel content, not into a
different tool. New panel content sources, all still composed (not edited):
- Code-interaction clips — short MP4s of terminal/code sessions, generated
from a script via VHS (charmbracelet/vhs) — a deterministic,
re-runnable "generated" source ideal for how-to material (no screen-record
flakiness).
- Piece-to-camera — an occasional talking-head clip of Matt, captured
in-browser (§2).
- plus the existing generated stills/video and uploaded media.
R-LF-1(intent) Longer-form is the same storyboard-driven composition pipeline extended (more/longer panels, more content sources), not a new timeline-edited artefact. Keep the model from hardwiring "short 9:16 reel" so length/aspect/panel-mix can grow; it remains composition only.R-LF-2(intent) It reuses the existing seams —VideoProvider,VoiceProvider,Renderer, themes, media store — plus a VHS code-clip source; new content + length, not new infrastructure.R-LF-3(intent) Posting reuses thePublisherinterface; the longer pieces' natural target is YouTube (regular, not Shorts) — the existing adapter generalises to non-Short uploads.
2. Capability B — in-browser capture (webcam + microphone)¶
The studio gains media capture (getUserMedia/MediaRecorder): record a
talking-head clip or a voice take in the browser, on desktop or phone.
R-LF-4(intent) Capture is a third media source beside generate and upload (R-UI-29): a card/segment's media, or the cover, can be captured. Captured media is treated like an upload (used as-is, no AI, no text-leak screen).R-LF-5(intent) Voice has a capture source too — a card'svocan be the real recorded voice instead of the ElevenLabs clone (VoiceProvidergains a "captured/recorded" source, or capture writes the selectedvo/NNtake directly). This is the natural answer to clone-pronunciation limits (0002 §3.2): just say the line.R-LF-6(intent) Capture happens client-side; the file uploads to the workspace via the existing assets API (R-API) and lands in the S3 store (captures are large — the storage rationale, 0001 §3.5).
3. The editing ceiling — AI-driven coarse edits only¶
keryx is not a video editor (0001 §2) and will not grow a timeline, keyframes, or clip-level editing UI — the overhead is huge and far better tools exist (e.g. kiru.app, a modern Rust video editor, the recommended companion for real editing). The only editing keryx permits is conversational and coarse, and this is the deliberate hard ceiling:
R-LF-12(intent, hard limit) the chat AI may perform coarse, describable clip operations on an uploaded/captured clip — trim head/tail, cut a span, set in/out points — from natural language ("trim the first 5 seconds of clip 1 to drop the awkward pause before I speak"), executed as a single deterministic ffmpeg operation. That is the full extent.R-LF-13(intent, boundary) no frame-accurate timeline, multi-track editing, transitions/effects authoring, colour, or compositing UI. Anything beyond a coarse trim/cut is out of scope — point the user to a real editor. Edits are non-destructive (operate into a new take; the source is kept).
4. Architectural hooks to keep open now (the gaps)¶
So that these are additive later, not a rewrite, the core design should hold to:
R-LF-7(MUST, now) S3-first storage (0001 §3.5) — done; the enabler for both.R-LF-8(SHOULD, now) the media-source enum is open (generated/uploaded/ … ) — addingcapturedis a new value, not a model change.R-LF-9(SHOULD, now) the voice source is not assumed to be TTS-only — thevo/voice model allows a recorded take to occupy the selected slot.R-LF-10(SHOULD, now) artefact-type neutrality — avoid baking "reel" into shared concepts (workspace, library, project, render timeline) so a second artefact type (long-form) is additive.R-LF-11(SHOULD, now) theRendererinterface is timeline-shaped, not reel-shaped, so a longer/chaptered composition is a different timeline through the same seam.
5. Open questions (for when this is picked up)¶
- Composition model for longer pieces — the minimal panel/segment model for length + mixed content (VHS clip, piece-to-camera, generated panels) that stays composition, never crossing into the §3 editing ceiling.
- VHS integration — drive
charmbracelet/vhsas a code-clip source: where the.tapescript lives (storyboard? bundle?), and caching its output. - Capture UX on mobile vs desktop; permissions; the coarse-trim affordance
(the only edit,
R-LF-12). - Recorded vs cloned voice as the default per project, and mixing both.
- YouTube long-form posting specifics (chapters, thumbnails, longer processing) vs the Shorts path.
6. References¶
- Storage decision that this drove:
0001-keryx.md§3.5 (S3-first). - Video generation seam + panels:
0003-video-panels.md,0001 §3.4. - Media source + voice contracts:
0002-interface-contracts.md§3.2,R-UI-29.