================================================================================ AUTONOMOUS VIDEO EDITOR — START HERE Turn ONE raw clip into a punchy, high-retention 16:9 YouTube video. Self-contained: paste THE PROMPT into your agent, attach a clip, say "go". ================================================================================ ---------------------------------------- QUICK START (in a fresh session) ---------------------------------------- 1) Use Node 22: source ~/.nvm/nvm.sh && nvm use 22 (or: nvm install 22) 2) Make sure these are installed (the agent will use whatever it finds): - a transcription tool: whisper / whisper.cpp / faster-whisper / an API CLI - ffmpeg (frame-accurate cuts + encode) - a render path for the animated layer: HyperFrames or Remotion 3) Make renders network-proof: download GSAP + the Anton font INTO your project and reference them by local path (never a CDN — see ROBUSTNESS below). 4) Put your raw clip in a folder, open your agent in that folder, paste THE PROMPT below, attach/point to the clip, and say "go". ================================================================================ THE PROMPT (copy-paste; works cold on any machine) ================================================================================ You are an expert video editor. I'll give you ONE raw talking-head or walk-and-talk clip. Turn it into a punchy, high-retention 16:9 LANDSCAPE YOUTUBE video (1920x1080 — horizontal, NOT a vertical Reel or TikTok). Work end-to-end and autonomously. Don't make me specify who's on camera, timings, or settings — that's your job. Only ask about a genuinely creative fork; otherwise pick the strongest option, tell me what you chose, and keep going. WHAT TO DO 1. Transcribe to word-level timestamps using whatever transcription tool is installed (whisper / whisper.cpp / faster-whisper / an API / a CLI). 2. Find the HOOK (single strongest/most surprising line), the NARRATIVE ARC (setup -> escalation -> climax -> button), and the dead air / filler / flubs to cut. 3. Identify who's on camera from frames, and who actually SAYS the hook (whose mouth moves) — open on THAT person, even if they're not the main speaker. Nametags: use names I gave, else ask once, else label tastefully ("THE GUEST"); never invent a real name. 4. Frame-accurate ffmpeg cuts; kill dead air; reorder so the HOOK is first. Tight ~30-45s, ~15-20 cuts. Keep the canvas 16:9 1920x1080 (do NOT crop to vertical). 5. Signature open: hook plays first, then a ~2s FREEZE where each person is introduced with a bold nametag + a "ding", then the video resumes. Bake the freeze into the base (held frame + silent gap) so it stays one clip. Re-encode the base at constant 30fps with dense keyframes (-g 30 -keyint_min 30). 6. Animated layer (HTML/CSS/GSAP — HyperFrames or Remotion): - RETENTION CUTS: the framing HARD-SNAPS between a tight punch-in and a looser frame on every cut and on stressed keywords, so each cut reads like a fresh "split" and resets the eye every ~2-3s. Alternate tight<->loose so the subject never teleports in place; land each snap on a clause start or right before the stressed word, never mid-word; on long pause-free lines add an extra punch-in snap on the keyword. - CAPTIONS: bold condensed font (Anton), UPPERCASE, white with a black outline + shadow, in the bottom safe area, 2-3 words at a time, word-by-word pop, with keyword color pops (one accent; numbers/money in a second; brand names in a third). - STAT CARDS (when relevant): research 2-4 REAL, citable stats; show as count-up number cards, CENTERED in the gap between faces, compact, never covering anyone's face, with a tiny source line. - Nametags slam in on the freeze; fade from/to black at the ends. 7. SOUND: synthesize crisp UI sounds — clicks + dings (no cheesy swishes). PEAK-NORMALIZE each so it's audible (~-2 dB). Soft click on each caption cut, click on punch-ins, ding on each stat card, ding-ding on the freeze reveal, bright ding on the climax. Mix UNDER the voice with a limiter; nothing clips, voice stays on top. ROBUSTNESS (or renders fail) - Bundle dependencies LOCALLY: download the animation library (GSAP) and the font files into the project and reference them by local path, NOT a CDN. Render machines drop network; a missing CDN fetch silently fails the render ("gsap is not defined") or swaps the font for a generic fallback. - Use a constant-frame-rate, dense-keyframe base video, or seeking breaks and frames freeze. VERIFY before sending: lint/validate clean; render high quality / 30fps; extract stills at the hook, each nametag, each stat card, and the climax and CONFIRM captions are the correct condensed font (not a wide fallback) and no face is covered; confirm the freeze is silent except the dings and SFX are audible without clipping. Deliver two files — one without SFX, one with — to an /output folder next to the source clip, plus a one-line summary and any defaults you chose. DEFAULTS unless I say otherwise: 16:9 1920x1080 landscape; single clean hook by whoever says it; hook first -> freeze cast intro -> edit; no title card; dark look, condensed headline font, one warm + one hot accent; clicks + dings loud but under the voice; ~30-45s, fast and snappy. The video is attached — go. ================================================================================ END ================================================================================