One copy-paste prompt that turns a single raw clip into a punchy, high-retention 16:9 YouTube video — fully autonomously. It finds the hook, opens with a freeze-frame cast intro, snap-cuts for retention, burns in word-by-word captions, adds real stat cards, and mixes crisp clicks + dings under the voice. Works cold on any machine.
Use Node 22 and make sure ffmpeg + a transcription tool (whisper / faster-whisper) are installed.
Put your raw clip in a folder and open your agent (Claude Code, or any agent) there.
Paste the prompt below, attach the clip, and say “go.”
Get two files back — one with SFX, one without — in an /output folder next to the clip.
The prompt
You are an expert video editor. I'll give you ONE raw talking-head or walk-and-talk clip. Turn it into a punchy, high-retention 16:9 LANDSCAPE YOUTUBE video (1920x1080 — horizontal, NOT a vertical Reel or TikTok). Work end-to-end and autonomously. Don't make me specify who's on camera, timings, or settings — that's your job. Only ask about a genuinely creative fork; otherwise pick the strongest option, tell me what you chose, and keep going.
WHAT TO DO
1. Transcribe to word-level timestamps using whatever transcription tool is installed (whisper / whisper.cpp / faster-whisper / an API / a CLI).
2. Find the HOOK (single strongest/most surprising line), the NARRATIVE ARC (setup -> escalation -> climax -> button), and the dead air / filler / flubs to cut.
3. Identify who's on camera from frames, and who actually SAYS the hook (whose mouth moves) — open on THAT person, even if they're not the main speaker. Nametags: use names I gave, else ask once, else label tastefully ("THE GUEST"); never invent a real name.
4. Frame-accurate ffmpeg cuts; kill dead air; reorder so the HOOK is first. Tight ~30-45s, ~15-20 cuts. Keep the canvas 16:9 1920x1080 (do NOT crop to vertical).
5. Signature open: hook plays first, then a ~2s FREEZE where each person is introduced with a bold nametag + a "ding", then the video resumes. Bake the freeze into the base (held frame + silent gap) so it stays one clip. Re-encode the base at constant 30fps with dense keyframes (-g 30 -keyint_min 30).
6. Animated layer (HTML/CSS/GSAP — HyperFrames or Remotion):
- RETENTION CUTS: the framing HARD-SNAPS between a tight punch-in and a looser frame on every cut and on stressed keywords, so each cut reads like a fresh "split" and resets the eye every ~2-3s. Alternate tight<->loose so the subject never teleports in place; land each snap on a clause start or right before the stressed word, never mid-word; on long pause-free lines add an extra punch-in snap on the keyword.
- CAPTIONS: bold condensed font (Anton), UPPERCASE, white with a black outline + shadow, in the bottom safe area, 2-3 words at a time, word-by-word pop, with keyword color pops (one accent; numbers/money in a second; brand names in a third).
- STAT CARDS (when relevant): research 2-4 REAL, citable stats; show as count-up number cards, CENTERED in the gap between faces, compact, never covering anyone's face, with a tiny source line.
- Nametags slam in on the freeze; fade from/to black at the ends.
7. SOUND: synthesize crisp UI sounds — clicks + dings (no cheesy swishes). PEAK-NORMALIZE each so it's audible (~-2 dB). Soft click on each caption cut, click on punch-ins, ding on each stat card, ding-ding on the freeze reveal, bright ding on the climax. Mix UNDER the voice with a limiter; nothing clips, voice stays on top.
ROBUSTNESS (or renders fail)
- Bundle dependencies LOCALLY: download the animation library (GSAP) and the font files into the project and reference them by local path, NOT a CDN. Render machines drop network; a missing CDN fetch silently fails the render ("gsap is not defined") or swaps the font for a generic fallback.
- Use a constant-frame-rate, dense-keyframe base video, or seeking breaks and frames freeze.
VERIFY before sending: lint/validate clean; render high quality / 30fps; extract stills at the hook, each nametag, each stat card, and the climax and CONFIRM captions are the correct condensed font (not a wide fallback) and no face is covered; confirm the freeze is silent except the dings and SFX are audible without clipping. Deliver two files — one without SFX, one with — to an /output folder next to the source clip, plus a one-line summary and any defaults you chose.
DEFAULTS unless I say otherwise: 16:9 1920x1080 landscape; single clean hook by whoever says it; hook first -> freeze cast intro -> edit; no title card; dark look, condensed headline font, one warm + one hot accent; clicks + dings loud but under the voice; ~30-45s, fast and snappy. The video is attached — go.
A free kit from TheVibeFounder. Use it for anything — no attribution required.