How many AI video generations does it take to produce a 3-minute video?
Last updated June 26, 2026
Plan for roughly 80–165 generations to land a 3-minute video. One documented production generated 164 fifteen-second clips, kept 41 in the final cut (a ~25% selection rate), and used about 5 seconds from each. The working formula: (runtime ÷ usable seconds per clip) × generations per usable shot.
Use this formula to size your project: (total runtime in seconds ÷ average usable seconds per clip) × (generations per usable shot). For a 3-minute (180-second) narrative piece, plug in the documented benchmarks — about 5 usable seconds extracted per 15-second generation, and an average of 3 generations before a shot meets the bar — and you land at roughly 100–110 generations on the efficient end, climbing to 150+ once you add character-locking iterations and composited shots.
The clearest real-world anchor: one 3-minute animated episode produced by a 2-person team in 2 days generated 164 Seedance 2.0 clips at 15 seconds each, kept 41 in the final cut (~25% selection rate), and used an average of 5 seconds per kept clip — that's how 41 clips became 3 minutes. Average 3 generations per usable shot, and 17 of the final shots were stitched together from 2 or more generations (Frankenstein shot assembly), so the raw generation count runs well ahead of the shot count on screen. Another documented production — a ~90-second horror short — used ~400 video generations plus 30 image generations for half the runtime, which scales to a similar order of magnitude for a 3-minute cut when the style is denser and shots are shorter.
Budget upstream costs separately. Character locking ran ~5 generations per character at ~$9.78 each in the Arcane-style production, plus 11 reference images for 4 characters and 1 prop. That's overhead before a single narrative clip exists, and it's why a 3-minute piece can easily land between 150 and 250 total generations once you count image refs, character sheets, and video clips together.
A rough decision range by use case:
| Use case | Video gens (3 min) | Selection rate | Notes |
|---|---|---|---|
| Social / single-style content | 80–120 | ~35–50% usable | Shorter clips, less iteration |
| Narrative short (documented) | 150–170 | ~25% usable | 164 clips → 41 kept, avg 3 gens/shot |
| Dense or stylized (horror, action) | 200–400+ | ~15–25% usable | More compositing, more retries |
Four levers move the number: clip length (longer generations mean fewer total clips but more wasted seconds — most of the documented work uses 15-second clips and harvests 4–7 usable seconds from each), selection rate (25% is realistic for narrative, higher for clean social content), compositing (more than 40% of final shots in the documented episode were stitched from multiple generations, which multiplies the raw count), and character/world locking (front-loaded image and short video gens that don't appear in the final cut but make every later clip viable).
The invideo agent holds project context — script, character sheets, style block — across every generation, and routes shots to the right model (Seedance 2.0 for reference-to-video continuity, Kling for multi-shot sequences, Veo where it fits), which is what lets the 25% selection rate hold instead of collapsing. Run in always-ask mode so you approve each prompt before credits spend; that's the difference between 165 generations and 300.
As Hridaye, invideo's creative director, put it: "Out of 164, 41 videos made the cut, and on average only 5 seconds of each 15-second clip was used. That's how 41 clips became a 3-minute episode." The math holds: plan for ~100–165 video generations on the efficient end, 200+ for stylized or compositing-heavy work, and another 10–30 image generations for character and world reference upfront.
Watch some of these to see what works for you:
Out of 164, 41 videos made the cut, and on average only 5 seconds of each 15-second clip was used. That's how 41 clips became a 3-minute episode.
— Hridaye, invideo's creative director