What is the frames-first pipeline for AI video production and why does it work?
Last updated June 26, 2026
The frames-first pipeline means generating and approving every static asset — character portraits, multi-angle character sheets, environment and prop references — before generating any video. It works because video models inherit whatever reference you feed them: locked, approved frames carry consistency into every clip, while unapproved frames propagate drift into the expensive video-generation stage.
Run the pipeline in four ordered stages, resolving every visual decision at the image level before any motion. invideo is an agentic video creation tool with all the current image and video models available, and documented productions run this exact sequence through the invideo agent.
Stage 1 — answer the questions that change every frame. Before building any asset, lock the inputs that ripple through the whole project: character descriptions, antagonist or entity reference, prop specification, and deliverable format. In one documented horror short, the invideo agent surfaced exactly these four questions before generating anything, framing the order explicitly as frames first, then video.
Stage 2 — generate the reference stills. Produce character portraits, multi-angle character sheets (turnaround angles plus face and mid-angle closeups), and environment and prop references — generating options rather than single images. One production generated 4 variations of each character sheet and environment reference and locked the best before any video work; another locked each character's identity in about 5 generations at roughly $9.78 per character. The invideo agent routes each asset to the image model suited to it — Recraft for photoreal portraits with skin-level texture, Nano Banana for 4K multi-angle sheets, GPT-Image-2 where it fits — so model choice happens inside one workflow rather than across platforms.
Stage 3 — approve frames as a quality gate. Review every still against your visual standard before anything moves. In 'Always Ask' mode the invideo agent presents each generation for approval before credits are spent, which keeps the gate human-controlled, shot by shot.
Stage 4 — only then generate video. Attach the locked sheets and references to every video prompt. In one 3-minute animated episode, the team attached character references and a locked style block to every single Seedance 2.0 generation, produced in 15-second chunks in the film's aspect ratio — the frames did the consistency work, and the video model only added motion.
Why the order works comes down to three mechanisms. First, economics: image iteration is cheap and video iteration is not. One $870 horror short used 30 image generations against ~400 video generations; an animated episode averaged 3 generations per usable shot, and only 41 of 164 generated clips (25%) made the final cut. Since video is where the credits go, every look decision resolved at the image stage avoids re-burning that budget. Second, consistency inheritance: video models render what the reference shows, and without a locked character sheet the model fills gaps with hallucinated details — the consistency wall, where characters drift in appearance from clip to clip. With locked sheets held in the invideo agent's context, one 70-second short kept 2 characters visually consistent across every scene with no LoRA fine-tuning. Third, frames become the source of truth for fixes: when a continuity error appears in a shot, you correct the character sheet panel itself, the invideo agent stores the updated sheet in context, and every subsequent shot inherits the fix — instead of re-rolling video generations. The principle holds regardless of which video model you finish on: Seedance 2.0, Kling, and Veo all animate from reference inputs, and since every roster model runs inside invideo, the invideo agent routes each shot without you switching platforms.
Watch some of these to see what works for you:
the AI always needs to see what the character is exactly, right? Or else it'll kind of hallucinate and imagine something that's under the cap. So, we don't want to do that. We always want the character to be seen as we see it on the character sheet.
— invideo's creative team