What is a first-last frame (frames-first) pipeline in AI video generation?
Last updated June 26, 2026
A frames-first pipeline locks the key static images of a shot — typically a start frame and an end frame — before any video is generated, then asks the model to interpolate motion between those anchors. You direct the stills to approved quality first, so the video stage inherits locked composition, character, and lighting instead of guessing them.
The order is the whole point: frames first, then video. You design and approve the static images that define a shot — a start frame, often an end frame, sometimes mid-beat references — and only then trigger generation, which fills in the motion between them. Because the anchors are fixed, the model interpolates inside a constrained visual envelope rather than improvising the entire shot from a text prompt.
A typical run looks like this: (1) generate the start frame and iterate until composition, character, and lighting are right; (2) generate the end frame so the shot has a defined destination; (3) write a short motion prompt describing what happens between them (camera move, action, beat); (4) pick a clip duration; (5) generate and review, then re-anchor if the interpolation drifts.
The contrast with text-to-video is control vs freedom. Text-to-video takes a sentence and invents every frame, which is fast but unstable across cuts — characters shift, lighting wanders, framing won't match the next shot. Frames-first trades that openness for shot-by-shot intent: the stills are the contract, the video stage executes against it. That's why it pairs naturally with storyboarding — your storyboard panels effectively become the anchor frames.
Inside invideo, the invideo agent runs this as a structured pipeline rather than a single button. You generate portraits and character sheets with image models like Recraft and Nano Banana, lock four options per asset and pick the best, then hand those locked frames into a video model — Seedance 2.0, Kling, or Veo — with the agent routing each shot to the model that fits it. Newer video models extend the same idea: Seedance 2.0 reference-to-video accepts character and location references alongside the start frame, carrying more context across the interpolation than start/end frame alone, and Kling generates multi-shot sequences natively so a single anchor pass can drive several beats. invideo has all of these models available, so the choice happens inside one project instead of across separate tools.
Why filmmakers run it this way: locking visuals before motion is the step that prevents consistency problems for the rest of the film. In one documented 70-second short, four options were generated per character sheet and environment reference, the best was locked, and the entire film was produced for $750 over 2 days with two consistent characters and no LoRA. A 3-minute animated episode followed the same order — 11 reference images locked first, then 164 video clips generated against them, of which 41 made the final cut at $315 per finished minute. Across documented productions the cost-per-finished-minute range sits at roughly $315–$750, and the common factor is that frames were approved before video ran.
As Hridaye, invideo's creative director, puts it: "Before I build assets, four things will change every frame: The Girl: What does she look like? What era? The Entity: Closer to Bathsheba? The Toy: Doll, ball, something else? The Deliverable: The frames first, then video? These four answers unlock everything."
One nuance worth knowing: newer multi-shot video models reduce how many anchor frames you actually need — a single storyboard frame can drive a 15-second sequence with internal cuts — so frames-first today is less about literally setting every start-and-end pair, and more about the discipline of locking visual identity (characters, world, look) before any motion is generated.
Watch some of these to see what works for you:
Before I build assets, four things will change every frame: The Girl: What does she look like? What era? The Entity: Closer to Bathsheba? The Toy: Doll, ball, something else? The Deliverable: The frames first, then video? These four answers unlock everything.
— Hridaye, invideo's creative director