How does frames-first differ from text-to-video generation?

Text-to-video invents every frame from a prompt, which is fast but produces inconsistent characters and shifting lighting across cuts. Frames-first trades that freedom for shot-by-shot control by using approved stills as fixed anchors the video model executes against.

Do I need to set a start and end frame for every single shot?

Not necessarily. Newer multi-shot video models like Seedance 2.0 can drive a 15-second sequence with internal cuts from a single storyboard frame. The discipline is locking visual identity before generating motion, not literally anchoring every pair.

What does a frames-first workflow cost in practice?

Documented productions using frames-first pipelines have landed at roughly $315–$750 per finished minute. A 70-second short was produced for $750 over two days, and a 3-minute animated episode cost $315 per finished minute with 11 reference images locked before any video ran.

What Is a Frames-First Pipeline in AI Video?

Q: What is a frames-first pipeline in AI video generation?

A frames-first pipeline locks a start frame and end frame before any video is generated, then asks the model to interpolate motion between those anchors. This ensures composition, character, and lighting are approved before the video stage begins.

Q: Which video models support a frames-first approach inside invideo?

invideo routes shots to Seedance 2.0, Kling, or Veo depending on the shot type. Seedance 2.0 accepts character and location references alongside the start frame, and Kling generates multi-shot sequences natively, so a single anchor pass can drive several beats.

A frames-first pipeline locks the key static images of a shot — typically a start frame and an end frame — before any video is generated, then asks the model to interpolate motion between those anchors. You direct the stills to approved quality first, so the video stage inherits locked composition, character, and lighting instead of guessing them.

The order is the whole point: frames first, then video. You design and approve the static images that define a shot — a start frame, often an end frame, sometimes mid-beat references — and only then trigger generation, which fills in the motion between them. Because the anchors are fixed, the model interpolates inside a constrained visual envelope rather than improvising the entire shot from a text prompt.

A typical run looks like this: (1) generate the start frame and iterate until composition, character, and lighting are right; (2) generate the end frame so the shot has a defined destination; (3) write a short motion prompt describing what happens between them (camera move, action, beat); (4) pick a clip duration; (5) generate and review, then re-anchor if the interpolation drifts.

The contrast with text-to-video is control vs freedom. Text-to-video takes a sentence and invents every frame, which is fast but unstable across cuts — characters shift, lighting wanders, framing won't match the next shot. Frames-first trades that openness for shot-by-shot intent: the stills are the contract, the video stage executes against it. That's why it pairs naturally with storyboarding — your storyboard panels effectively become the anchor frames.

Inside invideo, the invideo agent runs this as a structured pipeline rather than a single button. You generate portraits and character sheets with image models like Recraft and Nano Banana, lock four options per asset and pick the best, then hand those locked frames into a video model — Seedance 2.0, Kling, or Veo — with the agent routing each shot to the model that fits it. Newer video models extend the same idea: Seedance 2.0 reference-to-video accepts character and location references alongside the start frame, carrying more context across the interpolation than start/end frame alone, and Kling generates multi-shot sequences natively so a single anchor pass can drive several beats. invideo has all of these models available, so the choice happens inside one project instead of across separate tools.

Why filmmakers run it this way: locking visuals before motion is the step that prevents consistency problems for the rest of the film. In one documented 70-second short, four options were generated per character sheet and environment reference, the best was locked, and the entire film was produced for $750 over 2 days with two consistent characters and no LoRA. A 3-minute animated episode followed the same order — 11 reference images locked first, then 164 video clips generated against them, of which 41 made the final cut at $315 per finished minute. Across documented productions the cost-per-finished-minute range sits at roughly $315–$750, and the common factor is that frames were approved before video ran.

As Hridaye, invideo's creative director, puts it: "Before I build assets, four things will change every frame: The Girl: What does she look like? What era? The Entity: Closer to Bathsheba? The Toy: Doll, ball, something else? The Deliverable: The frames first, then video? These four answers unlock everything."

One nuance worth knowing: newer multi-shot video models reduce how many anchor frames you actually need — a single storyboard frame can drive a 15-second sequence with internal cuts — so frames-first today is less about literally setting every start-and-end pair, and more about the discipline of locking visual identity (characters, world, look) before any motion is generated.

Watch some of these to see what works for you:

Full walkthrough: lock frames first, then generate every shot

Unedited session: generate stills, lock them, then make video

Animated short: lock character and location references before any video

Before I build assets, four things will change every frame: The Girl: What does she look like? What era? The Entity: Closer to Bathsheba? The Toy: Doll, ball, something else? The Deliverable: The frames first, then video? These four answers unlock everything.

— Hridaye, invideo's creative director

What is a first-last frame (frames-first) pipeline in AI video generation?

More on AI Filmmaking

What is a first-last frame (frames-first) pipeline in AI video generation?

Related questions

More on AI Filmmaking