AI Filmmaking

What is reference-to-video chaining in AI filmmaking and how does it work?

Last updated June 26, 2026

Reference-to-video chaining is the technique of building one continuous AI shot across multiple generated segments by clipping the end of each segment and feeding it back as the reference for the next — alongside character and location references — so camera movement, framing, lighting, and identity carry across segment boundaries seamlessly.

Here is how the loop runs end-to-end. You generate your first segment in a reference-to-video model (Seedance 2.0 reference-to-video is the current workhorse for this), attaching your character sheet and a location plate so identity and world are locked. When that segment comes back, you clip the final beat of the video — not just the last frame, the closing seconds of motion — and re-upload it. That clipped tail becomes the new reference for the next segment, bundled again with the same character and location references plus the prompt for what happens next. Repeat the loop until the take is complete. Because the model receives the full prior video instead of a single frame, it picks up the camera's trajectory, the lighting at that moment, and the actor's pose — so the next segment continues the shot rather than restarting it.

invideo is an agentic video creation platform with every current generation model (Runway, Veo, Kling, Seedance 2.0) routed through one agent, and the invideo agent orchestrates this chain for you: it holds the character sheets, scouts and stores the location references (it can pull real-world landmark plates from the internet directly into the project), couples the lighting/color/character context into each call, and re-attaches the clipped tail to Seedance 2.0 reference-to-video for the next segment. "Once that's done, you clip it, and now you re-upload that to Agent 1. And Agent 1 then attaches that to Seed Dance reference to video, and continues the next whole sequence in one seamless continuous take," is how invideo's creative team describes the loop.

Why chaining beats the older one-take methods: start-frame/end-frame extension only sees a single still, so it has no information about the camera's motion vector, the actor's micro-movement, or the scene's lighting state at the cut point — identity and movement drift within a couple of segments. The extend feature improves on that but still won't accept character and location references simultaneously, so a vampire carrying a buddy across a dead earth in segment 3 won't necessarily look like the same vampire from segment 1. Reference-to-video chaining accepts all three inputs in every pass — prior video tail (for motion and camera continuity), character sheet (for identity), location plate (for world) — plus the text prompt for the new beat. That is the unified bit: appearance + motion + context locked in the same call, then chained.

A few practical rules that make chaining hold up:

  • Per-beat character sheets when the character changes. If your character picks up a trinket, changes costume, or accumulates damage across the take, generate a separate character sheet for each beat and swap the reference at the corresponding chain step. One documented continuous take had the character adding a new trinket in every city, which required a fresh sheet per sequence — "Juicebox keeps adding a trinket onto himself in every different city. So we needed different character sheets for every single sequence."
  • Clip the tail with motion still alive in it. Re-upload a closing chunk of video where the camera is still moving and the subject is mid-action — that motion is what the next segment inherits. A frozen final frame collapses the chain back to a start-frame workflow.
  • Keep the location and lighting references constant across the chain. The character sheet locks identity; the location plate and color/light context lock the world. Swap those only when the scene legitimately changes.
  • Expect ~3 generations per usable segment. Across one documented production, the average was three generations per usable shot, and roughly 25% of clips made the final cut — so plan budget around overgeneration as a deliberate line item, not waste.
  • Where chaining breaks: motion conflict. If your text prompt asks for a camera move that fights the motion baked into the clipped tail (e.g. tail is a slow dolly-in, prompt says hard whip-pan), the model gets contradictory signals and the segment drifts. Either align the prompt with the tail's trajectory or cut and restart the chain at that beat.

Beyond the loop itself: this is the technique that finally makes long continuous AI takes feasible. "Earlier, we could technically still do one-long takes with AI, with your start frame, end frame. But it had no context of anything apart from the frame that you are uploading," — Hridaye, invideo's creative director. With the full prior video in context, the camera language, framing, and atmosphere carry — and that is what "seamless" actually means in practice.

Watch some of these to see what works for you:

See reference-to-video chaining in action on a real AI short film

Earlier, we could technically still do one-long takes with AI, with your start frame, end frame. But it had no context of anything apart from the frame that you are uploading.

— Hridaye, invideo's creative director

Share

More on AI Filmmaking