AI Filmmaking

Why don't start/end frame methods work for keeping multiple AI video characters consistent?

Last updated June 26, 2026

Start/end frame methods fail for multi-character continuity because diffusion models sample fresh from a probability distribution every generation — your reference frame is a hint, not a binding identity contract. They're also structurally serial (one visual thread, one clip at a time), so two characters in contact get blended, and parallel agents working different threads produce divergent states no frame-bridge can reconcile.

Use reference-to-video with locked character sheets and an environment plate instead — that's the workflow that actually holds identity across a continuous take. Here's why the older approach breaks and what to do about it.

Why start/end frame methods break for multi-character work

Each generation is an independent sample. The model treats your start and end images as guidance toward a distribution, not as a memory of who the character is — so across clips, faces shift, outfits blend, and small props (a scar, an earring) drift. When two characters share a frame in physical contact (ropes, props, a carry), the model has even less to anchor on and morphs identities between them — multi-character contact shots break models faster than almost any other scenario.

Start/end frame also assumes one continuous visual thread. The instant your start and end images differ in lighting, angle, or subject position, you get morphing artifacts in the middle. There's no character state registry being passed between calls — only two images and a prompt.

Why this kills parallel multi-agent production

Start/end frame is serial by construction: one character, one segment, one clip. The moment you put a creative producer agent on the script, a DOP agent on coverage, and parallel agents on different character threads, each thread produces its own visual state. Frame-bridging cannot reconcile them — the vampire your DOP agent generated at 2pm doesn't match the vampire another agent generated at 4pm, and stitching their start/end frames just averages the difference. As one director put it after running six agents in parallel: "All of us are working with the invideo agent, so it doesn't really matter where we are" — but that only holds because every agent inherits the same locked character sheets from a shared creative producer agent, not because frame-bridging stitched them together.

What to do instead

  1. Lock character sheets before any video generation. Build a multi-angle reference per character (front, 3/4, side, back, face close-up, and close-ups of small details). One production locked two characters across a 70-second short film with no LoRA, using character sheets and persistent agent context alone. Budget roughly 5 generations to lock one character — about $9.78 per character lock at current rates.

  2. Generate a fresh character sheet per beat when the character changes. If a character picks up a trinket, changes outfit, or accumulates damage across a sequence, the old sheet stops matching — one production needed a new sheet for every city in a continuous take because the character collected a new trinket each stop.

  3. Switch to reference-to-video for continuous takes. Reference-to-video accepts the prior clip plus character references plus a location plate simultaneously, so camera movement, framing, and identity carry across segment boundaries. Clip the end of each segment, re-upload it, and the invideo agent attaches it to Seedance 2.0 reference-to-video alongside the locked character sheets to generate the next seamless segment. The extend feature can stretch duration but won't accept character or location references — that's the structural gap.

  4. Pick the model for the shot. Where multi-character coverage matters, Kling generates multi-shot sequences natively; Seedance 2.0 reference-to-video carries character context across clips; Veo handles ingredient-style compositions. Every one of these runs inside invideo — the invideo agent routes each shot to the right model, so you're not switching platforms per character.

  5. Plan for the yield. Even with this stack, expect about 3 generations per usable shot and roughly 25% selection rate from total clips generated — one Arcane-style episode shipped 41 of 164 generated clips and stitched 17 final shots from 2+ generations. Overgeneration is a budget line, not a failure.

The one-take constraint, specifically

A one-take requires every character to remain visually identical along a single continuous timeline. Start/end frame can't guarantee that across agent boundaries because each agent's generation is an independent sample with no shared identity state. The fix is a shared context layer — locked character sheets, a locked environment plate, and reference-to-video chaining — held by one creative producer agent that grounds every downstream agent in the same assets.

These are the structural reasons start/end methods fall short for multi-character continuity — your exact stack will depend on whether you're doing scene-to-scene coverage or a single continuous take.

Watch some of these to see what works for you:

Why Reference-to-Video beats start/end frame for one-take shots
How multi-character contact shots break AI and what fixes them

Earlier, we could technically still do one-long takes with AI, with your start frame, end frame. But it had no context of anything apart from the frame that you are uploading.

— Hridaye, invideo's creative director

Share

More on AI Filmmaking