AI Filmmaking

Why do AI video models fail when characters physically touch or interact?

Last updated June 26, 2026

AI video models fail on physical-contact shots because they reconstruct characters frame-by-frame from latent representations with no persistent identity between frames. When two bodies overlap, the model has to guess which pixels belong to which character — limbs blend, textures swap, props pass through hands, and identities drift. Occlusion + stochastic sampling + multi-subject ambiguity compound, and contact is where all three peak.

The root cause is how diffusion video models actually work: every frame is a fresh sample from a latent space, with no memory of which pixels belonged to which character a frame ago. The model re-derives the scene each step from the prompt, the reference, and the previous frame's appearance — there is no stored "this is character A's left hand" token traveling through the clip. Add a second body in contact and the model has to resolve ambiguous pixel attribution at the contact boundary: whose arm is in front, whose texture wins where they overlap, which silhouette owns which limb when they merge into one blob of pixels. That is when you see the classic failures — limb misattribution (an arm grafted onto the wrong torso), texture bleed (one character's costume creeping onto the other), prop pass-through (ropes, juice boxes, weapons phasing through hands), and identity swap mid-shot.

Three factors compound at the moment of contact. Occlusion ambiguity: when one character is partially behind another, the model has to invent the hidden geometry every frame, and the inventions don't agree across frames — so the hidden shoulder pops in and out of existence. Stochastic per-frame sampling: each frame is a new roll of the dice constrained by the last one; small attribution errors at frame N get baked in as "truth" at frame N+1, and drift accumulates fastest exactly where the signal is weakest — the contact zone. Multi-subject degradation: error rate rises with subject count even without contact; contact effectively fuses two subjects into one ambiguous region, so the model treats it as a single chaotic subject and resolves it inconsistently. invideo's creative director put it bluntly: "Multi-character consistency (ropes, props, bodies in contact) breaks models faster than anything else." This isn't a one-model problem — Runway, Veo, Kling, and Seedance 2.0 all degrade on the same axis, just at different rates. Seedance 2.0's reference-to-video gives you more grip than start/end-frame methods because it carries character and location context across segments, but contact still strains it.

There are working mitigations, and they all reduce the ambiguity the model has to resolve rather than asking it to do better at the contact frame itself. Lock each character on their own multi-angle character sheet before any video generation, so the model has a clean identity reference per body. Where the configuration is unusual (one character carrying another, ropes, intertwined props), pre-build a single fused reference image showing the exact physical arrangement, then feed that to video generation as the visual anchor — the model now has the contact already resolved as a static fact, not a thing it has to solve mid-sample. Generate in shorter chunks at the contact moments and select the seconds where attribution held — across documented productions, only about 25% of generated clips made the final cut and 17 of one episode's final shots were stitched from two or more generations. If a shot still won't resolve, break coverage: shoot the contact as two single-subject shots and cut between them, which is the same trick live-action directors use when an effects shot won't hold.

invideo is an agentic video tool with every current video and image model behind one agent, so the routing — Seedance 2.0 reference-to-video for the contact attempt, a different model for coverage cutaways, the image stack for fused references — happens inside one conversation instead of across platforms. Across documented productions where contact shots dominated (one short film ran multi-character contact in roughly 75% of frames), the pattern that worked was: resolve the contact in a reference image first, generate against that reference, accept ~3 generations per usable shot, and Frankenstein the keepers — stitch the best seconds from multiple generations into one shot — when no single take holds end-to-end.

Watch some of these to see what works for you:

How a hand-drawn sketch of two characters in contact unblocked AI video generation
Chaining AI video segments with Seedance Reference-to-Video to hold character identity across shots

Multi-character consistency (ropes, props, bodies in contact) breaks models faster than anything else.

— Hridaye, invideo's creative director

Share

More on AI Filmmaking