Why do AI video models fail when two characters interact or touch in the same scene?
Last updated June 26, 2026
Multi-character contact shots break AI video models because the model has to do two impossible things at once: hold two distinct character identity sheets simultaneously without averaging them into one face, AND resolve the 3D spatial logic of bodies touching, carrying, or gripping — with no persistent identity representation, no 3D spatial prior, and image-centric training that treats each frame independently.
The failure stacks in three layers. Identity-averaging: when two characters share a frame, models blend their features — face shape, hair, build — into a single hybrid because there's no persistent identity representation across tokens. Differentiate the two characters hard at the character-sheet stage (height, palette, silhouette, body shape) so the model has fewer features to confuse. No 3D spatial prior: contact geometry — a hand gripping a prop, one body carrying another, bodies pressed together — requires the model to reason about volume and occlusion, which image-centric training does not give it. You get merged limbs, vanishing hands, and the nonsensical body outputs creators have started calling "jabberwockies". Element-count explosion: every added character, prop, or contact point multiplies failure points — the more bodies and props in contact, the faster the shot collapses. In one production, the dominant setup (a vampire carrying another character) ran through ~75% of the film, and even prompting with strong image models could not produce a clean fused character sheet from text alone.
Here is what actually works, in the order you should try them:
Differentiate the two characters at the sheet stage. Before any video generation, lock four-angle turnaround sheets per character (front, side, back, face closeup) at 4K in Nano Banana Pro, and make the two characters visually unmistakable from each other — distinct height, palette, body shape, costume silhouette. Identity-averaging needs ambiguous inputs; remove the ambiguity.
Feed multiple reference images into a multi-subject generation pass. Kling's multi-reference input (up to four images per shot) and Seedance 2.0 reference-to-video both accept several character references plus a location plate in the same call, which holds identity across the contact moment better than a single prompt. Inside invideo, every roster model — Runway, Veo, Kling, Seedance 2.0 — is available, and the invideo agent routes each shot to whichever model handles your specific contact geometry best, so you never platform-hop to find the right one.
Hand-sketch the contact configuration and upload the drawing. When text prompts fail to produce a correct fused two-character sheet — even with the strongest image models — sketch how the two characters are physically arranged (who is carrying whom, where hands grip, what touches what), upload the sketch to the invideo agent, and have it pass that drawing as an image reference into Nano Banana alongside both character sheets. The drawing supplies the spatial logic the text prompt cannot. As Hridaye, invideo's creative director, put it: "He hand sketched how we want juice box character attached to our vampire character. We took that drawing and we uploaded that to our agent one who then in turn took that and then attached that to Nano Banana and prompted his way to finally get us the perfect character sheet."
Act the shot out on your phone and upload it as a reference video. For contact shots that also involve a complex camera move or POV, film a 5-second mock on your phone — bodies in the actual contact position, camera moving the way you want — and feed that as a reference video into Seedance 2.0 reference-to-video. The mock anchors both the spatial geometry and the camera language.
Composite from multiple generations. Across one 3-minute episode, 17 of the final shots were stitched from 2+ generations — when no single gen lands the contact cleanly, take the seconds where Character A is right from one generation and the seconds where Character B is right from another and cut them together. Plan a budget for it: avg 3 generations per usable shot, and only ~25% of all generations made the final cut.
These are some of the ways to problem-solve a contact shot — which one lands depends on how many bodies are touching, whether a prop is in play, and the camera move you need.
Watch some of these to see what works for you:
He hand sketched how we want juice box character attached to our vampire character. We took that drawing and we uploaded that to our agent one who then in turn took that and then attached that to Nano Banana and prompted his way to finally get us the perfect character sheet.
— Hridaye, invideo's creative director