Why do AI video models merge or blend two characters in the same scene?

AI video models have no persistent identity representation across tokens, so when two characters share a frame, their features — face shape, hair, build — get averaged into a hybrid. Making the two characters visually distinct at the character-sheet stage reduces the ambiguity the model can exploit.

What causes merged limbs and distorted bodies in AI-generated contact shots?

Image-centric training means models have no 3D spatial prior, so they cannot reason about volume and occlusion. Contact geometry like a hand gripping a prop or one character carrying another produces merged limbs, vanishing hands, and nonsensical body outputs.

How does adding more characters or props affect AI video generation quality?

Every additional character, prop, or contact point multiplies failure points exponentially. The more bodies and props in contact, the faster the shot collapses, so keeping scenes as simple as possible improves output quality.

Can uploading a hand sketch fix two-character contact problems in AI video?

Yes. Sketching the exact physical arrangement of the two characters and uploading it as an image reference gives the model spatial logic that a text prompt alone cannot supply, especially when even strong image models fail to produce a clean fused character sheet.

What is compositing from multiple generations and when should I use it?

When no single generation captures the full contact shot cleanly, you take the best seconds of Character A from one generation and the best seconds of Character B from another and cut them together. Plan for roughly three generations per usable shot, with only about 25% of all generations making the final cut.

Why AI Video Fails with Two Characters Touching

Multi-character contact shots break AI video models because the model has to do two impossible things at once: hold two distinct character identity sheets simultaneously without averaging them into one face, AND resolve the 3D spatial logic of bodies touching, carrying, or gripping — with no persistent identity representation, no 3D spatial prior, and image-centric training that treats each frame independently.

The failure stacks in three layers. Identity-averaging: when two characters share a frame, models blend their features — face shape, hair, build — into a single hybrid because there's no persistent identity representation across tokens. Differentiate the two characters hard at the character-sheet stage (height, palette, silhouette, body shape) so the model has fewer features to confuse. No 3D spatial prior: contact geometry — a hand gripping a prop, one body carrying another, bodies pressed together — requires the model to reason about volume and occlusion, which image-centric training does not give it. You get merged limbs, vanishing hands, and the nonsensical body outputs creators have started calling "jabberwockies". Element-count explosion: every added character, prop, or contact point multiplies failure points — the more bodies and props in contact, the faster the shot collapses. In one production, the dominant setup (a vampire carrying another character) ran through ~75% of the film, and even prompting with strong image models could not produce a clean fused character sheet from text alone.

Here is what actually works, in the order you should try them:

Differentiate the two characters at the sheet stage. Before any video generation, lock four-angle turnaround sheets per character (front, side, back, face closeup) at 4K in Nano Banana Pro, and make the two characters visually unmistakable from each other — distinct height, palette, body shape, costume silhouette. Identity-averaging needs ambiguous inputs; remove the ambiguity.

Feed multiple reference images into a multi-subject generation pass. Kling's multi-reference input (up to four images per shot) and Seedance 2.0 reference-to-video both accept several character references plus a location plate in the same call, which holds identity across the contact moment better than a single prompt. Inside invideo, every roster model — Runway, Veo, Kling, Seedance 2.0 — is available, and the invideo agent routes each shot to whichever model handles your specific contact geometry best, so you never platform-hop to find the right one.

Hand-sketch the contact configuration and upload the drawing. When text prompts fail to produce a correct fused two-character sheet — even with the strongest image models — sketch how the two characters are physically arranged (who is carrying whom, where hands grip, what touches what), upload the sketch to the invideo agent, and have it pass that drawing as an image reference into Nano Banana alongside both character sheets. The drawing supplies the spatial logic the text prompt cannot. As Hridaye, invideo's creative director, put it: "He hand sketched how we want juice box character attached to our vampire character. We took that drawing and we uploaded that to our agent one who then in turn took that and then attached that to Nano Banana and prompted his way to finally get us the perfect character sheet."

Act the shot out on your phone and upload it as a reference video. For contact shots that also involve a complex camera move or POV, film a 5-second mock on your phone — bodies in the actual contact position, camera moving the way you want — and feed that as a reference video into Seedance 2.0 reference-to-video. The mock anchors both the spatial geometry and the camera language.

Composite from multiple generations. Across one 3-minute episode, 17 of the final shots were stitched from 2+ generations — when no single gen lands the contact cleanly, take the seconds where Character A is right from one generation and the seconds where Character B is right from another and cut them together. Plan a budget for it: avg 3 generations per usable shot, and only ~25% of all generations made the final cut.

These are some of the ways to problem-solve a contact shot — which one lands depends on how many bodies are touching, whether a prop is in play, and the camera move you need.

Watch some of these to see what works for you:

Hand-sketching contact geometry to fix AI multi-character failures

When AI can't nail the shot — use images to guide it instead

He hand sketched how we want juice box character attached to our vampire character. We took that drawing and we uploaded that to our agent one who then in turn took that and then attached that to Nano Banana and prompted his way to finally get us the perfect character sheet.

— Hridaye, invideo's creative director

Why do AI video models fail when two characters interact or touch in the same scene?

More on AI Filmmaking

Why do AI video models fail when two characters interact or touch in the same scene?

Related questions

More on AI Filmmaking