Why do AI video models fail at multi-character physical contact shots?

Current models suffer identity bleed and occlusion failures when two characters touch. No text-to-video model, including reference-to-video tools, reliably solves this from prompts alone.

What is a fused character sheet and why does it help contact shots?

A fused character sheet is a single image showing both characters in the exact contact pose, created by sketching the arrangement and feeding it to an image model. It gives the video model a pre-resolved visual anchor instead of relying on text description.

How does a phone mock help generate contact shots in AI video?

Acting out the contact on your phone and uploading the clip as a reference video locks blocking, contact points, and camera path. Reference-to-video models like Seedance 2.0 then dress the motion in your chosen style.

What yield should I expect when generating AI contact shots?

Documented productions average about 3 generations per usable shot, with roughly 25% of clips making the final cut. Plan for overgeneration and stitch the strongest seconds from multiple takes into one shot.

What are Frankenstein shots and how common are they in contact-heavy productions?

Frankenstein shots are assembled from the best seconds of multiple generated takes. In one documented 3-minute episode, they accounted for over 40% of final shots, making them a core technique for difficult contact sequences.

AI Video Multi-Character Contact Shots: What Works

No — not reliably from text prompts alone. Multi-character physical contact (carries, grips, ropes, bodies touching) breaks current AI video models faster than almost any other shot type, with identity bleed and occlusion failures the dominant problems. Reliable contact shots come from feeding the model a visual anchor of the contact itself — a hand-sketch, a phone-shot mock, or a fused character sheet — and stitching the keepers.

Treat a contact shot as an asset problem, not a prompting problem. The invideo agent routes shots across Runway, Veo, Kling, and Seedance 2.0, but no current model — including the strongest reference-to-video ones — solves two-character physical contact reliably from text. Hridaye, invideo's creative director, put it bluntly: "Multi-character consistency (ropes, props, bodies in contact) breaks models faster than anything else." The workable path is to give the model a pre-resolved picture of the contact and then generate against it.

Hand-sketch the contact, then build a fused character sheet. When the configuration is unusual — one character carrying another, a grip pose, intertwined bodies — text prompts and even Nano Banana alone fail. Sketch the arrangement on paper, upload the drawing as a reference, and have the invideo agent feed it to the image model to produce a fused two-character sheet. In one documented production where a vampire carries a juice-box character through ~75% of the film, the team only cracked the character sheet after hand-sketching the carry pose and uploading it as the visual anchor. Lock that fused sheet before any video generation.

Shoot a phone mock of the contact and use it as reference video. For motion — a lift, a hand-off, a pull — act it out on a phone in your room and upload the clip as a reference into a reference-to-video model (Seedance 2.0 is the strongest here because it carries character and location context across clips simultaneously). The mock locks blocking, contact points, and camera path; the model dresses it in your style. This is how the same production unblocked POV and contact shots that pure prompting could not deliver.

Generate five variations per beat and stitch the keepers. Plan for overgeneration. Average yield in documented productions runs about 3 generations per usable shot, and roughly 25% of clips make the final cut (41 of 164 in one 3-minute episode). Stitch the strongest seconds from multiple takes into one shot — a Frankenstein shot — which accounted for more than 40% of final shots (17 of 41) in that production. For evolving contact (a character picking things up across beats), generate a separate character sheet per beat so the contact updates without identity drift.

Stage around the model's weak spots when contact isn't structural. Where the contact isn't load-bearing for the story, design around it: cutaways at the moment of contact, OTS framings (a known Nano Banana weakness — solve via reference imagery, not prompts), tighter coverage that hides the hand-off, or matched opposite-angle pairs generated back-to-back in the same session. Minimize occlusion in frame whenever the script allows it.

These are the levers that actually move reliability on contact shots — which combination works depends on the specific shot. Documented productions consistently land contact-heavy films (the carry film above ran ~$950 for 3 minutes; a 70-second short with two characters ran $750) by combining all four: sketch, mock, overgenerate, stitch.

Watch some of these to see what works for you:

See how hand-sketching a carry pose unblocked two-character contact shots

Real production numbers: 164 clips generated, 41 made the cut

Chain AI video segments with Seedance to hold contact and blocking across cuts

Multi-character consistency (ropes, props, bodies in contact) breaks models faster than anything else.

— Hridaye, invideo's creative director

Can AI video tools reliably generate multi-character shots where characters are physically touching?

More on AI Filmmaking

Can AI video tools reliably generate multi-character shots where characters are physically touching?

Related questions

More on AI Filmmaking