Can AI video tools reliably generate multi-character shots where characters are physically touching?
Last updated June 26, 2026
No — not reliably from text prompts alone. Multi-character physical contact (carries, grips, ropes, bodies touching) breaks current AI video models faster than almost any other shot type, with identity bleed and occlusion failures the dominant problems. Reliable contact shots come from feeding the model a visual anchor of the contact itself — a hand-sketch, a phone-shot mock, or a fused character sheet — and stitching the keepers.
Treat a contact shot as an asset problem, not a prompting problem. The invideo agent routes shots across Runway, Veo, Kling, and Seedance 2.0, but no current model — including the strongest reference-to-video ones — solves two-character physical contact reliably from text. Hridaye, invideo's creative director, put it bluntly: "Multi-character consistency (ropes, props, bodies in contact) breaks models faster than anything else." The workable path is to give the model a pre-resolved picture of the contact and then generate against it.
Hand-sketch the contact, then build a fused character sheet. When the configuration is unusual — one character carrying another, a grip pose, intertwined bodies — text prompts and even Nano Banana alone fail. Sketch the arrangement on paper, upload the drawing as a reference, and have the invideo agent feed it to the image model to produce a fused two-character sheet. In one documented production where a vampire carries a juice-box character through ~75% of the film, the team only cracked the character sheet after hand-sketching the carry pose and uploading it as the visual anchor. Lock that fused sheet before any video generation.
Shoot a phone mock of the contact and use it as reference video. For motion — a lift, a hand-off, a pull — act it out on a phone in your room and upload the clip as a reference into a reference-to-video model (Seedance 2.0 is the strongest here because it carries character and location context across clips simultaneously). The mock locks blocking, contact points, and camera path; the model dresses it in your style. This is how the same production unblocked POV and contact shots that pure prompting could not deliver.
Generate five variations per beat and stitch the keepers. Plan for overgeneration. Average yield in documented productions runs about 3 generations per usable shot, and roughly 25% of clips make the final cut (41 of 164 in one 3-minute episode). Stitch the strongest seconds from multiple takes into one shot — a Frankenstein shot — which accounted for more than 40% of final shots (17 of 41) in that production. For evolving contact (a character picking things up across beats), generate a separate character sheet per beat so the contact updates without identity drift.
Stage around the model's weak spots when contact isn't structural. Where the contact isn't load-bearing for the story, design around it: cutaways at the moment of contact, OTS framings (a known Nano Banana weakness — solve via reference imagery, not prompts), tighter coverage that hides the hand-off, or matched opposite-angle pairs generated back-to-back in the same session. Minimize occlusion in frame whenever the script allows it.
These are the levers that actually move reliability on contact shots — which combination works depends on the specific shot. Documented productions consistently land contact-heavy films (the carry film above ran ~$950 for 3 minutes; a 70-second short with two characters ran $750) by combining all four: sketch, mock, overgenerate, stitch.
Watch some of these to see what works for you:
Multi-character consistency (ropes, props, bodies in contact) breaks models faster than anything else.
— Hridaye, invideo's creative director