AI Filmmaking

Why do AI video models struggle with physical contact between characters?

Last updated June 26, 2026

AI video models break on character contact because diffusion architectures predict pixels frame-by-frame with no physics engine, no skeleton tracking, and no collision logic — so when two bodies touch, limbs merge, identities swap, and textures bleed. The fix is to stop generating contact from text alone: lock the contact pose as a reference image first, then animate from that anchor.

Treat contact shots as a reference-driven pipeline, not a prompt problem. The failure has three stacked causes worth understanding so you know which lever to pull.

The architecture has no body model. Video diffusion models predict the next set of pixels from noise conditioned on your prompt and prior frames. There is no skeleton, no collision mesh, no joint constraint, and no awareness that two characters are separate solids. At a touch boundary, the model averages overlapping pixel regions — that is what produces limb merging, identity swap (one character's face leaking onto the other), texture bleed, and phantom limbs.

The training data is sparse for contact poses. Pretraining video sets are heavy on single-subject motion and conversational two-shots, light on hugs, carries, grapples, hand-on-hand, and prop-in-hand interactions with clean labels. Hands alone are underrepresented; two sets of hands meeting is rarer still. The model has not seen enough clean examples to learn the geometry.

Multi-character physical contact — ropes, props, bodies in contact — breaks models faster than almost any other scenario. As Hridaye, invideo's creative director, puts it: "Multi-character consistency (ropes, props, bodies in contact) breaks models faster than anything else." In one production, roughly 75% of the film was a two-character carry shot, and no amount of prompt rewriting produced a clean fused character sheet from text.

What actually works — a frames-first contact pipeline. Generate and lock the contact moment as a still, then animate from that still. Inside invideo, the agent routes each stage to the right model:

  • Hand-sketch the arrangement. When the models cannot visualize how two characters are physically attached, draw it on paper — who is holding whom, where the limbs go, what the contact points are. Upload the sketch to the invideo agent. In one production, the team hand-sketched a carry pose, the invideo agent fed the drawing into Nano Banana, and that produced the fused character sheet text prompting had failed to deliver across many attempts.
  • Or act it out on your phone. For POV and complex camera-on-contact shots, film a quick mock on a phone — you playing both roles, or two people roughing the blocking — and upload that as a reference video. The video model uses it as a visual anchor for camera, framing, and contact geometry. The invideo agent suggested exactly this on a stuck POV shot and it cleared the block.
  • Build per-beat character sheets. If the contact evolves across a sequence (a character picks something up, then carries it, then sets it down), generate a separate character sheet per beat rather than one sheet for the whole scene. Generate four options per sheet, pick one, lock it before any video generation.
  • Animate from the locked frame. Route the locked contact image into reference-to-video — Seedance 2.0 reference-to-video carries both character and location context across the clip and handles contact continuity better than start-frame/end-frame extension, which has no context beyond the single frame you upload. Kling and Veo are stronger on certain contact scenes; the invideo agent picks the model per shot, so you do not pre-commit to one.
  • Expect to composite. Even with locked references, contact shots rarely come back perfect in one generation. Across documented productions, average three generations per usable shot, and more than 40% of final shots were stitched from two or more generations — pull the cleanest seconds of contact from one take, the cleanest camera move from another, cut them together.

A short proof on cost so you can plan: locking one character through multi-angle turnarounds ran about five generations at roughly $9.78 per character; full productions ranged from $750 to $5,000, with finished-minute costs of $315–$750. Overgenerate deliberately on contact shots — it is a budget line, not a waste.

Where model choice matters: Seedance 2.0 reference-to-video is the workhorse for carrying contact across a continuous take because it accepts both character and location references at once. Kling handles multi-shot sequences with character context natively. Veo holds physical plausibility well on shorter beats. invideo has all of these — the invideo agent routes your locked frame and references to whichever fits the shot, so you direct once and the agent picks the model.

These are the levers that move contact shots from broken to usable — which one you reach for depends on whether the block is the pose, the camera, or the continuity across beats.

Watch some of these to see what works for you:

Hand-sketching a carry pose when AI models fail on contact
Chaining contact shots with Seedance Reference-to-Video across beats

Multi-character consistency (ropes, props, bodies in contact) breaks models faster than anything else.

— Hridaye, invideo's creative director

Share

More on AI Filmmaking