Why don't text prompts reliably produce over-the-shoulder shots in AI video?

OTS shots require precise camera-to-subject geometry — foreground placement, shoulder position, and depth of field — that words don't reliably encode. Giving the model a visual reference instead of a description solves this.

How do you use a phone mock-up to generate an OTS shot in invideo AI?

Stand in for the foreground character, frame the shot over your own shoulder, and record 5–10 seconds on your phone. Upload the clip to the invideo agent as a visual reference, then prompt it to match the camera position while replacing the people with your real characters.

What is a still master plate and how does it help with OTS shots?

A still master plate is a wide establishing image of your scene with both characters in frame. You crop A-side and B-side anchor frames from it and feed each into image-to-video with a spatial-anchor prompt to lock the geometry for both OTS angles.

When should you use a hand sketch to generate an OTS shot?

Use a hand sketch when the framing is unusual — such as height differences or multi-character contact configurations — that the image model cannot infer from text. A simple stick-figure diagram uploaded to the invideo agent is enough.

Generate Over-the-Shoulder Shots in AI Video

Q: How many generations should you expect per usable OTS shot?

Expect roughly three generations per usable OTS shot. Plan to composite the strongest seconds from two or three takes into the final cut rather than relying on a single generation.

When prompting alone won't land an over-the-shoulder shot, give the video model spatial geometry it can't infer from text. The reliable path: shoot a phone mock of the OTS, generate a still master plate and crop A-side/B-side anchor frames, or hand-sketch the framing — then upload that visual reference and let the invideo agent route it into image-to-video with a spatial-anchor prompt.

OTS is a documented weak point of current video models — it needs camera-to-subject geometry (who's in foreground, which shoulder, how shallow the depth) that words don't reliably encode. Give the model a picture of the geometry, not a description of it. invideo is an agentic video tool with the current video and image models routed through one agent, so the same reference can flow from image generation into a video model in one pass.

Act the shot out on your phone and upload it as reference. Stand in for the foreground character, frame the camera over your own shoulder onto whatever is standing in for the B-side, and shoot 5–10 seconds on your phone. Upload that clip to the invideo agent as a visual reference for the shot, then prompt the agent with your real characters, lighting, and lens — "match this camera position and framing, replace the people with [character refs], hold the shallow depth." The phone clip locks the geometry; the agent handles the dressing. This is the same move productions use for tricky POVs when prompting stalls: "shoot like a mock video of it on our phone inside the office," then hand it back to the agent to take it over the line.

Generate a still master plate, then crop A-side and B-side anchors for image-to-video. Ask the invideo agent for a wide establishing still of the scene with both characters in frame — this is your master plate and it locks the spatial geometry of the room. From that master, crop two reference frames: one over the A character's shoulder onto B, one over B's shoulder onto A. Feed each crop into image-to-video on the invideo agent with a spatial-anchor prompt — name the lens, the shoulder, where the subject sits in the frame, and the depth of field. A working template:

Over-the-shoulder, camera behind [character]'s left shoulder, [character]'s head and shoulder soft in foreground right third, [other character] sharp in mid-ground left, 50mm equivalent, shallow depth of field, [lighting from your scene], hold the camera position.

Keep one camera action per prompt — if you ask for an OTS plus a push-in plus a rack focus in one shot, the model loses the geometry. Generate the static OTS first, then chain a second generation for any movement.

Hand-sketch the framing when the geometry is unusual. If the OTS involves an awkward height difference, a character carrying another, or a non-standard shoulder-over-shoulder arrangement, sketch the framing on paper — a stick-figure diagram is enough — and upload that to the invideo agent. One production used exactly this when the image model couldn't visualise a multi-character contact configuration from text: a hand sketch went up, the agent fed it into the image model, and the character sheet came back correct. The same logic applies to OTS framing the model can't infer.

Route the OTS to the right model. Reference-to-video on Seedance 2.0 carries character and location context into the generated clip — useful when the OTS has to match shots either side of it. Kling holds multi-shot framing well for OTS continuity within a sequence. Veo handles the lens/depth language in spatial-anchor prompts cleanly. You don't pick a platform per model — the invideo agent has all of them and routes each shot based on what the reference and prompt need. Expect roughly 3 generations per usable OTS shot in practice across documented productions, and plan to composite the strongest seconds from two or three takes into the final cut if needed.

Build the reverse in the same session. Once the A-side OTS is good, ask the invideo agent for the compositionally opposite angle in the same conversation, anchored to the same master plate. The agent will treat undecided design elements (what's on the wall behind the second character) as production-design questions and surface options rather than guess — "Reverse on [character] — what's behind them? That near wall doesn't exist yet. What should it be?" That keeps the OTS pair geometrically matched for editing.

These are the moves that unblock OTS when prompting alone won't — what works depends on how unusual your framing is and how tightly the shot has to match its neighbours.

Watch some of these to see what works for you:

See how phone reference and hand sketches crack shots AI can't prompt

Watch the invideo agent solve an OTS shot a video model couldn't crack

The lesson for the day truly is that when the models get stuck you draw, you shoot, you bring your hands in and you get it done. And that's when agent one meets you there and takes it over the line.

— invideo's creative team, on production lessons from AI short film work

How do you generate an over-the-shoulder shot in AI video when prompting alone doesn't work?

More on AI Filmmaking

How do you generate an over-the-shoulder shot in AI video when prompting alone doesn't work?

Related questions

More on AI Filmmaking