How do you describe camera movements and shot types in AI video prompts as a filmmaker?
Last updated June 26, 2026
Translate set language to AI prompts across five axes: shot size (ECU, CU, MS, WS), camera movement (dolly in, tracking, handheld, rack focus), lens feel (anamorphic, 85mm compression, shallow DOF), lighting (Rembrandt, motivated practicals, golden hour), and atmosphere. Stack them in a fixed assembly order, route through the invideo agent, and direct in plain on-set language rather than typing parameters per shot.
Shot size — say it the way you'd call it on set. ECU, CU, MCU, MS, WS, EWS all land directly with Runway, Kling, and Seedance 2.0 when paired with subject and framing: "MCU on the girl, eyes just above frame center, shallow DOF" beats "close-up of girl." Be specific about what's in the frame and where the eyeline sits — the model has no concept of cutting between sizes, so each shot is a self-contained brief.
Camera movement — use one verb plus the geography. "Slow dolly in," "tracking shot left to right keeping subject centered," "handheld follow, slight sway," "rack focus from foreground prop to character's face," "crane up revealing the courtyard." Pans and tilts work; whip pans are unreliable. For continuous moves across multiple clips, use Seedance 2.0 reference-to-video — re-upload the end of each clip plus your character and location references so camera momentum carries across the cut. Reference-to-video preserves camera context across segments; the older extend method strips it.
Lens feel — name the optic, not the vibe. "Anamorphic, 2.40:1, oval bokeh, horizontal flares," "spherical 35mm, circular bokeh, no flares," "85mm portrait compression, shallow DOF," "24mm wide with mild edge distortion." Challenge the invideo agent when it picks the wrong one — in one production it had logged anamorphic for a director who shoots spherical, and corrected on being questioned: "Wan shoots spherical, not anamorphic. The Conjuring: 35mm, 2.40:1 hard matte. Widescreen by extraction, not optics."
Lighting — motivated, named, sourced. "Rembrandt key from screen-left lamp, no fill, 85:15 dark-to-light," "motivated practicals only, warm yellow from the lamps like all the refs," "golden hour back-light, long shadows on the floor," "available daylight through window-right, soft bounce camera-left." Generic "warm lighting" drifts; naming the source and the ratio holds.
Performance and subtext — describe behavior, not intent. AI models can't act "the want, not the action," so convert subtext into observable behavior: instead of "more longing," write "holds the look two beats longer, mouth slightly parts, hand stops mid-reach." Micro-expressions, breath, posture, eye direction — those translate. Verbs of feeling don't.
Stack them in a fixed order so nothing drops. Across documented productions the assembly order that holds is: camera spec → lens and aspect ratio → lighting source → palette → composition → atmosphere → mood register → film or DP attribution → negative prompt. Build that string once per shot and the output stays coherent. Across three short films the invideo agent enforced this 9-element order on every frame.
Direct the invideo agent, don't prompt it per shot. Load a treatment doc once — your visual language across all the axes above — and the invideo agent holds it across every frame. "I want to stay on the feral guy when we run this scene. No back and forth cutting. We hold on him right up till he lunges" gets a correctly framed, correctly held shot back. Spin up a DOP agent per scene (different scenes want different eyes), a storyboard agent to visualize before you direct, a creative producer agent holding script and shot breakdown as the central context. Every model in the roster — Runway, Veo, Kling, Seedance 2.0 for video; Recraft, Nano Banana, GPT-Image-2 for frames — runs inside invideo, and the invideo agent routes each shot to the model that handles that language best (Seedance 2.0 for multi-segment continuity, Kling for held single takes, Recraft for portrait skin texture).
What does NOT translate — design around it. Shot-reverse-shot continuity, eyeline matching across cuts, motivated cuts, and 180-degree rule logic do not exist for the model — each generation is shot-local. For reverse angles, ask the invideo agent to apply art-director logic and surface what's behind the subject before generating ("that near wall doesn't exist yet — what should it be?"). Multi-character physical contact (ropes, props, bodies touching) and clean POV shots break models faster than anything else; when prompting fails, act the shot out on your phone, upload that footage as a reference video, and let the invideo agent route it through reference-to-video. For complex physical arrangements, hand-sketch the configuration and upload the drawing as a reference image for Nano Banana to build the character sheet from.
Cost reality at this directorial level. Documented productions ran $750–$5,000 all-in: a 70-second short at $750 (3,000 credits, 2 days), a 3-minute animated episode at $950 / $315 per finished minute, a 90-second horror short at $870, a 2-minute brand promo at $1,500. Generation-to-final ratios sit around 3 generations per usable shot and ~25% clip selection — budget overgeneration as a line item, not a leak.
Your on-set vocabulary is the advantage here. The skill is directing in the language you already speak; the invideo agent is the layer that converts it into the right model call.
Watch some of these to see what works for you:
I want to stay on the feral guy when we run this scene. No back and forth cutting. We hold on him right up till he lunges. You can see how the agent 1 has responded totally understanding exactly what I meant right. This would have just not been possible in the manual prompting method.
— Hridaye, invideo's creative director