How do you structure AI video prompts for consistent cinematography across shots?
Last updated June 26, 2026
Structure every shot prompt as the same fixed, ordered string of elements — camera spec, lens and aspect ratio, lighting source, palette, composition, atmosphere, mood register, film/DP attribution, negative prompt — and load a one-time visual-language document into a persistent agent so that order, and the language inside each slot, repeats identically across every generation.
Start with one orientation point: invideo is an agentic video tool that reads a treatment document once and holds those directives across every shot, so the prompt structure below lives in context — you're not retyping it per generation.
The 9-element prompt assembly order (use this same order on every shot). Most public frameworks stop at 5–7 slots (subject, action, environment, camera, lighting, style); for cross-shot cinematography consistency, that's not enough. Assemble each prompt in this fixed sequence:
- Camera spec (body + sensor language)
- Lens and aspect ratio (focal length, spherical vs anamorphic, your film's delivery ratio)
- Lighting source (named practicals and key direction, e.g. "warm yellow from the lamps only")
- Palette (named tonal mode with hex values where you have them)
- Composition (blocking, framing, eye-line)
- Atmosphere (haze, particulate, weather)
- Mood register (the emotional stage this shot sits in)
- Film / DP attribution (the directorial grammar you're citing)
- Negative prompt (what this look explicitly is NOT — e.g. "not live-action, not photorealistic")
The order matters because it mirrors how a DP builds a shot — camera and lens before light, light before palette, palette before mood — and because identical order across shots is what produces identical visual DNA. Reorder the slots between shots and the model re-weights them differently; hold the order and the look holds.
Lock the language inside each slot once, in a visual-language document. Encode the look as 14 sections — camera, angles, colour tone, atmosphere, mood, lighting, composition, movement, film palettes, prompt templates, negative prompts, quick-reference card — and upload it to a creative producer agent at project start. From there, every shot prompt is a thin override against that loaded grammar, not a full re-specification. One documented production wrote a 25-page treatment in this shape and ran 12 key parameters per shot against it; another encoded an 85:15 dark-to-light ratio and a 2.40:1 hard matte directly into the slot vocabulary so every generation inherited them.
Write a one-line continuity brief at the top of every shot prompt. Before the 9 slots, prefix one line that names the scene, the prior shot it must cut against, and what carries over ("continues from prior wide; same lamp key, same 35mm spherical, same amber/emerald split-tone, characters from sheets v3"). This is what stops drift between adjacent shots — the model gets told what to inherit, not just what to render.
Use named tonal modes instead of free-form color words. "Warm lighting" produces a different warm every generation; "Mode A — split-toned amber and emerald, #E8A33D / #2F6B5B, lamp-only key, 85:15 dark-to-light" produces the same one. Same discipline for lens ("35mm spherical, circular bokeh, no horizontal flares") and for stock ("Kodak 5219 grain, halated highlights"). Physical parameters are what makes the structure reproducible — a 5-element prompt with "cinematic lighting" in slot 5 will never lock.
Hold a negative-prompt block on every shot. The most common drift mode is the model sliding toward photoreal or live-action when you wanted painterly, or toward generic "cinematic" when you wanted a specific grammar. One production locked an animated style with the explicit line "This MUST look and feel like [the target style] — not live action, not photorealistic. Every surface has hand-painted brushstroke texture" — and prefixed every subsequent prompt with that block. As Hridaye, invideo's creative director, put it: "camera continuity carries from the treatment doc forward. you're not telling the agent how to move the camera every time. you set it once. it holds."
Route shots to the right model, but keep the prompt structure constant. The invideo agent holds all the current generation models — Runway, Veo, Kling, Seedance 2.0 — and routes each shot to whichever handles your specific need (Seedance 2.0 reference-to-video for shots that must carry character and location context across a cut; Kling for native multi-shot sequences). The 9-element prompt structure does not change between models — same order, same slot language — only the routing changes. That's how you get cross-model consistency without re-writing the grammar per engine.
Validate the structure with a stress test before you generate the film. Ask the agent to apply your loaded grammar to a scene type your reference director never shot. If it returns clarifying questions in the slot vocabulary ("era? threat type? lens — staying spherical?") and produces a stylistically coherent frame, the structure has been internalized as grammar. If it returns generic "cinematic" output, the slot language isn't specific enough yet — go back and tighten the named modes.
Across documented productions running this discipline, the math holds: ~3 generations per usable shot, 25% editorial yield, and finished cinematography that reads as one film — a 70-second short on $750, a 3-minute episodic piece at $315 per finished minute, a ~90-second piece at ~$580 per minute. The prompt structure is what makes that yield rate possible; without it, you're re-discovering the look every shot.
Watch some of these to see what works for you:
camera continuity carries from the treatment doc forward. you're not telling the agent how to move the camera every time. you set it once. it holds. that's the flow state.
— Hridaye, invideo's creative director