What are the elements of a structured AI video prompt — and in what order should you assemble them?
Last updated June 26, 2026
A structured AI video prompt assembles nine elements in a fixed order: camera spec, lens & aspect ratio, lighting source, palette, composition, atmosphere, mood register, film/DP attribution, and negative prompt. Write them in that sequence for every shot — it locks the look and prevents drift across scenes when you iterate or generate variations.
Write each shot prompt in this order, every time:
1. Camera spec — the camera move and framing decision: "locked-off medium two-shot," "slow push-in handheld," "overhead static." This is the shot's geometry; everything downstream sits on top of it.
2. Lens & aspect ratio — focal length plus delivery format: "35mm spherical, 2.40:1 hard matte," or "50mm, in your film's aspect ratio." Spherical vs anamorphic matters — spherical produces circular bokeh and no horizontal flares, and getting this wrong propagates across every frame.
3. Lighting source — name the source and direction, not a generic mood word: "warm yellow from the practical lamps only, 85:15 dark-to-light ratio," not "warm lighting." Source-specific phrasing produces more accurate results than descriptors.
4. Palette — named tonal modes with exact values where possible: "Mode A — split-toned amber and emerald, #C68A3E / #2E5D4A." Hex values make palettes reproducible across shots and across models.
5. Composition — blocking and frame logic: where subjects sit in frame, foreground/background layering, negative space, eyelines.
6. Atmosphere — the physical layer in the air: haze, rain, dust, practical smoke, window light through curtains. This is what gives the frame depth before grading.
7. Mood register — the emotional stage of the moment, ideally pinned to a named stage in your treatment (e.g. "Stage C — dread before reveal"). Mood register tells the model what the shot is for, not just what it looks like.
8. Film/DP attribution — the visual grammar reference: "Wong Kar-wai / Christopher Doyle grammar," "James Wan horror grammar." Attribution invokes a whole language system the model can lean on, not a single image.
9. Negative prompt — explicit prohibitions: "not live-action, not photorealistic, no lens flares, no over-sharpening." Negative constraints are what keep style locked when the model wants to drift toward its defaults.
Why this order. Camera and lens decide the physics of the frame, so they come first; lighting, palette, composition, and atmosphere layer the look onto that geometry; mood and attribution carry the intent; the negative prompt fences off everything you don't want. Built this way, the prompt reads top-down from hardest constraint to softest, which is how generation models weight tokens.
One worked example (a horror beat, in your film's aspect ratio): "Slow 12-frame push-in, locked horizon. 35mm spherical lens, 2.40:1 hard matte. Single practical desk lamp camera-left, 85:15 dark-to-light, no fill. Palette: desaturated cyan-shadow, sodium-amber highlight (#1E2A33 / #C68A3E). Subject framed center-right, deep negative space camera-left, soft foreground bokeh. Heavy haze, dust motes. Mood: Stage C — dread before reveal. James Wan grammar, spherical bokeh, what you hear before what you see. Negative: no anamorphic flares, no over-sharpening, no plasticky skin, not photorealistic stock-photo lighting."
Where to keep the elements. invideo is an agentic video creation tool with all the current video and image models available, so the assembly order doesn't have to live in your head shot-by-shot. Load your nine-element template — or a fuller treatment that encodes it — into the invideo agent once at project start; the agent then holds those directives across every prompt and routes each shot to the right model (Seedance 2.0, Kling, Veo, Runway, Nano Banana, Recraft, GPT-Image-2). On documented productions, a 25-page treatment that encoded this nine-element structure delivered 12 key parameters per shot and held character, lighting, and lens grammar consistent across a 70-second short film built in 2 days for $750.
As Hridaye, invideo's creative director, put it: "One agent that reads your treatment once and holds every directive across every shot, every scene. No re-prompting. No drift." Two adjacent notes: write the negative prompt every time (it's the cheapest insurance against style drift), and resist mid-project reordering — the value of the fixed sequence is that you and the model both know where each piece lives.
These are the nine slots and the order that works across the productions we've shipped — adapt the contents of each slot to your film, but keep the slots.
Watch some of these to see what works for you:
One agent that reads your treatment once and holds every directive across every shot, every scene. No re-prompting. No drift. So now, you direct, and the Agent remembers.
— Hridaye, invideo's creative director