How do you write a treatment document that an AI agent can use to direct a video autonomously?
Last updated June 26, 2026
A treatment document an AI agent can direct from is a standing directive set the invideo agent reads once and applies to every shot: codified visual language (camera, lighting, palette, composition), emotional-stage rules with explicit never-do constraints, an audio module, exception cases, and a defined per-shot output format. One documented production loaded a 25-page version into the invideo agent and shot a 70-second film for $750 in 2 days.
Write the document as the briefing you would give a human crew on day one — everything in your head, organized, so the invideo agent never has to guess. invideo is an agentic video creation tool, and its agent holds an uploaded treatment in persistent context across the whole production, so the document functions as a system prompt rather than a per-shot prompt. You don't need it perfect upfront: start with the assets and context you have and iterate.
1. Codify the visual language into discrete sections. Treat style as a rule system, not adjectives. One documented director document ran 14 sections: camera, angles, colour tone, atmosphere, mood, lighting, composition, movement, film palettes, prompt templates, negative prompts, and a quick-reference card. Be exact where exactness is reproducible — encode colour philosophy as named tonal modes with hex values ('Mode A — split-toned amber and emerald'), and specify lighting as ratios and sources (an 85:15 dark-to-light ratio; 'warm yellow from the lamps only') rather than generic descriptors. The goal is grammar, not surface style: rules the invideo agent can apply to scenes the source filmmaker never shot.
2. Add an emotional architecture with per-stage rules. Structure the film's arc into stages — one documented horror treatment used five escalating emotional stages — and lock camera, lighting, and sound rules to each stage. Crucially, include a 'what never to do' section per stage: negative-space constraints make autonomous decisions far easier for the invideo agent than positive rules alone. This paid off in production: the invideo agent caught shadows drifting blue-green against a Stage A rule without being asked, and later flagged that a reveal shot was running at the wrong stage register — an error the director had missed.
3. Include an audio architecture module. Sound direction belongs in the treatment, not just the edit — in the documented James Wan-style production, the document carried a full audio module on the logic of what you hear before what you see, and prop briefs encoded diegetic sound ('hard material, so it makes a horrible sound when it falls').
4. Separate exceptions into their own directive. If the style you're encoding has outlier works or special cases, isolate them in a dedicated section so the invideo agent doesn't misapply generalized rules to the wrong context. One documented director-protocol document did exactly this to keep atypical films from contaminating the default grammar.
5. Define the invideo agent's per-shot output format. Tell the document-holding agent what every scene request must return. One documented production specified 12 parameters per shot: film reference, shot design, length, style interpretation, emotional register, lens, lighting plan, color script, atmosphere layers, blocking, final prompt, negative prompt, and revision prompt. Pair this with prompt templates and a fixed assembly order — one production enforced a 9-element sequence (camera spec, lens and aspect ratio, lighting source, palette, composition, atmosphere, mood register, film attribution, negative prompt) across every frame. Write negative constraints explicitly: a style block that must stay animated should literally prohibit live-action and photorealistic output. Since the invideo agent routes shots to whichever video model fits — Veo, Kling, Seedance 2.0 are all available inside invideo — your prompt templates carry the style into every generation regardless of model.
6. Upload once, then validate before generating anything. Load the document at project start as the invideo agent's permanent instruction set. Stress-test it on a subject the encoded director never touched — one creator requested a courtroom thriller through a horror director's lens; the invideo agent asking clarifying questions (era, nature of threat) and returning stylistically coherent output confirmed the grammar was internalized, not pattern-matched. Also challenge the invideo agent's technical claims: when questioned, the invideo agent corrected its own lens analysis from anamorphic to spherical, citing the actual 2.40:1 hard-matte format — catching that before generation prevents the error propagating across every asset.
7. Manage context on long projects. A loaded treatment removes per-shot re-prompting — once context is held, a three-word continuation prompt ('Everything should match') maintains character, lighting, lens grammar, and spatial continuity across a sequence. For long-form work, divide the project into acts and fully complete one before starting the next, so the invideo agent never loses context mid-production. If you're running a crew of sub-agents, initialize a creative producer agent with the treatment, script, and character details first — it becomes the vision-holder that grounds every specialized sub-agent in the same document.
The payoff is documented: an agent holding a deeply written treatment applied a slow-shutter motion-smear effect from page 17 without being prompted, pulled a named principle from page 12 and applied it to a scene type the document never addressed, and sequenced a six-shot ending when the creator couldn't write one. Across treatment-driven productions, finished films ran $750 (70 seconds, 2 days, 3,000 credits) to $870 (90 seconds, 2 days, ~400 video generations and 30 image generations) — natural variance by team and approach.
Watch some of these to see what works for you:
Just think about it as all the information you want your crew to have as you start building with them. So if you want them to have all the thoughts that are in your head, just put them down in an organized fashion and upload them onto the agent and watch the magic after that.
— invideo's creative team