Do you need a treatment document before generating AI video, or can you just start prompting?
Last updated June 26, 2026
No, you don't strictly need one — but skipping it is the main reason AI films drift visually scene to scene. The fix is loading a treatment document (your visual language: camera, palette, lighting, composition, mood) into the invideo agent once at project start, so it holds every directive across every shot without re-prompting.
If you're making anything longer than a single shot, write the treatment first and load it into the invideo agent before generating a frame. The failure mode it prevents is prompting scene-by-scene and watching the look drift — palette shifts, lens grammar wanders, character details slip. A loaded treatment turns the agent into a context-holder: you direct in natural language, it cross-checks each generation against the document, and what comes back is a decision, not a draft.
What goes in the treatment. Encode the director's visual language as discrete, teachable directives — not vibes. The Wong Kar-wai document used for a 70-second short film ran 25 pages across 14 sections (camera, angles, colour tone, atmosphere, mood, lighting, composition, movement, film palettes, prompt templates, negative prompts, quick-reference card) and enforced a 9-element prompt assembly order on every frame: camera spec, lens and aspect ratio, lighting source, palette, composition, atmosphere, mood register, film/DP attribution, negative prompt. A James Wan horror treatment for a ~90-second film added a 9-step shot design process, an 8-step color grading process, an 85:15 dark-to-light ratio, a five-stage emotional architecture with locked rules per stage, and a full audio architecture section — because half of what makes that grammar land is what you hear before what you see. Isolate exceptions in their own directive (a director's outlier film gets its own section) so the agent doesn't misapply the general rule.
When a lighter setup is enough. For a single shot, a short social cut, or a one-scene test, a few anchor images and a 5-line style note are fine — you don't need 25 pages. Treatments earn their cost the moment a project has multiple scenes, multiple characters, or any continuity requirement. The rule of thumb: if you'd brief a human crew on the look, write that brief down and upload it.
What to load alongside the treatment. Drop the full script in once so the agent has character arcs, themes, and motifs as narrative context. Lock character sheets (multi-angle turnarounds plus close-ups) and environment references before any video generation — four options per asset, pick one, lock it. Before generating, force the four pre-production answers that change every frame: who is the character, who/what is the antagonist, what's the prop, and what's the deliverable format. With the document loaded, a three-word continuation prompt — "Everything should match" — is enough to carry character, lighting, lens grammar, and spatial continuity across a multi-shot sequence.
What you get back in time and cost. Documented productions that loaded a treatment first ran $750 (3,000 credits) for a 70-second short over 2 days, $870 (4,100 credits) for a ~90-second horror short over 2 days, $950 for a 3-minute animated episode at roughly $315 per finished minute, and $1,500 (6,000–6,500 credits) for a 2-minute brand promo in 3 days — versus a 1-week manual-prompting equivalent and a ~2-month traditional shoot at $100,000–$500,000. Range across these productions: $315–$750 per finished minute, 2–5 production days, 1–4 people. The treatment is what makes those numbers reproducible — not luck.
Tool-agnostic, but the agent is the routing layer. The treatment-first approach works regardless of which video model you generate on — Runway, Veo, Kling, Seedance 2.0 all benefit from a stable style context. invideo has all of them, and the invideo agent routes each shot to the right one (Seedance 2.0 reference-to-video for continuous takes; Recraft for photoreal portraits with skin imperfections; Nano Banana Pro for character sheets), so you don't pick a platform per model. The treatment lives once, in one place, and every model inherits it.
The crew-of-agents setup. Start with a creative producer agent holding the full script, shot breakdown, and character details — that's the vision anchor every other agent inherits from. Then spin up specialists: a storyboard agent to visualize before you direct, a DOP agent per scene (different scenes want different eyes), a costume designer agent you can brief on feel when you don't have exact spec, a production designer agent, and a director's-assistant agent to sequence shots. For one production, six to eight agents ran in parallel.
As Hridaye, invideo's creative director, puts it: "One agent that reads your treatment once and holds every directive across every shot, every scene. No re-prompting. No drift. So now, you direct, and the Agent remembers." The skill the treatment surfaces isn't prompting — it's directing.
Watch some of these to see what works for you:
One agent that reads your treatment once and holds every directive across every shot, every scene. No re-prompting. No drift. So now, you direct, and the Agent remembers.
— Hridaye, invideo's creative director