Directing matters more than prompt engineering. What moves AI filmmaking forward is a written visual language the agent holds across every shot, a locked production context (script, characters, world, references), a multi-agent crew structured by role, and the editorial judgment to choose what's usable — prompt wording is the smallest variable in that chain.
Treat prompts as fragile hypotheses and put your time into the layers above them. Single-prompt thinking breaks the moment your film needs continuity across scenes; what holds a film together is project-level context — a treatment document, a character bible, a shot breakdown, locked references — loaded once and carried by an agent across every frame. The invideo agent is built to hold that context: you load the film's directorial framework up front, then direct shot by shot in plain language rather than re-engineering prompts.
Write a visual language document, not better prompts. Codify the film's camera grammar, lens, lighting, palette (with hex values), composition, atmosphere, and negative prompts in a structured document, and load it once. One documented production used a 25-page Wong Kar-wai-style guide split across 14 sections; another encoded a James Wan horror grammar with an 85:15 dark-to-light ratio and a five-stage emotional architecture. As Hridaye, invideo's creative director, puts it: "One agent that reads your treatment once and holds every directive across every shot, every scene. No re-prompting. No drift. So now, you direct, and the Agent remembers."
Lock the project context before generating anything. Upload the full script, answer the four questions that change every frame (character, antagonist, prop, deliverable format), and lock four options per character sheet and environment reference before video generation begins. This is the step that prevents drift downstream — the documented horror short generated 11 reference images for 4 characters and 1 prop before a single video clip ran; the Wong Kar-wai short generated four variations per asset and locked the best.
Direct a crew of agents, don't prompt a model. Set up a creative producer agent to hold the script and shot breakdown, then assign a storyboard agent, a DOP agent per scene (different scenes want different eyes), a costume agent you brief on mood when specs aren't fixed, and a production designer agent. Documented productions ran 6–8 specialist agents simultaneously. Speak to each one the way you'd speak to that crew member on set — "hold on him right up till he lunges, no back and forth cutting" — not as a prompt template.
Develop editorial judgment, because most generations don't make the cut. Across documented productions, roughly 3 generations are needed per usable shot, only about 25% of clips reach the final cut (41 of 164 in one episode), and on average only 5 seconds of each 15-second clip is used. The skill that compresses raw output into a film is choosing — not prompting — and using the invideo agent as a maker-checker on the rough cut to flag pacing, sound, and emotional-register errors.
Bring on-set experience, not tutorial knowledge. Lens grammar, blocking, coverage logic, when to hold and when to cut — these translate directly to directing agents. As Hridaye puts it: "The real unlock isn't the tech. It's that the skill that makes this work isn't prompting — it's directing. And that doesn't come from a tutorial. It comes from being on set." Working act-by-act (finish 25% before moving on) and questioning the agent's technical claims (it self-corrects on lens type and aspect ratio when challenged) matter more than any prompt phrasing.
Choose the right model for the shot — but don't make that your job alone. Different shots want different models: Seedance 2.0 reference-to-video for continuity across segments, Kling for multi-shot sequences, Veo for naturalistic motion, Recraft and Nano Banana for image work. invideo holds all current models and upscalers, and the invideo agent routes each shot to the right one based on the context you've loaded — so model selection becomes a directorial choice, not a platform-hopping chore.
Prompt wording is the last 5% of the work. The first 95% is the document, the context, the crew structure, and the cut.
Watch some of these to see what works for you:
The real unlock isn't the tech. It's that the skill that makes this work isn't prompting — it's directing. And that doesn't come from a tutorial. It comes from being on set.
— Hridaye, invideo's creative director