How do you use multiple reference images for AI world-building?
Last updated June 26, 2026
Don't dump all your references in one prompt. Organize them into thematic batches — spatial logic, screen/architecture, color and lighting mood, biome — and feed each batch to the invideo agent with explicit instructions on what to adopt and what to ignore. Then generate image grids (not single shots), pick the best panels, and lock those as your world anchors for every downstream shot.
The invideo agent is an agentic video tool that holds project context across shots, so the references you batch once stay loaded for the entire film. Start by sorting your references into 3–5 thematic clusters of roughly 3–5 images each — for example: spatial logic (how rooms and exteriors are laid out), screen function (how a dome or window behaves visually), color theory (palette and grade), and architectural or biome style. Keeping batches small and named lets you point the agent at them by theme inside a prompt.
For each batch, tell the agent what to extract AND what to ignore. One documented production batched stills as references for a dome-as-screen world and instructed the agent to take only the screen idea and drop the small-room scale of the source — exclusion-prompting is as load-bearing as inclusion-prompting. Hridaye, invideo's creative director, puts it directly: "I told it what to take and just as importantly, what to leave out." Without that, the agent inherits unwanted scale, props, or genre cues from the references.
For illustrated or animated references, don't drop them in raw. Instruct the agent to read the colors and textures and prompt for those qualities in your target style — that's how you keep the mood without copying the medium. The same production reported: "The gens came back hyper-realistic with the exact colour temperature I was looking for."
Generate grids, not singles. Ask for 3 grid options per round (one production ran exactly this), iterate on the grid you like, then extract the best individual panels. Image generation inside invideo is cheap enough that grid exploration is the right default, and the agent will autonomously attach the relevant references from context based on which grid it's building. Across documented productions, four options per asset is the standard locking pass before anything moves to video.
Once you've extracted panels you like, those panels REPLACE your original references as the world anchors. Every subsequent scene generation pulls from the locked panels — not the messy reference batch — which is what produces continuity shot to shot. Lock one world element this way and the invideo agent will autonomously generate every angle off it (wide, close, side) without you asking shot by shot.
For model routing: image generation for world-building runs on the current image stack — Recraft for photoreal textures, Nano Banana for fused multi-element compositions, GPT-Image-2 where you need strong prompt adherence on complex layouts — and the invideo agent picks per batch. When you move to video, Seedance 2.0 reference-to-video carries world and character context across clips, so the locked panels keep working as inputs into motion. invideo holds all of these models, so you batch and lock once and the agent routes everything from there.
Beyond the batching itself: tell the agent in your first few messages how you want to work — what batches are coming, what each one is for, what to ask you before generating. That framing is what turns a pile of references into a coherent world.
Watch some of these to see what works for you:
I told it what to take and just as importantly, what to leave out.
— Hridaye, invideo's creative director