How should I organize reference images for AI world-building?

Sort your references into 3–5 thematic clusters of roughly 3–5 images each, such as spatial logic, screen function, color palette, and architectural style. Keeping batches small and named lets you direct the agent precisely by theme inside a prompt.

Why is exclusion prompting important when feeding references to the AI agent?

Exclusion prompting prevents the agent from inheriting unwanted scale, props, or genre cues from your source material. You should explicitly tell the agent what to take from a reference and what to leave out, making omissions just as deliberate as inclusions.

Should I generate single images or grids when building a world?

Always generate grids rather than singles. Ask for three grid options per round, iterate on the best one, then extract the strongest individual panels. Four options per asset is the standard locking pass before moving anything to video.

What happens once I lock reference panels as world anchors?

Locked panels replace your original reference batch and become the source for every subsequent scene. The invideo agent will autonomously generate multiple angles — wide, close, side — off each locked element without requiring shot-by-shot instructions.

Using Multiple Reference Images for AI World-Building

Q: How do I use illustrated or animated references without copying the art style?

Instead of dropping illustrated references in raw, instruct the agent to extract colors and textures and prompt for those qualities in your target style. This preserves mood and color temperature while keeping the output in your intended medium.

Don't dump all your references in one prompt. Organize them into thematic batches — spatial logic, screen/architecture, color and lighting mood, biome — and feed each batch to the invideo agent with explicit instructions on what to adopt and what to ignore. Then generate image grids (not single shots), pick the best panels, and lock those as your world anchors for every downstream shot.

The invideo agent is an agentic video tool that holds project context across shots, so the references you batch once stay loaded for the entire film. Start by sorting your references into 3–5 thematic clusters of roughly 3–5 images each — for example: spatial logic (how rooms and exteriors are laid out), screen function (how a dome or window behaves visually), color theory (palette and grade), and architectural or biome style. Keeping batches small and named lets you point the agent at them by theme inside a prompt.

For each batch, tell the agent what to extract AND what to ignore. One documented production batched stills as references for a dome-as-screen world and instructed the agent to take only the screen idea and drop the small-room scale of the source — exclusion-prompting is as load-bearing as inclusion-prompting. Hridaye, invideo's creative director, puts it directly: "I told it what to take and just as importantly, what to leave out." Without that, the agent inherits unwanted scale, props, or genre cues from the references.

For illustrated or animated references, don't drop them in raw. Instruct the agent to read the colors and textures and prompt for those qualities in your target style — that's how you keep the mood without copying the medium. The same production reported: "The gens came back hyper-realistic with the exact colour temperature I was looking for."

Generate grids, not singles. Ask for 3 grid options per round (one production ran exactly this), iterate on the grid you like, then extract the best individual panels. Image generation inside invideo is cheap enough that grid exploration is the right default, and the agent will autonomously attach the relevant references from context based on which grid it's building. Across documented productions, four options per asset is the standard locking pass before anything moves to video.

Once you've extracted panels you like, those panels REPLACE your original references as the world anchors. Every subsequent scene generation pulls from the locked panels — not the messy reference batch — which is what produces continuity shot to shot. Lock one world element this way and the invideo agent will autonomously generate every angle off it (wide, close, side) without you asking shot by shot.

For model routing: image generation for world-building runs on the current image stack — Recraft for photoreal textures, Nano Banana for fused multi-element compositions, GPT-Image-2 where you need strong prompt adherence on complex layouts — and the invideo agent picks per batch. When you move to video, Seedance 2.0 reference-to-video carries world and character context across clips, so the locked panels keep working as inputs into motion. invideo holds all of these models, so you batch and lock once and the agent routes everything from there.

Beyond the batching itself: tell the agent in your first few messages how you want to work — what batches are coming, what each one is for, what to ask you before generating. That framing is what turns a pile of references into a coherent world.

Watch some of these to see what works for you:

See how batching references by theme builds a coherent AI world

Full masterclass: how the invideo agent uses reference batches for world-building

Extract color temperature and texture from mood boards without copying style

I told it what to take and just as importantly, what to leave out.

— Hridaye, invideo's creative director

How do you use multiple reference images for AI world-building?

More on AI Filmmaking

How do you use multiple reference images for AI world-building?

Related questions

More on AI Filmmaking