How do you use reference images in AI video generation prompts?
Last updated June 26, 2026
Use reference images to lock WHAT the shot looks like — character, style, composition, environment, props — and use the text prompt to drive WHAT HAPPENS — action, camera move, lighting beat. Feed the agent multiple labeled references per category, tell it explicitly what to take from each and what to ignore, and lock those refs before any video generation begins.
Split the job cleanly: reference images carry visual identity, the prompt carries motion and intent. invideo is an agentic video creation tool with every current video and image model available, so the invideo agent accepts your references once and routes them — alongside your prompt — to the right model (Seedance 2.0, Veo, Kling, Recraft, Nano Banana, GPT-Image-2) for each shot.
Use a taxonomy of reference types, not one mood board. Feed the agent a separate batch per job: character refs (front, side, profile, back, plus a face close-up so small details like scars or accessories stay consistent), style refs (the look — palette, texture, brushstroke, grain), composition refs (framing and blocking you want echoed), environment refs (location plates — real-world landmarks pulled off the internet work well), and prop refs (each prop gets its own pass; lifeless props break narrative believability faster than weak characters). One documented production batched references by theme — spatial logic, screen function, color theory — and explicitly told the agent what to extract from each batch and what to ignore. That exclusion instruction matters as much as the inclusion one.
Prep the images themselves. High-resolution, clean lighting, neutral background where you can; 4K portraits and 4K character sheets with four angles plus face and mid-angle closeups is the working spec across documented productions. Remove objects from characters' hands before generating turnarounds — anything held will drift across angles. For style references, volume helps: one production uploaded 64 frames from a target animated series in a single message with the instruction "deeply understand this art style and save it into context for further generations" — that single ingestion locked the visual style for the whole project.
Label and number when the tool supports multiple refs. Tag what each image is for — "character sheet: Echo", "style frame 12 — palette only", "location plate: rooftop". The invideo agent picks the right reference automatically per shot once the labels are clear, but ambiguous uploads cause it to pull the wrong frame. If you only need a color and texture cue from an illustrated or animated reference, don't drop the image in raw — instruct the agent to read the colours and textures and prompt for those instead. Dropping illustrated refs in directly produces stylistic bleed; colour-and-texture extraction returns photoreal output with the exact temperature you wanted.
Lock the references before any video generation. Generate four options per asset — character sheet, environment plate, prop — pick the strongest, store it in the agent's context, and only then start moving to video. Locking up front is what prevents character drift across scenes; one 70-second short held two characters consistent across every scene this way with no LoRA, using only character sheets in the agent's context. If a continuity error appears later, fix it at the source: ask the agent which panel of the character sheet contains the error, correct that panel, and every downstream shot inherits the fix.
Anchor identity in the prompt before motion. Structure the prompt in two blocks. First the identity anchor — name each reference the shot uses ("Echo from character sheet v3, rooftop plate v2, vampire-and-juicebox pose ref") and state the locked visual rules (lens, palette, lighting source). Then the motion block — what happens, how the camera moves, what the emotional beat is. Splitting it this way stops the model from re-imagining the character to fit the action. Be explicit on lighting cues too: "warm yellow from the lamps only, like all the refs" produces more accurate output than a generic "warm lighting".
Generate stronger reference images by prompting in grids. Image generation is cheap, so ask for 3-4 grid options per round rather than single shots, iterate on a grid, extract the best panel, and use THAT extracted panel as your new reference for downstream scenes. One production ran three grids per round across world-building, then swapped the original references out for the extracted panels — shot-to-shot continuity improved noticeably because every scene was now anchored on something the agent itself had generated in-context.
Model choice matters once references are loaded. Seedance 2.0 reference-to-video carries character and location refs simultaneously and holds context across segments, so it's strong when identity AND environment have to stay locked. Veo and Kling read style references well for stylized aesthetics. Nano Banana / GPT-Image-2 / Recraft handle the reference IMAGE generation step (Recraft for photoreal portraits with skin imperfections; Nano Banana for character sheets and fused multi-character arrangements). The invideo agent routes the right shot to the right model, so you don't pick — you direct.
These are the levers that matter; what mix works depends on your shot and how stylized vs photoreal your film is.
Watch some of these to see what works for you:
The better move was to have Agent 1 read the colours and textures of them and prompt for that instead.
— invideo's creative team