Wong Kar-wai's style in AI video comes from encoding his visual grammar — step-printed motion, neon-on-wet-asphalt palette, handheld wide-angle close-ups, sodium-yellow streetlight, heavy grain, and 'big city with no answering voice' mood — into a treatment document the invideo agent holds across every shot, then routing each generation to Seedance 2.0, Kling, or Veo depending on what the shot needs.
Start by writing a treatment document that names the grammar, not just the look. One 70-second short film built this way used a 25-page Wong Kar-wai style guide loaded once into the invideo agent — 14 sections covering camera, angles, colour tone, atmosphere, mood, lighting, composition, movement, film palettes, prompt templates, negative prompts, and a quick-reference card. The agent then evaluated every shot against 12 parameters before generating: film reference, shot design, length, style interpretation, emotional register, lens, lighting plan, color script, atmosphere layers, blocking, final prompt, negative prompt. That's the spine — everything below feeds into it.
Lock the 9-element prompt assembly order. Every shot prompt should be built in the same sequence so style holds across scenes: camera spec, lens & aspect ratio, lighting source, palette, composition, atmosphere, mood register, film/DP attribution (Wong Kar-wai, Christopher Doyle), negative prompt. Skipping the order is where drift starts.
Name the palette as tonal modes with hex values, not adjectives. Instead of "moody neon," encode modes like "Mode A — split-toned amber and emerald, sodium-yellow key from streetlamps, electric green spill from signage, deep red practicals, cyan haze in mid-ground." Map per film: cramped interior reds and greens for the Chungking Express register, smoky amber and jade for the In the Mood for Love register, denser saturated reds and electric blues for the 2046 register. Hex values make it reproducible.
Encode motion explicitly — this is where most WKW attempts fail. Write motion tokens into the prompt: step-printed motion / low-frame-rate stutter on the subject while background blurs continuously, slow-shutter smear, speed-ramping from real-time into slow motion, handheld micro-sway on a wide-angle close-up, slight Dutch on the rack focus. One production had the invideo agent autonomously apply a slow-shutter motion smear effect pulled from page 17 of the treatment without being prompted — that only happens when the doc names the technique precisely.
Write composition and lens as Christopher Doyle would. Handheld wide-angle (24-32mm equivalent) close-ups with the subject pressed to one edge of frame, foreground obstructions (bottles, lamps, mesh, rain-streaked glass), reflections doubling the subject through fogged windows, shallow depth of field with circular bokeh from spherical lenses (not anamorphic — that's a different grammar). Lighting source must be named to the practical: "warm yellow from the lamps only," "neon sign as key from camera-right," never generic "warm lighting."
A copy-paste prompt template that holds the grammar: "Handheld wide-angle close-up, 28mm spherical, shallow depth of field. A figure leans against a fogged taxi window, rain streaking outside, sodium-yellow streetlight as key from camera-left, red and emerald neon signage spilling across the face as the cab moves. Step-printed motion on the figure, continuous motion blur on the background lights smearing into hazy streaks. Palette: deep red, electric green, sodium amber, cyan haze. Heavy 16mm film grain, lifted blacks, soft halation around highlights. Atmosphere: loneliness, the big city with no answering voice. In the visual language of Wong Kar-wai, shot by Christopher Doyle. Negative prompt: clean digital sharpness, even daylight, wide establishing shot, generic cinematic look, anamorphic flares, plastic skin, over-stabilized camera."
Write the negative prompt with the same care as the positive. This is the single biggest gap in most WKW attempts. Explicitly exclude: clean digital sharpness, daylight exteriors, even three-point lighting, wide establishing shots, generic cinematic LUTs, horizontal anamorphic flares, plastic AI skin, over-stabilized camera, symmetrical composition.
Route by model — the invideo agent holds all of them. Different shots want different models, and you don't pick a platform per model — every roster model runs inside invideo and the agent routes by shot:
- Seedance 2.0 reference-to-video for continuity shots and one-take movement through neon-lit interiors — it carries character, location, and camera context across segments better than start/end-frame methods, which is what WKW's drifting handheld needs.
- Kling when you want native multi-shot sequences inside one generation — useful for the rapid step-printed bursts.
- Veo when the shot leans on photoreal grain texture and subtle slow-motion ramps.
- Runway for shorter motion bursts where you need precise camera-move tokens. For stills (character sheets, mood frames, reference plates), generate with GPT-Image-2, Recraft for portraits where you want skin imperfection — pores, lines, stubble — and Nano Banana for fused character sheets and 4-angle turnarounds.
Build reference inputs before you write a single video prompt. Generate four options per character sheet and environment plate, pick one, lock it. For the WKW look specifically, batch references thematically — one batch for spatial logic (cramped interior geometry), one for color theory (the red-green-amber triad), one for surface (wet asphalt, fogged glass, rain-streaked windows) — and tell the agent what to take from each batch and what to leave out. When a reference is illustrated rather than photographic, instruct the agent to read the colours and textures and prompt for those, not to copy the image.
For the viral portrait look circulating on TikTok and Threads — figure pressed to a taxi window, red-green neon passing, rain-fogged glass, thick grain, motion blur halos — that prompt structure works as a still in GPT-Image-2 or Nano Banana. To turn it into video, feed the still as a reference image to Seedance 2.0 reference-to-video with an added motion block: "slow push-in over 4 seconds, step-printed motion on the figure, continuous smear on passing lights, micro-sway handheld."
Test that the grammar is internalized, not surface-pasted. Ask the invideo agent to apply Wong Kar-wai's language to a scene type he never shot — a courtroom, a kitchen at noon. If it asks clarifying questions and returns something that still reads as his grammar (mood over narrative, the substitution rule, doorway static holds for endings), the doc is solid. If it just paints everything red and green, rewrite the doc.
Across documented productions using this treatment-loaded workflow, full short films land at $750 to $5,000 total — $315 to $750 per finished minute — over 2 to 5 days, with 2 to 4 people. As Hridaye, invideo's creative director, puts it: "IT ISN'T A LOOK. IT'S A LANGUAGE. Color as diagnosis. Subliminal dollies. Dread before dialogue." That's the bar — encode the language, then direct.
Watch some of these to see what works for you:
IT ISN'T A LOOK. IT'S A LANGUAGE. Color as diagnosis. Subliminal dollies. Dread before dialogue.
— Hridaye, invideo's creative director