VIDEO MODELby AlibabaWan family

Wan 2.6

Alibaba’s reference-to-video model — insert a character’s appearance and voice from a reference input, generate multi-shot narratives with synchronized audio, and produce up to 15 seconds at 1080p with precise lip-sync. Built for character-centric, multilingual, and audio-synchronized production.

Duration

Up to 15 seconds

Resolution

720p–1080p

Audio

SFX + Music + Lip-sync

Frame rate

24 FPS

Reference-to-video: put real characters in any scene

Wan 2.6’s headline capability is its R2V (Reference-to-Video) mode — upload a reference image of a character and Wan 2.6 inserts that character’s appearance consistently into a generated scene. Combined with voice reference input, both the character’s face and voice can be preserved in the generated video, making Wan 2.6 uniquely capable for creator-centric workflows where personal or brand character identity needs to appear in generated content. The model also introduces comprehensive upgrades across text-to-video, image-to-video, and audio-to-video generation with one-pass A/V synchronization and precise lip-sync.

Capabilities

Reference-to-video (R2V)

Insert a character’s appearance from a reference image — and optionally their voice — into any generated scene with consistent identity preservation.

One-pass A/V synchronization

Audio and video generated in a single pass — synchronized sound effects, music, and voice generated with the video without post-production.

Precise lip-sync

Character lip movements synchronized accurately with generated or reference audio — suitable for dialogue-driven content.

Multi-shot storytelling

Generates coherent multi-shot sequences from simple prompts — scene transitions, character continuity, and narrative flow maintained automatically.

Up to 15 seconds

One of the longer generation windows in the lineup — supports more developed narrative sequences at 5, 10, or 15-second intervals.

Multiple generation modes

Text-to-video, image-to-video, audio-to-video, and reference-to-video all supported in a single model.

Generation modes

Mode	Description
Text-to-video	Generate video from text prompt with A/V sync
Image-to-video	Animate a reference image with motion and audio
Reference-to-video (R2V)	Insert a character’s appearance and voice from reference inputs
Audio-to-video	Generate matching visuals from an audio reference

Specifications

Feature	Details
Developer	Alibaba (Wan Video)
Resolution	720p, 1080p
Duration	5, 10, or 15 seconds
Frame rate	24 FPS
Aspect ratios	16:9, 9:16, 1:1, 4:3, 3:4
Audio	SFX, music, synchronized, lip-sync
R2V	Character appearance + voice insertion

How to use

Reference-to-video
Text to video

Open the AI Video Generator

Log into ImagineArt and go to the AI Video Generator.

Select Wan 2.6

Choose Wan 2.6 from the model dropdown.

Select R2V mode

Choose the Reference-to-Video generation mode.

Upload character reference

Upload a reference image of the character to use. Optionally, upload a voice reference audio clip.

Describe the scene

Write a prompt describing the scene, environment, action, and audio atmosphere around your character.

Generate

Click Generate. Wan 2.6 places your referenced character into the generated scene with synchronized audio.

Prompting tips

R2V: describe the scene, not the character — The reference image provides the character; your prompt should focus on the setting, action, camera, and audio environment.
Include audio cues for one-pass sync — “A jazz trio plays softly in the background” or “footsteps echo on the marble floor” integrate directly into the audio generation.
Multi-shot: use transition language — “THEN CUT TO:” or “The camera pulls back to reveal…” cues structured multi-shot generation.
15-second clips for narratives — Use the full 15-second window for storylines that need a beginning, middle, and resolution within one generation.

Example prompts

[R2V mode] Reference character appears as a chef in a busy restaurant kitchen. The chef plates a dish confidently, a soft smile as they look at the camera. Warm kitchen sounds, sizzling in background. 10 seconds.

A multilingual brand video: a young woman introduces a product in front of a clean white background. She speaks naturally, hands gesturing. Confident, friendly. 10 seconds, 1080p.

Compare models

Model	R2V	Audio	Lip-sync	Duration	Best for
Wan 2.6	Yes	Yes	Yes	15s	Character reference, A/V
Wan 2.5	No	Yes	Yes	10s	General A/V production
Wan 2.2	No	No	No	5s	Camera control, style LoRA
Seedance 1.5 Pro	No	Yes	Multilingual	12s	Multilingual precision

Wan 2.6 is the best choice when a specific character needs to appear consistently in generated video — the R2V system provides character identity preservation that other models can’t match from a simple image reference alone.

​Wan 2.6

​Reference-to-video: put real characters in any scene

​Capabilities

Reference-to-video (R2V)

One-pass A/V synchronization

Precise lip-sync

Multi-shot storytelling

Up to 15 seconds

Multiple generation modes

​Generation modes

​Specifications

​How to use

​Prompting tips

​Example prompts

​Compare models

Wan 2.6

Reference-to-video: put real characters in any scene

Capabilities

Generation modes

Specifications

How to use

Prompting tips

Example prompts

Compare models