Skip to main content
VIDEO MODELby AlibabaWan family

Wan 2.6

Alibaba’s reference-to-video model — insert a character’s appearance and voice from a reference input, generate multi-shot narratives with synchronized audio, and produce up to 15 seconds at 1080p with precise lip-sync. Built for character-centric, multilingual, and audio-synchronized production.

Duration
Up to 15 seconds
Resolution
720p–1080p
Audio
SFX + Music + Lip-sync
Frame rate
24 FPS

Reference-to-video: put real characters in any scene

Wan 2.6’s headline capability is its R2V (Reference-to-Video) mode — upload a reference image of a character and Wan 2.6 inserts that character’s appearance consistently into a generated scene. Combined with voice reference input, both the character’s face and voice can be preserved in the generated video, making Wan 2.6 uniquely capable for creator-centric workflows where personal or brand character identity needs to appear in generated content. The model also introduces comprehensive upgrades across text-to-video, image-to-video, and audio-to-video generation with one-pass A/V synchronization and precise lip-sync.

Capabilities

Reference-to-video (R2V)

Insert a character’s appearance from a reference image — and optionally their voice — into any generated scene with consistent identity preservation.

One-pass A/V synchronization

Audio and video generated in a single pass — synchronized sound effects, music, and voice generated with the video without post-production.

Precise lip-sync

Character lip movements synchronized accurately with generated or reference audio — suitable for dialogue-driven content.

Multi-shot storytelling

Generates coherent multi-shot sequences from simple prompts — scene transitions, character continuity, and narrative flow maintained automatically.

Up to 15 seconds

One of the longer generation windows in the lineup — supports more developed narrative sequences at 5, 10, or 15-second intervals.

Multiple generation modes

Text-to-video, image-to-video, audio-to-video, and reference-to-video all supported in a single model.

Generation modes

ModeDescription
Text-to-videoGenerate video from text prompt with A/V sync
Image-to-videoAnimate a reference image with motion and audio
Reference-to-video (R2V)Insert a character’s appearance and voice from reference inputs
Audio-to-videoGenerate matching visuals from an audio reference

Specifications

FeatureDetails
DeveloperAlibaba (Wan Video)
Resolution720p, 1080p
Duration5, 10, or 15 seconds
Frame rate24 FPS
Aspect ratios16:9, 9:16, 1:1, 4:3, 3:4
AudioSFX, music, synchronized, lip-sync
R2VCharacter appearance + voice insertion

How to use

1

Open the AI Video Generator

Log into ImagineArt and go to the AI Video Generator.
2

Select Wan 2.6

Choose Wan 2.6 from the model dropdown.
3

Select R2V mode

Choose the Reference-to-Video generation mode.
4

Upload character reference

Upload a reference image of the character to use. Optionally, upload a voice reference audio clip.
5

Describe the scene

Write a prompt describing the scene, environment, action, and audio atmosphere around your character.
6

Generate

Click Generate. Wan 2.6 places your referenced character into the generated scene with synchronized audio.

Prompting tips

  • R2V: describe the scene, not the character — The reference image provides the character; your prompt should focus on the setting, action, camera, and audio environment.
  • Include audio cues for one-pass sync — “A jazz trio plays softly in the background” or “footsteps echo on the marble floor” integrate directly into the audio generation.
  • Multi-shot: use transition language — “THEN CUT TO:” or “The camera pulls back to reveal…” cues structured multi-shot generation.
  • 15-second clips for narratives — Use the full 15-second window for storylines that need a beginning, middle, and resolution within one generation.

Example prompts

[R2V mode] Reference character appears as a chef in a busy restaurant kitchen. The chef plates a dish confidently, a soft smile as they look at the camera. Warm kitchen sounds, sizzling in background. 10 seconds.
A multilingual brand video: a young woman introduces a product in front of a clean white background. She speaks naturally, hands gesturing. Confident, friendly. 10 seconds, 1080p.

Compare models

ModelR2VAudioLip-syncDurationBest for
Wan 2.6YesYesYes15sCharacter reference, A/V
Wan 2.5NoYesYes10sGeneral A/V production
Wan 2.2NoNoNo5sCamera control, style LoRA
Seedance 1.5 ProNoYesMultilingual12sMultilingual precision
Wan 2.6 is the best choice when a specific character needs to appear consistently in generated video — the R2V system provides character identity preservation that other models can’t match from a simple image reference alone.