Reference-to-video: put real characters in any scene
Wan 2.6’s headline capability is its R2V (Reference-to-Video) mode — upload a reference image of a character and Wan 2.6 inserts that character’s appearance consistently into a generated scene. Combined with voice reference input, both the character’s face and voice can be preserved in the generated video, making Wan 2.6 uniquely capable for creator-centric workflows where personal or brand character identity needs to appear in generated content. The model also introduces comprehensive upgrades across text-to-video, image-to-video, and audio-to-video generation with one-pass A/V synchronization and precise lip-sync.Capabilities
Reference-to-video (R2V)
Insert a character’s appearance from a reference image — and optionally their voice — into any generated scene with consistent identity preservation.
One-pass A/V synchronization
Audio and video generated in a single pass — synchronized sound effects, music, and voice generated with the video without post-production.
Precise lip-sync
Character lip movements synchronized accurately with generated or reference audio — suitable for dialogue-driven content.
Multi-shot storytelling
Generates coherent multi-shot sequences from simple prompts — scene transitions, character continuity, and narrative flow maintained automatically.
Up to 15 seconds
One of the longer generation windows in the lineup — supports more developed narrative sequences at 5, 10, or 15-second intervals.
Multiple generation modes
Text-to-video, image-to-video, audio-to-video, and reference-to-video all supported in a single model.
Generation modes
| Mode | Description |
|---|---|
| Text-to-video | Generate video from text prompt with A/V sync |
| Image-to-video | Animate a reference image with motion and audio |
| Reference-to-video (R2V) | Insert a character’s appearance and voice from reference inputs |
| Audio-to-video | Generate matching visuals from an audio reference |
Specifications
| Feature | Details |
|---|---|
| Developer | Alibaba (Wan Video) |
| Resolution | 720p, 1080p |
| Duration | 5, 10, or 15 seconds |
| Frame rate | 24 FPS |
| Aspect ratios | 16:9, 9:16, 1:1, 4:3, 3:4 |
| Audio | SFX, music, synchronized, lip-sync |
| R2V | Character appearance + voice insertion |
How to use
- Reference-to-video
- Text to video
Upload character reference
Upload a reference image of the character to use. Optionally, upload a voice reference audio clip.
Describe the scene
Write a prompt describing the scene, environment, action, and audio atmosphere around your character.
Prompting tips
- R2V: describe the scene, not the character — The reference image provides the character; your prompt should focus on the setting, action, camera, and audio environment.
- Include audio cues for one-pass sync — “A jazz trio plays softly in the background” or “footsteps echo on the marble floor” integrate directly into the audio generation.
- Multi-shot: use transition language — “THEN CUT TO:” or “The camera pulls back to reveal…” cues structured multi-shot generation.
- 15-second clips for narratives — Use the full 15-second window for storylines that need a beginning, middle, and resolution within one generation.
Example prompts
[R2V mode] Reference character appears as a chef in a busy restaurant kitchen. The chef plates a dish confidently, a soft smile as they look at the camera. Warm kitchen sounds, sizzling in background. 10 seconds.
A multilingual brand video: a young woman introduces a product in front of a clean white background. She speaks naturally, hands gesturing. Confident, friendly. 10 seconds, 1080p.
Compare models
| Model | R2V | Audio | Lip-sync | Duration | Best for |
|---|---|---|---|---|---|
| Wan 2.6 | Yes | Yes | Yes | 15s | Character reference, A/V |
| Wan 2.5 | No | Yes | Yes | 10s | General A/V production |
| Wan 2.2 | No | No | No | 5s | Camera control, style LoRA |
| Seedance 1.5 Pro | No | Yes | Multilingual | 12s | Multilingual precision |

