VIDEO MODEL by Alibaba Wan family

Wan 2.6

Alibaba's reference-to-video model — insert a character's appearance and voice from a reference input, generate multi-shot narratives with synchronized audio, and produce up to 15 seconds at 1080p with precise lip-sync. Built for character-centric, multilingual, and audio-synchronized production.

Duration

Up to 15 seconds

Resolution

720p–1080p

Audio

SFX + Music + Lip-sync

Frame rate

24 FPS

## Reference-to-video: put real characters in any scene Wan 2.6's headline capability is its R2V (Reference-to-Video) mode — upload a reference image of a character and Wan 2.6 inserts that character's appearance consistently into a generated scene. Combined with voice reference input, both the character's face and voice can be preserved in the generated video, making Wan 2.6 uniquely capable for creator-centric workflows where personal or brand character identity needs to appear in generated content. The model also introduces comprehensive upgrades across text-to-video, image-to-video, and audio-to-video generation with one-pass A/V synchronization and precise lip-sync. ## Capabilities Insert a character's appearance from a reference image — and optionally their voice — into any generated scene with consistent identity preservation. Audio and video generated in a single pass — synchronized sound effects, music, and voice generated with the video without post-production. Character lip movements synchronized accurately with generated or reference audio — suitable for dialogue-driven content. Generates coherent multi-shot sequences from simple prompts — scene transitions, character continuity, and narrative flow maintained automatically. One of the longer generation windows in the lineup — supports more developed narrative sequences at 5, 10, or 15-second intervals. Text-to-video, image-to-video, audio-to-video, and reference-to-video all supported in a single model. ## Generation modes | Mode | Description | | ---------------------------- | --------------------------------------------------------------- | | **Text-to-video** | Generate video from text prompt with A/V sync | | **Image-to-video** | Animate a reference image with motion and audio | | **Reference-to-video (R2V)** | Insert a character's appearance and voice from reference inputs | | **Audio-to-video** | Generate matching visuals from an audio reference | ## Specifications | Feature | Details | | ----------------- | -------------------------------------- | | **Developer** | Alibaba (Wan Video) | | **Resolution** | 720p, 1080p | | **Duration** | 5, 10, or 15 seconds | | **Frame rate** | 24 FPS | | **Aspect ratios** | 16:9, 9:16, 1:1, 4:3, 3:4 | | **Audio** | SFX, music, synchronized, lip-sync | | **R2V** | Character appearance + voice insertion | ## How to use Log into ImagineArt and go to the **AI Video Generator**. Choose **Wan 2.6** from the model dropdown. Choose the **Reference-to-Video** generation mode. Upload a reference image of the character to use. Optionally, upload a voice reference audio clip. Write a prompt describing the scene, environment, action, and audio atmosphere around your character. Click **Generate**. Wan 2.6 places your referenced character into the generated scene with synchronized audio. Go to the **AI Video Generator** and select **Wan 2.6**. Describe scene, subjects, motion, and audio cues. Include any multi-shot structure with transition cues. Choose 5, 10, or 15 seconds at your target resolution. Click **Generate** for audio-synced video. ## Prompting tips * **R2V: describe the scene, not the character** — The reference image provides the character; your prompt should focus on the setting, action, camera, and audio environment. * **Include audio cues for one-pass sync** — "A jazz trio plays softly in the background" or "footsteps echo on the marble floor" integrate directly into the audio generation. * **Multi-shot: use transition language** — "THEN CUT TO:" or "The camera pulls back to reveal..." cues structured multi-shot generation. * **15-second clips for narratives** — Use the full 15-second window for storylines that need a beginning, middle, and resolution within one generation. ### Example prompts > \[R2V mode] Reference character appears as a chef in a busy restaurant kitchen. The chef plates a dish confidently, a soft smile as they look at the camera. Warm kitchen sounds, sizzling in background. 10 seconds. > A multilingual brand video: a young woman introduces a product in front of a clean white background. She speaks naturally, hands gesturing. Confident, friendly. 10 seconds, 1080p. ## Compare models | Model | R2V | Audio | Lip-sync | Duration | Best for | | ----------------------------------------------------- | --- | ----- | ------------ | -------- | -------------------------- | | **Wan 2.6** | Yes | Yes | Yes | 15s | Character reference, A/V | | [Wan 2.5](/ai-models/video/wan-2-5) | No | Yes | Yes | 10s | General A/V production | | [Wan 2.2](/ai-models/video/wan-2-2) | No | No | No | 5s | Camera control, style LoRA | | [Seedance 1.5 Pro](/ai-models/video/seedance-1-5-pro) | No | Yes | Multilingual | 12s | Multilingual precision | Wan 2.6 is the best choice when a specific character needs to appear consistently in generated video — the R2V system provides character identity preservation that other models can't match from a simple image reference alone.