Documentation Index
Fetch the complete documentation index at: https://docs.imagine.art/llms.txt
Use this file to discover all available pages before exploring further.
Audio-visual synchronization in a single pass
Wan 2.5 is Alibaba’s dedicated audio-visual synchronization model in the Wan family. Its primary strength is the one-pass A/V generation system — ambient sounds, sound effects, and voice are generated simultaneously with the video, synchronized at the frame level without post-production. Lip-sync support makes it particularly well-suited for content where characters speak, sing, or react expressively to audio. For reference-to-video with character insertion and voice reference support, see Wan 2.6 — the successor model with expanded capabilities. Wan 2.5 is the audio-capable general-purpose member of the Wan family for standard A/V production.Capabilities
One-pass A/V synchronization
Ambient sounds, sound effects, and voice generated simultaneously with the video — no separate audio editing or syncing required.
Precise lip-sync
Character lip movements synchronized accurately with generated audio — suitable for dialogue, narration, and character-driven clips.
Smooth motion flow
Consistent subject movement, natural transitions, and fluid camera behavior across the full clip duration.
Flexible resolution
480p, 720p, or 1080p — select based on quality requirements and credit budget.
Text and image input
Supports text prompts, uploaded reference images, or a combination of both for broader creative control.
Multiple aspect ratios
16:9, 9:16, 1:1, 4:3, and 3:4 — flexible framing for social, cinematic, and standard formats.
Specifications
| Feature | Details |
|---|---|
| Developer | Alibaba (Wan Video) |
| Resolution | 480p, 720p, 1080p |
| Duration | 5 or 10 seconds |
| Aspect ratios | 16:9, 9:16, 1:1, 4:3, 3:4 |
| Audio | Ambient, SFX, voice (native, one-pass) |
| Lip-sync | Yes |
| Input modes | Text-to-video, image-to-video |
How to use
Enter your prompt
Write a text prompt or upload a reference image. Include explicit motion, mood, and audio cues for best results.
Set duration and resolution
Choose 5 or 10 seconds and your preferred resolution (480p, 720p, or 1080p).
Prompting tips
- Include audio cues explicitly — “Rain in the background,” “distant city traffic,” or “soft piano music” feed directly into the audio generation alongside the visual.
- Describe motion and mood — Be specific about how subjects move and the atmosphere you want. “Slow pan,” “bustling city energy,” or “tense stillness” all guide the model.
- Use camera terminology — “Overhead shot,” “wide establishing shot,” and “slow zoom in” give clear directional cues.
- Specify lighting — “Golden hour,” “low-key studio lighting,” or “overcast afternoon” guide the visual output alongside the audio.
- For lip-sync — Describe your character’s speech or emotional reaction explicitly to anchor the lip movement generation.
Example prompts
Close-up shot: a woman in a vintage suit sits pensively at a café table. The camera slowly zooms in on her thoughtful expression as she speaks softly. Warm, ambient café sounds — quiet chatter, distant music. 10 seconds, 16:9.
A young man carefully unpacks a pair of headphones in a modern apartment. Smooth dolly shot, slow zoom in on his focused expression. City ambient sounds through open windows in the background. 10 seconds, 1080p.
Compare models
| Model | Audio | Lip-sync | Duration | R2V | Best for |
|---|---|---|---|---|---|
| Wan 2.5 | Yes | Yes | 10s | No | General A/V, lip-sync |
| Wan 2.6 | Yes | Yes | 15s | Yes | Character reference, voice insertion |
| Wan 2.2 | No | No | 5s | No | Camera control, LoRA style |
| Seedance 1.5 Pro | Yes | Multilingual | 12s | No | Multilingual precision lip-sync |

