VIDEO MODEL by Alibaba Wan family

Wan 2.5

Alibaba's audio-visual sync model — generates ambient sounds, sound effects, and voice with precise lip-sync alongside the video in a single pass. Supports 480p to 1080p at 5 or 10 seconds with flexible aspect ratios and text or image input.

Resolution

480p – 1080p

Duration

5–10 seconds

Audio

Ambient + SFX + Voice

Lip-sync

Yes

## Audio-visual synchronization in a single pass Wan 2.5 is Alibaba's dedicated audio-visual synchronization model in the Wan family. Its primary strength is the one-pass A/V generation system — ambient sounds, sound effects, and voice are generated simultaneously with the video, synchronized at the frame level without post-production. Lip-sync support makes it particularly well-suited for content where characters speak, sing, or react expressively to audio. For reference-to-video with character insertion and voice reference support, see [Wan 2.6](/ai-models/video/wan-2-6) — the successor model with expanded capabilities. Wan 2.5 is the audio-capable general-purpose member of the Wan family for standard A/V production. ## Capabilities Ambient sounds, sound effects, and voice generated simultaneously with the video — no separate audio editing or syncing required. Character lip movements synchronized accurately with generated audio — suitable for dialogue, narration, and character-driven clips. Consistent subject movement, natural transitions, and fluid camera behavior across the full clip duration. 480p, 720p, or 1080p — select based on quality requirements and credit budget. Supports text prompts, uploaded reference images, or a combination of both for broader creative control. 16:9, 9:16, 1:1, 4:3, and 3:4 — flexible framing for social, cinematic, and standard formats. ## Specifications | Feature | Details | | ----------------- | -------------------------------------- | | **Developer** | Alibaba (Wan Video) | | **Resolution** | 480p, 720p, 1080p | | **Duration** | 5 or 10 seconds | | **Aspect ratios** | 16:9, 9:16, 1:1, 4:3, 3:4 | | **Audio** | Ambient, SFX, voice (native, one-pass) | | **Lip-sync** | Yes | | **Input modes** | Text-to-video, image-to-video | ## How to use Log into ImagineArt and go to the **AI Video Generator**. Choose **Wan 2.5** from the model dropdown. Write a text prompt or upload a reference image. Include explicit motion, mood, and audio cues for best results. Choose **5 or 10 seconds** and your preferred resolution (480p, 720p, or 1080p). Click **Generate** to produce the video with synchronized audio. Preview the clip, adjust your prompt or settings as needed, and download. ## Prompting tips * **Include audio cues explicitly** — "Rain in the background," "distant city traffic," or "soft piano music" feed directly into the audio generation alongside the visual. * **Describe motion and mood** — Be specific about how subjects move and the atmosphere you want. "Slow pan," "bustling city energy," or "tense stillness" all guide the model. * **Use camera terminology** — "Overhead shot," "wide establishing shot," and "slow zoom in" give clear directional cues. * **Specify lighting** — "Golden hour," "low-key studio lighting," or "overcast afternoon" guide the visual output alongside the audio. * **For lip-sync** — Describe your character's speech or emotional reaction explicitly to anchor the lip movement generation. ### Example prompts > Close-up shot: a woman in a vintage suit sits pensively at a café table. The camera slowly zooms in on her thoughtful expression as she speaks softly. Warm, ambient café sounds — quiet chatter, distant music. 10 seconds, 16:9. > A young man carefully unpacks a pair of headphones in a modern apartment. Smooth dolly shot, slow zoom in on his focused expression. City ambient sounds through open windows in the background. 10 seconds, 1080p. ## Compare models | Model | Audio | Lip-sync | Duration | R2V | Best for | | ----------------------------------------------------- | ----- | ------------ | -------- | --- | ------------------------------------ | | **Wan 2.5** | Yes | Yes | 10s | No | General A/V, lip-sync | | [Wan 2.6](/ai-models/video/wan-2-6) | Yes | Yes | 15s | Yes | Character reference, voice insertion | | [Wan 2.2](/ai-models/video/wan-2-2) | No | No | 5s | No | Camera control, LoRA style | | [Seedance 1.5 Pro](/ai-models/video/seedance-1-5-pro) | Yes | Multilingual | 12s | No | Multilingual precision lip-sync | Use Wan 2.5 when your project needs both visual impact and audio coherence in a single generation. For character identity preservation with voice reference input, upgrade to [Wan 2.6](/ai-models/video/wan-2-6) — the R2V successor model.