Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.imagine.art/llms.txt

Use this file to discover all available pages before exploring further.

VIDEO MODELby AlibabaWan family

Wan 2.5

Alibaba’s audio-visual sync model — generates ambient sounds, sound effects, and voice with precise lip-sync alongside the video in a single pass. Supports 480p to 1080p at 5 or 10 seconds with flexible aspect ratios and text or image input.

Resolution
480p – 1080p
Duration
5–10 seconds
Audio
Ambient + SFX + Voice
Lip-sync
Yes

Audio-visual synchronization in a single pass

Wan 2.5 is Alibaba’s dedicated audio-visual synchronization model in the Wan family. Its primary strength is the one-pass A/V generation system — ambient sounds, sound effects, and voice are generated simultaneously with the video, synchronized at the frame level without post-production. Lip-sync support makes it particularly well-suited for content where characters speak, sing, or react expressively to audio. For reference-to-video with character insertion and voice reference support, see Wan 2.6 — the successor model with expanded capabilities. Wan 2.5 is the audio-capable general-purpose member of the Wan family for standard A/V production.

Capabilities

One-pass A/V synchronization

Ambient sounds, sound effects, and voice generated simultaneously with the video — no separate audio editing or syncing required.

Precise lip-sync

Character lip movements synchronized accurately with generated audio — suitable for dialogue, narration, and character-driven clips.

Smooth motion flow

Consistent subject movement, natural transitions, and fluid camera behavior across the full clip duration.

Flexible resolution

480p, 720p, or 1080p — select based on quality requirements and credit budget.

Text and image input

Supports text prompts, uploaded reference images, or a combination of both for broader creative control.

Multiple aspect ratios

16:9, 9:16, 1:1, 4:3, and 3:4 — flexible framing for social, cinematic, and standard formats.

Specifications

FeatureDetails
DeveloperAlibaba (Wan Video)
Resolution480p, 720p, 1080p
Duration5 or 10 seconds
Aspect ratios16:9, 9:16, 1:1, 4:3, 3:4
AudioAmbient, SFX, voice (native, one-pass)
Lip-syncYes
Input modesText-to-video, image-to-video

How to use

1

Open the AI Video Generator

Log into ImagineArt and go to the AI Video Generator.
2

Select Wan 2.5

Choose Wan 2.5 from the model dropdown.
3

Enter your prompt

Write a text prompt or upload a reference image. Include explicit motion, mood, and audio cues for best results.
4

Set duration and resolution

Choose 5 or 10 seconds and your preferred resolution (480p, 720p, or 1080p).
5

Generate

Click Generate to produce the video with synchronized audio.
6

Review and iterate

Preview the clip, adjust your prompt or settings as needed, and download.

Prompting tips

  • Include audio cues explicitly — “Rain in the background,” “distant city traffic,” or “soft piano music” feed directly into the audio generation alongside the visual.
  • Describe motion and mood — Be specific about how subjects move and the atmosphere you want. “Slow pan,” “bustling city energy,” or “tense stillness” all guide the model.
  • Use camera terminology — “Overhead shot,” “wide establishing shot,” and “slow zoom in” give clear directional cues.
  • Specify lighting — “Golden hour,” “low-key studio lighting,” or “overcast afternoon” guide the visual output alongside the audio.
  • For lip-sync — Describe your character’s speech or emotional reaction explicitly to anchor the lip movement generation.

Example prompts

Close-up shot: a woman in a vintage suit sits pensively at a café table. The camera slowly zooms in on her thoughtful expression as she speaks softly. Warm, ambient café sounds — quiet chatter, distant music. 10 seconds, 16:9.
A young man carefully unpacks a pair of headphones in a modern apartment. Smooth dolly shot, slow zoom in on his focused expression. City ambient sounds through open windows in the background. 10 seconds, 1080p.

Compare models

ModelAudioLip-syncDurationR2VBest for
Wan 2.5YesYes10sNoGeneral A/V, lip-sync
Wan 2.6YesYes15sYesCharacter reference, voice insertion
Wan 2.2NoNo5sNoCamera control, LoRA style
Seedance 1.5 ProYesMultilingual12sNoMultilingual precision lip-sync
Use Wan 2.5 when your project needs both visual impact and audio coherence in a single generation. For character identity preservation with voice reference input, upgrade to Wan 2.6 — the R2V successor model.