Built for multilingual dialogue and lip-sync
Seedance 1.5 Pro is ByteDance’s purpose-built model for dialogue-heavy and multilingual video content. The 4.5-billion-parameter Dual-Branch Diffusion Transformer (DB-DiT) architecture achieves millisecond-precision lip-sync — character mouth movements align exactly with the audio, across 8 languages and regional dialects including English, Mandarin, Japanese, Korean, Spanish, Portuguese, Indonesian, and Cantonese. At 10× faster inference than its predecessor, Seedance 1.5 Pro is viable for production workflows that require consistent talking-head or dialogue-scene generation at scale.Capabilities
Millisecond-precision lip-sync
Character lip movements align precisely with generated audio at the millisecond level — across 8 languages and regional dialects.
8+ language support
Native dialogue generation in English, Mandarin, Japanese, Korean, Spanish, Portuguese, Indonesian, Cantonese, and Sichuanese.
4.5B parameters
A 4.5-billion-parameter Dual-Branch Diffusion Transformer — capable of nuanced character expressions, complex scene compositions, and consistent identity.
Up to 1080p resolution
Full HD output for production-ready talking-head videos, interviews, and dialogue-driven scenes.
10× faster inference
Runs 10× faster than the previous generation — practical for batch content creation and localized video production pipelines.
Character consistency
Maintains subject appearance, expression nuance, and visual identity across scenes within a generation.
Specifications
| Feature | Details |
|---|---|
| Developer | ByteDance |
| Parameters | 4.5 billion |
| Architecture | Dual-Branch Diffusion Transformer (DB-DiT) |
| Resolution | Up to 1080p |
| Duration | 4–12 seconds |
| Languages | English, Mandarin, Japanese, Korean, Spanish, Portuguese, Indonesian, Cantonese, Sichuanese |
| Lip-sync | Millisecond-precision |
| Audio | Native dialogue with lip-sync |
| Inference speed | 10× faster than predecessor |
How to use
Upload a reference image
For talking-head or character dialogue scenes, upload a reference image of the character whose lips you want to animate.
Write your prompt
Describe the dialogue scene, specify the language if relevant, and include any visual context — setting, lighting, emotion.
Set duration and resolution
Choose your clip length (up to 12 seconds) and resolution (up to 1080p).
Prompting tips
- Specify the language explicitly — “A character speaking in formal Japanese” or “conversational Cantonese dialogue” helps the model produce accurate phoneme-to-mouth mapping.
- Describe emotional tone — “Excited,” “calm and measured,” “whispering urgently” all influence both the audio generation and facial expressions.
- Use a clear reference image — For best lip-sync accuracy, use a front-facing or slightly angled reference image where the character’s mouth is clearly visible.
- Keep dialogue clips concise — For maximum coherence, target 5–8 second clips per generation and stitch together longer sequences.
Example prompts
A news anchor speaks directly to camera in formal English. Well-lit studio background, professional broadcast style, neutral expression. 8 seconds, 1080p.
A young woman laughs and responds excitedly in Mandarin during a casual conversation. Warm indoor lighting, natural expressions, slight camera movement. 6 seconds.
Compare models
| Model | Lip-sync | Languages | Resolution | Best for |
|---|---|---|---|---|
| Seedance 1.5 Pro | Millisecond precision | 8+ | 1080p | Multilingual dialogue, talking-head |
| Seedance 2 | Native | — | 720p | Multi-reference, full multimodal |
| Wan 2.5 | Yes | Limited | 1080p | Audio-synced general content |
| Kling 2.6 Pro | Yes | EN + Chinese | 1080p | EN/Chinese audio-synced production |

