Documentation Index
Fetch the complete documentation index at: https://docs.imagine.art/llms.txt
Use this file to discover all available pages before exploring further.
A standard Sora 2 variant is also available for rapid iteration and exploration. Sora 2 Pro delivers higher final quality, more stable rendering in complex scenes, and better adherence to nuanced prompts — use it for final production output.
OpenAI’s final-production video model
Sora 2 Pro is built on OpenAI’s Multimodal Diffusion Transformer (MM-DiT) architecture and generates video at up to 1080p for 4–20 seconds. Audio (dialogue, sound effects, ambient) is generated in a single pass alongside the video, synchronized at the frame level without post-production. The Pro tier offers meaningfully higher quality over standard Sora 2 in the scenarios where it counts most: complex multi-element scenes with accurate physics, nuanced prompt instructions, and long-form narratives where rendering stability matters across the full clip duration.Capabilities
Integrated audio-video generation
Dialogue, sound effects, and ambient audio generated in a single pass — precisely synchronized with the visual output without post-editing.
Physics-aware motion
Understands gravity, collisions, and spatial relationships naturally — better object stability, realistic material behavior, and fewer visual artifacts in complex scenes.
4–20 seconds
A generous generation window — suitable for narrative sequences, commercial spots, and multi-beat storytelling.
Strong prompt fidelity
Responds accurately to instructions for camera movements, emotional tone, lighting, pacing, and scene transitions — including nuanced multi-part instructions.
Text and image input
Accepts text prompts alone, an uploaded image as a starting frame, or a combination of both for greater control over visual consistency.
MM-DiT architecture
Multimodal Diffusion Transformer processes visual and audio branches with joint attention — coherent audio-visual output from a single generation pass.
Specifications
| Feature | Details |
|---|---|
| Developer | OpenAI |
| Architecture | Multimodal Diffusion Transformer (MM-DiT) |
| Resolution | 1080p |
| Duration | 4–20 seconds |
| Aspect ratios | 16:9 (1280×720), 9:16 (720×1280) |
| Audio | Dialogue, SFX, ambient (native) |
| Input modes | Text-to-video, image-to-video |
How to use
Provide your input
Write a text prompt, upload an image as a starting frame, or combine both. Include explicit audio cues in your prompt for synchronized sound.
Configure settings
Set the video duration (4–20 seconds) and aspect ratio based on your project needs.
Prompting tips
- Include explicit audio cues — “With the sound of rain on glass” or “soft jazz playing in the background” directly influences the audio generation alongside the visual.
- Use the full duration for narratives — Describe a beginning, middle, and resolution. Sora 2 Pro maintains rendering stability and character consistency across the full duration.
- Specify camera behavior precisely — “The camera slowly orbits around the subject” or “cut to a close-up on the hands” gives Sora 2 Pro clear direction for camera motion.
- Describe physics interactions explicitly — “A glass tips over and water spills across the table” or “leaves scatter in a gust of wind” benefit from the physics-aware rendering.
- For image-to-video — Make sure the reference image style matches the aesthetic in your text prompt to avoid visual inconsistency in the generation.
Example prompts
Wide shot: two figures stand in the foreground, gazing at a majestic waterfall cascading into a river below. The camera slowly pans left to reveal the full expanse of the waterfall, capturing the lush greenery and dramatic sky. The roar of the water fills the audio. 15 seconds.
A barista carefully prepares a latte, steaming the milk with practiced precision. Soft café ambient sounds, quiet chatter in the background. Close-up on the hands, slow rack focus to the finished drink. 10 seconds.
POV shot: a mountain biker navigates a muddy trail in a dense forest during a rainstorm. The camera tracks forward, capturing mud splashes and rain. The sound of the storm and bike tires on wet ground. 20 seconds.
Compare models
| Model | Duration | Audio | Physics | Best for |
|---|---|---|---|---|
| Sora 2 Pro | Up to 25s | Yes | Yes | Final production, long-form, physics-accurate |
| Sora 2 | Up to 25s | Yes | Yes | Rapid iteration, exploration |
| Google Veo 3.1 | Up to 60s | Yes | — | Longest clips, broadcast quality |
| Kling 3.0 Pro | Up to 15s | Yes | — | 4K, multilingual audio, multi-shot |
| Seedance 2 | Up to 15s | Yes | — | Max references, multimodal |

