Unified creation and editing
Kling O1 is the first video model to unify generation and editing in a single system — you can create a new video from scratch and then edit specific sections, restyle footage, extend shots, or swap elements within the same model, without exporting to a separate editing tool. The Multi-modal Visual Language (MVL) architecture accepts six input types simultaneously: text, images, keyframes, reference videos, motion references, and video editing instructions. This makes Kling O1 uniquely capable for production pipelines that need a single model to handle multiple stages.Capabilities
Unified generation and editing
The first model to handle both video creation and video editing in one system — generate footage and edit it within the same generation pipeline.
6 input types
Accepts text, images, keyframes, reference videos, motion references, and editing instructions as simultaneous inputs.
Up to 7 reference images
Anchor character appearance, visual style, and scene composition with up to 7 reference images in a single generation.
Up to 6 camera cuts
Generates up to 6 distinct shots per generation — structured multi-shot output from a single model invocation.
Video restyling
Transform the visual style of existing footage — apply new aesthetics, change time of day, or retheme content while preserving the underlying motion.
Shot extension
Extend existing shots seamlessly — continue the motion and scene from the end of an existing clip.
Input types supported
| Input | Use |
|---|---|
| Text | Scene description, style direction, audio cues |
| Images (up to 7) | Subject appearance, visual style, composition anchoring |
| Keyframes | Define start, middle, or end frames for transition control |
| Reference videos | Motion and style reference from existing footage |
| Motion references | Camera trajectory and subject movement patterns |
| Editing instructions | Targeted edits to specific elements in existing video |
Specifications
| Feature | Details |
|---|---|
| Developer | Kling AI (Kuaishou) |
| Architecture | Multi-modal Visual Language (MVL) |
| Resolution | Up to 1080p |
| Duration | 5–10 seconds |
| Reference images | Up to 7 |
| Camera cuts | Up to 6 per generation |
| Audio | No native audio |
| Input modes | 6 (text, image, keyframe, ref video, motion ref, editing) |
How to use
Choose your input combination
Select the combination of input types that fits your use case — text only, text + images, keyframes + motion reference, or video editing mode.
Upload references
Upload up to 7 reference images, a reference video, or motion reference as needed.
Describe your multi-shot structure
For multi-shot output, structure your prompt with explicit shot descriptions — up to 6 shots per generation.
Prompting tips
- Describe edit targets precisely — In editing mode: “Change the background from day to night while keeping the subject unchanged” is more accurate than “make it darker.”
- Use keyframes for transitions — Define your start and end keyframes; let Kling O1 fill in the motion between them consistently.
- Combine input types — “Based on this reference image [image], in this visual style [image 2], with this camera movement [motion ref]…” — the MVL architecture processes all inputs cohesively.
Example prompts
SHOT 1 (wide, 3s): A detective walks into a rain-soaked alley at night. SHOT 2 (close-up, 2s): Detective looks at a clue on the ground, rain drops visible. SHOT 3 (medium, 3s): Detective turns and exits the alley. Reference image for detective character appearance attached.
Restyle the provided footage to a vintage 1970s Super 8 film look. Keep all motion and subjects identical; change only the visual aesthetic.
Compare models
| Model | Edit support | Input types | Camera cuts | Audio | Best for |
|---|---|---|---|---|---|
| Kling O1 | Yes (unified) | 6 | Up to 6 | No | Create + edit workflows |
| Kling O3 | Partial | 6 | Up to 6 | Yes | Max capability + audio |
| Kling 3.0 Pro | No | 2 | Up to 6 | Yes | 4K cinematic, multi-shot |
| Pika 2.2 | Partial (swaps, scenes) | 2 | No | No | Creative effects + keyframes |

