VIDEO MODEL by Kling AI Kling O series

Kling O1

Kling AI's unified video creation and editing model — the world's first multimodal video model to unify generation and editing in a single system. Accepts text, images, keyframes, reference videos, and motion inputs, with up to 7 reference images and 6 camera cuts per generation.

Resolution

Up to 1080p

Reference images

Up to 7

Camera cuts

Up to 6

Architecture

MVL unified

## Unified creation and editing Kling O1 is the first video model to unify generation and editing in a single system — you can create a new video from scratch and then edit specific sections, restyle footage, extend shots, or swap elements within the same model, without exporting to a separate editing tool. The Multi-modal Visual Language (MVL) architecture accepts six input types simultaneously: text, images, keyframes, reference videos, motion references, and video editing instructions. This makes Kling O1 uniquely capable for production pipelines that need a single model to handle multiple stages. ## Capabilities The first model to handle both video creation and video editing in one system — generate footage and edit it within the same generation pipeline. Accepts text, images, keyframes, reference videos, motion references, and editing instructions as simultaneous inputs. Anchor character appearance, visual style, and scene composition with up to 7 reference images in a single generation. Generates up to 6 distinct shots per generation — structured multi-shot output from a single model invocation. Transform the visual style of existing footage — apply new aesthetics, change time of day, or retheme content while preserving the underlying motion. Extend existing shots seamlessly — continue the motion and scene from the end of an existing clip. ## Input types supported | Input | Use | | ------------------------ | ---------------------------------------------------------- | | **Text** | Scene description, style direction, audio cues | | **Images (up to 7)** | Subject appearance, visual style, composition anchoring | | **Keyframes** | Define start, middle, or end frames for transition control | | **Reference videos** | Motion and style reference from existing footage | | **Motion references** | Camera trajectory and subject movement patterns | | **Editing instructions** | Targeted edits to specific elements in existing video | ## Specifications | Feature | Details | | -------------------- | --------------------------------------------------------- | | **Developer** | Kling AI (Kuaishou) | | **Architecture** | Multi-modal Visual Language (MVL) | | **Resolution** | Up to 1080p | | **Duration** | 5–10 seconds | | **Reference images** | Up to 7 | | **Camera cuts** | Up to 6 per generation | | **Audio** | No native audio | | **Input modes** | 6 (text, image, keyframe, ref video, motion ref, editing) | ## How to use Log into ImagineArt and go to the **AI Video Generator**. Choose **Kling O1** from the model dropdown. Select the combination of input types that fits your use case — text only, text + images, keyframes + motion reference, or video editing mode. Upload up to 7 reference images, a reference video, or motion reference as needed. For multi-shot output, structure your prompt with explicit shot descriptions — up to 6 shots per generation. Click **Generate**. Generation typically completes in 1–2 minutes for complex multi-input requests. ## Prompting tips * **Describe edit targets precisely** — In editing mode: "Change the background from day to night while keeping the subject unchanged" is more accurate than "make it darker." * **Use keyframes for transitions** — Define your start and end keyframes; let Kling O1 fill in the motion between them consistently. * **Combine input types** — "Based on this reference image \[image], in this visual style \[image 2], with this camera movement \[motion ref]..." — the MVL architecture processes all inputs cohesively. ### Example prompts > SHOT 1 (wide, 3s): A detective walks into a rain-soaked alley at night. SHOT 2 (close-up, 2s): Detective looks at a clue on the ground, rain drops visible. SHOT 3 (medium, 3s): Detective turns and exits the alley. Reference image for detective character appearance attached. > Restyle the provided footage to a vintage 1970s Super 8 film look. Keep all motion and subjects identical; change only the visual aesthetic. ## Compare models | Model | Edit support | Input types | Camera cuts | Audio | Best for | | ----------------------------------------------- | ----------------------- | ----------- | ----------- | ----- | ---------------------------- | | **Kling O1** | Yes (unified) | 6 | Up to 6 | No | Create + edit workflows | | [Kling O3](/ai-models/video/kling-o3) | Partial | 6 | Up to 6 | Yes | Max capability + audio | | [Kling 3.0 Pro](/ai-models/video/kling-3-0-pro) | No | 2 | Up to 6 | Yes | 4K cinematic, multi-shot | | [Pika 2.2](/ai-models/video/pika-2-2) | Partial (swaps, scenes) | 2 | No | No | Creative effects + keyframes | Kling O1 is the strongest model when your workflow requires both creating new footage and editing or transforming existing video within the same pipeline. For maximum capability with audio, consider [Kling O3](/ai-models/video/kling-o3).