Seedance 2.0: Getting Started - Seedance 2.0 Blog

If you have tried “one sentence, one video” but keep getting drifting characters, weak motion, slideshow-like shots, or chaos when stacking several assets, multimodal models with explicit references like Seedance 2.0 are different: they let you split the job into asset roles + language control. This guide walks through capabilities, limits, modes, writing style, and iteration. Quotas, entry points, and UI are always subject to the official Jimeng (or ByteDance) product—the numbers below are common community/product references; confirm inside the app before you plan a production.

What Seedance 2.0 is, and who it is for

Seedance 2.0 is ByteDance Seed’s next-generation AI video generation stack, aimed at coordinating visual and audio-related expression in one generation pass (whether output is always “native AV in one shot” depends on the current product build). Unlike early “text only” tools, it expects you to bring text, images, reference clips, and reference audio, and to say in natural language what each asset should do—who defines the face, who defines motion, who defines camera rhythm, who defines music mood.

It fits short ads, story beats, character pieces, product showcases, dance or motion copy, and any brief that needs a stable identity. It also rewards people who treat the prompt like a shooting plan + storyboard notes. If you want pure randomness and never think about lenses or assets, it still works—but you underuse what makes it controllable.

Mental model: from “generate” to “direct”

Split your work into three layers:

Assets: images, reference videos, reference tracks, keywords.
Roles: what each asset does in the final clip (first frame, identity lock, motion ref, camera ref, mood/rhythm ref, etc.).
Language: prompt that states time, space, action, camera, light, style, and what must not happen.

Seedance 2.0’s main lever for layer 2 is @ mentions: put “asset N” in the sentence and state its job. That cuts down useless guessing (“which image did they mean?”).

Input limits and planning (always verify in-product)

Typical reference ceilings from docs and tutorials—use for scheduling only, not as permanent law:

Clip length: often around 15 seconds max (varies by plan/mode).
Images: on the order of 9 (identity, scene, first frame, detail locks).
Reference videos: about 3 clips; total ref duration often tied to the same ballpark as clip length (e.g. ~15s combined).
Audio: about 3 MP3s; duration capped similarly.
Total referenced files: some docs cite ~12 items.
Resolution: commonly 1080p or higher, depending on account/region.

Planning tip: don’t fill every slot on day one. Steadier path: one hero image + one clear prompt until subject and light read well; add a reference video for motion or camera; then add audio for emotional lock. Each new asset type deserves an extra sentence—“who owns what”—or you risk style clashes, unclear motion sources, and vague camera intent.

Two pipelines: light to heavy

1. First/last frame / image-to-video

Fastest path: upload a key frame or first/last pair, then describe who does what, where, and how the camera moves. Good for animating stills, a single-shot performance beat, or learning the model’s taste.

Keep subject obvious, actions observable (body motion, not piles of abstract adjectives), camera verbs explicit (push in, pull out, pan, track, follow, high angle, orbit, etc.), and light/material anchors (backlight, soft overcast, neon bounce, film grain).

2. Full reference / multimodal

When you need one character to stay the same across shots or to transplant choreography or camera grammar from a reference into a new world, use multi-asset + @. Think of it as a minimal call sheet: “Image A owns face and wardrobe,” “Clip B owns motion path,” “Track C owns rhythm and mood.”

The hard part is not clicking buttons—it’s consistency: does the ref motion conflict with the written scene? does audio mood match the story? with several people, are interactions and order explicit?

`@` syntax: put roles in the sentence

Where supported, typing @ opens the uploaded list; there is often an @ toolbar too. Guidelines:

Name + role: not just @image1, but “@image1 as first-frame composition and look reference.”
One main job per asset: avoid contradictory asks on the same file (e.g. strict identity lock + “be a totally different performer”).
Align with the main prompt: if the text says “indoor,” don’t force-bind a purely outdoor ref clip without reconciling it.

Example patterns (adjust indices to your uploads):

“@image1 as first frame; character turns to camera and smiles; slow push-in; warm side light; cinematic.”
“@image1 is the only lead; motion and timing follow @video1; overall style: urban night.”
“Picture mood tracks @audio1; do not change hairstyle or outfit from @image1.”

Prompts: formula and structured template

Base formula

What keeps working in practice:

Subject (who) + Action (what they do) + Camera (how we see it) + Style & environment (light, color, materials, era)

Treat it like a news line plus a DP note. Prefer visual adjectives: instead of “beautiful,” say “soft foreground bokeh, rim light, shallow DOF, natural skin.”

S-A-C-S-C (keeps long prompts on rails)

S — Subject: who is on screen; lock look with @ if you have a key visual.
A — Action: concrete body beats and tempo (walk, turn, raise hand, eye contact, named dance steps).
C — Camera: height, movement, speed (slow / fast).
S — Style: set, light, materials, genre read (ad, doc, stylized realism).
C — Constraint: negatives—no wardrobe swap, no face swap, no jump cuts, no extra characters, etc.

Shot vocabulary cheat sheet

Size: ELS, LS, MS, MCU, CU.
Height: eye level, high angle, low angle, OTS.
Move: static, pan, tilt, dolly, track, handheld feel, orbit, crane.
Special reads: dolly zoom, long-take feel, “minimize cutting” (state in constraints if you want fewer cuts).

Iteration: change one thing, short before long

One variable per pass: same assets—try camera, then light, then motion intensity; rewriting everything hides cause and effect.
Short first: 4–6 seconds to stabilize identity + main action, then stretch duration or complexity.
Bucket failures: random noise vs prompt conflict vs asset conflict vs wrong mode—fix the bucket, not the whole prompt blindly.

Common mistakes (quick audit)

Contradictory style: “ultra minimal” + “maximal ornament,” “neon cyberpunk” + “muted vérité” in one breath.
Missing camera: model guesses “wobble or stand still”; reads like a collage, not a shoot.
@ without roles: assets stack; priority collapses.
Non-executable action: “she feels hope for life” vs “she inhales, slight smile, looks out the window.”
Vague multi-person blocking: no eyelines, positions, or sequence → intersections and sticky bodies.

Scenario notes

Identity lock: key art + @; constraints say “same person”; avoid ultra-strong style refs that steal facial cues.

Two+ performers: spell out who does what to whom, eyelines, order, and spatial relation (left/right, depth, distance).

Motion copy (dance, martial arts, gestures): ref video + @ as motion source; add “follow ref timing and key poses” + “do not change character look.”

Music and mood: align audio ref with keywords (tense / calm / epic); finer AV controls depend on the product UI.

Extend / continue: when the previous ending is clear, describe bridging action and continuous camera; whether true “extend mode” exists depends on version.

Local edits (hair or background only): state keep vs change explicitly; expect several passes; hard local control may still hit model limits.

E-commerce / product: hero packshot for logo and materials; constrain legible type and no warped trademark; avoid whip zooms that kill readability.

Closing

Seedance 2.0 is not here to replace your NLE—it helps you turn “what I want to see” into stable role assignments: @ for who owns what among assets, structured prompts for motion, lensing, and guardrails. Treat caps like a production budget, iteration like lighting rehearsal, and you move faster from gacha-style luck to repeatable short-form workflow. Before shipping, re-check the latest rules and entry paths on Jimeng or official channels, then scale assets and complexity to the brief.