Wan 2.1 is a advanced large AI video model that developed and open-sourced by Alibaba Group. Model has massive amount of dataset with 14B (14 billion parameters) and 1.3B (1.3 billion parameters) versions. Not only outperforms existing open-source model, 1.3B version can be run with only 8GB of VRAM. This powerful model also gets seperated two different versions that excels at different generation methods. This guide aims to highlight capabilities of Wan model, also include usage tips to help users.
Versions of Wan 2.1 Video Model
Wan 2.1 has 4 different versions that excels at different methods, resolutions and system consumption.
Wan 2.1 T2V 14B: This version excels at text to video generations and supports both 480P and 720P resolutions.
Wan 2.1 T2V 1.3B: This is reduced system requirement version (less parameters) of standart Wan 2.1 text to video version. It supports 480P and can work with 8GB of VRAM.
Wan 2.1 I2V 14B 720P: This is image to video version of model and supports 720P resolution.
Wan 2.1 I2V 14B 480P: This version is similar to other image to video version but supports 480P instead of 720P.
Note1: There is no 1.3B model for I2V generation.
Note2: 1.3B version is capable of generating 720P videos but it's not stable about generation. It's recommended to use model with 480P resolution.
Capabilities of Wan 2.1 Video Model
- Capable of generating 16 FPS video generations. It's best selection for Wan 2.1 model. There is tools to upscale it and make 32 FPS videos.
- Model is better with short lenght videos. It's recommended to use 5 seconds per clip. Longer clips can cause deformations and model start to lose consistency. If you need longer videos, please use last frame of video and use it as first frame of next clip (img2vid) and connect them with editing.
- Capable of creating 480P and 720P resolution videos.
- It can create high quality photorealistic video renders with minimum deformation.
- It's also capable of creating videos of different styles such as anime, cartoon etc.
- Despite being open-source, model can compete with most of paid video generation models.
- One of the most stable and smooth video models.
- Model can make text renders inside of video.
Recommended Settings
Steps: 20-30 is fairly enough but can be increased up to 50. More than 50 steps will cause extremely slow generations and high credit consumptions but quality gain will be very low.
Guidance Scale: 5-7 (6 is recommended for starting)
- If video suddenly change frame to frame, it means CFG is too high. Please try to reduce CFG.
- If video is kinda off-topic (not following prompt) and blurry, it means CFG is low. Please try to increase it.
FPS: 16
Resolution: Use correct resolutions with known aspect ratios for model you selected.
- 480P (640x480)
- 720P (1280x720)
Shift: 5-6 (5 is recommended)
Scheduler: UniPC
Usage Tips For Wan 2.1
- Model supports Chinese and English Inputs (prompts) and have optimized T5 encoder for itself. That means descriptive natural language prompts are best way to prompt for this model. Prompting is similar to Flux, SD3.5 models.
- You can get help from DeepSeek or ChatGPT create prompts. Just ask "Can you create me a {subject, composition} prompt for a AI video model that has T5 encoder?"
- Prompt order should be "Subject + Scene + Action" for best results in most cases. But there is no restrictions since it's NLP based.
- Model is good at following negatives, this is negative recommendation for general uses:
Base Negative Prompt Chinese Version:
`色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走`
Base Negative Prompt English Version:
`Overexposure, static, blurred details, subtitles, paintings, pictures, still, overall gray, worst quality, low quality, JPEG compression residue, ugly, mutilated, redundant fingers, poorly painted hands, poorly painted faces, deformed, disfigured, deformed limbs, fused fingers, cluttered background, three legs, a lot of people in the background, upside down`
Recommendation for Image to Video
If images are not correct resolution and aspect ratio, they will be cropped to fit in selected resolution. Unfortunately these model can't work good with random resolutions so images should be cropped for model. Selecting resolution based on input image is not wise decision, if it's not recommended resolution.
- Cropping images manually to get correct resolution and aspect ratio is best decision for image to video works.
- Prompting is optional but can change results significantly.
Time Limitation Warning
SeaArt AI tools and workflows are based on ComfyUI like all the other services. But online services has time limitation for each work to not crash server. This value is:
10 Minutes for Free Users
30 Minutes for VIP Users (Free Generation Option still has 10 minutes time limit)
Standart 5 second, 16 FPS, 720P Wan 2.1 video creation takes minimum +15 minutes on 4090 24GB so selecting higher settings not only have crash, deform option, there is also chance to get "Canceled by System" message. This is not an error but should be like this to maintain server workload.
Overally, it's better to use 480P resutions for both faster generations and avoiding time limitation cancels.
Prompt Examples And Detailed Comparison
Pirate Ship
https://www.seaart.ai/artWorkDetail/cvbopd5e878c739ngu20
Positive Prompt
`A large pirate ship moving on the sea, ocean is wavy and has deep blue color that makes impossible to see what is under the water. Waves crashing to ship and weather is stormy, scene is slowly shaking while thunders hit to ocean, sky is filled with dark clouds and it's hard to see horizon. Somber atmosphere with volumetric lighting, tracking shot. `
- Generated with Official SeaArt Wan 2.1 text-to-video tool.
- Cost 200 credits and generated within 5 minutes.
- Video is 720P, resolution was 1280x720.
- There is no advanced settings and negative option. Only positive prompt and resolution.
- Prompt is written by user, me.
Fantasy Scene
https://www.seaart.ai/artWorkDetail/cvbopqle878c73cf315g
Positive Prompt
`Create a vibrant fantasy adventure scene set in a mystical forest. The video should show lush, enchanted foliage with shimmering lights, magical creatures roaming, and a winding, ancient stone pathway leading to a hidden ruin. Emphasize cinematic depth, dynamic lighting, and rich, saturated colors.`
- Generated with Official SeaArt Wan 2.1 text-to-video tool.
- Cost 200 credits and generated within 5 minutes.
- Video is 720P, resolution was 1280x720.
- There is no advanced settings and negative option. Only positive prompt and resolution.
- Prompt is written LLM. (ChatGPT)
Fantasy Scene (With Negative Recommended By LLM)
https://www.seaart.ai/artWorkDetail/cvbos4te878c7386ring
Positive Prompt
`Create a vibrant fantasy adventure scene set in a mystical forest. The video should show lush, enchanted foliage with shimmering lights, magical creatures roaming, and a winding, ancient stone pathway leading to a hidden ruin. Emphasize cinematic depth, dynamic lighting, and rich, saturated colors.`
Negative Prompt
`Exclude cartoonish styles, avoid exaggerated or unrealistic physics, and do not use overly dark or gloomy tones.`
- Generated with ComfyUI workflow.
- Cost 535 credits.
- Video is 480P, resolution was 640x480.
- 480P, 16 FPS, 5 seconds, 30 Steps generation exceed 30 minutes time limit and got canceled.
- 480P, 16 FPS, 5 seconds, 20 Steps generation is finished after 25 minutes.
- Both positive prompt and negative prompt is written LLM. (ChatGPT)
I2V With High Resolution Image
https://www.seaart.ai/artWorkDetail/cvboqdte878c7386j2kg
Positive Prompt
`A pink haired girl with a white bikini sitting in a blue swimming tube, sea is wavy and she is partially submerged in water, she is slowly tilting her head and enjoying with her time while tube slowly moves on water. Sky is covered with clouds and sunshine reflects on water from horizon. Euphoric atmosphere with hard lighting, fine contrast and vivid colors. Camera is moving with dolly out effect and let us see full scenery.`
- Generated with Official SeaArt Wan 2.1 image-to-video tool.
- Cost 200 credits and generated within 5 minutes.
- Original image was 2728x4096 and resized to 768x1168. Maybe it was a coincidence but results were more unstable.
- Video is 720P, resolution was 784x1168.
- There is no advanced settings and negative option. Only positive prompt and image selection.
- Prompt is written by user, me.
I2V With 720P Resolution Image
https://www.seaart.ai/artWorkDetail/cvboqsle878c739nofe0
Positive Prompt
`A pink haired girl with a white bikini sitting in a blue swimming tube, sea is wavy and she is partially submerged in water, she is slowly tilting her head and enjoying with her time while tube slowly moves on water. Sky is covered with clouds and sunshine reflects on water from horizon. Euphoric atmosphere with hard lighting, fine contrast and vivid colors. Camera is moving with dolly out effect and let us see full scenery.`
- Generated with Official SeaArt Wan 2.1 image-to-video tool.
- Cost 200 credits and generated within 5 minutes.
- Original image was 720x1280 and nothing changed. Maybe it was a coincidence but results were more stable.
- Video is 720P, resolution was 720x1280.
- There is no advanced settings and negative option. Only positive prompt and image selection.
- Prompt is written by user, me.
Final Thoughts And Comparison
User Prompts vs. LLM Prompts
Both of them have acceptable quality and accuracy. Prompting is not hard, users should be familiar from Flux models. Writing by yourself is better in my opinion because you can express your thoughts exactly as they are but LLMs can't read your mind so there will be differences from what have you imagined. Maybe better, maybe worse but it's up to luck and i don't prefer leaving it to luck.
- Base Negative Prompt is way better than LLM recommendations. Use base negative and add things you want to remove.
- Prompt written by LLM was random, i didn't ask for something spesific.
ComfyUI Workflow vs. Official AI Tool
- AI Tool is way more cheap compared to ComfyUI.
- There is significant generation time difference between this two options.
- There is a high chance to get "time limit cancel" on ComfyUI, even with lower settings.
- ComfyUI has more control about settings but you exceed time limit when you control.
Comparing the resolution, generation speed, overall quality difference and cost, Official AI Tool is way better option.
Image Selection For I2V
- You should prompt to control elements of image. Otherwise you can get extra stuff coming up from void. For example, if you want a sword swinging animation, subject must be in a position to ready to swing sword and sword must be visible in image otherwise it will come up from absurd place or will appear suddenly and result will be nonsensical.
- Maybe it was a coincidence but using correct resolution instead of letting AI resize it has more stable results.
Wan 2.1 Wrapper Workflow: https://www.seaart.ai/workFlowDetail/cv4q655e878c73fjakag
Official T2V AI Tool: https://www.seaart.ai/workFlowAppDetail/cuvigule878c73c85e6g
Official I2V AI Tool: https://www.seaart.ai/workFlowAppDetail/cuveqode878c73f4tfqg
Wrapper Workflow Explanation Guide: https://www.seaart.ai/articleDetail/cvbp5gle878c739pf5v0