Quickstart Guide to Stable Video Diffusion

Video

Last Updated	Changes
11/24/2023	First version published
11/28/2023	New ComfyUI Workflows
2/6/2024	SVD 1.1 Released

What is Stable Video Diffusion (SVD)?

Stable Video Diffusion (SVD) from Stability AI, is an extremely powerful image-to-video model, which accepts an image input, into which it “injects” motion, producing some fantastic scenes.

SVD is a latent diffusion model trained to generate short video clips from image inputs. There are two models. The first, img2vid, was trained to generate 14 frames of motion at a resolution of 576×1024, and the second, img2vid-xt is a finetune of the first, trained to generate 25 frames of motion at the same resolution.

The newly released (2/2024) SVD 1.1 is further finetuned on a set of parameters to produce excellent, high-quality outputs, but requires specific settings, detailed below.

The official Stability AI SVD release page can be found here.

Why should I be excited by SVD?

SVD creates beautifully consistent video movement from our static images!

How can I use SVD?

ComfyUI is leading the pack when it comes to SVD image generation, with official S VD support! 25 frames of 1024×576 video uses < 10 GB VRAM to generate.

It’s entirely possible to run the img2vid and img2vid-xt models on a GTX 1080 with 8GB of VRAM!

There’s still no word (as of 11/28) on official SVD suppor t in A utomatic1111.

If you’d like to try SVD on Google Colab, this workbook works on the Free Tier; https://github.com/sagiodev/stable-video-diffusion-img2vid/. Generation time varies, but is generally around 2 minutes on a V100 GPU.

You’ll need to download one of the SVD models, from the links below, placing them in the ComfyUI/models/checkpoints directory.

Model	Civitai Link	Original Author Link
img2vid	Link	HF Lin k
img2vid-xt	Lin k	HF Lin k
img2vid-xt-1.1 (latest, see below for settings)	Link	HF Link

After updating your ComfyUI installation, you’ll see new nodes for VideoLinearCFGGuidance and SVD_img2vid _Conditioning. The Conditioning node takes the following inputs;

Setting	Description
video frames	The number of frames of motion to generate.
motion_bucket_id	The higher the number, the more motion will be in the output.
fps	Higher FPS results in less choppy video output
augmentation level	The amount of noise added to the input image. Higher noise will decrease the video’s resemblance to the input image, but will result in greater motion.
VideoLinearCFGGuidance	Improves sampling for video by scaling the CFG across the frames – frames farther away from the initial image frame receive a gradually higher CFG value.

You can download ComfyUI workflows for img2video and txt2video below, but keep in mind you’ll need to have an updated ComfyUI, and also may be missing additional nodes for Video. I recommend using the ComfyUI Manager to identify and download missing nodes!

SVD_img2video Download ComfyUI .json Workflow (basic img2video)

SVD with Film Interpolation Download ComfyUI .json Workflow (basic img2video with FILM frame interpolation – smooth)

SVD_txt2video Download ComfyUI .json Workflow (basic txt2video)

Suggested Settings

The settings below are suggested settings for each SVD component (node), which I’ve found produce the most consistently useable outputs, with the img2vid and img2vid-xt models.

Node	Setting	Value
VideoLinearCFGGuidance	min_cfg	1
KSampler	Steps	25
KSampler	CFG	2.9
SVD_img2vid_Conditioning	Width	576
SVD_img2vid_Conditioning	Height	1024
SVD_img2vid_Conditioning	Video Frames	25
SVD_img2vid_Conditioning	Motion Bucket ID	60
SVD_img2vid_Conditioning	FPS	8
SVD_img2vid_Conditioning	Augmentation Level	0.07

Settings – Img2vid-xt-1.1

February 2024 saw the release of a finetuned SVD model, version 1.1. This version only works with a very specific set of parameters to improve the consistency of outputs. If using the Img2vid-xt-1.1 model, the following settings must be applied to produce the best results;

Node	Setting	Value
SVD_img2vid_Conditioning	Width	1024
SVD_img2vid_Conditioning	Height	576
SVD_img2vid_Conditioning	Video Frames	25
SVD_img2vid_Conditioning	Motion Bucket ID	127
SVD_img2vid_Conditioning	FPS	6
SVD_img2vid_Conditioning	Augmentation Level	0.00

Output Examples (img2vid-xt-1.1)

Output Examples (img2vid-xt – v1.0)

Limitations

It’s not perfect! Currently there are a few issues with the implementation, including;

Generations are short! Only <=4 second generations are possible, at present.
Sometimes there’s no motion in the outputs. We can tweak the conditioning parameters, but sometimes the images just refuse to move.
The models cannot be controlled through text.
Faces, and bodies in general, often aren’t the best!

The Future

We’ll continue to expand this quickstart guide with more information as it becomes available, and we’ll create a full, advanced, usage guide, soon!

EDUCATION