Logo

EDUCATION

Quickstart Guide to Stable Video Diffusion

Last UpdatedChanges
11/24/2023First version published
11/28/2023New ComfyUI Workflows
2/6/2024SVD 1.1 Released

What is Stable Video Diffusion (SVD)?

Stable Video Diffusion (SVD) from Stability AI, is an extremely powerful image-to-video model, which accepts an image input, into which it “injects” motion, producing some fantastic scenes.

SVD is a latent diffusion model trained to generate short video clips from image inputs. There are two models. The first, img2vid, was trained to generate 14 frames of motion at a resolution of 576×1024, and the second, img2vid-xt is a finetune of the first, trained to generate 25 frames of motion at the same resolution.

The newly released (2/2024) SVD 1.1 is further finetuned on a set of parameters to produce excellent, high-quality outputs, but requires specific settings, detailed below.

The official Stability AI SVD release page can be found here.

Why should I be excited by SVD?

SVD creates beautifully consistent video movement from our static images!

How can I use SVD?

ComfyUI is leading the pack when it comes to SVD image generation, with official SVD support! 25 frames of 1024×576 video uses < 10 GB VRAM to generate.

It’s entirely possible to run the img2vid and img2vid-xt models on a GTX 1080 with 8GB of VRAM!

There’s still no word (as of 11/28) on official SVD support in Automatic1111.

If you’d like to try SVD on Google Colab, this workbook works on the Free Tier; https://github.com/sagiodev/stable-video-diffusion-img2vid/. Generation time varies, but is generally around 2 minutes on a V100 GPU.

You’ll need to download one of the SVD models, from the links below, placing them in the ComfyUI/models/checkpoints directory.

ModelCivitai LinkOriginal Author Link
img2vidLinkHF Link
img2vid-xtLinkHF Link
img2vid-xt-1.1 (latest, see below for settings)LinkHF Link

After updating your ComfyUI installation, you’ll see new nodes for VideoLinearCFGGuidance and SVD_img2vid _Conditioning. The Conditioning node takes the following inputs;

SettingDescription
video framesThe number of frames of motion to generate.
motion_bucket_idThe higher the number, the more motion will be in the output.
fpsHigher FPS results in less choppy video output
augmentation levelThe amount of noise added to the input image. Higher noise will decrease the video’s resemblance to the input image, but will result in greater motion.
VideoLinearCFGGuidanceImproves sampling for video by scaling the CFG across the frames – frames farther away from the initial image frame receive a gradually higher CFG value.

You can download ComfyUI workflows for img2video and txt2video below, but keep in mind you’ll need to have an updated ComfyUI, and also may be missing additional nodes for Video. I recommend using the ComfyUI Manager to identify and download missing nodes!

Suggested Settings

The settings below are suggested settings for each SVD component (node), which I’ve found produce the most consistently useable outputs, with the img2vid and img2vid-xt models.

NodeSettingValue
VideoLinearCFGGuidancemin_cfg1
KSamplerSteps25
KSamplerCFG2.9
SVD_img2vid_ConditioningWidth576
SVD_img2vid_ConditioningHeight1024
SVD_img2vid_ConditioningVideo Frames25
SVD_img2vid_ConditioningMotion Bucket ID60
SVD_img2vid_ConditioningFPS8
SVD_img2vid_ConditioningAugmentation Level0.07

Settings – Img2vid-xt-1.1

February 2024 saw the release of a finetuned SVD model, version 1.1. This version only works with a very specific set of parameters to improve the consistency of outputs. If using the Img2vid-xt-1.1 model, the following settings must be applied to produce the best results;

NodeSettingValue
SVD_img2vid_ConditioningWidth1024
SVD_img2vid_ConditioningHeight576
SVD_img2vid_ConditioningVideo Frames25
SVD_img2vid_ConditioningMotion Bucket ID127
SVD_img2vid_ConditioningFPS6
SVD_img2vid_ConditioningAugmentation Level0.00

Output Examples (img2vid-xt-1.1)

Output Examples (img2vid-xt – v1.0)

Limitations

It’s not perfect! Currently there are a few issues with the implementation, including;

  • Generations are short! Only <=4 second generations are possible, at present.
  • Sometimes there’s no motion in the outputs. We can tweak the conditioning parameters, but sometimes the images just refuse to move.
  • The models cannot be controlled through text.
  • Faces, and bodies in general, often aren’t the best!

The Future

We’ll continue to expand this quickstart guide with more information as it becomes available, and we’ll create a full, advanced, usage guide, soon!