Are you sure you want to delete this access key?
This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.
The text-to-video generation diffusion model consists of three sub-networks: text feature extraction model, text feature-to-video latent space diffusion model, and video latent space to video visual space model. The overall model parameters are about 1.7 billion. Currently, it only supports English input. The diffusion model adopts a UNet3D structure, and implements video generation through the iterative denoising process from the pure Gaussian noise video.
This model has a wide range of applications and can reason and generate videos based on arbitrary English text descriptions.
Let's first install the libraries required:
$ pip install diffusers transformers accelerate torch
Now, generate a video:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)
Here are some results:
An astronaut riding a horse.
![]() |
Darth vader surfing in waves.
![]() |
You can optimize for memory usage by enabling attention and VAE slicing and using Torch 2.0. This should allow you to generate videos up to 25 seconds on less than 16GB of GPU VRAM.
$ pip install git+https://github.com/huggingface/diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
# load pipeline
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
# optimize for GPU memory
pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()
# generate
prompt = "Spiderman is surfing. Darth Vader is also surfing and following Spiderman"
video_frames = pipe(prompt, num_inference_steps=25, num_frames=200).frames
# convent to video
video_path = export_to_video(video_frames)
The above code will display the save path of the output video, and the current encoding format can be played with VLC player.
The output mp4 file can be viewed by VLC media player. Some other media players may not view it normally.
The training data includes LAION5B, ImageNet, Webvid and other public datasets. Image and video filtering is performed after pre-training such as aesthetic score, watermark score, and deduplication.
(Part of this model card has been taken from here)
This model is available on HuggingFace under the CC by NC 4.0 License.
@InProceedings{VideoFusion,
author = {Luo, Zhengxiong and Chen, Dayou and Zhang, Yingya and Huang, Yan and Wang, Liang and Shen, Yujun and Zhao, Deli and Zhou, Jingren and Tan, Tieniu},
title = {VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023}
}
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?