Try Hugging Face’s Open Source Text-to-Video Model (modelscope-text-to-video-synthesis)

Renee LIN
3 min readMar 24, 2023
https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis

This Monday, Victor M twitted a video clip using Hugging Face’s open source Text-to-Video tool(released on the same day). Wow, I just tried it. It’s fun. Here are the related resources, including the links to the web app and the model.

  1. Introduction and demo
  2. Try it yourself
  3. Limitations
https://twitter.com/victormustar/status/1637461621541949441

1. Introduction and demo

This model is built upon a multi-stage text-to-video generation diffusion model, which consists of three sub-networks: text feature extraction, text feature-to-video latent space diffusion model, and video latent space to video visual space. The model utilizes the Unet3D structure and generates videos through an iterative denoising process from the pure Gaussian noise video. (They didn’t list any related paper on the homepage, I don’t understand the words in bold)

Check some demos on their website:

--

--

Renee LIN
Renee LIN

Written by Renee LIN

Passionate about web dev and data analysis. Huge FFXIV fan. Interested in health data now.

Responses (1)