Stable Diffusion: Latent Text-to-Image Diffusion Model
Stable Diffusion is an open-source latent text-to-image diffusion model developed by the CompVis team, in collaboration with Stability AI and Runway. This model is capable of generating high-resolution images from textual descriptions and is based on the research paper "High-Resolution Image Synthesis with Latent Diffusion Models" by Robin Rombach et al., presented at CVPR '22. The model utilizes a CLIP ViT-L/14 text encoder and features an 860M UNet for image generation, making it relatively lightweight and suitable for GPUs with at least 10GB VRAM.
The project repository provides detailed instructions on setting up the environment, training the model, and using it for text-to-image generation. It also includes the model weights and checkpoints, which are available under a license with specific use-based restrictions to prevent misuse. The repository encourages responsible use and highlights the ongoing research on safe and ethical deployment of AI models.
For those interested in experimenting with Stable Diffusion, the repository offers a reference sampling script with a Safety Checker Module to reduce explicit outputs and invisible watermarking to identify machine-generated images. Additionally, the integration with the diffusers library simplifies the process of downloading and sampling the model.
The CompVis team has also provided capabilities for image modification using Stable Diffusion, allowing for tasks such as text-guided image-to-image translation and upscaling.
To explore Stable Diffusion further, visit the GitHub repository