Introduction
Sana is a revolutionary text - to - image framework that has redefined the landscape of high - resolution image generation. It offers a remarkable combination of speed, quality, and versatility, making it a go - to choice for content creators, designers, and AI enthusiasts alike.
One of Sana's standout features is its ability to generate images up to 4096×4096 resolution with exceptional text - image alignment. This is achieved through a series of innovative core designs. For instance, the Deep Compression Autoencoder compresses images by a factor of 32×, significantly reducing the number of latent tokens. The Linear DiT, with its replacement of vanilla attention, offers enhanced efficiency at high resolutions.
In terms of performance, Sana - 0.6B is truly remarkable. It is 20 times smaller than models like Flux - 12B yet 100+ times faster in measured throughput. It can generate a 1024×1024 resolution image in less than 1 second on a 16GB laptop GPU, enabling cost - effective content creation.
The Sana source code is available at :
https://github.com/NVlabs/Sana .
Workflow Overview
SANA Workflow and Key Node Settings
Text Input Node:
This is where users enter their prompts. Sana supports a wide range of input, including English, Chinese, and emojis. For example, users can input a Chinese poem like “念去去千里烟波,暮霭沉沉楚天阔” or a fun prompt with emojis such as “A cute 🐶 playing with a 🏀 on the grass”.
Gemma Encoding Node:
Here, Gemma takes the input text and encodes it into a format that can be processed further. Its superior text comprehension capabilities ensure that the essence of the text is accurately captured.
Automatic Labeling and Caption Selection Node:
Multiple VLMs generate diverse re - captions. Then, a CLIPScore - based strategy is employed to select the most suitable captions. This step enriches the training data and improves the overall quality of the generated images.
Latent Token Generation Node:
The Deep Compression Autoencoder comes into play at this node. It compresses the encoded information into latent tokens, with a compression factor of 32×.
Linear DiT and Flow - DPM - Solver Node:
The latent tokens are passed through the Linear DiT, where the linear attention mechanism and Mix - FFN work together to generate the image. The Flow - DPM - Solver reduces the inference steps, accelerating the image generation process.
Image Output Node:
This is the final destination where the generated image is presented to the user. Whether it's a 512×512, 1024×1024, or even a 4096×4096 resolution image, Sana ensures high - quality output.
For 4K image generation, users can download the relevant model from the provided Hugging Face link: https://huggingface.co/Efficient-Large-Model/Sana_1600M_4Kpx_BF16/tree/main/checkpoints.
In addition, Sana has built - in safety features. When inappropriate vocabulary is entered, the system automatically replaces it with a heart symbol ❤️, ensuring a safe and pleasant user experience.