Workflows/Sana 2K Text2image FastTrack

Sana 2K Text2image FastTrack

Save it for me

Operate

MimicPC

04/27/2025

ComfyUI

Image Generation

1 / 0

Detailed Introduction

Introduction

Sana is a revolutionary text - to - image framework that has redefined the landscape of high - resolution image generation. It offers a remarkable combination of speed, quality, and versatility, making it a go - to choice for content creators, designers, and AI enthusiasts alike.

One of Sana's standout features is its ability to generate images up to 4096×4096 resolution with exceptional text - image alignment. This is achieved through a series of innovative core designs. For instance, the Deep Compression Autoencoder compresses images by a factor of 32×, significantly reducing the number of latent tokens. The Linear DiT, with its replacement of vanilla attention, offers enhanced efficiency at high resolutions.

In terms of performance, Sana - 0.6B is truly remarkable. It is 20 times smaller than models like Flux - 12B yet 100+ times faster in measured throughput. It can generate a 1024×1024 resolution image in less than 1 second on a 16GB laptop GPU, enabling cost - effective content creation.

The Sana source code is available at ：

https://github.com/NVlabs/Sana .

Workflow Overview

SANA Workflow and Key Node Settings
Text Input Node:

This is where users enter their prompts. Sana supports a wide range of input, including English, Chinese, and emojis. For example, users can input a Chinese poem like “念去去千里烟波，暮霭沉沉楚天阔” or a fun prompt with emojis such as “A cute 🐶 playing with a 🏀 on the grass”.

Gemma Encoding Node:

Here, Gemma takes the input text and encodes it into a format that can be processed further. Its superior text comprehension capabilities ensure that the essence of the text is accurately captured.

Automatic Labeling and Caption Selection Node:

Multiple VLMs generate diverse re - captions. Then, a CLIPScore - based strategy is employed to select the most suitable captions. This step enriches the training data and improves the overall quality of the generated images.

Latent Token Generation Node:

The Deep Compression Autoencoder comes into play at this node. It compresses the encoded information into latent tokens, with a compression factor of 32×.

Linear DiT and Flow - DPM - Solver Node:

The latent tokens are passed through the Linear DiT, where the linear attention mechanism and Mix - FFN work together to generate the image. The Flow - DPM - Solver reduces the inference steps, accelerating the image generation process.

Image Output Node:

This is the final destination where the generated image is presented to the user. Whether it's a 512×512, 1024×1024, or even a 4096×4096 resolution image, Sana ensures high - quality output.

For 4K image generation, users can download the relevant model from the provided Hugging Face link: https://huggingface.co/Efficient-Large-Model/Sana_1600M_4Kpx_BF16/tree/main/checkpoints.

In addition, Sana has built - in safety features. When inappropriate vocabulary is entered, the system automatically replaces it with a heart symbol ❤️, ensuring a safe and pleasant user experience.

Details

APP	ComfyUI(v0.3.10)
Update Time	04/27/2025
File Space	21.8 GB
Models	0
Extensions	4