How do robots dream?

# ai# machinelearning# opensource# tutorial

Robin Haze

So you've already seen the myriads of posts on Reddit where excited fans of neural networks are...

So you've already seen the myriads of posts on Reddit where excited fans of neural networks are sharing all sorts of precious sirens and other anime characters with outstanding forms created on their car-sized home PCs (also serving as electric heaters)? Looks stunning how a computer draws a picture out of its imagination (if we can call it so) following an artist's idea expressed in text. The time has come to figure out how it's done!

However, we are not going to blindly copy the prompts and workflows, just imitating the masters. Our goal is to get to the bottom of it and outpace them, so stay tuned. Good news: even with potato hardware we can run Stable Diffusion (an extremely popular open source diffusion model). And ComfyUI will become our virtual laboratory and artist's workshop. The bad news? There's a learning curve. But have patience and remember - practice makes perfect.

The Basics: Your First Workflow

ComfyUI is a node-based interface for image generation. Instead of typing a prompt and hitting "Generate," you connect building blocks (nodes) that each do one thing. It's like visual programming: data flows through wires from one node to the next until you get an image.

Every image generation workflow needs five essential nodes:

1. Load Checkpoint — This loads your AI model. Think of it as plugging in the brain. It outputs three things:

MODEL — the actual neural network that predicts noise
CLIP — the text encoder that understands your prompts
VAE — the encoder/decoder that converts between pixel space and latent space

2. CLIP Text Encode — You need two of these: one for what you want (positive prompt), one for what you don't (negative prompt). They convert your text into numerical embeddings the model understands.

3. Empty Latent Image — This creates a blank canvas in latent space. Here you set your output resolution. Important: different models have different sweet spots:

SD 1.5: 512×512 to 768×768
SDXL: 1024×1024
Flux: 1024×1024 or higher

4. KSampler — The engine. This is where the actual image generation happens through iterative denoising. Key settings:

Steps: 20-30 is a good starting point
CFG (Classifier-Free Guidance): 7-12 for most models, but Flux uses 1.0
Sampler: euler_a or dpmpp_2m are reliable choices
Seed: fix it to reproduce results by setting the Control after generate property

5. VAE Decode — Converts the latent image back into actual pixels you can see and save.

The connection pattern: Checkpoint → CLIP nodes → KSampler → VAE Decode → Preview Image. That's it. That's a working text-to-image workflow.

How to Choose a Model

Model Families

Stable Diffusion 1.5 — The grandfather. Fast, lightweight, runs on modest hardware. Outputs at 512×512. Doesn't follow complex prompts well, but there's a massive ecosystem of fine-tunes. Good for: experimentation, older GPUs, rapid iteration.

SDXL — The reliable workhorse. Better detail, anatomy, and prompt following than SD 1.5. Native 1024×1024 resolution. Needs 6-8GB VRAM minimum. Uses two CLIP encoders (hence the "Dual CLIP Loader" node you'll see in workflows). Good for: general use, when you want quality without bleeding-edge requirements.

Flux — The new hotness from Black Forest Labs. Exceptional prompt adherence — it actually listens to what you ask for. Great photorealism. Uses a different architecture (Rectified Flow Transformers, a type of Diffusion Transformer, or DiT). Catch: no negative prompts, needs more VRAM, and you should use CFG 1.0 for GGUF versions (more on this format later). Good for: best quality results, precise prompt following.

Stable Diffusion 3.5 — Stability's latest family. Comes in three flavours: Large (8.1B params, 18GB+ VRAM, best quality), Large Turbo (same size but only 4 steps needed), and Medium (2.5B params, ~10GB VRAM, runs on consumer GPUs). Native 1024×1024, can stretch to 2 megapixels. Much better text rendering than predecessors — finally readable letters in images! Uses MMDiT-X architecture (different from SDXL's U-Net). Quirk: requires three text encoders (CLIP L, CLIP G, and T5-XXL), which adds memory overhead. Good for: text in images, when you want Stability's latest without Flux's VRAM appetite.

There are also specialised models like PixArt Sigma (supports up to 4K), Kolors/Klein 4B (good multilingual support), and various anime-focused fine-tunes.

Model Architectures

Under the hood of the models, you'll see three major brain designs:

U-Net — The original backbone for many diffusion models (SD 1.5, SDXL). Shaped like a "U" — it compresses the image down, processes it, then expands back up. Uses convolutions (local feature detection) and skip connections that preserve detail. Fast, efficient, battle-tested. The downside: it has built-in assumptions about how images work (local patterns matter more than global ones), which can limit flexibility.

DiT (Diffusion Transformer) — The modern approach (Flux, SD 3.5). Throws out convolutions entirely and treats the image as a sequence of patches — like how language models treat text as a sequence of words. No spatial assumptions baked in; the model learns spatial relationships from data rather than having them built in. Scales beautifully with more parameters and compute. The trade-off: hungrier for VRAM, but produces better results when you feed it enough data.

MMDiT (Multimodal DiT) — SD 3.5's flavour. A DiT variant that processes text and image tokens together in the same attention blocks, rather than keeping them separate. Better at understanding how words relate to image regions.

In practice: U-Net models run faster on modest hardware; DiT models produce better quality but want beefier GPUs.

Newer Models Worth Knowing

Qwen-Image — Alibaba's 20B parameter beast with exceptional text rendering, especially for Chinese characters. The catch? It's hungry: full BF16 needs 48GB+ VRAM, though quantized versions (Q4_K_M at ~13GB) make it runnable on 24GB cards. If you're doing anything with text in images, this is currently the best open model for it.

Z-Image Turbo — Also from Alibaba (Tongyi Lab), but designed for efficiency. At 6B parameters, it fits in 12-16GB VRAM at BF16, or as low as 6GB with GGUF quantization. Generates in 8 steps (vs. 20-30 for older models), offers sub-second latency on good hardware, and handles bilingual text (English/Chinese) well. Currently the top-ranked open-source model on the Artificial Analysis leaderboard.

KOALA-Lightning — The budget champion. A distilled version of SDXL that compresses the U-Net to 700M-1B parameters (vs. SDXL's 2.6B). Runs on 8GB VRAM, generates 1024×1024 images in under a second on a 4090. Quality isn't quite SDXL level, but it's remarkably close for a model that runs on a 3060 Ti.

Model File Formats

Models come in different file formats. Here's what you'll encounter:

.safetensors — The standard. Safe to load (no arbitrary code execution), efficient, widely supported. This is what you want.

.ckpt — Legacy format. Can contain executable code, so only use from trusted sources.

.gguf — Quantized format from the LLM world, now used for Flux and other large models. Enables running big models on smaller GPUs through compression.

Where to Get Model Files

Hugging Face will be our best friend. It's a priceless source of paint and brushes for our virtual workshop. Just find the model you need, choose the largest *.safetensors file in the folder and start downloading. Depending on your connection, it will probably take enough time to let you read this article till the end.

Once downloaded, make ComfyUI aware of it by moving the file to models/checkpoints/ inside the root ComfyUI folder. In the app, open the Nodes section on the left and hit the Refresh node definitions button so the new model appears without restarting.

Weight Data Types

This is where file sizes come from. The model's weights (the numbers that make up the neural network) can be stored at different precisions:

FP16 (16-bit float) — standard precision, no quality loss. Flux Dev at FP16 is about 23GB and needs ~24GB VRAM. This is the gold standard.

BF16 (Brain Float 16) — Alternative 16-bit format with different trade-offs. Similar size to FP16.

FP8 (8-bit float) — Half the size of FP16 (~11GB for Flux). Minimal quality loss, works great on RTX 4000 series cards which have native FP8 support. Sweet spot for most people with modern GPUs. One caveat: Apple's MPS backend doesn't support FP8.

MPS vs CUDA

MPS (Metal Performance Shaders) — Apple's secret weapon for Mac users. It's a framework that lets PyTorch tap into Apple Silicon GPUs instead of crawling on CPU. Before MPS support (introduced in PyTorch 1.12), running models on Mac meant CPU-only — painfully slow. Now the M1/M2/M3 chips actually earn their keep.

The magic trick: Apple Silicon has unified memory — GPU and CPU share the same RAM pool. No copying data back and forth between separate memory banks like on discrete GPUs. This means you can load larger models than the "GPU memory" spec would suggest on traditional hardware. An M3 Max with 64GB can theoretically handle models that would need a 48GB+ NVIDIA card.

The catch? MPS doesn't support everything CUDA does. FP8 precision? Nope. Some exotic operations? Hit or miss. You'll occasionally see errors about unsupported dtypes — that's MPS being picky. Stick to FP16/BF16 models and you'll be fine.

That's our foundation. The beauty of ComfyUI is that once you understand these basics, you can build increasingly complex workflows — img2img, inpainting, ControlNet, video generation — just by adding more nodes. But that's another story.

Now check if the model file has finally landed on your drive, and go make some pictures!