Large diffusion models like Flux (a flow-based text-to-image generation model) can create stunning images, but their size can be a hurdle, demanding significant memory and compute resources. Quantization offers a powerful solution, shrinking these models to make them more accessible without drastically compromising performance. But the big question always is: can you actually tell the difference in the final image?
Before we dive into the technical details of how various quantization backends in Hugging Face Diffusers work, why not test your own perception?
We created a setup where you can provide a prompt, and we generate results using both the original, high-precision model (e.g., Flux-dev in BF16) and several quantized versions (BnB 4-bit, BnB 8-bit). The generated images are then presented to you and your challenge is to identify which ones came from the quantized models.
Try it out here or below!
Often, especially with 8-bit quantization, the differences are subtle and may not be noticeable without close inspection. More aggressive quantization like 4-bit or lower might be more noticeable, but the results can still be good, especially considering the massive memory savings. NF4 often gives the best trade-off though.
Now, let's dive deeper.
Building on our previous post, "Memory-efficient Diffusion Transformers with Quanto and Diffusers", this post explores the diverse quantization backends integrated directly into Hugging Face Diffusers. We'll examine how bitsandbytes, GGUF, torchao, Quanto and native FP8 support make large and powerful models more accessible, demonstrating their use with Flux.
Before diving into the quantization backends, let's introduce the FluxPipeline (using the black-forest-labs/FLUX.1-dev checkpoint) and its components, which we'll be quantizing. Loading the full FLUX.1-dev model in BF16 precision requires approximately 31.447 GB of memory. The main components are:
Text Encoders (CLIP and T5):
Function: Process input text prompts. FLUX-dev uses CLIP for initial understanding and a larger T5 for nuanced comprehension and better text rendering.
Memory: T5 - 9.52 GB; CLIP - 246 MB (in BF16)
Transformer (Main Model - MMDiT):
Function: Core generative part (Multimodal Diffusion Transformer). Generates images in latent space from text embeddings.
Memory: 23.8 GB (in BF16)
Variational Auto-Encoder (VAE):
Function: Translates images between pixel and latent space. Decodes generated latent representation to a pixel-based image.
Memory: 168 MB (in BF16)
Focus of Quantization: Examples will primarily focus on the transformer and text_encoder_2 (T5) for the most substantial memory savings.
prompts = [
"Baroque style, a lavish palace interior with ornate gilded ceilings, intricate tapestries, and dramatic lighting over a grand staircase.",
"Futurist style, a dynamic spaceport with sleek silver starships docked at angular platforms, surrounded by distant planets and glowing energy lines.",
"Noir style, a shadowy alleyway with flickering street lamps and a solitary trench-coated figure, framed by rain-soaked cobblestones and darkened storefronts.",
]bitsandbytes is a popular and user-friendly library for 8-bit and 4-bit quantization, widely used for LLMs and QLoRA fine-tuning. We can use it for transformer-based diffusion and flow models, too.
BF16
BnB 4-bit
BnB 8-bit
Visual comparison of Flux-dev model outputs using BF16 (left), BnB 4-bit (center), and BnB 8-bit (right) quantization. (Click on an image to zoom)
Precision
Memory after loading
Peak memory
Inference time
BF16
~31.447 GB
36.166 GB
12 seconds
4-bit
12.584 GB
17.281 GB
12 seconds
8-bit
19.273 GB
24.432 GB
27 seconds
All benchmarks performed on 1x NVIDIA H100 80GB GPU
Example (Flux-dev with BnB 4-bit):
import torch
from diffusers import FluxPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
model_id = "black-forest-labs/FLUX.1-dev"
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
"transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
"text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
}
)
pipe = FluxPipeline.from_pretrained(
model_id,
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16
)
pipe.to("cuda")
prompt = "Baroque style, a lavish palace interior with ornate gilded ceilings, intricate tapestries, and dramatic lighting over a grand staircase."
pipe_kwargs = {
"prompt": prompt,
"height": 1024,
"width": 1024,
"guidance_scale": 3.5,
"num_inference_steps": 50,
"max_sequence_length": 512,
}
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
image = pipe(
**pipe_kwargs, generator=torch.manual_seed(0),
).images[0]
print(f"Pipeline memory usage: {torch.cuda.max_memory_reserved() / 1024**3:.3f} GB")
image.save("flux-dev_bnb_4bit.png")Note: When using
PipelineQuantizationConfigwithbitsandbytes, you need to importDiffusersBitsAndBytesConfigfromdiffusersandTransformersBitsAndBytesConfigfromtransformersseparately. This is because these components originate from different libraries. If you prefer a simpler setup without managing these distinct imports, you can use an alternative approach for pipeline-level quantization, an example of this method is in the Diffusers documentation on Pipeline-level quantization.
For more information check out the bitsandbytes docs.
torchao is a PyTorch-native library for architecture optimization, offering quantization, sparsity, and custom data types, designed for compatibility with torch.compile and FSDP. Diffusers supports a wide range of torchao's exotic data types, enabling fine-grained control over model optimization.
int4_weight_only
int8_weight_only
float8_weight_only
Visual comparison of Flux-dev model outputs using torchao int4_weight_only (left), int8_weight_only (center), and float8_weight_only (right) quantization. (Click on an image to zoom)
torchao Precision
Memory after loading
Peak memory
Inference time
int4_weight_only
10.635 GB
14.654 GB
109 seconds
int8_weight_only
17.020 GB
21.482 GB
15 seconds
float8_weight_only
17.016 GB
21.488 GB
15 seconds
Example (Flux-dev with torchao INT8 weight-only):
@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import TorchAoConfig as DiffusersTorchAoConfig
- from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
+ from transformers import TorchAoConfig as TransformersTorchAoConfig
@@
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
- "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
- "text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
+ "transformer": DiffusersTorchAoConfig("int8_weight_only"),
+ "text_encoder_2": TransformersTorchAoConfig("int8_weight_only"),
}
)Example (Flux-dev with torchao INT4 weight-only):
@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import TorchAoConfig as DiffusersTorchAoConfig
- from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
+ from transformers import TorchAoConfig as TransformersTorchAoConfig
@@
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
- "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
- "text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
+ "transformer": DiffusersTorchAoConfig("int4_weight_only"),
+ "text_encoder_2": TransformersTorchAoConfig("int4_weight_only"),
}
)
pipe = FluxPipeline.from_pretrained(
model_id,
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
+ device_map="balanced"
)
- pipe.to("cuda")For more information check out the torchao docs.
Quanto is a quantization library integrated with the Hugging Face ecosystem via the optimum library.
INT4
INT8
FP8
Visual comparison of Flux-dev model outputs using Quanto INT4 (left), INT8 (center), and FP8 (right) quantization. (Click on an image to zoom)
quanto Precision
Memory after loading
Peak memory
Inference time
INT4
12.254 GB
16.139 GB
109 seconds
INT8
17.330 GB
21.814 GB
15 seconds
FP8
16.395 GB
20.898 GB
16 seconds
Example (Flux-dev with quanto INT8 weight-only):
@@
- from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
+ from diffusers import QuantoConfig as DiffusersQuantoConfig
- from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
+ from transformers import QuantoConfig as TransformersQuantoConfig
@@
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
- "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
- "text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
+ "transformer": DiffusersQuantoConfig(weights_dtype="int8"),
+ "text_encoder_2": TransformersQuantoConfig(weights_dtype="int8"),
}
)Note: At the time of writing, for float8 support with Quanto, you'll need
optimum-quanto<0.2.5and use quanto directly. We will be working on fixing this.
Example (Flux-dev with quanto FP8 weight-only)
import torch
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel
from optimum.quanto import freeze, qfloat8, quantize
model_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
model_id,
subfolder="text_encoder_2",
torch_dtype=torch.bfloat16,
)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)
transformer = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
)
quantize(transformer, weights=qfloat8)
freeze(transformer)
pipe = FluxPipeline.from_pretrained(
model_id,
transformer=transformer,
text_encoder_2=text_encoder_2,
torch_dtype=torch.bfloat16
).to("cuda")For more information check out the Quanto docs.
GGUF is a file format popular in the llama.cpp community for storing quantized models.
Q2_k
Q4_1
Q8_0
Visual comparison of Flux-dev model outputs using GGUF Q2_k (left), Q4_1 (center), and Q8_0 (right) quantization. (Click on an image to zoom)
GGUF Precision
Memory after loading
Peak memory
Inference time
Q2_k
13.264 GB
17.752 GB
26 seconds
Q4_1
16.838 GB
21.326 GB
23 seconds
Q8_0
21.502 GB
25.973 GB
15 seconds
Example (Flux-dev with GGUF Q4_1)
import torch
from diffusers import FluxPipeline, FluxTransformer2DModel, GGUFQuantizationConfig
model_id = "black-forest-labs/FLUX.1-dev"
# Path to a pre-quantized GGUF file
ckpt_path = "https://huggingface.co/city96/FLUX.1-dev-gguf/resolve/main/flux1-dev-Q4_1.gguf"
transformer = FluxTransformer2DModel.from_single_file(
ckpt_path,
quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
torch_dtype=torch.bfloat16,
)
pipe = FluxPipeline.from_pretrained(
model_id,
transformer=transformer,
torch_dtype=torch.bfloat16,
)
pipe.to("cuda")For more information check out the GGUF docs.
enable_layerwise_casting)FP8 Layerwise Casting is a memory optimization technique. It works by storing the model's weights in the compact FP8 (8-bit floating point) format, which uses roughly half the memory of standard FP16 or BF16 precision. Just before a layer performs its calculations, its weights are dynamically cast up to a higher compute precision (like FP16/BF16). Immediately afterward, the weights are cast back down to FP8 for efficient storage. This approach works because the core computations retain high precision, and layers particularly sensitive to quantization (like normalization) are typically skipped. This technique can also be combined with group offloading for further memory savings.
FP8 (e4m3)
Visual output of Flux-dev model using FP8 Layerwise Casting (e4m3) quantization.
precision
Memory after loading
Peak memory
Inference time
FP8 (e4m3)
23.682 GB
28.451 GB
13 seconds
import torch
from diffusers import AutoModel, FluxPipeline
model_id = "black-forest-labs/FLUX.1-dev"
transformer = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16
)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")For more information check out the Layerwise casting docs.
Most of these quantization backends can be combined with the memory optimization techniques offered in Diffusers. Let's explore CPU offloading, group offloading, and torch.compile. You can learn more about these techniques in the Diffusers documentation.
Note: At the time of writing, bnb +
torch.compilealso works if bnb is installed from source and using pytorch nightly or with fullgraph=False.
Example (Flux-dev with BnB 4-bit + enable_model_cpu_offload):
import torch
from diffusers import FluxPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
model_id = "black-forest-labs/FLUX.1-dev"
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
"transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
"text_encoder_2": TransformersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
}
)
pipe = FluxPipeline.from_pretrained(
model_id,
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16
)
- pipe.to("cuda")
+ pipe.enable_model_cpu_offload()Model CPU Offloading (enable_model_cpu_offload): This method moves entire model components (like the UNet, text encoders, or VAE) between the CPU and GPU during the inference pipeline. It offers substantial VRAM savings and is generally faster than more granular offloading because it involves fewer, larger data transfers.
bnb + enable_model_cpu_offload:
Precision
Memory after loading
Peak memory
Inference time
4-bit
12.383 GB
12.383 GB
17 seconds
8-bit
19.182 GB
23.428 GB
27 seconds
Example (Flux-dev with fp8 layerwise casting + group offloading):
import torch
from diffusers import FluxPipeline, AutoModel
model_id = "black-forest-labs/FLUX.1-dev"
transformer = AutoModel.from_pretrained(
model_id,
subfolder="transformer",
torch_dtype=torch.bfloat16,
# device_map="cuda"
)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
+ transformer.enable_group_offload(onload_device=torch.device("cuda"), offload_device=torch.device("cpu"), offload_type="leaf_level", use_stream=True)
pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
- pipe.to("cuda")Group offloading (enable_group_offload for diffusers components or apply_group_offloading for generic torch.nn.Modules): It moves groups of internal model layers (like torch.nn.ModuleList or torch.nn.Sequential instances) to the CPU. This approach is typically more memory-efficient than full model offloading and faster than sequential offloading.
FP8 layerwise casting + group offloading:
precision
Memory after loading
Peak memory
Inference time
FP8 (e4m3)
9.264 GB
14.232 GB
58 seconds
Example (Flux-dev with torchao 4-bit + torch.compile):
import torch
from diffusers import FluxPipeline
from diffusers import TorchAoConfig as DiffusersTorchAoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import TorchAoConfig as TransformersTorchAoConfig
from torchao.quantization import Float8WeightOnlyConfig
model_id = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16
pipeline_quant_config = PipelineQuantizationConfig(
quant_mapping={
"transformer":DiffusersTorchAoConfig("int4_weight_only"),
"text_encoder_2": TransformersTorchAoConfig("int4_weight_only"),
}
)
pipe = FluxPipeline.from_pretrained(
model_id,
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
device_map="balanced"
)
+ pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)Note:
torch.compilecan introduce subtle numerical differences, leading to changes in image output
torch.compile: Another complementary approach is to accelerate the execution of your model with PyTorch 2.x’s torch.compile() feature. Compiling the model doesn’t directly lower memory, but it can significantly speed up inference. PyTorch 2.0’s compile (Torch Dynamo) works by tracing and optimizing the model graph ahead-of-time.
torchao + torch.compile:
torchao Precision
Memory after loading
Peak memory
Inference time
Compile Time
int4_weight_only
10.635 GB
15.238 GB
6 seconds
~285 seconds
int8_weight_only
17.020 GB
22.473 GB
8 seconds
~851 seconds
float8_weight_only
17.016 GB
22.115 GB
8 seconds
~545 seconds
Explore some benchmarking results here:
You can find bitsandbytes and torchao quantized models from this blog post in our Hugging Face collection: link to collection.
Here's a quick guide to choosing a quantization backend:
Easiest Memory Savings (NVIDIA): Start with bitsandbytes 4/8-bit. This can also be combined with torch.compile() for faster inference.
Prioritize Inference Speed: torchao, GGUF, and bitsandbytes can all be used with torch.compile() to potentially boost inference speed.
For Hardware Flexibility (CPU/MPS), FP8 Precision: Quanto can be a good option.
Simplicity (Hopper/Ada): Explore FP8 Layerwise Casting (enable_layerwise_casting).
For Using Existing GGUF Models: Use GGUF loading (from_single_file).
Curious about training with quantization? Stay tuned for a follow-up blog post on that topic! Update (June 19, 2025): it's here!
Quantization significantly lowers the barrier to entry for using large diffusion models. Experiment with these backends to find the best balance of memory, speed, and quality for your needs.
Acknowledgements: Thanks to Chunte for providing the thumbnail for this post.