Skip to content

A magnitude-preserving implementation of Diffusion Transformers, resulting in faster convergence and improved performance.

Notifications You must be signed in to change notification settings

ericbill21/map-dit

Repository files navigation

The Art of Balance: Magnitude Preservation in Diffusion Transformers

We extend magnitude-preserving techniques from the EDM2 architecture to Diffusion Transformers (DiT), ensuring stable training by maintaining activation magnitudes and controlling weight growth throughout the architecture. Additionally, we incorporate power function-based exponential moving averages, enabling flexible post-training reconstruction with adjustable decay parameters. Experiments on DiT-XS/2 and DiT-S/4 show significant improvements in FID-10K, highlighting the effectiveness of our approach. Despite increased computational overhead, our methods offer a scalable and modular solution for transformer-based diffusion models.

               

Fig 1. DiT-S/4 samples without (left) and with (right) magnitude preserving layers.

This project builds upon key concepts from the following research papers:

  • Peebles & Xie (2023) explore the application of transformer architectures to diffusion models, achieving state-of-the-art performance on various generation tasks;
  • Karras et al. (2024) introduce the idea of preserving the magnitude of features during the diffusion process, enhancing the stability and quality of generated outputs.

Training

python train.py --data-path /path/to/data --results-dir /path/to/results --model DiT-S/2 --num-steps 400_000 <map feature flags>

Magnitude Preservation Flags

Customize the training process by enabling the following flags:

  • --use-cosine-attention - Controls weight growth in attention layers.
  • --use-weight-normalization - Applies magnitude preservation in linear layers.
  • --use-forced-weight-normalization - Controls weight growth in linear layers.
  • --use-mp-residual - Enables magnitude preservation in residual connections.
  • --use-mp-silu - Uses a magnitude-preserving version of SiLU nonlinearity.
  • --use-no-layernorm - Disables transformer layer normalization.
  • --use-mp-pos-enc - Activates magnitude-preserving positional encoding.
  • --use-mp-embedding - Uses magnitude-preserving embeddings.

Sampling

python sample.py --result-dir /path/to/results/<dir> --class-label <class label>

Citation

@misc{bill_jensen_2025,
    title={The Art of Balance: Magnitude Preservation in Diffusion Transformers},
    author={Bill, Eric Tillmann and Jensen, Cristian Perez},
    howpublished = {\url{https://github.com/ericbill21/map-dit}},
    year={2025}
}

About

A magnitude-preserving implementation of Diffusion Transformers, resulting in faster convergence and improved performance.

Topics

Resources

Stars

Watchers

Forks

Languages