Recently, numerous AGI applications catch the eyes of almost all the people on the internet. Here lists some advanced papers elucidate their key principles and technologies.
DiT
The authors explore a new class of diffusion models based on the transformer architecture, Diffusion Transformers (DITs)1. Before their work, using a U-Net backbone to generate the target image is prevalent instead of using a transformer architecture. The authors make some experiments with variants of standard transformer blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens.

Figure 1. DIT architecture. The gray area in the diagram is suboptimal structure. (Image source: Scalable Diffusion Models with Transformers)
In general, DiTs mainly introduce three components, patchily, DiT blocks, Decoder. All the diffusion process undergoes under the latent space projected by a VAE encoder. DiTs is based on Visual Transformer architecture (ViT) which operates on sequences of patches.

Figure 2. Given the input of latent space, DiTs patchify it into a sequence. The length of sequence is relevant to patch size p. (Image source: Scalable Diffusion Models with Transformers)
As for DiT block, considering that zero-initializing the final batch norm scale factor $\gamma$ in each block accelerates large-scale training in supervised learning setting, and that zero-initializing the final convolutional layer in each block prior to any residual connection brings benifits in Diffusion U-Net, the authors design adaLN-Zero block, as illustrated in Figure 1. Then a standard linear decoder is applied to decode sequence tokens into many tensors and get predicted noise and covariance. DiTs is the first transformer-based backbone for diffusion models that outperforms prior U-Net models and has a promising future through scaling it to larger models and token counts.
VDT
VDT (lu et al.2) features transformer blocks with modularized temporal and spatial attention modules to utilize the rich spatial-temporal representation inheried in transformers and introduce a unified spatial-temporal mask modeling mechanism.
Figure 3. Main components and pipeline in VDT. (Image source: VDT: General-purpose Video Diffusion Transformers via Mask Modeling)

Figure 4. Incorporating conditional frame features into the layer-normalization of transformer blocks to predict next frame. (Image source: VDT: General-purpose Video Diffusion Transformers via Mask Modeling)
Latte
In this work, the authors present a novel latent diffusion transformers (Latte3), which adopts a video Transformer as the backbone. Latte a pre-trained variational autoencoder to encode input video into features in latent space, where tokens are extracted from encoded features. Then a series of Transformer blocks are applied to encode these tokens. There are inherent disparities between spatial and temporal information and numerous tokens extracted from input videos, hence the authors design four Transformer-based model variants from the perspective of decomposing the spatial and temporal dimensions of input videos.

Figure 4. Four model variants are designed to capture spatio-temporal information in videos.Each block depicted in light orange represents a Transformer block. The standard Transformer block is employed in (a) and (b). (Image source: Latte: Latent Diffusion Transformer for Video Generation)
Suppose there is a video clip in the latent space $V_{L} \in \mathbb{R}^{F\times{H}\times{W}\times{C}}$, here $F, H, W, C$ represent the number of frames, heights, widths, and channel of video frames in the latent space respectively. Then we translate $V_L$ into a sequence of tokens, denoted as $\hat{z}\in \mathbb{R}^{n_f\times{n_h}\times{n_w}\times{d}}$. For the input of our model, $z = \hat{z} + p$, where p means the spatio-temporal position embedding. For the spatial Transformer block, the authors reshape $z$ into $z_s \in \mathbb{R}^{n_f\times{t}\times{d}}$ (here $t = n_h \times{n_w}$), and then for the temporal Transformer block, the workers reshape $z_s$ into $z_t \in \mathbb{R}^{t\times{n_f}\times{d}}$ as the input.
To embed a video clip, the authors also explore two methods: Uniform frame patch embedding and Compression frame patch embedding. In the first method, $n_f$, $n_h$, $n_w$ correspond to $F$, $\frac{H}{h}$, and $\frac{W}{w}$ when non-overlapping image patches are extracted from every video frame. In the other method, $n_f$ is equivalent to $\frac{F}{s}$ in contrast to non-overlapping uniform patch embedding. In short, ‘Compression’, means a few frames are compressed.

Figure 5. (a) uniform frame patch embedding. (b) compression frame patch embedding. (Image source: Latte: Latent Diffusion Transformer for Video Generation)

Figure 6. (Image source: Image source: Latte: Latent Diffusion Transformer for Video Generation)
Text2Video-Zero
TextVideo-Zero(Khachatryan et al.4) is a totally training-free, does not require massive computation powers or dozens of GPUs.
Some Open Source Projects
CogVideoX Mochi 1 LTX Video Pyramid Flow
Model Acceleration
-
Peeples et al., Scalable Diffusion Models with Transformers, CVPR 2022. ↩︎
-
Lu et al., VDT: General-purpose Video Diffusion Transformers via Mask Modeling, CVPR 2023. ↩︎
-
Ma et al., Latte: Latent Diffusion Transformer for Video Generation, CVPR 2024. ↩︎
-
Khachatryan et al., Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators, 2023 ↩︎