Emerging Trends and Systems Implications of Multi-Modal AI Models

diffusion and transformer text to image models
Source: https://arxiv.org/abs/2312.14385

Introduction

As generative AI continues to advance, models are evolving beyond text generation to include image and video synthesis capabilities. However, these multi-modal models come with unique systems-level challenges compared to traditional language models.

A new paper from researchers at Meta and Harvard University provides the first in-depth analysis characterizing the system performance and implications of text-to-image (TTI) and text-to-video (TTV) generative AI models. Their analysis compares two main model architectures – Diffusion-based and Transformer-based – across eight representative models on dimensions like latency, computational intensity, and component breakdown.

The researchers make several key observations about the distinct properties of TTI/TTV models:

  • High arithmetic intensity: Diffusion TTI models exhibit up to 100x higher intensity than language models due to parameter reuse in the iterative denoising process. This makes them more compute-bound.
  • Convolution as a bottleneck: After optimizations like Flash Attention, convolution accounts for up to 44% of execution time in Diffusion TTI models, shifting the bottleneck from attention to convolution.
  • Variable sequence lengths: Unlike LLMs, sequence lengths vary up to 4x over Diffusion model inference, impacting computational intensity.
  • Scaling constraints: Memory scales quadratically with image size, posing challenges for developing efficient systems for high-resolution image generation.
  • Temporal attention bottleneck: The temporal dimension in TTV models suffers 2x slower execution than spatial attention despite using fewer FLOPs.

This initial performance characterization demonstrates the need to tailor system optimizations and hardware for emerging multi-modal AI workloads based on their distinct properties.

Taxonomy of Text-to-Image and Text-to-Video Models

“Detail on Diffusion and Transformer models. Note that Diffusion models consist of Resnet blocks, Self-Attention blocks, and Cross-Attention blocks while Transformer-based models contain SelfAttention, Cross-Attention, and FeedForward.”

The researchers characterize and compare TTI and TTV models along two main architecture categories: diffusion models and transformer models.

Text-to-Image Diffusion Models
These include pixel-based and latent-based diffusion models. As seen in the figure above, both contain a UNet structure with downsampling and upsampling blocks, Resnet blocks, self-attention blocks attending to the image itself, and cross-attention blocks attending to the text embedding.

As seen in the main image at the top of this post, the key difference between pixel and laten diffusion models is that pixel-based models operate on the pixel level, while latent models use an embedding representation that is more efficient to compute but requires additional decoding.

Latent models see less time spent on convolution compared to pixel-based models since they don’t require additional super-resolution networks. But this comes at the cost of including a VAE or GAN-decoder to convert latent space back to pixel space.

Text-to-Image Transformer Models
These generate images sequentially by conditioning the next pixel/patch prediction on previous ones. They contain a typical Transformer architecture with self-attention and feedforward layers.

The image tokens then get decoded into an image representation via a separate decoder. Some models use parallel decoding for faster inference.

Transformer TTI models require less compute and higher memory than Diffusion models in general. Transformer-based models also tend to have lower latencies compared to diffusion models.

Text-to-Video Models
Text-to-video (TTV) models build off of the TTI model architectures, often using a pretrained TTI model to generate individual frames. These frames are then connected via additional temporal attention or convolutional layers to ensure temporal cohesion.

For example, temporal attention layers may be inserted after spatial attention layers in a diffusion TTV model. Or convolutional layers may substitute some attention layers to reduce memory costs.

Generating coherent video introduces unique system bottlenecks compared to TTI models, especially with increasing number of frames and resolutions. The temporal dimension exacerbates issues like the attention bottleneck.

The table below nicely summarized the taxonomy of TTI/TTV models:

Taxonomy of Text-to-Image Models
Taxonomy of Text-to-Image Models

Analyzing System Bottlenecks and Optimization Opportunities

The researchers conduct detailed performance characterization across the TTI/TTV model suite to reveal system bottlenecks and optimization potential.

Operator Time Breakdown
This key figure from the paper shows the operator time breakdown for different models, before and after optimizing for attention (i.e., applying Flash Attention):

"Operator Breakdown Across TTI and TTV Models With Baseline Attention. First bar of each model shows model execution time with Baseline Attention, while second bar shows corresponding normalized execution time with Flash Attention"

Key observations:

  • Diffusion models have more diverse operators like GroupNorm
  • Attention accounts for ~41% of baseline execution time on average
  • Convolution is up to 36% of execution time in Diffusion models

Applying the Flash Attention V2 optimization accelerates the attention bottleneck. But the results demonstrate how it shifts the bottleneck in diffusion models to other operators like convolution, which becomes 44% of execution time.

So while Flash Attention provides 1.1-2.5x more kernel speedup for Diffusion models, further optimizations need to target the convolutional bottleneck. Additionally, Transformer-based TTI models still spend 37-45% of time in attention after Flash Attention.

Prefill/Decode Phase Analysis
The researchers also analyze how Diffusion and Transformer TTI models and model speedup differ in the context of traditional LLM inference, which consists of two distinct phases: prefill and decode.

Prefill Phase: This refers to the initial processing of the text prompt that is input to the LLM. It allows for greater parallelization because the model is processing a large input text sequence all at once. This creates large intermediate matrices in the self-attention calculation, which benefits more from optimizations like Flash Attention that reduce memory access.

Decode Phase: This phase generates the output tokens one-by-one in an autoregressive manner. So only a single output token is processed at a time based on the previously generated tokens. This results in smaller intermediate matrices during the attention calculation, reducing the benefit of memory access optimizations.

Now in text-to-image models:

  • Diffusion TTI models generate the entire image in parallel rather than sequentially. So all the pixel values are produced at the same time, conditioned on the text embedding. This resembles the prefill phase in LLMs – large intermediate activations that can be optimized for reduced memory access. (This explains the greater speedup in execution time observed after applying Flash Attention.)
  • Transformer TTI models generate an image sequentially one pixel/patch at a time in an autoregressive process. So each pixel conditions on the previous ones. This resembles the decode phase in LLMs. The matrices are smaller, reducing the optimization opportunity from texture memory access techniques. (This explains the reduced/limited speedup after applying Flash Attention.)

In essence, Diffusion TTI models benefit more from things like Flash Attention because their parallel generation creates potential for larger speedups. The sequential transformer TTI models resemble LLM decoding, which has less room for optimization.

This suggests diffusion and transformer models may need customized optimizations rather than a one-size-fits-all approach.

Dealing with Variable Sequence Lengths

Since traditional LLM paradigms like prefill and decode do not apply to Diffusion-based TTI and TTV systems, the researchers investigated relating and mapping other LLM concepts, such as sequence length, into the context of TTI/TTV models in the hopes of identifying more efficient system design.

Unlike language models which have a fixed context length, the researchers find sequence length varies significantly over diffusion model inference.

This sequence length depends on the image size and diffusion step impacting the upsampling/downsampling. As this figure shows, sequence length during inference varies up to 4x in diffusion models:

"Sequence length profiling across various models in model suite. Shown as sequence length over course of time. The variation in sequence length over time for diffusion models pose unique system constraints for these UNet-based models. Note sequence length of Stable Diffusion model actually goes up to 4096, but not shown here for plotting purposes."
And the distribution changes with higher resolutions:
frequency distribution of sequence length
"Frequency distribution of sequence lengths over the course of inference for Stable Diffusion model. Note the significance of the overlapping bars, where the value distribution shifts right for increasing image size and corresponds to the input/output size."

The authors also found that after techniques such as Flash Attention are applied, Convolution actually has a larger scaling dependence with image size than Attention. That is, after applying Flash Attention, Convolution execution time scales at a faster rate than Attenion execution time with increasing sequence length/image size.

scaling of attention vs convolution
"Illustration of how time spent on Attention versus Convolution scales as image size increases for Stable Diffusion. Note that before Flash Attention, Attention execution time scales at a faster rate than Convolution execution time with increasing sequence length. However, after Flash Attention is applied, Convolution becomes the limiting factor."

Key implications:

  • Memory scales with seq_len4 – so higher resolutions see much higher requirements
  • Computational intensity changes over inference – opportunities to optimize staging
  • Could stagger diffusion steps to maximize memory bandwidth

Overall, the variability poses challenges for system optimization. But being aware of common sequence lengths can help tailor hardware and schedules accordingly.

Analyzing the Temporal Attention Bottleneck

Text-to-video (TTV) models introduce a temporal dimension to connect generated frames over time. This adds a new performance bottleneck centered around temporal attention.

Temporal vs Spatial Attention
As illustrated in this figure, the sequence length depends on different factors between temporal and spatial attention:

“Tensor dimensions are rearranged to perform Spatial versus Temporal Attention. As shown, sequence length is proportional to image size in Spatial Attention and number of frames in Temporal Attention.”

Key observations from analyzing the Make-A-Video TTV model:

  • Temporal attention takes 2x longer than spatial attention per layer despite 9x less FLOPs
  • 10x lower cache hit rates likely causing higher memory latency

Implications of More Frames and Resolution
This benchmark shows how temporal attention FLOPs grow exponentially with more frames, while spatial attention scales linearly:

“Benchmark illustrating how Temporal Attention FLOPs scale exponentially with number of frames as opposed to Spatial Attention, which scales linearly.”

And increased resolution prolongs the crossover point between the two. So as TTV models produce longer and higher resolution video, temporal attention will become the dominant performance bottleneck without further optimization.

Thus, optimizing the temporal dimension is critical for advancing text-to-video generation models.

Conclusion

This paper provides the first in-depth performance characterization of multi-modal text-to-image and text-to-video models. The analysis reveals several unique properties and bottlenecks compared to traditional language models:

  • High arithmetic intensity for Diffusion TTI models
  • Shifting system bottlenecks after optimizations like Flash Attention
  • Different prefill/decode characteristics and optimization potential
  • Variable sequence lengths during diffusion model inference
  • Scaling constraints around image size and resolutions
  • Temporal dimension exacerbating attention bottleneck

These observations demonstrate the need to customize system hardware and optimizations for emerging workloads like text-and-image generation rather than relying on techniques tailored for language models.

As models advance to higher dimensional representations from text to image, video, and potentially 3D simulations, new performance challenges will arise from capturing spatial, temporal, and interactive dynamics. Hardware and algorithms will need to coevolve with larger model architectures and data representations. The initial analysis from this paper lays the foundation for future work around efficient and scalable multi-modal AI.

References

Thank you for reading! 

If you enjoyed this post and would like to stay up to date then please consider following me on Medium

You can also find me on LinkedInTwitter or Instagram (@musicalchemist), GitHub, or YouTube. 

Facebook
Twitter
LinkedIn
Pinterest

Leave a Reply

Your email address will not be published. Required fields are marked *