This AI Paper from Google Presents a Set of Optimizations that Collectively Attain Groundbreaking Latency Figures for Executing Large Diffusion Models on Various Devices


Model size and inference workloads have grown dramatically as large diffusion models for image production have become more commonplace. Due to resource limitations, optimizing performance for on-device ML inference in mobile contexts is a delicate balancing act. Due to these models’ considerable memory requirements and computational demands, running inference of large diffusion models (LDMs) on devices poses even larger hurdles, especially in light of the need for cost-effectiveness and user privacy.

The speedy creation and widespread use of foundation models have completely transformed artificial intelligence. Due to its versatility and capacity to produce photorealistic images, large diffusion models have attracted much attention. Reduced server costs, offline capabilities, and enhanced user privacy are only some of the advantages of deploying these models locally on the user’s device. Due to limited computational and memory resources on devices, typical big diffusion models have over 1 billion parameters, posing difficulties. Researchers from Google offer a set of modifications to the implementation of large diffusion models that allow for the fastest inference latency on mobile devices with GPUs to date. These updates improve the overall user experience across various devices and increase the scope of usage for generative AI.

Due to its many benefits over server-based methods, such as lower latency, increased privacy, and greater scalability, on-device model inference acceleration has recently attracted much interest. The complexity of the softmax operation used frequently in deep learning has motivated optimization efforts, resulting in several different acceleration strategies. Winograd Convolution was developed to improve the efficiency of convolutional computation by minimizing the number of multiplications required, which is especially helpful for graphics processing units (GPUs).

The widespread success and adoption of the Transformer design have sparked research into speeding up the attention mechanism. Reformer uses a sparse approximation to reduce computing cost, while other works use low-rank or a combination of approximation techniques. FlashAttention, on the other hand, is a precise attention algorithm that considers hardware configurations to achieve better performance. 

The primary focus is on the challenge of creating visuals from written descriptions by employing massive diffusion models. Even though this explanation focuses on how the proposed improvements work with the Stable Diffusion architecture, it is important to note that these optimizations are easily transferable to other large diffusion models. Inferencing from text requires extra conditioning based on the desired textual description to steer the reverse diffusion process.

The attention block employed extensively by the denoiser model in the LDM presents a prime area for improvement. The model can zero down on relevant information by giving the attention blocks more weight in the input. The attention modules can be optimized in several ways; researchers often utilize only one of the two optimizations detailed below, whichever yields the best results.

The first optimization, dubbed partially fused softmax, reduces the amount of memory read and written during the attention module’s softmax by merging it with the matrix multiplication. The other tweak uses an I/O-aware precise attention method called FlashAttention. The number of high-bandwidth memory accesses from the GPU is reduced with this approach, making it an excellent choice for applications with restricted memory bandwidth. A huge number of registers are needed, and they discovered that the method only works with specific sizes of SRAM. Therefore, they only use this method on a subset of GPUs for attention matrices of a particular size.

Moreover, the team found that the fusion windows for commonly used layers and units in LDMs need to be substantially larger on a mobile GPU than what is currently available from commercially available GPU-accelerated ML inference engines. In light of the limitations of standard fusion rules, they devised custom implementations capable of running a wider variety of neural operators. Their attention was directed at two subfields: the Gaussian Error Linear Unit (GELU) and the group normalization layer.

Limitations in model file size, massive runtime memory needs, and prolonged inference latency have all proven to be significant obstacles when making ML inferences of big models on the device itself. Researchers realized memory bandwidth usage was the main constraint. Thus, they focused on improving memory bandwidth utilization while maintaining a healthy ALU/memory efficiency ratio. Together, the optimizations they demonstrated allowed for the execution of large diffusion models on a wide range of devices with record-breaking latency values. Thanks to these enhancements, the model’s applicability is broadened, and the user experience is boosted across a wide range of devices.

Check Out The Paper and Google AI Article. Don’t forget to join our 24k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at

Featured Tools From AI Tools Club

Rask AI




Aragon AI

Check Out 100’s AI Tools in AI Tools Club

The post This AI Paper from Google Presents a Set of Optimizations that Collectively Attain Groundbreaking Latency Figures for Executing Large Diffusion Models on Various Devices appeared first on MarkTechPost.

 Read More MarkTechPost 







Leave a Reply

Your email address will not be published. Required fields are marked *