Google's latest DiffusionGemma open AI model comes with a 4x speed boost

Google DeepMind released DiffusionGemma, a new open AI model, on March 13, 2024, which generates text in parallel rather than sequentially. This approach, similar to image generation models, allows DiffusionGemma to produce an entire block of text at once, leading to increased speed and efficiency on local hardware. Unlike autoregressive models that generate text token by token, DiffusionGemma uses a denoising process over a field of placeholder tokens to refine its output. The model is a Mixture of Experts (MoE) with 26 billion parameters, but only 3.8 billion are active during inference, making it suitable for GPUs with 18GB of RAM. In benchmarks, DiffusionGemma achieved approximately 700 tokens per second on an RTX 5090 and over 1,000 tokens per second on a single Nvidia H100 AI accelerator. This performance represents a four-fold speed increase compared to similarly sized autoregressive Gemma models.