NVIDIA Releases Open-Weight Diffusion Language Model Nemotron-Labs-TwoTower
NVIDIA released Nemotron-Labs-TwoTower, an open-weight diffusion language model built on a frozen autoregressive backbone, on March 20, 2024. This model aims to address throughput limitations in text generation by employing a discrete diffusion approach. Unlike traditional autoregressive models that generate tokens sequentially, Nemotron-Labs-TwoTower generates tokens in parallel and refines them iteratively. The model separates the generation process into two distinct "towers": one for representing clean tokens and another for denoising corrupted ones. This architecture maintains 98.7% of the autoregressive baseline's benchmark quality while achieving 2.42 times higher wall-clock generation throughput, as demonstrated with specific configurations (γ=0.8, S=16, 2×H100). The denoiser tower was trained on approximately 2.1 trillion tokens, a subset of the 25 trillion tokens used for the backbone's pretraining. The model is instantiated on Nemotron-3-Nano-30B-A3B, an open-weight hybrid backbone that integrates Mamba-2, self-attention, and mixture-of-experts (MoE) layers. Each of the two towers comprises 52 layers, including 23 Mamba-2, 6 self-attention, and 23 MoE layers. The released checkpoint contains both towers, totaling roughly 60 billion parameters, with approximately 3 billion active parameters per tower per token. The MoE configuration utilizes 128 routable experts, with 6 activating per token, alongside 2 shared experts. Both towers originate from the same backbone checkpoint, but only the denoiser tower undergoes training, leaving the autoregressive context tower frozen. The autoregressive context tower processes the prompt and committed tokens causally, generating per-layer KV cache and final Mamba-2 states, thereby preserving the backbone's autoregressive capabilities.
Original source — read the full reporting at the publisher:
Read on MarkTechPost