Home/News/Interfaze Ships Open-Source Diffusion ASR Model for Six Languages
MarkTechPost2 min read

Interfaze Ships Open-Source Diffusion ASR Model for Six Languages

Interfaze released diffusion-gemma-asr-small this week, an open-source automatic speech recognition (ASR) model that transcribes audio using a diffusion decoder. This model is described by the Interfaze team as the first multilingual audio diffusion ASR model, capable of handling six languages with a single adapter. The research team trained approximately 42 million parameters for the adapter, building upon a frozen 26 billion parameter backbone, which represents about 0.16% of the total model weights.

Unlike autoregressive models that generate text token by token, diffusion models refine all tokens in parallel. diffusion-gemma-asr-small employs this diffusion approach for speech-to-text conversion. The model utilizes DiffusionGemma's discrete diffusion decoder, which operates differently from typical diffusion LLMs that use an absorbing mask scheme. Instead, DiffusionGemma employs uniform, random-token diffusion, filling a fixed-length canvas with random vocabulary tokens and refining them over several steps. This method allows transcription cost to scale with the number of denoising steps rather than the transcript length.

In performance benchmarks, diffusion-gemma-asr-small achieved a 6.6% word error rate (WER) on LibriSpeech, outperforming Whisfusion's 8.3% WER. However, it trails behind the autoregressive Whisper model in this specific benchmark. The adapter for diffusion-gemma-asr-small is distributed under the Apache-2.0 license. The DiffusionGemma model itself operates under Gemma's terms, and the whisper-small model is available under the MIT license.

DiffusionGemma, originally developed by Google, is a 26 billion parameter mixture-of-experts model that activates 4 billion parameters using 128 experts with top-8 routing. It is designed to process text, images, and video. Interfaze adapted this text-only model to handle audio input, enabling its speech-to-text transcription capabilities.

Original source — read the full reporting at the publisher:

Read on MarkTechPost

Read next