# Delayed Streams Modeling This repo contains instructions and examples of how to run Kyutai Speech-To-Text models. These models are powered by delayed streams modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Text-to-speech models based on DSM coming soon! ## Kyutai Speech-To-Text Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps. We provide two models: - `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad). - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay. **More details can be found on the [project page](https://kyutai.org/next/stt).** You can retrieve the sample files used in the following snippets via: ```bash wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3 wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3 ``` ### PyTorch implementation Hugging Face Open In Colab This requires the [moshi package](https://pypi.org/project/moshi/) with version 0.2.5 or later, which can be installed via pip. ```bash python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3 ``` ### Rust server Hugging Face The Rust implementation provides a server that can process multiple streaming queries in parallel. Dependening on the amount of memory on your GPU, you may have to adjust the batch size from the config file. For a L40S GPU, a batch size of 64 works well and requests can be processed at 3x real-time speed. In order to run the server, install the [moshi-server crate](https://crates.io/crates/moshi-server) via the following command. The server code can be found in the [kyutai-labs/moshi](https://github.com/kyutai-labs/moshi/tree/main/rust/moshi-server) repository. ```bash cargo install --features cuda moshi-server ``` Then the server can be started via the following command using the config file from this repository. For `kyutai/stt-1b-en_fr`, use `configs/config-stt-en_fr.hf.toml`, and for `kyutai/stt-2.6b-en`, use `configs/config-stt-en-hf.toml`, ```bash moshi-server worker --config configs/config-stt-en_fr-hf.toml ``` Once the server has started you can run a streaming inference with the following script. ```bash uv run scripts/asr-streaming-query.py bria.mp3 ``` The script simulates some real-time processing of the audio. Faster processing can be triggered by setting the real-time factor, e.g. `--rtf 500` will process the data as fast as possible. ### Rust standalone Hugging Face A standalone Rust example script is provided in the `stt-rs` directory in this repo. This can be used as follows: ```bash cd stt-rs cargo run --features cuda -r -- bria.mp3 ``` ### MLX implementation Hugging Face [MLX](https://ml-explore.github.io/mlx/build/html/index.html) is Apple's ML framework that allows you to use hardware acceleration on Apple silicon. This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/) with version 0.2.5 or later, which can be installed via pip. ```bash python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0 ``` ## Text-to-Speech We're in the process of open-sourcing our TTS models. Check back for updates! ## License The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend. The web client code is provided under the MIT license. Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under the MIT license. The weights for the speech-to-text models are released under the CC-BY 4.0 license.