2025-06-16 19:36:53 +00:00
|
|
|
# delayed-streams-modeling
|
|
|
|
|
Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimodal sequence-to-sequence learning.
|
2025-06-16 19:39:36 +00:00
|
|
|
|
|
|
|
|
## Speech To Text
|
|
|
|
|
|
2025-06-18 05:38:33 +00:00
|
|
|
DSM can be used to build streaming speech to text models. These models can be
|
|
|
|
|
batched for efficiency, return word level timestamps, and are great for
|
|
|
|
|
interactive applications. We provide two such models, these models are
|
|
|
|
|
characterized by their size as well as the delay it takes for audio to be
|
|
|
|
|
transcribed into text. We provide two such models:
|
|
|
|
|
- An English only model with ~2.6b parameters using a 2.5 second delay,
|
|
|
|
|
`kyutai/stt-2.6b-en`.
|
|
|
|
|
- An English and French model with ~1b parameters using a 0.5 second delay,
|
|
|
|
|
`kyutai/stt-1b-en_fr`.
|
|
|
|
|
|
|
|
|
|
### PyTorch implementation
|
2025-06-17 09:59:33 +00:00
|
|
|
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en)
|
2025-06-18 05:27:54 +00:00
|
|
|
<a target="_blank" href="https://colab.research.google.com/drive/1mc0Q-FoHxU2pEvId8rTdS4q1r1zorJhS?usp=sharing">
|
|
|
|
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
|
|
|
|
</a>
|
|
|
|
|
|
2025-06-16 19:39:36 +00:00
|
|
|
|
|
|
|
|
```bash
|
2025-06-17 06:37:44 +00:00
|
|
|
# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
|
2025-06-17 10:39:59 +00:00
|
|
|
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
|
2025-06-16 19:39:36 +00:00
|
|
|
```
|
|
|
|
|
|
2025-06-18 05:38:33 +00:00
|
|
|
### MLX implementation
|
2025-06-17 09:59:33 +00:00
|
|
|
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-mlx)
|
2025-06-16 19:39:36 +00:00
|
|
|
|
|
|
|
|
```bash
|
2025-06-17 06:37:44 +00:00
|
|
|
# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
|
2025-06-17 09:59:33 +00:00
|
|
|
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
|
2025-06-16 19:39:36 +00:00
|
|
|
```
|
|
|
|
|
|
2025-06-18 05:38:33 +00:00
|
|
|
### Rust implementation
|
2025-06-17 09:59:33 +00:00
|
|
|
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-candle)
|
2025-06-17 06:37:44 +00:00
|
|
|
|
|
|
|
|
The Rust implementation provides a server that can process multiple streaming
|
|
|
|
|
queries in parallel. Dependening on the amount of memory on your GPU, you may
|
|
|
|
|
have to adjust the batch size from the config file. For a L40S GPU, a batch size
|
|
|
|
|
of 64 works well.
|
|
|
|
|
|
|
|
|
|
In order to run the server, install the `moshi-server` crate via the following
|
|
|
|
|
command. The server code can be found in the
|
|
|
|
|
[kyutai-labs/moshi](https://github.com/kyutai-labs/moshi/tree/main/rust/moshi-server)
|
|
|
|
|
repository.
|
|
|
|
|
```bash
|
|
|
|
|
cargo install --features cuda moshi-server
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Then the server can be started via the following command using the config file
|
|
|
|
|
from this repository.
|
|
|
|
|
```bash
|
|
|
|
|
moshi-server worker --config configs/config-stt-hf.toml
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Once the server has started you can run a streaming inference with the following
|
|
|
|
|
script.
|
|
|
|
|
```bash
|
|
|
|
|
# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
|
|
|
|
|
uv run scripts/asr-streaming-query.py bria.mp3
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
The script simulates some real-time processing of the audio. Faster processing
|
|
|
|
|
can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
|
|
|
|
|
the data as fast as possible.
|
|
|
|
|
|
2025-06-17 06:46:20 +00:00
|
|
|
## Text To Speech
|
|
|
|
|
|
|
|
|
|
We're in the process of open-sourcing our TTS models. Check back for updates!
|
|
|
|
|
|
2025-06-16 19:39:36 +00:00
|
|
|
## License
|
|
|
|
|
|
|
|
|
|
The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.
|
|
|
|
|
The web client code is provided under the MIT license.
|
|
|
|
|
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
|
|
|
|
|
the MIT license.
|
|
|
|
|
|
|
|
|
|
The weights for the models are released under the CC-BY 4.0 license.
|