kyutai/README.md

130 lines
5.4 KiB
Markdown
Raw Normal View History

2025-06-16 19:36:53 +00:00
# delayed-streams-modeling
Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimodal sequence-to-sequence learning.
2025-06-16 19:39:36 +00:00
2025-06-18 09:22:30 +00:00
## Speech-to-text
2025-06-16 19:39:36 +00:00
2025-06-19 06:52:48 +00:00
DSM can be used to build streaming speech-to-text models. We provide two such models
with a different delay between the audio input and the text output.
2025-06-18 05:38:33 +00:00
- An English and French model with ~1b parameters using a 0.5 second delay,
`kyutai/stt-1b-en_fr`.
2025-06-18 09:22:30 +00:00
- An English only model with ~2.6b parameters using a 2.5 second delay,
`kyutai/stt-2.6b-en`.
2025-06-18 05:38:33 +00:00
2025-06-19 06:52:48 +00:00
These speech-to-text models have several advantages:
- Easy batching for maximum efficiency: a H100 can process 400 streams in
real-time.
- Streaming inference: the models can process audio in chunks, which allows
for real-time transcription, and is great for interactive applications.
- Return word-level timestamps.
- Some models have a semantic Voice Activity Detection (VAD) component that
can be used to detect when the user is speaking. This is especially useful
for building voice agents.
2025-06-18 09:24:22 +00:00
More details can be found on the [project page](https://kyutai.org/next/stt).
2025-06-18 10:32:14 +00:00
You can retrieve the sample files used in the following snippets via:
```bash
wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3
```
2025-06-18 05:38:33 +00:00
### PyTorch implementation
2025-06-18 09:37:32 +00:00
<a href="https://huggingface.co/kyutai/stt-2.6b-en" target="_blank" style="margin: 2px;">
2025-06-18 09:39:27 +00:00
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
2025-06-18 09:37:32 +00:00
</a>
2025-06-18 05:27:54 +00:00
<a target="_blank" href="https://colab.research.google.com/drive/1mc0Q-FoHxU2pEvId8rTdS4q1r1zorJhS?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
2025-06-18 09:05:28 +00:00
This requires the [moshi package](https://pypi.org/project/moshi/)
with version 0.2.5 or later, which can be installed via pip.
2025-06-16 19:39:36 +00:00
```bash
2025-06-17 10:39:59 +00:00
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
2025-06-16 19:39:36 +00:00
```
If you have `uv` installed, you can skip the installation step and run directly:
```bash
uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
```
It will install the moshi package in a temporary environment and run the speech-to-text.
2025-06-18 05:38:33 +00:00
### MLX implementation
2025-06-18 09:37:32 +00:00
<a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
2025-06-18 09:39:27 +00:00
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
2025-06-18 09:37:32 +00:00
</a>
2025-06-16 19:39:36 +00:00
2025-06-18 09:05:28 +00:00
This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
with version 0.2.5 or later, which can be installed via pip.
2025-06-16 19:39:36 +00:00
```bash
2025-06-17 09:59:33 +00:00
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
2025-06-16 19:39:36 +00:00
```
If you have `uv` installed, you can skip the installation step and run directly:
```bash
uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
```
It will install the moshi package in a temporary environment and run the speech-to-text.
2025-06-18 05:38:33 +00:00
### Rust implementation
2025-06-18 09:37:32 +00:00
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
2025-06-18 09:39:27 +00:00
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
2025-06-18 09:37:32 +00:00
</a>
2025-06-17 06:37:44 +00:00
A standalone Rust example is provided in the `stt-rs` directory in this repo.
This can be used as follows:
```bash
cd stt-rs
cargo run --features cuda -r -- bria.mp3
```
### Rust server
2025-06-18 09:37:32 +00:00
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
2025-06-18 09:39:27 +00:00
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
2025-06-18 09:37:32 +00:00
</a>
2025-06-17 06:37:44 +00:00
The Rust implementation provides a server that can process multiple streaming
queries in parallel. Dependening on the amount of memory on your GPU, you may
have to adjust the batch size from the config file. For a L40S GPU, a batch size
2025-06-18 09:22:30 +00:00
of 64 works well and requests can be processed at 3x real-time speed.
2025-06-17 06:37:44 +00:00
2025-06-18 09:22:30 +00:00
In order to run the server, install the [moshi-server
crate](https://crates.io/crates/moshi-server) via the following command. The
server code can be found in the
2025-06-17 06:37:44 +00:00
[kyutai-labs/moshi](https://github.com/kyutai-labs/moshi/tree/main/rust/moshi-server)
repository.
```bash
cargo install --features cuda moshi-server
```
Then the server can be started via the following command using the config file
from this repository.
```bash
moshi-server worker --config configs/config-stt-hf.toml
```
Once the server has started you can run a streaming inference with the following
script.
```bash
uv run scripts/asr-streaming-query.py bria.mp3
```
The script limits the decoding speed to simulates real-time processing of the audio.
Faster processing can be triggered by setting
the real-time factor, e.g. `--rtf 500` will process
2025-06-17 06:37:44 +00:00
the data as fast as possible.
2025-06-18 09:22:30 +00:00
## Text-to-Speech
2025-06-17 06:46:20 +00:00
We're in the process of open-sourcing our TTS models. Check back for updates!
2025-06-16 19:39:36 +00:00
## License
The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.
The web client code is provided under the MIT license.
Note that parts of this code is based on [AudioCraft](https://github.com/facebookresearch/audiocraft), released under
the MIT license.
2025-06-18 09:22:30 +00:00
The weights for the speech-to-text models are released under the CC-BY 4.0 license.