Compare commits

...

2 Commits

Author SHA1 Message Date
Václav Volhejn
a1a5fa9803 Merge branch 'main' of github.com:kyutai-labs/delayed-streams-modeling into vv/readme 2025-06-19 09:36:12 +02:00
Václav Volhejn
9f388c6a70 Make a pass over the Readme 2025-06-19 09:17:48 +02:00
3 changed files with 54 additions and 46 deletions

100
README.md
View File

@ -1,27 +1,29 @@
# delayed-streams-modeling # Delayed Streams Modeling
Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimodal sequence-to-sequence learning.
## Speech-to-text This repo contains instructions and examples of how to run Kyutai Speech-To-Text models.
These models are powered by delayed streams modeling (DSM),
a flexible formulation for streaming, multimodal sequence-to-sequence learning.
Text-to-speech models based on DSM coming soon!
DSM can be used to build streaming speech-to-text models. We provide two such models ## Kyutai Speech-To-Text
with a different delay between the audio input and the text output.
- An English and French model with ~1b parameters using a 0.5 second delay, **More details can be found on the [project page](https://kyutai.org/next/stt).**
`kyutai/stt-1b-en_fr`.
- An English only model with ~2.6b parameters using a 2.5 second delay, Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps.
`kyutai/stt-2.6b-en`. We provide two models:
- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
These speech-to-text models have several advantages: These speech-to-text models have several advantages:
- Easy batching for maximum efficiency: a H100 can process 400 streams in
real-time.
- Streaming inference: the models can process audio in chunks, which allows - Streaming inference: the models can process audio in chunks, which allows
for real-time transcription, and is great for interactive applications. for real-time transcription, and is great for interactive applications.
- Return word-level timestamps. - Easy batching for maximum efficiency: a H100 can process 400 streams in
- Some models have a semantic Voice Activity Detection (VAD) component that real-time.
- They return word-level timestamps.
- The 1B model has a semantic Voice Activity Detection (VAD) component that
can be used to detect when the user is speaking. This is especially useful can be used to detect when the user is speaking. This is especially useful
for building voice agents. for building voice agents.
More details can be found on the [project page](https://kyutai.org/next/stt).
You can retrieve the sample files used in the following snippets via: You can retrieve the sample files used in the following snippets via:
```bash ```bash
wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3 wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
@ -49,36 +51,6 @@ uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria
``` ```
It will install the moshi package in a temporary environment and run the speech-to-text. It will install the moshi package in a temporary environment and run the speech-to-text.
### MLX implementation
<a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
</a>
This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
with version 0.2.5 or later, which can be installed via pip.
```bash
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
```
If you have `uv` installed, you can skip the installation step and run directly:
```bash
uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
```
It will install the moshi package in a temporary environment and run the speech-to-text.
### Rust implementation
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
</a>
A standalone Rust example is provided in the `stt-rs` directory in this repo.
This can be used as follows:
```bash
cd stt-rs
cargo run --features cuda -r -- bria.mp3
```
### Rust server ### Rust server
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;"> <a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
@ -100,8 +72,11 @@ cargo install --features cuda moshi-server
Then the server can be started via the following command using the config file Then the server can be started via the following command using the config file
from this repository. from this repository.
For `kyutai/stt-1b-en_fr`, use `configs/config-stt-en_fr.hf.toml`,
and for `kyutai/stt-2.6b-en`, use `configs/config-stt-en-hf.toml`,
```bash ```bash
moshi-server worker --config configs/config-stt-hf.toml moshi-server worker --config configs/config-stt-en_fr-hf.toml
``` ```
Once the server has started you can run a streaming inference with the following Once the server has started you can run a streaming inference with the following
@ -115,6 +90,39 @@ Faster processing can be triggered by setting
the real-time factor, e.g. `--rtf 500` will process the real-time factor, e.g. `--rtf 500` will process
the data as fast as possible. the data as fast as possible.
### Rust standalone
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
</a>
A standalone Rust example script is provided in the `stt-rs` directory in this repo.
This can be used as follows:
```bash
cd stt-rs
cargo run --features cuda -r -- bria.mp3
```
### MLX implementation
<a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
</a>
[MLX](https://ml-explore.github.io/mlx/build/html/index.html) is Apple's ML framework that allows you to use
hardware acceleration on Apple silicon.
This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
with version 0.2.5 or later, which can be installed via pip.
```bash
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
```
If you have `uv` installed, you can skip the installation step and run directly:
```bash
uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
```
It will install the moshi package in a temporary environment and run the speech-to-text.
## Text-to-Speech ## Text-to-Speech
We're in the process of open-sourcing our TTS models. Check back for updates! We're in the process of open-sourcing our TTS models. Check back for updates!