Merge branch 'main' of github.com:kyutai-labs/delayed-streams-modeling into vv/readme

This commit is contained in:
Václav Volhejn 2025-06-19 09:36:12 +02:00
commit a1a5fa9803
2 changed files with 30 additions and 6 deletions

4
.gitignore vendored
View File

@ -191,4 +191,6 @@ cython_debug/
# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data # exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
# refer to https://docs.cursor.com/context/ignore-files # refer to https://docs.cursor.com/context/ignore-files
.cursorignore .cursorignore
.cursorindexingignore .cursorindexingignore
bria.mp3
sample_fr_hibiki_crepes.mp3

View File

@ -1,19 +1,28 @@
# Delayed Streams Modeling # Delayed Streams Modeling
This repo contains instructions and examples of how to run This repo contains instructions and examples of how to run Kyutai Speech-To-Text models.
Kyutai Speech-To-Text models.
These models are powered by delayed streams modeling (DSM), These models are powered by delayed streams modeling (DSM),
a flexible formulation for streaming, multimodal sequence-to-sequence learning. a flexible formulation for streaming, multimodal sequence-to-sequence learning.
Text-to-speech models based on DSM coming soon! Text-to-speech models based on DSM coming soon!
## Kyutai Speech-To-Text ## Kyutai Speech-To-Text
**More details can be found on the [project page](https://kyutai.org/next/stt).**
Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps. Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps.
We provide two models: We provide two models:
- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad). - `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay. - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
**More details can be found on the [project page](https://kyutai.org/next/stt).** These speech-to-text models have several advantages:
- Streaming inference: the models can process audio in chunks, which allows
for real-time transcription, and is great for interactive applications.
- Easy batching for maximum efficiency: a H100 can process 400 streams in
real-time.
- They return word-level timestamps.
- The 1B model has a semantic Voice Activity Detection (VAD) component that
can be used to detect when the user is speaking. This is especially useful
for building voice agents.
You can retrieve the sample files used in the following snippets via: You can retrieve the sample files used in the following snippets via:
```bash ```bash
@ -36,6 +45,12 @@ with version 0.2.5 or later, which can be installed via pip.
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3 python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
``` ```
If you have `uv` installed, you can skip the installation step and run directly:
```bash
uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
```
It will install the moshi package in a temporary environment and run the speech-to-text.
### Rust server ### Rust server
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;"> <a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
@ -70,8 +85,9 @@ script.
uv run scripts/asr-streaming-query.py bria.mp3 uv run scripts/asr-streaming-query.py bria.mp3
``` ```
The script simulates some real-time processing of the audio. Faster processing The script limits the decoding speed to simulates real-time processing of the audio.
can be triggered by setting the real-time factor, e.g. `--rtf 500` will process Faster processing can be triggered by setting
the real-time factor, e.g. `--rtf 500` will process
the data as fast as possible. the data as fast as possible.
### Rust standalone ### Rust standalone
@ -101,6 +117,12 @@ with version 0.2.5 or later, which can be installed via pip.
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
``` ```
If you have `uv` installed, you can skip the installation step and run directly:
```bash
uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
```
It will install the moshi package in a temporary environment and run the speech-to-text.
## Text-to-Speech ## Text-to-Speech
We're in the process of open-sourcing our TTS models. Check back for updates! We're in the process of open-sourcing our TTS models. Check back for updates!