Merge branch 'main' of github.com:kyutai-labs/delayed-streams-modeling into vv/readme
This commit is contained in:
commit
a1a5fa9803
4
.gitignore
vendored
4
.gitignore
vendored
|
|
@ -191,4 +191,6 @@ cython_debug/
|
||||||
# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
|
# exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
|
||||||
# refer to https://docs.cursor.com/context/ignore-files
|
# refer to https://docs.cursor.com/context/ignore-files
|
||||||
.cursorignore
|
.cursorignore
|
||||||
.cursorindexingignore
|
.cursorindexingignore
|
||||||
|
bria.mp3
|
||||||
|
sample_fr_hibiki_crepes.mp3
|
||||||
|
|
|
||||||
32
README.md
32
README.md
|
|
@ -1,19 +1,28 @@
|
||||||
# Delayed Streams Modeling
|
# Delayed Streams Modeling
|
||||||
|
|
||||||
This repo contains instructions and examples of how to run
|
This repo contains instructions and examples of how to run Kyutai Speech-To-Text models.
|
||||||
Kyutai Speech-To-Text models.
|
|
||||||
These models are powered by delayed streams modeling (DSM),
|
These models are powered by delayed streams modeling (DSM),
|
||||||
a flexible formulation for streaming, multimodal sequence-to-sequence learning.
|
a flexible formulation for streaming, multimodal sequence-to-sequence learning.
|
||||||
Text-to-speech models based on DSM coming soon!
|
Text-to-speech models based on DSM coming soon!
|
||||||
|
|
||||||
## Kyutai Speech-To-Text
|
## Kyutai Speech-To-Text
|
||||||
|
|
||||||
|
**More details can be found on the [project page](https://kyutai.org/next/stt).**
|
||||||
|
|
||||||
Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps.
|
Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps.
|
||||||
We provide two models:
|
We provide two models:
|
||||||
- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
|
- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
|
||||||
- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
|
- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
|
||||||
|
|
||||||
**More details can be found on the [project page](https://kyutai.org/next/stt).**
|
These speech-to-text models have several advantages:
|
||||||
|
- Streaming inference: the models can process audio in chunks, which allows
|
||||||
|
for real-time transcription, and is great for interactive applications.
|
||||||
|
- Easy batching for maximum efficiency: a H100 can process 400 streams in
|
||||||
|
real-time.
|
||||||
|
- They return word-level timestamps.
|
||||||
|
- The 1B model has a semantic Voice Activity Detection (VAD) component that
|
||||||
|
can be used to detect when the user is speaking. This is especially useful
|
||||||
|
for building voice agents.
|
||||||
|
|
||||||
You can retrieve the sample files used in the following snippets via:
|
You can retrieve the sample files used in the following snippets via:
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -36,6 +45,12 @@ with version 0.2.5 or later, which can be installed via pip.
|
||||||
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
|
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If you have `uv` installed, you can skip the installation step and run directly:
|
||||||
|
```bash
|
||||||
|
uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
|
||||||
|
```
|
||||||
|
It will install the moshi package in a temporary environment and run the speech-to-text.
|
||||||
|
|
||||||
### Rust server
|
### Rust server
|
||||||
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
|
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
|
||||||
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
|
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
|
||||||
|
|
@ -70,8 +85,9 @@ script.
|
||||||
uv run scripts/asr-streaming-query.py bria.mp3
|
uv run scripts/asr-streaming-query.py bria.mp3
|
||||||
```
|
```
|
||||||
|
|
||||||
The script simulates some real-time processing of the audio. Faster processing
|
The script limits the decoding speed to simulates real-time processing of the audio.
|
||||||
can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
|
Faster processing can be triggered by setting
|
||||||
|
the real-time factor, e.g. `--rtf 500` will process
|
||||||
the data as fast as possible.
|
the data as fast as possible.
|
||||||
|
|
||||||
### Rust standalone
|
### Rust standalone
|
||||||
|
|
@ -101,6 +117,12 @@ with version 0.2.5 or later, which can be installed via pip.
|
||||||
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
|
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If you have `uv` installed, you can skip the installation step and run directly:
|
||||||
|
```bash
|
||||||
|
uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
|
||||||
|
```
|
||||||
|
It will install the moshi package in a temporary environment and run the speech-to-text.
|
||||||
|
|
||||||
## Text-to-Speech
|
## Text-to-Speech
|
||||||
|
|
||||||
We're in the process of open-sourcing our TTS models. Check back for updates!
|
We're in the process of open-sourcing our TTS models. Check back for updates!
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue
Block a user