More tweaks.

This commit is contained in:
laurent 2025-06-18 07:38:33 +02:00
parent cb1465c706
commit de30f2be23

View File

@ -3,10 +3,17 @@ Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimod
## Speech To Text ## Speech To Text
### English only model DSM can be used to build streaming speech to text models. These models can be
The main model handles english only, it has ~2.6b parameters. batched for efficiency, return word level timestamps, and are great for
interactive applications. We provide two such models, these models are
characterized by their size as well as the delay it takes for audio to be
transcribed into text. We provide two such models:
- An English only model with ~2.6b parameters using a 2.5 second delay,
`kyutai/stt-2.6b-en`.
- An English and French model with ~1b parameters using a 0.5 second delay,
`kyutai/stt-1b-en_fr`.
#### PyTorch implementation ### PyTorch implementation
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en) [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en)
<a target="_blank" href="https://colab.research.google.com/drive/1mc0Q-FoHxU2pEvId8rTdS4q1r1zorJhS?usp=sharing"> <a target="_blank" href="https://colab.research.google.com/drive/1mc0Q-FoHxU2pEvId8rTdS4q1r1zorJhS?usp=sharing">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
@ -18,7 +25,7 @@ The main model handles english only, it has ~2.6b parameters.
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3 python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
``` ```
#### MLX implementation ### MLX implementation
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-mlx) [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-mlx)
```bash ```bash
@ -26,7 +33,7 @@ python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
``` ```
#### Rust implementation ### Rust implementation
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-candle) [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-candle)
The Rust implementation provides a server that can process multiple streaming The Rust implementation provides a server that can process multiple streaming
@ -59,35 +66,6 @@ The script simulates some real-time processing of the audio. Faster processing
can be triggered by setting the real-time factor, e.g. `--rtf 500` will process can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
the data as fast as possible. the data as fast as possible.
### English + French model
This model has ~1b parameters and supports both English and French.
#### PyTorch implementation
[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr)
```bash
# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr bria.mp3
```
#### MLX implementation
[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-mlx)
```bash
# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-1b-en_fr-mlx bria.mp3 --temp 0
```
#### Rust implementation
[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-candle)
The only difference with the en only model is the config file used when
launching the server.
```bash
moshi-server worker --config configs/config-stt-enfr-hf.toml
```
## Text To Speech ## Text To Speech
We're in the process of open-sourcing our TTS models. Check back for updates! We're in the process of open-sourcing our TTS models. Check back for updates!