More tweaks.

2025-06-18 07:38:33 +02:00 · 2025-06-18 07:38:33 +02:00 · de30f2be23
commit de30f2be23
parent cb1465c706
1 changed files with 12 additions and 34 deletions
--- a/README.md
+++ b/README.md
@ -3,10 +3,17 @@ Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimod
 ## Speech To Text
-### English only model
+DSM can be used to build streaming speech to text models. These models can be
-The main model handles english only, it has ~2.6b parameters.
+batched for efficiency, return word level timestamps,  and are great for
 interactive applications. We provide two such models, these models are
 characterized by their size as well as the delay it takes for audio to be
 transcribed into text. We provide two such models:
 - An English only model with ~2.6b parameters using a 2.5 second delay,
  `kyutai/stt-2.6b-en`.
 - An English and French model with ~1b parameters using a 0.5 second delay,
  `kyutai/stt-1b-en_fr`.
-#### PyTorch implementation
+### PyTorch implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en)
 <a target="_blank" href="https://colab.research.google.com/drive/1mc0Q-FoHxU2pEvId8rTdS4q1r1zorJhS?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
@ -18,7 +25,7 @@ The main model handles english only, it has ~2.6b parameters.
 python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
 ```
-#### MLX implementation
+### MLX implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-mlx)
 ```bash
@ -26,7 +33,7 @@ python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
 ```
-#### Rust implementation
+### Rust implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-candle)
 The Rust implementation provides a server that can process multiple streaming
@ -59,35 +66,6 @@ The script simulates some real-time processing of the audio. Faster processing
 can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
 the data as fast as possible.
 ### English + French model
 This model has ~1b parameters and supports both English and French.
 #### PyTorch implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr)
 ```bash
 # wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
 python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr bria.mp3
 ```
 #### MLX implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-mlx)
 ```bash
 # wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-1b-en_fr-mlx bria.mp3 --temp 0
 ```
 #### Rust implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-candle)
 The only difference with the en only model is the config file used when
 launching the server.
 ```bash
 moshi-server worker --config configs/config-stt-enfr-hf.toml
 ```
 ## Text To Speech
 We're in the process of open-sourcing our TTS models. Check back for updates!