From de30f2be23ebd376eb92dd25f3aa4ad97242e8f3 Mon Sep 17 00:00:00 2001
From: laurent <laurent.mazare@gmail.com>
Date: Wed, 18 Jun 2025 07:38:33 +0200
Subject: [PATCH] More tweaks.

---
 README.md | 46 ++++++++++++----------------------------------
 1 file changed, 12 insertions(+), 34 deletions(-)
diff --git a/README.md b/README.md
index 9d4b9fd..1ea4d9b 100644
--- a/README.md
+++ b/README.md
@@ -3,10 +3,17 @@ Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimod
 
 ## Speech To Text
 
-### English only model
-The main model handles english only, it has ~2.6b parameters.
+DSM can be used to build streaming speech to text models. These models can be
+batched for efficiency, return word level timestamps,  and are great for
+interactive applications. We provide two such models, these models are
+characterized by their size as well as the delay it takes for audio to be
+transcribed into text. We provide two such models:
+- An English only model with ~2.6b parameters using a 2.5 second delay,
+  `kyutai/stt-2.6b-en`.
+- An English and French model with ~1b parameters using a 0.5 second delay,
+  `kyutai/stt-1b-en_fr`.
 
-#### PyTorch implementation
+### PyTorch implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en)
 <a target="_blank" href="https://colab.research.google.com/drive/1mc0Q-FoHxU2pEvId8rTdS4q1r1zorJhS?usp=sharing">
   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
@@ -18,7 +25,7 @@ The main model handles english only, it has ~2.6b parameters.
 python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
 ```
 
-#### MLX implementation
+### MLX implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-mlx)
 
 ```bash
@@ -26,7 +33,7 @@ python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
 ```
 
-#### Rust implementation
+### Rust implementation
 [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-candle)
 
 The Rust implementation provides a server that can process multiple streaming
@@ -59,35 +66,6 @@ The script simulates some real-time processing of the audio. Faster processing
 can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
 the data as fast as possible.
 
-### English + French model
-This model has ~1b parameters and supports both English and French.
-
-#### PyTorch implementation
-[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr)
-
-```bash
-# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
-python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr bria.mp3
-```
-
-#### MLX implementation
-[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-mlx)
-
-```bash
-# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
-python -m moshi_mlx.run_inference --hf-repo kyutai/stt-1b-en_fr-mlx bria.mp3 --temp 0
-```
-
-#### Rust implementation
-[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-candle)
-
-The only difference with the en only model is the config file used when
-launching the server.
-```bash
-moshi-server worker --config configs/config-stt-enfr-hf.toml
-```
-
-
 ## Text To Speech
 
 We're in the process of open-sourcing our TTS models. Check back for updates!