Make a pass over the Readme

2025-06-19 09:17:48 +02:00 · 2025-06-19 09:17:48 +02:00 · 9f388c6a70
commit 9f388c6a70
parent de8202bddc
3 changed files with 44 additions and 38 deletions
--- a/README.md
+++ b/README.md
@ -1,19 +1,19 @@
-# delayed-streams-modeling
+# Delayed Streams Modeling
 Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimodal sequence-to-sequence learning.
-## Speech-to-text
+This repo contains instructions and examples of how to run
 Kyutai Speech-To-Text models.
 These models are powered by delayed streams modeling (DSM),
 a flexible formulation for streaming, multimodal sequence-to-sequence learning.
 Text-to-speech models based on DSM coming soon!
-DSM can be used to build streaming speech-to-text models. These models can be
+## Kyutai Speech-To-Text
 batched for efficiency, return word level timestamps,  and are great for
 interactive applications. We provide two such models, these models are
 characterized by their size as well as the delay it takes for audio to be
 transcribed into text. We provide two such models:
 - An English and French model with ~1b parameters using a 0.5 second delay,
  `kyutai/stt-1b-en_fr`.
 - An English only model with ~2.6b parameters using a 2.5 second delay,
  `kyutai/stt-2.6b-en`.
-More details can be found on the [project page](https://kyutai.org/next/stt).
+Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps.
 We provide two models:
 - `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
 - `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
 **More details can be found on the [project page](https://kyutai.org/next/stt).**
 You can retrieve the sample files used in the following snippets via:
 ```bash
@ -36,30 +36,6 @@ with version 0.2.5 or later, which can be installed via pip.
 python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
 ```
 ### MLX implementation
 <a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
 </a>
 This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
 with version 0.2.5 or later, which can be installed via pip.
 ```bash
 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
 ```
 ### Rust implementation
 <a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
 </a>
 A standalone Rust example is provided in the `stt-rs` directory in this repo.
 This can be used as follows:
 ```bash
 cd stt-rs
 cargo run --features cuda -r -- bria.mp3
 ```
 ### Rust server
 <a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
@ -81,8 +57,11 @@ cargo install --features cuda moshi-server
 Then the server can be started via the following command using the config file
 from this repository.
 For `kyutai/stt-1b-en_fr`, use `configs/config-stt-en_fr.hf.toml`,
 and for `kyutai/stt-2.6b-en`, use `configs/config-stt-en-hf.toml`,
 ```bash
-moshi-server worker --config configs/config-stt-hf.toml
+moshi-server worker --config configs/config-stt-en_fr-hf.toml
 ```
 Once the server has started you can run a streaming inference with the following
@ -95,6 +74,33 @@ The script simulates some real-time processing of the audio. Faster processing
 can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
 the data as fast as possible.
 ### Rust standalone
 <a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
 </a>
 A standalone Rust example script is provided in the `stt-rs` directory in this repo.
 This can be used as follows:
 ```bash
 cd stt-rs
 cargo run --features cuda -r -- bria.mp3
 ```
 ### MLX implementation
 <a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
 </a>
 [MLX](https://ml-explore.github.io/mlx/build/html/index.html) is Apple's ML framework that allows you to use
 hardware acceleration on Apple silicon.
 This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
 with version 0.2.5 or later, which can be installed via pip.
 ```bash
 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
 ```
 ## Text-to-Speech
 We're in the process of open-sourcing our TTS models. Check back for updates!
--- a/configs/config-stt-en-hf.toml
+++ b/configs/config-stt-en-hf.toml
--- a/configs/config-stt-en_fr-hf.toml
+++ b/configs/config-stt-en_fr-hf.toml