From 6f4ef1eae85fee9c4987e247564e7c8eb7e59b46 Mon Sep 17 00:00:00 2001
From: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com>
Date: Wed, 18 Jun 2025 12:45:33 +0200
Subject: [PATCH 1/2] Add uv instructions and ignore the sample audio files
 (#1)

* Add uv instructions and ignore the sample audio file

* Add french sample

* Clarify real-time

* Remove empty space
---
 .gitignore |  4 +++-
 README.md  | 17 +++++++++++++++--
 2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/.gitignore b/.gitignore
index 7b004e5..013ebc7 100644
--- a/.gitignore
+++ b/.gitignore
@@ -191,4 +191,6 @@ cython_debug/
 #  exclude from AI features like autocomplete and code analysis. Recommended for sensitive data
 #  refer to https://docs.cursor.com/context/ignore-files
 .cursorignore
-.cursorindexingignore
\ No newline at end of file
+.cursorindexingignore
+bria.mp3
+sample_fr_hibiki_crepes.mp3
diff --git a/README.md b/README.md
index c8bc0be..21ad38e 100644
--- a/README.md
+++ b/README.md
@@ -36,6 +36,12 @@ with version 0.2.5 or later, which can be installed via pip.
 python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
 ```
 
+If you have `uv` installed, you can skip the installation step and run directly:
+```bash
+uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
+```
+It will install the moshi package in a temporary environment and run the speech-to-text.
+
 ### MLX implementation
 <a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
     <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
@@ -48,6 +54,12 @@ with version 0.2.5 or later, which can be installed via pip.
 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
 ```
 
+If you have `uv` installed, you can skip the installation step and run directly:
+```bash
+uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
+```
+It will install the moshi package in a temporary environment and run the speech-to-text.
+
 ### Rust implementation
 <a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
     <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
@@ -91,8 +103,9 @@ script.
 uv run scripts/asr-streaming-query.py bria.mp3
 ```
 
-The script simulates some real-time processing of the audio. Faster processing
-can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
+The script limits the decoding speed to simulates real-time processing of the audio. 
+Faster processing can be triggered by setting 
+the real-time factor, e.g. `--rtf 500` will process
 the data as fast as possible.
 
 ## Text-to-Speech

From 1b362905f9bc1685862156c23a9a2c5422a24a8e Mon Sep 17 00:00:00 2001
From: laurent <laurent.mazare@gmail.com>
Date: Thu, 19 Jun 2025 08:52:48 +0200
Subject: [PATCH 2/2] Tweaks.

---
 README.md | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 21ad38e..e8eed61 100644
--- a/README.md
+++ b/README.md
@@ -3,16 +3,23 @@ Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimod
 
 ## Speech-to-text
 
-DSM can be used to build streaming speech-to-text models. These models can be
-batched for efficiency, return word level timestamps,  and are great for
-interactive applications. We provide two such models, these models are
-characterized by their size as well as the delay it takes for audio to be
-transcribed into text. We provide two such models:
+DSM can be used to build streaming speech-to-text models. We provide two such models
+with a different delay between the audio input and the text output.
 - An English and French model with ~1b parameters using a 0.5 second delay,
   `kyutai/stt-1b-en_fr`.
 - An English only model with ~2.6b parameters using a 2.5 second delay,
   `kyutai/stt-2.6b-en`.
 
+These speech-to-text models have several advantages:
+- Easy batching for maximum efficiency: a H100 can process 400 streams in
+  real-time.
+- Streaming inference: the models can process audio in chunks, which allows
+  for real-time transcription, and is great for interactive applications.
+- Return word-level timestamps.
+- Some models have a semantic Voice Activity Detection (VAD) component that
+  can be used to detect when the user is speaking. This is especially useful
+  for building voice agents.
+
 More details can be found on the [project page](https://kyutai.org/next/stt).
 
 You can retrieve the sample files used in the following snippets via: