From 1b362905f9bc1685862156c23a9a2c5422a24a8e Mon Sep 17 00:00:00 2001
From: laurent <laurent.mazare@gmail.com>
Date: Thu, 19 Jun 2025 08:52:48 +0200
Subject: [PATCH] Tweaks.

---
 README.md | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 21ad38e..e8eed61 100644
--- a/README.md
+++ b/README.md
@@ -3,16 +3,23 @@ Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimod
 
 ## Speech-to-text
 
-DSM can be used to build streaming speech-to-text models. These models can be
-batched for efficiency, return word level timestamps,  and are great for
-interactive applications. We provide two such models, these models are
-characterized by their size as well as the delay it takes for audio to be
-transcribed into text. We provide two such models:
+DSM can be used to build streaming speech-to-text models. We provide two such models
+with a different delay between the audio input and the text output.
 - An English and French model with ~1b parameters using a 0.5 second delay,
   `kyutai/stt-1b-en_fr`.
 - An English only model with ~2.6b parameters using a 2.5 second delay,
   `kyutai/stt-2.6b-en`.
 
+These speech-to-text models have several advantages:
+- Easy batching for maximum efficiency: a H100 can process 400 streams in
+  real-time.
+- Streaming inference: the models can process audio in chunks, which allows
+  for real-time transcription, and is great for interactive applications.
+- Return word-level timestamps.
+- Some models have a semantic Voice Activity Detection (VAD) component that
+  can be used to detect when the user is speaking. This is especially useful
+  for building voice agents.
+
 More details can be found on the [project page](https://kyutai.org/next/stt).
 
 You can retrieve the sample files used in the following snippets via: