This commit is contained in:
laurent 2025-06-19 08:52:48 +02:00
parent 6f4ef1eae8
commit 1b362905f9

View File

@ -3,16 +3,23 @@ Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimod
## Speech-to-text ## Speech-to-text
DSM can be used to build streaming speech-to-text models. These models can be DSM can be used to build streaming speech-to-text models. We provide two such models
batched for efficiency, return word level timestamps, and are great for with a different delay between the audio input and the text output.
interactive applications. We provide two such models, these models are
characterized by their size as well as the delay it takes for audio to be
transcribed into text. We provide two such models:
- An English and French model with ~1b parameters using a 0.5 second delay, - An English and French model with ~1b parameters using a 0.5 second delay,
`kyutai/stt-1b-en_fr`. `kyutai/stt-1b-en_fr`.
- An English only model with ~2.6b parameters using a 2.5 second delay, - An English only model with ~2.6b parameters using a 2.5 second delay,
`kyutai/stt-2.6b-en`. `kyutai/stt-2.6b-en`.
These speech-to-text models have several advantages:
- Easy batching for maximum efficiency: a H100 can process 400 streams in
real-time.
- Streaming inference: the models can process audio in chunks, which allows
for real-time transcription, and is great for interactive applications.
- Return word-level timestamps.
- Some models have a semantic Voice Activity Detection (VAD) component that
can be used to detect when the user is speaking. This is especially useful
for building voice agents.
More details can be found on the [project page](https://kyutai.org/next/stt). More details can be found on the [project page](https://kyutai.org/next/stt).
You can retrieve the sample files used in the following snippets via: You can retrieve the sample files used in the following snippets via: