From de30f2be23ebd376eb92dd25f3aa4ad97242e8f3 Mon Sep 17 00:00:00 2001 From: laurent Date: Wed, 18 Jun 2025 07:38:33 +0200 Subject: [PATCH] More tweaks. --- README.md | 46 ++++++++++++---------------------------------- 1 file changed, 12 insertions(+), 34 deletions(-) diff --git a/README.md b/README.md index 9d4b9fd..1ea4d9b 100644 --- a/README.md +++ b/README.md @@ -3,10 +3,17 @@ Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimod ## Speech To Text -### English only model -The main model handles english only, it has ~2.6b parameters. +DSM can be used to build streaming speech to text models. These models can be +batched for efficiency, return word level timestamps, and are great for +interactive applications. We provide two such models, these models are +characterized by their size as well as the delay it takes for audio to be +transcribed into text. We provide two such models: +- An English only model with ~2.6b parameters using a 2.5 second delay, + `kyutai/stt-2.6b-en`. +- An English and French model with ~1b parameters using a 0.5 second delay, + `kyutai/stt-1b-en_fr`. -#### PyTorch implementation +### PyTorch implementation [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en) Open In Colab @@ -18,7 +25,7 @@ The main model handles english only, it has ~2.6b parameters. python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3 ``` -#### MLX implementation +### MLX implementation [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-mlx) ```bash @@ -26,7 +33,7 @@ python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0 ``` -#### Rust implementation +### Rust implementation [[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-candle) The Rust implementation provides a server that can process multiple streaming @@ -59,35 +66,6 @@ The script simulates some real-time processing of the audio. Faster processing can be triggered by setting the real-time factor, e.g. `--rtf 500` will process the data as fast as possible. -### English + French model -This model has ~1b parameters and supports both English and French. - -#### PyTorch implementation -[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr) - -```bash -# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3 -python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr bria.mp3 -``` - -#### MLX implementation -[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-mlx) - -```bash -# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3 -python -m moshi_mlx.run_inference --hf-repo kyutai/stt-1b-en_fr-mlx bria.mp3 --temp 0 -``` - -#### Rust implementation -[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-candle) - -The only difference with the en only model is the config file used when -launching the server. -```bash -moshi-server worker --config configs/config-stt-enfr-hf.toml -``` - - ## Text To Speech We're in the process of open-sourcing our TTS models. Check back for updates!