Shorter names for STT scripts

This commit is contained in:
Vaclav Volhejn 2025-07-02 17:59:44 +02:00
parent 07ac744609
commit 40c1d812d6
8 changed files with 12 additions and 22 deletions

View File

@ -48,12 +48,12 @@ Here is how to choose which one to use:
<a href="https://huggingface.co/kyutai/stt-2.6b-en" target="_blank" style="margin: 2px;"> <a href="https://huggingface.co/kyutai/stt-2.6b-en" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
</a> </a>
<a target="_blank" href="https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/transcribe_via_pytorch.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/stt_pytorch.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a> </a>
For an example of how to use the model in a way where you can directly stream in PyTorch tensors, For an example of how to use the model in a way where you can directly stream in PyTorch tensors,
[see our Colab notebook](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/transcribe_via_pytorch.ipynb). [see our Colab notebook](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/stt_pytorch.ipynb).
This requires the [moshi package](https://pypi.org/project/moshi/) This requires the [moshi package](https://pypi.org/project/moshi/)
with version 0.2.6 or later, which can be installed via pip. with version 0.2.6 or later, which can be installed via pip.
@ -71,7 +71,7 @@ Additionally, we provide two scripts that highlight different usage scenarios. T
```bash ```bash
uv run \ uv run \
scripts/transcribe_from_file_via_pytorch.py \ scripts/stt_from_file_pytorch.py \
--hf-repo kyutai/stt-2.6b-en \ --hf-repo kyutai/stt-2.6b-en \
--file audio/bria.mp3 --file audio/bria.mp3
``` ```
@ -85,7 +85,7 @@ uv run scripts/evaluate_on_dataset.py \
Another example shows how one can provide a text-, audio-, or text-audio prompt to our STT model: Another example shows how one can provide a text-, audio-, or text-audio prompt to our STT model:
```bash ```bash
uv run scripts/transcribe_from_file_via_pytorch_with_prompt.py \ uv run scripts/stt_from_file_pytorch_with_prompt.py \
--hf-repo kyutai/stt-2.6b-en \ --hf-repo kyutai/stt-2.6b-en \
--file bria.mp3 \ --file bria.mp3 \
--prompt_file ./audio/loonah.mp3 \ --prompt_file ./audio/loonah.mp3 \
@ -131,12 +131,12 @@ moshi-server worker --config configs/config-stt-en_fr-hf.toml
Once the server has started you can transcribe audio from your microphone with the following script. Once the server has started you can transcribe audio from your microphone with the following script.
```bash ```bash
uv run scripts/transcribe_from_mic_via_rust_server.py uv run scripts/stt_from_mic_rust_server.py
``` ```
We also provide a script for transcribing from an audio file. We also provide a script for transcribing from an audio file.
```bash ```bash
uv run scripts/transcribe_from_file_via_rust_server.py audio/bria.mp3 uv run scripts/stt_from_file_rust_server.py audio/bria.mp3
``` ```
The script limits the decoding speed to simulates real-time processing of the audio. The script limits the decoding speed to simulates real-time processing of the audio.
@ -181,7 +181,7 @@ and just prefix the command above with `uvx --with moshi-mlx`.
If you want to transcribe audio from your microphone, use: If you want to transcribe audio from your microphone, use:
```bash ```bash
python scripts/transcribe_from_mic_via_mlx.py python scripts/stt_from_mic_mlx.py
``` ```
The MLX models can also be used in swift using the [moshi-swift The MLX models can also be used in swift using the [moshi-swift
@ -190,7 +190,7 @@ tested to work fine on an iPhone 16 Pro.
## Kyutai Text-to-Speech ## Kyutai Text-to-Speech
<a target="_blank" href="https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/transcribe_via_pytorch.ipynb"> <a target="_blank" href="https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/stt_pytorch.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a> </a>

View File

@ -14,15 +14,7 @@ import tqdm
class PromptHook: class PromptHook:
def __init__( def __init__(self, tokenizer, prefix, padding_tokens=(0, 3)):
self,
tokenizer,
prefix,
padding_tokens=(
0,
3,
),
):
self.tokenizer = tokenizer self.tokenizer = tokenizer
self.prefix_enforce = deque(self.tokenizer.encode(prefix)) self.prefix_enforce = deque(self.tokenizer.encode(prefix))
self.padding_tokens = padding_tokens self.padding_tokens = padding_tokens
@ -141,7 +133,7 @@ def main(args):
prompt_frames = audio_prompt.shape[1] // mimi.frame_size prompt_frames = audio_prompt.shape[1] // mimi.frame_size
no_prompt_offset_seconds = audio_delay_seconds + audio_silence_prefix_seconds no_prompt_offset_seconds = audio_delay_seconds + audio_silence_prefix_seconds
no_prompt_offset = int(no_prompt_offset_seconds * mimi.frame_rate) no_prompt_offset = int(no_prompt_offset_seconds * mimi.frame_rate)
text_tokens = text_tokens[prompt_frames + no_prompt_offset:] text_tokens = text_tokens[prompt_frames + no_prompt_offset :]
text = tokenizer.decode( text = tokenizer.decode(
text_tokens[text_tokens > padding_token_id].numpy().tolist() text_tokens[text_tokens > padding_token_id].numpy().tolist()

View File

@ -228,11 +228,9 @@
"provenance": [] "provenance": []
}, },
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3" "name": "python3"
},
"language_info": {
"name": "python"
} }
}, },
"nbformat": 4, "nbformat": 4,