Fix usage examples and a few small things (#24)

This commit is contained in:
Václav Volhejn 2025-07-02 08:58:45 +02:00 committed by GitHub
parent 4985940aad
commit 395eaeae95
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 17 additions and 30 deletions

View File

@ -59,18 +59,17 @@ Here is how to choose which one to use:
For an example of how to use the model in a way where you can directly stream in PyTorch tensors, For an example of how to use the model in a way where you can directly stream in PyTorch tensors,
[see our Colab notebook](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/transcribe_via_pytorch.ipynb). [see our Colab notebook](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/transcribe_via_pytorch.ipynb).
If you just want to run the model on a file, you can use `moshi.run_inference`.
This requires the [moshi package](https://pypi.org/project/moshi/) This requires the [moshi package](https://pypi.org/project/moshi/)
with version 0.2.6 or later, which can be installed via pip. with version 0.2.6 or later, which can be installed via pip.
If you just want to run the model on a file, you can use `moshi.run_inference`.
```bash ```bash
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3 python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
``` ```
If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step and run directly: If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step
```bash and just prefix the command above with `uvx --with moshi`.
uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
```
Additionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs: Additionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs:
@ -157,15 +156,20 @@ hardware acceleration on Apple silicon.
This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/) This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
with version 0.2.6 or later, which can be installed via pip. with version 0.2.6 or later, which can be installed via pip.
If you just want to run the model on a file, you can use `moshi_mlx.run_inference`:
```bash ```bash
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0
``` ```
If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step and run directly: If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step
and just prefix the command above with `uvx --with moshi-mlx`.
If you want to transcribe audio from your microphone, use:
```bash ```bash
uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0 python scripts/transcribe_from_mic_via_mlx.py
``` ```
It will install the moshi package in a temporary environment and run the speech-to-text.
The MLX models can also be used in swift using the [moshi-swift The MLX models can also be used in swift using the [moshi-swift
codebase](https://github.com/kyutai-labs/moshi-swift), the 1b model has been codebase](https://github.com/kyutai-labs/moshi-swift), the 1b model has been

View File

@ -14,14 +14,6 @@
Example implementation of the streaming STT example. Here we group Example implementation of the streaming STT example. Here we group
test utterances in batches (pre- and post-padded with silence) and test utterances in batches (pre- and post-padded with silence) and
and then feed these batches into the streaming STT model frame-by-frame. and then feed these batches into the streaming STT model frame-by-frame.
Example command:
```
uv run scripts/streaming_stt.py \
--dataset meanwhile \
--hf-repo kyutai/stt-2.6b-en
```
""" """
# The outputs I get on my H100 using this code with the 2.6B model, # The outputs I get on my H100 using this code with the 2.6B model,

View File

@ -10,13 +10,6 @@
"""An example script that illustrates how one can get per-word timestamps from """An example script that illustrates how one can get per-word timestamps from
Kyutai STT models. Kyutai STT models.
Usage:
```
uv run scripts/streaming_stt_timestamps.py \
--hf-repo kyutai/stt-2.6b-en \
--file bria.mp3
```
""" """
import argparse import argparse
@ -185,6 +178,8 @@ def main(args):
if text_tokens is not None: if text_tokens is not None:
text_tokens_accum.append(text_tokens) text_tokens_accum.append(text_tokens)
print(tokenizer.decode(text_tokens.numpy().tolist()))
utterance_tokens = torch.concat(text_tokens_accum, dim=-1) utterance_tokens = torch.concat(text_tokens_accum, dim=-1)
timed_text = tokens_to_timestamped_text( timed_text = tokens_to_timestamped_text(
utterance_tokens, utterance_tokens,
@ -201,11 +196,7 @@ def main(args):
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Example streaming STT w/ timestamps.") parser = argparse.ArgumentParser(description="Example streaming STT w/ timestamps.")
parser.add_argument( parser.add_argument("in_file", help="The file to transcribe.")
"--file",
required=True,
help="File to transcribe.",
)
parser.add_argument( parser.add_argument(
"--hf-repo", type=str, help="HF repo to load the STT model from. " "--hf-repo", type=str, help="HF repo to load the STT model from. "

View File

@ -70,7 +70,7 @@ if __name__ == "__main__":
def audio_callback(indata, _frames, _time, _status): def audio_callback(indata, _frames, _time, _status):
block_queue.put(indata.copy()) block_queue.put(indata.copy())
print("start recording the user input") print("recording audio from microphone, speak to get your words transcribed")
with sd.InputStream( with sd.InputStream(
channels=1, channels=1,
dtype="float32", dtype="float32",