Compare commits
1 Commits
main
...
vv/fix-exa
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
188e4add10 |
20
README.md
20
README.md
|
|
@ -59,18 +59,17 @@ Here is how to choose which one to use:
|
||||||
For an example of how to use the model in a way where you can directly stream in PyTorch tensors,
|
For an example of how to use the model in a way where you can directly stream in PyTorch tensors,
|
||||||
[see our Colab notebook](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/transcribe_via_pytorch.ipynb).
|
[see our Colab notebook](https://colab.research.google.com/github/kyutai-labs/delayed-streams-modeling/blob/main/transcribe_via_pytorch.ipynb).
|
||||||
|
|
||||||
If you just want to run the model on a file, you can use `moshi.run_inference`.
|
|
||||||
This requires the [moshi package](https://pypi.org/project/moshi/)
|
This requires the [moshi package](https://pypi.org/project/moshi/)
|
||||||
with version 0.2.6 or later, which can be installed via pip.
|
with version 0.2.6 or later, which can be installed via pip.
|
||||||
|
|
||||||
|
If you just want to run the model on a file, you can use `moshi.run_inference`.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
|
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
|
||||||
```
|
```
|
||||||
|
|
||||||
If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step and run directly:
|
If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step
|
||||||
```bash
|
and just prefix the command above with `uvx --with moshi`.
|
||||||
uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
|
|
||||||
```
|
|
||||||
|
|
||||||
Additionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs:
|
Additionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs:
|
||||||
|
|
||||||
|
|
@ -157,15 +156,20 @@ hardware acceleration on Apple silicon.
|
||||||
This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
|
This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
|
||||||
with version 0.2.6 or later, which can be installed via pip.
|
with version 0.2.6 or later, which can be installed via pip.
|
||||||
|
|
||||||
|
If you just want to run the model on a file, you can use `moshi_mlx.run_inference`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0
|
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0
|
||||||
```
|
```
|
||||||
|
|
||||||
If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step and run directly:
|
If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step
|
||||||
|
and just prefix the command above with `uvx --with moshi-mlx`.
|
||||||
|
|
||||||
|
If you want to transcribe audio from your microphone, use:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0
|
python scripts/transcribe_from_mic_via_mlx.py
|
||||||
```
|
```
|
||||||
It will install the moshi package in a temporary environment and run the speech-to-text.
|
|
||||||
|
|
||||||
The MLX models can also be used in swift using the [moshi-swift
|
The MLX models can also be used in swift using the [moshi-swift
|
||||||
codebase](https://github.com/kyutai-labs/moshi-swift), the 1b model has been
|
codebase](https://github.com/kyutai-labs/moshi-swift), the 1b model has been
|
||||||
|
|
|
||||||
|
|
@ -14,14 +14,6 @@
|
||||||
Example implementation of the streaming STT example. Here we group
|
Example implementation of the streaming STT example. Here we group
|
||||||
test utterances in batches (pre- and post-padded with silence) and
|
test utterances in batches (pre- and post-padded with silence) and
|
||||||
and then feed these batches into the streaming STT model frame-by-frame.
|
and then feed these batches into the streaming STT model frame-by-frame.
|
||||||
|
|
||||||
Example command:
|
|
||||||
```
|
|
||||||
uv run scripts/streaming_stt.py \
|
|
||||||
--dataset meanwhile \
|
|
||||||
--hf-repo kyutai/stt-2.6b-en
|
|
||||||
```
|
|
||||||
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# The outputs I get on my H100 using this code with the 2.6B model,
|
# The outputs I get on my H100 using this code with the 2.6B model,
|
||||||
|
|
|
||||||
|
|
@ -10,13 +10,6 @@
|
||||||
|
|
||||||
"""An example script that illustrates how one can get per-word timestamps from
|
"""An example script that illustrates how one can get per-word timestamps from
|
||||||
Kyutai STT models.
|
Kyutai STT models.
|
||||||
|
|
||||||
Usage:
|
|
||||||
```
|
|
||||||
uv run scripts/streaming_stt_timestamps.py \
|
|
||||||
--hf-repo kyutai/stt-2.6b-en \
|
|
||||||
--file bria.mp3
|
|
||||||
```
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
|
|
@ -185,6 +178,8 @@ def main(args):
|
||||||
if text_tokens is not None:
|
if text_tokens is not None:
|
||||||
text_tokens_accum.append(text_tokens)
|
text_tokens_accum.append(text_tokens)
|
||||||
|
|
||||||
|
print(tokenizer.decode(text_tokens.numpy().tolist()))
|
||||||
|
|
||||||
utterance_tokens = torch.concat(text_tokens_accum, dim=-1)
|
utterance_tokens = torch.concat(text_tokens_accum, dim=-1)
|
||||||
timed_text = tokens_to_timestamped_text(
|
timed_text = tokens_to_timestamped_text(
|
||||||
utterance_tokens,
|
utterance_tokens,
|
||||||
|
|
@ -201,11 +196,7 @@ def main(args):
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
parser = argparse.ArgumentParser(description="Example streaming STT w/ timestamps.")
|
parser = argparse.ArgumentParser(description="Example streaming STT w/ timestamps.")
|
||||||
parser.add_argument(
|
parser.add_argument("in_file", help="The file to transcribe.")
|
||||||
"--file",
|
|
||||||
required=True,
|
|
||||||
help="File to transcribe.",
|
|
||||||
)
|
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--hf-repo", type=str, help="HF repo to load the STT model from. "
|
"--hf-repo", type=str, help="HF repo to load the STT model from. "
|
||||||
|
|
|
||||||
|
|
@ -70,7 +70,7 @@ if __name__ == "__main__":
|
||||||
def audio_callback(indata, _frames, _time, _status):
|
def audio_callback(indata, _frames, _time, _status):
|
||||||
block_queue.put(indata.copy())
|
block_queue.put(indata.copy())
|
||||||
|
|
||||||
print("start recording the user input")
|
print("recording audio from microphone, speak to get your words transcribed")
|
||||||
with sd.InputStream(
|
with sd.InputStream(
|
||||||
channels=1,
|
channels=1,
|
||||||
dtype="float32",
|
dtype="float32",
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue
Block a user