Add audio samples

This commit is contained in:
Václav Volhejn 2025-06-25 19:13:12 +02:00
parent 9ba717e547
commit bf458a9cb6
4 changed files with 7 additions and 15 deletions

2
.gitignore vendored
View File

@ -192,5 +192,3 @@ cython_debug/
# refer to https://docs.cursor.com/context/ignore-files # refer to https://docs.cursor.com/context/ignore-files
.cursorignore .cursorignore
.cursorindexingignore .cursorindexingignore
bria.mp3
sample_fr_hibiki_crepes.mp3

View File

@ -48,12 +48,6 @@ Here is how to choose which one to use:
MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon. MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon.
If you want to run the model on a Mac or an iPhone, choose the MLX implementation. If you want to run the model on a Mac or an iPhone, choose the MLX implementation.
You can retrieve the sample files used in the following snippets via:
```bash
wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
wget https://github.com/kyutai-labs/moshi/raw/refs/heads/main/data/sample_fr_hibiki_crepes.mp3
```
### PyTorch implementation ### PyTorch implementation
<a href="https://huggingface.co/kyutai/stt-2.6b-en" target="_blank" style="margin: 2px;"> <a href="https://huggingface.co/kyutai/stt-2.6b-en" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/> <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
@ -70,12 +64,12 @@ This requires the [moshi package](https://pypi.org/project/moshi/)
with version 0.2.6 or later, which can be installed via pip. with version 0.2.6 or later, which can be installed via pip.
```bash ```bash
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3 python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
``` ```
If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step and run directly: If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step and run directly:
```bash ```bash
uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3 uvx --with moshi python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en audio/bria.mp3
``` ```
Additionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs: Additionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs:
@ -84,7 +78,7 @@ Additionally, we provide two scripts that highlight different usage scenarios. T
uv run \ uv run \
scripts/transcribe_from_file_via_pytorch.py \ scripts/transcribe_from_file_via_pytorch.py \
--hf-repo kyutai/stt-2.6b-en \ --hf-repo kyutai/stt-2.6b-en \
--file bria.mp3 --file audio/bria.mp3
``` ```
The second script can be used to run a model on an existing Hugging Face dataset and calculate its performance metrics: The second script can be used to run a model on an existing Hugging Face dataset and calculate its performance metrics:
@ -130,7 +124,7 @@ uv run scripts/transcribe_from_mic_via_rust_server.py
We also provide a script for transcribing from an audio file. We also provide a script for transcribing from an audio file.
```bash ```bash
uv run scripts/transcribe_from_file_via_rust_server.py bria.mp3 uv run scripts/transcribe_from_file_via_rust_server.py audio/bria.mp3
``` ```
The script limits the decoding speed to simulates real-time processing of the audio. The script limits the decoding speed to simulates real-time processing of the audio.
@ -147,7 +141,7 @@ A standalone Rust example script is provided in the `stt-rs` directory in this r
This can be used as follows: This can be used as follows:
```bash ```bash
cd stt-rs cd stt-rs
cargo run --features cuda -r -- bria.mp3 cargo run --features cuda -r -- audio/bria.mp3
``` ```
You can get the timestamps by adding the `--timestamps` flag, and see the output You can get the timestamps by adding the `--timestamps` flag, and see the output
of the semantic VAD by adding the `--vad` flag. of the semantic VAD by adding the `--vad` flag.
@ -164,12 +158,12 @@ This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
with version 0.2.6 or later, which can be installed via pip. with version 0.2.6 or later, which can be installed via pip.
```bash ```bash
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0 python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0
``` ```
If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step and run directly: If you have [uv](https://docs.astral.sh/uv/) installed, you can skip the installation step and run directly:
```bash ```bash
uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0 uvx --with moshi-mlx python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx audio/bria.mp3 --temp 0
``` ```
It will install the moshi package in a temporary environment and run the speech-to-text. It will install the moshi package in a temporary environment and run the speech-to-text.

BIN
audio/bria.mp3 Normal file

Binary file not shown.

Binary file not shown.