Make a pass over the Readme
This commit is contained in:
parent
de8202bddc
commit
9f388c6a70
82
README.md
82
README.md
|
|
@ -1,19 +1,19 @@
|
||||||
# delayed-streams-modeling
|
# Delayed Streams Modeling
|
||||||
Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimodal sequence-to-sequence learning.
|
|
||||||
|
|
||||||
## Speech-to-text
|
This repo contains instructions and examples of how to run
|
||||||
|
Kyutai Speech-To-Text models.
|
||||||
|
These models are powered by delayed streams modeling (DSM),
|
||||||
|
a flexible formulation for streaming, multimodal sequence-to-sequence learning.
|
||||||
|
Text-to-speech models based on DSM coming soon!
|
||||||
|
|
||||||
DSM can be used to build streaming speech-to-text models. These models can be
|
## Kyutai Speech-To-Text
|
||||||
batched for efficiency, return word level timestamps, and are great for
|
|
||||||
interactive applications. We provide two such models, these models are
|
|
||||||
characterized by their size as well as the delay it takes for audio to be
|
|
||||||
transcribed into text. We provide two such models:
|
|
||||||
- An English and French model with ~1b parameters using a 0.5 second delay,
|
|
||||||
`kyutai/stt-1b-en_fr`.
|
|
||||||
- An English only model with ~2.6b parameters using a 2.5 second delay,
|
|
||||||
`kyutai/stt-2.6b-en`.
|
|
||||||
|
|
||||||
More details can be found on the [project page](https://kyutai.org/next/stt).
|
Kyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps.
|
||||||
|
We provide two models:
|
||||||
|
- `kyutai/stt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https://kyutai.org/next/stt#semantic-vad).
|
||||||
|
- `kyutai/stt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.
|
||||||
|
|
||||||
|
**More details can be found on the [project page](https://kyutai.org/next/stt).**
|
||||||
|
|
||||||
You can retrieve the sample files used in the following snippets via:
|
You can retrieve the sample files used in the following snippets via:
|
||||||
```bash
|
```bash
|
||||||
|
|
@ -36,30 +36,6 @@ with version 0.2.5 or later, which can be installed via pip.
|
||||||
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
|
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
|
||||||
```
|
```
|
||||||
|
|
||||||
### MLX implementation
|
|
||||||
<a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
|
|
||||||
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
|
|
||||||
</a>
|
|
||||||
|
|
||||||
This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
|
|
||||||
with version 0.2.5 or later, which can be installed via pip.
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
|
|
||||||
```
|
|
||||||
|
|
||||||
### Rust implementation
|
|
||||||
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
|
|
||||||
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
|
|
||||||
</a>
|
|
||||||
|
|
||||||
A standalone Rust example is provided in the `stt-rs` directory in this repo.
|
|
||||||
This can be used as follows:
|
|
||||||
```bash
|
|
||||||
cd stt-rs
|
|
||||||
cargo run --features cuda -r -- bria.mp3
|
|
||||||
```
|
|
||||||
|
|
||||||
### Rust server
|
### Rust server
|
||||||
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
|
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
|
||||||
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
|
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
|
||||||
|
|
@ -81,8 +57,11 @@ cargo install --features cuda moshi-server
|
||||||
|
|
||||||
Then the server can be started via the following command using the config file
|
Then the server can be started via the following command using the config file
|
||||||
from this repository.
|
from this repository.
|
||||||
|
For `kyutai/stt-1b-en_fr`, use `configs/config-stt-en_fr.hf.toml`,
|
||||||
|
and for `kyutai/stt-2.6b-en`, use `configs/config-stt-en-hf.toml`,
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
moshi-server worker --config configs/config-stt-hf.toml
|
moshi-server worker --config configs/config-stt-en_fr-hf.toml
|
||||||
```
|
```
|
||||||
|
|
||||||
Once the server has started you can run a streaming inference with the following
|
Once the server has started you can run a streaming inference with the following
|
||||||
|
|
@ -95,6 +74,33 @@ The script simulates some real-time processing of the audio. Faster processing
|
||||||
can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
|
can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
|
||||||
the data as fast as possible.
|
the data as fast as possible.
|
||||||
|
|
||||||
|
### Rust standalone
|
||||||
|
<a href="https://huggingface.co/kyutai/stt-2.6b-en-candle" target="_blank" style="margin: 2px;">
|
||||||
|
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
A standalone Rust example script is provided in the `stt-rs` directory in this repo.
|
||||||
|
This can be used as follows:
|
||||||
|
```bash
|
||||||
|
cd stt-rs
|
||||||
|
cargo run --features cuda -r -- bria.mp3
|
||||||
|
```
|
||||||
|
|
||||||
|
### MLX implementation
|
||||||
|
<a href="https://huggingface.co/kyutai/stt-2.6b-en-mlx" target="_blank" style="margin: 2px;">
|
||||||
|
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue" style="display: inline-block; vertical-align: middle;"/>
|
||||||
|
</a>
|
||||||
|
|
||||||
|
[MLX](https://ml-explore.github.io/mlx/build/html/index.html) is Apple's ML framework that allows you to use
|
||||||
|
hardware acceleration on Apple silicon.
|
||||||
|
|
||||||
|
This requires the [moshi-mlx package](https://pypi.org/project/moshi-mlx/)
|
||||||
|
with version 0.2.5 or later, which can be installed via pip.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
|
||||||
|
```
|
||||||
|
|
||||||
## Text-to-Speech
|
## Text-to-Speech
|
||||||
|
|
||||||
We're in the process of open-sourcing our TTS models. Check back for updates!
|
We're in the process of open-sourcing our TTS models. Check back for updates!
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue
Block a user