More tweaks.
This commit is contained in:
parent
cb1465c706
commit
de30f2be23
46
README.md
46
README.md
|
|
@ -3,10 +3,17 @@ Delayed Streams Modeling (DSM) is a flexible formulation for streaming, multimod
|
|||
|
||||
## Speech To Text
|
||||
|
||||
### English only model
|
||||
The main model handles english only, it has ~2.6b parameters.
|
||||
DSM can be used to build streaming speech to text models. These models can be
|
||||
batched for efficiency, return word level timestamps, and are great for
|
||||
interactive applications. We provide two such models, these models are
|
||||
characterized by their size as well as the delay it takes for audio to be
|
||||
transcribed into text. We provide two such models:
|
||||
- An English only model with ~2.6b parameters using a 2.5 second delay,
|
||||
`kyutai/stt-2.6b-en`.
|
||||
- An English and French model with ~1b parameters using a 0.5 second delay,
|
||||
`kyutai/stt-1b-en_fr`.
|
||||
|
||||
#### PyTorch implementation
|
||||
### PyTorch implementation
|
||||
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en)
|
||||
<a target="_blank" href="https://colab.research.google.com/drive/1mc0Q-FoHxU2pEvId8rTdS4q1r1zorJhS?usp=sharing">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
|
|
@ -18,7 +25,7 @@ The main model handles english only, it has ~2.6b parameters.
|
|||
python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
|
||||
```
|
||||
|
||||
#### MLX implementation
|
||||
### MLX implementation
|
||||
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-mlx)
|
||||
|
||||
```bash
|
||||
|
|
@ -26,7 +33,7 @@ python -m moshi.run_inference --hf-repo kyutai/stt-2.6b-en bria.mp3
|
|||
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-2.6b-en-mlx bria.mp3 --temp 0
|
||||
```
|
||||
|
||||
#### Rust implementation
|
||||
### Rust implementation
|
||||
[[Hugging Face]](https://huggingface.co/kyutai/stt-2.6b-en-candle)
|
||||
|
||||
The Rust implementation provides a server that can process multiple streaming
|
||||
|
|
@ -59,35 +66,6 @@ The script simulates some real-time processing of the audio. Faster processing
|
|||
can be triggered by setting the real-time factor, e.g. `--rtf 500` will process
|
||||
the data as fast as possible.
|
||||
|
||||
### English + French model
|
||||
This model has ~1b parameters and supports both English and French.
|
||||
|
||||
#### PyTorch implementation
|
||||
[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr)
|
||||
|
||||
```bash
|
||||
# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
|
||||
python -m moshi.run_inference --hf-repo kyutai/stt-1b-en_fr bria.mp3
|
||||
```
|
||||
|
||||
#### MLX implementation
|
||||
[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-mlx)
|
||||
|
||||
```bash
|
||||
# wget https://github.com/metavoiceio/metavoice-src/raw/main/assets/bria.mp3
|
||||
python -m moshi_mlx.run_inference --hf-repo kyutai/stt-1b-en_fr-mlx bria.mp3 --temp 0
|
||||
```
|
||||
|
||||
#### Rust implementation
|
||||
[[Hugging Face]](https://huggingface.co/kyutai/stt-1b-en_fr-candle)
|
||||
|
||||
The only difference with the en only model is the config file used when
|
||||
launching the server.
|
||||
```bash
|
||||
moshi-server worker --config configs/config-stt-enfr-hf.toml
|
||||
```
|
||||
|
||||
|
||||
## Text To Speech
|
||||
|
||||
We're in the process of open-sourcing our TTS models. Check back for updates!
|
||||
|
|
|
|||
Loading…
Reference in New Issue
Block a user