COMOSPEECH

Implementation of the CoMospeech. For all details check out our paper accepted to ACM MM 2023: CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model.

Authors: Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo.

Update

2024-04-26

We propose FlashSpeech, an efficient zero-shot speech synthesizer based on the latent consistency model and adversarial training. (Paper).

2023-12-01

We also propose a well-designed Singing Voice Conversion (SVC) version based on consistency model (Code).

2023-11-30

We find that zero-mean Gaussian noise instead of the prior in grad-tts can also achieve similar performance. We alse release the new code and checkpoints.

2023-10-21

We add Heun’s 2nd order method support for teacher model (can be used for teacher model sampling and better ODE trajectory for consistency distillation).

Abstract

Demo page: link.

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines.

Prepare

Build monotonic_align code (Cython):

cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..

Inference

Run script inference.py by providing path to the text file, path to the checkpoint, number of sampling :

    python inference.py -f <text file> -c <checkpoint> -t <sampling steps>

Check out folder called out for generated audios. Note that in params file. Teacher = True is for our teacher model, False is for our ComoSpeech. In addition, we use the same vocoder in Grad-TTS. You can download it and put into checkpts folder.

Training

We use LJSpeech datasets and follow the train/test/val split in fastspeech2, you can change the split in fs2_txt folder. Then run script train.py ,

    python train.py

Note that in params file. Teacher = True is for our teacher model, False is for our ComoSpeech. While training Comospeech, teacher checkpoint directory should be provide.

Checkpoints trained on LJSpeech can be download from here.

Acknowledgement

I would like to extend a special thanks to authors of Grad-TTS, since our code base is mainly borrowed from Grad-TTS.

Contact

You are welcome to send pull requests or share some ideas with me. Contact information: Zhen YE ( [email protected] )

Name	Name	Last commit message	Last commit date
Latest commit zhenye234 Update README.md Apr 26, 2024 7387718 · Apr 26, 2024 History 29 Commits
fs2_txt	fs2_txt	v1	Aug 29, 2023
hifi-gan	hifi-gan	Update meldataset.py	Dec 13, 2023
model	model	Update como_zero.py	Nov 30, 2023
resources	resources	v1	Aug 29, 2023
text	text	v1	Aug 29, 2023
.gitignore	.gitignore	v1	Aug 29, 2023
LICENSE	LICENSE	v1	Aug 29, 2023
README.md	README.md	Update README.md	Apr 26, 2024
data.py	data.py	v1	Aug 29, 2023
inference.py	inference.py	Update inference.py	Nov 20, 2023
params.py	params.py	v1	Aug 29, 2023
requirements.txt	requirements.txt	v1	Aug 29, 2023
text.txt	text.txt	v1	Aug 29, 2023
train.py	train.py	Update train.py	Oct 28, 2023
utils.py	utils.py	v1	Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMOSPEECH

Update

Abstract

Prepare

Inference

Training

Acknowledgement

Contact

About

Releases

Packages

Contributors 3

Languages

License

zhenye234/CoMoSpeech

Folders and files

Latest commit

History

Repository files navigation

COMOSPEECH

Update

Abstract

Prepare

Inference

Training

Acknowledgement

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages