Skip to content

ACM MM 2023 CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model

License

Notifications You must be signed in to change notification settings

zhenye234/CoMoSpeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

7387718 · Apr 26, 2024

History

29 Commits
Aug 29, 2023
Dec 13, 2023
Nov 30, 2023
Aug 29, 2023
Aug 29, 2023
Aug 29, 2023
Aug 29, 2023
Apr 26, 2024
Aug 29, 2023
Nov 20, 2023
Aug 29, 2023
Aug 29, 2023
Aug 29, 2023
Oct 28, 2023
Aug 29, 2023

Repository files navigation

COMOSPEECH

Implementation of the CoMospeech. For all details check out our paper accepted to ACM MM 2023: CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model.

Authors: Zhen Ye, Wei Xue, Xu Tan, Jie Chen, Qifeng Liu, Yike Guo.

Update

2024-04-26

  • We propose FlashSpeech, an efficient zero-shot speech synthesizer based on the latent consistency model and adversarial training. (Paper).

2023-12-01

  • We also propose a well-designed Singing Voice Conversion (SVC) version based on consistency model (Code).

2023-11-30

  • We find that zero-mean Gaussian noise instead of the prior in grad-tts can also achieve similar performance. We alse release the new code and checkpoints.

2023-10-21

  • We add Heun’s 2nd order method support for teacher model (can be used for teacher model sampling and better ODE trajectory for consistency distillation).

Abstract

Demo page: link.

Denoising diffusion probabilistic models (DDPMs) have shown promising performance for speech synthesis. However, a large number of iterative steps are required to achieve high sample quality, which restricts the inference speed. Maintaining sample quality while increasing sampling speed has become a challenging task. In this paper, we propose a Consistency Model-based Speech synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step while achieving high audio quality. The consistency constraint is applied to distill a consistency model from a well-designed diffusion-based teacher model, which ultimately yields superior performances in the distilled CoMoSpeech. Our experiments show that by generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time on a single NVIDIA A100 GPU, which is comparable to FastSpeech2, making diffusion-sampling based speech synthesis truly practical. Meanwhile, objective and subjective evaluations on text-to-speech and singing voice synthesis show that the proposed teacher models yield the best audio quality, and the one-step sampling based CoMoSpeech achieves the best inference speed with better or comparable audio quality to other conventional multi-step diffusion model baselines.

Prepare

Build monotonic_align code (Cython):

cd model/monotonic_align; python setup.py build_ext --inplace; cd ../..

Inference

Run script inference.py by providing path to the text file, path to the checkpoint, number of sampling :

    python inference.py -f <text file> -c <checkpoint> -t <sampling steps> 

Check out folder called out for generated audios. Note that in params file. Teacher = True is for our teacher model, False is for our ComoSpeech. In addition, we use the same vocoder in Grad-TTS. You can download it and put into checkpts folder.

Training

We use LJSpeech datasets and follow the train/test/val split in fastspeech2, you can change the split in fs2_txt folder. Then run script train.py ,

    python train.py 

Note that in params file. Teacher = True is for our teacher model, False is for our ComoSpeech. While training Comospeech, teacher checkpoint directory should be provide.

Checkpoints trained on LJSpeech can be download from here.

Acknowledgement

I would like to extend a special thanks to authors of Grad-TTS, since our code base is mainly borrowed from Grad-TTS.

Contact

You are welcome to send pull requests or share some ideas with me. Contact information: Zhen YE ( [email protected] )