Skip to content

Files

Latest commit

 

History

History

vc

This is an extention of standard VITS-based VC

The codebase is based on QuickVC but contains several modifications

  1. TPRLS GAN loss (from StyleTTS2)
  2. Multispectral GAN discriminator (Univnet/Vocos/StyleTTS2)
  3. Contentvec instead of Hubert

Pretrained model

Pretrained model is available on hugginface:

https://huggingface.co/alphacep/vosk-vc-ru

Results

On Russian dataset we measure speaker similarity with Resemblyzer

Model Average similarity Min similarity
Our
Original QuickVC (trained on VCTK) 0.667 0.477
Trained on Russian data 0.836 0.692
With contentvec 0.880 0.712
Others
Openvoice EN 0.800 0.653

TODO

  • Test other VC methods (XTTS, GPT-Sovits, RVC, Unitspeech)
  • Collect wideband dataset (currently 16khz)
  • Add better speaker and style encoder (3dspeaker, Openvoice)

Inference with pretrained model

python convert.py

You can change convert.txt to select the target and source

Preprocess

python encode.py dataset/VCTK-16K dataset/VCTK-16K

Train

python train.py

References

Initial approach QuickVC

Better content/speaker decomposition Contentvec

Fast MB-iSTFT decoder for VITS MS-ISTFT-VITS

Hubert-soft Soft-VC

Data augmentation (not implemented) FreeVC

TPRLS GAN StyleTTS2, Paper

Multires spectral discriminator UnivNet