The codebase is based on QuickVC but contains several modifications
- TPRLS GAN loss (from StyleTTS2)
- Multispectral GAN discriminator (Univnet/Vocos/StyleTTS2)
- Contentvec instead of Hubert
Pretrained model is available on hugginface:
https://huggingface.co/alphacep/vosk-vc-ru
On Russian dataset we measure speaker similarity with Resemblyzer
Model | Average similarity | Min similarity |
---|---|---|
Our | ||
Original QuickVC (trained on VCTK) | 0.667 | 0.477 |
Trained on Russian data | 0.836 | 0.692 |
With contentvec | 0.880 | 0.712 |
Others | ||
Openvoice EN | 0.800 | 0.653 |
- Test other VC methods (XTTS, GPT-Sovits, RVC, Unitspeech)
- Collect wideband dataset (currently 16khz)
- Add better speaker and style encoder (3dspeaker, Openvoice)
python convert.py
You can change convert.txt to select the target and source
python encode.py dataset/VCTK-16K dataset/VCTK-16K
python train.py
Initial approach QuickVC
Better content/speaker decomposition Contentvec
Fast MB-iSTFT decoder for VITS MS-ISTFT-VITS
Hubert-soft Soft-VC
Data augmentation (not implemented) FreeVC
Multires spectral discriminator UnivNet