-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance in LLM-based-TTS #40
Comments
We found that, under fair comparison conditions, the speech synthesis quality of a single-layer WavTokenizer outperforms that of the 9-layer DAC in downstream autoregressive TTS models, with slight improvements in other text-to-speech aspects as well |
Thank you for your reply! But I also meet the same problem as #34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem? And do you think hifigan maybe a better model for decoder part? |
Hello, I have a similar problem. The pronunciation of the synthesized speech is similar to some phonemes of the synthesized text. Do you have any ideas for solving it? |
I used 200h of data to train an LLM-based TTS model. When the codec used speechtokenizer, I could get good results. But after changing to wavtokenizer, the above problem occurred. Could this be because the dataset is too small? |
Does any train this model and using it to train LLM-based TTS. How about the performace?
I mean performance of wav quanlity, as well as performace in zero-shot-TTS.
The text was updated successfully, but these errors were encountered: