Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance in LLM-based-TTS #40

Open
Liujingxiu23 opened this issue Sep 26, 2024 · 4 comments
Open

Performance in LLM-based-TTS #40

Liujingxiu23 opened this issue Sep 26, 2024 · 4 comments
Labels
important important

Comments

@Liujingxiu23
Copy link

Liujingxiu23 commented Sep 26, 2024

Does any train this model and using it to train LLM-based TTS. How about the performace?
I mean performance of wav quanlity, as well as performace in zero-shot-TTS.

@Liujingxiu23 Liujingxiu23 changed the title performace in LLM-based-TTS Performance in LLM-based-TTS Sep 26, 2024
@jishengpeng
Copy link
Owner

jishengpeng commented Oct 1, 2024

Does any train this model and using it to train LLM-based TTS. How about the performace? I mean performance of wav quanlity, as well as performace in zero-shot-TTS.

We found that, under fair comparison conditions, the speech synthesis quality of a single-layer WavTokenizer outperforms that of the 9-layer DAC in downstream autoregressive TTS models, with slight improvements in other text-to-speech aspects as well

@jishengpeng jishengpeng added the important important label Oct 1, 2024
@Liujingxiu23
Copy link
Author

Thank you for your reply!

But I also meet the same problem as #34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?

And do you think hifigan maybe a better model for decoder part?

@CriDora
Copy link

CriDora commented Dec 16, 2024

Thank you for your reply!

But I also meet the same problem as #34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?

And do you think hifigan maybe a better model for decoder part?

Hello, I have a similar problem. The pronunciation of the synthesized speech is similar to some phonemes of the synthesized text. Do you have any ideas for solving it?

@CriDora
Copy link

CriDora commented Dec 16, 2024

Thank you for your reply!
But I also meet the same problem as #34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?
And do you think hifigan maybe a better model for decoder part?

Hello, I have a similar problem. The pronunciation of the synthesized speech is similar to some phonemes of the synthesized text. Do you have any ideas for solving it?

I used 200h of data to train an LLM-based TTS model. When the codec used speechtokenizer, I could get good results. But after changing to wavtokenizer, the above problem occurred. Could this be because the dataset is too small?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
important important
Projects
None yet
Development

No branches or pull requests

3 participants