Performance in LLM-based-TTS #40

Liujingxiu23 · 2024-09-26T10:36:21Z

Does any train this model and using it to train LLM-based TTS. How about the performace?
I mean performance of wav quanlity, as well as performace in zero-shot-TTS.

jishengpeng · 2024-10-01T07:22:37Z

Does any train this model and using it to train LLM-based TTS. How about the performace? I mean performance of wav quanlity, as well as performace in zero-shot-TTS.

We found that, under fair comparison conditions, the speech synthesis quality of a single-layer WavTokenizer outperforms that of the 9-layer DAC in downstream autoregressive TTS models, with slight improvements in other text-to-speech aspects as well

Liujingxiu23 · 2024-10-08T04:05:26Z

Thank you for your reply！

But I also meet the same problem as #34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?

And do you think hifigan maybe a better model for decoder part?

CriDora · 2024-12-16T05:32:34Z

Thank you for your reply！

But I also meet the same problem as #34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?

And do you think hifigan maybe a better model for decoder part?

Hello, I have a similar problem. The pronunciation of the synthesized speech is similar to some phonemes of the synthesized text. Do you have any ideas for solving it?

CriDora · 2024-12-16T05:38:37Z

Thank you for your reply！
But I also meet the same problem as #34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?
And do you think hifigan maybe a better model for decoder part?

Hello, I have a similar problem. The pronunciation of the synthesized speech is similar to some phonemes of the synthesized text. Do you have any ideas for solving it?

I used 200h of data to train an LLM-based TTS model. When the codec used speechtokenizer, I could get good results. But after changing to wavtokenizer, the above problem occurred. Could this be because the dataset is too small?

Liujingxiu23 changed the title ~~performace in LLM-based-TTS~~ Performance in LLM-based-TTS Sep 26, 2024

jishengpeng added the important important label Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance in LLM-based-TTS #40

Performance in LLM-based-TTS #40

Liujingxiu23 commented Sep 26, 2024 •

edited

Loading

jishengpeng commented Oct 1, 2024 •

edited

Loading

Liujingxiu23 commented Oct 8, 2024

CriDora commented Dec 16, 2024

CriDora commented Dec 16, 2024

Performance in LLM-based-TTS #40

Performance in LLM-based-TTS #40

Comments

Liujingxiu23 commented Sep 26, 2024 • edited Loading

jishengpeng commented Oct 1, 2024 • edited Loading

Liujingxiu23 commented Oct 8, 2024

CriDora commented Dec 16, 2024

CriDora commented Dec 16, 2024

Liujingxiu23 commented Sep 26, 2024 •

edited

Loading

jishengpeng commented Oct 1, 2024 •

edited

Loading