How to reduce audio truncation #136

delijoe79 · 2024-03-20T06:36:43Z

delijoe79
Mar 20, 2024

I'm currently using Alltalk tts to generate tts for audiobooks.

I'm noticing that every so often, sentences just get truncated for no apparent reason. I'm using default settings in the tts generator (2 chunks). I am using a finetuned model, so could this truncation have to do with how I trained the model (maybe a bad dataset)?

erew123 · 2024-03-20T10:22:06Z

erew123
Mar 20, 2024
Maintainer

These kind of issue lay somewhere within the AI model (is my best understanding). AllTalk certainly passes the text over to the AI model correctly (from all the tests I've performed in the past), however, occasional skips or even double speaking a word (usually at the end of a sentence) does occur from time to time. My loose general findings/beliefs thus far are:

If you are using an untrained base model with just a wav sample file, the problem seems more likely to occur. Especially if the wav sample is of poorer quality.
The better the model is trained/finetuned on a voice the less likely the issue is to occur e.g. the female_01.wav sample was a studio quality sample that the model was trained on and it appears to have less or zero issues with reproduction of text.
I have no idea if DeepSpeed has any impact on it.
I am uncertain if the XTTS 2.0.3 model is less susceptible to these issues. It can be used in place of the 2.0.2 model and also finetuned however. I don't have any evidence either way on this.

So its possible that further finetuning may improve the situation, or indeed a different wav sample.

Long term, I may introduce other TTS models, giving a variety of method/ways for generating TTS. Whisper could be an option to compare generated audio to the text and regenerate or flag where necessary.

0 replies

erew123 · 2024-03-20T15:48:02Z

erew123
Mar 20, 2024
Maintainer

@delijoe79

I've managed to code something together that will at least ease the burden. Saying that I want to be clear to anyone reading this, this is a currently UNSUPPORTED work in progress.

I have created a proof of concept that will compare the spoken audio generated (by ID number), to the original text it was requested to generate. When the script runs, it will flag up a list of "ID number didn't match the text".

You will need the Nvidia CUDA Toolkit 11.8 setup like with Finetuning https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-important-requirements-cuda-118

You will need to update AllTalk with a git pull that should pull down the new files.

When you have generated your TTS in the TTS Generator, you will Export List as JSON and export the ttsList.JSON file to /alltalk_tts/templates/ttsgencomp/

Do not close AllTalk, but open another command prompt/terminal window and start your Python environment that AllTalk runs in e.g. start_windows.bat (in the AllTalk folder) or cmd_windows.bat (if using text-gen-webui), then move into the /alltalk_tts/templates/ttsgencomp/ folder e.g.

Install the required packages using pip: pip install faster-whisper fuzzywuzzy spacy levenshtein
Download the language model en_core_web_md for Spacy: python -m spacy download en_core_web_md
Assuming you have done the above, run the script using Python: python ttsgencomp.py. There is also an additional command line argument, which is --threshold and uses a numerical number e.g. python ttsgencomp.py --threshold 98. This is basically how accurate you want the detection to be e.g. if your text is the temperature was 17 degrees but the transcription is the temperature was seventeen degrees, do you want it to flag that? My very limited tests say you want the threshold around 96 to 98 (default is 98).

As I say, this is a proof of concept. I have not tested its limits, found all the issues and of course it isn't integrated into the TTS Generator.

Thanks

4 replies

delijoe79 Mar 22, 2024
Author

Thanks a lot for this it's a big help.

One further suggestion (besides integrating this into the TTS generator at some point) is to have a function in the generator to split up a line into 2 or more lines before regenerating, then adding that line into the list.

I've found the only way to fix some of these generation errors is to split them into 2 separate generations.

erew123 Mar 22, 2024
Maintainer

Somewhere down the line I will have this fully integrated, but Im trying to do a couple of bits to before releasing something new. And of course, there will be more documentation to write, so will be integrated at some point:

As for splitting, humm maybe. Ill have to think on how that work work.

delijoe79 Mar 23, 2024
Author

I've been messing around more, specifically generating with different chunk sizes.

When I set it at 2, I tend to get occasional truncation as well as inexplicable long pauses in the speech. Typically editing out certain problematic punctuation helps, but sometimes no matter how I regenerate a line, it won't come out right regardless.

When I set it to 1, I get a lot of repeating words, but much less in the way of pauses and truncation.

I wonder if instead of splitting by an undefined "chunk"... maybe have a way to generate by sentence or just number of characters. I tend to run into the most problems when a new chunk is generated mid-sentence.

Also is it possible to maybe send a different seed each time a line is regenerated. I'm not sure exactly how this model works, but with most generative AI you can vary the generation with the same prompt by varying the seed. Maybe that could be a way to have the model "try again" to generate a problematic line.

Regardless thanks for what you are doing, this has been the closest thing to a free/open source alternative to Elevenlabs that I've seen so far!

erew123 Mar 23, 2024
Maintainer

Hi @delijoe79

Currently the chunk sizes is based on a detectable sentence e.g. it ends with some form of terminal punctuation or end mark e.g. full stop . exclamation mark ! or question mark ?.

Obviously you have to find some way to split the sentence. Hence 1x chunk is 1x sentence, 2x chunks is 2x sentences etc.

There are a few additional limitations here that make splitting in other ways difficult:

The XTTS model has a limitation that it wants to generate 250 characters at most per individual generation. So you cant just fire larger quantities of text at it in one go. Well, you can and it will attempt to split through its own scripts, but that doesn't work out any better. Also this limitation doesn't exist for streaming, but streaming generation doesn't create wav files. So because of this limitation there is an initial need to split things down in some way.
If we just split sentences every 250 characters (or pick a number) sentences will obviously get split in random places. The problem here is that the model always sees each generation sent to it as a new sentence. Meaning it wont carry on the spoken flow from where it left off. So you lose the intonation, flow, nuances of the spoken sentence. Hence random splitting doesn't give the desired result either

In fairness, much of the punctuation outside of a full stop, comma, question mark or exclamation mark can cause (in my experience) some of the issues. To a degree some of these are filtered out pre TTS generation, but not as aggressively as I do elsewhere.

Additionally, I think there may be some correlation between the voice sample used and/or how well the model may be trained/finetuned on that voice, OR if untrained, how well its able to reproduce it. I say this based on the fact that the "female_01.wav" which is studio quality and the model was trained on that voice, typically does not suffer the issues you mention, at least with a much lower frequency.

Possible resolutions

Yesterday I did update the documentation and make a few notes on getting certain pronunciation to work e.g. acronyms https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-tricks-to-get-the-model-to-say-things-correctly

I have not tested if the XTTS v2.0.3 model has better, worse or the same issues. At the time of release there was a problem with the configuration files of that model which just made it sound terrible. I've not yet had opportunity to work on updating or simplifying the model downloader to allow the v2.0.3 model to be downloaded as an additional/main model. However its available here if you wanted to test https://huggingface.co/coqui/XTTS-v2/tree/v2.0.3 you would place it in the SAME folder s the XTTS v.2.0.2 model and just replace the files. I honestly don't know the answer if this is better with issues of duplication of speech/pronunciation.

I could attempt automatic stripping of additional punctuation to see if that clears things up.

Somewhere down the line, I theoretically intend to add additional models and therefore more choices for generation. The problem with this is that, if I am going to do that, I intend to make a flexible backend so it would in theory be relatively easy to add any TTS engine/model, and its quite a lot of thinking and a lot of work to make that happen. Im maybe 5-10% into attempting this, but I do this in my free time and I've been getting an average of 10 emails/contacts a day from people, hence my current big tidy up of documentation, to hopefully cover off all these questions people ask as it eats into my time to get on with other bits or just relax. (BTW, this is not me having a dig, I'm just pointing out there's only me working on this and things take time).

Re the Seed question.

The XTTS model/engine makes no allocation for setting the seed. I believe it is just random every time. Full details of the XTTS model can be found here https://docs.coqui.ai/en/latest/models/xtts.html#inference-parameters

Long story short though, there's just no way to set a seed.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to reduce audio truncation #136

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to reduce audio truncation #136

delijoe79 Mar 20, 2024

Replies: 2 comments · 4 replies

erew123 Mar 20, 2024 Maintainer

erew123 Mar 20, 2024 Maintainer

delijoe79 Mar 22, 2024 Author

erew123 Mar 22, 2024 Maintainer

delijoe79 Mar 23, 2024 Author

erew123 Mar 23, 2024 Maintainer

Possible resolutions

Re the Seed question.

delijoe79
Mar 20, 2024

Replies: 2 comments 4 replies

erew123
Mar 20, 2024
Maintainer

erew123
Mar 20, 2024
Maintainer

delijoe79 Mar 22, 2024
Author

erew123 Mar 22, 2024
Maintainer

delijoe79 Mar 23, 2024
Author

erew123 Mar 23, 2024
Maintainer