Performance with M4 #263

jsandlerus · 2024-07-03T05:36:05Z

jsandlerus
Jul 3, 2024

I want to use this to power the voice of an llm in my website. The idea is to buy the next mac mini m4 and run this in a local server, then make that server public using ngrok and hit the api every time I need to generate the audio voice. Maybe I can send the file as the response of the local api. My question is, how performant do you think alltalk_tts would be by using the m4 chip? How fast do you think it would take to generate a 400 chars audio?

If I pull this off, we can basically reduce the cost of tts for websites dramatically.

erew123 · 2024-07-03T07:57:00Z

erew123
Jul 3, 2024
Maintainer

Hi @jsandlerus

There are a couple of things to say on this, all which will impact things. I am going to assume you are looking at AllTalk v2 and not the v1.

AllTalk v2 supports multiple TTS engines and will support more when I get a chance. Some of those engines support CUDA based acceleration and some do not. Obviously CUDA support is Nvidia based, however, abstraction layers into other things like AMD RCOm (on Linux) provide a CUDA acceleration experience (not Windows). I dont know where Mac Metal support is yet with this, but I know its been a mixed bag over the last 6 months and I am uncertain if Mac Metal is working or not with emulating CUDA in any way.
If Mac Metal support is NOT providing a CUDA emulation layer yet, the only way to support Mac hardware acceleration, would be for each TTS engine to write specific code to support Mac Metal. This would require the TTS engine developer to write code in their scripts and this is separate from AllTalk's code. AKA, I would have to re-write their TTS engine code to support something like Mac Metal acceleration and that is a bit beyond where I am at with things.
Taking the above into account and assuming no engines natively support Mac Metal acceleration, the engines will run as a CPU based TTS generation. The fastest engine will be Piper, which is pretty dam fast and relatively realistic as far as the voices go, though as yet, I have not gotten its streaming generation working on AllTalk v2, only full generation of the TTS before it hands back an audio file. XTTS is slower but gives more emotion within the spoken words. (290 words at this point here).

To give a bit more context, on a TTS generation, that is generating the whole text (not streaming generation) to the point where I types 290 words, running on CPU with no specific acceleration:

XTTS

Piper

This is on a Windows Machine, which is typically slower than Unix based OS's and its CPU is nowhere near as powerful as a Mac.

Streaming generation would obviously be faster as it is handing over the audio for playback before it has completed generating it e.g. lets say as soon as it has generated 5 seconds of the audio, its playing back, while it continues generating the rest of the TTS.

There are then additional questions about multiple simultaneous generations occurring and how each underlying TTS engine handles that. Please see here for some discussion on that Simultaneously calls to API #63

Finally I do not own a Mac myself, so have no specific way to test out the real hardware, both in terms of performance or compatibility with each individual TTS engine or Mac Metal support.

I hope that gives you some insight/answers.

Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance with M4 #263

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Performance with M4 #263

jsandlerus Jul 3, 2024

Replies: 1 comment

erew123 Jul 3, 2024 Maintainer

XTTS

Piper

jsandlerus
Jul 3, 2024

erew123
Jul 3, 2024
Maintainer