Improve speech recognition and remove postprocessing #837

josancamon19 · 2024-09-14T05:07:50Z

Refactoring STT system

https://artificialanalysis.ai/speech-to-text

Points to https://www.speechmatics.com/ as the winner in WER %

Deepgram has a worst WER by 40%, which it's forcing us to do a postprocessing using whisper-x.

Also tried assembly AI, unfortunately streaming only works for english language, so it's discarded.

Speechmatics is marginally better than assembly ai, but works with all languages, and has interesting features future proof.

NOTE I will do the exact same pipeline first in Soniox first, we already have 10k in credits, but I'm unsure if I trust their accuracy for some reason, as the WER comparison was made by themselves.
Also they made the research before the releases of latest models.

Still the reason of testing soniox first, is because we have already a good % of the pipeline integrated, so it shouldn't take long.

Setup speechmatics websocket concurrently with existing deepgram websocket.
From the app use a settings dropdown, that allows to select transcript model (only while testing)
Test both options in 10 scenarios. (Deepgram + postprocessing) (Speechmatics + postprocessing)
Script to view line by line comparison between each one of them
- Prompt GPT to compare the 3 transcripts at each scenario, which one has better accuracy.
- (Maybe) Use groq whisper v3 as source of truth and perform WER in comparison
If tests point that speechmatics <= whisper-x results by 5-10%, skip and remove postprocessing.

Important:

~~Need to double check scalability~~ (no response)
~~Need to ask for free credits, it's 4x more expensive than deepgram.~~ (no response).
Speechmatics will only be supported for opus, for 1.0.2, will continue using deepgram.

Add ons:

VAD Implementation will be needed. Finish ticket specially for Opus.
Push more users to migrate, initiate "campaign" to help users migrate from 1.0.2 to 1.0.4 in < 30days so we can deprecate pcm8.
- Understand the data (how many are still on pcm8?)
Improve speech recognition, make sure the file is being sent correctly (use the raw audio .wav instead of the saved opus encoded bytes), double check the duration at which performs 90% of the time.

The text was updated successfully, but these errors were encountered:

josancamon19 · 2024-09-14T05:19:24Z

How WER tests were made by artificialanalysis

josancamon19 · 2024-09-14T05:39:34Z

https://soniox.com/media/SonioxEnglishBenchmarks2023.pdf
https://soniox.com/benchmarks/

kodjima33 · 2024-09-14T21:49:36Z

@josancamon19 can you pls specify what languages are required to complete the task? This will allow me to quicker understand whom to ask to do it

josancamon19 · 2024-09-23T09:41:35Z

Note had an issue with speechmatics diarization results, hunch that it is still better than deepgram.
For WER used jiwer
For DER used from pyannote.metrics.diarization import DiarizationErrorRate

Average WER Table

Model	Average WER
soniox	20.04%
speechmatics	20.85%
fal_whisperx	21.80%
deepgram	31.59%

Average DER Table

Rank	Model	Average DER
1	deepgram	24.07%
1	soniox	24.32%
2	fal_whisperx	27.93%
3	speechmatics	1258.00%

How was this computed

For WER, groq-whisper-large-v3, was used as reference, and other models results where computed against this.
For DER pyannote diarization 3.1 was used as baseline as reference via https://pyannote.ai/

In english soniox is better than speechmatics, in the overall WER, sometimes by a huge difference, but overall speechmatics, was more reliable in terms of WER around multiple recording scenarios.

Considering that, we will use soniox as of right now, as we have credits, and speechmatics costs 2.5x than soniox, or 4.5x than deepgram.

Deepgram was slighty better on speaker diarization, speechmatics was preferred by users perception, (but on pipeline had issues with the pipeline computing speechmatics), thus not sure how good it is.

Still soniox cheaper, and very good.

Postprocessing:
groq+pyannote is definitely a better pipeline than fal_whisperx.

From the results, there's no benefit on using fal whisperx, even tho some results were almost as good as groq-whisper-large-v3, with something like 1% WER and DER, it's still very unreliable, at sometimes outputs 20% of the expected transcript, or outputs non-sense.

Thus postprocessing will be removed.

beastoin · 2024-09-24T03:05:39Z

👋

josancamon19 · 2024-09-24T06:48:54Z

Asked for credits to speechmatics 3 times, no response, will keep bothering, not much we can do for now.

josancamon19 · 2024-09-27T02:57:29Z

Remaining ticket will be here: https://github.com/orgs/BasedHardware/projects/1/views/1?pane=issue&itemId=81004351

josancamon19 self-assigned this Sep 14, 2024

josancamon19 assigned josancamon19 and unassigned josancamon19 Sep 19, 2024

beastoin self-assigned this Sep 24, 2024

josancamon19 closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speech recognition and remove postprocessing #837

Improve speech recognition and remove postprocessing #837

josancamon19 commented Sep 14, 2024 •

edited

Loading

josancamon19 commented Sep 14, 2024

josancamon19 commented Sep 14, 2024 •

edited

Loading

kodjima33 commented Sep 14, 2024

josancamon19 commented Sep 23, 2024 •

edited

Loading

beastoin commented Sep 24, 2024

josancamon19 commented Sep 24, 2024

josancamon19 commented Sep 27, 2024

Improve speech recognition and remove postprocessing #837

Improve speech recognition and remove postprocessing #837

Comments

josancamon19 commented Sep 14, 2024 • edited Loading

Refactoring STT system

josancamon19 commented Sep 14, 2024

josancamon19 commented Sep 14, 2024 • edited Loading

kodjima33 commented Sep 14, 2024

josancamon19 commented Sep 23, 2024 • edited Loading

Average WER Table

Average DER Table

How was this computed

beastoin commented Sep 24, 2024

josancamon19 commented Sep 24, 2024

josancamon19 commented Sep 27, 2024

josancamon19 commented Sep 14, 2024 •

edited

Loading

josancamon19 commented Sep 14, 2024 •

edited

Loading

josancamon19 commented Sep 23, 2024 •

edited

Loading