Question about ASR and diarization #2509
-
Hello, I'm trying the I need to perform ASR on italian language, so in my code I load the out-of-the-box italian model
If I have understand correctly the tutorial shows how to obtain a oracle-VAD file (needed as input for speakernet model) starting from the CTC labels returned by ASR inference on the audio file. Voice and non-voice segment are obtained checking sequence of 'blank' or 'space' characters longer than a threshold (20 in the tutorial). Once the
and
and they are very important in order to translate frame number to timestamp. How are this two parameters determined? Is it correct to say that How is I'm struggling trying to obtain valid values for VAD labels using the tutorial values on italian audio files. Thank you! Francesco |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hello Francesco,
Correct. Most ASR models take a mel spectrogram as an input. The spectrogram is usually computed using 20 (or 25 in case of Citrinet) milliseconds long windows with a stride 10 milliseconds. The first convolutional layer of QuartzNet (and Jasper) ASR model decimates input by 2 (using stride=2). That's why each output frame of QuartzNet has duration of 20 milliseconds (10*2):
Correct. Exact number of frames might be slightly different depending on if there is extra padding at the end of the signal.
We use Kind regards, |
Beta Was this translation helpful? Give feedback.
Hello Francesco,
Correct. Most ASR models take a mel spectrogram as an input. The spectrogram is usually computed using 20 (or 25 in case of Citrinet) milliseconds long windows with a stride 10 milliseconds. The first convolutional layer of QuartzNet (and Jasper) ASR model decimates input by 2 (using stride=2). That's why each output frame of QuartzNet has duration of 20 milliseconds (10*2):
timestride=0.02
.Correct. Exact number of frames might be slightly di…