Question about ASR and diarization #2509

francescodaq · 2021-07-19T18:19:24Z

francescodaq
Jul 19, 2021

Hello,

I'm trying the tutorials/speaker_recognition/ASR_with_SpeakerDiarization.ipynb tutorial.

I need to perform ASR on italian language, so in my code I load the out-of-the-box italian model stt_it_quartznet15x5:

# let's do inference once again but without decoder
logits = asr_model.transcribe(files, logprobs=True)[0]
probs = softmax(logits)

If I have understand correctly the tutorial shows how to obtain a oracle-VAD file (needed as input for speakernet model) starting from the CTC labels returned by ASR inference on the audio file. Voice and non-voice segment are obtained checking sequence of 'blank' or 'space' characters longer than a threshold (20 in the tutorial).

Once the non_speech array is built it is used in order to determine the rttm file containing the speech sequences. To do this two parameters are used:

# 20ms is duration of a timestep at output of the model
time_stride = 0.02

and

# calibration offset for timestamps: 180 ms
offset = -0.18

and they are very important in order to translate frame number to timestamp.

How are this two parameters determined?

Is it correct to say that timestride corresponds to the length (in terms of time) of a CTC label frame returned by ASR model?
Is it correct for a 2 seconds long audio file and a timestride of 0.02 to expect 100 frames as output?
Do I need to use a a different value when using italian ASR model stt_it_quartznet15x5 or is 0.02 suitable value for it?

How is offset calculated?

I'm struggling trying to obtain valid values for VAD labels using the tutorial values on italian audio files.

Thank you!

Francesco

Answered by vsl9

Jul 20, 2021

Hello Francesco,

Is it correct to say that timestride corresponds to the length (in terms of time) of a CTC label frame returned by ASR model?

Correct. Most ASR models take a mel spectrogram as an input. The spectrogram is usually computed using 20 (or 25 in case of Citrinet) milliseconds long windows with a stride 10 milliseconds. The first convolutional layer of QuartzNet (and Jasper) ASR model decimates input by 2 (using stride=2). That's why each output frame of QuartzNet has duration of 20 milliseconds (10*2): timestride=0.02.

Is it correct for a 2 seconds long audio file and a timestride of 0.02 to expect 100 frames as output?

Correct. Exact number of frames might be slightly di…

View full answer

okuchaiev · 2021-07-19T22:28:59Z

okuchaiev
Jul 19, 2021
Collaborator

@nithinraok

0 replies

vsl9 · 2021-07-20T06:31:45Z

vsl9
Jul 20, 2021
Maintainer

Hello Francesco,

Is it correct to say that timestride corresponds to the length (in terms of time) of a CTC label frame returned by ASR model?

Correct. Most ASR models take a mel spectrogram as an input. The spectrogram is usually computed using 20 (or 25 in case of Citrinet) milliseconds long windows with a stride 10 milliseconds. The first convolutional layer of QuartzNet (and Jasper) ASR model decimates input by 2 (using stride=2). That's why each output frame of QuartzNet has duration of 20 milliseconds (10*2): timestride=0.02.

Is it correct for a 2 seconds long audio file and a timestride of 0.02 to expect 100 frames as output?

Correct. Exact number of frames might be slightly different depending on if there is extra padding at the end of the signal.

Do I need to use a a different value when using italian ASR model stt_it_quartznet15x5 or is 0.02 suitable value for it?

timestride=0.02 is the correct value for QuartzNet and Jasper ASR models.

How is offset calculated?

We use offset parameter to compensate ASR model's delay between output and input. It wasn't calculated. We just estimated it empirically applying a trained English QuartzNet model to different audio clips and checking alignment. 180ms delay (offset=-0.18) works quite good.

Kind regards,
Vitaly

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about ASR and diarization #2509

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Question about ASR and diarization #2509

francescodaq Jul 19, 2021

Replies: 2 comments

okuchaiev Jul 19, 2021 Collaborator

vsl9 Jul 20, 2021 Maintainer

francescodaq
Jul 19, 2021

okuchaiev
Jul 19, 2021
Collaborator

vsl9
Jul 20, 2021
Maintainer