Skip to content

Question about ASR and diarization #2509

Answered by vsl9
francescodaq asked this question in Q&A
Discussion options

You must be logged in to vote

Hello Francesco,

Is it correct to say that timestride corresponds to the length (in terms of time) of a CTC label frame returned by ASR model?

Correct. Most ASR models take a mel spectrogram as an input. The spectrogram is usually computed using 20 (or 25 in case of Citrinet) milliseconds long windows with a stride 10 milliseconds. The first convolutional layer of QuartzNet (and Jasper) ASR model decimates input by 2 (using stride=2). That's why each output frame of QuartzNet has duration of 20 milliseconds (10*2): timestride=0.02.

Is it correct for a 2 seconds long audio file and a timestride of 0.02 to expect 100 frames as output?

Correct. Exact number of frames might be slightly di…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by francescodaq
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants