Efficient Speech Translation (IWSLT 2022)

Data preparation

Follow the preprocessing steps of Speechformer to preprocess the MuST-C data. Once you obtain the TSV definition of the training data, you can filter it with:

python examples/speech_to_text/scripts/filter_on_char_ratio.py \
        --train-subset $TSV_TRAINING_SET \
        --tsv-out $NEW_TSV_TRAINING \
        --threshold-min 0.8 --threshold-max 1.6

Please note that these thresholds are those used in the paper where the source text is normalized and without punctuation, while the target text is true-cased and contains punctuation. If this is not the case in your data, these thresholds may be different. We recommend to check your data with a histogram as in the paper to determine the thresholds.

Training

Our Conformer model without pre-training was obtained running the following command on 4 A40 GPUs (48GB RAM). With different hardware you may need to adjust the --max-tokens parameter (in case your GPUs have less RAM) and the --update-frequency (to keep the product of --max-tokens, --update-frequency, and number of GPUs constant).

python train.py ${DATA_ROOT} \
        --train-subset ${COMMA_SEPARATED_TRAINING_SETS} \
	    --valid-subset dev_mustc2 \
        --save-dir ${ST_SAVE_DIR} \
	    --ignore-prefix-size 1 \
        --num-workers 5 --max-update 200000 --patience 30 --save-interval-updates 1000 \
        --max-tokens 40000 --adam-betas '(0.9, 0.98)' \
        --user-dir examples/speech_to_text \
        --task speech_to_text_ctc --config-yaml config_st.yaml  \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --arch conformer \
        --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
        --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
        --warmup-updates 25000 \
        --clip-norm 10.0 \
        --seed 1 --update-freq 2 \
        --skip-invalid-size-inputs-valid-test \
        --log-format simple > ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err

Similarly, the command to reproduce our fine-tuning is:

python train.py ${DATA_ROOT} \
        --train-subset ${COMMA_SEPARATED_TRAINING_SETS} \
	    --valid-subset dev_mustc2 \
        --save-dir ${ST_SAVE_DIR} \
	    --ignore-prefix-size 1 \
        --num-workers 5 --max-update 200000 --patience 30 --save-interval-updates 1000 \
        --max-tokens 40000 --adam-betas '(0.9, 0.98)' \
        --user-dir examples/speech_to_text \
        --task speech_to_text_ctc --config-yaml config_st.yaml  \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --arch conformer \
        --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
        --optimizer adam --lr 1e-3 --lr-scheduler fixed --reset-lr-scheduler --reset-optimizer --reset-dataloader \
        --clip-norm 10.0 \
        --seed 1 --update-freq 2 \
        --skip-invalid-size-inputs-valid-test \
        --log-format simple > ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err

Models

IWSLT Submission

We here release the trained models used in our participation to the competition. They all share the same configuration, English and German dictionaries, and English and German SentencePiece models. The name of the models in this table are the same of Table 3 and 4 in the paper. Please refer to the paper for a detailed description of each model.

	Model	SacreBLEU MuST-C tst-COMMON	SacreBLEU MuST-C tst-COMMON SHAS-segmented
I.	conformer	30.6	-
1.	conformer_indomainfn	31.6	30.3
2.	conformer_pretrain_indomainfn	31.7	30.4
6.	conformer_pretrain_indomainfn_resegmfn	-	29.7

MuST-C only Models

We release here the best models mentioned in our paper trained only on MuST-C. They share the same config file, English and German dictionaries, and English and German Sentencepiece models. Their description can be found in the paper, where their results are reported in Table 1 and 2. Namely, we release:

conformer + CTC compr (25.5 BLEU): the best model without encoder pre-training;
speechformer hybrid (25.7 BLEU): the best model with encoder pre-training;
conformer + CTC compr + char-ratio filter (26.7 BLEU): the best model obtained filtering the MuST-C training set;

Citation

@inproceedings{gaido-et-al-2022-efficient,
title = "Efficient yet Competitive Speech Translation: FBK@IWSLT2022",
author = {Gaido, Marco and Papi, Sara and Fucci, Dennis and Fiameni, Giuseppe and Negri, Matteo and Turchi, Marco},
booktitle = "Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)",
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IWSLT_2022.md

IWSLT_2022.md

Efficient Speech Translation (IWSLT 2022)

Data preparation

Training

Models

IWSLT Submission

MuST-C only Models

Citation

Files

IWSLT_2022.md

Latest commit

History

IWSLT_2022.md

File metadata and controls

Efficient Speech Translation (IWSLT 2022)

Data preparation

Training

Models

IWSLT Submission

MuST-C only Models

Citation