Follow the preprocessing steps of Speechformer to preprocess the MuST-C data. Once you obtain the TSV definition of the training data, you can filter it with:
python examples/speech_to_text/scripts/filter_on_char_ratio.py \
--train-subset $TSV_TRAINING_SET \
--tsv-out $NEW_TSV_TRAINING \
--threshold-min 0.8 --threshold-max 1.6
Please note that these thresholds are those used in the paper where the source text is normalized and without punctuation, while the target text is true-cased and contains punctuation. If this is not the case in your data, these thresholds may be different. We recommend to check your data with a histogram as in the paper to determine the thresholds.
Our Conformer model without pre-training was obtained running the following command
on 4 A40 GPUs (48GB RAM). With different hardware you may need to adjust the --max-tokens
parameter
(in case your GPUs have less RAM) and the --update-frequency
(to keep the product of --max-tokens
, --update-frequency
, and number of GPUs constant).
python train.py ${DATA_ROOT} \
--train-subset ${COMMA_SEPARATED_TRAINING_SETS} \
--valid-subset dev_mustc2 \
--save-dir ${ST_SAVE_DIR} \
--ignore-prefix-size 1 \
--num-workers 5 --max-update 200000 --patience 30 --save-interval-updates 1000 \
--max-tokens 40000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml config_st.yaml \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 25000 \
--clip-norm 10.0 \
--seed 1 --update-freq 2 \
--skip-invalid-size-inputs-valid-test \
--log-format simple > ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err
Similarly, the command to reproduce our fine-tuning is:
python train.py ${DATA_ROOT} \
--train-subset ${COMMA_SEPARATED_TRAINING_SETS} \
--valid-subset dev_mustc2 \
--save-dir ${ST_SAVE_DIR} \
--ignore-prefix-size 1 \
--num-workers 5 --max-update 200000 --patience 30 --save-interval-updates 1000 \
--max-tokens 40000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml config_st.yaml \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 1e-3 --lr-scheduler fixed --reset-lr-scheduler --reset-optimizer --reset-dataloader \
--clip-norm 10.0 \
--seed 1 --update-freq 2 \
--skip-invalid-size-inputs-valid-test \
--log-format simple > ${ST_SAVE_DIR}/train.log 2> ${ST_SAVE_DIR}/train.err
We here release the trained models used in our participation to the competition. They all share the same configuration, English and German dictionaries, and English and German SentencePiece models. The name of the models in this table are the same of Table 3 and 4 in the paper. Please refer to the paper for a detailed description of each model.
Model | SacreBLEU MuST-C tst-COMMON | SacreBLEU MuST-C tst-COMMON SHAS-segmented | |
---|---|---|---|
I. | conformer | 30.6 | - |
1. | conformer_indomainfn | 31.6 | 30.3 |
2. | conformer_pretrain_indomainfn | 31.7 | 30.4 |
6. | conformer_pretrain_indomainfn_resegmfn | - | 29.7 |
We release here the best models mentioned in our paper trained only on MuST-C. They share the same config file, English and German dictionaries, and English and German Sentencepiece models. Their description can be found in the paper, where their results are reported in Table 1 and 2. Namely, we release:
- conformer + CTC compr (25.5 BLEU): the best model without encoder pre-training;
- speechformer hybrid (25.7 BLEU): the best model with encoder pre-training;
- conformer + CTC compr + char-ratio filter (26.7 BLEU): the best model obtained filtering the MuST-C training set;
@inproceedings{gaido-et-al-2022-efficient,
title = "Efficient yet Competitive Speech Translation: FBK@IWSLT2022",
author = {Gaido, Marco and Papi, Sara and Fucci, Dennis and Fiameni, Giuseppe and Negri, Matteo and Turchi, Marco},
booktitle = "Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)",
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics"
}