Code and models for the paper "Direct Speech Translation for Automatic Subtitling" published at TACL 2023.
- English > {Dutch, French, German, Italian, Portuguese, Romanian, Spanish}: | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt
- English > German: | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt
- English > Spanish: | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt
Clone this repository and install it as explained in the original Fairseq(-py).
Download all the corpora listed in our paper and preprocess them as explained here.
To train the model from scratch, please follow the two steps below.
First, a speech translation pre-training is performed using all the corpora listed in our paper,
included MuST-Cinema from which we removed <eob>
and <eol>
from the textual parts (transcripts and translation).
Run the following code by setting ${FBK_fairseq}
as the folder containing this repository,
as the folder containing the preprocessed datasets,
as the comma separated list of the training ST datasets (e.g. mustc_train,europarl_train,...
as the split name of the MuST-Cinema dev set from which <eob>
and <eol>
have been removed,
as the directory in which the checkpoints will be saved,
as the path to the yaml file generated after preprocessing.
This script is intended for 4 NVIDIA A100 40GB, please set --max-tokens
and --update-freq
accordingly with
your hardware, so that number of GPUs * max_tokens * update_freq = 320,000
python ${FBK_fairseq}/ ${DATA_ROOT} \
--train-subset ${SPLIT_LIST} \
--valid-subset ${MUSTCINEMA_DEV} \
--save-dir ${ST_SAVE_DIR} \
--num-workers 2 --max-update 100000 \
--save-interval-updates 1000 \
--max-tokens 40000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml ${CONFIG_YAML} \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 25000 --patience 15 \
--clip-norm 10.0 \
--seed 1 --update-freq 2 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> ${ST_SAVE_DIR}/train.log
Second, we fine-tune the previous model using the previously preprocessed data but with transcripts and translations
containing <eob>
and <eol>
. MuST-Cinema already contains the subtitle segmentation markers,
while all the other datasets have to be segmented into subtitles
using the multimodal segmenter.
Please average the checkpoints of the ST pre-trained as explained
and copy it with the name
in the ${SUB_SAVE_DIR}
Run the following code by setting
as the comma separated list of the training ST datasets containing <eob>
and <eol>
(e.g. mustc_sub_train,europarl_sub_train,...
as the split name of the original MuST-Cinema dev set containing <eob>
and <eol>
as the folder in which to save the checkpoints for the final model.
python ${FBK_fairseq}/ ${DATA_ROOT} \
--train-subset ${SPLIT_LIST} \
--valid-subset ${MUSTCINEMA_DEV} \
--save-dir ${SUB_SAVE_DIR} \
--num-workers 1 --max-update 100000 --save-interval-updates 1000 \
--max-tokens 40000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml ${CONFIG_YAML} \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 1e-3 --lr-scheduler fixed \
--patience 10 \
--clip-norm 10.0 \
--seed 1 --update-freq 2 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> ${ST_SAVE_DIR}/train.log
Then, average the checkpoints as mentioned above to obtain the final model
Please use SHAS to generate the automatic segmentation files for MuST-Cinema test set, EC Short Clips, and EuroParl Interviews as we do in our paper and preprocess them.
To generate the srt files, run the below script by setting:
as the folder containing the prepreocessed test set,
as the name of the preprocessed test set,
as the path to the yaml file of the model,
as the path to the model checkpoint in pt format,
as the path to the yaml containing the automatic segmentation obtained by SHAS
which has been also used during the preprocessing,
as the output folder that will contain the generated srt.
The script performs the generation process into 4 steps. This was done to ease the experimentation with different methods, although inefficient. An efficient implementation could perform the full generation with a single forward on the direct ST model.
data_tmp_dir=$(mktemp -d)
mkdir -p $SRTDIR
# Generates the output subtitles (translations with <eol> and <eob>)
# with the autoregressive decoder
python ${FBK_fairseq}/FBK-fairseq/fairseq_cli/ $DATA_ROOT \
--user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
--gen-subset $SPLIT \
--max-tokens 16000 --unkpen 10000 --beam 5 \
--model-overrides "{'batch_unsafe_relative_shift': False}" \
--max-source-positions 16000 --max-target-positions 1000 \
--task speech_to_text_ctc --criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy --no-repeat-ngram-size 5 \
--path $MODEL > $data_tmp_dir/translation.out
grep "^D-" $data_tmp_dir/translation.out | cut -c3- | sort -k 1n | cut -f3 > $data_tmp_dir/translation.txt
# Generates the captions (transcripts with <eol> and <eob>) using
# the CTC predictions
python ${FBK_fairseq}/FBK-fairseq/fairseq_cli/ $DATA_ROOT \
--user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
--gen-subset $SPLIT \
--max-tokens 16000 --unkpen 10000 --beam 5 \
--model-overrides "{'batch_unsafe_relative_shift': False}" \
--max-source-positions 16000 --max-target-positions 1000 \
--task speech_to_text_ctcgen --criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy --no-repeat-ngram-size 5 \
--path $MODEL --lenpen 0.0 > $data_tmp_dir/transcript.out
grep "^D-" $data_tmp_dir/transcript.out | cut -c3- | sort -k 1n | cut -f3 > $data_tmp_dir/transcript.txt
# Runs the CTC segmentation to align the generated transcripts with the source
# audio, hence obtaining the estimated timestamps at block level
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/ $DATA_ROOT \
--user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
--gen-subset $SPLIT \
--max-tokens 16000 --beam 5 \
--model-overrides "{'batch_unsafe_relative_shift': False}" \
--max-source-positions 16000 --max-target-positions 1000 \
--split-tokens "<eob>" --feature-duration 0.04 \
--task speech_to_text_ctc \
--criterion ctc_multi_loss --underlying-criterion cross_entropy \
--path $MODEL --text-file $data_tmp_dir/transcript.txt > $data_tmp_dir/transcript_align.out
grep "^SEGM-" $data_tmp_dir/transcript_align.out | cut -c6- | sort -k 1n | cut -f2 > $data_tmp_dir/transcript_align.txt
# Projects the caption timestamps onto the subtitling blocks with the Levenshtein method
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/ \
$data_tmp_dir/transcript.txt \
$data_tmp_dir/transcript_align.txt \
$data_tmp_dir/translation.txt \
# Creates the SRT files
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/ \
$data_tmp_dir/translation.txt \
$data_tmp_dir/translation_align.txt \
rm -rf $data_tmp_dir
Please use SubER repository for the SubER-cased
To evaluate BLEU (BLEUnb) and Sigma, please install EvalSub and
mwerSegmenter, and
run the following code by setting
as the folder containing the reference srt files,
as the folder containing the mwerSegmenter,
as the folder containing the EvalSub, and
as the path in which to save the EvalSub output.
# These first 4 commands should be skipped for MuST-Cinema.
# For MuST-Cinema, use as reference the amara.$lang files
# instead of generating the text files from the SRTs.
cat ${SRTDIR}/*.srt > ${SRTDIR}/
cat ${REF_SRTDIR}/*.srt > ${REF_SRTDIR}/
python ${SRTDIR}/
python ${REF_SRTDIR}/
${MWER_DIR}/mwerSegmenter \
-mref ${REF_SRTDIR}/ \
-hypfile ${SRTDIR}/
mv __segments ${SRTDIR}/;
python ${EVALSUB_DIR}/ -a -e2e \
-ref ${REF_SRTDIR}/ \
-sys ${SRTDIR}/ \
-res ${OUT_FILE}
To evaluate CPL and CPS conformity, run:
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/ \
--srt-file ${SRTDIR}/*.srt \
--metrics cpl cps --remove-parenthesis-content
If you use this work, please cite:
title={Direct Speech Translation for Automatic Subtitling},
author={Papi, Sara and Gaido, Marco and Karakanta, Alina and Cettolo, Mauro and Negri, Matteo and Turchi, Marco},
journal={Transactions of the Association for Computational Linguistics},