Code and models for the paper "Direct Speech Translation for Automatic Subtitling" published at TACL 2023.
- English > {Dutch, French, German, Italian, Portuguese, Romanian, Spanish}: model.pt | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt
- English > German: model.pt | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt
- English > Spanish: model.pt | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt
Clone this repository and install it as explained in the original Fairseq(-py).
Download all the corpora listed in our paper and preprocess them as explained here.
To train the model from scratch, please follow the two steps below.
First, a speech translation pre-training is performed using all the corpora listed in our paper,
included MuST-Cinema from which we removed <eob>
and <eol>
from the textual parts (transcripts and translation).
Run the following code by setting ${FBK_fairseq}
as the folder containing this repository,
${DATA_ROOT}
as the folder containing the preprocessed datasets,
${SPLIT_LIST}
as the comma separated list of the training ST datasets (e.g. mustc_train,europarl_train,...
),
${MUSTCINEMA_DEV}
as the split name of the MuST-Cinema dev set from which <eob>
and <eol>
have been removed,
${ST_SAVE_DIR}
as the directory in which the checkpoints will be saved,
${CONFIG_YAML}
as the path to the yaml file generated after preprocessing.
This script is intended for 4 NVIDIA A100 40GB, please set --max-tokens
and --update-freq
accordingly with
your hardware, so that number of GPUs * max_tokens * update_freq = 320,000
.
python ${FBK_fairseq}/train.py ${DATA_ROOT} \
--train-subset ${SPLIT_LIST} \
--valid-subset ${MUSTCINEMA_DEV} \
--save-dir ${ST_SAVE_DIR} \
--num-workers 2 --max-update 100000 \
--save-interval-updates 1000 \
--max-tokens 40000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml ${CONFIG_YAML} \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 25000 --patience 15 \
--clip-norm 10.0 \
--seed 1 --update-freq 2 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> ${ST_SAVE_DIR}/train.log
Second, we fine-tune the previous model using the previously preprocessed data but with transcripts and translations
containing <eob>
and <eol>
. MuST-Cinema already contains the subtitle segmentation markers,
while all the other datasets have to be segmented into subtitles
using the multimodal segmenter.
Please average the checkpoints of the ST pre-trained as explained
here
and copy it with the name checkpoint_last.pt
in the ${SUB_SAVE_DIR}
folder.
Run the following code by setting
${SPLIT_LIST}
as the comma separated list of the training ST datasets containing <eob>
and <eol>
(e.g. mustc_sub_train,europarl_sub_train,...
),
${MUSTCINEMA_DEV}
as the split name of the original MuST-Cinema dev set containing <eob>
and <eol>
,
${SUB_SAVE_DIR}
as the folder in which to save the checkpoints for the final model.
python ${FBK_fairseq}/train.py ${DATA_ROOT} \
--train-subset ${SPLIT_LIST} \
--valid-subset ${MUSTCINEMA_DEV} \
--save-dir ${SUB_SAVE_DIR} \
--num-workers 1 --max-update 100000 --save-interval-updates 1000 \
--max-tokens 40000 --adam-betas '(0.9, 0.98)' \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml ${CONFIG_YAML} \
--criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch conformer \
--ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
--optimizer adam --lr 1e-3 --lr-scheduler fixed \
--patience 10 \
--clip-norm 10.0 \
--seed 1 --update-freq 2 \
--skip-invalid-size-inputs-valid-test \
--log-format simple >> ${ST_SAVE_DIR}/train.log
Then, average the checkpoints as mentioned above to obtain the final model checkpoint_avg7.pt
.
Please use SHAS to generate the automatic segmentation files for MuST-Cinema test set, EC Short Clips, and EuroParl Interviews as we do in our paper and preprocess them.
To generate the srt files, run the below script by setting:
${DATA_ROOT}
as the folder containing the prepreocessed test set,
${SPLIT}
as the name of the preprocessed test set,
${CONFIG_YAML}
as the path to the yaml file of the model,
${MODEL}
as the path to the model checkpoint in pt format,
${YAML_SEGM}
as the path to the yaml containing the automatic segmentation obtained by SHAS
which has been also used during the preprocessing,
${SRT_DIR}
as the output folder that will contain the generated srt.
The script performs the generation process into 4 steps. This was done to ease the experimentation with different methods, although inefficient. An efficient implementation could perform the full generation with a single forward on the direct ST model.
DATA_ROOT=$1
SPLIT=$2
CONFIG_YAML=$3
MODEL=$4
YAML_SEGM=$5
SRTDIR=$6
data_tmp_dir=$(mktemp -d)
mkdir -p $SRTDIR
# Generates the output subtitles (translations with <eol> and <eob>)
# with the autoregressive decoder
python ${FBK_fairseq}/FBK-fairseq/fairseq_cli/generate.py $DATA_ROOT \
--user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
--gen-subset $SPLIT \
--max-tokens 16000 --unkpen 10000 --beam 5 \
--model-overrides "{'batch_unsafe_relative_shift': False}" \
--max-source-positions 16000 --max-target-positions 1000 \
--task speech_to_text_ctc --criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy --no-repeat-ngram-size 5 \
--path $MODEL > $data_tmp_dir/translation.out
grep "^D-" $data_tmp_dir/translation.out | cut -c3- | sort -k 1n | cut -f3 > $data_tmp_dir/translation.txt
# Generates the captions (transcripts with <eol> and <eob>) using
# the CTC predictions
python ${FBK_fairseq}/FBK-fairseq/fairseq_cli/generate.py $DATA_ROOT \
--user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
--gen-subset $SPLIT \
--max-tokens 16000 --unkpen 10000 --beam 5 \
--model-overrides "{'batch_unsafe_relative_shift': False}" \
--max-source-positions 16000 --max-target-positions 1000 \
--task speech_to_text_ctcgen --criterion ctc_multi_loss \
--underlying-criterion label_smoothed_cross_entropy --no-repeat-ngram-size 5 \
--path $MODEL --lenpen 0.0 > $data_tmp_dir/transcript.out
grep "^D-" $data_tmp_dir/transcript.out | cut -c3- | sort -k 1n | cut -f3 > $data_tmp_dir/transcript.txt
# Runs the CTC segmentation to align the generated transcripts with the source
# audio, hence obtaining the estimated timestamps at block level
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/ctc_align.py $DATA_ROOT \
--user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
--gen-subset $SPLIT \
--max-tokens 16000 --beam 5 \
--model-overrides "{'batch_unsafe_relative_shift': False}" \
--max-source-positions 16000 --max-target-positions 1000 \
--split-tokens "<eob>" --feature-duration 0.04 \
--task speech_to_text_ctc \
--criterion ctc_multi_loss --underlying-criterion cross_entropy \
--path $MODEL --text-file $data_tmp_dir/transcript.txt > $data_tmp_dir/transcript_align.out
grep "^SEGM-" $data_tmp_dir/transcript_align.out | cut -c6- | sort -k 1n | cut -f2 > $data_tmp_dir/transcript_align.txt
# Projects the caption timestamps onto the subtitling blocks with the Levenshtein method
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/target_from_source_timestamp_levenshtein.py \
$data_tmp_dir/transcript.txt \
$data_tmp_dir/transcript_align.txt \
$data_tmp_dir/translation.txt \
$data_tmp_dir/translation_align.txt
# Creates the SRT files
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/make_srt.py \
$data_tmp_dir/translation.txt \
$data_tmp_dir/translation_align.txt \
$YAML_SEGM \
$SRTDIR
rm -rf $data_tmp_dir
Please use SubER repository for the SubER-cased
computation.
To evaluate BLEU (BLEUnb) and Sigma, please install EvalSub and
mwerSegmenter, and
run the following code by setting
${REF_SRTDIR}
as the folder containing the reference srt files,
${MWER_DIR}
as the folder containing the mwerSegmenter,
${EVALSUB_DIR}
as the folder containing the EvalSub, and
${OUT_FILE}
as the path in which to save the EvalSub output.
# These first 4 commands should be skipped for MuST-Cinema.
# For MuST-Cinema, use as reference the amara.$lang files
# instead of generating the text files from the SRTs.
cat ${SRTDIR}/*.srt > ${SRTDIR}/hyp.srt
cat ${REF_SRTDIR}/*.srt > ${REF_SRTDIR}/ref.srt
python from_srt_to_blocks.py ${SRTDIR}/hyp.srt
python from_srt_to_blocks.py ${REF_SRTDIR}/ref.srt
${MWER_DIR}/mwerSegmenter \
-mref ${REF_SRTDIR}/ref.srt.blocks \
-hypfile ${SRTDIR}/hyp.srt.blocks
mv __segments ${SRTDIR}/hyp.srt.blocks.resegm;
python ${EVALSUB_DIR}/evalsub_main.py -a -e2e \
-ref ${REF_SRTDIR}/ref.srt.blocks \
-sys ${SRTDIR}/hyp.srt.blocks.resegm \
-res ${OUT_FILE}
To evaluate CPL and CPS conformity, run:
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/subtitle_compliance.py \
--srt-file ${SRTDIR}/*.srt \
--metrics cpl cps --remove-parenthesis-content
If you use this work, please cite:
@article{papi2023directsub,
title={Direct Speech Translation for Automatic Subtitling},
author={Papi, Sara and Gaido, Marco and Karakanta, Alina and Cettolo, Mauro and Negri, Matteo and Turchi, Marco},
journal={Transactions of the Association for Computational Linguistics},
year={2023}
}