Direct Speech Translation for Automatic Subtitling

Code and models for the paper "Direct Speech Translation for Automatic Subtitling" published at TACL 2023.

📌 Pretrained models

📎 MuST-Cinema (multilingual)

English > {Dutch, French, German, Italian, Portuguese, Romanian, Spanish}: model.pt | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt

📎 Unconstrained (or Constrained setting of IWSLT 2023)

English > German: model.pt | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt
English > Spanish: model.pt | config.yaml | src_vocab.model | src_vocab.txt | tgt_vocab.model | tgt_vocab.txt

📍 Preprocess and Setup

Clone this repository and install it as explained in the original Fairseq(-py).

Download all the corpora listed in our paper and preprocess them as explained here.

🏃 Training

To train the model from scratch, please follow the two steps below.

🔘 Speech Translation Pre-training

First, a speech translation pre-training is performed using all the corpora listed in our paper, included MuST-Cinema from which we removed <eob> and <eol> from the textual parts (transcripts and translation).

Run the following code by setting ${FBK_fairseq} as the folder containing this repository, ${DATA_ROOT} as the folder containing the preprocessed datasets, ${SPLIT_LIST} as the comma separated list of the training ST datasets (e.g. mustc_train,europarl_train,...), ${MUSTCINEMA_DEV} as the split name of the MuST-Cinema dev set from which <eob> and <eol> have been removed, ${ST_SAVE_DIR} as the directory in which the checkpoints will be saved, ${CONFIG_YAML} as the path to the yaml file generated after preprocessing.

This script is intended for 4 NVIDIA A100 40GB, please set --max-tokens and --update-freq accordingly with your hardware, so that number of GPUs * max_tokens * update_freq = 320,000.

python ${FBK_fairseq}/train.py ${DATA_ROOT} \
        --train-subset ${SPLIT_LIST} \
        --valid-subset ${MUSTCINEMA_DEV} \
        --save-dir ${ST_SAVE_DIR} \
        --num-workers 2 --max-update 100000 \
        --save-interval-updates 1000 \
        --max-tokens 40000 --adam-betas '(0.9, 0.98)' \
        --user-dir examples/speech_to_text \
        --task speech_to_text_ctc --config-yaml ${CONFIG_YAML} \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --arch conformer \
        --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
        --optimizer adam --lr 2e-3 --lr-scheduler inverse_sqrt \
        --warmup-updates 25000 --patience 15 \
        --clip-norm 10.0 \
        --seed 1 --update-freq 2 \
        --skip-invalid-size-inputs-valid-test \
        --log-format simple >> ${ST_SAVE_DIR}/train.log

🔘 Subtitling Fine-tuning

Second, we fine-tune the previous model using the previously preprocessed data but with transcripts and translations containing <eob> and <eol>. MuST-Cinema already contains the subtitle segmentation markers, while all the other datasets have to be segmented into subtitles using the multimodal segmenter.

Please average the checkpoints of the ST pre-trained as explained here and copy it with the name checkpoint_last.pt in the ${SUB_SAVE_DIR} folder.

Run the following code by setting ${SPLIT_LIST} as the comma separated list of the training ST datasets containing <eob> and <eol> (e.g. mustc_sub_train,europarl_sub_train,...), ${MUSTCINEMA_DEV} as the split name of the original MuST-Cinema dev set containing <eob> and <eol>, ${SUB_SAVE_DIR} as the folder in which to save the checkpoints for the final model.

python ${FBK_fairseq}/train.py ${DATA_ROOT} \
        --train-subset ${SPLIT_LIST} \
        --valid-subset ${MUSTCINEMA_DEV} \
        --save-dir ${SUB_SAVE_DIR} \
        --num-workers 1 --max-update 100000 --save-interval-updates 1000 \
        --max-tokens 40000 --adam-betas '(0.9, 0.98)' \
        --user-dir examples/speech_to_text \
        --task speech_to_text_ctc --config-yaml ${CONFIG_YAML} \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
        --arch conformer \
        --ctc-encoder-layer 8 --ctc-weight 0.5 --ctc-compress-strategy avg \
        --optimizer adam --lr 1e-3 --lr-scheduler fixed \
        --patience 10 \
        --clip-norm 10.0 \
        --seed 1 --update-freq 2 \
        --skip-invalid-size-inputs-valid-test \
        --log-format simple >> ${ST_SAVE_DIR}/train.log

Then, average the checkpoints as mentioned above to obtain the final model checkpoint_avg7.pt.

📺 Generation

Please use SHAS to generate the automatic segmentation files for MuST-Cinema test set, EC Short Clips, and EuroParl Interviews as we do in our paper and preprocess them.

To generate the srt files, run the below script by setting: ${DATA_ROOT} as the folder containing the prepreocessed test set, ${SPLIT} as the name of the preprocessed test set, ${CONFIG_YAML} as the path to the yaml file of the model, ${MODEL} as the path to the model checkpoint in pt format, ${YAML_SEGM} as the path to the yaml containing the automatic segmentation obtained by SHAS which has been also used during the preprocessing, ${SRT_DIR} as the output folder that will contain the generated srt.

The script performs the generation process into 4 steps. This was done to ease the experimentation with different methods, although inefficient. An efficient implementation could perform the full generation with a single forward on the direct ST model.

DATA_ROOT=$1
SPLIT=$2
CONFIG_YAML=$3
MODEL=$4
YAML_SEGM=$5
SRTDIR=$6

data_tmp_dir=$(mktemp -d)
mkdir -p $SRTDIR

# Generates the output subtitles (translations with <eol> and <eob>)
# with the autoregressive decoder
python ${FBK_fairseq}/FBK-fairseq/fairseq_cli/generate.py $DATA_ROOT \
        --user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
        --gen-subset $SPLIT  \
        --max-tokens 16000 --unkpen 10000 --beam 5 \
        --model-overrides "{'batch_unsafe_relative_shift': False}" \
        --max-source-positions 16000 --max-target-positions 1000 \
        --task speech_to_text_ctc --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy --no-repeat-ngram-size 5 \
        --path $MODEL > $data_tmp_dir/translation.out

grep "^D-" $data_tmp_dir/translation.out | cut -c3- | sort -k 1n | cut -f3 > $data_tmp_dir/translation.txt

# Generates the captions (transcripts with <eol> and <eob>) using
# the CTC predictions
python ${FBK_fairseq}/FBK-fairseq/fairseq_cli/generate.py $DATA_ROOT \
        --user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
        --gen-subset  $SPLIT \
        --max-tokens 16000 --unkpen 10000 --beam 5 \
        --model-overrides "{'batch_unsafe_relative_shift': False}"  \
        --max-source-positions 16000 --max-target-positions 1000 \
        --task speech_to_text_ctcgen --criterion ctc_multi_loss \
        --underlying-criterion label_smoothed_cross_entropy --no-repeat-ngram-size 5 \
        --path $MODEL --lenpen 0.0 > $data_tmp_dir/transcript.out

grep "^D-" $data_tmp_dir/transcript.out | cut -c3- | sort -k 1n | cut -f3 > $data_tmp_dir/transcript.txt

# Runs the CTC segmentation to align the generated transcripts with the source
# audio, hence obtaining the estimated timestamps at block level
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/ctc_align.py $DATA_ROOT \
        --user-dir examples/speech_to_text --config-yaml $CONFIG_YAML \
        --gen-subset $SPLIT \
        --max-tokens 16000 --beam 5 \
        --model-overrides "{'batch_unsafe_relative_shift': False}" \
        --max-source-positions 16000 --max-target-positions 1000 \
        --split-tokens "<eob>" --feature-duration 0.04 \
        --task speech_to_text_ctc \
        --criterion ctc_multi_loss --underlying-criterion cross_entropy \
        --path $MODEL --text-file $data_tmp_dir/transcript.txt > $data_tmp_dir/transcript_align.out

grep "^SEGM-" $data_tmp_dir/transcript_align.out | cut -c6- | sort -k 1n | cut -f2 > $data_tmp_dir/transcript_align.txt

# Projects the caption timestamps onto the subtitling blocks with the Levenshtein method
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/target_from_source_timestamp_levenshtein.py \
        $data_tmp_dir/transcript.txt \
        $data_tmp_dir/transcript_align.txt \
        $data_tmp_dir/translation.txt \
        $data_tmp_dir/translation_align.txt

# Creates the SRT files
python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/make_srt.py \
        $data_tmp_dir/translation.txt \
        $data_tmp_dir/translation_align.txt \
        $YAML_SEGM \
        $SRTDIR

rm -rf $data_tmp_dir

🔍 Evaluation

Please use SubER repository for the SubER-cased computation.

To evaluate BLEU (BLEUnb) and Sigma, please install EvalSub and mwerSegmenter, and run the following code by setting ${REF_SRTDIR} as the folder containing the reference srt files, ${MWER_DIR} as the folder containing the mwerSegmenter, ${EVALSUB_DIR} as the folder containing the EvalSub, and ${OUT_FILE} as the path in which to save the EvalSub output.

# These first 4 commands should be skipped for MuST-Cinema.
# For MuST-Cinema, use as reference the amara.$lang files
# instead of generating the text files from the SRTs.
cat ${SRTDIR}/*.srt > ${SRTDIR}/hyp.srt
cat ${REF_SRTDIR}/*.srt > ${REF_SRTDIR}/ref.srt
python from_srt_to_blocks.py ${SRTDIR}/hyp.srt
python from_srt_to_blocks.py ${REF_SRTDIR}/ref.srt

${MWER_DIR}/mwerSegmenter \
    -mref ${REF_SRTDIR}/ref.srt.blocks \
    -hypfile ${SRTDIR}/hyp.srt.blocks 
mv __segments ${SRTDIR}/hyp.srt.blocks.resegm; 

python ${EVALSUB_DIR}/evalsub_main.py -a -e2e \
    -ref ${REF_SRTDIR}/ref.srt.blocks \
    -sys ${SRTDIR}/hyp.srt.blocks.resegm \
    -res ${OUT_FILE}

To evaluate CPL and CPS conformity, run:

python ${FBK_fairseq}/FBK-fairseq/examples/speech_to_text/scripts/subtitle_compliance.py \
        --srt-file ${SRTDIR}/*.srt \
        --metrics cpl cps --remove-parenthesis-content

⭐ Citation

If you use this work, please cite:

@article{papi2023directsub,
  title={Direct Speech Translation for Automatic Subtitling},
  author={Papi, Sara and Gaido, Marco and Karakanta, Alina and Cettolo, Mauro and Negri, Matteo and Turchi, Marco},
  journal={Transactions of the Association for Computational Linguistics},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIRECT_SUBTITLING.md

DIRECT_SUBTITLING.md

Direct Speech Translation for Automatic Subtitling

📌 Pretrained models

📎 MuST-Cinema (multilingual)

📎 Unconstrained (or Constrained setting of IWSLT 2023)

📍 Preprocess and Setup

🏃 Training

🔘 Speech Translation Pre-training

🔘 Subtitling Fine-tuning

📺 Generation

🔍 Evaluation

⭐ Citation

Files

DIRECT_SUBTITLING.md

Latest commit

History

DIRECT_SUBTITLING.md

File metadata and controls

Direct Speech Translation for Automatic Subtitling

📌 Pretrained models

📎 MuST-Cinema (multilingual)

📎 Unconstrained (or Constrained setting of IWSLT 2023)

📍 Preprocess and Setup

🏃 Training

🔘 Speech Translation Pre-training

🔘 Subtitling Fine-tuning

📺 Generation

🔍 Evaluation

⭐ Citation