[ACM MM24] Official implementation of paper "From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning"
-
Release Speaker2Dubber‘s demo at here.
-
Release the generated test set at Google Drive or Baidu Cloud Drive (Password: mm24).
-
Release Speaker2Dubber's train and inference code (Tips: There maybe still some bugs in the code, feel free to use after the checkpoint released.).
-
Release Speaker2Dubber's model.
-
Update README.md (How to use).
-
Release the first-stage and second-stage pre-trained checkpoints.
Our python version is 3.8.18
and cuda version 11.5
. It's possible to have another compatible version.
Both training and inference are implemented with PyTorch on a
GeForce RTX 4090 GPU.
conda create -n speaker2dubber python=3.8.18
conda activate speaker2dubber
pip install -r requirements.txt
pip install git+https://github.com/resemble-ai/monotonic_align.git
You need repalce tha path in preprocess_config
to your preprocssed data path (see "config/MovieAnimation/preprocess.yaml") to you own path and run:
python train.py -p config/MovieAnimation/preprocess.yaml -m config/MovieAnimation/model.yaml -t config/MovieAnimation/train.yaml
There is three setting in V2C task.
python Synthesis.py --restore_step 50000 -s 1 -n 'YOUR_EXP_NAME'
python Synthesis.py --restore_step 50000 -s 2 -n 'YOUR_EXP_NAME'
python Synthesis.py --restore_step 50000 -s 3 -n 'YOUR_EXP_NAME'
The s
denotes the inference setting (1
for setting1 which use gt audio as reference audio, 2
for setting2 which use another audio from target speaker as reference audio, 3
for zero shot setting which use reference audio from unseen dataset as refernce audio.)
- GRID (BaiduDrive (code: GRID) / GoogleDrive)
- V2C-Animation dataset (chenqi-Denoise2)
We would like to thank the authors of previous related projects for generously sharing their code and insights: HPMDubbing, Monotonic Align, StyleSpeech, FastSpeech2, V2C, StyleDubber, PL-BERT, and HiFi-GAN.
If you find our work useful, please consider citing:
@inproceedings{zhang-etal-2024-speaker2dubber,
author = {Zhedong Zhang and
Liang Li and
Gaoxiang Cong and
Haibing Yin and
Yuhan Gao and
Chenggang Yan and
Anton van den Hengel and
Yuankai Qi},
title = {From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency
Learning},
booktitle = {Proceedings of the 32nd {ACM} International Conference on Multimedia,
{MM} 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November
2024},
pages = {7523--7532},
publisher = {{ACM}},
year = {2024},
}