Skip to content

[ACM MM24] Official implementation of paper "From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning"

License

Notifications You must be signed in to change notification settings

ZZDoog/Speaker2Dubber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speaker2Dubber

[ACM MM24] Official implementation of paper "From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning"

image

image

🗒 TODOs

  • Release Speaker2Dubber‘s demo at here.

  • Release the generated test set at Google Drive or Baidu Cloud Drive (Password: mm24).

  • Release Speaker2Dubber's train and inference code (Tips: There maybe still some bugs in the code, feel free to use after the checkpoint released.).

  • Release Speaker2Dubber's model.

  • Update README.md (How to use).

  • Release the first-stage and second-stage pre-trained checkpoints.

🌼 Environment

Our python version is 3.8.18 and cuda version 11.5. It's possible to have another compatible version. Both training and inference are implemented with PyTorch on a GeForce RTX 4090 GPU.

conda create -n speaker2dubber python=3.8.18
conda activate speaker2dubber
pip install -r requirements.txt
pip install git+https://github.com/resemble-ai/monotonic_align.git

🔧 Training

You need repalce tha path in preprocess_config to your preprocssed data path (see "config/MovieAnimation/preprocess.yaml") to you own path and run:

python train.py -p config/MovieAnimation/preprocess.yaml -m config/MovieAnimation/model.yaml -t config/MovieAnimation/train.yaml

✍ Inference

There is three setting in V2C task.

python Synthesis.py --restore_step 50000 -s 1 -n 'YOUR_EXP_NAME'
python Synthesis.py --restore_step 50000 -s 2 -n 'YOUR_EXP_NAME'
python Synthesis.py --restore_step 50000 -s 3 -n 'YOUR_EXP_NAME'

The s denotes the inference setting (1 for setting1 which use gt audio as reference audio, 2 for setting2 which use another audio from target speaker as reference audio, 3 for zero shot setting which use reference audio from unseen dataset as refernce audio.)

📊 Dataset

  • GRID (BaiduDrive (code: GRID) / GoogleDrive)
  • V2C-Animation dataset (chenqi-Denoise2)

🙏 Acknowledgments

We would like to thank the authors of previous related projects for generously sharing their code and insights: HPMDubbing, Monotonic Align, StyleSpeech, FastSpeech2, V2C, StyleDubber, PL-BERT, and HiFi-GAN.

🤝 Ciation

If you find our work useful, please consider citing:

@inproceedings{zhang-etal-2024-speaker2dubber,
  author       = {Zhedong Zhang and
                  Liang Li and
                  Gaoxiang Cong and
                  Haibing Yin and
                  Yuhan Gao and
                  Chenggang Yan and
                  Anton van den Hengel and
                  Yuankai Qi},
  title        = {From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency
                  Learning},
  booktitle    = {Proceedings of the 32nd {ACM} International Conference on Multimedia,
                  {MM} 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November
                  2024},
  pages        = {7523--7532},
  publisher    = {{ACM}},
  year         = {2024},
}

About

[ACM MM24] Official implementation of paper "From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages