This baseline uses existing technologies to create a clean-slate, end-to-end personalized virtual voice platform, using only a user voice and a target voice. In its current locally-running version, the platform is a script that runs Facebook Flashlight for automatic speech recognition (ASR) then SV2TTS to generate the speech.
To get started, you will need to set up Flashlight and SV2TTS before running pipeline_script.sh
.
-
This project runs Flashlight with Docker. Please make sure you have Docker installed and that Docker is up and running for the baseline.
-
Follow the instructions in the overview to install the example trained models from AWS S3. You do not need the LibriSpeech audio samples for our purposes but you may choose to download them as well for testing purposes.
Python 3.6 or 3.7 is needed to run the toolbox.
- Install PyTorch (>=1.0.1).
- Install ffmpeg.
- Run
pip install -r requirements.txt
to install the remaining necessary packages.
Download the latest here.
By the end of the installation process, you should have a model
folder in your local repo.
Before you download any dataset, you can begin by testing your configuration with:
python demo_cli.py
If all tests pass, you're good to go.
For playing with the toolbox alone, I only recommend downloading LibriSpeech/train-clean-100
. Extract the contents as <datasets_root>/LibriSpeech/train-clean-100
where <datasets_root>
is a directory of your choosing. Other datasets are supported in the toolbox, see here. You're free not to download any dataset, but then you will need your own data as audio files or you will have to record it with the toolbox.
You can then try the toolbox:
python demo_toolbox.py -d <datasets_root>
or
python demo_toolbox.py
depending on whether you downloaded any datasets. If you are running an X-server or if you have the error Aborted (core dumped)
, see this issue.