Number of speakers detection
The baseline is next:
- Voice Activity Detection (VAD) first to identify the speech regions
- Mixing the voices and extracting the MFCC features (13-dimensional) per speech frame
- Classification of sequence of N frames using LSTM and CNN classifiers
- code/main.py - main code to detect the number of speakers in an audio file
- code/prepare_training_data.ipynb - vizualisations and data preprocessing, dataset preparation
- code/train_model_LSTM.ipynb - trains LSTM classifier
- code/train_model_CNN.ipynb - trains CNN classifier
- code/vad.py and code/utils/estnoise_ms.py - are borrowed from source [1] - Voice Activity Detector
conda config --add channels anaconda
conda config --append channels conda-forge
conda create -n "name_of_new_environment" --file package-list.txt
- acivate the environment by
source activate "name_of_new_environment"
- run
pip install python-speech-features
-
To download the dataset run in terminal
wget --mirror --no-parent http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/
-
Navigate to the
16kHz_16bit/
directory and run in terminalfor i in *.tgz; do echo working on $i; tar xvzf $i ; done
to unzip all the archives -
Move folder
16kHz_16bit/
with all the nested folders to the folderdata/
- Navigate to the folder
code/
- Run in terminal
python main.py
- https://github.com/eesungkim/Voice_Activity_Detector
- https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7298570
- https://mycourses.aalto.fi/pluginfile.php/146209/mod_resource/content/1/slides_07_vad.pdf
Python 3.6.3