faster-whisper-transcribe
is a project focus on transcribing the subtitile from video or audio file which is recorded from online calls and meetings. It is also able to produce the .mkv file by merging the input video with subtitle.
GPU Execution Requirements:
- This project requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x for GPU execution. For detailed installation instructions, please refer to the CTranslate2 documentation.
- Alternatively, you can use the following command to install the required libraries:
conda install cudatoolkit=11.8 cudnn
- Create the environment:
conda create -n faster-whisper python=3.8 -y
conda activate faster-whisper
- Clone this repository:
git clone https://github.com/RxChi1d/faster-whisper-transcribe.git
cd faster-whisper-transcribe
- Install the required packages:
conda install cudatoolkit=11.8 cudnn
pip install -r requirements.txt
To use faster-whisper-transcribe
, execute the transcribe.py
script with the following options:
python transcribe.py [options]
-
--input_path (-i)
: Path to the input video or audio file. This option is required. -
--output_path (-o)
: Path for the output transcription. If not provided, the output will be saved with a default name in the current directory. -
--merge_srt (-s)
: Flag to merge the generated SRT (subtitle) file with the input video. Default isTrue
. -
--beam_size (-b)
: Beam size for the model. Default is5
. -
--model_size (-z)
: Specifies the size of the model to use. Choices include 'tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', and 'large-v2'. Default is 'large-v2'. -
--device_type (-d)
: Device type on which the model runs. Choices include 'auto', 'cuda', and 'cpu'. Default is 'cuda'. -
--device_index (-x)
: Device index for the model, useful when multiple GPUs are available. Default is0
. -
--compute_type (-c)
: Specifies the compute type for the model. Choices include 'default', 'float16', 'int8_float16', and 'int8'. Default is 'float16'. -
--cpu_threads (-t)
: Specifies the number of CPU threads to be used. Default is the number of CPU cores available. -
--language (-l)
: Language for the model. Choices include "auto", "en", "zh", "ja", "fr", and "de". Default is 'en'. -
--word_level_timestamps (-w)
: Flag to enable word level timestamps in the output. Default isFalse
. -
--vad_filter (-f)
: Flag to enable Voice Activity Detection (VAD) filter. This helps in filtering out non-speech segments. Default isTrue
. -
--vad_filter_min_silence_duration_ms (-g)
: Minimum silence duration (in milliseconds) for the VAD filter. Default is50
ms. -
--verbose (-v)
: Flag to enable verbose mode, which provides detailed logs during execution. Default isTrue
. -
--max_gap_ms_between_two_sentence (-mg)
: Specifies the maximum gap (in milliseconds) allowed between two sentences. Default is200
ms. -
--max_length (-ml)
: Specifies the maximum length of a sentence. Default is35
words.
Thanks for the following projects: