|
1 |
| -A proper simple setup to train a Vosk model |
| 1 | +# Vosk API Training |
2 | 2 |
|
3 |
| -More documentation later |
| 3 | +This directory contains scripts and tools for training speech recognition models using the Kaldi toolkit. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +1. [Overview](#overview) |
| 8 | +2. [Directory Structure](#directory-structure) |
| 9 | +3. [Installation](#installation) |
| 10 | +4. [Training Process](#training-process) |
| 11 | + - [Data Preparation](#data-preparation) |
| 12 | + - [Dictionary Preparation](#dictionary-preparation) |
| 13 | + - [MFCC Feature Extraction](#mfcc-feature-extraction) |
| 14 | + - [Acoustic Model Training](#acoustic-model-training) |
| 15 | + - [TDNN Chain Model Training](#tdnn-chain-model-training) |
| 16 | + - [Decoding](#decoding) |
| 17 | +5. [Results](#results) |
| 18 | +6. [Contributing](#contributing) |
| 19 | + |
| 20 | +## Overview |
| 21 | + |
| 22 | +This repository provides tools for training custom speech recognition models using Kaldi. It supports acoustic model training, language model creation, and decoding pipelines. |
| 23 | + |
| 24 | +## Directory Structure |
| 25 | + |
| 26 | +```plaintext |
| 27 | +. |
| 28 | +├── cmd.sh # Command configuration for training and decoding |
| 29 | +├── conf/ |
| 30 | +│ ├── mfcc.conf # Configuration for MFCC feature extraction |
| 31 | +│ └── online_cmvn.conf # Online Cepstral Mean Variance Normalization (currently empty) |
| 32 | +├── local/ |
| 33 | +│ ├── chain/ |
| 34 | +│ │ ├── run_ivector_common.sh # Script for i-vector extraction during chain model training |
| 35 | +│ │ └── run_tdnn.sh # Script for training a TDNN model |
| 36 | +│ ├── data_prep.sh # Data preparation script for creating Kaldi data directories |
| 37 | +│ ├── download_and_untar.sh # Script for downloading and extracting datasets |
| 38 | +│ ├── download_lm.sh # Downloads language models |
| 39 | +│ ├── prepare_dict.sh # Prepares the pronunciation dictionary |
| 40 | +│ └── score.sh # Scoring script for evaluation |
| 41 | +├── path.sh # Script for setting Kaldi paths |
| 42 | +├── RESULTS # Script for printing the best WER results |
| 43 | +├── RESULTS.txt # Contains WER results from decoding |
| 44 | +├── run.sh # Main script for the entire training pipeline |
| 45 | +├── steps -> ../../wsj/s5/steps/ # Link to Kaldi’s WSJ steps for acoustic model training |
| 46 | +└── utils -> ../../wsj/s5/utils/ # Link to Kaldi’s utility scripts |
| 47 | +``` |
| 48 | + |
| 49 | +### Key Files: |
| 50 | +- **cmd.sh**: Defines commands for running training and decoding tasks. |
| 51 | +- **path.sh**: Sets up paths for Kaldi binaries and scripts. |
| 52 | +- **run.sh**: Main entry point for the training pipeline, running tasks in stages. |
| 53 | +- **RESULTS**: Displays Word Error Rate (WER) for the trained models. |
| 54 | + |
| 55 | +## Installation |
| 56 | + |
| 57 | +### Prerequisites |
| 58 | +- [Kaldi](https://github.com/kaldi-asr/kaldi): Kaldi toolkit must be installed and configured. |
| 59 | +- Required tools: `ffmpeg`, `sox`, `sctk` for data preparation and scoring. |
| 60 | + |
| 61 | +### Steps |
| 62 | +1. Clone the Vosk API repository. |
| 63 | +2. Install Kaldi and ensure the `KALDI_ROOT` is correctly set in `path.sh`. |
| 64 | +3. Set environment variables using `cmd.sh` and `path.sh`. |
| 65 | + |
| 66 | +## Training Process |
| 67 | + |
| 68 | +### Data Preparation |
| 69 | +Run the data preparation stage in `run.sh`: |
| 70 | +```bash |
| 71 | +bash run.sh --stage 0 --stop_stage 0 |
| 72 | +``` |
| 73 | +This stage downloads and prepares the LibriSpeech dataset. |
| 74 | + |
| 75 | +### Dictionary Preparation |
| 76 | +Prepare the pronunciation dictionary with: |
| 77 | +```bash |
| 78 | +bash run.sh --stage 1 --stop_stage 1 |
| 79 | +``` |
| 80 | +This step generates the necessary files for Kaldi's `prepare_lang.sh` script. |
| 81 | + |
| 82 | +### MFCC Feature Extraction |
| 83 | +Run the MFCC extraction process: |
| 84 | +```bash |
| 85 | +bash run.sh --stage 2 --stop_stage 2 |
| 86 | +``` |
| 87 | +This step extracts Mel-frequency cepstral coefficients (MFCC) features and computes Cepstral Mean Variance Normalization (CMVN). |
| 88 | + |
| 89 | +### Acoustic Model Training |
| 90 | +Train monophone, LDA+MLLT, and SAT models: |
| 91 | +```bash |
| 92 | +bash run.sh --stage 3 --stop_stage 3 |
| 93 | +``` |
| 94 | +This stage trains GMM-based models and aligns the data for TDNN training. |
| 95 | + |
| 96 | +### TDNN Chain Model Training |
| 97 | +Train a Time-Delay Neural Network (TDNN) chain model: |
| 98 | +```bash |
| 99 | +bash run.sh --stage 4 --stop_stage 4 |
| 100 | +``` |
| 101 | +The chain model uses i-vectors for speaker adaptation. |
| 102 | + |
| 103 | +### Decoding |
| 104 | +After training, decode the test data: |
| 105 | +```bash |
| 106 | +bash run.sh --stage 5 --stop_stage 5 |
| 107 | +``` |
| 108 | +This step decodes using the trained model and evaluates the Word Error Rate (WER). |
| 109 | + |
| 110 | +## Results |
| 111 | + |
| 112 | +WER can be evaluated by running: |
| 113 | +```bash |
| 114 | +bash RESULTS |
| 115 | +``` |
| 116 | +Example of `RESULTS.txt`: |
| 117 | +```plaintext |
| 118 | +%WER 14.10 [ 2839 / 20138, 214 ins, 487 del, 2138 sub ] exp/chain/tdnn/decode_test/wer_11_0.0 |
| 119 | +%WER 12.67 [ 2552 / 20138, 215 ins, 406 del, 1931 sub ] exp/chain/tdnn/decode_test_rescore/wer_11_0.0 |
| 120 | +``` |
0 commit comments