Skip to content

Commit a9f27eb

Browse files
authored
Update README.md with training details (#1637)
1 parent 1b308a3 commit a9f27eb

File tree

1 file changed

+119
-2
lines changed

1 file changed

+119
-2
lines changed

training/README.md

+119-2
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,120 @@
1-
A proper simple setup to train a Vosk model
1+
# Vosk API Training
22

3-
More documentation later
3+
This directory contains scripts and tools for training speech recognition models using the Kaldi toolkit.
4+
5+
## Table of Contents
6+
7+
1. [Overview](#overview)
8+
2. [Directory Structure](#directory-structure)
9+
3. [Installation](#installation)
10+
4. [Training Process](#training-process)
11+
- [Data Preparation](#data-preparation)
12+
- [Dictionary Preparation](#dictionary-preparation)
13+
- [MFCC Feature Extraction](#mfcc-feature-extraction)
14+
- [Acoustic Model Training](#acoustic-model-training)
15+
- [TDNN Chain Model Training](#tdnn-chain-model-training)
16+
- [Decoding](#decoding)
17+
5. [Results](#results)
18+
6. [Contributing](#contributing)
19+
20+
## Overview
21+
22+
This repository provides tools for training custom speech recognition models using Kaldi. It supports acoustic model training, language model creation, and decoding pipelines.
23+
24+
## Directory Structure
25+
26+
```plaintext
27+
.
28+
├── cmd.sh # Command configuration for training and decoding
29+
├── conf/
30+
│ ├── mfcc.conf # Configuration for MFCC feature extraction
31+
│ └── online_cmvn.conf # Online Cepstral Mean Variance Normalization (currently empty)
32+
├── local/
33+
│ ├── chain/
34+
│ │ ├── run_ivector_common.sh # Script for i-vector extraction during chain model training
35+
│ │ └── run_tdnn.sh # Script for training a TDNN model
36+
│ ├── data_prep.sh # Data preparation script for creating Kaldi data directories
37+
│ ├── download_and_untar.sh # Script for downloading and extracting datasets
38+
│ ├── download_lm.sh # Downloads language models
39+
│ ├── prepare_dict.sh # Prepares the pronunciation dictionary
40+
│ └── score.sh # Scoring script for evaluation
41+
├── path.sh # Script for setting Kaldi paths
42+
├── RESULTS # Script for printing the best WER results
43+
├── RESULTS.txt # Contains WER results from decoding
44+
├── run.sh # Main script for the entire training pipeline
45+
├── steps -> ../../wsj/s5/steps/ # Link to Kaldi’s WSJ steps for acoustic model training
46+
└── utils -> ../../wsj/s5/utils/ # Link to Kaldi’s utility scripts
47+
```
48+
49+
### Key Files:
50+
- **cmd.sh**: Defines commands for running training and decoding tasks.
51+
- **path.sh**: Sets up paths for Kaldi binaries and scripts.
52+
- **run.sh**: Main entry point for the training pipeline, running tasks in stages.
53+
- **RESULTS**: Displays Word Error Rate (WER) for the trained models.
54+
55+
## Installation
56+
57+
### Prerequisites
58+
- [Kaldi](https://github.com/kaldi-asr/kaldi): Kaldi toolkit must be installed and configured.
59+
- Required tools: `ffmpeg`, `sox`, `sctk` for data preparation and scoring.
60+
61+
### Steps
62+
1. Clone the Vosk API repository.
63+
2. Install Kaldi and ensure the `KALDI_ROOT` is correctly set in `path.sh`.
64+
3. Set environment variables using `cmd.sh` and `path.sh`.
65+
66+
## Training Process
67+
68+
### Data Preparation
69+
Run the data preparation stage in `run.sh`:
70+
```bash
71+
bash run.sh --stage 0 --stop_stage 0
72+
```
73+
This stage downloads and prepares the LibriSpeech dataset.
74+
75+
### Dictionary Preparation
76+
Prepare the pronunciation dictionary with:
77+
```bash
78+
bash run.sh --stage 1 --stop_stage 1
79+
```
80+
This step generates the necessary files for Kaldi's `prepare_lang.sh` script.
81+
82+
### MFCC Feature Extraction
83+
Run the MFCC extraction process:
84+
```bash
85+
bash run.sh --stage 2 --stop_stage 2
86+
```
87+
This step extracts Mel-frequency cepstral coefficients (MFCC) features and computes Cepstral Mean Variance Normalization (CMVN).
88+
89+
### Acoustic Model Training
90+
Train monophone, LDA+MLLT, and SAT models:
91+
```bash
92+
bash run.sh --stage 3 --stop_stage 3
93+
```
94+
This stage trains GMM-based models and aligns the data for TDNN training.
95+
96+
### TDNN Chain Model Training
97+
Train a Time-Delay Neural Network (TDNN) chain model:
98+
```bash
99+
bash run.sh --stage 4 --stop_stage 4
100+
```
101+
The chain model uses i-vectors for speaker adaptation.
102+
103+
### Decoding
104+
After training, decode the test data:
105+
```bash
106+
bash run.sh --stage 5 --stop_stage 5
107+
```
108+
This step decodes using the trained model and evaluates the Word Error Rate (WER).
109+
110+
## Results
111+
112+
WER can be evaluated by running:
113+
```bash
114+
bash RESULTS
115+
```
116+
Example of `RESULTS.txt`:
117+
```plaintext
118+
%WER 14.10 [ 2839 / 20138, 214 ins, 487 del, 2138 sub ] exp/chain/tdnn/decode_test/wer_11_0.0
119+
%WER 12.67 [ 2552 / 20138, 215 ins, 406 del, 1931 sub ] exp/chain/tdnn/decode_test_rescore/wer_11_0.0
120+
```

0 commit comments

Comments
 (0)