Fork of p2fa. This python script computes an alignment between a speech audio file and a verbatim text transcript. It also calls on the CMU Sphinx lmtool to get pronunciations for words that are not in default dictionary (so internet access is required to run the script).
I've made a bunch of changes to the output and input formats, and also to finding pronunciations for words that aren't in the dictionary.
See the original readme.txt
in the repo for more details.
This script was used in my research project: Content-Based Tools for Editing Audio Stories [UIST 2013].
Install HTK 3.4. Note: 3.4.1 will not work. Get HTK here.
On OSX this can be a pain. Here's one method that works:
./configure --without-x --disable-hslab CFLAGS='-I/usr/include/malloc'
Then edit HTKLib/esignal.c
and replace every occurence of ARCH
with "darwin"
.
Then run make all && sudo make install
pip install -r requirements.txt
On OSX, with homebrew:
brew install sox
In the p2fa-vislab directory, run:
git submodule init
and
git submodule update
python align.py audio_file.wav transcript_input.json aligned_output.json
The input audio_file.wav
must be 16 bit and mono.
The input transcript json must have the following jsonschema:
{
"title": "Transcript Schema",
"description": "A transcript of an audio file",
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "array",
"items": {
"title": "Line",
"type": "object",
"description": "An individual line or paragraph of the transcript",
"properties": {
"speaker": {
"description": "Speaker of the line or paragraph",
"type": "string"
},
"line": {
"description": "Text of the line or paragraph",
"type": "string"
}
},
"required": ["line", "speaker"]
}
}
For example:
[
{
"speaker": "Steve",
"line": "Hi, my name is Steve."
},
{
"speaker": "Steve",
"line": "What's your name?"
}
]
(Although, there's no reason to list those two lines separately because they're the same speaker.) To convert a plain text transcript into a file that adheres to this schema, see text_to_transcript.py.
The output will be a json with the following jsonschema:
{
"title": "Alignment Schema",
"description": "A alignment of a transcript to an audio file",
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"additionalProperties": true,
"properties": {
"words": {
"type": "array",
"items": {
"title": "Word",
"type": "object",
"description": "An individual aligned word of the transcript and audio file",
"properties": {
"word": {
"description": "Original word",
"type": "string"
},
"alignedWord": {
"description": "Word processed by the alignment algorithm",
"type": "string"
},
"start": {
"description": "Start time of the aligned word, in seconds",
"type": "number"
},
"end": {
"description": "End time of the aligned word, in seconds",
"type": "number"
},
"speaker": {
"description": "Speaker of the word",
"type": "string"
},
"line_idx": {
"description": "Index of input line that word came from",
"type": "integer"
}
},
"required": ["word", "alignedWord", "start", "end"]
}
}
}
}
You can also specifiy --textgrid
and --no-json
on the command
line to get the output of the script as a Praat TextGrid file instead
of in the json format.