Skip to content

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and keyword extraction

License

Notifications You must be signed in to change notification settings

Fashad-Ahmed/The-Speako

Repository files navigation

🚀 About Project


Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to process human speech into a written format. The Speako is a Natural Language Processing based project built on top of stack of technologies in order to transcribe the English voice based audio files of any accent. The project further provides the facility to user to provide the Urdu Translation of that transcribed text. Lastly, it also extracts Keywords from that transcribed text.

Features:


  • 🤩 This project will let the users to transcribe there .flac or .wav audio file.
  • 🥳 Translate the transcribtion into Urdu Language.
  • 😋 Extract the Key points of text.

📂 Project Organization


├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

📃 Transcription


Model Selection Steps:

We selected four accurate working Transcription Models and peform evaluation to select the best performer among four of them.

  • Facebook/wav2vec2-large-960h-lv60-self
  • Facebook/wav2vec2-lv60-base
  • Pytorch transformers model
  • Deep search model by Mozilla

📈 Dataset:

  • Audio clips of different English accents were collected several online resources.

🔩 Preprocessing Step:

  • Sampled the Audio file at 16 KHz
  • Removed any distortion or background noises from the audio

💭Evaluation:

These are the following evaluation metrics which were considered to select the best working model.

  • Word Error Rate for each model
  • Match Error Rate for each model
  • Word Information Loss for each model
  • Word Error Rate for each accent
  • Match Error Rate for each accent
  • Word Information Loss for each accent

All the evaualtion results and meta-data are logged in Neptune AI

Fine-Tuning Steps:

📈 Dataset:

  • TIMIT: is a corpus of phonemically and lexically transcribed speech of American English speakers of different genders and dialects.

🔩 Preprocessing Step:

  • Removed irrelivant features from English dataset.In our dataset: phonetic_detail , word_detail, dialect_region, sentence_type, speaker_id)
  • Removed expressions like \ , \ ? \ . \ ! \ - \ ; \ : \ "
  • Sampled the Audio file at 16 KHz

💭Evaluation:

  • Use WER (Word Error Rate)

🎶 Training:

🔤 Translation


Supported Language:

URDU

  • Translates the transcription into Urdu Language.
  • Model Used: Helsinki-NLP/opus-mt-en-ur
  • In future we will be adding a Pipeline channel to preprocess and generate direct results from it

📌 Keywords Extraction


  • Text analysis feature that automatically extracts the most used important words from a transciption. It helps summarize the content of texts and recognize the main topics discussed.
  • Model Used: KeyBERT

🔮 User Interface


  • The UI of the project is built using Streamlit.
  • It provides a responsive GUI presenation of the project with their respective model results.

🏭 Project Pipelining


  • Inference Pipeline using ZenML
  • Fine Tuning Pipeline for English ASR with Transformers

🏡 Developer Setup Guide


⏮️ Prerequisites

You can run Docker image on your local system using

`docker pull taserx/speako:latest`

`docker run -p 8501:8501 taserx/speako:latest`

`docker exec -it <container_name> bash`

`apt-get update && apt-get install libsndfile1`

For python file:

 `python app.py'

⚒️ Built Upon


- Python
- facebook/wav2vec2-large-960h-lv60-self
- Helsinki-NLP/opus-mt-en-ur
- KeyBERT
- Streamlit
- Docker

🔧 Tools Used


- Visual Studio Code
- Google Colaboratory
- Google Drive Mount
- Neptune AI

👥 Collaborators


📋 License


This project is licensed under the MIT License - see the LICENSE file for details.

About

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and keyword extraction

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages