This project explores the implementation and comparison of three Part-of-Speech (POS) tagging algorithms—Eager, Viterbi, and Individually Most Probable Tags—across English, Swedish, and Korean. These algorithms were designed to navigate the complexities of morphology and syntax in different languages, revealing intriguing patterns in linguistic structure and algorithm performance.
- Implement three distinct POS tagging algorithms of varying complexity: Eager, Viterbi, and Individually Most Probable Tags.
- Train and evaluate these algorithms using multilingual corpora from the Universal Dependencies Treebank.
- Uncover linguistic insights by analyzing algorithm performance across English, Swedish, and Korean.
Language | Eager Accuracy (%) | Viterbi Accuracy (%) | Individually Most Probable Tags Accuracy (%) |
---|---|---|---|
English | 88.6 | 91.3 | 88.6 |
Swedish | 85.7 | 90.2 | 85.7 |
Korean | 80.8 | 79.2 | 80.8 |
pip install conllu
pip install nltk
python3 pos_tagging.py
- Python: Primary programming language for implementation.
- CoNLL-U: For parsing and preparing corpora.
- NLTK: To calculate emission and transition probabilities.
Grateful for the Universal Dependencies Treebank for providing high-quality multilingual data, enabling this exploration into the intricacies of POS tagging.
For more insights, read the associated blog post: