Skip to content

Latest commit

 

History

History
60 lines (51 loc) · 2.68 KB

README.md

File metadata and controls

60 lines (51 loc) · 2.68 KB

ML Project 2: Disambiguating Voynich Manuscript transliterations with word embeddings

Team members

  • Jirka Lhotka
  • Francesco Salvi
  • Liudvikas Lazauskas

Repo structure

The repository contains 3 main notebooks aswell as 4 modules:

  • embeddings_italian.ipynb Responsible for training and evaluating embeddings on italian text (Dante's Inferno).
  • embeddings_latin.ipynb Responsible for training and evaluating embeddings on latin text (Albert of Aix).
  • embeddings_voynich.ipynb Responsible for training embeddings on the Voynich Manuscript.
  • corruptions.py Provide methods to compute ambiguities distributions and to artificially corrupt the texts.
  • uncertainties.py Provide a class to represent ambiguities with their contexts and methods to create a list of ambiguities given a corrupted text.
  • baseline.py Provide methods to generate baseline predictions, computing letter frequencies in the text.
  • validation.py Provide methods to generate predictions and to evaluate the models by computing their accuracy.

Data

The texts used in this project can be mainly found in the foler texts/. The folder contains historical texts such as Dante's Inferno and Albert of Aix, and Voynich transliterations available here. The transliterations are further processed with ivtt, and processed texts are found in the data/ folder.

Resources

  • Benchmarks The benchmark used for the Latin synonym selection task can be found in the benchmarks/ folder.

  • Software The software used for filtering and processing the transliterations can be found in software/ folder, taken from here.

  • Documentation Documentation for the usage of IVTT and IVTFF format can be found in the documentation/ folder.

Predictions

The resulting predictions of the model trained on Voynich can be found in the predictions/ folder.

Requirements