RDT: Russian Distributional Thesaurus (Русский Дистрибутивный Тезаурус)
This package let you efficiently use word graph of the Russian Distributional Thesaurus.
- Download the pre-packed resource:
wget http://panchenko.me/data/russe/rdt.pkl
- Install dependencies, e.g.:
pip install -r requirements.txt
- Load the distributional thesaurus (specify path to the downloaded 'rdt.pkl' file):
from dt import RDT, DistributionalThesaurus
rdt = RDT(dt_pkl_fpath="rdt.pkl")
Loading takes about 5 minutes and the resulting structure occupy around 1.3 Gb of RAM. This is however more efficient than parsing the CSV file into a dict in terms of both time and memory consumption. This implementation relies on marisa trie for storing keys and on numpy array for storing similarity scores.
- Search for nearest neighbours:
for w,s in rdt.most_similar(u"граф"):
print w,s