This open-source software package implements two components of a pipeline for identifying information relevant to a specific topic in text documents:
- Machine-learning backend, for training and testing a machine-learning based neural network model that assigns a relevance score to each token in a document;
- Front-end web interface, for viewing relevance tags produced by information extraction models, ranking documents by relevance, and analyzing qualitative model outcomes.
This system was described in the following paper:
- D Newman-Griffis and E Fosler-Lussier, "HARE: a Flexible Highlighting Annotator for Ranking and Exploration". In Proceedings of EMNLP 2019.
The included makefile
provides pre-written commands for common tasks in the HARE backend.
The requirements.txt
file lists all required Python packages installable with pip. Just run
pip install -r requirements.txt
to install all packages.
Source code of two packages is required for generating BERT features; these packages (BERT and bert_to_hdf5) are automatically downloaded by the utils/get_bert_hdf5_features.sh
script.
The processing pipeline in this package includes several primary elements, described here with reference to key code files. (For technical reference on script usage, see makefile
)
- Text preprocessing: tokenization and formatting for analysis.
- See
data/extract_data_files.py
- See
- Contextualized feature extraction: pre-calculation of embedding features, using contextualized language models.
- If using static embeddings, features are extracted dynamically at runtime.
- See
utils/get_bert_hdf5_features.sh
andmakefile
- Cross-validation split generation: pre-generation of cross-validation splits for a specified dataset, for consistency across experiments.
- Splitting is done at the document level (assumes documents are independent).
- See
experiments.document_splits
- Token relevance model training: implementation and training of token-level relevance estimator.
- Implemented in TensorFlow.
- For model, see
model/DNNTokenClassifier.py
- For training, see
experiments/train.py
- Prediction with token relevance model: application of pre-trained token-level relevance estimator to new data.
- See
experiments/run_pretrained.py
- See
- Document-level results visualization: web-based viewing of token-level relevance predictions.
- See visualization README for more details.
- Document ranking: web-based interface ranking documents by relevance scores.
- See visualization README for more details.
- Qualitative outcomes analysis: web-based interface for analyzing qualitative trends in model outputs.
- See visualization README for more details.
This package includes two tiny datasets for code demonstration purposes:
demo_data/demo_labeled_dataset
5 short, synthetic clinical documents with mobility-related information. Text files are located in thetxt
subdirectory, andcsv
contains corresponding CSV files with standoff annotations.demo_data/demo_unlabeled_dataset
5 more short, synthetic clinical documents, only one of which contains mobility-related information. Text files are provided without corresponding annotations.
The included run_demo_experiments.sh
script (written for Bash execution on a Unix machine) utilizes the provided make targets to run a complete end-to-end experiment, with the following steps:
- Tokenize both labeled and unlabeled datasets, using SpaCy and WordPiece.
- Generate contextualized embedding features for each dataset, using ELMo and BERT.
- Train HARE models on the labeled dataset, using static embeddings, ELMo, and BERT.
- Use the trained models to get predictions on the unlabeled dataset.
- Prepare all output predictions for viewing in the front-end interface.
If you use this software in your own work, please cite the following paper:
@inproceedings{newman-griffis-fosler-lussier-2019-hare,
title = "{HARE}: a Flexible Highlighting Annotator for Ranking and Exploration",
author = "Newman-Griffis, Denis and Fosler-Lussier, Eric",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-3015",
doi = "10.18653/v1/D19-3015",
pages = "85--90",
}
All source code, documentation, and data contained in this package are distributed under the terms in the LICENSE file (modified BSD).