HARE: Highlighting Annotator for Ranking and Exploration

This open-source software package implements two components of a pipeline for identifying information relevant to a specific topic in text documents:

Machine-learning backend, for training and testing a machine-learning based neural network model that assigns a relevance score to each token in a document;
Front-end web interface, for viewing relevance tags produced by information extraction models, ranking documents by relevance, and analyzing qualitative model outcomes.

This system was described in the following paper:

D Newman-Griffis and E Fosler-Lussier, "HARE: a Flexible Highlighting Annotator for Ranking and Exploration". In Proceedings of EMNLP 2019.

Setup/Installation

The included makefile provides pre-written commands for common tasks in the HARE backend.

The requirements.txt file lists all required Python packages installable with pip. Just run

pip install -r requirements.txt

to install all packages.

Source code of two packages is required for generating BERT features; these packages (BERT and bert_to_hdf5) are automatically downloaded by the utils/get_bert_hdf5_features.sh script.

Package components

The processing pipeline in this package includes several primary elements, described here with reference to key code files. (For technical reference on script usage, see makefile)

Text preprocessing: tokenization and formatting for analysis.
- See data/extract_data_files.py
Contextualized feature extraction: pre-calculation of embedding features, using contextualized language models.
- If using static embeddings, features are extracted dynamically at runtime.
- See utils/get_bert_hdf5_features.sh and makefile
Cross-validation split generation: pre-generation of cross-validation splits for a specified dataset, for consistency across experiments.
- Splitting is done at the document level (assumes documents are independent).
- See experiments.document_splits
Token relevance model training: implementation and training of token-level relevance estimator.
- Implemented in TensorFlow.
- For model, see model/DNNTokenClassifier.py
- For training, see experiments/train.py
Prediction with token relevance model: application of pre-trained token-level relevance estimator to new data.
- See experiments/run_pretrained.py
Document-level results visualization: web-based viewing of token-level relevance predictions.
- See visualization README for more details.
Document ranking: web-based interface ranking documents by relevance scores.
- See visualization README for more details.
Qualitative outcomes analysis: web-based interface for analyzing qualitative trends in model outputs.
- See visualization README for more details.

Demo script/data

This package includes two tiny datasets for code demonstration purposes:

demo_data/demo_labeled_dataset 5 short, synthetic clinical documents with mobility-related information. Text files are located in the txt subdirectory, and csv contains corresponding CSV files with standoff annotations.
demo_data/demo_unlabeled_dataset 5 more short, synthetic clinical documents, only one of which contains mobility-related information. Text files are provided without corresponding annotations.

The included run_demo_experiments.sh script (written for Bash execution on a Unix machine) utilizes the provided make targets to run a complete end-to-end experiment, with the following steps:

Tokenize both labeled and unlabeled datasets, using SpaCy and WordPiece.
Generate contextualized embedding features for each dataset, using ELMo and BERT.
Train HARE models on the labeled dataset, using static embeddings, ELMo, and BERT.
Use the trained models to get predictions on the unlabeled dataset.
Prepare all output predictions for viewing in the front-end interface.

Reference

If you use this software in your own work, please cite the following paper:

@inproceedings{newman-griffis-fosler-lussier-2019-hare,
  title = "{HARE}: a Flexible Highlighting Annotator for Ranking and Exploration",
  author = "Newman-Griffis, Denis and Fosler-Lussier, Eric",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations",
  month = nov,
  year = "2019",
  address = "Hong Kong, China",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/D19-3015",
  doi = "10.18653/v1/D19-3015",
  pages = "85--90",
}

License

All source code, documentation, and data contained in this package are distributed under the terms in the LICENSE file (modified BSD).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HARE: Highlighting Annotator for Ranking and Exploration

Setup/Installation

Package components

Demo script/data

Reference

License

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
analysis		analysis
data		data
demo_data		demo_data
evaluation		evaluation
experiments		experiments
model		model
utils		utils
visualization		visualization
LICENSE		LICENSE
README.md		README.md
makefile		makefile
requirements.txt		requirements.txt
run_demo_experiments.sh		run_demo_experiments.sh

License

CC-RMD-EpiBio/HARE

Folders and files

Latest commit

History

Repository files navigation

HARE: Highlighting Annotator for Ranking and Exploration

Setup/Installation

Package components

Demo script/data

Reference

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages