This open source package implements a series of components required for comprehensive quality assurance on annotations created using GATE
The requirements.txt
file lists all required Python3 packages installable with pip3. Just run
pip3 install -r requirements.txt
to install all packages.
-
Document Validation
-
Annotation Validation
-
Statistics
-
Evaluation
-
Discrepancy Analysis
-
Generate Reports
-
Corpus Viewer
-
Annotation Schema
-
Help
The code is completely reliant on information gathered from a config file. The config file for QA4IE is structured to contain the following information:
-
Annotation Directory: the absolute path to the annotations in xml format. These annotations should be saved with a specific structure in order to work. It needs to be a single directory that contains multiple sub directories. These sub directories should contain the annotations from each annotators and should be named in a way that represents each annotator. A caveat to consider is that the xml files themselves should be consistently named, in a way that the only difference between the paths should be the file's parent directory. As an example consider the following,
annotations/anno1/file_1.xml
andannotations/anno2/file_1.xml
. If the files are named differently, the code will treat them as different files. In the case of an annotator containing different files than others, these files will be ignored by the tool. -
Output Directory: the absolute path to a results/report directory.
-
Task: this section should be
sequence_labeling
. In the future, there will be an additional option forclassification
. -
Encoding: the encoding of the xml files
The config file allows to add an unlimited amount of annotation types. The following is an example of how to create these types in the config file.
[type_main]
overlaps =
sub_entities= type_a|type_b|type_c
features= att_1:=:val_1|val_2|val_3||att_2:=:val_1|val_2
[type_a]
overlaps = type_b|type_c
features= att_1:=:val_1|val_2|val_3||att_2:=:val_1|val_2
[type_b]
overlaps = type_a|type_c
features= att_1:=:val_1|val_2|val_3||att_2:=:val_1|val_2
[type_c]
overlaps = type_a|type_b
features= att_1:=:val_1|val_2|val_3||att_2:=:val_1|val_2
Under each annotation type there is up to 3 options that one could add. Where overlaps should contain thee other annotation types for which an overlap is allowed. This option is only necessary for main entities and not sub entities. The code will determine hierarchical overlaps and allow them based on the information from the sub_entities.
The sub_entities should only go under an annotation that is considered to be a main or parent entity of other sub entities and should be defined in the config file. The order of these entities in the config file does not affect the code at all.
To add features to a specific entity to have to used several separators. (:=:
) will separate the attribute name from the possible values. (|
) will separate each value for a specific attribute. (||
) separate different attributes in the features dictionary
This package includes 1 small dataset for code demonstration purposes:
data/annotations
synthetic notes annotated using the mobility schema
To use the tool you will first need to update the information inside the config file. Including the absolute paths for your input and output directories. Once that's done just run,
python app.py <path_to_config_file>
where <path_to_config_file>
is a placeholder for the absolute path to the config file
qa4ie_video_demo.mov
LREC_2022_QA4IE_PRESENTATION.mov
''' GATE
The Annotation Diff Tool
https://gate.ac.uk/sale/tao/splitch10.html#x14-26300010.2
Corpus Quality Assurance
https://gate.ac.uk/sale/tao/splitch10.html#x14-26700010.3
Corpus Benchmark Tool
https://gate.ac.uk/sale/tao/splitch10.html#x14-27500010.4
A Plugin Computing Inter-Annotator Agreement (IAA)
https://gate.ac.uk/sale/tao/splitch10.html#x14-28000010.5
Quality Assurance Summariser for Teamware
https://gate.ac.uk/sale/tao/splitch10.html#x14-28600010.7
ERAS
https://github.com/jonatasgrosman/eras
Brat
LightTag
MAT
http://mat-annotation.sourceforge.net
Tagtog
TextAE
https://github.com/pubannotation/textae
Watson KS
https://www.ibm.com/cloud/watson-knowledge-studio
WebAnno
https://webanno.github.io/webanno/
Prodigy
SLATE
https://github.com/jkkummerfeld/slate
Knowtator
http://knowtator.sourceforge.net
INCEpTION
https://inception-project.github.io
QA4IE
https://github.com/CC-RMD-EpiBio/QA4IE
'''
If you use this software in your own work, please cite the following paper:
@inproceedings{,
title = "",
author = "",
booktitle = "",
month = ,
year = "",
address = "",
publisher = "",
url = "",
doi = "",
pages = "",
}
All source code, documentation, and data contained in this package are distributed under the terms in the LICENSE file (modified BSD).