Skip to content

Commit

Permalink
Add initial version of README
Browse files Browse the repository at this point in the history
  • Loading branch information
sudodoki committed Mar 16, 2019
1 parent 3db7062 commit a835807
Show file tree
Hide file tree
Showing 19 changed files with 2,107 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.ipynb_checkpoints/
venv
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2018 sudodoki

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
130 changes: 130 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# NLP annotate: Howtos

No doubt, there are some great tools out there to do annotation for NLP tasks and datasets. Sometimes they are somewhat complex, or cost money (and usually they are worth it). See [tools](TOOLS.md) for list if you haven't heard about any annotation tools for NLP. [chbrown/awesome-annotation](https://github.com/chbrown/awesome-annotation) might be of interest to you as well.

This document might give some guidance how to complete some of the NLP annotation tasks using simpler (more familiar) tooling. Based on my knowledge of industry, usually companies with enough need in annotation would either roll their own solution or purchase of-the-shelf one that is powerful enough to cover their use-cases / provide paid support to add new features. So you could consider this to be possibly better suited for setups when you are only bootstrapping small dataset for experiments.

Below is the list of tasks and descriptions of process to gather annotations for them. Use cases not covered might be supported by other listed [tools](TOOLS.md).

There is extensive reading on the subject, in form of books, for example:
+ [Introduction to Linguistic Annotation and Text Analytics](https://www.amazon.com/Introduction-Linguistic-Annotation-Analytics-Technologies/dp/1598297384) by Graham Wilcock
* [Natural Language Annotation for Machine Learning](http://shop.oreilly.com/product/0636920020578.do) by James Pustejovsky and Amber Stubbs


Contents
=================

* [Binary classification for documents / sentences](#binary-classification-for-documents--sentences)
* [Using folders (Finder)](#using-folders-finder)
* [Using google spreadsheets and data validation](#using-google-spreadsheets-and-data-validation)
* [Using jupyter notebooks](#using-jupyter-notebooks)
* [Multi-class classification for documents / sentences](#multi-class-classification-for-documents--sentences)
* [Using folders (Finder)](#using-folders-finder-1)
* [Using google spreadsheets and data validation](#using-google-spreadsheets-and-data-validation-1)
* [Using jupyter notebooks](#using-jupyter-notebooks-1)
* [Hierarchical Multi-class classification for documentssentences](#hierarchical-multi-class-classification-for-documents--sentences)
* [Using google spreadsheets and data validation](#using-google-spreadsheets-and-data-validation-2)
* [Using jupyter](#using-jupyter)
* [NER (Span annotations)](#ner-span-annotations)
* [Jupyter](#jupyter)
* [m8nware/ann](#m8nwareann)
* [Do you have more use-cases/solutions?](#do-you-have-more-use-casessolutions)

## Binary classification for documents / sentences

There are multiple ways to assign a single sentence / doc a label that can have at most 2 values (True/False).

### Using folders (Finder)

Using preview tool for the folders that displays thumbs with content in it or in specific 'preview' area (in mac's finder that would be under View > as Cover Flow)
https://www.dropbox.com/s/55fx5wk6b5p3w1y/Screenshot%202018-10-27%2016.49.20.png?dl=0
https://www.dropbox.com/s/0p11jwaljktedjx/Screenshot%202018-10-27%2016.50.52.png?dl=0
and just manually sort them into two folders. Short sentences in txt / or document that can be identified via first sentence work best.
This methods also work for **multi-class** and **hierarchical multi-class** classification.

### Using google spreadsheets and data validation

*better for sentences*

Create a Google [spreadsheet](http://spreadsheet.new) – you can upload csvs as well. Put a column that would hold target classes for items. Click Data -> Data Validation, select cell range for target column (i.e. `Sheet1!C2:C`) and select Criteria - 'Checkbox'. You can now either use mouse or arrows + space to toggle target label. You can export data via csv download (File -> Download as -> .csv). You'll have to further map `TRUE`/`FALSE` to target binary class.

### Using [jupyter](https://jupyter.org/) notebooks

See [samples/Binary_Classification_Annotation.ipynb]. Be aware this was created in python version 3 (and probably would run with minor modifications on python 2).


## Multi-class classification for documents / sentences

There are two cases for multi-class label: one where you have multiple labels and you set associate single label with a sample or multiple labels with a single sample. Notes below are for case with single label for sample (except for when it's noted in jupyter notebook).

### Using folders (Finder)

Same as [Binary classification](#anchor-link-here) but using multiple folders.

### Using google spreadsheets and data validation

Create a Google [spreadsheet](http://spreadsheet.new) – you can upload csvs as well. Put a column that would hold target classes for items. Click Data -> Data Validation, select cell range for target column (i.e. `Sheet1!C2:C`) and for Criteria either select 'List of items' or 'List from a range' (you can also use a reference to [named ranges](https://support.google.com/docs/answer/63175?co=GENIE.Platform%3DDesktop&hl=en)which is useful if you are using those in multiple places) and be sure to have 'show a dropdown' option checked as it will enable typeahead which will make it easier to quick filter list of classes based on first few letters typed in. You can export data via csv download (File -> Download as -> .csv).

> Note: if you need to have multiple labels per row, you can consider adding additional columns (i.e. 'class 1', 'class 2') and then merge this in post-processing step or use google script, but that might be more cumbersome.
### Using [jupyter](https://jupyter.org/) notebooks

See [samples/jupyter/Multiclass_Classification_Annotation.ipynb]. Be aware this was created in python version 3 (and probably would run with minor modifications on python 2).


## Hierarchical Multi-class classification for documents / sentences

This is a task where labels are organized in hierarchical way and after assining a first label out of predefined set we can proceed to picking next one out of corresponding child set. For example, our labels might look like following:
- work-days:
a) Monday
b) Tuesday
c) Wednesday
- weekend:
a) Saturday
b) Sunday

### Using google spreadsheets and data validation

To accomplish this in google spreadsheets, you'll need to use [google apps script](https://www.google.com/script/start/). I adapted code from a [Stack Overflow answer](https://stackoverflow.com/questions/34191248/drop-down-dependent-menus-in-google-spreadsheets) for my needs. Below is step by step guide on how to apply it for hierarchical labelling task.

> In this section I would use blockquotes to provide steps I used to label for somewhat artificial task for date & type of day data.
You'll need to:
- create a [new spreadsheet](http://spreadsheet.new)
> Here's a [sample spreadsheet](https://docs.google.com/spreadsheets/d/1caBDrV46-p8VWpQVv0m7shMAitY_k9gj5XUoIguYRqA/edit?usp=sharing).
- set up a references for classes. There is also way to make this work based on indexes but I would suggest going with [named ranges](https://support.google.com/docs/answer/63175?co=GENIE.Platform%3DDesktop&hl=en) and have items of higher level range be easibly translated into dependant class reference name.

> I created two named ranges, one with name `work_days` (replacing `-` with `_` to conform with range name restriction from value in 'Type of days' column on `Labels` sheet) and `weekend`.
- for higher level class add a data validation to accept only top level class values

> On `Data` sheet I added a validation for `Type of date` (Cell range: `Data!B2:B`, List from a range: `Labels!A3:A5`)
- add a script that would create a dynamic validation based on top-level class value. Go to `Tools` > `Script Editor` and insert [following code](https://gist.github.com/sudodoki/70c7765e460724ec5d517d13917babef). High level overview of what it does: `to_range_name` function transforms value of class to dependant class labels range name (`to_range_name('week-days')` would yield `week_days`), `depDrop_` takes a cell and range of reference values for validation and adds a dynamic validation for cell, `onEdit` is a global callback that ties this all together, by taking a current cell, verifying it's value is not 'N/A' (I used this value for times when no class is available and no dependant labels needed, an empty value would work as well) and in case it's something meaningful, it would lookup a reference by transformed name and add a validation to a cell next to edited one but in next column.
You'll need to name this project and save it. Then you'll need to run it (by pressing ▶️ button in top panel). **Note: if you set up your batches for labelling by copying a spreadsheet over and modifying values, you'll need to run this in every spreadsheet**.

- now, whenever you assign or edit a value in top level label column, in 2-4 seconds a dropdown in following column would appear with corresponding values.

It might be an overkill for task at hand, but having lots of top level classes and subitems and managing them manually without programmatic restriction/validation might turn out cumbersome. Google dropdown can be edited using keyboard and provide typeahead-prompt style input which might be useful if annotator needs to choose from long list of possible values.

### Using jupyter

See [samples/jupyter/Hierarchical_multiclass.ipynb]. Be aware this was created in python version 3 (and probably would run with minor modifications on python 2).


## NER (Span annotations)

These are useful for NER and involve selecting contiguous set of text in document and marking with corresponding class.

### Jupyter

See [samples/jupyter/Span_annotation.ipynb]. Be aware this was created in python version 3 (and probably would run with minor modifications on python 2). There's possibility to do even more responsive/dynamic solution using js in widget and handling mouse events & tracking current selection.

### m8nware/ann

See [samples/ann] sample project that uses docker to run modification of [m8nware/ann](https://github.com/m8nware/ann).


## Do you have more use-cases/solutions?

If you know a solution to peculiar task of NLP annotation, open an issue with description. Even better, write it down and create a [pull request](https://help.github.com/en/articles/creating-a-pull-request)! I almost sure we didn't describe every useful tool that is out there providing solutions to nlp tasks, please reference one if you know any that are not in [tools](TOOLS.md) list. If this was useful to you, let me know as well, either through issue or in [gitter channel](https://gitter.im/sudodoki/nlp-how-to-annotate).
21 changes: 21 additions & 0 deletions TOOLS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Not all the tools are of same quality and price.

https://gate.ac.uk - NER, Text Classification
http://brat.nlplab.org - NER, Relations, Normalization, […etc](http://brat.nlplab.org/examples.html#annotation-examples)
https://prodi.gy - NER, Text Classification
https://lighttag.io - NER, Relations, Classification
https://github.com/m8nware/ann - NER, persists to disk, common lisp, basic auth, built-in diffs
https://github.com/jiesutd/YEDDA - NER annotation, admin interface / comparisons, desktop, python
https://github.com/emanjavacas/cosycat
https://github.com/annefried/swan
https://github.com/tayllan/viper - NER, js based
https://paperai.github.io/htmlanno - NER, relations, paragraph selection, js based
https://github.com/aldanor/plato - binary classification, no backend, csv, hotkeys, js based
http://quepid.com/ - commercial judgement list for relevancy sorting

http://www.janfreyberg.com/superintendent - widgets for jupyter notebooks
https://github.com/natasha/ipyannotate - widgets for jupyter notebooks

https://pybossa.com - more of 'roll your own solution'
https://github.com/danvk/localturk - mimicking [Mechanical Turk](https://www.mturk.com/) API

3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ipyannotate
ipywidgets==7.4.2
pandas
22 changes: 22 additions & 0 deletions samples/ann/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM parentheticalenterprises/sbcl-quicklisp-base

RUN apt-get update && apt-get install -y git libyaml-dev gcc locales && apt-get clean

RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

# Used to update builds whenever github repo gets update https://stackoverflow.com/a/49772666/1976857
ARG CACHEBUST
RUN git clone --single-branch --branch master https://github.com/sudodoki/ann.git ann

COPY users.txt ann/users.txt
WORKDIR ann

EXPOSE 7001

RUN echo "(push :dev *features*)\n$(cat run.lisp)" > run.lisp

ENTRYPOINT ["sbcl", "--load", "hunch.lisp", "--load", "run.lisp"]
# ENTRYPOINT ["sbcl", "--noinform", "--disable-ldb", "--lose-on-corruption", "--disable-debugger", "--load", "hunch.lisp", "--load", "run.lisp"]
47 changes: 47 additions & 0 deletions samples/ann/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Setting up host system.

It should have docker installed.

Also, if you see something like `mmap: Cannot allocate memory: ensure_space: failed to validate XXX bytes at …` error, it's a known limitation that can arise in [some environments](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=474402) due to inability to secure enough memory and fix as mentioned in thread is to do `echo 1 >/proc/sys/vm/overcommit_memory` (maybe, with sudo) in the host system.

## Building & running container

```shell
# after cloning / copying ann/ folder from this repo
docker build -t ann-spans ./
# placing proper data to annotate into data/ & setting schemas in schemas/
docker run -p 7001:7001 -v /data/:/ann/data -v /schemas/:/ann/schemas -ti ann-spans
# docker run -p 7001:7001 -v "$(pwd)/data/":/ann/data -v "$(pwd)/schemas/":/ann/schemas -ti ann-spans
# >>> !! docker run -ti -p 7001:7001 -v "$(pwd)/data/":/ann/data -v "$(pwd)/schemas/":/ann/schemas ann-spans
# go to localhost:7001 and see annotation tool's UI
```

If you run this on Mac and have issues with data not showing up you'll have to use `"$(pwd)/data/"` instead to mount the volume.

## Folder structure

In your `data/` you could have any folder structure. Be sure to have leaves of your tree to be `.txt` files as those are the ones `ann` works with. You would need to provide schema for your dataset, which has a class-code thing you'll get in final annotation as well as human readable labels that would be displayed in annotation popup and css color that would be used to highlight item in css. [Sample schema](schemas/ner.yml) Also, there should be `.ann.yaml` config file that would provide name of schema and some other items. See sample in [data/annotator1/.ann.yaml](data/annotator1/.ann.yaml)

## Updating image after first build

After you've built image the first time and you know there were changes in upstream ann codebase, be sure to run build and pass an extra argument

```shell
docker build -t ann-spans ./ --build-arg CACHEBUST=something1
```

which shall be a new one whenever you need to update image

# Actual UI

After navigating to designated URL (localhost:7001 or other port / host if you run it on server / with different port binding) you'll see folder view, which you can drill down until you end up with single document. After selecting range of text (you can double click on words to select them as well) you'll be presented with labels prompt to select appropriate class.
![](https://github.com/sudodoki/sudodoki-public-assets/raw/gh-pages/ann_screenshot.png)
![](https://github.com/sudodoki/sudodoki-public-assets/raw/gh-pages/ann_screenshot_2.png)

# Parsing results

Here's a [code sample](work_with_annotations.ipynb) to parse annotations.

# Restricting who has access

There's a built-in flow to restrict to who has access to annotating what via basic auth. When building docker container a file users.txt would be copied into container and used to provide set of users and passwords. Each user then would be restricted to items inside top-level folder with same name (annotator1 to everything below data/annotator1, etc.). Special case is "admin" user that can modify any file.
4 changes: 4 additions & 0 deletions samples/ann/data/annotator1/.ann.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
format: bsf
schema: ner
highlight: background
ext: txt
1 change: 1 addition & 0 deletions samples/ann/data/annotator1/doc1.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
SALT LAKE CITY — Disney will soon finish acquiring 20th Century Fox, which could mean some major changes are in store for the Marvel Cinematic Universe.
4 changes: 4 additions & 0 deletions samples/ann/data/annotator1/doc1.txt.ann
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
T1 LOC 0 14 SALT LAKE CITY
T2 ORG 17 23 Disney
T3 ORG 51 67 20th Century Fox
T4 LOC 126 151 Marvel Cinematic Universe
1 change: 1 addition & 0 deletions samples/ann/data/annotator1/doc2.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Brazil was one of the final countries that had yet to approve the deal, according to Bloomberg. A source told Bloomberg that Disney was willing to unload the Fox Sports network to different buyers.
5 changes: 5 additions & 0 deletions samples/ann/data/annotator1/doc2.txt.ann
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
T1 LOC 0 6 Brazil
T2 ORG 85 94 Bloomberg
T3 ORG 110 119 Bloomberg
T4 ORG 125 131 Disney
T5 ORG 158 168 Fox Sports
9 changes: 9 additions & 0 deletions samples/ann/schemas/ner.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
LOC:
desc: Location
color: aqua
PER:
desc: Person
color: green
ORG:
desc: Organization
color: orange
Empty file added samples/ann/users.txt
Empty file.
Loading

0 comments on commit a835807

Please sign in to comment.