Add initial version of README

sudodoki · Mar 16, 2019 · a835807 · a835807
1 parent 3db7062
commit a835807
Show file tree

Hide file tree

Showing 19 changed files with 2,107 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,2 @@
+.ipynb_checkpoints/
+venv
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2018 sudodoki
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,130 @@
+# NLP annotate: Howtos
+
+No doubt, there are some great tools out there to do annotation for NLP tasks and datasets. Sometimes they are somewhat complex, or cost money (and usually they are worth it). See [tools](TOOLS.md) for list if you haven't heard about any annotation tools for NLP. [chbrown/awesome-annotation](https://github.com/chbrown/awesome-annotation) might be of interest to you as well.
+
+This document might give some guidance how to complete some of the NLP annotation tasks using simpler (more familiar) tooling. Based on my knowledge of industry, usually companies with enough need in annotation would either roll their own solution or purchase of-the-shelf one that is powerful enough to cover their use-cases / provide paid support to add new features. So you could consider this to be possibly better suited for setups when you are only bootstrapping small dataset for experiments.
+
+Below is the list of tasks and descriptions of process to gather annotations for them. Use cases not covered might be supported by other listed [tools](TOOLS.md).
+
+There is extensive reading on the subject, in form of books, for example:
++ [Introduction to Linguistic Annotation and Text Analytics](https://www.amazon.com/Introduction-Linguistic-Annotation-Analytics-Technologies/dp/1598297384) by Graham Wilcock
+* [Natural Language Annotation for Machine Learning](http://shop.oreilly.com/product/0636920020578.do) by James Pustejovsky and Amber Stubbs
+
+
+Contents
+=================
+
+   * [Binary classification for documents / sentences](#binary-classification-for-documents--sentences)
+      * [Using folders (Finder)](#using-folders-finder)
+      * [Using google spreadsheets and data validation](#using-google-spreadsheets-and-data-validation)
+      * [Using jupyter notebooks](#using-jupyter-notebooks)
+   * [Multi-class classification for documents / sentences](#multi-class-classification-for-documents--sentences)
+      * [Using folders (Finder)](#using-folders-finder-1)
+      * [Using google spreadsheets and data validation](#using-google-spreadsheets-and-data-validation-1)
+      * [Using jupyter notebooks](#using-jupyter-notebooks-1)
+      * [Hierarchical Multi-class classification for documentssentences](#hierarchical-multi-class-classification-for-documents--sentences)
+      * [Using google spreadsheets and data validation](#using-google-spreadsheets-and-data-validation-2)
+      * [Using jupyter](#using-jupyter)
+   * [NER (Span annotations)](#ner-span-annotations)
+      * [Jupyter](#jupyter)
+      * [m8nware/ann](#m8nwareann)
+   * [Do you have more use-cases/solutions?](#do-you-have-more-use-casessolutions)
+
+## Binary classification for documents / sentences
+
+There are multiple ways to assign a single sentence / doc a label that can have at most 2 values (True/False).
+
+### Using folders (Finder)
+
+Using preview tool for the folders that displays thumbs with content in it or in specific 'preview' area (in mac's finder that would be under View > as Cover Flow)
+https://www.dropbox.com/s/55fx5wk6b5p3w1y/Screenshot%202018-10-27%2016.49.20.png?dl=0
+https://www.dropbox.com/s/0p11jwaljktedjx/Screenshot%202018-10-27%2016.50.52.png?dl=0
+and just manually sort them into two folders. Short sentences in txt / or document that can be identified via first sentence work best. 
+This methods also work for **multi-class** and **hierarchical multi-class** classification.
+
+### Using google spreadsheets and data validation 
+
+*better for sentences*
+
+Create a Google [spreadsheet](http://spreadsheet.new) – you can upload csvs as well. Put a column that would hold target classes for items. Click Data -> Data Validation, select cell range for target column (i.e. `Sheet1!C2:C`) and select Criteria - 'Checkbox'. You can now either use mouse or arrows + space to toggle target label. You can export data via csv download (File -> Download as -> .csv). You'll have to further map `TRUE`/`FALSE` to target binary class.
+
+### Using [jupyter](https://jupyter.org/) notebooks
+
+See [samples/Binary_Classification_Annotation.ipynb]. Be aware this was created in python version 3 (and probably would run with minor modifications on python 2).
+
+
+## Multi-class classification for documents / sentences
+
+There are two cases for multi-class label: one where you have multiple labels and you set associate single label with a sample or multiple labels with a single sample. Notes below are for case with single label for sample (except for when it's noted in jupyter notebook).
+
+### Using folders (Finder)
+
+Same as [Binary classification](#anchor-link-here) but using multiple folders.
+
+### Using google spreadsheets and data validation 
+
+Create a Google [spreadsheet](http://spreadsheet.new) – you can upload csvs as well. Put a column that would hold target classes for items. Click Data -> Data Validation, select cell range for target column (i.e. `Sheet1!C2:C`) and for Criteria either select 'List of items' or 'List from a range' (you can also use a reference to [named ranges](https://support.google.com/docs/answer/63175?co=GENIE.Platform%3DDesktop&hl=en)which is useful if you are using those in multiple places) and be sure to have 'show a dropdown' option checked as it will enable typeahead which will make it easier to quick filter list of classes based on first few letters typed in. You can export data via csv download (File -> Download as -> .csv).
+
+> Note: if you need to have multiple labels per row, you can consider adding additional columns (i.e. 'class 1', 'class 2') and then merge this in post-processing step or use google script, but that might be more cumbersome.
+
+### Using [jupyter](https://jupyter.org/) notebooks
+
+See [samples/jupyter/Multiclass_Classification_Annotation.ipynb]. Be aware this was created in python version 3 (and probably would run with minor modifications on python 2).
+
+
+## Hierarchical Multi-class classification for documents / sentences
+
+This is a task where labels are organized in hierarchical way and after assining a first label out of predefined set we can proceed to picking next one out of corresponding child set. For example, our labels might look like following:
+- work-days:
+    a) Monday
+    b) Tuesday
+    c) Wednesday
+    …
+- weekend:
+    a) Saturday
+    b) Sunday
+
+### Using google spreadsheets and data validation 
+
+To accomplish this in google spreadsheets, you'll need to use [google apps script](https://www.google.com/script/start/). I adapted code from a [Stack Overflow answer](https://stackoverflow.com/questions/34191248/drop-down-dependent-menus-in-google-spreadsheets) for my needs. Below is step by step guide on how to apply it for hierarchical labelling task.
+
+> In this section I would use blockquotes to provide steps I used to label for somewhat artificial task for date & type of day data.
+
+You'll need to:
+- create a [new spreadsheet](http://spreadsheet.new) 
+> Here's a [sample spreadsheet](https://docs.google.com/spreadsheets/d/1caBDrV46-p8VWpQVv0m7shMAitY_k9gj5XUoIguYRqA/edit?usp=sharing).
+- set up a references for classes. There is also way to make this work based on indexes but I would suggest going with [named ranges](https://support.google.com/docs/answer/63175?co=GENIE.Platform%3DDesktop&hl=en) and have items of higher level range be easibly translated into dependant class reference name.
+
+> I created two named ranges, one with name `work_days` (replacing `-` with `_` to conform with range name restriction from value in 'Type of days' column on `Labels` sheet) and `weekend`.
+- for higher level class add a data validation to accept only top level class values
+
+> On `Data` sheet I added a validation for `Type of date` (Cell range: `Data!B2:B`, List from a range: `Labels!A3:A5`)
+
+- add a script that would create a dynamic validation based on top-level class value. Go to `Tools` > `Script Editor` and insert [following code](https://gist.github.com/sudodoki/70c7765e460724ec5d517d13917babef). High level overview of what it does: `to_range_name` function transforms value of class to dependant class labels range name (`to_range_name('week-days')` would yield `week_days`), `depDrop_` takes a cell and range of reference values for validation and adds a dynamic validation for cell, `onEdit` is a global callback that ties this all together, by taking a current cell, verifying it's value is not 'N/A' (I used this value for times when no class is available and no dependant labels needed, an empty value would work as well) and in case it's something meaningful, it would lookup a reference by transformed name and add a validation to a cell next to edited one but in next column.
+You'll need to name this project and save it. Then you'll need to run it (by pressing ▶️ button in top panel). **Note: if you set up your batches for labelling by copying a spreadsheet over and modifying values, you'll need to run this in every spreadsheet**.
+
+- now, whenever you assign or edit a value in top level label column, in 2-4 seconds a dropdown in following column would appear with corresponding values. 
+
+It might be an overkill for task at hand, but having lots of top level classes and subitems and managing them manually without programmatic restriction/validation might turn out cumbersome. Google dropdown can be edited using keyboard and provide typeahead-prompt style input which might be useful if annotator needs to choose from long list of possible values.
+
+### Using jupyter
+
+See [samples/jupyter/Hierarchical_multiclass.ipynb]. Be aware this was created in python version 3 (and probably would run with minor modifications on python 2).
+
+
+## NER (Span annotations)
+
+These are useful for NER and involve selecting contiguous set of text in document and marking with corresponding class.
+
+### Jupyter
+
+See [samples/jupyter/Span_annotation.ipynb]. Be aware this was created in python version 3 (and probably would run with minor modifications on python 2). There's possibility to do even more responsive/dynamic solution using js in widget and handling mouse events & tracking current selection.
+
+### m8nware/ann
+
+See [samples/ann] sample project that uses docker to run modification of [m8nware/ann](https://github.com/m8nware/ann).
+
+
+## Do you have more use-cases/solutions?
+
+If you know a solution to peculiar task of NLP annotation, open an issue with description. Even better, write it down and create a [pull request](https://help.github.com/en/articles/creating-a-pull-request)! I almost sure we didn't describe every useful tool that is out there providing solutions to nlp tasks, please reference one if you know any that are not in [tools](TOOLS.md) list. If this was useful to you, let me know as well, either through issue or in [gitter channel](https://gitter.im/sudodoki/nlp-how-to-annotate).
diff --git a/TOOLS.md b/TOOLS.md
@@ -0,0 +1,21 @@
+Not all the tools are of same quality and price.
+
+https://gate.ac.uk - NER, Text Classification
+http://brat.nlplab.org - NER, Relations, Normalization, […etc](http://brat.nlplab.org/examples.html#annotation-examples)
+https://prodi.gy - NER, Text Classification
+https://lighttag.io - NER, Relations, Classification
+https://github.com/m8nware/ann - NER, persists to disk, common lisp, basic auth, built-in diffs
+https://github.com/jiesutd/YEDDA - NER annotation, admin interface / comparisons, desktop, python
+https://github.com/emanjavacas/cosycat
+https://github.com/annefried/swan
+https://github.com/tayllan/viper - NER, js based
+https://paperai.github.io/htmlanno - NER, relations, paragraph selection, js based
+https://github.com/aldanor/plato - binary classification, no backend, csv, hotkeys, js based
+http://quepid.com/ - commercial judgement list for relevancy sorting
+
+http://www.janfreyberg.com/superintendent - widgets for jupyter notebooks
+https://github.com/natasha/ipyannotate - widgets for jupyter notebooks
+
+https://pybossa.com - more of 'roll your own solution'
+https://github.com/danvk/localturk - mimicking [Mechanical Turk](https://www.mturk.com/) API
+
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,3 @@
+ipyannotate
+ipywidgets==7.4.2
+pandas
diff --git a/samples/ann/Dockerfile b/samples/ann/Dockerfile
@@ -0,0 +1,22 @@
+FROM parentheticalenterprises/sbcl-quicklisp-base
+
+RUN apt-get update && apt-get install -y git libyaml-dev gcc locales && apt-get clean
+
+RUN locale-gen en_US.UTF-8  
+ENV LANG en_US.UTF-8  
+ENV LANGUAGE en_US:en  
+ENV LC_ALL en_US.UTF-8  
+
+# Used to update builds whenever github repo gets update https://stackoverflow.com/a/49772666/1976857
+ARG CACHEBUST
+RUN git clone --single-branch --branch master https://github.com/sudodoki/ann.git ann
+
+COPY users.txt ann/users.txt
+WORKDIR ann
+
+EXPOSE 7001
+
+RUN echo "(push :dev *features*)\n$(cat run.lisp)" > run.lisp
+
+ENTRYPOINT ["sbcl", "--load", "hunch.lisp", "--load", "run.lisp"]
+# ENTRYPOINT ["sbcl", "--noinform", "--disable-ldb", "--lose-on-corruption", "--disable-debugger", "--load", "hunch.lisp", "--load", "run.lisp"]
diff --git a/samples/ann/README.md b/samples/ann/README.md
@@ -0,0 +1,47 @@
+# Setting up host system.
+
+It should have docker installed.
+
+Also, if you see something like `mmap: Cannot allocate memory: ensure_space: failed to validate XXX bytes at …` error, it's a known limitation that can arise in [some environments](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=474402) due to inability to secure enough memory and fix as mentioned in thread is to do `echo 1 >/proc/sys/vm/overcommit_memory` (maybe, with sudo) in the host system.
+
+## Building & running container
+
+```shell
+# after cloning / copying ann/ folder from this repo
+docker build -t ann-spans ./
+# placing proper data to annotate into data/ & setting schemas in schemas/
+docker run -p 7001:7001 -v /data/:/ann/data -v /schemas/:/ann/schemas -ti ann-spans
+# docker run -p 7001:7001 -v "$(pwd)/data/":/ann/data -v "$(pwd)/schemas/":/ann/schemas -ti ann-spans
+# >>> !! docker run -ti -p 7001:7001 -v "$(pwd)/data/":/ann/data -v "$(pwd)/schemas/":/ann/schemas ann-spans
+# go to localhost:7001 and see annotation tool's UI
+```
+
+If you run this on Mac and have issues with data not showing up you'll have to use `"$(pwd)/data/"` instead to mount the volume.
+
+## Folder structure
+
+In your `data/` you could have any folder structure. Be sure to have leaves of your tree to be `.txt` files as those are the ones `ann` works with. You would need to provide schema for your dataset, which has a class-code thing you'll get in final annotation as well as human readable labels that would be displayed in annotation popup and css color that would be used to highlight item in css. [Sample schema](schemas/ner.yml) Also, there should be `.ann.yaml` config file that would provide name of schema and some other items. See sample in [data/annotator1/.ann.yaml](data/annotator1/.ann.yaml)
+
+## Updating image after first build
+
+After you've built image the first time and you know there were changes in upstream ann codebase, be sure to run build and pass an extra argument
+
+```shell
+docker build -t ann-spans ./ --build-arg CACHEBUST=something1
+```
+
+which shall be a new one whenever you need to update image
+
+# Actual UI
+
+After navigating to designated URL (localhost:7001 or other port / host if you run it on server / with different port binding) you'll see folder view, which you can drill down until you end up with single document. After selecting range of text (you can double click on words to select them as well) you'll be presented with labels prompt to select appropriate class.
+![](https://github.com/sudodoki/sudodoki-public-assets/raw/gh-pages/ann_screenshot.png)
+![](https://github.com/sudodoki/sudodoki-public-assets/raw/gh-pages/ann_screenshot_2.png)
+
+# Parsing results
+
+Here's a [code sample](work_with_annotations.ipynb) to parse annotations.
+
+# Restricting who has access
+
+There's a built-in flow to restrict to who has access to annotating what via basic auth. When building docker container a file users.txt would be copied into container and used to provide set of users and passwords. Each user then would be restricted to items inside top-level folder with same name (annotator1 to everything below data/annotator1, etc.). Special case is "admin" user that can modify any file.
diff --git a/samples/ann/data/annotator1/.ann.yaml b/samples/ann/data/annotator1/.ann.yaml
@@ -0,0 +1,4 @@
+format: bsf
+schema: ner
+highlight: background
+ext: txt
diff --git a/samples/ann/data/annotator1/doc1.txt b/samples/ann/data/annotator1/doc1.txt
@@ -0,0 +1 @@
+SALT LAKE CITY — Disney will soon finish acquiring 20th Century Fox, which could mean some major changes are in store for the Marvel Cinematic Universe.
diff --git a/samples/ann/data/annotator1/doc1.txt.ann b/samples/ann/data/annotator1/doc1.txt.ann
@@ -0,0 +1,4 @@
+T1	LOC 0 14	SALT LAKE CITY
+T2	ORG 17 23	Disney
+T3	ORG 51 67	20th Century Fox
+T4	LOC 126 151	Marvel Cinematic Universe
diff --git a/samples/ann/data/annotator1/doc2.txt b/samples/ann/data/annotator1/doc2.txt
@@ -0,0 +1 @@
+Brazil was one of the final countries that had yet to approve the deal, according to Bloomberg. A source told Bloomberg that Disney was willing to unload the Fox Sports network to different buyers.
diff --git a/samples/ann/data/annotator1/doc2.txt.ann b/samples/ann/data/annotator1/doc2.txt.ann
@@ -0,0 +1,5 @@
+T1	LOC 0 6	Brazil
+T2	ORG 85 94	Bloomberg
+T3	ORG 110 119	Bloomberg
+T4	ORG 125 131	Disney
+T5	ORG 158 168	Fox Sports
diff --git a/samples/ann/schemas/ner.yaml b/samples/ann/schemas/ner.yaml
@@ -0,0 +1,9 @@
+LOC:
+    desc: Location
+    color: aqua
+PER:
+    desc: Person
+    color: green
+ORG:
+    desc: Organization
+    color: orange
diff --git a/samples/ann/users.txt b/samples/ann/users.txt
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		SALT LAKE CITY — Disney will soon finish acquiring 20th Century Fox, which could mean some major changes are in store for the Marvel Cinematic Universe.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Brazil was one of the final countries that had yet to approve the deal, according to Bloomberg. A source told Bloomberg that Disney was willing to unload the Fox Sports network to different buyers.