Skip to content

Latest commit

 

History

History
159 lines (137 loc) · 4.88 KB

parsing.md

File metadata and controls

159 lines (137 loc) · 4.88 KB

Parsing with SLING

The trained parser model is stored in a Myelin flow file, It contains all the information needed for parsing text:

  • The neural network units (LR, RL, FF) with the parameters learned from training.
  • Feature maps for the lexicon and affixes.
  • The commons store is a SLING store with the schemas for the frames.
  • The action table with all the transition actions.

A pre-trained model can be downloaded from here. The model can be loaded and initialized in the following way:

#include "sling/frame/store.h"
#include "sling/nlp/document/document-tokenizer.h"
#include "sling/nlp/parser/parser.h"

// Load parser model.
sling::Store commons;
sling::nlp::Parser parser;
parser.Load(&commons, "/tmp/caspar.flow");
commons.Freeze();

// Create document tokenizer.
sling::nlp::DocumentTokenizer tokenizer;

In order to parse some text, it first needs to be tokenized. The document with text, tokens, and frames is stored in a local document frame store.

// Create frame store for document.
sling::Store store(&commons);
sling::nlp::Document document(&store);

// Tokenize text.
string text = "John hit the ball with a bat.";
tokenizer.Tokenize(&document, text);

// Parse document.
parser.Parse(&document);
document.Update();

// Output document annotations.
std::cout << sling::ToText(document.top(), 2);

Myelin-based parser tool

SLING comes with a parsing tool for annotating a corpus of documents with frames using a parser model, benchmarking this annotation process, and optionally evaluating the annotated frames against supplied gold frames.

This tool takes the following commandline arguments:

  • --parser : This should point to a Myelin flow, e.g. one created by the training script.

  • If --text is specified then the parser is run over the supplied text, and prints the annotated frame(s) in text mode. The indentation of the text output can be controlled by --indent. E.g.

    bazel build -c opt sling/nlp/parser/tools:parse
    bazel-bin/sling/nlp/parser/tools/parse --logtostderr \
       --parser=<path to flow file> --text="John loves Mary" --indent=2
    
    {=#1
      :document
      text: "John loves Mary"
      tokens: [{=#2
        word: "John"
        start: 0
        size: 4
      }, {=#3
        word: "loves"
        start: 5
        length: 5
      }, {=#4
        word: "Mary"
        start: 11
        size: 4
      }]
      mention: {=#5
        begin: 0
        evokes: {=#6
          :/saft/person
        }
      }
      mention: {=#7
        begin: 1
        evokes: {=#8
          :/pb/love-01
          /pb/arg0: #6
          /pb/arg1: {=#9
            :/saft/person
          }
        }
      }
      mention: {=#10
        begin: 2
        evokes: #9
      }
    }
    I0927 14:44:25.705880 30901 parse.cc:154] 823.732 tokens/sec
  • If --benchmark is specified then the parser is run on the document corpus specified via --corpus. This corpus should be prepared similarly to how the training/dev corpora were created. The processing can be limited to the first N documents by specifying --maxdocs=N.

     bazel-bin/sling/nlp/parser/tools/parse --logtostderr \
       --parser=sempar.flow --corpus=dev.zip -benchmark --maxdocs=200
    
     I0927 14:45:36.634670 30934 parse.cc:127] Load parser from sempar.flow
     I0927 14:45:37.307870 30934 parse.cc:135] 565.077 ms loading parser
     I0927 14:45:37.307922 30934 parse.cc:161] Benchmarking parser on dev.zip
     I0927 14:45:39.059257 30934 parse.cc:184] 200 documents, 3369 tokens, 2289.91 tokens/sec

    If --profile is specified, the parser will run with profiling instrumentation enabled and output a detailed profile report with execution timing for each operation in the neural network.

  • If --evaluate is specified then the tool expects --corpora to specify a corpora with gold frames. It then runs the parser model over a frame-less version of this corpora and evaluates the annotated frames vs the gold frames. Again, one can use --maxdocs to limit the evaluation to the first N documents.

    bazel-bin/sling/nlp/parser/tools/parse --logtostderr \
      --evaluate --parser=sempar.flow --corpus=dev.rec --maxdocs=200
    
    I0927 14:51:39.542151 31336 parse.cc:127] Load parser from sempar.flow
    I0927 14:51:40.211920 31336 parse.cc:135] 562.249 ms loading parser
    I0927 14:51:40.211973 31336 parse.cc:194] Evaluating parser on dev.rec
    SPAN_P+ 1442
    SPAN_P- 93
    SPAN_R+ 1442
    SPAN_R- 133
    SPAN_Precision  93.941368078175884
    SPAN_Recall     91.555555555555557
    SPAN_F1 92.733118971061089
    ...
    <snip>
    ...
    SLOT_F1 78.398993883366586
    COMBINED_P+     4920
    COMBINED_P-     633
    COMBINED_R+     4923
    COMBINED_R-     901
    COMBINED_Precision      88.60075634792004
    COMBINED_Recall 84.529532967032978
    COMBINED_F1     86.517276488704127