Fact Extraction Functions

Article Processing

Article processing is initiated by calling interpreter.process_article_new(story), where story is a String.

The steps are as follows:

The story string is cleaned to deal with some formatting errors such as lack of spacing between numbers and text:
```
story = self.cleanup(story)
```
We initialize an empty array for keeping track of encountered reports:
```
processed_reports = []
```
Convert the story into a Spacy doc object by running it through the NLP parser, and split it into sentences:
```
story = self.nlp(story)
sentences = list(story.sents)
```
Initialize two variables for keeping track of the most recently encountered dates and locations in the article:
```
dates_memory = None
locations_memory = None
```

Loop through each sentence, and attempt to extract a report using the process_sentence_new function:

for sentence in sentences:
    reports = []
    reports = self.process_sentence_new(
                sentence, dates_memory, locations_memory, story)

If the current sentence contains locations (independent of whether or not a report has been generated), then we should update the locations_memory variable:
```
current_locations = self.extract_locations(sentence)
if current_locations:
    locations_memory = current_locations
```

Similarly, if the sentence contains dates, then update the dates_memory variable:

current_locations = self.extract_locations(sentence)
if current_locations:
    locations_memory = current_locations

Add any extracted reports to the list of reports for the article:
```
processed_reports.extend(reports)
```
```
When we are finished processing all of the sentences, return the list of unique reports:
```
return list(set(processed_reports))
```

Sentence Processing

Sentence processing is initiated by a call to:

process_sentence_new(sentence, dates_memory, locations_memory, story)

Parameters:

sentence: the sentence to be processed, a Spacy Span object
dates_memory: the most recently extracted dates from the article
locations_memory: the most recently extracted locations from the article
story: the article currently being processed, a Spacy Docobject

The steps for processing a sentence are:

Initialize a list for storing extracted reports:
```
sentence_reports = []
```
Extract the main verbs from the sentence (using the Textacy library that provides a wrapper around Spacy along with many useful functions). Here we expect the verbs to correspond to desired reporting terms:
```
main_verbs = textacy.spacy_utils.get_main_verbs_of_sent(sentence)
```
Loop through each verb, and determine whether or not it is relevant to human displacement, and whether it is a Structural or Person-related reporting term:
```
for v in main_verbs:
    unit_type, verb_lemma = self.verb_relevance(v, story)
```

If the verb is determined to be relevant, initiate a branch search, based upon that verb to try and extract one or more reports:

reports = self.branch_search_new(v, verb_lemma, unit_type, dates_memory, locations_memory,
                                 sentence, story)
sentence_reports.extend(reports)

Once all of the verbs have been processed, return the list of extracted reports:
```
return sentence_reports
```

Determining Verb Relevance

During sentence processing, each verb is tested for relevance to human displacement:

verb_relevance(verb, story)

Parameters:

verb: the verb being tested, a Spacy Token object
story: the article currently being processed, a Spacy Docobject

The steps for determining verb relevance are:

Start by looking for the case where the verb is relevant to both Structural or Person reporting terms:

if verb.lemma_ in self.joint_term_lemmas:
    return self.structure_unit_lemmas + self.person_unit_lemmas, Fact(verb, verb, verb.lemma_, "term")
```

Otherwise look for the general case where the verb is equivalent to a Structural reporting term:

elif verb.lemma_ in self.structure_term_lemmas:
    return self.structure_unit_lemmas, verb.lemma_
```
Here we return the structural unit lemmas to indicate that we should attempt to extract a report based on
structural rather than person-related terms

Similarly, look for the general case where the verb is equivalent to a Person reporting term:

elif verb.lemma_ in self.person_term_lemmas:
    return self.person_unit_lemmas, verb.lemma_
```

The should cover the majority of the cases that we are looking for, however there are also some special cases that need to be taken into account and tested for individually.

Case 1: Phrases of the form 'leaving' + reporting term, i.e. 'leaving 60 people homeless'

Test for presence of 'leave' as a main verb:
```
elif verb.lemma_ == 'leave':
```
Identify the children of the verb via the parse tree:
```
children = verb.children
```

Cycle through the children and try and identify an object by looking at dependencies for Direct Objects or Object Predicates:

obj_predicate = None
    for child in children:
        if child.dep_ in ('oprd', 'dobj'):
            obj_predicate = child

If the obj_predicate exists, then use tests similar to above to determine if it is relevant to Structural or Person-related reporting terms:

if obj_predicate.lemma_ in self.structure_term_lemmas:
    return self.structure_unit_lemmas, 'leave ' + obj_predicate.lemma_
elif obj_predicate.lemma_ in self.person_term_lemmas:
    return self.person_unit_lemmas, 'leave ' + obj_predicate.lemma_

Case 2: People or structures being "affected" by specific events

Test for presence of 'affect' as a main verb, and look in the article for key phrases relating to desired events:
```
verb.lemma_ == 'affect' and self.article_relevance(article):
```
If both conditions are met, return a combination of structural and person-related reporting units:
```
return self.structure_unit_lemmas + self.person_unit_lemmas, verb.lemma_
```

Case 3: Phrases of the form 'feared' + reporting term, i.e. '100 homes are feared damaged'

Test for presence of 'fear' or 'assume' as a main verb:
```
elif verb.lemma_ in ('fear', 'assume'):
```

Look at the main object of the verb, and compare it structural and person-related units:

verb_objects = textacy.spacy_utils.get_objects_of_verb(verb)
    if verb_objects:
        verb_object = verb_objects[0]
            if verb_object.lemma_ in self.person_term_lemmas:
                return self.person_unit_lemmas, verb.lemma_ + " " + verb_object.text
            elif verb_object.lemma_ in self.structure_term_lemmas:
                return self.structure_unit_lemmas, verb.lemma_ + " " + verb_object.text
```

Case 4: Phrases of the form 'claimed XXX lives'

Look to see if 'claim' is the main verb and life or lives its direct object:

elif verb.lemma_ == 'claim':
    dobjects = [v.text for v in textacy.spacy_utils.get_objects_of_verb(verb)]
        if 'lives' in dobjects:
            return self.person_unit_lemmas, verb.lemma_ + " " + "lives"

Branch Search

For each relevant verb that is encountered, initiate a 'branch search' to attempt to create a report(s) based on that verb:

reports = self.branch_search_new(verb, unit_type, dates_memory, locations_memory,
    sentence, story)

Parameters:

verb: the verb being explored
unit_type: the relevant reporting units to look for depending in the reporting term (this could be Structural, Person or a combination), a list of terms
dates_memory: the most recently encountered dates in the article, a list of date-type Facts
locations_memory: the most recently encountered locations in the article, a list of location-type Facts
story: the article currently being processed, a Spacy Docobject

The steps followed during the branch search are:

Search for locations and dates within the sentence; if none are found then revert to the most recently encountered ones

possible_locations = self.extract_locations(sentence, verb.token)
possible_dates = self.extract_dates(sentence, story, verb.token)
if not possible_locations:
    possible_locations = locations_memory
if not possible_dates:
    possible_dates = dates_memory

Look for the subjects and objects of the verb within the sentence:

verb_objects = self.get_subjects_and_objects(story, sentence, verb.token)

If there are multiple possible nouns and it is unclear which is the correct one choose the one with the fewest descendants (a verb object with many descendants is more likely to have its own verb as a descendant):
```
verb_descendent_counts = [(v, len(list(v.subtree))) for v in verb_objects]
verb_objects = [x[0] for x in sorted(verb_descendent_counts, key=lambda x: x[1])]
```
Loop through each subject and object:
```
for o in verb_objects:
```

Case 1:

Look first at the case where the subject or object is a number; it may be the case that the reporting unit is not explicit but inferred
```
if self.basic_number(o):
```

Test if the following word is either the verb in question or if it is of the construction 'leave ____', then ____ is the following word

next_word = self.next_word(story, o)
    if next_word and (next_word.i == verb.token.i or next_word.text == verb.lemma_.split(" ")[-1]):

Set the reporting unit based on the verb type:

if search_type == self.structure_term_lemmas:
    unit = 'house'
else:
    unit = 'person'

The quantity is the subject or object, and we can now create the report:

quantity = Fact(o, o, o.lemma_, 'quantity')
report = Report([p.text for p in possible_locations], [p.text for p in possible_dates], verb.lemma_,
                unit, quantity, story.text)

Extract the locations in the sentence for each of the facts found and add to the reports, and at this point assume we are done for this verb:
```
report.tag_spans = self.set_report_span([verb, quantity, possible_locations, possible_dates])
reports.append(report)
break
```

Case 2:

Otherwise see if the subject / object matches a reporting unit
```
elif o.lemma_ in search_type:
    reporting_unit = o
```
If the reporting unit is part of a two noun clauses joined by a conjunction (NOUN CONJ NOUN), then try and extract a quantity based on the root of the noun clauses:
```
noun_conj = self.test_noun_conj(sentence, o)
if noun_conj:
    reporting_unit = noun_conj
    # Try and get a number - begin search from noun conjunction root.
```

Otherwise try and extract a quantity based directly on the noun:

else:
    # Try and get a number - begin search from noun.
    quantity = self.get_quantity(sentence, o)

As per above, create a report, get the fact locations and break:

reporting_unit = Fact(reporting_unit, reporting_unit, reporting_unit.lemma_, "unit")
report = Report([p.text for p in possible_locations], [p.text for p in possible_dates], verb.lemma_,
               reporting_unit.lemma_, quantity, story.text)
report.tag_spans = self.set_report_span([verb, quantity, reporting_unit, possible_locations, possible_dates])
reports.append(report)
break

Extract Subjects and Objects

While attempting to complete a report based upon a verb, start by getting the subjects and objects of that verb:

verb_objects = self.get_subjects_and_objects(story, sentence, verb.token)

Parameters:

story: the article currently being processed, a Spacy Docobject
story: the sentence currently being processed, a Spacy Spanobject
verb.token: the verb token currently being processed, a Spacy Tokenobject

The core of this function is based on the Textacy.spacy_utils functions get_objects_of_verb and get_subjects_of_verb, which are implemented through:

def simple_subjects_and_objects(self, verb):
    verb_objects = textacy.spacy_utils.get_objects_of_verb(verb)
    verb_subjects = textacy.spacy_utils.get_subjects_of_verb(verb)
    verb_objects.extend(verb_subjects)
    return verb_objects

This list is then extended based upon a number of special cases.

Case 1: Look at certain types of tokens that directly precede the verb:

if verb.i > 0:
    preceding = story[verb.i - 1]
    if preceding.dep_ in ('pobj', 'dobj', 'nsubj', 'conj') and preceding not in verb_objects:
        verb_objects.append(preceding)

Case 2: Look at certain types of tokens that directly follow the verb:

if verb.i < len(story) - 1:
    following = story[verb.i + 1]
    if following.dep_ in ('pobj', 'dobj', 'ROOT') and following not in verb_objects:
        verb_objects.append(following)

Case 3: See if verb is part of a conjunction, and add certain tokens that are to the left-of or ancestors of the verb:

if verb.dep_ == 'conj':
    lefts = list(verb.lefts)
    if len(lefts) > 0:
        for token in lefts:
            if token.dep_ in ('nsubj', 'nsubjpass'):
                verb_objects.append(token)
    else:
        ancestors = verb.ancestors
        for anc in ancestors:
            verb_objects.extend(self.simple_subjects_and_objects(anc))

Case 4: See if verb is Root of sentence, and look for prepositional objects

if verb.dep_ == 'ROOT':
    for token in sentence:
        if token.dep_ == 'pobj':
            verb_objects.append(token)

Case 5: See if verb is part of a relative clause, look for nouns within the relative clause

if verb.dep_ == 'relcl':
    relcl_noun = self.nouns_from_relative_clause(sentence, verb)
        if relcl_noun:
            verb_objects.append(relcl_noun)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fact Extraction Functions

Article Processing

Sentence Processing

Determining Verb Relevance

Branch Search

Extract Subjects and Objects

Clone this wiki locally