You must be signed in to change notification settings - Fork 27
Fact Extraction Functions
Article processing is initiated by calling interpreter.process_article_new(story)
, where story is a String
The steps are as follows:
The story string is cleaned to deal with some formatting errors such as lack of spacing between numbers and text:
story = self.cleanup(story)
We initialize an empty array for keeping track of encountered reports:
processed_reports = []
Convert the story into a Spacy
object by running it through the NLP parser, and split it into sentences:story = self.nlp(story) sentences = list(story.sents)
Initialize two variables for keeping track of the most recently encountered dates and locations in the article:
dates_memory = None locations_memory = None
Loop through each sentence, and attempt to extract a report using the
function:for sentence in sentences: reports = [] reports = self.process_sentence_new( sentence, dates_memory, locations_memory, story)
If the current sentence contains locations (independent of whether or not a report has been generated), then we should update the
variable:current_locations = self.extract_locations(sentence) if current_locations: locations_memory = current_locations
Similarly, if the sentence contains dates, then update the
variable:current_locations = self.extract_locations(sentence) if current_locations: locations_memory = current_locations
Add any extracted reports to the list of reports for the article:
processed_reports.extend(reports) ```
When we are finished processing all of the sentences, return the list of unique reports:
return list(set(processed_reports))
Sentence processing is initiated by a call to:
process_sentence_new(sentence, dates_memory, locations_memory, story)
- sentence: the sentence to be processed, a Spacy
object - dates_memory: the most recently extracted dates from the article
- locations_memory: the most recently extracted locations from the article
- story: the article currently being processed, a Spacy
The steps for processing a sentence are:
Initialize a list for storing extracted reports:
sentence_reports = []
Extract the main verbs from the sentence (using the Textacy library that provides a wrapper around Spacy along with many useful functions). Here we expect the verbs to correspond to desired reporting terms:
main_verbs = textacy.spacy_utils.get_main_verbs_of_sent(sentence)
Loop through each verb, and determine whether or not it is relevant to human displacement, and whether it is a Structural or Person-related reporting term:
for v in main_verbs: unit_type, verb_lemma = self.verb_relevance(v, story)
If the verb is determined to be relevant, initiate a branch search, based upon that verb to try and extract one or more reports:
reports = self.branch_search_new(v, verb_lemma, unit_type, dates_memory, locations_memory, sentence, story) sentence_reports.extend(reports)
Once all of the verbs have been processed, return the list of extracted reports:
return sentence_reports
During sentence processing, each verb is tested for relevance to human displacement:
verb_relevance(verb, story)
- verb: the verb being tested, a Spacy
object - story: the article currently being processed, a Spacy
The steps for determining verb relevance are:
Start by looking for the case where the verb is relevant to both Structural or Person reporting terms:
if verb.lemma_ in self.joint_term_lemmas: return self.structure_unit_lemmas + self.person_unit_lemmas, Fact(verb, verb, verb.lemma_, "term") ```
Otherwise look for the general case where the verb is equivalent to a Structural reporting term:
elif verb.lemma_ in self.structure_term_lemmas: return self.structure_unit_lemmas, verb.lemma_ ``` Here we return the structural unit lemmas to indicate that we should attempt to extract a report based on structural rather than person-related terms
Similarly, look for the general case where the verb is equivalent to a Person reporting term:
elif verb.lemma_ in self.person_term_lemmas: return self.person_unit_lemmas, verb.lemma_ ```
The should cover the majority of the cases that we are looking for, however there are also some special cases that need to be taken into account and tested for individually.
Case 1: Phrases of the form 'leaving' + reporting term, i.e. 'leaving 60 people homeless'
Test for presence of 'leave' as a main verb:
elif verb.lemma_ == 'leave':
Identify the children of the verb via the parse tree:
children = verb.children
Cycle through the children and try and identify an object by looking at dependencies for Direct Objects or Object Predicates:
obj_predicate = None for child in children: if child.dep_ in ('oprd', 'dobj'): obj_predicate = child
If the obj_predicate exists, then use tests similar to above to determine if it is relevant to Structural or Person-related reporting terms:
if obj_predicate.lemma_ in self.structure_term_lemmas: return self.structure_unit_lemmas, 'leave ' + obj_predicate.lemma_ elif obj_predicate.lemma_ in self.person_term_lemmas: return self.person_unit_lemmas, 'leave ' + obj_predicate.lemma_
Case 2: People or structures being "affected" by specific events
Test for presence of 'affect' as a main verb, and look in the article for key phrases relating to desired events:
verb.lemma_ == 'affect' and self.article_relevance(article):
If both conditions are met, return a combination of structural and person-related reporting units:
return self.structure_unit_lemmas + self.person_unit_lemmas, verb.lemma_
Case 3: Phrases of the form 'feared' + reporting term, i.e. '100 homes are feared damaged'
Test for presence of 'fear' or 'assume' as a main verb:
elif verb.lemma_ in ('fear', 'assume'):
Look at the main object of the verb, and compare it structural and person-related units:
verb_objects = textacy.spacy_utils.get_objects_of_verb(verb) if verb_objects: verb_object = verb_objects[0] if verb_object.lemma_ in self.person_term_lemmas: return self.person_unit_lemmas, verb.lemma_ + " " + verb_object.text elif verb_object.lemma_ in self.structure_term_lemmas: return self.structure_unit_lemmas, verb.lemma_ + " " + verb_object.text ```
Case 4: Phrases of the form 'claimed XXX lives'
Look to see if 'claim' is the main verb and life or lives its direct object:
elif verb.lemma_ == 'claim': dobjects = [v.text for v in textacy.spacy_utils.get_objects_of_verb(verb)] if 'lives' in dobjects: return self.person_unit_lemmas, verb.lemma_ + " " + "lives"
For each relevant verb that is encountered, initiate a 'branch search' to attempt to create a report(s) based on that verb:
reports = self.branch_search_new(verb, unit_type, dates_memory, locations_memory,
sentence, story)
- verb: the verb being explored
- unit_type: the relevant reporting units to look for depending in the reporting term (this could be Structural, Person or a combination), a list of terms
- dates_memory: the most recently encountered dates in the article, a list of date-type Facts
- locations_memory: the most recently encountered locations in the article, a list of location-type Facts
- story: the article currently being processed, a Spacy
The steps followed during the branch search are:
Search for locations and dates within the sentence; if none are found then revert to the most recently encountered ones
possible_locations = self.extract_locations(sentence, verb.token) possible_dates = self.extract_dates(sentence, story, verb.token) if not possible_locations: possible_locations = locations_memory if not possible_dates: possible_dates = dates_memory
Look for the subjects and objects of the verb within the sentence:
verb_objects = self.get_subjects_and_objects(story, sentence, verb.token)
If there are multiple possible nouns and it is unclear which is the correct one choose the one with the fewest descendants (a verb object with many descendants is more likely to have its own verb as a descendant):
verb_descendent_counts = [(v, len(list(v.subtree))) for v in verb_objects] verb_objects = [x[0] for x in sorted(verb_descendent_counts, key=lambda x: x[1])]
Loop through each subject and object:
for o in verb_objects:
Case 1:
Look first at the case where the subject or object is a number; it may be the case that the reporting unit is not explicit but inferred
if self.basic_number(o):
Test if the following word is either the verb in question or if it is of the construction 'leave ____', then ____ is the following word
next_word = self.next_word(story, o) if next_word and (next_word.i == verb.token.i or next_word.text == verb.lemma_.split(" ")[-1]):
Set the reporting unit based on the verb type:
if search_type == self.structure_term_lemmas: unit = 'house' else: unit = 'person'
The quantity is the subject or object, and we can now create the report:
quantity = Fact(o, o, o.lemma_, 'quantity') report = Report([p.text for p in possible_locations], [p.text for p in possible_dates], verb.lemma_, unit, quantity, story.text)
Extract the locations in the sentence for each of the facts found and add to the reports, and at this point assume we are done for this verb:
report.tag_spans = self.set_report_span([verb, quantity, possible_locations, possible_dates]) reports.append(report) break
Case 2:
Otherwise see if the subject / object matches a reporting unit
elif o.lemma_ in search_type: reporting_unit = o
If the reporting unit is part of a two noun clauses joined by a conjunction (NOUN CONJ NOUN), then try and extract a quantity based on the root of the noun clauses:
noun_conj = self.test_noun_conj(sentence, o) if noun_conj: reporting_unit = noun_conj # Try and get a number - begin search from noun conjunction root.
Otherwise try and extract a quantity based directly on the noun:
else: # Try and get a number - begin search from noun. quantity = self.get_quantity(sentence, o)
As per above, create a report, get the fact locations and break:
reporting_unit = Fact(reporting_unit, reporting_unit, reporting_unit.lemma_, "unit") report = Report([p.text for p in possible_locations], [p.text for p in possible_dates], verb.lemma_, reporting_unit.lemma_, quantity, story.text) report.tag_spans = self.set_report_span([verb, quantity, reporting_unit, possible_locations, possible_dates]) reports.append(report) break
While attempting to complete a report based upon a verb, start by getting the subjects and objects of that verb:
verb_objects = self.get_subjects_and_objects(story, sentence, verb.token)
- story: the article currently being processed, a Spacy
object - story: the sentence currently being processed, a Spacy
object - verb.token: the verb token currently being processed, a Spacy
The core of this function is based on the Textacy.spacy_utils
functions get_objects_of_verb
and get_subjects_of_verb
, which are implemented through:
def simple_subjects_and_objects(self, verb):
verb_objects = textacy.spacy_utils.get_objects_of_verb(verb)
verb_subjects = textacy.spacy_utils.get_subjects_of_verb(verb)
return verb_objects
This list is then extended based upon a number of special cases.
Case 1: Look at certain types of tokens that directly precede the verb:
if verb.i > 0:
preceding = story[verb.i - 1]
if preceding.dep_ in ('pobj', 'dobj', 'nsubj', 'conj') and preceding not in verb_objects:
Case 2: Look at certain types of tokens that directly follow the verb:
if verb.i < len(story) - 1:
following = story[verb.i + 1]
if following.dep_ in ('pobj', 'dobj', 'ROOT') and following not in verb_objects:
Case 3: See if verb is part of a conjunction, and add certain tokens that are to the left-of or ancestors of the verb:
if verb.dep_ == 'conj':
lefts = list(verb.lefts)
if len(lefts) > 0:
for token in lefts:
if token.dep_ in ('nsubj', 'nsubjpass'):
ancestors = verb.ancestors
for anc in ancestors:
Case 4: See if verb is Root of sentence, and look for prepositional objects
if verb.dep_ == 'ROOT':
for token in sentence:
if token.dep_ == 'pobj':
Case 5: See if verb is part of a relative clause, look for nouns within the relative clause
if verb.dep_ == 'relcl':
relcl_noun = self.nouns_from_relative_clause(sentence, verb)
if relcl_noun: