Skip to content

Fact Extraction Functions

Simon Bedford edited this page Mar 3, 2017 · 4 revisions

Article Processing

Article processing is initiated by calling interpreter.process_article_new(story), where story is a String.

The steps are as follows:

  1. The story string is cleaned to deal with some formatting errors such as lack of spacing between numbers and text:

    story = self.cleanup(story)
    
  2. We initialize an empty array for keeping track of encountered reports:

    processed_reports = []
    
  3. Convert the story into a Spacy doc object by running it through the NLP parser, and split it into sentences:

    story = self.nlp(story)
    sentences = list(story.sents)
    
  4. Initialize two variables for keeping track of the most recently encountered dates and locations in the article:

    dates_memory = None
    locations_memory = None
    
  5. Loop through each sentence, and attempt to extract a report using the process_sentence_new function:

    for sentence in sentences:
        reports = []
        reports = self.process_sentence_new(
                    sentence, dates_memory, locations_memory, story)
    
  6. If the current sentence contains locations (independent of whether or not a report has been generated), then we should update the locations_memory variable:

    current_locations = self.extract_locations(sentence)
    if current_locations:
        locations_memory = current_locations
    
  7. Similarly, if the sentence contains dates, then update the dates_memory variable:

    current_locations = self.extract_locations(sentence)
    if current_locations:
        locations_memory = current_locations
    
  8. Add any extracted reports to the list of reports for the article:

    processed_reports.extend(reports)
    ```
    
  9. When we are finished processing all of the sentences, return the list of unique reports:

    return list(set(processed_reports))
    

Sentence Processing

Sentence processing is initiated by a call to:

process_sentence_new(sentence, dates_memory, locations_memory, story)

Parameters:

  • sentence: the sentence to be processed, a Spacy Span object
  • dates_memory: the most recently extracted dates from the article
  • locations_memory: the most recently extracted locations from the article
  • story: the article currently being processed, a Spacy Docobject

The steps for processing a sentence are:

  1. Initialize a list for storing extracted reports:

    sentence_reports = []
    
  2. Extract the main verbs from the sentence (using the Textacy library that provides a wrapper around Spacy along with many useful functions). Here we expect the verbs to correspond to desired reporting terms:

    main_verbs = textacy.spacy_utils.get_main_verbs_of_sent(sentence)
    
  3. Loop through each verb, and determine whether or not it is relevant to human displacement, and whether it is a Structural or Person-related reporting term:

    for v in main_verbs:
        unit_type, verb_lemma = self.verb_relevance(v, story)
    
  4. If the verb is determined to be relevant, initiate a branch search, based upon that verb to try and extract one or more reports:

    reports = self.branch_search_new(v, verb_lemma, unit_type, dates_memory, locations_memory,
                                     sentence, story)
    sentence_reports.extend(reports)
    
  5. Once all of the verbs have been processed, return the list of extracted reports:

    return sentence_reports
    

Determining Verb Relevance

During sentence processing, each verb is tested for relevance to human displacement:

verb_relevance(verb, story)

Parameters:

  • verb: the verb being tested, a Spacy Token object
  • story: the article currently being processed, a Spacy Docobject

The steps for determining verb relevance are:

  1. Start by looking for the case where the verb is relevant to both Structural or Person reporting terms:

    if verb.lemma_ in self.joint_term_lemmas:
        return self.structure_unit_lemmas + self.person_unit_lemmas, Fact(verb, verb, verb.lemma_, "term")
    ```
    
    
  2. Otherwise look for the general case where the verb is equivalent to a Structural reporting term:

    elif verb.lemma_ in self.structure_term_lemmas:
        return self.structure_unit_lemmas, verb.lemma_
    ```
    Here we return the structural unit lemmas to indicate that we should attempt to extract a report based on
    structural rather than person-related terms
    
    
  3. Similarly, look for the general case where the verb is equivalent to a Person reporting term:

    elif verb.lemma_ in self.person_term_lemmas:
        return self.person_unit_lemmas, verb.lemma_
    ```
    
    

The should cover the majority of the cases that we are looking for, however there are also some special cases that need to be taken into account and tested for individually.

Case 1: Phrases of the form 'leaving' + reporting term, i.e. 'leaving 60 people homeless'

  1. Test for presence of 'leave' as a main verb:

    elif verb.lemma_ == 'leave':
    
  2. Identify the children of the verb via the parse tree:

    children = verb.children
    
  3. Cycle through the children and try and identify an object by looking at dependencies for Direct Objects or Object Predicates:

    obj_predicate = None
        for child in children:
            if child.dep_ in ('oprd', 'dobj'):
                obj_predicate = child
    
  4. If the obj_predicate exists, then use tests similar to above to determine if it is relevant to Structural or Person-related reporting terms:

    if obj_predicate.lemma_ in self.structure_term_lemmas:
        return self.structure_unit_lemmas, 'leave ' + obj_predicate.lemma_
    elif obj_predicate.lemma_ in self.person_term_lemmas:
        return self.person_unit_lemmas, 'leave ' + obj_predicate.lemma_
    

Case 2: People or structures being "affected" by specific events

  1. Test for presence of 'affect' as a main verb, and look in the article for key phrases relating to desired events:

    verb.lemma_ == 'affect' and self.article_relevance(article):
    
  2. If both conditions are met, return a combination of structural and person-related reporting units:

    return self.structure_unit_lemmas + self.person_unit_lemmas, verb.lemma_
    

Case 3: Phrases of the form 'feared' + reporting term, i.e. '100 homes are feared damaged'

  1. Test for presence of 'fear' or 'assume' as a main verb:

    elif verb.lemma_ in ('fear', 'assume'):
    
  2. Look at the main object of the verb, and compare it structural and person-related units:

    verb_objects = textacy.spacy_utils.get_objects_of_verb(verb)
        if verb_objects:
            verb_object = verb_objects[0]
                if verb_object.lemma_ in self.person_term_lemmas:
                    return self.person_unit_lemmas, verb.lemma_ + " " + verb_object.text
                elif verb_object.lemma_ in self.structure_term_lemmas:
                    return self.structure_unit_lemmas, verb.lemma_ + " " + verb_object.text
    ```
    
    

Case 4: Phrases of the form 'claimed XXX lives'

  1. Look to see if 'claim' is the main verb and life or lives its direct object:

    elif verb.lemma_ == 'claim':
        dobjects = [v.text for v in textacy.spacy_utils.get_objects_of_verb(verb)]
            if 'lives' in dobjects:
                return self.person_unit_lemmas, verb.lemma_ + " " + "lives"
    

Branch Search

For each relevant verb that is encountered, initiate a 'branch search' to attempt to create a report(s) based on that verb:

reports = self.branch_search_new(verb, unit_type, dates_memory, locations_memory,
    sentence, story)

Parameters:

  • verb: the verb being explored
  • unit_type: the relevant reporting units to look for depending in the reporting term (this could be Structural, Person or a combination), a list of terms
  • dates_memory: the most recently encountered dates in the article, a list of date-type Facts
  • locations_memory: the most recently encountered locations in the article, a list of location-type Facts
  • story: the article currently being processed, a Spacy Docobject

The steps followed during the branch search are:

  1. Search for locations and dates within the sentence; if none are found then revert to the most recently encountered ones

    possible_locations = self.extract_locations(sentence, verb.token)
    possible_dates = self.extract_dates(sentence, story, verb.token)
    if not possible_locations:
        possible_locations = locations_memory
    if not possible_dates:
        possible_dates = dates_memory
    
  2. Look for the subjects and objects of the verb within the sentence:

    verb_objects = self.get_subjects_and_objects(story, sentence, verb.token)
    
  3. If there are multiple possible nouns and it is unclear which is the correct one choose the one with the fewest descendants (a verb object with many descendants is more likely to have its own verb as a descendant):

    verb_descendent_counts = [(v, len(list(v.subtree))) for v in verb_objects]
    verb_objects = [x[0] for x in sorted(verb_descendent_counts, key=lambda x: x[1])]
    
  4. Loop through each subject and object:

    for o in verb_objects:
    

Case 1:

  1. Look first at the case where the subject or object is a number; it may be the case that the reporting unit is not explicit but inferred

    if self.basic_number(o):
    
  2. Test if the following word is either the verb in question or if it is of the construction 'leave ____', then ____ is the following word

    next_word = self.next_word(story, o)
        if next_word and (next_word.i == verb.token.i or next_word.text == verb.lemma_.split(" ")[-1]):
    
  3. Set the reporting unit based on the verb type:

    if search_type == self.structure_term_lemmas:
        unit = 'house'
    else:
        unit = 'person'
    
  4. The quantity is the subject or object, and we can now create the report:

    quantity = Fact(o, o, o.lemma_, 'quantity')
    report = Report([p.text for p in possible_locations], [p.text for p in possible_dates], verb.lemma_,
                    unit, quantity, story.text)
    
  5. Extract the locations in the sentence for each of the facts found and add to the reports, and at this point assume we are done for this verb:

    report.tag_spans = self.set_report_span([verb, quantity, possible_locations, possible_dates])
    reports.append(report)
    break
    

Case 2:

  1. Otherwise see if the subject / object matches a reporting unit

    elif o.lemma_ in search_type:
        reporting_unit = o
    
  2. If the reporting unit is part of a two noun clauses joined by a conjunction (NOUN CONJ NOUN), then try and extract a quantity based on the root of the noun clauses:

    noun_conj = self.test_noun_conj(sentence, o)
    if noun_conj:
        reporting_unit = noun_conj
        # Try and get a number - begin search from noun conjunction root.
    
  3. Otherwise try and extract a quantity based directly on the noun:

    else:
        # Try and get a number - begin search from noun.
        quantity = self.get_quantity(sentence, o)
    
    
  4. As per above, create a report, get the fact locations and break:

    reporting_unit = Fact(reporting_unit, reporting_unit, reporting_unit.lemma_, "unit")
    report = Report([p.text for p in possible_locations], [p.text for p in possible_dates], verb.lemma_,
                   reporting_unit.lemma_, quantity, story.text)
    report.tag_spans = self.set_report_span([verb, quantity, reporting_unit, possible_locations, possible_dates])
    reports.append(report)
    break
    
    

Extract Subjects and Objects

While attempting to complete a report based upon a verb, start by getting the subjects and objects of that verb:

verb_objects = self.get_subjects_and_objects(story, sentence, verb.token)

Parameters:

  • story: the article currently being processed, a Spacy Docobject
  • story: the sentence currently being processed, a Spacy Spanobject
  • verb.token: the verb token currently being processed, a Spacy Tokenobject

The core of this function is based on the Textacy.spacy_utils functions get_objects_of_verb and get_subjects_of_verb, which are implemented through:

def simple_subjects_and_objects(self, verb):
    verb_objects = textacy.spacy_utils.get_objects_of_verb(verb)
    verb_subjects = textacy.spacy_utils.get_subjects_of_verb(verb)
    verb_objects.extend(verb_subjects)
    return verb_objects

This list is then extended based upon a number of special cases.

Case 1: Look at certain types of tokens that directly precede the verb:

if verb.i > 0:
    preceding = story[verb.i - 1]
    if preceding.dep_ in ('pobj', 'dobj', 'nsubj', 'conj') and preceding not in verb_objects:
        verb_objects.append(preceding)

Case 2: Look at certain types of tokens that directly follow the verb:

if verb.i < len(story) - 1:
    following = story[verb.i + 1]
    if following.dep_ in ('pobj', 'dobj', 'ROOT') and following not in verb_objects:
        verb_objects.append(following)

Case 3: See if verb is part of a conjunction, and add certain tokens that are to the left-of or ancestors of the verb:

if verb.dep_ == 'conj':
    lefts = list(verb.lefts)
    if len(lefts) > 0:
        for token in lefts:
            if token.dep_ in ('nsubj', 'nsubjpass'):
                verb_objects.append(token)
    else:
        ancestors = verb.ancestors
        for anc in ancestors:
            verb_objects.extend(self.simple_subjects_and_objects(anc))

Case 4: See if verb is Root of sentence, and look for prepositional objects

if verb.dep_ == 'ROOT':
    for token in sentence:
        if token.dep_ == 'pobj':
            verb_objects.append(token)

Case 5: See if verb is part of a relative clause, look for nouns within the relative clause

if verb.dep_ == 'relcl':
    relcl_noun = self.nouns_from_relative_clause(sentence, verb)
        if relcl_noun:
            verb_objects.append(relcl_noun)