Skip to content

Development & Enhancements

Simon Bedford edited this page Apr 26, 2017 · 2 revisions

Introduction

The project has been created in a modular fashion in order to facilitate further development and enhancements, as well as 'plug and play' behavior.

Front End

The front end was created using Nodejs, and is separate from the code for the back-end. Interaction between the two occurs via the database.

Database

The project currently uses a PostgreSQL database.

The underlying data model is in internal_displacement/model/model.py.

Back End

The back-end is based on three primary classes, each of which deals with a particular part of the process.

Parser

The Parser class contains all of the code required for opening and parsing urls and web-based pdf files. The core dependencies are newspaper and BeautifulSoup for parsing html articles, and textract for parsing pdf files.

The Parser returns a tuple of the elements extracted from the article, and can also be used independently for url processing.

Interpreter

The Interpreter class contains all of the code required for Natural Language Processing, article classification and report extraction. The core dependencies are:

  • spacy and textacy for Natural Language Processing
  • scikit-learn for machine learning (article classification and report extraction)
  • pycountry for identifying countries
  • parsedatetime for working with date-like entities extracted from text

The Interpreter can also be used independently from other components for specific tasks, including:

  • Identifying the language of written text
  • Identifying countries mentioned in text
  • Extracting reports from text

Notes:

  1. When extracting reports, the Interpreter returns an ExtractedReport object. This is simply a wrapper to facilitate working with reports, and individual elements can easily be accessed via:
  • Reporting unit: report.subject_term
  • Reporting term: report.event_term
  • Quantity: report.quantity
  • Locations: report.locations
  1. During report extraction, another wrapper class, Fact, is used to facilitate working with and keeping track of each identified report element, and enables us to keep track of such as:
  • The underlying Spacy token or tokens
  • The token lemma
  • The type of element (Unit, Term, Location etc.)
  • The location of the element within the underlying text (start index and end index)

Models

The model(s) used for classification and report extraction are stored as pickle files and provided as arguments when initializing an Interpreter object.

Pipeline

The Pipeline class brings together the database and data model, Scraper and Interpreter for end-to-end processing of URLs.

When initializing the Pipeline, the required arguments are:

  • A SqlAlchemy Session object for interacting with the database
  • A Scraper object for working with and parsing URLs
  • An Interpreter object for natural language processing and report extraction

The key method within Pipeline is process_url that receives a url as an argument and, in real time, carries out the end-to-end processing.

Further Development & Enhancements

Given the modular nature of the back end, enhancements and changes can be made to different parts of the solution without affecting each other. Some possibilities include:

  1. Changing the underlying database solution

This simply requires modifying the files in docker/localdb.

  1. Providing new classification or report extraction models

This can be done by providing new arguments to Interpreter which should be paths to pickle files of the models.

  1. Modifying or replacing the url scraping functionaly

The only requirement for not affecting the Pipeline is being sure to return the extracted elements as a tuple:

article_text, article_pub_date, article_title, article_content_type, article_authors, article_domain

or, if there is an error:

"retrieval_failed", None, "", datetime.datetime.now(), "", ""
  1. Modifying or replacing the report extraction functionality

Modifications to the report extraction require more care due to the use of the ExtractedReport and Fact wrapper classes. For more information see Fact and ExtractedReport

Clone this wiki locally