-
Notifications
You must be signed in to change notification settings - Fork 27
Development & Enhancements
The project has been created in a modular fashion in order to facilitate further development and enhancements, as well as 'plug and play' behavior.
The front end was created using Nodejs, and is separate from the code for the back-end. Interaction between the two occurs via the database.
The project currently uses a PostgreSQL database.
The underlying data model is in internal_displacement/model/model.py
.
The back-end is based on three primary classes, each of which deals with a particular part of the process.
The Parser
class contains all of the code required for opening and parsing urls and web-based pdf files. The core dependencies are newspaper
and BeautifulSoup
for parsing html articles, and textract
for parsing pdf files.
The Parser
returns a tuple of the elements extracted from the article, and can also be used independently for url processing.
The Interpreter
class contains all of the code required for Natural Language Processing, article classification and report extraction. The core dependencies are:
-
spacy
andtextacy
for Natural Language Processing -
scikit-learn
for machine learning (article classification and report extraction) -
pycountry
for identifying countries -
parsedatetime
for working with date-like entities extracted from text
The Interpreter
can also be used independently from other components for specific tasks, including:
- Identifying the language of written text
- Identifying countries mentioned in text
- Extracting reports from text
Notes:
- When extracting reports, the
Interpreter
returns anExtractedReport
object. This is simply a wrapper to facilitate working with reports, and individual elements can easily be accessed via:
- Reporting unit:
report.subject_term
- Reporting term:
report.event_term
- Quantity:
report.quantity
- Locations:
report.locations
- During report extraction, another wrapper class,
Fact
, is used to facilitate working with and keeping track of each identified report element, and enables us to keep track of such as:
- The underlying Spacy token or tokens
- The token lemma
- The type of element (Unit, Term, Location etc.)
- The location of the element within the underlying text (start index and end index)
The model(s) used for classification and report extraction are stored as pickle files and provided as arguments when initializing an Interpreter
object.
The Pipeline
class brings together the database and data model, Scraper
and Interpreter
for end-to-end processing of URLs.
When initializing the Pipeline
, the required arguments are:
- A SqlAlchemy
Session
object for interacting with the database - A
Scraper
object for working with and parsing URLs - An
Interpreter
object for natural language processing and report extraction
The key method within Pipeline
is process_url
that receives a url as an argument and, in real time, carries out the end-to-end processing.
Given the modular nature of the back end, enhancements and changes can be made to different parts of the solution without affecting each other. Some possibilities include:
- Changing the underlying database solution
This simply requires modifying the files in docker/localdb
.
- Providing new classification or report extraction models
This can be done by providing new arguments to Interpreter
which should be paths to pickle files of the models.
- Modifying or replacing the url scraping functionaly
The only requirement for not affecting the Pipeline
is being sure to return the extracted elements as a tuple:
article_text, article_pub_date, article_title, article_content_type, article_authors, article_domain
or, if there is an error:
"retrieval_failed", None, "", datetime.datetime.now(), "", ""
- Modifying or replacing the report extraction functionality
Modifications to the report extraction require more care due to the use of the ExtractedReport
and Fact
wrapper classes. For more information see Fact and ExtractedReport