Skip to content

Article Classification

Simon Bedford edited this page Apr 28, 2017 · 1 revision

The Article Classification method is split into two steps:

  1. Classifying articles as relevant or not
  2. Classifying relevant articles into Conflict & Violence or Disaster

Article Relevance

The Article Relevance classification method is based on combining the results of manually crafted rules as well as a machine learning classifier.

Keyword Approach

The keyword approach is based on reading through the texts and identifying tokens that might uniquely identify displacement events. The whole series of texts is first tokenized and stemmed and each token is compared to the possible keywords.

This gives results:

  • Precision: 0.83
  • Recall: 0.81
  • F1 Score: 0.81

Machine Learning

The general machine learning approach used is to convert documents to a TF-IDF representation, and then to mode topics (or simply reduce dimensionality) by implementing an LSI algorithm. The resulting vectors were then used as features for training a Random Forest (1,000 estimators).

This on its own gives results:

  • Precision: 0.79
  • Recall: 0.79
  • F1 Score: 0.79

Combined Approach

The keyword and machine learning results were combined based on the following rule:

Where there is disagreement, (i.e. on approach says not relevant and the other says relevant), choose relevant.

This gives results:

  • Precision: 0.84
  • Recall: 0.80
  • F1 Score: 0.80

Article Category

The method used for article categorization (among relevant articles) is a pure machine learning approach, as no additional improvement was obtained by incorporating keyword analysis.

The general machine learning approach used is to convert documents to a TF-IDF representation, and then to mode topics (or simply reduce dimensionality) by implementing an LSI algorithm. The resulting vectors were then used as features for training a Random Forest (1,000 estimators).

This gives results:

  • Precision: 0.98
  • Recall: 0.98
  • F1 Score: 0.98