Skip to content

Future Work

Simon Bedford edited this page Apr 28, 2017 · 2 revisions

Although this solution meets the basic requirements of the challenge, we believe there are still considerable opportunities for improving both the results as well as the underlying E2E processing and visualization pipeline.

These include:

Obtaining additional tagged training data

We have been working in partnership with CrowdFlower to obtain more pre-labelled training data through Crowdsourcing. This enabled us to obtain data for training our classifier, and in the future we would move on to obtaining more labelled excerpts too.

Enhancing classification and report extraction

Based on initial results, we believe that a larger set of labelled excerpt could have a significant impact on the results (Precision & Recall) of the report extraction machine learning models.

Visualizations

We believe there are still many opportunities to implement additional visualizations into our tool. We would also like to give analysts the flexibility to choose or create their own visualizations based on the underlying data.

Additional Complementary Fields

While designing the tool, we identified a number of opportunities for enhancing the utility to analysts by implementing various meta-fields for both articles and reports. Some ideas, that have not yet been implemented, include:

Article Reliability

A measure of the estimated reliability of a given article, that could take numerous factors into account, including the underlying domain, manually captured analyst ratings, similarity of content with other articles etc.

Report Accuracy

A measure of the likely accuracy of an extracted report, which could, for example, take into account:

  • Incidence of certain key words
  • Presence of conflicting or synonymous words
  • Output probabilities of ML models etc.