[Paragraph extraction] backend #7130

gabriel-piles · 2024-08-16T12:26:51Z

The paragraph extraction feature should operate on top of the segmentation process. Segmentation is activated on a per-instance basis, processing all PDFs within that instance, including newly uploaded ones. Segmentation is controlled by the following feature toggle:

features.segmentation: {url: "http://10.0.11.196:5051/async_extraction"}

For backend paragraph extraction, the system should retrieve segmentation data from the segmentation MongoDB collection and filter segment types to exclude unwanted content such as page headers, text on pictures, or formulas.

The segmentation paragraphs in Mongo looks like this:

left: number;
top: number;
width: number;
height: number;
page_number: number;
text: string;
type: string; // not in Uwazi yet

The different types of segmentation the service returns (aka document layout analysis async) are as follows:

"Caption"
"Footnote"
"Formula"
"List item"
"Page footer"
"Page header"
"Picture"
"Section header"
"Table"
"Text"
"Title"

The desired paragraph could be this short list:

"List item"
"Section header"
"Text"
"Title"

Find more information about the segmentation in the following repositories:

https://github.com/huridocs/pdf-document-layout-analysis-async
https://github.com/huridocs/pdf-document-layout-analysis

gabriel-piles added the Backend 💾 label Aug 16, 2024

RafaPolit added the Priority: Medium label Aug 16, 2024

aphilop added this to the Paragraph Extraction milestone Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Paragraph extraction] backend #7130

[Paragraph extraction] backend #7130

gabriel-piles commented Aug 16, 2024

[Paragraph extraction] backend #7130

[Paragraph extraction] backend #7130

Comments

gabriel-piles commented Aug 16, 2024