Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paragraph extraction] backend #7130

Open
gabriel-piles opened this issue Aug 16, 2024 · 0 comments
Open

[Paragraph extraction] backend #7130

gabriel-piles opened this issue Aug 16, 2024 · 0 comments

Comments

@gabriel-piles
Copy link
Member

The paragraph extraction feature should operate on top of the segmentation process. Segmentation is activated on a per-instance basis, processing all PDFs within that instance, including newly uploaded ones. Segmentation is controlled by the following feature toggle:

features.segmentation: {url: "http://10.0.11.196:5051/async_extraction"}

For backend paragraph extraction, the system should retrieve segmentation data from the segmentation MongoDB collection and filter segment types to exclude unwanted content such as page headers, text on pictures, or formulas.

The segmentation paragraphs in Mongo looks like this:

left: number;
top: number;
width: number;
height: number;
page_number: number;
text: string;
type: string; // not in Uwazi yet

The different types of segmentation the service returns (aka document layout analysis async) are as follows:

"Caption"
"Footnote"
"Formula"
"List item"
"Page footer"
"Page header"
"Picture"
"Section header"
"Table"
"Text"
"Title"

The desired paragraph could be this short list:

"List item"
"Section header"
"Text"
"Title"

Find more information about the segmentation in the following repositories:

https://github.com/huridocs/pdf-document-layout-analysis-async
https://github.com/huridocs/pdf-document-layout-analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants