You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The paragraph extraction feature should operate on top of the segmentation process. Segmentation is activated on a per-instance basis, processing all PDFs within that instance, including newly uploaded ones. Segmentation is controlled by the following feature toggle:
For backend paragraph extraction, the system should retrieve segmentation data from the segmentation MongoDB collection and filter segment types to exclude unwanted content such as page headers, text on pictures, or formulas.
The segmentation paragraphs in Mongo looks like this:
left: number;
top: number;
width: number;
height: number;
page_number: number;
text: string;
type: string; // not in Uwazi yet
The different types of segmentation the service returns (aka document layout analysis async) are as follows:
The paragraph extraction feature should operate on top of the segmentation process. Segmentation is activated on a per-instance basis, processing all PDFs within that instance, including newly uploaded ones. Segmentation is controlled by the following feature toggle:
features.segmentation: {url: "http://10.0.11.196:5051/async_extraction"}
For backend paragraph extraction, the system should retrieve segmentation data from the segmentation MongoDB collection and filter segment types to exclude unwanted content such as page headers, text on pictures, or formulas.
The segmentation paragraphs in Mongo looks like this:
left: number;
top: number;
width: number;
height: number;
page_number: number;
text: string;
type: string; // not in Uwazi yet
The different types of segmentation the service returns (aka document layout analysis async) are as follows:
"Caption"
"Footnote"
"Formula"
"List item"
"Page footer"
"Page header"
"Picture"
"Section header"
"Table"
"Text"
"Title"
The desired paragraph could be this short list:
"List item"
"Section header"
"Text"
"Title"
Find more information about the segmentation in the following repositories:
https://github.com/huridocs/pdf-document-layout-analysis-async
https://github.com/huridocs/pdf-document-layout-analysis
The text was updated successfully, but these errors were encountered: