hOCR renderer writes "x_size" (instead of "x_fsize") property to ocr_line/ocr_header/... #3303

MerlijnWajer · 2021-02-16T15:44:52Z

Environment

Tesseract Version: master and 4.1.1

Current Behavior:

Function AddBoxTohOCR in src/api/hocrrenderer.cpp writes property x_size for the height of a line/header/etc, but it looks like the spec suggests the usage of x_fsize; x_size is nowhere to be mentioned in the hOCR spec: http://kba.cloud/hocr-spec/1.2/#propdef-x_fsize

For the ocr_word elements, x_fsize is used.

The spec does suggest the usage of x_fsize for ocr_line: http://kba.cloud/hocr-spec/1.2/#elementdef-ocr_line

Expected Behavior:

It might make sense to write x_fsize instead of x_size for ocr_line elements. But perhaps there was a reason x_fsize was not used for ocr_line (and similar) elements? There might be something to be said for backwards compatibility.

Suggested Fix:

Evaluate situation and perhaps replace x_size with x_fsize.

The text was updated successfully, but these errors were encountered:

stweil · 2021-02-16T16:12:08Z

For the history of that code see also pull request #27 and issue #225. Commit 4317862 introduced the current code.

That code added several properties which are not part of the latest specification: x_size, x_ascenders and x_descenders.

Can x_size (height of row) directly be replaced by x_fsize (font size)? I think we should test that by writing some text for example with a 12 pt font, export that as image and look what Tesseract produces.

stweil · 2021-02-16T16:14:34Z

There is also a TODO comment in that code: "Do we want to limit these to a single decimal place?". I think that might be reasonable.

MerlijnWajer · 2021-03-14T00:18:58Z

#3326 is somewhat related. I believe there is probably value in keeping a size in pixels that doesn't relate to the DPI, but I'm not sure I can make a recommendation here anymore. x_size probably makes sense (even though not in spec) if it indeed is not a font size, but just the size in pixels. Not sure why Tesseract specifies it per line, but not per word, but I suppose that might be too verbose.

MerlijnWajer · 2021-07-05T12:10:41Z

FYI: It looks like x_fsize changes depending on the provided dpi, but x_size does not (makes sense, the line height is independent on the DPI, but the font size is not), so that needs special handling in case we'd like to move forward with this.

tfmorris · 2023-11-12T00:27:26Z

There is also a TODO comment in that code: "Do we want to limit these to a single decimal place?". I think that might be reasonable.

Although the TODO has Ray's name on it, it's from me. (All the attribution was lost when the code was moved into a separate module.)

When I implemented this in 2016, I would have been looking at the HOCR 1.0 spec, but that seems to have been taken off line. I'm not sure if anyone still has a copy available. The HOCR 1.2 spec says that x_fsize is an unsigned int, but doesn't say what the units should be, although it would make sense for it to be points, rather than pixels as used for x_size.

Note that since the x_ prefix introduces a non-standard parameter, they might not have been in the spec at all. It might just have been me trying to make sure that all the information that we had available got added to the output, but I don't remember at this distance.

I don't know if anyone is using the current parameters, but I'd suggest adding x_fsize separately for compatibility (presuming you can figure out what units it should be in).

p.s. HOCR 1.2 spec has moved to https://kba.github.io/hocr-spec/1.2/

stweil · 2023-11-12T07:48:00Z

I think Konstantin made a Markdown copy from the hOCR 1.0 spec: kba/hocr-spec@910c8c9.

MerlijnWajer · 2023-11-12T10:01:09Z

If x_fsize is added unconditionally next to x_size (which makes sense to me), it should be worth noting that as far as I understand the DPI should be stored somewhere too, otherwise it's not clear how to get back to a pixel size given a font size. (Although I guess we would provide that explicitly through x_size)

Currently the scan_res property is only written if the DPI is known at OCR time (passed in command line or in image technical metadata tags), but if Tesseract guesses the DPI (I think for segmentation purposes), it will not be written, so in that case I suppose the x_fsize would be in 70 (or 72, I don't remember) DPI, but that currently would be implicit, not explicit.

stweil · 2024-08-11T16:20:26Z

@kba, Tesseract uses three properties in its hOCR output which are not part of the standard: x_size, x_ascenders and x_descenders. Do you think that this is an error (like issue #4300 assumes), or is this allowed since the x_ prefix introduces a non-standard parameter?

tfmorris · 2024-08-11T21:21:36Z

My comment above:

Note that since the x_ prefix introduces a non-standard parameter, they might not have been in the spec at all. It might just have been me trying to make sure that all the information that we had available got added to the output, but I don't remember at this distance.

is addressed by my comment in the original issue #225:

The hOCR spec places all its information in the title attribute and I believe it makes most sense to use that mechanism for the extended information, using the x_ prefix to avoid collisions with future extensions to the spec.

which makes it clear what my approach/rationale was - output all the information that Tesseract had available using the hOCR extension mechanism. Given that the x_ prefix is for extensions, it seems odd that they could cause a validation failure.

The hOCR spec is internally inconsistent in that it says in the definition of property (2.2.2)

Property names must be either from those defined in § 4 The properties of hOCR or begin with x_ to denote implementation-specific extensions.

but then it also includes a bunch of standard (?) properties with an x_ prefix to their name, including: x_bboxes, x_font, x_fsize, x_confs, x_scanner, x_source, and x_wconf. That defeats the whole point of having an extension namespace.

I think that x_ascenders and x_descenders may be (mostly?) duplicative with the baseline property, so could be candidates for removal.

The pointsize which is already being calculated to use with ocrx_word can be repurposed to use as the x_fsize in contexts where x_size is currently emitted. It's computed from row (ie line) attributes, so it's constant anyway. It will, however, be zero if the Y resolution is unknown, so I'd recommend keeping the x_size (in addition for backward compatibility).

stweil mentioned this issue Aug 11, 2024

hocr syntax errs #4300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hOCR renderer writes "x_size" (instead of "x_fsize") property to ocr_line/ocr_header/... #3303

hOCR renderer writes "x_size" (instead of "x_fsize") property to ocr_line/ocr_header/... #3303

MerlijnWajer commented Feb 16, 2021

stweil commented Feb 16, 2021

stweil commented Feb 16, 2021

MerlijnWajer commented Mar 14, 2021

MerlijnWajer commented Jul 5, 2021

tfmorris commented Nov 12, 2023 •

edited

Loading

stweil commented Nov 12, 2023

MerlijnWajer commented Nov 12, 2023

stweil commented Aug 11, 2024

tfmorris commented Aug 11, 2024

hOCR renderer writes "x_size" (instead of "x_fsize") property to ocr_line/ocr_header/... #3303

hOCR renderer writes "x_size" (instead of "x_fsize") property to ocr_line/ocr_header/... #3303

Comments

MerlijnWajer commented Feb 16, 2021

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

stweil commented Feb 16, 2021

stweil commented Feb 16, 2021

MerlijnWajer commented Mar 14, 2021

MerlijnWajer commented Jul 5, 2021

tfmorris commented Nov 12, 2023 • edited Loading

stweil commented Nov 12, 2023

MerlijnWajer commented Nov 12, 2023

stweil commented Aug 11, 2024

tfmorris commented Aug 11, 2024

tfmorris commented Nov 12, 2023 •

edited

Loading