Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hOCR renderer writes "x_size" (instead of "x_fsize") property to ocr_line/ocr_header/... #3303

Open
MerlijnWajer opened this issue Feb 16, 2021 · 9 comments

Comments

@MerlijnWajer
Copy link
Contributor

Environment

  • Tesseract Version: master and 4.1.1

Current Behavior:

Function AddBoxTohOCR in src/api/hocrrenderer.cpp writes property x_size for the height of a line/header/etc, but it looks like the spec suggests the usage of x_fsize; x_size is nowhere to be mentioned in the hOCR spec: http://kba.cloud/hocr-spec/1.2/#propdef-x_fsize

For the ocr_word elements, x_fsize is used.

The spec does suggest the usage of x_fsize for ocr_line: http://kba.cloud/hocr-spec/1.2/#elementdef-ocr_line

Expected Behavior:

It might make sense to write x_fsize instead of x_size for ocr_line elements. But perhaps there was a reason x_fsize was not used for ocr_line (and similar) elements? There might be something to be said for backwards compatibility.

Suggested Fix:

Evaluate situation and perhaps replace x_size with x_fsize.

@stweil
Copy link
Contributor

stweil commented Feb 16, 2021

For the history of that code see also pull request #27 and issue #225. Commit 4317862 introduced the current code.

That code added several properties which are not part of the latest specification: x_size, x_ascenders and x_descenders.

Can x_size (height of row) directly be replaced by x_fsize (font size)? I think we should test that by writing some text for example with a 12 pt font, export that as image and look what Tesseract produces.

@stweil
Copy link
Contributor

stweil commented Feb 16, 2021

There is also a TODO comment in that code: "Do we want to limit these to a single decimal place?". I think that might be reasonable.

@MerlijnWajer
Copy link
Contributor Author

#3326 is somewhat related. I believe there is probably value in keeping a size in pixels that doesn't relate to the DPI, but I'm not sure I can make a recommendation here anymore. x_size probably makes sense (even though not in spec) if it indeed is not a font size, but just the size in pixels. Not sure why Tesseract specifies it per line, but not per word, but I suppose that might be too verbose.

@MerlijnWajer
Copy link
Contributor Author

FYI: It looks like x_fsize changes depending on the provided dpi, but x_size does not (makes sense, the line height is independent on the DPI, but the font size is not), so that needs special handling in case we'd like to move forward with this.

@tfmorris
Copy link
Contributor

tfmorris commented Nov 12, 2023

There is also a TODO comment in that code: "Do we want to limit these to a single decimal place?". I think that might be reasonable.

Although the TODO has Ray's name on it, it's from me. (All the attribution was lost when the code was moved into a separate module.)

When I implemented this in 2016, I would have been looking at the HOCR 1.0 spec, but that seems to have been taken off line. I'm not sure if anyone still has a copy available. The HOCR 1.2 spec says that x_fsize is an unsigned int, but doesn't say what the units should be, although it would make sense for it to be points, rather than pixels as used for x_size.

Note that since the x_ prefix introduces a non-standard parameter, they might not have been in the spec at all. It might just have been me trying to make sure that all the information that we had available got added to the output, but I don't remember at this distance.

I don't know if anyone is using the current parameters, but I'd suggest adding x_fsize separately for compatibility (presuming you can figure out what units it should be in).

p.s. HOCR 1.2 spec has moved to https://kba.github.io/hocr-spec/1.2/

@stweil
Copy link
Contributor

stweil commented Nov 12, 2023

I think Konstantin made a Markdown copy from the hOCR 1.0 spec: kba/hocr-spec@910c8c9.

@MerlijnWajer
Copy link
Contributor Author

If x_fsize is added unconditionally next to x_size (which makes sense to me), it should be worth noting that as far as I understand the DPI should be stored somewhere too, otherwise it's not clear how to get back to a pixel size given a font size. (Although I guess we would provide that explicitly through x_size)

Currently the scan_res property is only written if the DPI is known at OCR time (passed in command line or in image technical metadata tags), but if Tesseract guesses the DPI (I think for segmentation purposes), it will not be written, so in that case I suppose the x_fsize would be in 70 (or 72, I don't remember) DPI, but that currently would be implicit, not explicit.

@stweil
Copy link
Contributor

stweil commented Aug 11, 2024

@kba, Tesseract uses three properties in its hOCR output which are not part of the standard: x_size, x_ascenders and x_descenders. Do you think that this is an error (like issue #4300 assumes), or is this allowed since the x_ prefix introduces a non-standard parameter?

@tfmorris
Copy link
Contributor

My comment above:

Note that since the x_ prefix introduces a non-standard parameter, they might not have been in the spec at all. It might just have been me trying to make sure that all the information that we had available got added to the output, but I don't remember at this distance.

is addressed by my comment in the original issue #225:

The hOCR spec places all its information in the title attribute and I believe it makes most sense to use that mechanism for the extended information, using the x_ prefix to avoid collisions with future extensions to the spec.

which makes it clear what my approach/rationale was - output all the information that Tesseract had available using the hOCR extension mechanism. Given that the x_ prefix is for extensions, it seems odd that they could cause a validation failure.

The hOCR spec is internally inconsistent in that it says in the definition of property (2.2.2)

Property names must be either from those defined in § 4 The properties of hOCR or begin with x_ to denote implementation-specific extensions.

but then it also includes a bunch of standard (?) properties with an x_ prefix to their name, including: x_bboxes, x_font, x_fsize, x_confs, x_scanner, x_source, and x_wconf. That defeats the whole point of having an extension namespace.

I think that x_ascenders and x_descenders may be (mostly?) duplicative with the baseline property, so could be candidates for removal.

The pointsize which is already being calculated to use with ocrx_word can be repurposed to use as the x_fsize in contexts where x_size is currently emitted. It's computed from row (ie line) attributes, so it's constant anyway. It will, however, be zero if the Y resolution is unknown, so I'd recommend keeping the x_size (in addition for backward compatibility).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants