Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better support for Brazilian Portuguese #4302

Open
insinfo opened this issue Aug 21, 2024 · 4 comments
Open

Add better support for Brazilian Portuguese #4302

insinfo opened this issue Aug 21, 2024 · 4 comments

Comments

@insinfo
Copy link

insinfo commented Aug 21, 2024

I did a test to OCR scanned documents in Brazilian Portuguese, and I saw that Tesseract makes a lot of mistakes on scanned documents in Portuguese

Current Behavior

result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

1-1

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

“Chast vO

Precesse: 18457 J 2003 Data: 03/09/2003 Hora: 10:53:56
Requerente: COSCARELLI E CIALTDA ME 2 ;
* Sec.Destino: Secretaria Municipal de Fazend we
Dept.Destine: Dept? de Tributes @ Fiscalizagao

4
Assunto: ALVARA o Lh 3. )40

Expected Behavior

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Processo: 18457 / 2003
Data: 03/09/2003
Hora: 10:53:56
Requerente: COSCARELLI E CIA LTDA ME
Sec. Destino: Secretaria Municipal de Fazenda
Dept. Destino: Depto. de Tributos e Fiscalização
Assunto: ALVARÁ

Current Behavior

result from https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

110-1

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Frocesso 153

14 ¢ 2003 data 2540712003 Hora: 16:48:28

COLOMIA DE PESCADOPES 2.00

a oe

pcos

Expected Behavior

the correct thing would be

ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Processo: 15314 / 2003
Data: 25/07/2003
Hora: 16:18:28

Requerente: COLÔNIA DE PESCADORES Z-22
Sec. Destino: Sec. Mun. Urbanismo Obras e S. Pub.
Dept. Destino: 0
Assunto: AGRADECIMENTO / FAZ

Windows 11

https://huggingface.co/spaces/kneelesh48/Tesseract-OCR

@stweil
Copy link
Contributor

stweil commented Aug 21, 2024

Latest Tesseract with the model script/Latin gives a better result for the first image:

ESTADO DO RIO DE JANEIRO

Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Cent EO
Processo: 18457 / 2003 Data: 03/09/2003 Hora: 10:53:56
Requerente: COSCARELLI E CIA LTDA ME 2, '
` Sec Destino: Secretaria Municipal de rarako OS
Dept.Destino: Dept? de Tributos è Fiscalização

Assunto: ALVARA A i L J: j4 0

ES

@filipe-smartins
Copy link

@stweil

What is the config to get this result in portuguese? Is it "-l lat+script/Latin" or "-l por+script/Latin"?

config_tesseract = fr'--tessdata-dir "{TESSDATA_PREFIX}" -l lat+script/Latin --oem 3 --psm 6'

@stweil
Copy link
Contributor

stweil commented Sep 7, 2024

It's simply -l script/Latin (or -l Latin, depending on your Linux distribution or local installation). The script Latin includes all Western European languages which are using the same script (instead of Greek or Cyrillic).

@stweil
Copy link
Contributor

stweil commented Sep 7, 2024

Note also that a correct installation of Tesseract does not need --tessdata-dir or TESSDATA_PREFIX, so avoid both (unless you have very special needs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants