Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IX] - Uwazi is sending chaotic data for training (ensure training is using denormalized data) #7183

Open
txau opened this issue Sep 3, 2024 · 4 comments

Comments

@txau
Copy link
Collaborator

txau commented Sep 3, 2024

Eg. in this case for selects:

Number of options: 16
Number of samples: 500
Languages
es 281
en 207
pt 12
Options
Medidas Provisionales 255
Supervisión de cumplimiento de Sentencia 187
Monitoring compliance with Judgment 24
Precautionary Measures 19
Otros 9
Fondo de asistencia a víctimas 6
Interpretación 2
Medidas Provisorias 2
Outros 1
Victims' Legal Assistance Fund 1

Labels are coming in mixed languages. This is probably because the labels are being taken from the denormalized metadata, and that denormalization is not accounting for translations.

(related) https://github.com/huridocs/ml-backlog/issues/31 #7184

@txau txau added the Bug 🐞 label Sep 3, 2024
@aphilop aphilop added this to the Information Extraction milestone Sep 4, 2024
@txau
Copy link
Collaborator Author

txau commented Sep 4, 2024

Another oddity:

2024-09-04 15:42:47,019 [INFO] 
Number of options: 4
Number of samples: 500
Languages
es 281
en 207
pt 12
Options
Corte Interamericana de Derechos Humanos 500 for {"run_name":"cejil","extraction_name":"66d87fa4d5516a82a6f0e1e7","metadata":{}}

It says "Number of options 4", but it is only reporting 500 items all from the same option. Something is off here cc @gabriel-piles

@RafaPolit
Copy link
Member

@txau has a hunch this is being affected by #7184.

We need to do, at least, two things:

If both conditions are met, then this would be a non-issue, and can be closed.

If we are NOT sending the correctly translated / denormalized data, then we need to take steps into ensuring that.

Maybe the right approach is not relying on denormalized data for critical db-integrity processes?

@RafaPolit RafaPolit changed the title [IX] - Uwazi is sending chaotic data for training [IX] - Uwazi is sending chaotic data for training (ensure training is using denormalized data) Sep 6, 2024
@gabriel-piles
Copy link
Member

A new error in the metadata extractor service could be related to this issue.

The error message says that there is a sample with a value that is not included in the options list. This could be due to a translated label:

Option(id='mrl5hyschs', label='المحكمة الإدارية ') is not in list

@aphilop
Copy link

aphilop commented Sep 26, 2024

Changed priority to high and put in the backlog column as discussed in MM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants