[IX] - Uwazi is sending chaotic data for training (ensure training is using denormalized data) #7183

txau · 2024-09-03T16:36:23Z

Eg. in this case for selects:

Number of options: 16
Number of samples: 500
Languages
es 281
en 207
pt 12
Options
Medidas Provisionales 255
Supervisión de cumplimiento de Sentencia 187
Monitoring compliance with Judgment 24
Precautionary Measures 19
Otros 9
Fondo de asistencia a víctimas 6
Interpretación 2
Medidas Provisorias 2
Outros 1
Victims' Legal Assistance Fund 1

Labels are coming in mixed languages. This is probably because the labels are being taken from the denormalized metadata, and that denormalization is not accounting for translations.

(related) https://github.com/huridocs/ml-backlog/issues/31 #7184

The text was updated successfully, but these errors were encountered:

txau · 2024-09-04T15:46:00Z

Another oddity:

2024-09-04 15:42:47,019 [INFO] 
Number of options: 4
Number of samples: 500
Languages
es 281
en 207
pt 12
Options
Corte Interamericana de Derechos Humanos 500 for {"run_name":"cejil","extraction_name":"66d87fa4d5516a82a6f0e1e7","metadata":{}}

It says "Number of options 4", but it is only reporting 500 items all from the same option. Something is off here cc @gabriel-piles

RafaPolit · 2024-09-06T15:01:49Z

@txau has a hunch this is being affected by #7184.

We need to do, at least, two things:

fix Uwazi not denormalizing translations #7184
Ensure that training data is being sent from the denormalized data.

If both conditions are met, then this would be a non-issue, and can be closed.

If we are NOT sending the correctly translated / denormalized data, then we need to take steps into ensuring that.

Maybe the right approach is not relying on denormalized data for critical db-integrity processes?

gabriel-piles · 2024-09-24T14:33:08Z

A new error in the metadata extractor service could be related to this issue.

The error message says that there is a sample with a value that is not included in the options list. This could be due to a translated label:

Option(id='mrl5hyschs', label='المحكمة الإدارية ') is not in list

aphilop · 2024-09-26T16:58:39Z

Changed priority to high and put in the backlog column as discussed in MM.

txau added the Bug 🐞 label Sep 3, 2024

aphilop added this to the Information Extraction milestone Sep 4, 2024

RafaPolit changed the title ~~[IX] - Uwazi is sending chaotic data for training~~ [IX] - Uwazi is sending chaotic data for training (ensure training is using denormalized data) Sep 6, 2024

RafaPolit added the Priority: Medium label Sep 6, 2024

txau mentioned this issue Sep 23, 2024

Inheriting select properties does not bring translated thesauri values #7259

Open

aphilop added the Priority: High label Sep 26, 2024

aphilop removed the Priority: Medium label Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IX] - Uwazi is sending chaotic data for training (ensure training is using denormalized data) #7183

[IX] - Uwazi is sending chaotic data for training (ensure training is using denormalized data) #7183

txau commented Sep 3, 2024 •

edited

Loading

txau commented Sep 4, 2024

RafaPolit commented Sep 6, 2024

gabriel-piles commented Sep 24, 2024

aphilop commented Sep 26, 2024

[IX] - Uwazi is sending chaotic data for training (ensure training is using denormalized data) #7183

[IX] - Uwazi is sending chaotic data for training (ensure training is using denormalized data) #7183

Comments

txau commented Sep 3, 2024 • edited Loading

txau commented Sep 4, 2024

RafaPolit commented Sep 6, 2024

gabriel-piles commented Sep 24, 2024

aphilop commented Sep 26, 2024

txau commented Sep 3, 2024 •

edited

Loading