-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not getting prediction correctly using the model trained on the custom dataset (similar format as CORD-V2 dataset) #297
Comments
@SiriusPoint any updates? I have the same problem |
@CarlosSerrano88, Not yet. I am trying but not getting appropriate results. |
The transformers implementation of Donut seems to have broken saving and loading at some point. Try transformers==4.26.1 and see if that works. |
+1 |
with transformers==4.25.1 working perfect! |
I have the same problem with the last version of transformers. Going back to 4.40.1, and the saved model works again |
Did you solve the similar question when training on cord-v2 by using transformers 4.25.1? |
I have trained the Donut model using custom dataset which is on the same line as CORD-v2 dataset. The image is having multiple values in one line and we have around 23 to 24 lines in each document. I have used the base model as "naver-clova-ix/donut-base".
I am using 149 documents for the training and following is the breakup of the datasets
training = 119 images
validation = 22 images
testing = 8 images
I have crated 3 meradata.jsonl file i.e. for train, validation and test. Below is the sample value from the metadat.jsonl file from the training database
{"file_name": "IOB_Bank_31_image_0.jpg", "ground_truth": "{\"gt_parse\": {\"bank_stmt_entries\": [{\"TXN_DATE\": \"02-11-2023\", \"TXN_DESC\": \"SB Int: 10-2023:0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"93.00\", \"BALANCE_AMT\": \"10901.92\"}, {\"TXN_DATE\": \"09-12-2023\", \"TXN_DESC\": \"CHRGS- SMS ALERT\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"1.06\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10900.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"Debit Card AMC-2\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": \"295.00\", \"DEPOSIT_AMT\": null, \"BALANCE_AMT\": \"10605.86\"}, {\"TXN_DATE\": \"02-02-2024\", \"TXN_DESC\": \"SB Int: 01-2024: 0\", \"CHEQUE_REF_NO\": null, \"WITHDRAWAL_AMT\": null, \"DEPOSIT_AMT\": \"75,00\", \"BALANCE_AMT\": \"10680.86\"}]}}"}
I trained the model for 30 epochs and following are the values for loss and val_edit_distance
loss = 0.03544
val_edit_distance = 0.3443
Following is the config parameters used for the training
When I am trying to find the prediction using the test dataset, I am getting following output because I had put the print statement at specific location
seq ==>:
<s_bank-stmt>署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.seq after token2json ==>: {'text_sequence': '署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署署12-323310-3012-510-3021-32-2021-2021-2021-2021-2021-2021-2021-2021-3021-32419181mt-3021-3241.4351.4351.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.43.'}
ground_truth after json load ==>: {'gt_parse': {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}}
ground_truth ==>: {'bank_stmt_entries': [{'TXN_DATE': '02-11-2023', 'TXN_DESC': 'SB Int: 10-2023:0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '93.00', 'BALANCE_AMT': '10901.92'}, {'TXN_DATE': '09-12-2023', 'TXN_DESC': 'CHRGS- SMS ALERT', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '1.06', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10900.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'Debit Card AMC-2', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': '295.00', 'DEPOSIT_AMT': None, 'BALANCE_AMT': '10605.86'}, {'TXN_DATE': '02-02-2024', 'TXN_DESC': 'SB Int: 01-2024: 0', 'CHEQUE_REF_NO': None, 'WITHDRAWAL_AMT': None, 'DEPOSIT_AMT': '75,00', 'BALANCE_AMT': '10680.86'}]}
evaluator ==>: <donut.util.JSONParseEvaluator object at 0x7d697edbfc10>
score ==>: 0
I had referred following URL as reference
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb
Please help me out in identifying and revolve the issue and let me know if you need more information
Thank you in advance
The text was updated successfully, but these errors were encountered: