Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong file name parsing due to missing extension #7135

Closed
txau opened this issue Aug 20, 2024 · 4 comments
Closed

Wrong file name parsing due to missing extension #7135

txau opened this issue Aug 20, 2024 · 4 comments

Comments

@txau
Copy link
Collaborator

txau commented Aug 20, 2024

We are creating entries of this form:

"_id" : ObjectId("668c6173afac83a434cd9beb"),
    "entity" : "rnu3ljgi0df",
    "type" : "document",
    "filename" : "1720476018696wx3osn5bm4.com discussing the case",
    "originalname" : "News article from LiveLaw.com discussing the case",
    "mimetype" : "application/pdf",

Where the "filename" field is wrongly inferred due to 1) a missing extension in the filename? and 2) a period in the middle of the filename, wrongly using the string after the period as the file extension.

This is creating file not found errors.

Fixes?

  1. Ensure this kind of entry is not created anymore
  2. Check out why this is giving file not found exceptions and either fix the data or adapt the retrieve approach.
@RafaPolit
Copy link
Member

Priorities to determine extension:

  • use the actual extension
  • use a "known" utility to extract the actual extension (and not just split and use the last occurrence)
  • if everything else fails, simply keep the 'no extension'

@Joao-vi
Copy link
Collaborator

Joao-vi commented Oct 9, 2024

@RafaPolit @txau

Hey everyone, could someone help me reproduce the item 2? Maybe the page where this "file not found" error is found.

Item 1 is fixed, so now on, every document created should work as expected.

You can follow the progress right here (#7331)

@txau
Copy link
Collaborator Author

txau commented Oct 9, 2024

We've been seeing this errors in our logs in the form of:

2024-10-08T18:02:14.894Z [cyrilla-ix-test] 
File 1682523685131ns7nb4qqq8p. Telegram not found in s3 storage
original error: {
 "filename": "1682523685131ns7nb4qqq8p. Telegram",
 "storage": "s3"
}

@Joao-vi
Copy link
Collaborator

Joao-vi commented Oct 11, 2024

Closing this issue, you can check the fix right here (#7331).

I'm moving the item 2 to a separeted issue because there's no enougth information to replicate this error.

@Joao-vi Joao-vi closed this as completed Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants