-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ability to handle large, flat directories without blowing up memory #452
Comments
Ok, so here's the good news: The issue restic (and kopia) face with regard to memory issues with large directories has been solved in plakar which completely had them until June 2024, partially until then, and no longer anymore thanks to algorithmic changes + packfile-backed btree + caching through databases using memory-indexes to disk objects. You should be able to backup several millions files spread across a filesystem or all part of a single directory with no resources issues on plakar and with very similar performances. The three cases you pointed are still valid though they are not as deeply ingrained in the backup phase: The utils.go part is to ensure that the root pathname is normalized to its proper case on case-insensitive filesystems before beginning a backup (ie: if I enter ~/Wip by doing The second case is just a restore test to validate restore works with a single file, so it's not going to be an issue, I'll think of a way to handle this differently in tests though as bigger tests would be an issue and we need them too. The third case is part of the repository code and might actually be slightly problematic in the event where all packfiles are listed by the client (not part of the actual backup process) so I'll investigate too. Thanks |
If you open the directory as a file (getting a file descriptor first), then there is an os.File method of the same name, ReadDir, but with very different properties-- it does not sort, and gives back only a small batch at a time if a batch size > 0 is requested, so one can make multiple calls and handle a small batch at a time. https://pkg.go.dev/os#File.ReadDir
examples: https://github.com/glycerine/b3/blob/master/walk.go#L49 Also, how to scan directories in parallel: |
Even mature backup programs like restic have memory issues with large numbers of files in one dir (think an S3 bucket, or backing up a mail program's data directory). For example:
restic/restic#2446
These problems are often due to trying to process an entire directory at once, rather than part by part.
For instance, using os.ReadDir() is going to load all directory entries and sort them before returning anything, which could be problematic; https://pkg.go.dev/os#ReadDir
I see four uses at the current origin main HEAD 506a908
Testing for and handling a large number of files in one directory is common enough that it deserves its own set of test cases.
The text was updated successfully, but these errors were encountered: