Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ability to handle large, flat directories without blowing up memory #452

Open
glycerine opened this issue Feb 8, 2025 · 2 comments
Open
Assignees

Comments

@glycerine
Copy link

Even mature backup programs like restic have memory issues with large numbers of files in one dir (think an S3 bucket, or backing up a mail program's data directory). For example:

restic/restic#2446

These problems are often due to trying to process an entire directory at once, rather than part by part.

For instance, using os.ReadDir() is going to load all directory entries and sort them before returning anything, which could be problematic; https://pkg.go.dev/os#ReadDir

I see four uses at the current origin main HEAD 506a908

~/go/src/github.com/PlakarKorp/plakar (main) $ ack os.ReadDir
ack os.ReadDir
cmd/plakar/utils/utils.go
529:            dirEntries, err := os.ReadDir(normalizedPath)

snapshot/restore_test.go
52:     files, err := os.ReadDir(exporterInstance.Root())

storage/backends/fs/buckets.go
54:     bucketsDir, err := os.ReadDir(buckets.path)
67:             entries, err := os.ReadDir(path)
~/go/src/github.com/PlakarKorp/plakar (main) $ 

Testing for and handling a large number of files in one directory is common enough that it deserves its own set of test cases.

@poolpOrg
Copy link
Collaborator

poolpOrg commented Feb 8, 2025

Ok, so here's the good news:

The issue restic (and kopia) face with regard to memory issues with large directories has been solved in plakar which completely had them until June 2024, partially until then, and no longer anymore thanks to algorithmic changes + packfile-backed btree + caching through databases using memory-indexes to disk objects. You should be able to backup several millions files spread across a filesystem or all part of a single directory with no resources issues on plakar and with very similar performances.

The three cases you pointed are still valid though they are not as deeply ingrained in the backup phase:

The utils.go part is to ensure that the root pathname is normalized to its proper case on case-insensitive filesystems before beginning a backup (ie: if I enter ~/Wip by doing cd ~/wip, it works on my macOS, but things get weird as pathnames are relative to ~/wip but return ~/Wip in some system calls). I will see how I can implement it without ReadDir.

The second case is just a restore test to validate restore works with a single file, so it's not going to be an issue, I'll think of a way to handle this differently in tests though as bigger tests would be an issue and we need them too.

The third case is part of the repository code and might actually be slightly problematic in the event where all packfiles are listed by the client (not part of the actual backup process) so I'll investigate too.

Thanks

@poolpOrg poolpOrg self-assigned this Feb 8, 2025
@glycerine
Copy link
Author

glycerine commented Feb 8, 2025

I will see how I can implement it without ReadDir.

If you open the directory as a file (getting a file descriptor first), then there is an os.File method of the same name, ReadDir, but with very different properties-- it does not sort, and gives back only a small batch at a time if a batch size > 0 is requested, so one can make multiple calls and handle a small batch at a time.
By "directory order", the docs mean not sorted but simply in the order they are found (probably creation order--but the fastest order without any sorting applied is what is meant).

https://pkg.go.dev/os#File.ReadDir

func (f *File) ReadDir(n int) ([]DirEntry)

ReadDir reads the contents of the directory associated with the file f and returns a 
slice of DirEntry values in directory order. Subsequent calls on the same file 
will yield later DirEntry records in the directory.

If n > 0, ReadDir returns at most n DirEntry records...

examples:

https://github.com/glycerine/b3/blob/master/walk.go#L49

Also, how to scan directories in parallel:

https://github.com/glycerine/parallelwalk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants