ability to handle large, flat directories without blowing up memory #452

glycerine · 2025-02-08T15:08:31Z

Even mature backup programs like restic have memory issues with large numbers of files in one dir (think an S3 bucket, or backing up a mail program's data directory). For example:

restic/restic#2446

These problems are often due to trying to process an entire directory at once, rather than part by part.

For instance, using os.ReadDir() is going to load all directory entries and sort them before returning anything, which could be problematic; https://pkg.go.dev/os#ReadDir

I see four uses at the current origin main HEAD 506a908

~/go/src/github.com/PlakarKorp/plakar (main) $ ack os.ReadDir
ack os.ReadDir
cmd/plakar/utils/utils.go
529:            dirEntries, err := os.ReadDir(normalizedPath)

snapshot/restore_test.go
52:     files, err := os.ReadDir(exporterInstance.Root())

storage/backends/fs/buckets.go
54:     bucketsDir, err := os.ReadDir(buckets.path)
67:             entries, err := os.ReadDir(path)
~/go/src/github.com/PlakarKorp/plakar (main) $

Testing for and handling a large number of files in one directory is common enough that it deserves its own set of test cases.

The text was updated successfully, but these errors were encountered:

poolpOrg · 2025-02-08T15:42:06Z

Ok, so here's the good news:

The issue restic (and kopia) face with regard to memory issues with large directories has been solved in plakar which completely had them until June 2024, partially until then, and no longer anymore thanks to algorithmic changes + packfile-backed btree + caching through databases using memory-indexes to disk objects. You should be able to backup several millions files spread across a filesystem or all part of a single directory with no resources issues on plakar and with very similar performances.

The three cases you pointed are still valid though they are not as deeply ingrained in the backup phase:

The utils.go part is to ensure that the root pathname is normalized to its proper case on case-insensitive filesystems before beginning a backup (ie: if I enter ~/Wip by doing cd ~/wip, it works on my macOS, but things get weird as pathnames are relative to ~/wip but return ~/Wip in some system calls). I will see how I can implement it without ReadDir.

The second case is just a restore test to validate restore works with a single file, so it's not going to be an issue, I'll think of a way to handle this differently in tests though as bigger tests would be an issue and we need them too.

The third case is part of the repository code and might actually be slightly problematic in the event where all packfiles are listed by the client (not part of the actual backup process) so I'll investigate too.

Thanks

glycerine · 2025-02-08T18:06:22Z

I will see how I can implement it without ReadDir.

If you open the directory as a file (getting a file descriptor first), then there is an os.File method of the same name, ReadDir, but with very different properties-- it does not sort, and gives back only a small batch at a time if a batch size > 0 is requested, so one can make multiple calls and handle a small batch at a time.
By "directory order", the docs mean not sorted but simply in the order they are found (probably creation order--but the fastest order without any sorting applied is what is meant).

https://pkg.go.dev/os#File.ReadDir

func (f *File) ReadDir(n int) ([]DirEntry)

ReadDir reads the contents of the directory associated with the file f and returns a 
slice of DirEntry values in directory order. Subsequent calls on the same file 
will yield later DirEntry records in the directory.

If n > 0, ReadDir returns at most n DirEntry records...

examples:

https://github.com/glycerine/b3/blob/master/walk.go#L49

Also, how to scan directories in parallel:

https://github.com/glycerine/parallelwalk

poolpOrg self-assigned this Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ability to handle large, flat directories without blowing up memory #452

ability to handle large, flat directories without blowing up memory #452

glycerine commented Feb 8, 2025

poolpOrg commented Feb 8, 2025 •

edited

Loading

glycerine commented Feb 8, 2025 •

edited

Loading

ability to handle large, flat directories without blowing up memory #452

ability to handle large, flat directories without blowing up memory #452

Comments

glycerine commented Feb 8, 2025

poolpOrg commented Feb 8, 2025 • edited Loading

glycerine commented Feb 8, 2025 • edited Loading

poolpOrg commented Feb 8, 2025 •

edited

Loading

glycerine commented Feb 8, 2025 •

edited

Loading