Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsync not updating times correctly #560

Open
dvickernasa opened this issue Oct 2, 2023 · 87 comments
Open

dsync not updating times correctly #560

dvickernasa opened this issue Oct 2, 2023 · 87 comments

Comments

@dvickernasa
Copy link

Hello,

I recently discovered mpifileutils and they are awesome. I moved about 200 TB in 4 million files between two different lustre filesystems over the weekend. Thank you for your work on these tools. After doing an initial data move, my intent was to keep the data in sync for a period of time using rsync. With the bulk of the data transferred, this should go fast but it would also catch any file types that `dsync doesn't handle (hard links, etc.). Here is the script I'm using.

source=/nobackup/PROJECTS/ccdev/
dest=/nobackup2/PROJECTS/ccdev/

mpiexec --allow-run-as-root dsync -X non-lustre $source $dest > dsync-ccdev.out.$( date +%F )

rsync -aPHASx $source $dest > rsync-ccdev.out.$( date +%F )

The problem I'm running into is that rsync is copying files that dync already copied. Here is one example of a file that rsync is trying to transfer again.

Source:

238412051 -rw-r----- 1 dvicker eg3 244210724864 Aug 23 16:38 /nobackup/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

Destination:

238487228 -rw-r----- 1 dvicker eg3 244210724864 Sep 29 17:15 /nobackup2/PROJECTS/ccdev/boeing/aero/adb/ar/ar-204.squashfs

The file size it correct but it looks like the date on the destination is wrong - this is the date that that the file got created. After poking around in the destination more, it looks like this is prevalent - dates were not preserved. This will trip up rsync's default algorithm of checking by date and size. I could switch to rsync's checksum mode to get around this but that will be quite a bit more time consuming for this dataset.

Is this a bug with dsync or am I possibly doing something wrong? I'm using version 0.11.1 of mpifileutils.

@dvickernasa dvickernasa changed the title dsync followed by rsync dsync not updating times correctly Oct 2, 2023
@dvickernasa
Copy link
Author

Changed the subject since the root problem here is the time stamp info not being preserved.

Here is a more comprehensive example. There is an odd mix of correct and incorrect time stamps in the destination.

[dvicker@monte dsync]$ ls -l /nobackup/PROJECTS/ccdev/spacex/aero/adb/ar/
total 24040207564
  781477845 -rw-r----- 1 dvicker  ccp_spacex   791038201856 Oct 27  2022 ar-108.squashfs
        187 -rw-rw---- 1 dvicker  ccp_spacex         742598 Aug 15 11:12 ar-108.squashfs.toc
 5835780999 -rw-rw---- 1 ema      ccp_spacex  5977433890816 Aug 24 03:22 ar-109.squashfs
       1721 -rw-rw---- 1 ema      ccp_spacex        9413597 Aug 24 03:22 ar-109.squashfs.toc
   30794854 -rw-r----- 1 dvicker  ccp_spacex    25150177280 Oct 29  2022 ar-126.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          89461 Aug 15 11:12 ar-126.squashfs.toc
  561289162 -rw-r----- 1 dvicker  ccp_spacex   568532955136 Oct 29  2022 ar-127.squashfs
        233 -rw-rw---- 1 dvicker  ccp_spacex        1003455 Aug 15 11:12 ar-127.squashfs.toc
 1042993450 -rw-r----- 1 dvicker  ccp_spacex  1056351469568 Oct 29  2022 ar-131.squashfs
       2651 -rw-rw---- 1 dvicker  ccp_spacex       14523295 Aug 15 11:13 ar-131.squashfs.toc
  882730771 -rw-rw---- 1 ema      ccp_spacex   899033808896 Aug 24 11:42 ar-132.squashfs
       4046 -rw-rw---- 1 ema      ccp_spacex       22383607 Aug 24 11:42 ar-132.squashfs.toc
 1777595522 -rw-r----- 1 dvicker  ccp_spacex  1815639597056 Oct 28  2022 ar-133.squashfs
       1721 -rw-rw---- 1 dvicker  ccp_spacex        9352445 Aug 15 11:13 ar-133.squashfs.toc
  101115096 -rw-r----- 1 dvicker  ccp_spacex   102275829760 Oct 28  2022 ar-134.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          93327 Aug 15 11:13 ar-134.squashfs.toc
  753055384 -rw-r----- 1 dvicker  ccp_spacex   771118510080 Oct 30  2022 ar-135.squashfs
        466 -rw-rw---- 1 dvicker  ccp_spacex        2126037 Aug 15 11:13 ar-135.squashfs.toc
  158022760 -rw-r----- 1 dvicker  ccp_spacex   161615790080 Oct 29  2022 ar-136.squashfs
         94 -rw-rw---- 1 dvicker  ccp_spacex         225492 Aug 15 11:13 ar-136.squashfs.toc
11896898702 -rw-r----- 1 dvicker  ccp_spacex 12047965007872 Aug 20 22:27 ar-138.squashfs
      32762 -rw-rw---- 1 dvicker  ccp_spacex       64656431 Aug 20 22:27 ar-138.squashfs.toc
   46078532 -rw-r----- 1 dvicker  ccp_spacex    38567428096 Oct 27  2022 ar-140.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          99596 Aug 15 11:13 ar-140.squashfs.toc
   69136835 -rw-r----- 1 dvicker  ccp_spacex    64205635584 Oct 28  2022 ar-144.squashfs
         47 -rw-rw---- 1 dvicker  ccp_spacex          61816 Aug 15 11:13 ar-144.squashfs.toc
  103192925 -rw-r--r-- 1 dvicker  eg3          105696878592 Aug 22 20:03 Dragon2_RCS_JI_Joe_MSFC.squashfs
        629 -rw-r--r-- 1 dvicker  eg3               3418271 Aug 22 20:03 Dragon2_RCS_JI_Joe_MSFC.squashfs.toc
         36 -rwxr-xr-x 1 aschwing ccp_spacex            203 Feb 27  2019 update_perms.csh*
[dvicker@monte dsync]$ ls -l /nobackup2/PROJECTS/ccdev/spacex/aero/adb/ar/
total 23852316152
  772498852 -rw-r----- 1 dvicker  ccp_spacex   791038201856 Oct 27  2022 ar-108.squashfs
        728 -rw-rw---- 1 dvicker  ccp_spacex         742598 Sep 29 22:48 ar-108.squashfs.toc
 5837342120 -rw-rw---- 1 ema      ccp_spacex  5977433890816 Aug 24 03:22 ar-109.squashfs
       9196 -rw-rw---- 1 ema      ccp_spacex        9413597 Sep 29 22:03 ar-109.squashfs.toc
   24560752 -rw-r----- 1 dvicker  ccp_spacex    25150177280 Sep 29 20:41 ar-126.squashfs
         88 -rw-rw---- 1 dvicker  ccp_spacex          89461 Aug 15 11:12 ar-126.squashfs.toc
  555208412 -rw-r----- 1 dvicker  ccp_spacex   568532955136 Sep 29 23:53 ar-127.squashfs
        980 -rw-rw---- 1 dvicker  ccp_spacex        1003455 Sep 30 00:05 ar-127.squashfs.toc
 1031594016 -rw-r----- 1 dvicker  ccp_spacex  1056351469568 Oct 29  2022 ar-131.squashfs
      14184 -rw-rw---- 1 dvicker  ccp_spacex       14523295 Sep 29 17:44 ar-131.squashfs.toc
  877963384 -rw-rw---- 1 ema      ccp_spacex   899033808896 Sep 29 19:22 ar-132.squashfs
      21860 -rw-rw---- 1 ema      ccp_spacex       22383607 Sep 30 00:14 ar-132.squashfs.toc
 1773086904 -rw-r----- 1 dvicker  ccp_spacex  1815639597056 Sep 29 20:43 ar-133.squashfs
       9136 -rw-rw---- 1 dvicker  ccp_spacex        9352445 Aug 15 11:13 ar-133.squashfs.toc
   99878836 -rw-r----- 1 dvicker  ccp_spacex   102275829760 Sep 30 00:08 ar-134.squashfs
         92 -rw-rw---- 1 dvicker  ccp_spacex          93327 Aug 15 11:13 ar-134.squashfs.toc
  753046000 -rw-r----- 1 dvicker  ccp_spacex   771118510080 Oct 30  2022 ar-135.squashfs
       2080 -rw-rw---- 1 dvicker  ccp_spacex        2126037 Sep 29 22:02 ar-135.squashfs.toc
  157828056 -rw-r----- 1 dvicker  ccp_spacex   161615790080 Sep 29 23:49 ar-136.squashfs
        224 -rw-rw---- 1 dvicker  ccp_spacex         225492 Sep 29 17:19 ar-136.squashfs.toc
11765599492 -rw-r----- 1 dvicker  ccp_spacex 12047965007872 Aug 20 22:27 ar-138.squashfs
      63144 -rw-rw---- 1 dvicker  ccp_spacex       64656431 Sep 29 23:49 ar-138.squashfs.toc
   37663536 -rw-r----- 1 dvicker  ccp_spacex    38567428096 Sep 29 16:54 ar-140.squashfs
        100 -rw-rw---- 1 dvicker  ccp_spacex          99596 Sep 29 22:03 ar-140.squashfs.toc
   62700864 -rw-r----- 1 dvicker  ccp_spacex    64205635584 Oct 28  2022 ar-144.squashfs
         64 -rw-rw---- 1 dvicker  ccp_spacex          61816 Sep 30 02:00 ar-144.squashfs.toc
  103219708 -rw-r--r-- 1 dvicker  eg3          105696878592 Sep 30 02:26 Dragon2_RCS_JI_Joe_MSFC.squashfs
       3340 -rw-r--r-- 1 dvicker  eg3               3418271 Sep 29 16:46 Dragon2_RCS_JI_Joe_MSFC.squashfs.toc
          4 -rwxr-xr-x 1 aschwing ccp_spacex            203 Sep 29 16:46 update_perms.csh*
[dvicker@monte dsync]$ 

@adammoody
Copy link
Member

adammoody commented Oct 3, 2023

Hi, @dvickernasa . Great to hear mpiFileUtils has been helpful. Thanks.

dsync should be setting the atime and mtime on the destination files to match the source file. However, it does that as the very last step, after all files have been copied. I wonder if the job might have been interrupted or hit a fatal error before it could complete that step.

Do you see any such errors or early termination from your job output log?

Tip: When copying a large number of files, I like to recommend using the --batch-files option, as in:

dsync --batch-files 100000 ...

This changes dsync to copy files in batches rather than all files at once. It sets the timestamps at the completion of each batch. If the job is interrupted, one can then execute the same dsync command again. dsync skips copying any files that were completed in earlier runs. By default, it avoids copying any file whose destination exists and has a matching file type, size, and mtime.

@dvickernasa
Copy link
Author

That is possible. Here is the end of the output from the dsync run.

[2023-09-30T08:19:28] Syncing directory updates to disk.
[2023-09-30T08:19:33] Sync completed in 4.942 seconds. 
[2023-09-30T08:19:33] Started: Sep-29-2023,16:34:18
[2023-09-30T08:19:33] Completed: Sep-30-2023,08:19:33
[2023-09-30T08:19:33] Seconds: 56715.151
[2023-09-30T08:19:33] Items: 1076815
[2023-09-30T08:19:33]   Directories: 35829
[2023-09-30T08:19:33]   Files: 1037753
[2023-09-30T08:19:33]   Links: 3233
[2023-09-30T08:19:33] Data: 48.152 TiB (52944066031967 bytes)
[2023-09-30T08:19:33] Rate: 890.263 MiB/s (52944066031967 bytes in 56715.151 seconds)
[2023-09-30T08:19:33] Updating timestamps on newly copied files
[2023-09-30T08:19:33] Completed updating timestamps
[2023-09-30T08:19:33] Completed sync
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[49249,1],5]
  Exit code:    1
--------------------------------------------------------------------------   

I'll start using the --batch-files option, thanks for the tip.

I'm attempting to recover from this using an rsync command to update the time stamps.

rsync -rv --size-only --times $source $dest >> rsync-ccdev.out.times.$( date +%F )

But its going fairly slow. Woud just repeating the dsync update the timestamps?

@dvickernasa
Copy link
Author

Actually, the data I copied was done with two separate dsync runs since they were in two different areas. I've attached the output from both runs. The ccdev run showed the error, but the cev-repo run did not. But the destinations for both runs have files with incorrect time stamps. I will attached the full output of both runs.

Any recommendations for fixing the time stamps would be appreciated. The rsync command I posted above should do it, but its slow. I'm thinking dsync won't work since the output contains Updating timestamps on newly copied files. That makes it sound like dsync won't update the timestamps on the files that were previously copied.

dsync-ccdev.out.2023-09-29.txt
dsync-cev-repo.out.2023-09-29.txt

@dvickernasa
Copy link
Author

dvickernasa commented Oct 3, 2023

I've been doing some tests this afternoon using a smaller dataset. Basically did this:

  • dsync /src /dest
  • cd /dest ; lfs find . | xargs touch
  • dsync /src /dest

Even though no actual data was transferred for the 2nd dsync (I think, the 2nd transfer went fast), the time stamps in the destination got updated to their source time stamps. I used rsync after the 2nd dsync to see if it found any files it wanted to transfer again and it did not, indicating that the size and times of all files did indeed get updated.

I've restarted the dsync's of my original, large datasets so we'll see what happens.

@adammoody
Copy link
Member

Ok, thanks for the update. I'm still puzzled as to why the timestamps aren't being set correctly the first time.

How many nodes and processes are you using?

@adammoody
Copy link
Member

Does it do a better job at setting timestamps the first time if you add the --direct option?

@adilger
Copy link
Contributor

adilger commented Oct 4, 2023

@adammoody is there a reason not to set the timestamps on a file immediately after it is finished processing? That would avoid the risk of missing them being set, and also reduces overhead on the filesystem because it doesn't need to do another set of filename lookups on all of the files.

It is understandable that the timestamps on the directories should be deferred until all of the entries in the directory are updated (and also to avoid setting the directory timestamp millions of times for a large directory), but that is not the case for regular files.

At worst, if a large file is being written by multiple MPI ranks concurrently, the timestamp might be set multiple times, but that is fine given it only happens for very large files, and one of the threads will be the last one writing and will reset the timestamp when it is finished.

Separately, is the "--batch-files" option mainly to work around the lack of timestamp updates on files, or are there other reasons to use it? Should this be set to eg. 10000, or 100 x mpi_ranks, or similar by default? That would increase the directory timestamp updates marginally, but is unlikely to hurt overall transfer performance significantly (I guess some amount of straggler/idle thread overhead for each batch).

@dvickernasa
Copy link
Author

How many nodes and processes are you using?

I'm using 16 cores on a single workstation for these transfers.

Does it do a better job at setting timestamps the first time if you add the --direct option?

In my small tests, I wasn't using --direct and the times were getting set properly. I verified by following up with an rsync, which didn't copy anything new. I can try a big transfer with --direct next time and see if that works better.

The 2nd dsync for the large dataset just went OK. The intent here was just to correct the times. For some reason, this happened:

[2023-10-03T18:42:03] Comparing file sizes and modification times of 3093774 items
[2023-10-03T18:42:03] Started   : Oct-03-2023, 18:42:01
[2023-10-03T18:42:03] Completed : Oct-03-2023, 18:42:03
[2023-10-03T18:42:03] Seconds   : 2.043
[2023-10-03T18:42:03] Items     : 3093774
[2023-10-03T18:42:03] Item Rate : 3093774 items in 2.042868 seconds (1514426.851390 items/sec)
[2023-10-03T18:42:03] Deleting items from destination
[2023-10-03T18:42:03] Removing 1740489 items ...

<cut>

[2023-10-03T18:46:44] Removed 1676960 items (96.35%) in 280.076 secs (5987.513 items/sec) 10 secs remaining ...
[2023-10-03T18:46:54] Removed 1739691 items (99.95%) in 290.075 secs (5997.386 items/sec) 0 secs remaining ...
[2023-10-03T18:46:54] Removed 1740489 items (100.00%) in 290.399 secs (5993.446 items/sec) done
[2023-10-03T18:46:54] Removed 1740489 items in 290.457 seconds (5992.237 items/sec)
[2023-10-03T18:46:54] Copying items to destination
[2023-10-03T18:46:54] Copying to /nobackup2/PROJECTS/orion/cev-repo
[2023-10-03T18:46:54] Items: 1740490
[2023-10-03T18:46:54]   Directories: 0
[2023-10-03T18:46:54]   Files: 1740490
[2023-10-03T18:46:54]   Links: 0
[2023-10-03T18:46:54] Data: 94.113 TiB (56.699 MiB per file)

The source data is almost completely static so I'm not sure why dsync decided it needed to delete and recopy about half of the data. Perhaps the incorrect dates from the first run threw it off? Its hard to tell for sure but it looked like there were far fewer than 50% of the files with wrong dates in the destination from the 1st dsync. Any thoughts on this mass deletion would be welcome.

@adammoody
Copy link
Member

adammoody commented Oct 4, 2023

Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes.

From your output logs, I don't see the types of errors that I'd expect to see if the job had been interrupted. It looks to print the message that it has successfully updated all timestamps. It looks buggy to me that it somehow failed.

One guess is that perhaps a trailing write came in to the file after the timestamp update had been processed on the metadata server. That shouldn't be happening, but maybe it is. I don't think that would happen either for a single node run or when using --direct which uses O_DIRECT, which is what looking for with those questions.

I'll also run some tests to see if I can reproduce that locally.

@adammoody
Copy link
Member

adammoody commented Oct 4, 2023

Thank you for following along and for your suggestions, @adilger .

Yes, we currently set the timestamps once at the end to handle files that are split among multiple ranks. We don't have communication to know when all ranks are done with a given file, so instead we wait until all ranks are done with all files.

The --batch-files option is there to add intermediate global sync points for completing files early. It may provide some secondary benefit to sync procs up more frequently and perhaps improve work balance, but it's mainly there to set timestamps before waiting until the very end.

Picking the best batch size is not well defined. It effectively acts like a checkpoint operation, so it makes sense to pick a batch size based on the expected runtime and failure rate. It also depends on amount of data and the file sizes. I suggested 100k here since there were 4m files in total, so 100k just seemed like a decent checkpoint frequency. Pretty arbitrary...

I like the idea of setting the timestamp multiple times from multiple ranks. That does add extra load on the metadata server, but that's probably not a huge hit.

However, a remaining challenge is to figure out how to handle setting other properties like ownership, permission bits, and extended attributes. We set those at the end, and then we set the timestamps as the very last step. This ensures that the mtime on the destination only matches the source after the file has been fully copied (all data and attributes).

I know some properties can't be set too early, since for example, some of those can disable write access to the destination file.

And I think we'd need to consider the case where dsync might be interrupted and leaves a destination file with the correct size and mtime before all ranks had actually completed their writes. (The file size can be correct since the last bytes in the file might actually be written first.) We could perhaps write to a temporary file and rename to handle that, but then I think we run into a similar problem of knowing when it's time to do the rename.

Have more insight into those?

@dvickernasa
Copy link
Author

Regarding all of those deletes, I think that second dsync detected the mismatch in mtime between the source and destination files. It then assumes the files are different, it deletes the destination files, and it re-copies those files from the source. It would obviously have been smoother if the first dsync properly set the mtimes.

Given this, I think @adilger's suggestion of writting the times right after the file write makes a lot of sense if its not too much of a performance hit. If dsync's behavior is to delete the file and start over again if the times don't match, it would be very beneficial to get the times updated ASAP so that if the transfer is interrupted, the next dsync can avoid as much re-transfer as possible.

One other possibility is implementing rsync's --partial behavior. If that is turned on, the part of the file that is transferred is kept and subsequent rsync runs pick up where the the previous left off. Of course, this would also require implementing the delta-transfer algorithm and I have no idea how much effort that would entail.

@adilger
Copy link
Contributor

adilger commented Oct 5, 2023

One option for detecting partial file transfer would be to check the blocks allocated to the file against the source file. If they are grossly smaller on the target, then it is clearly not finished copying the interior.

Alternately (and this might be better for performance reasons as well) you could make the work distribution function smarter and have only a single rank per OST object and copy the objects linearly to avoid holes. With modern Lustre, large O_DIRECT writes are much faster than buffered writes 20GB/s vs. 2GB/s).

Also, I imagine you only split a file into multiple parts to copy in parallel once it is over 1GiB in allocated space, so the cost of doing an MPI comm to the other ranks is pretty tiny compared to the file data moved, so making the client do multiple locks/opens for these files is probably more work in total.

@dvickernasa
Copy link
Author

Sorry, I know this isn't directly related to this issue, but I have a question about dsync's behavior. Here is an example of the end of a transfer.

[2023-10-03T19:32:44] Copied 2.673 TiB (66%) in 2738.947 secs (0.999 GiB/s) 1438 secs left ...
[2023-10-03T19:32:51] Copied 2.684 TiB (66%) in 2746.006 secs (1.001 GiB/s) 1424 secs left ...
[2023-10-03T19:33:04] Copied 2.695 TiB (66%) in 2759.073 secs (1.000 GiB/s) 1414 secs left ...
[2023-10-03T19:33:14] Copied 2.707 TiB (66%) in 2769.405 secs (1.001 GiB/s) 1401 secs left ...
[2023-10-03T19:33:24] Copied 2.719 TiB (67%) in 2779.124 secs (1.002 GiB/s) 1388 secs left ...
[2023-10-03T19:33:34] Copied 2.730 TiB (67%) in 2789.049 secs (1.002 GiB/s) 1376 secs left ...
[2023-10-03T19:33:49] Copied 2.738 TiB (67%) in 2804.159 secs (1.000 GiB/s) 1370 secs left ...
[2023-10-03T19:34:04] Copied 2.755 TiB (68%) in 2818.953 secs (1.001 GiB/s) 1353 secs left ...
[2023-10-03T19:34:15] Copied 2.766 TiB (68%) in 2829.957 secs (1.001 GiB/s) 1341 secs left ...
[2023-10-03T19:34:21] Copied 2.777 TiB (68%) in 2836.497 secs (1.002 GiB/s) 1328 secs left ...
[2023-10-03T19:34:39] Copied 2.788 TiB (68%) in 2854.734 secs (1.000 GiB/s) 1320 secs left ...
[2023-10-03T19:34:48] Copied 2.802 TiB (69%) in 2863.235 secs (1.002 GiB/s) 1303 secs left ...
[2023-10-03T19:35:08] Copied 2.813 TiB (69%) in 2883.456 secs (0.999 GiB/s) 1295 secs left ...
[2023-10-03T20:09:39] Copied 2.831 TiB (69%) in 4954.226 secs (599.206 MiB/s) 2180 secs left ...
[2023-10-03T20:09:39] Copied 4.077 TiB (100%) in 4954.227 secs (862.846 MiB/s) done
[2023-10-03T20:09:39] Copy data: 4.077 TiB (4482383360736 bytes)
[2023-10-03T20:09:39] Copy rate: 862.846 MiB/s (4482383360736 bytes in 4954.227 seconds)
[2023-10-03T20:09:39] Syncing data to disk.
[2023-10-03T20:09:45] Sync completed in 6.131 seconds.

Note that the percentage is progressing along fairly linearly, then suddenly jumps a lot and finishes much earlier than predicted. I see this happen a lot. Do you know what's going on there?

My 2nd dsync of the large dataset finished and I have an rsync running to verify it doesn't find anything else to transfer. Its going to take awhile for that to finish but its been running most of the day and it hasn't found anything yet, which is a good sign. My original rsync verification starting finding files it wanted to transfer almost right away.

@adammoody
Copy link
Member

The missing progress messages is a known problem. This is also noted in #530

These progress messages are implemented using a tree-based communication pattern across ranks using point-to-point messages. Every 10 seconds or so, rank 0 initiates a message down the tree to request progress. Each intermediate rank receives that request message from its parent and forwards it down to its children. When the request message reaches the leaves, ranks respond and send their current progress status back up the tree. Intermediate ranks receive that data, merge in their own progress status, and respond back up the tree. Rank 0 receives the messages from its children, merges its data, and prints the final result. These reductions are asynchronous. Ranks periodically check for and respond to these messages as they come in.

I think what probably happens for these progress messages to get stalled is that some rank (any rank) gets stuck in an I/O call. Because it is blocked in the I/O call, it doesn't respond to MPI messages, which then blocks the progress tree. If that's what's happening, one solution would be to move the I/O calls into a pthread or maybe use async I/O routines so that the process can still make MPI calls while waiting on its current I/O operation to complete.

@adilger
Copy link
Contributor

adilger commented Oct 6, 2023

Is it possible to set a timeout on the MPI broadcast that doesn't also result in a total MPI failure? Since the status updates are somewhat advisory, unlike an HPC calculation, getting a partial status is probably fine.

@adammoody
Copy link
Member

adammoody commented Oct 6, 2023

Yes, I think timeouts on the reduction operation could also work. I think we'd need to pair that with adding sequence numbers to the messages or internal counters so that a parent can distinguish late-arriving responses from its children.

It might also get strange, since the progress "sum" would be missing any results from the subtree rooted at the blocked rank. The reported progress would appear to bounce up and down during the run if such stalls are frequent as random ranks come and go. That might be countered by caching the last reported value from each child. I'd have to check whether that makes sense for the various reduction ops...

There may be another advantage to moving the I/O to a thread. For algorithms that use libcircle for work stealing, work on blocked ranks could be moved to other ranks faster. Having said that, changing all of the tools to move I/O to a separate thread is a bigger overhaul.

@adammoody
Copy link
Member

Oh, wait. I'm describing the point-to-point reduction in libcircle:

https://github.com/hpc/libcircle/blob/406a0464795ef5db783971e578a798707ee79dc8/libcircle/token.c#L97

There is a separate progress reduction algorithm in mpiFileUtils, and I forgot that it uses non-blocking MPI collectives for that:

https://github.com/hpc/mpifileutils/blob/main/src/common/mfu_progress.c

That'd probably first need to be switched over to a point-to-point method (like in libcircle) to add timeouts.

@adammoody
Copy link
Member

Since this is getting a bit involved, let's migrate discussion about missing progress messages to issue #530.

@dvickernasa
Copy link
Author

I've got a few more datapoints on setting the times.

First of all, the rsync verification of our 141 TB, 2.7M files dataset is still going. Its been running since 10/5. We closed off the permissions of the top level directory to ensure people can't access the data. So far, rsync has found about a couple dozen files that it wanted to re-transfer.

Over the weekend, we had another few sets of data we needed to transfer, each set containing about 250 GB and 1.7M files. Using a single node we can transfer one set in less than an hour using dsync. However, we found more indications of timestamps getting set incorrectly. This time we kept running dsync. After 2 or 3 times of dsync finding new files to transfer, we could get to the point where dsync itself no longer found new files to transfer. This was still not using --direct. The next time I use dsync, I'll try using --direct.

@adammoody
Copy link
Member

adammoody commented Oct 13, 2023

To be sure, your latest transfer of 1.7M files saw mismatched timestamps, and this only ran dsync using a single node, right? That info helps to eliminate one theory I had.

And this is a Lustre-to-Lustre transfer, right?

Did you see any errors reported in the output of that first transfer?

@dvickernasa
Copy link
Author

Correct - single node and lustre to lustre.

@dvickernasa
Copy link
Author

Sorry, to answer your questions about errors, Yes, there were some errors. They did a couple intermediate transfers while the datasets were in use and being modified in an effort to reduce the time it took to transfer the data when the runs were completed.

ERROR: Source file `XXX' shorter than expected 4007 (errno=0 Success)

Although, I found this odd (the source data was smaller not larger), we didn't spend any time digging into this.

One the job were done and the data was static, the transfer did show just a couple of errors.

ERROR: Failed to open output file `XXX' (errno=13 Permission denied)
ERROR: Failed to change ownership on `XXX' lchown() (errno=13 Permission denied)

Which is kind of strange since the source and destination directories were both owned by the person doing the transfer. One more dsync run got all files and finished without errors.

@dvickernasa
Copy link
Author

FYI, I started another transfer on Thursday. This one was quite large WRT the number of files (44.7M)

[2023-10-12T14:42:45] Copying to /nobackup2/PROJECTS/orion/gnc/em1data-copy
[2023-10-12T14:42:46] Items: 44656137
[2023-10-12T14:42:46]   Directories: 3593494
[2023-10-12T14:42:46]   Files: 39432055
[2023-10-12T14:42:46]   Links: 1630588
[2023-10-12T14:42:46] Data: 49.134 TiB (1.307 MiB per file)

I used --direct from the beginning on this one.

source=/.scratch00/agdl/em1data/
dest=/nobackup2/PROJECTS/orion/gnc/em1data-copy/
dsync_opts='-X non-lustre --batch-files 100000 --direct'

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-em1.out.$( date +%F-%H-%M )

The scratch filesystem is NFS so this is a an NFS to lustre transfer. Unfortunately, it looks like this one died sometime on Saturday.

[2023-10-14T11:13:45] Copied 262.948 GiB (80%) in 851.497 secs (316.218 MiB/s) 219 secs left ...
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 4679 on node monte exited on signal 9 (Killed).
--------------------------------------------------------------------------

I'm not sure why it died yet - I'll be digging into that some more. However, after poking around in the partially transferred directory structure, it looks like all directories are owned by root. Or at least most of them. Here is an example of the top level directory.

[root@dvicker dsync]# ls -l /.scratch00/agdl/em1data/
total 20381086
drwxrws---  41 root     mpcvdata          57 Mar  6  2023 AR1_MER_GEM
drwxrwxrwx  11 mgualdon cev               11 Mar 21  2023 AR1_MER_POLARIS
-rw-r--r--   1 lhmolina mpcvdata 18450288381 Dec 10  2018 Data_20180219_0300.tgz
drwxr-sr-x   2 root     mpcvdata     2206262 Dec  7  2020 dumap_html
-rw-r--r--   1 root     mpcvdata        2782 Dec  7  2020 dumap.html
-rw-r--r--   1 root     mpcvdata  6744041294 Dec  4  2020 du.txt
drwxr-sr-x   8 choltz   mpcvdata           8 Jun 26  2018 EM1_DATA
drwxr-sr-x   5 choltz   mpcvdata           5 Aug 24  2016 EM1_FSW
drwxrws---   5 lmcev    mpcvdata           5 Jul 13  2020 EM1_iVAC
drwxrws---+ 13 choltz   mpcvdata          13 Jan 25  2023 EM1_RUNS
drwxr-sr-x  20 mfrostad mpcvdata          20 Jan 14  2021 EM1_SLS_Traj
drwxrws---   2 idux     mpcvdata           4 Apr 21  2021 EM1_TMT_Database
drwxr-sr-x   6 dheacock mpcvdata           6 Jul 13  2020 EM1_VAC
drwxr-s---   3 jarnold1 em1tppi            3 Aug  8  2019 EM_Flight_Traj_Limited_Rights
drwxr-sr-x  10 choltz   mpcvdata          16 May  3  2016 jfreechart-1.0.19
drwxrwsrwx   4 jmarcy   mpcvdata           4 Aug 10  2022 jmarcy
-rwxrwxrwx   1 jmarcy   mpcvdata      659220 Jan 19  2021 MEP_28ep9_20201215.xml
drwxrwsrwx   2 mstoft   mpcvdata           2 Jan 11  2022 mstoft
drwxrwsr-x+ 10 bawood   mpcvdata          14 May  8 13:58 orion_veras_archive
drwxr-sr-x   3 lgefert  mpcvdata           3 Feb  7  2019 prop_data
drwxrwsrwx  16 jmarcy   mpcvdata          16 Sep 21  2020 ramtares
drwxrwsrwx   6 mstoft   mpcvdata           7 Feb 10  2021 Rel28eP9Eng_Debug
drwxr-sr-x   5 choltz   mpcvdata           6 Nov 27  2017 socgnc_base
drwxr-s---  36 choltz   em1tppi           39 Oct 17  2022 socgnc_releases
drwxr-sr-x   5 choltz   mpcvdata           5 Mar 22  2016 SSW
[root@dvicker dsync]# ls -l /nobackup2/PROJECTS/orion/gnc/em1data-copy/
total 36212336
drwx------  41 root     root            4096 Oct 14 10:16 AR1_MER_GEM
drwx------  11 root     root            4096 Oct 12 14:42 AR1_MER_POLARIS
-rw-r--r--   1 lhmolina mpcvdata 18450288381 Dec 10  2018 Data_20180219_0300.tgz
drwx------   2 root     root       180535296 Oct 14 10:08 dumap_html
drwx------   8 root     root            4096 Oct 12 14:42 EM1_DATA
drwx------   5 root     root            4096 Oct 12 14:42 EM1_FSW
drwx------   5 root     root            4096 Oct 12 14:42 EM1_iVAC
drwxrwx---+ 13 root     root            4096 Oct 12 14:42 EM1_RUNS
drwx------  20 root     root            4096 Oct 12 14:42 EM1_SLS_Traj
drwx------   2 root     root            4096 Oct 12 14:42 EM1_TMT_Database
drwx------   6 root     root            4096 Oct 12 14:42 EM1_VAC
drwx------   3 root     root            4096 Oct 12 14:42 EM_Flight_Traj_Limited_Rights
drwx------  10 root     root            4096 Oct 14 10:05 jfreechart-1.0.19
drwx------   4 root     root            4096 Oct 12 14:42 jmarcy
drwx------   2 root     root            4096 Oct 12 14:42 mstoft
drwx------+ 10 root     root            4096 Oct 14 10:05 orion_veras_archive
drwx------   3 root     root            4096 Oct 12 14:42 prop_data
drwx------  16 root     root            4096 Oct 12 14:42 ramtares
drwx------   6 root     root            4096 Oct 12 14:42 Rel28eP9Eng_Debug
drwx------   5 root     root            4096 Oct 12 14:42 socgnc_base
drwx------  36 root     root            4096 Oct 12 18:55 socgnc_releases
drwx------   5 root     root            4096 Oct 12 14:42 SSW
[root@dvicker dsync]# 

Is this expected?

@dvickernasa
Copy link
Author

dvickernasa commented Oct 16, 2023

A few updates.

The first dsync run of this latest transfer died due to an OOM event on the machine I was running on. I think it was dsync itself that caused the OOM, since this was a pretty isolated machine and nothing else was running there. But I'm not 100% on that. I didn't mention it before, but this is also a single node run.

The sync is restarted and, unfortunately, I'm having similar issues. dsync decided to delete a lot of what it already transferred.

[2023-10-16T11:22:01] Comparing file sizes and modification times of 17155249 items
[2023-10-16T11:22:19] Started   : Oct-16-2023, 11:21:26
[2023-10-16T11:22:19] Completed : Oct-16-2023, 11:22:19
[2023-10-16T11:22:19] Seconds   : 53.312
[2023-10-16T11:22:19] Items     : 17155249
[2023-10-16T11:22:19] Item Rate : 17155249 items in 53.312461 seconds (321786.852653 items/sec)
[2023-10-16T11:22:19] Deleting items from destination
[2023-10-16T11:22:19] Removing 10915151 items

I'm attaching the full output for both the original and the restart for this latest run.

dsync-em1.out.2023-10-16-09-50.txt
dsync-em1.out.2023-10-12-12-52.txt

@adammoody
Copy link
Member

Ugh, that's frustrating. I suspect that the timestamps for the files on that first run are still not being set correctly.

It looks like the first run successfully completed a number of 100k file-transfer batches before it died (193 batches to be precise).

[2023-10-14T10:56:04] Copied 19300000 of 44656137 items (43.219%)

So at most it should have had to delete about 100k files, which would be the files just from the batch being worked when the run died. Instead, it thought 11m million files needed to be replaced.

It seems like it's either missing setting the timestamps on some files, or perhaps it's hitting an error and not reporting it. I'll review the code again with an eye for those kinds of problem.

You might as well drop --direct for future runs. We now know that doesn't seem to help, and it may slow down your transfers. Thanks for checking that.

Having root ownership left on the directories is expected, because the directory permissions are not updated until all files have been transferred. Since the first run didn't complete, it didn't get a chance to change the ownership on the directories. However, it should have changed ownership and set timestamps on all of those previous files.

It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt.

Do you have any idea of how large you have to get before you see things break?

@dvickernasa
Copy link
Author

You might as well drop --direct for future runs. We now know that doesn't seem to help, and it may slow down your transfers. Thanks for checking that.

No problem and I will drop this for future runs.

It'd be helpful if we could find a reproducer where you don't have to transfer a ton of data. You mentioned some smaller transfers worked on the first attempt.

Do you have any idea of how large you have to get before you see things break?

Yes, I agree. I will see if I can find a small-ish dataset to reproduce this.

@adammoody
Copy link
Member

adammoody commented Oct 16, 2023

This will produce some long output, but let's add some debug statements to print info when dsync sets the timestamp on a file.

Can you modify src/common/mfu_io.c to add the printf and preceding formatting lines as shown below and rebuild?

int mfu_utimensat(int dirfd, const char* pathname, const struct timespec times[2], int flags)
{   
    int rc;
    int tries = MFU_IO_TRIES;
retry:
    errno = 0;
    rc = utimensat(dirfd, pathname, times, flags);
    if (rc != 0) {
        if (errno == EINTR || errno == EIO) {
            tries--;
            if (tries > 0) {
                /* sleep a bit before consecutive tries */
                usleep(MFU_IO_USLEEP);
                goto retry;
            }
        }
    }

    //------------------------------
    char access_s[30];
    char modify_s[30]; 
    size_t access_rc = strftime(access_s, sizeof(access_s) - 1, "%FT%T", localtime(&times[0].tv_sec));
    size_t modify_rc = strftime(modify_s, sizeof(modify_s) - 1, "%FT%T", localtime(&times[1].tv_sec));
    printf("%s rc=%d errno=%d atime_sec=%llu atime_ns=%llu mtime_sec=%llu mtime_ns=%llu atime=%s mtime=%s\n",
        pathname, rc, errno, times[0].tv_sec, times[0].tv_nsec, times[1].tv_sec, times[1].tv_nsec, access_s, modify_s);
    //------------------------------

    return rc;
}

Also add an #include <time.h> statement at the top of the file.

Execute a dsync and if you find any files have mismatching timestamps, look for those same paths in the output. Let's verify that a line is actually printed, that the input timestamps are correct, and that the utimensat call returns with rc=0.

BTW, it seems like it's not too hard to find files with mismatched timestamps, but if you need to scan a lot of files manually, we can probably use dcmp to help with that.

No need to copy the actual output here, but let me know what you find.

@adammoody
Copy link
Member

@dvickernasa , have you had any transfers reproduce the Lustre timestamp problem?

@dvickernasa
Copy link
Author

dvickernasa commented Nov 3, 2023

Sorry, this fell off my radar. The "production" transfers I needed to do are done and I didn't get back to my testing. I'm restarting a transfer as a test. I'm going to recreate the original transfer this time, including the original options I used (most notably without --batch-files. I don't think I've run into a case where this didn't work when using --batch-files and I'm starting to suspect that is why. This new transfer is the exact same set of files from the first post in this issue. The source and destination filesystem are the same as well, I'm just using a different path in the destination FS. Fingers crossed that this recreates it. Details below. I'm using the debugtimestamps branch for both dsync runs so if we do catch the problem, we should have some good data to go over.

ml load mpifileutils/debugtimestamps

source=/nobackup/PROJECTS/ccdev.to-delete/
dest=/nobackup2/dvicker/dsync_test/ccdev/

dsync_opts='-X non-lustre'

mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out
mpiexec --allow-run-as-root dsync $dsync_opts $source $dest >> dsync-testing.$( date +%FT%H-%M-%S ).out

@dvickernasa
Copy link
Author

OK, I think we caught a few files this time.

FYI, the source was closed off to regular users so I'm certain it was static.

[root@dvicker dsync]# ls -ld /nobackup/PROJECTS/ccdev.to-delete 
d--------- 5 root root 11776 Aug 24 17:08 /nobackup/PROJECTS/ccdev.to-delete
You have new mail in /var/spool/mail/root
[root@dvicker dsync]# 

Here is the end of the first transfer:

[dvicker@dvicker dsync]$ tail -15 dsync-testing.2023-11-03T13-27-57.out
[2023-11-04T06:10:03] Updated 1151121 items in 408.878 seconds (2815.316 items/sec)
[2023-11-04T06:10:03] Syncing directory updates to disk.
[2023-11-04T06:10:06] Sync completed in 2.394 seconds.
[2023-11-04T06:10:06] Started: Nov-03-2023,14:05:41
[2023-11-04T06:10:06] Completed: Nov-04-2023,06:10:06
[2023-11-04T06:10:06] Seconds: 57864.748
[2023-11-04T06:10:06] Items: 1151121
[2023-11-04T06:10:06]   Directories: 35775
[2023-11-04T06:10:06]   Files: 1112126
[2023-11-04T06:10:06]   Links: 3220
[2023-11-04T06:10:06] Data: 47.312 TiB (52019905315792 bytes)
[2023-11-04T06:10:06] Rate: 857.345 MiB/s (52019905315792 bytes in 57864.748 seconds)
[2023-11-04T06:10:06] Updating timestamps on newly copied files
[2023-11-04T06:10:06] Completed updating timestamps
[2023-11-04T06:10:06] Completed sync
[dvicker@dvicker dsync]$ 

And the output from immediately running the second dsync.

[2023-11-04T06:41:24] Comparing file sizes and modification times of 1112126 items
[2023-11-04T06:41:24] Started   : Nov-04-2023, 06:41:23
[2023-11-04T06:41:24] Completed : Nov-04-2023, 06:41:24
[2023-11-04T06:41:24] Seconds   : 0.702
[2023-11-04T06:41:24] Items     : 1112126
[2023-11-04T06:41:24] Item Rate : 1112126 items in 0.702232 seconds (1583701.944237 items/sec)
[2023-11-04T06:41:24] Deleting items from destination
[2023-11-04T06:41:24] Removing 3 items
[2023-11-04T06:41:24] Removed 3 items in 0.002 seconds (1491.750 items/sec)
[2023-11-04T06:41:24] Copying items to destination
[2023-11-04T06:41:24] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-04T06:41:24] Items: 3
[2023-11-04T06:41:24]   Directories: 0
[2023-11-04T06:41:24]   Files: 3
[2023-11-04T06:41:24]   Links: 0
[2023-11-04T06:41:24] Data: 55.123 GiB (18.374 GiB per file)
[2023-11-04T06:41:24] Creating 3 files.
[2023-11-04T06:41:24] Copying data.
[2023-11-04T06:41:45] Copied 10.328 GiB (19%) in 20.811 secs (508.184 MiB/s) 90 secs left ...
[2023-11-04T06:41:58] Copied 21.477 GiB (39%) in 34.094 secs (645.048 MiB/s) 53 secs left ...
[2023-11-04T06:42:21] Copied 34.646 GiB (63%) in 56.483 secs (628.105 MiB/s) 33 secs left ...
[2023-11-04T06:42:21] Copied 55.123 GiB (100%) in 56.483 secs (999.351 MiB/s) done
[2023-11-04T06:42:21] Copy data: 55.123 GiB (59188151153 bytes)
[2023-11-04T06:42:21] Copy rate: 999.351 MiB/s (59188151153 bytes in 56.483 seconds)
[2023-11-04T06:42:21] Syncing data to disk.
[2023-11-04T06:42:23] Sync completed in 2.393 seconds.
[2023-11-04T06:42:23] Setting ownership, permissions, and timestamps.
/nobackup2/dvicker/dsync_test/ccdev/boeing/aerothermal/steam_database_solutions/bp_results/MIIIb_traj_MC_690to700_run_00214.cdat/misc/processed_hifi_trajs.h5 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/Crew-3/boat-and-recovery-video/TRaTS Raw Converted Stills/00025276_000000000062B5E5_Lk5x5_B1250_L16384.bmp atime_sec=1699081733 atime_ns=0 mtime_sec=1651831020 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/boeing/aerothermal/steam_database_solutions/bp_results/MIIIb_traj_MC_690to700_run_00214.cdat/misc/processed_hifi_trajs.h5 rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/Crew-3/boat-and-recovery-video/TRaTS Raw Converted Stills/00025276_000000000062B5E5_Lk5x5_B1250_L16384.bmp rc=0 errno=0 atime_sec=1699081733 atime_ns=0 mtime_sec=1651831020 mtime_ns=0 atime=2023-11-04T02:08:53 mtime=2022-05-06T04:57:00
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
[2023-11-04T06:42:23] Updated 3 items in 0.008 seconds (398.938 items/sec)
[2023-11-04T06:42:23] Syncing directory updates to disk.
[2023-11-04T06:42:25] Sync completed in 2.375 seconds.
[2023-11-04T06:42:25] Started: Nov-04-2023,06:41:24
[2023-11-04T06:42:25] Completed: Nov-04-2023,06:42:25
[2023-11-04T06:42:25] Seconds: 61.262
[2023-11-04T06:42:25] Items: 3
[2023-11-04T06:42:25]   Directories: 0
[2023-11-04T06:42:25]   Files: 3
[2023-11-04T06:42:25]   Links: 0
[2023-11-04T06:42:25] Data: 55.123 GiB (59188151153 bytes)
[2023-11-04T06:42:25] Rate: 921.392 MiB/s (59188151153 bytes in 61.262 seconds)
[2023-11-04T06:42:25] Updating timestamps on newly copied files

One thing stood out to me. The output said it was deleting and re-transferring 3 files, but the subsequent output shows 6 files.

I picked one of those files and grepped it out of the output.

[dvicker@dvicker dsync]$ grep 2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV dsync-testing.2023-11-03T13-27-57.out dsync-testing.2023-11-04T06-10-06.out
dsync-testing.2023-11-03T13-27-57.out:/nobackup/PROJECTS/ccdev.to/nobackup/PROJECTS/ccdev.to-delete/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-03T13-27-57.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-03T13-27-57.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1697860054 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-10-20T22:47:34 mtime=2023-01-11T05:34:17
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0
dsync-testing.2023-11-04T06-10-06.out:/nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV rc=0 errno=0 atime_sec=1699081708 atime_ns=0 mtime_sec=1673436857 mtime_ns=0 atime=2023-11-04T02:08:28 mtime=2023-01-11T05:34:17
[dvicker@dvicker dsync]$ sudo ls -l /nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV
-rw-rw---- 1 sajansse ccp_spacex 17839545344 Jan 11  2023 /nobackup2/dvicker/dsync_test/ccdev/spacex/chutes/video/CRS-26/boat-and-recovery-video/2023-01-11-SpaceX-CRS26-MX20-MWIR-Overlay-Raw.MOV
[dvicker@dvicker dsync]$ 

Even with gunzip --best, these output files are over the upload limit so I had to split them into pieces. I've done that so I can upload them and you can see the entire output.

[dvicker@dvicker dsync]$ split -n 3 dsync-testing.2023-11-03T13-27-57.out{,.split}
[dvicker@dvicker dsync]$ split -n 2 dsync-testing.2023-11-04T06-10-06.out{,.split}

dsync-testing.2023-11-03T13-27-57.out.splitaa.gz
dsync-testing.2023-11-03T13-27-57.out.splitab.gz
dsync-testing.2023-11-03T13-27-57.out.splitac.gz
dsync-testing.2023-11-04T06-10-06.out.splitaa.gz
dsync-testing.2023-11-04T06-10-06.out.splitab.gz

@adammoody
Copy link
Member

Thanks for running these additional tests.

I think there are just 3 files, but it prints two debug messages for each file.

Looking at the *.h5 file in the set, grep prints these 2 lines in the first dsync:

atime_sec=1696039618 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1696039618 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-09-29T21:06:58 mtime=2023-06-06T11:29:24

And then for that file in the second dsync, I see these 4 lines:

atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24
atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0
rc=0 errno=0 atime_sec=1699081579 atime_ns=0 mtime_sec=1686068964 mtime_ns=0 atime=2023-11-04T02:06:19 mtime=2023-06-06T11:29:24

So for this particular file, it looks dsync succeeded each time that it set the timestamps (rc=0 every time).

The mtime values did not change between the two runs (mtime_sec=1686068964 mtime_ns=0).

I do see that the access time changed from atime=2023-09-29T21:06:58 to atime=2023-11-04T02:06:19. That's probably due to dsync updating the atime when it reads the source file to begin with. The next release will have the --open-noatime to avoid that.

Given that the mtime is the same, I'm not sure why dsync decided to delete and re-copy the file. Perhaps the file size was different between the two?

Given that the mtime didn't change on the source file, and knowing that the file system was locked down, I really doubt the size of the source file changed. Maybe dsync never fully copied the file to the destination in the first run, even though it thought it had?

It would have been a good idea to print the file size, too, so that I could verify. There might be a couple other checks, like if the file type came up different somehow. I'll look to add that.

@adammoody
Copy link
Member

@dvickernasa , I've updated the debugtimestamps branch. I merged a commit which required a force push, so hopefully that doesn't trip you up.

https://github.com/hpc/mpifileutils/tree/debugtimestamps

Since the mtime handling in the output of your last test looked to be correct, I've disabled the messages that print getting/setting timestamps on each file so that it's not as noisy. I added new messages to print what difference dsync finds between the source/destination when it decides to replace a destination file.

If you have a chance, would you please kick off a repeat of that test?

I realize that takes a while to run, so I appreciate your time. Thanks.

@adammoody
Copy link
Member

Oh, this branch does include the new --open-noatime option, too, if you want to use that. I don't think that will interfere with the bug we're currently chasing. It should prevent dsync from updating the source atime values, which should also speed up the second dsync a bit. Without that option, I would expect the second dsync to be setting the timestamps on all destination files to update atime.

@dvickernasa
Copy link
Author

I've compiled the update and I'll repeat the test. I'm not going to use the new option for now, just to be consistent. I'll email the output for the latest results when its done.

@adammoody
Copy link
Member

Thanks, @dvickernasa . In this latest test, the second dsync copied 3 files. The new "DIFF" debug statements show that it detected a difference in the size/mtime check:

DIFF size/mtime src ... src_size=41347294011 dst_size=41347294011 src_mtime=1686068964 dst_mtime=1699435045
DIFF size/mtime src ... src_size=5307980 dst_size=5307980 src_mtime=1502986683 dst_mtime=1699435081
DIFF size/mtime src ... src_size=4759 dst_size=4759 src_mtime=1444167688 dst_mtime=1699434866

Here the file size values match, but the mtime values are different. The destination mtimes are all much more recent than the source file mtimes, and the destination mtime values look like they probably match up with the time at which the dsync was running.

From your previous test, we saw that dsync had correctly set the mtime on all files, including the 3 files that were copied by the second run. Unfortunately, I can't verify whether the mtime values had been set correctly in this test, because I dropped those messages.

I suspect that to be the case again, but I'd like to add those messages back and run once more to be certain.

Would you mind pulling those changes and running once more?

Thanks!

@adammoody
Copy link
Member

With this next test, the goal is to look for messages that verify that the timestamps are being set correctly and then must somehow be getting changed again behind the scenes.

I'm back to wondering if this is due to outstanding writes that are buffered on the server and then getting applied after the timestamps have been set. Those "late writes" would change the mtime to the time that last write is applied. dsync has a sync() call and then an MPI_Barrier() which is meant to finalize all writes before setting any timestamps, but maybe that's not sufficient.

write all data
sync()
MPI_Barrier(MPI_COMM_WORLD);
set timestamps

For this test, you don't need to send the full output to me. You can grep for "DIFF" in the second run to find any files that were re-copied. Even one file should be sufficient. Then also grep for the source and destination path of that file in both the first and second run output. We should only need those lines.

Assuming that behaves as expected, I'd like to run one more test where we modify dsync to call fsync() before closing every file. I'd like to check whether that forces the writes to happen.

Thanks again for your help with this, @dvickernasa .

@adilger
Copy link
Contributor

adilger commented Nov 10, 2023

@adammoody just following along at home, and wanted to note that sync() on one client node is not going to flush out dirty data on all of the other client nodes, otherwise the servers would see a crippling number of sync writes as every connected client is causing sync writes on every other client. Also, IIRC, sync() at the OS level may start writes, but does not necessarily wait to finish all writes on all filesystems (this has changed over the years between kernels.

Once the data has been flushed from the clients to the servers, Lustre will generally write it to storage in order, so a later "sync" will ensure all earlier writes are also committed to storage.

The "guaranteed" way to ensure all of the data is flushed is to call fsync(fd) or fdatasync(fd) on each files written by each node. If dsync is writing files in batches, then it could f(data)sync(fd) in reverse order (and/or after some delay) so that most of the time it is not waiting for any data to be written or actually triggering a sync on the storage.

Calling the llapi_file_flush() interface will revoke the locks and cache on all clients to flush their data, but of course has no control over whether there are still active processes writing new data to the file, so that would need to be done at the end (I don't know if there is any coordination between writers?). Internally, this is calling llapi_get_data_version(). For debugging, calling llapi_get_data_vesion() when the data writers are finished, and then again before setting the timestamp would tell you if the data has been modified on any OST objects since it was supposedly flushed.

Since these APIs could trigger sync writes for every file accessed (if they have dirty data), it would best to call it on a file only when all of the writers are done, just before the timestamp is reset. That should avoid too much overhead.

While there is an S_NOCMTIME flag that we set on inodes when writing on the OSS (so that the local server clock does not mess with the client timestamps) there is unfortunately no kernel API to set this on the clients.

@adammoody
Copy link
Member

Thanks, @adilger . I was hoping you might still be following this thread. The global barrier should ensure that sync has been called by all ranks before any rank starts to set timestamps. However, if sync returns before all data has been written by the servers, then that could lead to the problem that we're seeing.

It sounds like I'll need the fsync or fdatasync for this. We used to have an fsync in there. I think we took that out to improve performance and added the single sync() at the end in the hopes it would accomplish the same thing but more efficiently. I like the idea of reversing the order of fsync calls to speed things up, though I might need to reduce the batch size to avoid keeping too many files open.

For a large file that gets split across clients, I presume a single call to llapi_file_flush would be more efficient than having each client call fsync individually? That might be a good Lustre-specific optimization.

@dvickernasa
Copy link
Author

@adammoody, the latest test completed. It didn't re-transfer files on the 2nd dsync this time so I'm not sure how useful the results are but I'll send them to you anyway.

I'm pulling the latest changes and repeating the test now.

@adammoody
Copy link
Member

Thanks, @dvickernasa . Right, unfortunately it looks like this latest run didn't trigger the problem. Hopefully, the next run will catch it. Thanks as always for your help.

When you run using that latest branch, it will be quite verbose again as it prints a message whenever getting/setting timestamps.

@dvickernasa
Copy link
Author

Sorry for the delay. I was at SC23 all last week and didn't get back to this. Looks like my latest transfer with the updated code caught one file.

[2023-11-14T03:38:05] Comparing file sizes and modification times of 1112126 items
[2023-11-14T03:38:05] Started   : Nov-14-2023, 03:38:05
[2023-11-14T03:38:05] Completed : Nov-14-2023, 03:38:05
[2023-11-14T03:38:05] Seconds   : 0.721
[2023-11-14T03:38:05] Items     : 1112126
[2023-11-14T03:38:05] Item Rate : 1112126 items in 0.721121 seconds (1542218.650501 items/sec)
[2023-11-14T03:38:05] Deleting items from destination
[2023-11-14T03:38:05] Removing 1 items
[2023-11-14T03:38:05] Removed 1 items in 0.001 seconds (964.208 items/sec)
[2023-11-14T03:38:05] Copying items to destination
[2023-11-14T03:38:05] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-14T03:38:05] Items: 1
[2023-11-14T03:38:05]   Directories: 0
[2023-11-14T03:38:05]   Files: 1
[2023-11-14T03:38:05]   Links: 0
[2023-11-14T03:38:05] Data: 917.512 MiB (917.512 MiB per file)
[2023-11-14T03:38:05] Creating 1 files.
[2023-11-14T03:38:05] Copying data.
[2023-11-14T03:38:09] Copy data: 917.512 MiB (962081164 bytes)
[2023-11-14T03:38:09] Copy rate: 236.609 MiB/s (962081164 bytes in 3.878 seconds)
[2023-11-14T03:38:09] Syncing data to disk.
[2023-11-14T03:38:12] Sync completed in 2.778 seconds.
[2023-11-14T03:38:12] Setting ownership, permissions, and timestamps.           

@adammoody I'll email you the latest output.

@adammoody
Copy link
Member

Thanks again, @dvickernasa . No problem, and I hope you had a good time at SC.

Yes, this set of runs captured good evidence of what we were looking for. Copying bits from your email, here is the note about the second run that finds a difference in mtime:

DIFF /src/file.mpg src_size=962081164 src_mstr=2023-04-15T08:54:32 src_mtime=1681566872 src_mtime2=1681566872
DIFF /dst/file.mpg dst_size=962081164 dst_mstr=2023-11-13T23:02:23 dst_mtime=1699938143 dst_mtime2=1699938143

The source file has an mtime of 2023-04-15T08:54:32 while the destination file has something more recent at 2023-11-13T23:02:23.

However, for this file, I see that the output from the first run reports that it had set the mtime to the correct value 2023-04-15T08:54:32 and without error rc=0:

dsync-testing.2023-11-13T10-00-06.out:dst /dst/file.mpg rc=0 errno=0 atime_sec=1699621797 atime_ns=0 mtime_sec=1681566872 mtime_ns=0 atime=2023-11-10T07:09:57 mtime=2023-04-15T08:54:32

Let me update the branch to enable the fsync() before closing each file. This may slow things down during the copy, but I'd like to see whether the problem still shows up in that case.

@adammoody
Copy link
Member

adammoody commented Nov 20, 2023

@dvickernasa , I've pushed a commit to the debugtimestamps branch that forces the fsync during the copy.

c17d0c6

When you get a chance, would you please pull this change, rebuild, and try another test?

This will add a lot of fsync calls, which may slow down the copy speed. It will also change the timing of things.

Given that this looks to be a race condition, the timing change alone might hide the problem. It'll be hard to trust that it's really gone without running more tests or running larger. On the other hand, if we still see the problem show up with fsync, we might then try the llapi_file_flush API.

I think what we're seeing is that some outstanding writes are being buffered by a server and applied after the timestamp. I'm looking for different ways to force those writes to finish.

@dvickernasa
Copy link
Author

I'm re-running my test with the updated code. Should be finished in the morning. Since I've transferred this same set of files a few times now, we should get a good idea if the extra fsyncs's slow things down much.

@adammoody
Copy link
Member

adammoody commented Nov 20, 2023

@adilger , assuming this is what is happening, I see the following in the notes section of the man 2 sync page on my system:

According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However Linux waits for I/O completions, and thus sync() or syncfs() provide the same guarantees as fsync called on every file in the system or filesystem respectively.

dsync currently uses sync as though it flushes (to disk) any data that the calling process has written. I think the Linux behavior described above would be sufficient, but the POSIX.1-2001 specification would not be.

Do you know how Lustre treats sync?

From your earlier comments, it sounds like it may be more along the lines of POSIX, where sync starts the writes but does not wait for them to finish. You've already described that above, but I just wanted to confirm. I feel like the text in the man page confuses things, since I know Lustre is POSIX-compliant, but this is also on a Linux system.

@adilger
Copy link
Contributor

adilger commented Nov 21, 2023

@adammoody I don't think that sync() will at most flush a single client's dirty writes to the filesystem.

@dvickernasa
Copy link
Author

The updated test ran this morning. The 2nd transfer did not re-transfer anything, so that's encouraging. I'll be out the rest of the week for Thanksgiving but I'm going to put this in a loop and run in several times while I'm gone, so we'll have some more data next week.

Also, here is some data on how the extra sync's affected the transfer times. There is a lot of variation in the runs but it would appear that there isn't a huge affect. Or, it might be nice to add some (potentially just debug) timing statements to directly calculate the overhead of calling the sync.

[dvicker@dvicker dsync]$ for file in  dsync-testing.*.out; do
> grep Rate: $file | tail -1 
> done
[2023-10-16T20:12:58] Rate: 730.639 MiB/s (5357437501746 bytes in 6992.849 seconds)
[2023-10-17T10:19:23] Rate: 35.958 B/s (753 bytes in 20.941 seconds)
[2023-10-19T20:44:39] Rate: 756.352 MiB/s (5736324718486 bytes in 7232.852 seconds)
[2023-10-20T14:05:21] Rate: 655.065 MiB/s (5736324718486 bytes in 8351.211 seconds)
[2023-10-20T16:24:18] Rate: 700.910 MiB/s (5736324718486 bytes in 7804.975 seconds)
[2023-10-20T18:43:53] Rate: 693.610 MiB/s (5736324718486 bytes in 7887.125 seconds)
[2023-10-20T21:00:51] Rate: 701.543 MiB/s (5736324718486 bytes in 7797.935 seconds)
[2023-10-20T23:18:05] Rate: 703.717 MiB/s (5736324718486 bytes in 7773.846 seconds)
[2023-10-21T12:12:59] Rate: 479.823 MiB/s (2842477276278 bytes in 5649.579 seconds)
[2023-11-04T06:10:06] Rate: 857.345 MiB/s (52019905315792 bytes in 57864.748 seconds)
[2023-11-04T06:42:25] Rate: 921.392 MiB/s (59188151153 bytes in 61.262 seconds)
[2023-11-08T07:21:31] Rate: 843.019 MiB/s (52019905315792 bytes in 58848.101 seconds)
[2023-11-08T07:58:05] Rate: 877.777 MiB/s (41352606750 bytes in 44.928 seconds)
[2023-11-10T11:29:01] Rate: 836.292 MiB/s (52019905315792 bytes in 59321.474 seconds)
[2023-11-14T03:05:10] Rate: 835.834 MiB/s (52019905315792 bytes in 59353.928 seconds)
[2023-11-14T03:38:15] Rate: 98.241 MiB/s (962081164 bytes in 9.339 seconds)
[2023-11-21T09:28:14] Rate: 814.605 MiB/s (52019905315792 bytes in 60900.757 seconds)
[dvicker@dvicker dsync]$ 

@adammoody
Copy link
Member

That's great. Thanks, @dvickernasa . Have a Happy Thanksgiving!

@ofaaland
Copy link
Collaborator

Hi @dvickernasa, hope you had a good Thanksgiving.

Is the destination file system built with OSTs backed by ZFS, and if so, is "options osd_zfs osd_object_sync_delay_us=<some_positive_number>" set on those servers? If so I might have an explanation for what we're seeing (although I'm not sure). Thanks!

@dvickernasa
Copy link
Author

@adammoody, I got 6 more iterations on this done over the break. It looks like all of them found more data to copy on the 2nd transfer.

[dvicker@dvicker dsync]$ grep -e Removing dsync-testing.1.2023-11-22T11-04-07.out dsync-testing.2.2023-11-23T05-21-15.out dsync-testing.3.2023-11-24T00-25-20.out dsync-testing.4.2023-11-24T19-19-35.out dsync-testing.5.2023-11-25T13-22-08.out dsync-testing.6.2023-11-26T06-59-07.out
dsync-testing.1.2023-11-22T11-04-07.out:[2023-11-22T11:41:24] Removing 2 items
dsync-testing.2.2023-11-23T05-21-15.out:[2023-11-23T06:03:11] Removing 5 items
dsync-testing.3.2023-11-24T00-25-20.out:[2023-11-24T01:08:11] Removing 1 items
dsync-testing.4.2023-11-24T19-19-35.out:[2023-11-24T19:57:52] Removing 1 items
dsync-testing.5.2023-11-25T13-22-08.out:[2023-11-25T13:56:00] Removing 2 items
dsync-testing.6.2023-11-26T06-59-07.out:[2023-11-26T07:36:36] Removing 1 items
[dvicker@dvicker dsync]$ ll dsync-testing.?.*.out
 54186 -rw-r--r-- 1 root root 328067432 Nov 22 11:04 dsync-testing.1.2023-11-21T17-21-53.out
 67614 -rw-r--r-- 1 root root 321097260 Nov 22 11:52 dsync-testing.1.2023-11-22T11-04-07.out
 54351 -rw-r--r-- 1 root root 328053051 Nov 23 05:21 dsync-testing.2.2023-11-22T11-55-54.out
 68126 -rw-r--r-- 1 root root 322586738 Nov 23 06:15 dsync-testing.2.2023-11-23T05-21-15.out
 55228 -rw-r--r-- 1 root root 328051558 Nov 24 00:25 dsync-testing.3.2023-11-23T06-18-43.out
 68071 -rw-r--r-- 1 root root 322618344 Nov 24 01:29 dsync-testing.3.2023-11-24T00-25-20.out
 55603 -rw-r--r-- 1 root root 328055348 Nov 24 19:19 dsync-testing.4.2023-11-24T01-32-36.out
 68034 -rw-r--r-- 1 root root 322569590 Nov 24 20:20 dsync-testing.4.2023-11-24T19-19-35.out
 54397 -rw-r--r-- 1 root root 328050794 Nov 25 13:22 dsync-testing.5.2023-11-24T20-23-39.out
 67888 -rw-r--r-- 1 root root 322593290 Nov 25 14:07 dsync-testing.5.2023-11-25T13-22-08.out
 54698 -rw-r--r-- 1 root root 328059803 Nov 26 06:59 dsync-testing.6.2023-11-25T14-10-39.out
 68098 -rw-r--r-- 1 root root 322600133 Nov 26 07:37 dsync-testing.6.2023-11-26T06-59-07.out
[dvicker@dvicker dsync]$ 

I'll sent you some more details via email.

@ofaaland, for the testing I've been using in the last several posts (and most of the other examples in this issue), the source LFS does have ZFS OST's. The destination LFS is ldiskfs. Even so, we are setting the parameter you mentioned to zero.

[root@hpfs-fsl-oss00 ~]# cat /etc/modprobe.d/lustre.conf
# Lustre modprobe configuration

options lnet networks=tcp0(enp4s0),o2ib0(ib1),o2ib1(ib0)
options ko2iblnd map_on_demand=32
options osd_zfs osd_txg_sync_delay_us=0 osd_object_sync_delay_us=0
options zfs zfs_txg_history=120 zfs_txg_timeout=15 zfs_prefetch_disable=1 zfs_txg_timeout=30 zfs_vdev_scheduler=deadline

[root@hpfs-fsl-oss00 ~]# 

@ofaaland
Copy link
Collaborator

Hi @dvickernasa Since the destination LFS is ldiskfs, that rules out my theory. Back to the drawing board for us. Thank you.

@adammoody
Copy link
Member

adammoody commented Nov 28, 2023

Thanks for that additional test output, @dvickernasa . This shows that the problem is still there even if we fsync() every file. I was hoping that might be a work around, but it's instructive to know that it's not.

I've pushed a couple more commits to the branch.

The first one calls llapi_file_flush() as @adilger suggested. I left the fsync in there, so now dsync will be doing both. I'd like to see whether this might do better than fsync. As added, this will only work properly when the destination file system is Lustre.

I also added a call to stat the destination file and check the timestamps immediately after setting them. This will print a new DIFF message if it finds a difference. For any file that is re-copied in the second run, if we do not see those new DIFF messages show up in the output of the first run, then it helps further verify that the mtime was set correctly to begin with.

When you have a chance, would you please pull, rebuild, and submit a new round of tests?

@dvickernasa
Copy link
Author

I've started my test with the updated code.

@dvickernasa
Copy link
Author

@adammoody, we talked a while ago about adding the --input option to dsync. If its easy, that would come in handy for all these tests. It takes about 30 minutes to walk the source data for the test I've been running (which is on our older lustre filesystem).

@adammoody
Copy link
Member

Oh, right. That slipped my mind. I'll work on that.

@dvickernasa
Copy link
Author

The latest test finished (sorry for the delay). The latest run found 9 files to retransfer.

[2023-11-29T11:34:37] Started   : Nov-29-2023, 11:34:36
[2023-11-29T11:34:37] Completed : Nov-29-2023, 11:34:37
[2023-11-29T11:34:37] Seconds   : 0.717
[2023-11-29T11:34:37] Items     : 1112126
[2023-11-29T11:34:37] Item Rate : 1112126 items in 0.717182 seconds (1550688.322885 items/sec)
[2023-11-29T11:34:37] Deleting items from destination
[2023-11-29T11:34:37] Removing 9 items
[2023-11-29T11:34:37] Removed 9 items in 0.019 seconds (473.855 items/sec)
[2023-11-29T11:34:37] Copying items to destination
[2023-11-29T11:34:37] Copying to /nobackup2/dvicker/dsync_test/ccdev
[2023-11-29T11:34:37] Items: 9
[2023-11-29T11:34:37]   Directories: 0
[2023-11-29T11:34:37]   Files: 9
[2023-11-29T11:34:37]   Links: 0
[2023-11-29T11:34:37] Data: 101.952 GiB (11.328 GiB per file)
[2023-11-29T11:34:37] Creating 9 files.
[2023-11-29T11:34:37] Copying data.
[2023-11-29T11:34:51] Copied 9.484 GiB (9%) in 14.754 secs (658.244 MiB/s) 144 secs left ...
[2023-11-29T11:36:20] Copied 20.059 GiB (20%) in 103.611 secs (198.241 MiB/s) 423 secs left ...
[2023-11-29T11:36:20] Copied 101.952 GiB (100%) in 103.611 secs (0.984 GiB/s) done
[2023-11-29T11:36:20] Copy data: 101.952 GiB (109470191676 bytes)
[2023-11-29T11:36:20] Copy rate: 0.984 GiB/s (109470191676 bytes in 103.611 seconds)
[2023-11-29T11:36:20] Syncing data to disk.
[2023-11-29T11:36:23] Sync completed in 2.766 seconds.
[2023-11-29T11:36:23] Setting ownership, permissions, and timestamps.

@adammoody
Copy link
Member

Thanks, @dvickernasa . Wow, this problem is stubborn. So it would seem the llapi_file_flush() calls are also not helping.

Based on the data from the latest run that you emailed to me separately, I don't see any of the new DIFF messages that would have been printed by the first run if it detected a wrong mtime immediately after it set the timestamp. That further suggests that the timetamps are being set correctly initially and then being changed again later.

Even more puzzling is that several of the files that have incorrect mtimes are of size=0. There should not be any write() calls issued for those files. Maybe this not a buffered write after all, but something else. Let me take another look through the code with that in mind.

BTW, I am working to have the --input-src option for your next run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants