mknod() failed (errno=2 No such file or directory) #574

bdevcich · 2024-05-29T13:21:25Z

Hello,

I have been seeing this issue when using dcp to recursive copy a directory from one lustre filesystem to another. This is being ran via mpirun and over multiple hosts (e.g. mpirun -H host1,host2). I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host).

The error in the dcp output suggests no such file or directory, which seems like the parent directory (e.g. /mnt/nnf/.../job2/) is not being created first. I originally thought the Original directory exists, skip the creation output might suggest that none of the directories are being created, but that output shows up the case where this issue does not surface.

I don't have a way to reproduce this directly, but if I run our tests enough times this will be hit.

Each test is copying the same source directory to essentially the same location on an ephemeral lustre file system. Here is the source directory and its contents that dcp is attempting to copy:

$ tree /lus/global/blake/dm-system-test/src
/lus/global/blake/dm-system-test/src
├── job
│   ├── data2.out
│   ├── data.out
│   └── job2
│       └── data3.out
└── job2
    ├── data2.out
    └── data.out

Here is the error output from dcp:

...
[2024-05-28T17:00:04] Walking /lus/global/blake/dm-system-test/src
[2024-05-28T17:00:04] Walked 9 items in 0.011 secs (807.850 items/sec) ...
[2024-05-28T17:00:04] Walked 9 items in 0.012 seconds (747.192 items/sec)
[2024-05-28T17:00:04] Copying to /mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0
[2024-05-28T17:00:04] Items: 9
[2024-05-28T17:00:04]   Directories: 4
[2024-05-28T17:00:04]   Files: 5
[2024-05-28T17:00:04]   Links: 0
[2024-05-28T17:00:04] Data: 60.000 B (12.000 B per file)
[2024-05-28T17:00:04] Creating 4 directories
[2024-05-28T17:00:04] Original directory exists, skip the creation: `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0' (errno=17 File exists)
[2024-05-28T17:00:04] Creating 5 files.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' mknod() failed (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-05-28T17:00:04] Copying data.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' (errno=2 No such file or directory)
[2024-05-28T17:00:04] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data2.out' to `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out'
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data.out' to `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out'
[2024-05-28T17:00:04] Copy data: 60.000 B (60 bytes)
[2024-05-28T17:00:04] Copy rate: 2.796 KiB/s (60 bytes in 0.021 seconds)
[2024-05-28T17:00:04] Syncing data to disk.
[2024-05-28T17:00:04] Sync completed in 0.038 seconds.
[2024-05-28T17:00:04] Fixing permissions.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] Updated 9 items in 0.004 seconds (2140.376 items/sec)
[2024-05-28T17:00:04] Syncing directory updates to disk.
[2024-05-28T17:00:04] Sync completed in 0.001 seconds.
[2024-05-28T17:00:04] Started: May-28-2024,17:00:04
[2024-05-28T17:00:04] Completed: May-28-2024,17:00:04
[2024-05-28T17:00:04] Seconds: 0.105
[2024-05-28T17:00:04] Items: 7
[2024-05-28T17:00:04]   Directories: 4
[2024-05-28T17:00:04]   Files: 3
[2024-05-28T17:00:04]   Links: 0
[2024-05-28T17:00:04] Data: 36.000 B (36 bytes)
[2024-05-28T17:00:04] Rate: 343.790 B/s (036 bytes in 0.105 seconds)
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

 Local host:  nnf-dm-worker-7c96s
 System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-7c96s.c7370001.3
 Error:       Operation not permitted (errno 1)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

 Process name: [[50999,1],15]
 Exit code:    1
--------------------------------------------------------------------------

The system call failed during shared memory initialization is something that I see when the copy is successful so I don't think it contributes to the problem here.

Is there anything I can enable to trace this further?

The text was updated successfully, but these errors were encountered:

bdevcich · 2024-05-29T18:12:48Z

Here's another case with dcp -d dbg enabled. According to the debug output, mkdir appears to have worked.

[2024-05-29T17:58:39] Walking /lus/global/blake/dm-system-test/src
   [2024-05-29T17:58:39] Walked 9 items in 0.013 secs (706.261 items/sec) ...
   [2024-05-29T17:58:39] Walked 9 items in 0.014 seconds (660.435 items/sec)
   [2024-05-29T17:58:39] Copying to /mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0
   [2024-05-29T17:58:39] Items: 9
   [2024-05-29T17:58:39]   Directories: 4
   [2024-05-29T17:58:39]   Files: 5
   [2024-05-29T17:58:39]   Links: 0
   [2024-05-29T17:58:39] Data: 60.000 B (12.000 B per file)
   [2024-05-29T17:58:39] Creating 4 directories
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0'
   [2024-05-29T17:58:39] Original directory exists, skip the creation: `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0' (errno=17 File exists)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job'
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2'
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job/job2'
   [2024-05-29T17:58:39] Creating 5 files.
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' mknod() failed (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' mknod() failed (errno=2 No such file or directory)
   [2024-05-29T17:58:39] Copying data.
   [2024-05-29T17:58:39] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [3] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data.out' to `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out'
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data2.out' to `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out'
   [2024-05-29T17:58:39] Copy data: 60.000 B (60 bytes)
   [2024-05-29T17:58:39] Copy rate: 3.881 KiB/s (60 bytes in 0.015 seconds)
   [2024-05-29T17:58:39] Syncing data to disk.
   [2024-05-29T17:58:39] Sync completed in 0.037 seconds.
   [2024-05-29T17:58:39] Fixing permissions.
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] Updated 9 items in 0.004 seconds (2078.435 items/sec)
   [2024-05-29T17:58:39] Syncing directory updates to disk.
   [2024-05-29T17:58:39] Sync completed in 0.001 seconds.
   [2024-05-29T17:58:39] Started: May-29-2024,17:58:39
   [2024-05-29T17:58:39] Completed: May-29-2024,17:58:39
   [2024-05-29T17:58:39] Seconds: 0.099
   [2024-05-29T17:58:39] Items: 7
   [2024-05-29T17:58:39]   Directories: 4
   [2024-05-29T17:58:39]   Files: 3
   [2024-05-29T17:58:39]   Links: 0
   [2024-05-29T17:58:39] Data: 36.000 B (36 bytes)
   [2024-05-29T17:58:39] Rate: 363.605 B/s (036 bytes in 0.099 seconds)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)

daltonbohning · 2024-05-29T18:30:38Z

I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host)

Does that mount have some kind of caching or delayed update? I'm more familiar with DAOS than Lustre, but I've seen similar behavior with DAOS and certain cache settings where a rank creates the directory on one host, then a rank on a different host doesn't see that directory immediately

behlendorf · 2024-05-29T22:06:04Z

That's an good thought and could explain the raciness. Lustre is fully coherent across nodes but I'm not sure how dcp itself coordinates this internally. Perhaps it's possible that dcp could issue the file create on host-B before the mkdir(2) syscall on host-A completes?

daltonbohning · 2024-05-30T14:04:55Z

Perhaps it's possible that dcp could issue the file create on host-B before the mkdir(2) syscall on host-A completes?

I don't think so, unless there's a bug somewhere. This is the function that creates the directories

mpifileutils/src/common/mfu_flist_copy.c

Line 1042 in 0a4c530

static int mfu_create_directories(

And notice there is a barrier

mpifileutils/src/common/mfu_flist_copy.c

Line 1123 in 0a4c530

MPI_Barrier(MPI_COMM_WORLD);

And that function is called before creating the files

mpifileutils/src/common/mfu_flist_copy.c

Lines 2490 to 2495 in 0a4c530

    
           /* create directories, from top down */ 
        
           int tmp_rc = mfu_create_directories(levels, minlevel, lists, numpaths, 
        
                   paths, destpath, copy_opts, mfu_src_file, mfu_dst_file); 
        
           if (tmp_rc < 0) { 
        
               rc = -1; 
        
           }

As far as I remember (I'm not the primary author!) dcp doesn't directly handle coherency. It creates all directories upfront, barrier, then creates files, barrier, then copies file data

ofaaland · 2024-05-30T15:37:48Z

Does the job2 directory ever get created at all? In the example above, I see that not only did the file copies to job2 fail, but the chmod on job2 itself also failed. That chmod is a third phase (mkdirs, then file copies, then set metadata), so multiple barriers later.

ofaaland · 2024-05-30T15:47:36Z

[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2' chmod() (errno=2 No such file or directory)

This is what I'm referring to WRT to the chmod failing. I see this in both the first and second examples you posted. So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed.

Can you instrument your test to see what each node sees under the destination root directory after dcp finishes?

daltonbohning · 2024-05-30T15:52:58Z

So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed.

Good point. If mkdir fails, it looks like the copy continues anyway

mpifileutils/src/common/mfu_flist_copy.c

Lines 2490 to 2499 in 0a4c530

    
           /* create directories, from top down */ 
        
           int tmp_rc = mfu_create_directories(levels, minlevel, lists, numpaths, 
        
                   paths, destpath, copy_opts, mfu_src_file, mfu_dst_file); 
        
           if (tmp_rc < 0) { 
        
               rc = -1; 
        
           } 
        
           /* operate on files in batches if batch size is given */ 
        
           uint64_t batch_size = copy_opts->batch_files; 
        
           if (batch_size > 0) {

bdevcich · 2024-05-30T18:28:01Z

I was able to add a pause in our code after the dcp failure occurs to go take a look at the lustre filesystem on each of the nodes. Obviously this isn't in real time and any sort of caching issue would have time to catch up since I have to do this manually right now, but I was able to take a peek at the two nodes and the files:

node0:

[root@rabbit-node-1 ~]# ls -laR /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0:
total 98
drwxr-xr-x 5 blake users 33280 May 30 12:44 .
drwx------ 2 blake users 33280 May 30 12:44 job
drwxr-xr-x 2 blake users 33280 May 30 12:44 job2

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job:
total 65
drwx------ 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2:
total 90
drwxr-xr-x 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
-rw-r--r-- 1 blake users    12 May 30 12:44 data2.out
-rw-r--r-- 1 blake users    12 May 30 12:44 data.out

node1:

[root@rabbit-node-2 ~]# ls -laR /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0:
total 294
drwxr-xr-x    5 blake users  33280 May 30 12:44 .
drwxr-xr-x 2772 root  root  159744 May 30 12:43 ..
drwx------    2 blake users  33280 May 30 12:44 job
drwxr-xr-x    2 blake users  33280 May 30 12:44 job2

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job:
total 65
drwx------ 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2:
total 90
drwxr-xr-x 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
-rw-r--r-- 1 blake users    12 May 30 12:44 data2.out
-rw-r--r-- 1 blake users    12 May 30 12:44 data.out

You can see that the job/ directory does get created in the end result.

However, in this run, I'm seeing an actual error message for a failed mkdir(). Here is the output:

...
[2024-05-30T17:44:03] Debug level set to: debug
...
[2024-05-30T17:44:03] Walking /lus/global/blake/dm-system-test/src
[2024-05-30T17:44:03] Walked 9 items in 0.024 secs (373.053 items/sec) ...
[2024-05-30T17:44:03] Walked 9 items in 0.025 seconds (354.708 items/sec)
[2024-05-30T17:44:03] Copying to /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
[2024-05-30T17:44:03] Items: 9
[2024-05-30T17:44:03]   Directories: 4
[2024-05-30T17:44:03]   Files: 5
[2024-05-30T17:44:03]   Links: 0
[2024-05-30T17:44:03] Data: 60.000 B (12.000 B per file)
[2024-05-30T17:44:03] Creating 4 directories
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0'
[2024-05-30T17:44:03] Original directory exists, skip the creation: `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0' (errno=17 File exists)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1010] ERROR: Create `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2' mkdir() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] Creating 5 files.
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] Copying data.
[2024-05-30T17:44:03] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' (errno=2 No such file or directory)
[2024-05-30T17:44:03] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' (errno=2 No such file or directory)
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' (errno=2 No such file or di
rectory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out'
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data2.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/job2/data3.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out'
[2024-05-30T17:44:03] Copy data: 60.000 B (60 bytes)
[2024-05-30T17:44:03] Copy rate: 2.696 KiB/s (60 bytes in 0.022 seconds)
[2024-05-30T17:44:03] Syncing data to disk.
[2024-05-30T17:44:03] Sync completed in 0.037 seconds.
[2024-05-30T17:44:03] Fixing permissions.
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] Updated 9 items in 0.009 seconds (1052.615 items/sec)
[2024-05-30T17:44:03] Syncing directory updates to disk.
[2024-05-30T17:44:03] Sync completed in 0.001 seconds.
[2024-05-30T17:44:03] Started: May-30-2024,17:44:03
[2024-05-30T17:44:03] Completed: May-30-2024,17:44:03
[2024-05-30T17:44:03] Seconds: 0.121
[2024-05-30T17:44:03] Items: 5
[2024-05-30T17:44:03]   Directories: 3
[2024-05-30T17:44:03]   Files: 2
[2024-05-30T17:44:03]   Links: 0
[2024-05-30T17:44:03] Data: 24.000 B (24 bytes)
[2024-05-30T17:44:03] Rate: 198.718 B/s (024 bytes in 0.121 seconds)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)

So even though job/ does show up in the end result, any sort of action in that directory fails. dcp fails to make job/job2/, job/data.out, and job/data2.out.

adilger · 2024-05-31T05:48:15Z

What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers.

behlendorf · 2024-05-31T19:47:05Z

@adilger we're seeing this with 2.15.4_6.llnl which does include the fix for LU-16056. For what it's worth we didn't notice this issue with 2.15.3_5.llnl but haven't yet tried rolling back to the servers. We do have a small patch stack applied but we had most of those changes applied to both Lustre versions and nothing in there really jumps out at me. It's mostly kfilnd/lnet fixes.

adilger · 2024-05-31T23:31:21Z

@behlendorf is this reproducible enough that it could be bisected between 2.15.3 and 2.15.4?

ofaaland · 2024-06-01T01:17:09Z

@bdevcich can you modify your build of mpifileutils so dcp does a stat() and reports ownership + permissions on those directories after the mkdir returns?

bdevcich · 2024-06-04T13:40:46Z

@bdevcich can you modify your build of mpifileutils so dcp does a stat() and reports ownership + permissions on those directories after the mkdir returns?

I'm back from vacation and I will start on this.

bdevcich · 2024-06-05T15:03:45Z

Here is the output with some added debugging lines from the stat():

   [2024-06-05T15:00:57] Walking /lus/global/blake/dm-system-test/src
   [2024-06-05T15:00:57] Walked 9 items in 0.013 secs (702.758 items/sec) ...
   [2024-06-05T15:00:57] Walked 9 items in 0.014 seconds (660.454 items/sec)
   [2024-06-05T15:00:57] Copying to /mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0
   [2024-06-05T15:00:57] Items: 9
   [2024-06-05T15:00:57]   Directories: 4
   [2024-06-05T15:00:57]   Files: 5
   [2024-06-05T15:00:57]   Links: 0
   [2024-06-05T15:00:57] Data: 60.000 B (12.000 B per file)
   [2024-06-05T15:00:57] Creating 4 directories
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] Original directory exists, skip the creation: `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0' (errno=17 File exists)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1044] TEST: statbuf: '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1045]    mode: 40755
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:951]    Permissions (octal): 755
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:958]    Non-permission bits (octal): 40000
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1047]    UID: 1060
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1048]    GID: 100
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1049]    numLinks: 3
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1044] TEST: statbuf: '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1045]    mode: 40700
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:951]    Permissions (octal): 700
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:958]    Non-permission bits (octal): 40000
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1047]    UID: 1060
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1048]    GID: 100
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1049]    numLinks: 2
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1028] ERROR: Create `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' mkdir() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1054] ERROR: TEST: mkdir failed, done
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1153] ERROR: TEST: Error creating directory, rc: -1, destpath: �3�V
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2537] ERROR: Error creating directories, rc: -1
   [2024-06-05T15:00:57] Creating 5 files.
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] Copying data.
   [2024-06-05T15:00:57] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/job2/data3.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out'
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out'
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data2.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out'
   [2024-06-05T15:00:57] Copy data: 60.000 B (60 bytes)
   [2024-06-05T15:00:57] Copy rate: 3.742 KiB/s (60 bytes in 0.016 seconds)
   [2024-06-05T15:00:57] Syncing data to disk.
   [2024-06-05T15:00:57] Sync completed in 0.036 seconds.
   [2024-06-05T15:00:57] Fixing permissions.
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] Updated 9 items in 0.003 seconds (3391.414 items/sec)
   [2024-06-05T15:00:57] Syncing directory updates to disk.
   [2024-06-05T15:00:57] Sync completed in 0.001 seconds.
   [2024-06-05T15:00:57] Started: Jun-05-2024,15:00:57
   [2024-06-05T15:00:57] Completed: Jun-05-2024,15:00:57
   [2024-06-05T15:00:57] Seconds: 0.092
   [2024-06-05T15:00:57] Items: 5
   [2024-06-05T15:00:57]   Directories: 3
   [2024-06-05T15:00:57]   Files: 2
   [2024-06-05T15:00:57]   Links: 0
   [2024-06-05T15:00:57] Data: 24.000 B (24 bytes)
   [2024-06-05T15:00:57] Rate: 261.025 B/s (024 bytes in 0.092 seconds)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)

ofaaland · 2024-06-05T17:33:42Z

@bdevcich I see the stat() of the job directory returned ENOENT

   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' (errno=2 No such file or directory)

It's concerning that mkdir would fail. @behlendorf tells me you're able to reproduce this on elcap, so I'll try myself once the nodes are back up post firmware update.

bdevcich · 2024-06-07T20:00:53Z

What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers.

On our internal system where I see this, we're using lustre-2.15.4_4.llnl-1.t4.x86_64

bdevcich · 2024-06-07T20:07:29Z

As I mentioned in the rabbit issue, if I put a 30s pause after the mount of ephemeral lustre, I cannot reproduce this issue.

I've removed the pause and did my best to capture the lustre logs from the two rabbit nodes when this occurs. I did a lctl dk after the issue occurred and I did my best to try to do a lctl clear after the successful runs to try to limit the size of the logs.

mknod-issue.rabbit-node-1.log
mknod-issue.rabbit-node-2.log

bdevcich mentioned this issue May 30, 2024

Lustre to Lustre Data Movement can fail on DataIn due to mknod() error NearNodeFlash/NearNodeFlash.github.io#161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mknod() failed (errno=2 No such file or directory) #574

mknod() failed (errno=2 No such file or directory) #574

bdevcich commented May 29, 2024

bdevcich commented May 29, 2024 •

edited

Loading

daltonbohning commented May 29, 2024

behlendorf commented May 29, 2024

daltonbohning commented May 30, 2024

ofaaland commented May 30, 2024 •

edited

Loading

ofaaland commented May 30, 2024 •

edited

Loading

daltonbohning commented May 30, 2024

bdevcich commented May 30, 2024

adilger commented May 31, 2024

behlendorf commented May 31, 2024

adilger commented May 31, 2024

ofaaland commented Jun 1, 2024

bdevcich commented Jun 4, 2024

bdevcich commented Jun 5, 2024

ofaaland commented Jun 5, 2024

bdevcich commented Jun 7, 2024

bdevcich commented Jun 7, 2024

mknod() failed (errno=2 No such file or directory) #574

mknod() failed (errno=2 No such file or directory) #574

Comments

bdevcich commented May 29, 2024

bdevcich commented May 29, 2024 • edited Loading

daltonbohning commented May 29, 2024

behlendorf commented May 29, 2024

daltonbohning commented May 30, 2024

ofaaland commented May 30, 2024 • edited Loading

ofaaland commented May 30, 2024 • edited Loading

daltonbohning commented May 30, 2024

bdevcich commented May 30, 2024

adilger commented May 31, 2024

behlendorf commented May 31, 2024

adilger commented May 31, 2024

ofaaland commented Jun 1, 2024

bdevcich commented Jun 4, 2024

bdevcich commented Jun 5, 2024

ofaaland commented Jun 5, 2024

bdevcich commented Jun 7, 2024

bdevcich commented Jun 7, 2024

bdevcich commented May 29, 2024 •

edited

Loading

ofaaland commented May 30, 2024 •

edited

Loading

ofaaland commented May 30, 2024 •

edited

Loading