Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mknod() failed (errno=2 No such file or directory) #574

Open
bdevcich opened this issue May 29, 2024 · 17 comments
Open

mknod() failed (errno=2 No such file or directory) #574

bdevcich opened this issue May 29, 2024 · 17 comments

Comments

@bdevcich
Copy link
Contributor

Hello,

I have been seeing this issue when using dcp to recursive copy a directory from one lustre filesystem to another. This is being ran via mpirun and over multiple hosts (e.g. mpirun -H host1,host2). I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host).

The error in the dcp output suggests no such file or directory, which seems like the parent directory (e.g. /mnt/nnf/.../job2/) is not being created first. I originally thought the Original directory exists, skip the creation output might suggest that none of the directories are being created, but that output shows up the case where this issue does not surface.

I don't have a way to reproduce this directly, but if I run our tests enough times this will be hit.

Each test is copying the same source directory to essentially the same location on an ephemeral lustre file system. Here is the source directory and its contents that dcp is attempting to copy:

$ tree /lus/global/blake/dm-system-test/src
/lus/global/blake/dm-system-test/src
├── job
│   ├── data2.out
│   ├── data.out
│   └── job2
│       └── data3.out
└── job2
    ├── data2.out
    └── data.out

Here is the error output from dcp:

...
[2024-05-28T17:00:04] Walking /lus/global/blake/dm-system-test/src
[2024-05-28T17:00:04] Walked 9 items in 0.011 secs (807.850 items/sec) ...
[2024-05-28T17:00:04] Walked 9 items in 0.012 seconds (747.192 items/sec)
[2024-05-28T17:00:04] Copying to /mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0
[2024-05-28T17:00:04] Items: 9
[2024-05-28T17:00:04]   Directories: 4
[2024-05-28T17:00:04]   Files: 5
[2024-05-28T17:00:04]   Links: 0
[2024-05-28T17:00:04] Data: 60.000 B (12.000 B per file)
[2024-05-28T17:00:04] Creating 4 directories
[2024-05-28T17:00:04] Original directory exists, skip the creation: `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0' (errno=17 File exists)
[2024-05-28T17:00:04] Creating 5 files.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' mknod() failed (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-05-28T17:00:04] Copying data.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' (errno=2 No such file or directory)
[2024-05-28T17:00:04] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data2.out' to `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out'
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data.out' to `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out'
[2024-05-28T17:00:04] Copy data: 60.000 B (60 bytes)
[2024-05-28T17:00:04] Copy rate: 2.796 KiB/s (60 bytes in 0.021 seconds)
[2024-05-28T17:00:04] Syncing data to disk.
[2024-05-28T17:00:04] Sync completed in 0.038 seconds.
[2024-05-28T17:00:04] Fixing permissions.
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data.out' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] [14] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2/data2.out' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/94a97335-dde0-4eb9-b5ac-844ff6534688-0/job2' chmod() (errno=2 No such file or directory)
[2024-05-28T17:00:04] Updated 9 items in 0.004 seconds (2140.376 items/sec)
[2024-05-28T17:00:04] Syncing directory updates to disk.
[2024-05-28T17:00:04] Sync completed in 0.001 seconds.
[2024-05-28T17:00:04] Started: May-28-2024,17:00:04
[2024-05-28T17:00:04] Completed: May-28-2024,17:00:04
[2024-05-28T17:00:04] Seconds: 0.105
[2024-05-28T17:00:04] Items: 7
[2024-05-28T17:00:04]   Directories: 4
[2024-05-28T17:00:04]   Files: 3
[2024-05-28T17:00:04]   Links: 0
[2024-05-28T17:00:04] Data: 36.000 B (36 bytes)
[2024-05-28T17:00:04] Rate: 343.790 B/s (036 bytes in 0.105 seconds)
[2024-05-28T17:00:04] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

 Local host:  nnf-dm-worker-7c96s
 System call: unlink(2) /dev/shm/vader_segment.nnf-dm-worker-7c96s.c7370001.3
 Error:       Operation not permitted (errno 1)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

 Process name: [[50999,1],15]
 Exit code:    1
--------------------------------------------------------------------------

The system call failed during shared memory initialization is something that I see when the copy is successful so I don't think it contributes to the problem here.

Is there anything I can enable to trace this further?

@bdevcich
Copy link
Contributor Author

bdevcich commented May 29, 2024

Here's another case with dcp -d dbg enabled. According to the debug output, mkdir appears to have worked.

[2024-05-29T17:58:39] Walking /lus/global/blake/dm-system-test/src
   [2024-05-29T17:58:39] Walked 9 items in 0.013 secs (706.261 items/sec) ...
   [2024-05-29T17:58:39] Walked 9 items in 0.014 seconds (660.435 items/sec)
   [2024-05-29T17:58:39] Copying to /mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0
   [2024-05-29T17:58:39] Items: 9
   [2024-05-29T17:58:39]   Directories: 4
   [2024-05-29T17:58:39]   Files: 5
   [2024-05-29T17:58:39]   Links: 0
   [2024-05-29T17:58:39] Data: 60.000 B (12.000 B per file)
   [2024-05-29T17:58:39] Creating 4 directories
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0'
   [2024-05-29T17:58:39] Original directory exists, skip the creation: `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0' (errno=17 File exists)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job'
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2'
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job/job2'
   [2024-05-29T17:58:39] Creating 5 files.
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' mknod() failed (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' mknod() failed (errno=2 No such file or directory)
   [2024-05-29T17:58:39] Copying data.
   [2024-05-29T17:58:39] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [3] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data.out' to `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out'
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job2/data2.out' to `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out'
   [2024-05-29T17:58:39] Copy data: 60.000 B (60 bytes)
   [2024-05-29T17:58:39] Copy rate: 3.881 KiB/s (60 bytes in 0.015 seconds)
   [2024-05-29T17:58:39] Syncing data to disk.
   [2024-05-29T17:58:39] Sync completed in 0.037 seconds.
   [2024-05-29T17:58:39] Fixing permissions.
   [2024-05-29T17:58:39] [7] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data2.out' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2/data.out' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2' chmod() (errno=2 No such file or directory)
   [2024-05-29T17:58:39] Updated 9 items in 0.004 seconds (2078.435 items/sec)
   [2024-05-29T17:58:39] Syncing directory updates to disk.
   [2024-05-29T17:58:39] Sync completed in 0.001 seconds.
   [2024-05-29T17:58:39] Started: May-29-2024,17:58:39
   [2024-05-29T17:58:39] Completed: May-29-2024,17:58:39
   [2024-05-29T17:58:39] Seconds: 0.099
   [2024-05-29T17:58:39] Items: 7
   [2024-05-29T17:58:39]   Directories: 4
   [2024-05-29T17:58:39]   Files: 3
   [2024-05-29T17:58:39]   Links: 0
   [2024-05-29T17:58:39] Data: 36.000 B (36 bytes)
   [2024-05-29T17:58:39] Rate: 363.605 B/s (036 bytes in 0.099 seconds)
   [2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)

@daltonbohning
Copy link
Collaborator

I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host)

Does that mount have some kind of caching or delayed update? I'm more familiar with DAOS than Lustre, but I've seen similar behavior with DAOS and certain cache settings where a rank creates the directory on one host, then a rank on a different host doesn't see that directory immediately

@behlendorf
Copy link

That's an good thought and could explain the raciness. Lustre is fully coherent across nodes but I'm not sure how dcp itself coordinates this internally. Perhaps it's possible that dcp could issue the file create on host-B before the mkdir(2) syscall on host-A completes?

@daltonbohning
Copy link
Collaborator

Perhaps it's possible that dcp could issue the file create on host-B before the mkdir(2) syscall on host-A completes?

I don't think so, unless there's a bug somewhere. This is the function that creates the directories

static int mfu_create_directories(

And notice there is a barrier
MPI_Barrier(MPI_COMM_WORLD);

And that function is called before creating the files
/* create directories, from top down */
int tmp_rc = mfu_create_directories(levels, minlevel, lists, numpaths,
paths, destpath, copy_opts, mfu_src_file, mfu_dst_file);
if (tmp_rc < 0) {
rc = -1;
}

As far as I remember (I'm not the primary author!) dcp doesn't directly handle coherency. It creates all directories upfront, barrier, then creates files, barrier, then copies file data

@ofaaland
Copy link
Collaborator

ofaaland commented May 30, 2024

Does the job2 directory ever get created at all? In the example above, I see that not only did the file copies to job2 fail, but the chmod on job2 itself also failed. That chmod is a third phase (mkdirs, then file copies, then set metadata), so multiple barriers later.

@ofaaland
Copy link
Collaborator

ofaaland commented May 30, 2024

[2024-05-29T17:58:39] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b2632390-930f-4cf4-bd39-4441c892bf6f-0/job2' chmod() (errno=2 No such file or directory)

This is what I'm referring to WRT to the chmod failing. I see this in both the first and second examples you posted. So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed.

Can you instrument your test to see what each node sees under the destination root directory after dcp finishes?

@daltonbohning
Copy link
Collaborator

So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed.

Good point. If mkdir fails, it looks like the copy continues anyway

/* create directories, from top down */
int tmp_rc = mfu_create_directories(levels, minlevel, lists, numpaths,
paths, destpath, copy_opts, mfu_src_file, mfu_dst_file);
if (tmp_rc < 0) {
rc = -1;
}
/* operate on files in batches if batch size is given */
uint64_t batch_size = copy_opts->batch_files;
if (batch_size > 0) {

@bdevcich
Copy link
Contributor Author

I was able to add a pause in our code after the dcp failure occurs to go take a look at the lustre filesystem on each of the nodes. Obviously this isn't in real time and any sort of caching issue would have time to catch up since I have to do this manually right now, but I was able to take a peek at the two nodes and the files:

node0:

[root@rabbit-node-1 ~]# ls -laR /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0:
total 98
drwxr-xr-x 5 blake users 33280 May 30 12:44 .
drwx------ 2 blake users 33280 May 30 12:44 job
drwxr-xr-x 2 blake users 33280 May 30 12:44 job2

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job:
total 65
drwx------ 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2:
total 90
drwxr-xr-x 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
-rw-r--r-- 1 blake users    12 May 30 12:44 data2.out
-rw-r--r-- 1 blake users    12 May 30 12:44 data.out

node1:

[root@rabbit-node-2 ~]# ls -laR /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0:
total 294
drwxr-xr-x    5 blake users  33280 May 30 12:44 .
drwxr-xr-x 2772 root  root  159744 May 30 12:43 ..
drwx------    2 blake users  33280 May 30 12:44 job
drwxr-xr-x    2 blake users  33280 May 30 12:44 job2

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job:
total 65
drwx------ 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..

/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2:
total 90
drwxr-xr-x 2 blake users 33280 May 30 12:44 .
drwxr-xr-x 5 blake users 33280 May 30 12:44 ..
-rw-r--r-- 1 blake users    12 May 30 12:44 data2.out
-rw-r--r-- 1 blake users    12 May 30 12:44 data.out

You can see that the job/ directory does get created in the end result.

However, in this run, I'm seeing an actual error message for a failed mkdir(). Here is the output:

...
[2024-05-30T17:44:03] Debug level set to: debug
...
[2024-05-30T17:44:03] Walking /lus/global/blake/dm-system-test/src
[2024-05-30T17:44:03] Walked 9 items in 0.024 secs (373.053 items/sec) ...
[2024-05-30T17:44:03] Walked 9 items in 0.025 seconds (354.708 items/sec)
[2024-05-30T17:44:03] Copying to /mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0
[2024-05-30T17:44:03] Items: 9
[2024-05-30T17:44:03]   Directories: 4
[2024-05-30T17:44:03]   Files: 5
[2024-05-30T17:44:03]   Links: 0
[2024-05-30T17:44:03] Data: 60.000 B (12.000 B per file)
[2024-05-30T17:44:03] Creating 4 directories
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0'
[2024-05-30T17:44:03] Original directory exists, skip the creation: `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0' (errno=17 File exists)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job2'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1001] Creating directory `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1010] ERROR: Create `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2' mkdir() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] Creating 5 files.
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1254] ERROR: File `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' mknod() failed (errno=2 No such file or directory)
[2024-05-30T17:44:03] Copying data.
[2024-05-30T17:44:03] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' (errno=2 No such file or directory)
[2024-05-30T17:44:03] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' (errno=2 No such file or directory)
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2082] ERROR: Failed to open output file `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' (errno=2 No such file or di
rectory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out'
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data2.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out'
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2237] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/job2/data3.out' to `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out'
[2024-05-30T17:44:03] Copy data: 60.000 B (60 bytes)
[2024-05-30T17:44:03] Copy rate: 2.696 KiB/s (60 bytes in 0.022 seconds)
[2024-05-30T17:44:03] Syncing data to disk.
[2024-05-30T17:44:03] Sync completed in 0.037 seconds.
[2024-05-30T17:44:03] Fixing permissions.
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2/data3.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data2.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [15] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/data.out' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job/job2' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/b61b1d0f-eea3-4298-b85b-e8731913b838-0/job' chmod() (errno=2 No such file or directory)
[2024-05-30T17:44:03] Updated 9 items in 0.009 seconds (1052.615 items/sec)
[2024-05-30T17:44:03] Syncing directory updates to disk.
[2024-05-30T17:44:03] Sync completed in 0.001 seconds.
[2024-05-30T17:44:03] Started: May-30-2024,17:44:03
[2024-05-30T17:44:03] Completed: May-30-2024,17:44:03
[2024-05-30T17:44:03] Seconds: 0.121
[2024-05-30T17:44:03] Items: 5
[2024-05-30T17:44:03]   Directories: 3
[2024-05-30T17:44:03]   Files: 2
[2024-05-30T17:44:03]   Links: 0
[2024-05-30T17:44:03] Data: 24.000 B (24 bytes)
[2024-05-30T17:44:03] Rate: 198.718 B/s (024 bytes in 0.121 seconds)
[2024-05-30T17:44:03] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)

So even though job/ does show up in the end result, any sort of action in that directory fails. dcp fails to make job/job2/, job/data.out, and job/data2.out.

@adilger
Copy link
Contributor

adilger commented May 31, 2024

What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers.

@behlendorf
Copy link

@adilger we're seeing this with 2.15.4_6.llnl which does include the fix for LU-16056. For what it's worth we didn't notice this issue with 2.15.3_5.llnl but haven't yet tried rolling back to the servers. We do have a small patch stack applied but we had most of those changes applied to both Lustre versions and nothing in there really jumps out at me. It's mostly kfilnd/lnet fixes.

@adilger
Copy link
Contributor

adilger commented May 31, 2024

@behlendorf is this reproducible enough that it could be bisected between 2.15.3 and 2.15.4?

@ofaaland
Copy link
Collaborator

ofaaland commented Jun 1, 2024

@bdevcich can you modify your build of mpifileutils so dcp does a stat() and reports ownership + permissions on those directories after the mkdir returns?

@bdevcich
Copy link
Contributor Author

bdevcich commented Jun 4, 2024

@bdevcich can you modify your build of mpifileutils so dcp does a stat() and reports ownership + permissions on those directories after the mkdir returns?

I'm back from vacation and I will start on this.

@bdevcich
Copy link
Contributor Author

bdevcich commented Jun 5, 2024

Here is the output with some added debugging lines from the stat():

   [2024-06-05T15:00:57] Walking /lus/global/blake/dm-system-test/src
   [2024-06-05T15:00:57] Walked 9 items in 0.013 secs (702.758 items/sec) ...
   [2024-06-05T15:00:57] Walked 9 items in 0.014 seconds (660.454 items/sec)
   [2024-06-05T15:00:57] Copying to /mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0
   [2024-06-05T15:00:57] Items: 9
   [2024-06-05T15:00:57]   Directories: 4
   [2024-06-05T15:00:57]   Files: 5
   [2024-06-05T15:00:57]   Links: 0
   [2024-06-05T15:00:57] Data: 60.000 B (12.000 B per file)
   [2024-06-05T15:00:57] Creating 4 directories
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] Original directory exists, skip the creation: `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0' (errno=17 File exists)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1044] TEST: statbuf: '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1045]    mode: 40755
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:951]    Permissions (octal): 755
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:958]    Non-permission bits (octal): 40000
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1047]    UID: 1060
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1048]    GID: 100
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1049]    numLinks: 3
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1044] TEST: statbuf: '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1045]    mode: 40700
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:951]    Permissions (octal): 700
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:958]    Non-permission bits (octal): 40000
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1047]    UID: 1060
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1048]    GID: 100
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1049]    numLinks: 2
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1028] ERROR: Create `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' mkdir() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1054] ERROR: TEST: mkdir failed, done
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1153] ERROR: TEST: Error creating directory, rc: -1, destpath: �3�V
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2537] ERROR: Error creating directories, rc: -1
   [2024-06-05T15:00:57] Creating 5 files.
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1296] ERROR: File `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' mknod() failed (errno=2 No such file or directory)
   [2024-06-05T15:00:57] Copying data.
   [2024-06-05T15:00:57] [4] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [1] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2124] ERROR: Failed to open output file `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/job2/data3.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out'
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out'
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:2279] ERROR: Failed to copy `/lus/global/blake/dm-system-test/src/job/data2.out' to `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out'
   [2024-06-05T15:00:57] Copy data: 60.000 B (60 bytes)
   [2024-06-05T15:00:57] Copy rate: 3.742 KiB/s (60 bytes in 0.016 seconds)
   [2024-06-05T15:00:57] Syncing data to disk.
   [2024-06-05T15:00:57] Sync completed in 0.036 seconds.
   [2024-06-05T15:00:57] Fixing permissions.
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2/data3.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [5] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data2.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [2] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/data.out' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job/job2' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:500] ERROR: Failed to change permissions on `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' chmod() (errno=2 No such file or directory)
   [2024-06-05T15:00:57] Updated 9 items in 0.003 seconds (3391.414 items/sec)
   [2024-06-05T15:00:57] Syncing directory updates to disk.
   [2024-06-05T15:00:57] Sync completed in 0.001 seconds.
   [2024-06-05T15:00:57] Started: Jun-05-2024,15:00:57
   [2024-06-05T15:00:57] Completed: Jun-05-2024,15:00:57
   [2024-06-05T15:00:57] Seconds: 0.092
   [2024-06-05T15:00:57] Items: 5
   [2024-06-05T15:00:57]   Directories: 3
   [2024-06-05T15:00:57]   Files: 2
   [2024-06-05T15:00:57]   Links: 0
   [2024-06-05T15:00:57] Data: 24.000 B (24 bytes)
   [2024-06-05T15:00:57] Rate: 261.025 B/s (024 bytes in 0.092 seconds)
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/dcp/dcp.c:592] ERROR: One or more errors were detected while copying: MFU_ERR(-1101)

@ofaaland
Copy link
Collaborator

ofaaland commented Jun 5, 2024

@bdevcich I see the stat() of the job directory returned ENOENT

   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1016] Creating directory `/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1040] TEST: Running stat on '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job'
   [2024-06-05T15:00:57] [0] [/deps/mpifileutils/src/common/mfu_flist_copy.c:1042] ERROR: TEST: could not stat '/mnt/nnf/0951169d-0182-4e57-b921-d0eff2ccffed-0/job' (errno=2 No such file or directory)

It's concerning that mkdir would fail. @behlendorf tells me you're able to reproduce this on elcap, so I'll try myself once the nodes are back up post firmware update.

@bdevcich
Copy link
Contributor Author

bdevcich commented Jun 7, 2024

What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers.

On our internal system where I see this, we're using lustre-2.15.4_4.llnl-1.t4.x86_64

@bdevcich
Copy link
Contributor Author

bdevcich commented Jun 7, 2024

As I mentioned in the rabbit issue, if I put a 30s pause after the mount of ephemeral lustre, I cannot reproduce this issue.

I've removed the pause and did my best to capture the lustre logs from the two rabbit nodes when this occurs. I did a lctl dk after the issue occurred and I did my best to try to do a lctl clear after the successful runs to try to limit the size of the logs.

mknod-issue.rabbit-node-1.log
mknod-issue.rabbit-node-2.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants