-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mknod() failed (errno=2 No such file or directory) #574
Comments
Here's another case with
|
Does that mount have some kind of caching or delayed update? I'm more familiar with DAOS than Lustre, but I've seen similar behavior with DAOS and certain cache settings where a rank creates the directory on one host, then a rank on a different host doesn't see that directory immediately |
That's an good thought and could explain the raciness. Lustre is fully coherent across nodes but I'm not sure how |
I don't think so, unless there's a bug somewhere. This is the function that creates the directories mpifileutils/src/common/mfu_flist_copy.c Line 1042 in 0a4c530
And notice there is a barrier mpifileutils/src/common/mfu_flist_copy.c Line 1123 in 0a4c530
And that function is called before creating the files mpifileutils/src/common/mfu_flist_copy.c Lines 2490 to 2495 in 0a4c530
As far as I remember (I'm not the primary author!) dcp doesn't directly handle coherency. It creates all directories upfront, barrier, then creates files, barrier, then copies file data |
Does the job2 directory ever get created at all? In the example above, I see that not only did the file copies to job2 fail, but the chmod on job2 itself also failed. That chmod is a third phase (mkdirs, then file copies, then set metadata), so multiple barriers later. |
This is what I'm referring to WRT to the chmod failing. I see this in both the first and second examples you posted. So, I wonder if the mkdir() failed, but for some reason dcp didn't know it failed. Can you instrument your test to see what each node sees under the destination root directory after dcp finishes? |
Good point. If mkdir fails, it looks like the copy continues anyway mpifileutils/src/common/mfu_flist_copy.c Lines 2490 to 2499 in 0a4c530
|
I was able to add a pause in our code after the dcp failure occurs to go take a look at the lustre filesystem on each of the nodes. Obviously this isn't in real time and any sort of caching issue would have time to catch up since I have to do this manually right now, but I was able to take a peek at the two nodes and the files: node0:
node1:
You can see that the However, in this run, I'm seeing an actual error message for a failed mkdir(). Here is the output:
So even though |
What version of Lustre is in use on the servers? There was a race in the server request processing code that resulted in their-created files/directories having permission 0000, and was fixed in https://jira.whamcloud.com/browse/LU-16056 and backported to 2.15.2 servers. |
@adilger we're seeing this with 2.15.4_6.llnl which does include the fix for LU-16056. For what it's worth we didn't notice this issue with 2.15.3_5.llnl but haven't yet tried rolling back to the servers. We do have a small patch stack applied but we had most of those changes applied to both Lustre versions and nothing in there really jumps out at me. It's mostly kfilnd/lnet fixes. |
@behlendorf is this reproducible enough that it could be bisected between 2.15.3 and 2.15.4? |
@bdevcich can you modify your build of mpifileutils so |
I'm back from vacation and I will start on this. |
Here is the output with some added debugging lines from the
|
@bdevcich I see the stat() of the job directory returned ENOENT
It's concerning that mkdir would fail. @behlendorf tells me you're able to reproduce this on elcap, so I'll try myself once the nodes are back up post firmware update. |
On our internal system where I see this, we're using |
As I mentioned in the rabbit issue, if I put a 30s pause after the mount of ephemeral lustre, I cannot reproduce this issue. I've removed the pause and did my best to capture the lustre logs from the two rabbit nodes when this occurs. I did a |
Hello,
I have been seeing this issue when using
dcp
to recursive copy a directory from one lustre filesystem to another. This is being ran viampirun
and over multiple hosts (e.g.mpirun -H host1,host2
). I have not seen this when ran with a single host (or when the launcher is on the same host as the 1 worker/host).The error in the
dcp
output suggestsno such file or directory
, which seems like the parent directory (e.g./mnt/nnf/.../job2/
) is not being created first. I originally thought theOriginal directory exists, skip the creation
output might suggest that none of the directories are being created, but that output shows up the case where this issue does not surface.I don't have a way to reproduce this directly, but if I run our tests enough times this will be hit.
Each test is copying the same source directory to essentially the same location on an ephemeral lustre file system. Here is the source directory and its contents that
dcp
is attempting to copy:Here is the error output from
dcp
:The
system call failed during shared memory initialization
is something that I see when the copy is successful so I don't think it contributes to the problem here.Is there anything I can enable to trace this further?
The text was updated successfully, but these errors were encountered: