Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help for restore failed issue #2552

Open
melsamathew opened this issue Dec 18, 2024 · 28 comments
Open

Need help for restore failed issue #2552

melsamathew opened this issue Dec 18, 2024 · 28 comments

Comments

@melsamathew
Copy link

We are currently using CRIU version 3.15 for checkpoint and restore operations in our Linux application.
The environment consists of Glibc 2.40 and GCC 14.2.0.
While the checkpoint (dumping) process is successful, we encounter an issue during restore, where the process fails with the following error:

(00.064198) pie: 5495: - skip pagemap (00.064199) pie: 5495: - skip pagemap
(00.064201) pie: 5495: `- skip pagemap
(00.064297) Error (criu/cr-restore.c:1573): 5495 killed by signal 127: Unknown signal 127
(00.064335) Error (criu/cr-restore.c:2498): Restoring FAILED.
Our query is whether there could be any compatibility issues between CRIU 3.15 and the newer versions of Glibc (2.40) and GCC (14.2.0). Specifically, we would like to know:

Under what circumstances can CRIU throw an "unknown signal 127" during a restore?
How can we debug this issue further to pinpoint the cause?
We appreciate any insights or suggestions to help resolve this issue.

CRIU logs and information:
Restore.log

CRIU full dump/restore logs: [Dump.log](https://github.com/user-attachments/files/18176588/Dump.log)
Output of `criu --version`:

criu --version
Version: 3.15

Output of `criu check --all`:

criu check --all

Warn (criu/cr-check.c:859): Dirty tracking is OFF. Memory snapshot will not work.
Warn (criu/cr-check.c:1194): Loginuid restore is OFF.
Error (criu/cr-check.c:1216): UFFD is not supported
Error (criu/cr-check.c:1216): UFFD is not supported
Warn (criu/cr-check.c:1239): clone3() with set_tid not supported
Error (criu/cr-check.c:1281): Time namespaces are not supported
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

Additional environment details:
The environment consists of Glibc 2.40 and GCC 14.2.0. Recently upgraded from Glibc 2.23 , criu was working with this glibc

@adrianreber
Copy link
Member

Your CRIU version is too old (as mentioned in the other ticket) to work with restartable sequences. Either update CRIU or you can set an environment variable to disable restartable sequences in glibc (as mentioned in the other ticket).

@melsamathew
Copy link
Author

melsamathew commented Dec 19, 2024

Thank you for your response and suggestions. I have upgraded CRIU to versions 3.17 and 4.0, and attempted to run the simple loop program (from (https://criu.org/Simple_loop) to verify if CRIU is functioning correctly. However, I encountered an issue during the dump process, which resulted in a segmentation fault.

In CRIU-4.0
ps -C test.sh
PID TTY TIME CMD
8401 pts/1 00:00:00 test.sh
criu dump -vvvv -o dump.log -t 8401 --shell-job && echo OK
Segmentation fault (core dumped)

Dump log ends in

00.045690) Add cgroup ns 8 pid 7153
(00.045693) cg: Dumping cgroups for thread 7153
(00.045727) cg: - New css ID 1 (00.045729) cg: - [cpuset,cpu,cpuacct,blkio,memory,devices,freezer,perf_event,hugetlb,pids,rdma] -> [/] [0]
(00.045732) cg: Set 1 is criu one
Or when tried from root
(00.055347) Error (criu/parasite-syscall.c:88): si_code=4 si_pid=13325 si_status=11
(00.055351) Error (criu/parasite-syscall.c:95): 13325 was stopped by 11 unexpectedly

Could you please suggested how it can make success?

@adrianreber
Copy link
Member

As described in #1696 you need at least Linux Kernel version 5.13 for restartable sequences to work with CRIU. According to your log files you have 5.4.282-staros-v3-scale-64.

With the glibc version you are using restartable sequences are always used and you need at least kernel 5.13 and CRIU 3.17.

You can use a newer kernel, an older glibc or export the environment variable I mentioned earlier.

@melsamathew
Copy link
Author

Thanks for your inputs.. Can you tell the combinations of criu glibc and kernel which can be tried to make it work?
Also which version of CRIU is compatible with Glibc-2.40?

@adrianreber
Copy link
Member

If glibc >= 2.35 you need at least kernel 5.13 and CRIU 3.17.

@melsamathew
Copy link
Author

Thanks adrianreber. I understand that to work everything proper we need an upgraded kernel too.
Other than setting environment variables, is there any patches available to apply to CRIU-3.15 to work for restartable sequences, without upgrading kernel version?

@adrianreber
Copy link
Member

No

@melsamathew
Copy link
Author

Need some more help here.
Currently, we are unable to upgrade the kernel to version 5.13.
The issue is still persisting after setting the environment variable with glibc 2.40 and CRIU-3.15
We are now trying to debug on the above combination.Can you suggest how to further debug the unknown signal 127?
Are there any logs we can enable to identify what is triggering this signal?
Additionally, could the introduction of rseq in the new glibc be the primary cause of the restore failure?

@adrianreber
Copy link
Member

I guess you are setting the environment variable wrong. Can you post the exact steps you are doing. So that I can reproduce it.

@melsamathew
Copy link
Author

We have exported environment variable in our application as export GLIBC_TUNABLES=glibc.pthread.rseq=0
After setting it, following is the list of tunables.
/lib64/ld-linux-x86-64.so.2 --list-tunables
glibc.cpu.hwcaps:
glibc.cpu.plt_rewrite: 0 (min: 0, max: 2)
glibc.cpu.prefer_map_32bit_exec: 0 (min: 0, max: 1)
glibc.cpu.x86_data_cache_size: 0x8000 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_ibt:
glibc.cpu.x86_memset_non_temporal_threshold: 0x180000 (min: 0x4040, max: 0xffffffffffffffff)
glibc.cpu.x86_non_temporal_threshold: 0x180000 (min: 0x4040, max: 0xfffffffffffffff)
glibc.cpu.x86_rep_movsb_threshold: 0x2000 (min: 0x100, max: 0xffffffffffffffff)
glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
glibc.cpu.x86_shared_cache_size: 0x200000 (min: 0x0, max: 0xffffffffffffffff)
glibc.cpu.x86_shstk:
glibc.elision.enable: 0 (min: 0, max: 1)
glibc.elision.skip_lock_after_retries: 3 (min: 0, max: 2147483647)
glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
glibc.elision.skip_lock_internal_abort: 3 (min: 0, max: 2147483647)
glibc.elision.skip_trylock_internal_abort: 3 (min: 0, max: 2147483647)
glibc.elision.tries: 3 (min: 0, max: 2147483647)
glibc.gmon.maxarcs: 1048576 (min: 50, max: 2147483647)
glibc.gmon.minarcs: 50 (min: 50, max: 2147483647)
glibc.malloc.arena_max: 0x0 (min: 0x1, max: 0xffffffffffffffff)
glibc.malloc.arena_test: 0x0 (min: 0x1, max: 0xffffffffffffffff)
glibc.malloc.check: 0 (min: 0, max: 3)
glibc.malloc.hugetlb: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mmap_max: 0 (min: 0, max: 2147483647)
glibc.malloc.mmap_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.mxfast: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.perturb: 0 (min: 0, max: 255)
glibc.malloc.tcache_count: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.tcache_max: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.tcache_unsorted_limit: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
glibc.malloc.trim_threshold: 0x0 (min: 0x0, max: 0xffffffffffffffff)
glibc.mem.decorate_maps: 0 (min: 0, max: 1)
glibc.mem.tagging: 0 (min: 0, max: 255)
glibc.pthread.mutex_spin_count: 100 (min: 0, max: 32767)
glibc.pthread.rseq: 0 (min: 0, max: 1)
glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
glibc.pthread.stack_hugetlb: 1 (min: 0, max: 1)
glibc.rtld.dynamic_sort: 2 (min: 1, max: 2)
glibc.rtld.enable_secure: 0 (min: 0, max: 1)
glibc.rtld.nns: 0x4 (min: 0x1, max: 0x10)
glibc.rtld.optional_static_tls: 0x200 (min: 0x0, max: 0xffffffffffffffff)

@melsamathew
Copy link
Author

Basically unknown signal relates to "a standard error message that indicates a command could not be found or executed"
Is there chances in CRIU encountering such issue after glibc upgrade ?

@melsamathew
Copy link
Author

Hi adrianreber,
Can you help on this, was waiting for your reply.. Thanks

@melsamathew
Copy link
Author

Hi, we are trying to upgrade linux kernel. But it is having lot of dependencies with other package and it takes time. Can you please point out which area of linux kernel is depending on latest criu?

@adrianreber
Copy link
Member

Can you please point out which area of linux kernel is depending on latest criu?

As described in #1696 you need torvalds/linux@90f093f

RHEL8 for example has that patch backported on a 4.18 kernel.

But the problem you have is not really understood. I am at this point just assuming it is related to restartable sequences because it makes sense with what you described. It still could be something else. I would recommend that you try it, for testing, on an up to date Linux distribution which is known to work. If it works there you know it is your environment.

@melsamathew
Copy link
Author

Hi
Thanks for the patch, I tried applying this patch in this environment
Glibc:2.40, CRIU -3.17 Kernel 5.4.282 with the above patch.

Now could see restoring is failed with SEGFAULT
(00.062154) 17144: Error (criu/cr-restore.c:1508): 17428 killed by signal 11: Segmentation fault
(00.067228) Error (criu/cr-restore.c:1504): 17144 exited, status=1
(00.067259) Error (criu/cr-restore.c:2550): Restoring FAILED.
Earlier when I tried without this patch dumping itself was failes, but now its failing with SEGFAULT.
Is this a known issue or how can debug this more?

@melsamathew
Copy link
Author

Can you please help how to proceed on this issue?

@adrianreber
Copy link
Member

At this point I cannot help you any more. Not sure what is going on. Maybe somebody else has an idea. If your operating system comes with support, maybe talk to your vendor.

@adrianreber
Copy link
Member

At least post the complete log of the failure.

@melsamathew
Copy link
Author

Hi Adrianreber,
I have tried executing the simple loop in CRIU-3.17 with the linux patch and got these logs for restore
simple_loop_3_17restore.log

@adrianreber
Copy link
Member

I don't see any mentioning of restartable sequences in the restore log. So something is not set up right. Why are you using 3.17 and 4.0?

Latest git checkout gives me following log content:

(00.036868) pie: 1: rseq: rseq_abi_pointer = 0x7f782ced5060 signature = 0x53053053

Please try with 4.0 and provide the logs.

@melsamathew
Copy link
Author

When I was trying with CRIU 4.0 with linux patch, dump itself is getting failed with segmentation fault
criu_4_dump.log
That's why checked in CRIU 3.17 where dump was passed

@adrianreber
Copy link
Member

Thanks. Looks like your kernel is really too old for CRIU. You need a newer kernel. We do not test on such old kernels.

@adrianreber
Copy link
Member

Maybe try 3.19 and 3.18 to see if one of those versions work better. For latest glibc 4.0 might be necessary. Not sure anymore.

@melsamathew
Copy link
Author

Hi,

I have tried upgrading to 3.19 and tried with linux patch too. But that also didn't helped and failed in dumping. Please find the dump logs.

criu_3_19_dump_logs.txt

@adrianreber
Copy link
Member

Why is the information about the kernel missing in the log file. Looks like you removed information.

Also I still see (00.043768) ptrace(PTRACE_GET_RSEQ_CONFIGURATION) is not supported, which means you kernel does not have the necessary support to handle restartable sequences with glibc 2.40.

Also, you are trying to checkpoint non-existing grep processes. The PID is wrong in both examples.

This will not work as long as your kernel does not support ptrace(PTRACE_GET_RSEQ_CONFIGURATION). This has already been mentioned a couple of times in the ticket.

@melsamathew
Copy link
Author

Thanks for pointing out it. In the above logs with and without linux patch was tried .Now I could see with linux patch dumping is success and restoring goes through but ended with segfault

criu_3_19_dump_logs.txt

@adrianreber
Copy link
Member

Now it seems that CRIU crashes. Can you attach gdb to the core file and show a backtrace?

But there are also log messages which do not exist upstream. Are you using local modifications? Please use an unmodified version if you want our help. Not sure if we are now debugging your changes.

@melsamathew
Copy link
Author

Hi Adrian. As upgrade is having lot of dependencies I was trying to debug in the existing system by setting environment variable.
Glibc:2.40, CRIU -3.15 Kernel 5.4.282 and glibc.pthread.rseq: 0 .
Here I would to know the options in criu dump as it is giving some result. Is there any difference by using --tree and -t?
./criu dump -tree
./criu dump -t

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants