* [RCU] zombie task hung in synchronize_rcu_expedited
@ 2024-06-05 23:42 Rachel Menge
2024-06-06 11:10 ` Oleg Nesterov
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Rachel Menge @ 2024-06-05 23:42 UTC (permalink / raw)
To: linux-kernel, rcu
Cc: Wei Fu, apais, Sudhanva Huruli, fuweid89, Jens Axboe,
Christian Brauner, Oleg Nesterov, Andrew Morton, Mike Christie,
Joel Granados, Mateusz Guzik, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang
Hello,
We are facing a soft lockup on our systems which appears to be related
to rcu scheduling.
The bug appears as high CPU usage. Dmesg shows a soft lock which is
associated with "zap_pid_ns_processes". I have confirmed the behavior on
5.15 and 6.8 kernels.
This example was taken from an Ubuntu 22.04 VM running in a hyper-v
environment.
rachel@ubuntu:~$ uname -a
Linux ubuntu 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC
2024 x86_64 x86_64 x86_64 GNU/Linux
dmesg snippet:
watchdog: BUG: soft lockup - CPU#0 stuck for 212s! [npm start:306207]
Modules linked in: veth nf_conntrack_netlink xt_conntrack nft_chain_nat
xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables
nfnetlink binfmt_misc nls_iso8859_1 intel_rapl_msr serio_raw
intel_rapl_common hyperv_fb hv_balloon joydev mac_hid sch_fq_codel
dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua overlay
iptable_filter ip6table_filter ip6_tables br_netfilter bridge stp llc
arp_tables msr efi_pstore ip_tables x_tables autofs4 btrfs
blake2b_generic zstd_compress raid10 raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
raid0 multipath linear hyperv_drm drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops crct10dif_pclmul cec hv_storvsc crc32_pclmul
hid_generic hv_netvsc ghash_clmulni_intel scsi_transport_fc rc_core
sha256_ssse3 hid_hyperv drm sha1_ssse3 hv_utils hid hyperv_keyboard
aesni_intel crypto_simd cryptd hv_vmbus
CPU: 0 PID: 306207 Comm: npm start Tainted: G L
5.15.0-107-generic #117-Ubuntu
Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine,
BIOS Hyper-V UEFI Release v4.1 04/06/2022
RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30
Code: eb 8d cc cc cc 0f 1f 44 00 00 55 48 89 e5 e8 3a b8 36 ff 66 90 f7
c6 00 02 00 00 75 06 5d e9 e2 cb 22 00 fb 66 0f 1f 44 00 00 <5d> e9 d5
cb 22 00 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 8b 07
RSP: 0018:ffffb15fc915bc60 EFLAGS: 00000206
RAX: 0000000000000001 RBX: ffffb15fc915bcf8 RCX: 0000000000000000
RDX: ffff9d4713f9c828 RSI: 0000000000000246 RDI: ffff9d4713f9c820
RBP: ffffb15fc915bc60 R08: ffff9d4713f9c828 R09: ffff9d4713f9c828
R10: 0000000000000228 R11: ffffb15fc915bcf0 R12: ffff9d4713f9c820
R13: 0000000000000004 R14: ffff9d47305a9980 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff9d4643c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd63a1b6008 CR3: 0000000288bd6003 CR4: 0000000000370ef0
Call Trace:
<IRQ>
? show_trace_log_lvl+0x1d6/0x2ea
? show_trace_log_lvl+0x1d6/0x2ea
? add_wait_queue+0x6b/0x80
? show_regs.part.0+0x23/0x29
? show_regs.cold+0x8/0xd
? watchdog_timer_fn+0x1be/0x220
? lockup_detector_update_enable+0x60/0x60
? __hrtimer_run_queues+0x107/0x230
? read_hv_clock_tsc_cs+0x9/0x30
? hrtimer_interrupt+0x101/0x220
? hv_stimer0_isr+0x20/0x30
? __sysvec_hyperv_stimer0+0x32/0x70
? sysvec_hyperv_stimer0+0x7b/0x90
</IRQ>
<TASK>
? asm_sysvec_hyperv_stimer0+0x1b/0x20
? _raw_spin_unlock_irqrestore+0x25/0x30
add_wait_queue+0x6b/0x80
do_wait+0x52/0x310
kernel_wait4+0xaf/0x150
? thread_group_exited+0x50/0x50
zap_pid_ns_processes+0x111/0x1a0
forget_original_parent+0x348/0x360
exit_notify+0x4a/0x210
do_exit+0x24f/0x3c0
do_group_exit+0x3b/0xb0
__x64_sys_exit_group+0x18/0x20
x64_sys_call+0x1937/0x1fa0
do_syscall_64+0x56/0xb0
? do_user_addr_fault+0x1e7/0x670
? exit_to_user_mode_prepare+0x37/0xb0
? irqentry_exit_to_user_mode+0x17/0x20
? irqentry_exit+0x1d/0x30
? exc_page_fault+0x89/0x170
entry_SYSCALL_64_after_hwframe+0x67/0xd1
RIP: 0033:0x7f60019daf8e
Code: Unable to access opcode bytes at RIP 0x7f60019daf64.
RSP: 002b:00007fff2812a468 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f5ffeda01b0 RCX: 00007f60019daf8e
RDX: 00007f6001a560c0 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00007fff2812a4b0 R08: 0000000000000024 R09: 0000000800000000
R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001
R13: 00007f60016f4a90 R14: 0000000000000000 R15: 00007f5ffede4d50
</TASK>
Looking at the running processes, there are zombie threads
root@ubuntu:/home/rachel# ps aux | grep Z
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
rachel 305832 0.5 0.0 0 0 ? Zsl 01:55 0:00 [npm
start] <defunct>
rachel 308234 0.3 0.0 0 0 ? Zl 01:55 0:00 [npm
run zombie] <defunct>
rachel 308987 0.0 0.0 0 0 ? Z 01:55 0:00 [sh]
<defunct>
root 345328 0.0 0.0 6480 2220 pts/5 S+ 01:56 0:00 grep
--color=auto Z
"308234" zombie thread group shows a thread is stuck on
synchronize_rcu_expedited
root@ubuntu:/home/rachel# ls /proc/308234/task
308234 308312
root@ubuntu:/home/rachel# cat /proc/308312/stack
[<0>] exp_funnel_lock+0x1eb/0x230
[<0>] synchronize_rcu_expedited+0x6d/0x1b0
[<0>] namespace_unlock+0xd6/0x1b0
[<0>] put_mnt_ns+0x74/0xa0
[<0>] free_nsproxy+0x1c/0x1b0
[<0>] switch_task_namespaces+0x5e/0x70
[<0>] exit_task_namespaces+0x10/0x20
[<0>] do_exit+0x212/0x3c0
[<0>] io_sq_thread+0x457/0x5b0
[<0>] ret_from_fork+0x22/0x30
To consistently reproduce the issue, disable "CONFIG_PREEMPT_RCU". It is
unclear if this completely prevents the issue, but it is much easier to
reproduce with preemption off. I was able to reproduce on the Ubuntu
22.04 5.15.0-107-generic and 24.04 6.8.0-30-generic. There are 2 methods
of reproducing. Both methods are hosted at
https://github.com/rlmenge/rcu-soft-lock-issue-repro .
Repro using npm and docker:
Get the script here:
https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/main/rcu-npm-repro.sh
# get image so that script doesn't keep pulling for images
$ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11
$ sudo ./rcu-npm-repro.sh
This script creates several containers. Each container runs in new pid
and mount namespaces. The container's entrypoint is `npm run task && npm
start`.
npm run task: This command is to run `npm run zombie & npm run done`
command.
npm run zombie: It's to run `while true; do echo zombie; sleep 1; done`.
Infinite loop to print zombies.
npm run done: It's to run `echo done`. Short live process.
npm start: It's also a short live process. It will exit in a few seconds.
When `npm start` exits, the process tree in that pid namespace will be like
npm start (pid 1)
|__npm run zombie
|__ sh -c "whle true; do echo zombie; sleep 1; done"
Repro using golang:
Use the go module found here:
https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/main/rcudeadlock.go
Run
$ go mod init rcudeadlock.go
$ go mod tidy
$ CGO_ENABLED=0 go build -o ./rcudeadlock ./
$ sudo ./rcudeadlock
This golang program is to simulate the npm reproducer without involving
docker as dependency. This binary is using re-exec self to support
multiple subcommands. It also sets up processes in new pid and mount
namespaces by unshare, since the `put_mnt_ns` is a critical code path in
the kernel to reproduce this issue. Both mount and pid namespaces are
required in this issue.
The entrypoint of new pid and mount namespaces is `rcudeadlock task &&
rcudeadlock start`.
rcudeadlock task: This command is to run `rcudeadlock zombie &
rcudeadlock done`
rcudeadlock zombie: It's to run `bash -c "while true; do echo zombie;
sleep 1; done"`. Infinite loop to print zombies.
rcudeadlock done: Prints done and exits.
rcudeadlock start: Prints `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA` 10 times
and exits.
When `rcudeadlock start` exits, the process tree in that pid namespace
will be like
rcudeadlock start (pid 1)
|__rcudeadlock zombie
|__bash -c "while true; do echo zombie; sleep 1; done".
Each rcudeadlock process will set up 4 idle io_uring threads before
handling commands, like `task`, `zombie`, `done` and `start`. That is
similar to npm reproducer. Not sure that it's related to io_uring. But
with io_uring idle threads, it's easy to reproduce this issue.
Thank you,
Rachel
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-05 23:42 [RCU] zombie task hung in synchronize_rcu_expedited Rachel Menge @ 2024-06-06 11:10 ` Oleg Nesterov 2024-06-06 15:45 ` Wei Fu 2024-06-08 12:06 ` [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Oleg Nesterov 2024-06-08 15:48 ` [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads Oleg Nesterov 2 siblings, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2024-06-06 11:10 UTC (permalink / raw) To: Rachel Menge Cc: linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Andrew Morton, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang, Eric W. Biederman Add Eric. Well, due to unfortunate design zap_pid_ns_processes() can hang "forever" if this namespace has a (zombie) task injected from the parent ns, this task should be reaped by its parent. But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should sleep in kernel_wait4(). Any chance you can test the patch below? This patch makes sense anyway, I'll send it later. But I am not sure it can fix your problem. Oleg. diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index dc48fecfa1dc..25f3cf679b35 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) */ do { clear_thread_flag(TIF_SIGPENDING); + clear_thread_flag(TIF_NOTIFY_SIGNAL); rc = kernel_wait4(-1, NULL, __WALL, NULL); } while (rc != -ECHILD); On 06/05, Rachel Menge wrote: > > Hello, > > We are facing a soft lockup on our systems which appears to be related to > rcu scheduling. > > The bug appears as high CPU usage. Dmesg shows a soft lock which is > associated with "zap_pid_ns_processes". I have confirmed the behavior on > 5.15 and 6.8 kernels. > > This example was taken from an Ubuntu 22.04 VM running in a hyper-v > environment. > rachel@ubuntu:~$ uname -a > Linux ubuntu 5.15.0-107-generic #117-Ubuntu SMP Fri Apr 26 12:26:49 UTC 2024 > x86_64 x86_64 x86_64 GNU/Linux > > dmesg snippet: > watchdog: BUG: soft lockup - CPU#0 stuck for 212s! [npm start:306207] > Modules linked in: veth nf_conntrack_netlink xt_conntrack nft_chain_nat > xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user > xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink binfmt_misc > nls_iso8859_1 intel_rapl_msr serio_raw intel_rapl_common hyperv_fb > hv_balloon joydev mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc > scsi_dh_alua overlay iptable_filter ip6table_filter ip6_tables br_netfilter > bridge stp llc arp_tables msr efi_pstore ip_tables x_tables autofs4 btrfs > blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy > async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath > linear hyperv_drm drm_kms_helper syscopyarea sysfillrect sysimgblt > fb_sys_fops crct10dif_pclmul cec hv_storvsc crc32_pclmul hid_generic > hv_netvsc ghash_clmulni_intel scsi_transport_fc rc_core sha256_ssse3 > hid_hyperv drm sha1_ssse3 hv_utils hid hyperv_keyboard aesni_intel > crypto_simd cryptd hv_vmbus > CPU: 0 PID: 306207 Comm: npm start Tainted: G L > 5.15.0-107-generic #117-Ubuntu > Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS > Hyper-V UEFI Release v4.1 04/06/2022 > RIP: 0010:_raw_spin_unlock_irqrestore+0x25/0x30 > Code: eb 8d cc cc cc 0f 1f 44 00 00 55 48 89 e5 e8 3a b8 36 ff 66 90 f7 c6 > 00 02 00 00 75 06 5d e9 e2 cb 22 00 fb 66 0f 1f 44 00 00 <5d> e9 d5 cb 22 00 > 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 8b 07 > RSP: 0018:ffffb15fc915bc60 EFLAGS: 00000206 > RAX: 0000000000000001 RBX: ffffb15fc915bcf8 RCX: 0000000000000000 > RDX: ffff9d4713f9c828 RSI: 0000000000000246 RDI: ffff9d4713f9c820 > RBP: ffffb15fc915bc60 R08: ffff9d4713f9c828 R09: ffff9d4713f9c828 > R10: 0000000000000228 R11: ffffb15fc915bcf0 R12: ffff9d4713f9c820 > R13: 0000000000000004 R14: ffff9d47305a9980 R15: 0000000000000000 > FS: 0000000000000000(0000) GS:ffff9d4643c00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fd63a1b6008 CR3: 0000000288bd6003 CR4: 0000000000370ef0 > Call Trace: > <IRQ> > ? show_trace_log_lvl+0x1d6/0x2ea > ? show_trace_log_lvl+0x1d6/0x2ea > ? add_wait_queue+0x6b/0x80 > ? show_regs.part.0+0x23/0x29 > ? show_regs.cold+0x8/0xd > ? watchdog_timer_fn+0x1be/0x220 > ? lockup_detector_update_enable+0x60/0x60 > ? __hrtimer_run_queues+0x107/0x230 > ? read_hv_clock_tsc_cs+0x9/0x30 > ? hrtimer_interrupt+0x101/0x220 > ? hv_stimer0_isr+0x20/0x30 > ? __sysvec_hyperv_stimer0+0x32/0x70 > ? sysvec_hyperv_stimer0+0x7b/0x90 > </IRQ> > <TASK> > ? asm_sysvec_hyperv_stimer0+0x1b/0x20 > ? _raw_spin_unlock_irqrestore+0x25/0x30 > add_wait_queue+0x6b/0x80 > do_wait+0x52/0x310 > kernel_wait4+0xaf/0x150 > ? thread_group_exited+0x50/0x50 > zap_pid_ns_processes+0x111/0x1a0 > forget_original_parent+0x348/0x360 > exit_notify+0x4a/0x210 > do_exit+0x24f/0x3c0 > do_group_exit+0x3b/0xb0 > __x64_sys_exit_group+0x18/0x20 > x64_sys_call+0x1937/0x1fa0 > do_syscall_64+0x56/0xb0 > ? do_user_addr_fault+0x1e7/0x670 > ? exit_to_user_mode_prepare+0x37/0xb0 > ? irqentry_exit_to_user_mode+0x17/0x20 > ? irqentry_exit+0x1d/0x30 > ? exc_page_fault+0x89/0x170 > entry_SYSCALL_64_after_hwframe+0x67/0xd1 > RIP: 0033:0x7f60019daf8e > Code: Unable to access opcode bytes at RIP 0x7f60019daf64. > RSP: 002b:00007fff2812a468 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 > RAX: ffffffffffffffda RBX: 00007f5ffeda01b0 RCX: 00007f60019daf8e > RDX: 00007f6001a560c0 RSI: 0000000000000000 RDI: 0000000000000001 > RBP: 00007fff2812a4b0 R08: 0000000000000024 R09: 0000000800000000 > R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001 > R13: 00007f60016f4a90 R14: 0000000000000000 R15: 00007f5ffede4d50 > </TASK> > > Looking at the running processes, there are zombie threads > root@ubuntu:/home/rachel# ps aux | grep Z > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > rachel 305832 0.5 0.0 0 0 ? Zsl 01:55 0:00 [npm > start] <defunct> > rachel 308234 0.3 0.0 0 0 ? Zl 01:55 0:00 [npm run > zombie] <defunct> > rachel 308987 0.0 0.0 0 0 ? Z 01:55 0:00 [sh] > <defunct> > root 345328 0.0 0.0 6480 2220 pts/5 S+ 01:56 0:00 grep > --color=auto Z > > "308234" zombie thread group shows a thread is stuck on > synchronize_rcu_expedited > root@ubuntu:/home/rachel# ls /proc/308234/task > 308234 308312 > root@ubuntu:/home/rachel# cat /proc/308312/stack > [<0>] exp_funnel_lock+0x1eb/0x230 > [<0>] synchronize_rcu_expedited+0x6d/0x1b0 > [<0>] namespace_unlock+0xd6/0x1b0 > [<0>] put_mnt_ns+0x74/0xa0 > [<0>] free_nsproxy+0x1c/0x1b0 > [<0>] switch_task_namespaces+0x5e/0x70 > [<0>] exit_task_namespaces+0x10/0x20 > [<0>] do_exit+0x212/0x3c0 > [<0>] io_sq_thread+0x457/0x5b0 > [<0>] ret_from_fork+0x22/0x30 > > To consistently reproduce the issue, disable "CONFIG_PREEMPT_RCU". It is > unclear if this completely prevents the issue, but it is much easier to > reproduce with preemption off. I was able to reproduce on the Ubuntu 22.04 > 5.15.0-107-generic and 24.04 6.8.0-30-generic. There are 2 methods of > reproducing. Both methods are hosted at > https://github.com/rlmenge/rcu-soft-lock-issue-repro . > > Repro using npm and docker: > Get the script here: https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/main/rcu-npm-repro.sh > # get image so that script doesn't keep pulling for images > $ sudo docker run telescope.azurecr.io/issue-repro/zombie:v1.1.11 > $ sudo ./rcu-npm-repro.sh > > This script creates several containers. Each container runs in new pid and > mount namespaces. The container's entrypoint is `npm run task && npm start`. > npm run task: This command is to run `npm run zombie & npm run done` > command. > npm run zombie: It's to run `while true; do echo zombie; sleep 1; done`. > Infinite loop to print zombies. > npm run done: It's to run `echo done`. Short live process. > npm start: It's also a short live process. It will exit in a few seconds. > > When `npm start` exits, the process tree in that pid namespace will be like > npm start (pid 1) > |__npm run zombie > |__ sh -c "whle true; do echo zombie; sleep 1; done" > > Repro using golang: > Use the go module found here: > https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/main/rcudeadlock.go > > Run > $ go mod init rcudeadlock.go > $ go mod tidy > $ CGO_ENABLED=0 go build -o ./rcudeadlock ./ > $ sudo ./rcudeadlock > > This golang program is to simulate the npm reproducer without involving > docker as dependency. This binary is using re-exec self to support multiple > subcommands. It also sets up processes in new pid and mount namespaces by > unshare, since the `put_mnt_ns` is a critical code path in the kernel to > reproduce this issue. Both mount and pid namespaces are required in this > issue. > > The entrypoint of new pid and mount namespaces is `rcudeadlock task && > rcudeadlock start`. > rcudeadlock task: This command is to run `rcudeadlock zombie & rcudeadlock > done` > rcudeadlock zombie: It's to run `bash -c "while true; do echo zombie; sleep > 1; done"`. Infinite loop to print zombies. > rcudeadlock done: Prints done and exits. > rcudeadlock start: Prints `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA` 10 times and > exits. > > When `rcudeadlock start` exits, the process tree in that pid namespace will > be like > rcudeadlock start (pid 1) > |__rcudeadlock zombie > |__bash -c "while true; do echo zombie; sleep 1; done". > > Each rcudeadlock process will set up 4 idle io_uring threads before handling > commands, like `task`, `zombie`, `done` and `start`. That is similar to npm > reproducer. Not sure that it's related to io_uring. But with io_uring idle > threads, it's easy to reproduce this issue. > > Thank you, > Rachel > ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-06 11:10 ` Oleg Nesterov @ 2024-06-06 15:45 ` Wei Fu 2024-06-06 17:28 ` Oleg Nesterov 0 siblings, 1 reply; 23+ messages in thread From: Wei Fu @ 2024-06-06 15:45 UTC (permalink / raw) To: oleg Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, ebiederm, frederic, fuweid89, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt, weifu Hi! > Add Eric. > > Well, due to unfortunate design zap_pid_ns_processes() can hang "forever" > if this namespace has a (zombie) task injected from the parent ns, this > task should be reaped by its parent. That zombie task was cloned by pid-1 process in that pid namespace. In my last reproduced log, the process tree in that pid namespace looks like ``` # unshare(CLONE_NEWPID | CLONE_NEWNS) npm start (pid 2522045) |__npm run zombie (pid 2522605) |__ sh -c "whle true; do echo zombie; sleep 1; done" (pid 2522869) ``` The `npm start (pid 2522045)` was stuck in kernel_wait4. And its child, `npm run zombie (pid 2522605)`, has two threads. One of them was in D status. As far as I know, pid-2522605 can't be reaped by its parent pid-2522045 until that thread returns from `synchronize_rcu_expedited`. ``` $ sudo cat /proc/2522605/task/*/stack [<0>] synchronize_rcu_expedited+0x177/0x1f0 [<0>] namespace_unlock+0xd6/0x1b0 [<0>] put_mnt_ns+0x73/0xa0 [<0>] free_nsproxy+0x1c/0x1b0 [<0>] switch_task_namespaces+0x5d/0x70 [<0>] exit_task_namespaces+0x10/0x20 [<0>] do_exit+0x2ce/0x500 [<0>] io_sq_thread+0x48e/0x5a0 [<0>] ret_from_fork+0x3c/0x60 [<0>] ret_from_fork_asm+0x1b/0x30 $ sudo cat /proc/2522605/task/2522645/status Name: iou-sqp-2522605 State: D (disk sleep) Tgid: 2522605 Ngid: 0 Pid: 2522645 PPid: 2522045 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 FDSize: 0 Groups: 1000 NStgid: 2522605 25 NSpid: 2522645 40 NSpgid: 2522045 1 NSsid: 2522045 1 Kthread: 0 Threads: 2 SigQ: 0/128311 SigPnd: 0000000000000000 ShdPnd: 0000000000000100 SigBlk: fffffffffffbfeff SigIgn: 0000000001001000 SigCgt: 0000000000014602 CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000a80425fb CapAmb: 0000000000000000 NoNewPrivs: 0 Seccomp: 2 Seccomp_filters: 1 Speculation_Store_Bypass: vulnerable SpeculationIndirectBranch: always enabled Cpus_allowed: ff Cpus_allowed_list: 0-7 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 11 nonvoluntary_ctxt_switches: 21 ``` > > But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should > sleep in kernel_wait4(). I run `cat /proc/2522045/status` and found that the status was kept switching between running and sleeping. But the kernel was still reporting soft lockup. And there is log from dmesg. The CPU 5 wasn't able to report the quiescent state. It seems that [rcu_flavor_sched_clock_irq][1] wasn't able to call [run_qs][2]. ``` rcu: INFO: rcu_sched self-detected stall on CPU rcu: 5-....: (15000 ticks this GP) idle=db4c/1/0x4000000000000000 softirq=14924115/14924115 fqs=7430 rcu: hardirqs softirqs csw/system rcu: number: 0 833 0 rcu: cputime: 0 0 29996 ==> 30000(ms) rcu: (t=15003 jiffies g=44379053 q=145851 ncpus=8) CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x20 Code: cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 62 06 00 00 90 f7 c6 00 02 00 00 74 01 fb 5d <e9> d2 19 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffa666c4bafc30 EFLAGS: 00000206 RAX: 0000000000000001 RBX: ffffa666c4bafcc0 RCX: 0000000000000020 RDX: ffff8a3d82130928 RSI: 0000000000000282 RDI: ffff8a3d82130920 RBP: ffffa666c4bafc48 R08: ffff8a3d82130928 R09: ffff8a3d82130928 R10: 0000000000000040 R11: 0000000000000002 R12: ffff8a3d82130920 R13: ffff8a44f3db9980 R14: ffff8a44f3db9980 R15: ffff8a44f3db9970 FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0 Call Trace: <IRQ> ? show_regs+0x6a/0x80 ? dump_cpu_task+0x71/0x90 ? rcu_dump_cpu_stacks+0xe8/0x180 ? print_cpu_stall+0x131/0x290 ? load_balance+0x160/0x870 ? check_cpu_stall+0x1d8/0x270 ? rcu_pending+0x32/0x1e0 ? rcu_sched_clock_irq+0x16e/0x290 ? update_process_times+0x63/0xa0 ? tick_sched_handle+0x28/0x70 ? tick_sched_timer+0x77/0x90 ? __pfx_tick_sched_timer+0x10/0x10 ? __hrtimer_run_queues+0x111/0x240 ? srso_alias_return_thunk+0x5/0x7f ? hrtimer_interrupt+0x101/0x240 ? hv_stimer0_isr+0x20/0x30 ? __sysvec_hyperv_stimer0+0x32/0x70 ? sysvec_hyperv_stimer0+0x7b/0x90 </IRQ> <TASK> ? asm_sysvec_hyperv_stimer0+0x1b/0x20 ? _raw_spin_unlock_irqrestore+0x19/0x20 ? remove_wait_queue+0x47/0x50 do_wait+0x19f/0x300 kernel_wait4+0xaf/0x150 ? __pfx_child_wait_callback+0x10/0x10 zap_pid_ns_processes+0x105/0x190 forget_original_parent+0x2e4/0x360 exit_notify+0x4a/0x210 do_exit+0x30b/0x500 ? srso_alias_return_thunk+0x5/0x7f ? wake_up_state+0x10/0x20 ? srso_alias_return_thunk+0x5/0x7f do_group_exit+0x35/0x90 __x64_sys_exit_group+0x18/0x20 x64_sys_call+0xd95/0x1ff0 do_syscall_64+0x56/0x80 ? srso_alias_return_thunk+0x5/0x7f ? handle_mm_fault+0x128/0x290 ? srso_alias_return_thunk+0x5/0x7f ? srso_alias_return_thunk+0x5/0x7f ? exit_to_user_mode_prepare+0x49/0x100 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit_to_user_mode+0x19/0x30 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit+0x1d/0x30 ? srso_alias_return_thunk+0x5/0x7f ? exc_page_fault+0x80/0x160 entry_SYSCALL_64_after_hwframe+0x73/0xdd RIP: 0033:0x75fce9367f8e Code: Unable to access opcode bytes at 0x75fce9367f64. RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001 R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50 </TASK> rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 5-.... } 15359 jiffies s: 90777 root: 0x20/. rcu: blocking rcu_node structures (internal RCU debug): Sending NMI from CPU 4 to CPUs 5: NMI backtrace for cpu 5 CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 RIP: 0010:do_wait+0x11/0x300 Code: 8b 4d d4 e9 28 fd ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 <53> 48 89 fb 48 83 ec 08 48 8b 77 08 0f 1f 44 00 00 65 4c 8b 34 25 RSP: 0018:ffffa666c4bafc68 EFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000040000004 RCX: 0000000000000000 RDX: 0000000040000000 RSI: 0000000000000000 RDI: ffffa666c4bafc98 RBP: ffffa666c4bafc88 R08: ffff8a3d82130928 R09: ffff8a3d82130928 R10: 0000000000000040 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0 Call Trace: <NMI> ? show_regs+0x6a/0x80 ? nmi_cpu_backtrace+0x9c/0x100 ? nmi_cpu_backtrace_handler+0x11/0x20 ? nmi_handle+0x62/0x160 ? default_do_nmi+0x45/0x120 ? exc_nmi+0x19f/0x250 ? end_repeat_nmi+0x16/0x67 ? do_wait+0x11/0x300 ? do_wait+0x11/0x300 ? do_wait+0x11/0x300 </NMI> <TASK> kernel_wait4+0xaf/0x150 ? __pfx_child_wait_callback+0x10/0x10 zap_pid_ns_processes+0x105/0x190 forget_original_parent+0x2e4/0x360 exit_notify+0x4a/0x210 do_exit+0x30b/0x500 ? srso_alias_return_thunk+0x5/0x7f ? wake_up_state+0x10/0x20 ? srso_alias_return_thunk+0x5/0x7f do_group_exit+0x35/0x90 __x64_sys_exit_group+0x18/0x20 x64_sys_call+0xd95/0x1ff0 do_syscall_64+0x56/0x80 ? srso_alias_return_thunk+0x5/0x7f ? handle_mm_fault+0x128/0x290 ? srso_alias_return_thunk+0x5/0x7f ? srso_alias_return_thunk+0x5/0x7f ? exit_to_user_mode_prepare+0x49/0x100 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit_to_user_mode+0x19/0x30 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit+0x1d/0x30 ? srso_alias_return_thunk+0x5/0x7f ? exc_page_fault+0x80/0x160 entry_SYSCALL_64_after_hwframe+0x73/0xdd RIP: 0033:0x75fce9367f8e Code: Unable to access opcode bytes at 0x75fce9367f64. RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001 R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50 </TASK> watchdog: BUG: soft lockup - CPU#5 stuck for 85s! [npm start:2522045] Modules linked in: tls raw_diag unix_diag af_packet_diag netlink_diag udp_diag tcp_diag inet_diag xt_statistic xt_mark veth xt_comment xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_nat xt_MASQUERADE nft_chain_nat nf_nat nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay binfmt_misc nls_iso8859_1 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_owner xt_tcpudp nft_compat crct10dif_pclmul crc32_pclmul nf_tables polyval_clmulni polyval_generic ghash_clmulni_intel libcrc32c sha256_ssse3 joydev sha1_ssse3 hid_generic nfnetlink aesni_intel crypto_simd hyperv_drm cryptd hid_hyperv serio_raw drm_kms_helper hv_netvsc hid hyperv_keyboard pata_acpi drm_shmem_helper dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sch_fq_codel msr drm efi_pstore i2c_core ip_tables x_tables autofs4 CPU: 5 PID: 2522045 Comm: npm start Tainted: G L 6.5.0-1021-azure #22~22.04.1-Ubuntu Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x20 Code: cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 62 06 00 00 90 f7 c6 00 02 00 00 74 01 fb 5d <e9> d2 19 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffa666c4bafc30 EFLAGS: 00000206 RAX: 0000000000000001 RBX: ffffa666c4bafcc0 RCX: 0000000000000020 RDX: ffff8a3d82130928 RSI: 0000000000000282 RDI: ffff8a3d82130920 RBP: ffffa666c4bafc48 R08: ffff8a3d82130928 R09: ffff8a3d82130928 R10: 0000000000000040 R11: 0000000000000002 R12: ffff8a3d82130920 R13: ffff8a44f3db9980 R14: ffff8a44f3db9980 R15: ffff8a44f3db9970 FS: 0000000000000000(0000) GS:ffff8a451fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000767ea57cc000 CR3: 00000005db436002 CR4: 0000000000370ee0 Call Trace: <IRQ> ? show_regs+0x6a/0x80 ? watchdog_timer_fn+0x1ce/0x230 ? __pfx_watchdog_timer_fn+0x10/0x10 ? __hrtimer_run_queues+0x111/0x240 ? srso_alias_return_thunk+0x5/0x7f ? hrtimer_interrupt+0x101/0x240 ? hv_stimer0_isr+0x20/0x30 ? __sysvec_hyperv_stimer0+0x32/0x70 ? sysvec_hyperv_stimer0+0x7b/0x90 </IRQ> <TASK> ? asm_sysvec_hyperv_stimer0+0x1b/0x20 ? _raw_spin_unlock_irqrestore+0x19/0x20 ? remove_wait_queue+0x47/0x50 do_wait+0x19f/0x300 kernel_wait4+0xaf/0x150 ? __pfx_child_wait_callback+0x10/0x10 zap_pid_ns_processes+0x105/0x190 forget_original_parent+0x2e4/0x360 exit_notify+0x4a/0x210 do_exit+0x30b/0x500 ? srso_alias_return_thunk+0x5/0x7f ? wake_up_state+0x10/0x20 ? srso_alias_return_thunk+0x5/0x7f do_group_exit+0x35/0x90 __x64_sys_exit_group+0x18/0x20 x64_sys_call+0xd95/0x1ff0 do_syscall_64+0x56/0x80 ? srso_alias_return_thunk+0x5/0x7f ? handle_mm_fault+0x128/0x290 ? srso_alias_return_thunk+0x5/0x7f ? srso_alias_return_thunk+0x5/0x7f ? exit_to_user_mode_prepare+0x49/0x100 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit_to_user_mode+0x19/0x30 ? srso_alias_return_thunk+0x5/0x7f ? irqentry_exit+0x1d/0x30 ? srso_alias_return_thunk+0x5/0x7f ? exc_page_fault+0x80/0x160 entry_SYSCALL_64_after_hwframe+0x73/0xdd RIP: 0033:0x75fce9367f8e Code: Unable to access opcode bytes at 0x75fce9367f64. RSP: 002b:00007ffc80c04b18 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 000075fce672d1b0 RCX: 000075fce9367f8e RDX: 000075fce93e30c0 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 00007ffc80c04b60 R08: 0000000000000024 R09: 0000000800000000 R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000001 R13: 000075fce9081a90 R14: 0000000000000000 R15: 000075fce6771d50 </TASK> ``` > > Any chance you can test the patch below? This patch makes sense anyway, > I'll send it later. But I am not sure it can fix your problem. Sure! Will do! Thanks > > Oleg. > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index dc48fecfa1dc..25f3cf679b35 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > */ > do { > clear_thread_flag(TIF_SIGPENDING); > + clear_thread_flag(TIF_NOTIFY_SIGNAL); > rc = kernel_wait4(-1, NULL, __WALL, NULL); > } while (rc != -ECHILD); > > > Wei, Fu [1]: https://elixir.bootlin.com/linux/v6.5/source/kernel/rcu/tree_plugin.h#L964 [2]: https://elixir.bootlin.com/linux/v6.5/source/kernel/rcu/tree_plugin.h#L848 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-06 15:45 ` Wei Fu @ 2024-06-06 17:28 ` Oleg Nesterov 2024-06-07 3:02 ` Wei Fu 0 siblings, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2024-06-06 17:28 UTC (permalink / raw) To: Wei Fu Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, ebiederm, frederic, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt, weifu Hi Wei, thanks for more info. On 06/06, Wei Fu wrote: > > > Well, due to unfortunate design zap_pid_ns_processes() can hang "forever" > > if this namespace has a (zombie) task injected from the parent ns, this > > task should be reaped by its parent. > > That zombie task was cloned by pid-1 process in that pid namespace. In my last > reproduced log, the process tree in that pid namespace looks like OK, > ``` > # unshare(CLONE_NEWPID | CLONE_NEWNS) > > npm start (pid 2522045) > |__npm run zombie (pid 2522605) > |__ sh -c "whle true; do echo zombie; sleep 1; done" (pid 2522869) > ``` only 3 processes? nothing is running? Is the last process 2522869 a zombie too? Could you show your .config? In particular, CONFIG_PREEMPT... > The `npm start (pid 2522045)` was stuck in kernel_wait4. And its child, so this is the init task in this namespace, > `npm run zombie (pid 2522605)`, has two threads. One of them was in D status. ... > $ sudo cat /proc/2522605/task/*/stack > [<0>] synchronize_rcu_expedited+0x177/0x1f0 > [<0>] namespace_unlock+0xd6/0x1b0 > [<0>] put_mnt_ns+0x73/0xa0 > [<0>] free_nsproxy+0x1c/0x1b0 > [<0>] switch_task_namespaces+0x5d/0x70 > [<0>] exit_task_namespaces+0x10/0x20 > [<0>] do_exit+0x2ce/0x500 > [<0>] io_sq_thread+0x48e/0x5a0 > [<0>] ret_from_fork+0x3c/0x60 > [<0>] ret_from_fork_asm+0x1b/0x30 so I guess this is the trace of its sub-thread 2522645. What about the process 2522605? Has it exited too? > > But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should > > sleep in kernel_wait4(). > > I run `cat /proc/2522045/status` and found that the status was kept switching > between running and sleeping. OK, this shouldn't happen in this case. So it really looks like it spins in a busy-wait loop because TIF_NOTIFY_SIGNAL is not cleared. It can be reported as sleeping because do_wait() sets/clears TASK_INTERRUPTIBLE, although the window is small... Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-06 17:28 ` Oleg Nesterov @ 2024-06-07 3:02 ` Wei Fu 2024-06-07 6:25 ` Oleg Nesterov 0 siblings, 1 reply; 23+ messages in thread From: Wei Fu @ 2024-06-07 3:02 UTC (permalink / raw) To: oleg Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, ebiederm, frederic, fuweid89, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt, weifu > > > ``` > > # unshare(CLONE_NEWPID | CLONE_NEWNS) > > > > npm start (pid 2522045) > > |__npm run zombie (pid 2522605) > > |__ sh -c "whle true; do echo zombie; sleep 1; done" (pid 2522869) > > ``` > > only 3 processes? nothing is running? Is the last process 2522869 a > zombie too? Yes. The pid-2522045 sent SIGKILL to all the processes in that pid namespace, when it exited. The last process 2522869 was zombie as well. Sometimes, `npm start` could exit before `npm run zombie` forks `sh`. You might see there are only two processes in that pid namespace. > > Could you show your .config? In particular, CONFIG_PREEMPT... I'm using [6.5.0-1021-azure][1] kernel and preempt is disabled. Highlight part of .config. ``` $ cat /boot/config-6.5.0-1021-azure | grep _RCU CONFIG_TREE_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_TASKS_RCU_GENERIC=y CONFIG_TASKS_RUDE_RCU=y CONFIG_TASKS_TRACE_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y CONFIG_RCU_NOCB_CPU=y # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set # CONFIG_RCU_LAZY is not set CONFIG_MMU_GATHER_RCU_TABLE_FREE=y # CONFIG_RCU_SCALE_TEST is not set # CONFIG_RCU_TORTURE_TEST is not set # CONFIG_RCU_REF_SCALE_TEST is not set CONFIG_RCU_CPU_STALL_TIMEOUT=60 CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0 CONFIG_RCU_CPU_STALL_CPUTIME=y # CONFIG_RCU_TRACE is not set # CONFIG_RCU_EQS_DEBUG is not set $ cat /boot/config-6.5.0-1021-azure | grep _PREEMPT CONFIG_PREEMPT_VOLUNTARY_BUILD=y # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_DYNAMIC is not set CONFIG_HAVE_PREEMPT_DYNAMIC=y CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y CONFIG_PREEMPT_NOTIFIERS=y CONFIG_DRM_I915_PREEMPT_TIMEOUT=640 CONFIG_DRM_I915_PREEMPT_TIMEOUT_COMPUTE=7500 # CONFIG_PREEMPTIRQ_DELAY_TEST is not set $ cat /boot/config-6.5.0-1021-azure | grep HZ CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set # CONFIG_NO_HZ_IDLE is not set CONFIG_NO_HZ_FULL=y CONFIG_NO_HZ=y # CONFIG_HZ_100 is not set CONFIG_HZ_250=y # CONFIG_HZ_300 is not set # CONFIG_HZ_1000 is not set CONFIG_HZ=250 CONFIG_MACHZ_WDT=m ``` > > > The `npm start (pid 2522045)` was stuck in kernel_wait4. And its child, > > so this is the init task in this namespace, Yes~ > > > `npm run zombie (pid 2522605)`, has two threads. One of them was in D status. > ... > > $ sudo cat /proc/2522605/task/*/stack > > [<0>] synchronize_rcu_expedited+0x177/0x1f0 > > [<0>] namespace_unlock+0xd6/0x1b0 > > [<0>] put_mnt_ns+0x73/0xa0 > > [<0>] free_nsproxy+0x1c/0x1b0 > > [<0>] switch_task_namespaces+0x5d/0x70 > > [<0>] exit_task_namespaces+0x10/0x20 > > [<0>] do_exit+0x2ce/0x500 > > [<0>] io_sq_thread+0x48e/0x5a0 > > [<0>] ret_from_fork+0x3c/0x60 > > [<0>] ret_from_fork_asm+0x1b/0x30 > > so I guess this is the trace of its sub-thread 2522645. Sorry for unclear message. Yes~ > > What about the process 2522605? Has it exited too? The process-2522605 has two threads. The main thread-2522605 was in zombie status. Yes. That main thread has exited as well. Only thread-2522645 was stuck in synchronize_rcu_expedited. > > > > But zap_pid_ns_processes() shouldn't cause the soft-lockup, it should > > > sleep in kernel_wait4(). > > > > I run `cat /proc/2522045/status` and found that the status was kept switching > > between running and sleeping. > > OK, this shouldn't happen in this case. So it really looks like it spins > in a busy-wait loop because TIF_NOTIFY_SIGNAL is not cleared. It can be > reported as sleeping because do_wait() sets/clears TASK_INTERRUPTIBLE, > although the window is small... > I can reproduce this issue in v5.15, v6.1, v6.5, v6.8, v6.9 and v6.10-rc2. All the kernels disable CONFIG_PREEMPT and PREEMPT_RCU. And it's very easy to reproduce this in v5.15.x with 8 vcores in few minutes. For the other versions of kernel, it could take 30 minutes or few hours. Rachel provides [golang-repro][2] which is similar to docker repro. It can be built as static binary which is friendly to reproduce. Hope this information can help. Thanks, Wei [1]: https://gist.github.com/fuweid/ae8bad349fee3e00a4f1ce82397831ac [2]: https://github.com/rlmenge/rcu-soft-lock-issue-repro?tab=readme-ov-file#golang-repro ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-07 3:02 ` Wei Fu @ 2024-06-07 6:25 ` Oleg Nesterov 2024-06-07 15:04 ` Wei Fu 0 siblings, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2024-06-07 6:25 UTC (permalink / raw) To: Wei Fu Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, ebiederm, frederic, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt, weifu Thanks for this info, On 06/07, Wei Fu wrote: > > All the kernels disable CONFIG_PREEMPT and PREEMPT_RCU. Ah, this can explain both soft-lockup and synchronize_rcu() hang. If my theory is correct. Can you try the patch I sent? Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-07 6:25 ` Oleg Nesterov @ 2024-06-07 15:04 ` Wei Fu 2024-06-07 21:22 ` Oleg Nesterov 0 siblings, 1 reply; 23+ messages in thread From: Wei Fu @ 2024-06-07 15:04 UTC (permalink / raw) To: oleg Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, ebiederm, frederic, fuweid89, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt, weifu Hi! > > On 06/07, Wei Fu wrote: > > > > All the kernels disable CONFIG_PREEMPT and PREEMPT_RCU. > > Ah, this can explain both soft-lockup and synchronize_rcu() hang. If my theory > is correct. > > Can you try the patch I sent? > > Oleg. > Yes. I applied your patch on v5.15.160 and run reproducer for 5 hours. I didn't see this issue. Currently, it looks good!. I will continue that test on this weekend. In last reply, you mentioned TIF_NOTIFY_SIGNAL related to busy-wait loop. Would you please explain why flag-clear works here? Thanks, Wei ``` ➜ linux git:(v5.15.160) ✗ git --no-pager show commit c61bd26ae81a896c8660150b4e356153da30880a (HEAD, tag: v5.15.160, origin/linux-5.15.y) Author: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Date: Sat May 25 16:20:19 2024 +0200 Linux 5.15.160 Link: https://lore.kernel.org/r/20240523130327.956341021@linuxfoundation.org Tested-by: SeongJae Park <sj@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Tested-by: Ron Economos <re@w6rz.net> Tested-by: Kelsey Steele <kelseysteele@linux.microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> diff --git a/Makefile b/Makefile index 5cbfe2be72dd..bfc863d71978 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 5 PATCHLEVEL = 15 -SUBLEVEL = 159 +SUBLEVEL = 160 EXTRAVERSION = NAME = Trick or Treat ➜ linux git:(v5.15.160) ✗ git --no-pager diff . diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 259fc4ca0d9c..40b011f88067 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -214,6 +214,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) */ do { clear_thread_flag(TIF_SIGPENDING); + clear_thread_flag(TIF_NOTIFY_SIGNAL); rc = kernel_wait4(-1, NULL, __WALL, NULL); } while (rc != -ECHILD); ``` ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-07 15:04 ` Wei Fu @ 2024-06-07 21:22 ` Oleg Nesterov 2024-06-08 12:42 ` Oleg Nesterov 0 siblings, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2024-06-07 21:22 UTC (permalink / raw) To: Wei Fu Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, ebiederm, frederic, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt, weifu On 06/07, Wei Fu wrote: > > Yes. I applied your patch on v5.15.160 and run reproducer for 5 hours. > I didn't see this issue. Currently, it looks good!. I will continue that test > on this weekend. Great, thanks! > In last reply, you mentioned TIF_NOTIFY_SIGNAL related to busy-wait loop. > Would you please explain why flag-clear works here? Sure, I'll write the changelog with the explanation and send the patch on weekend. If it passes your testing. But in short this is very simple. zap_pid_ns_processes() clears TIF_SIGPENDING exactly because we want to avoid the busy-wait loop. But today this is not enough to make signal_pending() return F, see include/linux/sched/signal.h:signal_pending(). Thanks, Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-07 21:22 ` Oleg Nesterov @ 2024-06-08 12:42 ` Oleg Nesterov 2024-06-10 0:07 ` Wei Fu 0 siblings, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2024-06-08 12:42 UTC (permalink / raw) To: Wei Fu Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, ebiederm, frederic, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt, weifu On 06/07, Oleg Nesterov wrote: > > On 06/07, Wei Fu wrote: > > > > Yes. I applied your patch on v5.15.160 and run reproducer for 5 hours. > > I didn't see this issue. Currently, it looks good!. I will continue that test > > on this weekend. > > Great, thanks! > > > In last reply, you mentioned TIF_NOTIFY_SIGNAL related to busy-wait loop. > > Would you please explain why flag-clear works here? > > Sure, I'll write the changelog with the explanation and send the patch on > weekend. If it passes your testing. Please see the patch I've sent. The changelog doesn't bother to describe this particular problem because busy-waiting can obviously cause multiple problems, especially without CONFIG_PREEMPT or if rt_task(). So let me add more details about this particular deadlock here. The sub-namespace init task T spins in a tight loop calling kernel_wait4() which returns -EINTR without sleeping because its child C has not exited yet and signal_pending(T) is true due to TIF_NOTIFY_SIGNAL. The exiting child C sleeps in synchronize_rcu() which hangs exactly because T never calls schedule/rcu_note_context_switch, it can't be preempted because CONFIG_PREEMPT is not enabled. Note also that without PREEMPT_RCU __rcu_read_lock() is just preempt_disable() which is nop without CONFIG_PREEMPT. Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [RCU] zombie task hung in synchronize_rcu_expedited 2024-06-08 12:42 ` Oleg Nesterov @ 2024-06-10 0:07 ` Wei Fu 0 siblings, 0 replies; 23+ messages in thread From: Wei Fu @ 2024-06-10 0:07 UTC (permalink / raw) To: oleg Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, ebiederm, frederic, fuweid89, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt, weifu > > On 06/07, Oleg Nesterov wrote: > > > > On 06/07, Wei Fu wrote: > > > > > > Yes. I applied your patch on v5.15.160 and run reproducer for 5 hours. > > > I didn't see this issue. Currently, it looks good!. I will continue that test > > > on this weekend. > > > > Great, thanks! > > > > > In last reply, you mentioned TIF_NOTIFY_SIGNAL related to busy-wait loop. > > > Would you please explain why flag-clear works here? > > > > Sure, I'll write the changelog with the explanation and send the patch on > > weekend. If it passes your testing. > > Please see the patch I've sent. The changelog doesn't bother to describe this > particular problem because busy-waiting can obviously cause multiple problems, > especially without CONFIG_PREEMPT or if rt_task(). > > So let me add more details about this particular deadlock here. > > The sub-namespace init task T spins in a tight loop calling kernel_wait4() > which returns -EINTR without sleeping because its child C has not exited > yet and signal_pending(T) is true due to TIF_NOTIFY_SIGNAL. > > The exiting child C sleeps in synchronize_rcu() which hangs exactly because > T never calls schedule/rcu_note_context_switch, it can't be preempted because > CONFIG_PREEMPT is not enabled. > > Note also that without PREEMPT_RCU __rcu_read_lock() is just preempt_disable() > which is nop without CONFIG_PREEMPT. > > Oleg. > > Thanks for the update. That's really helpful! Wei ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-05 23:42 [RCU] zombie task hung in synchronize_rcu_expedited Rachel Menge 2024-06-06 11:10 ` Oleg Nesterov @ 2024-06-08 12:06 ` Oleg Nesterov 2024-06-08 17:00 ` Boqun Feng ` (3 more replies) 2024-06-08 15:48 ` [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads Oleg Nesterov 2 siblings, 4 replies; 23+ messages in thread From: Oleg Nesterov @ 2024-06-08 12:06 UTC (permalink / raw) To: Andrew Morton, Rachel Menge Cc: linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang kernel_wait4() doesn't sleep and returns -EINTR if there is no eligible child and signal_pending() is true. That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() return false and avoid a busy-wait loop. Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") Reported-by: Rachel Menge <rachelmenge@linux.microsoft.com> Closes: https://lore.kernel.org/all/1386cd49-36d0-4a5c-85e9-bc42056a5a38@linux.microsoft.com/ Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/pid_namespace.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index dc48fecfa1dc..25f3cf679b35 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) */ do { clear_thread_flag(TIF_SIGPENDING); + clear_thread_flag(TIF_NOTIFY_SIGNAL); rc = kernel_wait4(-1, NULL, __WALL, NULL); } while (rc != -ECHILD); -- 2.25.1.362.g51ebf55 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-08 12:06 ` [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Oleg Nesterov @ 2024-06-08 17:00 ` Boqun Feng 2024-06-09 14:12 ` Wei Fu ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Boqun Feng @ 2024-06-08 17:00 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Rachel Menge, linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang On Sat, Jun 08, 2024 at 02:06:16PM +0200, Oleg Nesterov wrote: > kernel_wait4() doesn't sleep and returns -EINTR if there is no > eligible child and signal_pending() is true. > > That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not > enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() > return false and avoid a busy-wait loop. > > Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") > Reported-by: Rachel Menge <rachelmenge@linux.microsoft.com> > Closes: https://lore.kernel.org/all/1386cd49-36d0-4a5c-85e9-bc42056a5a38@linux.microsoft.com/ > Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Boqun Feng <boqun.feng@gmail.com> Wei, appreciate it if you could share some test result and provide a Tested-by tag. Thanks! Regards, Boqun > --- > kernel/pid_namespace.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index dc48fecfa1dc..25f3cf679b35 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > */ > do { > clear_thread_flag(TIF_SIGPENDING); > + clear_thread_flag(TIF_NOTIFY_SIGNAL); > rc = kernel_wait4(-1, NULL, __WALL, NULL); > } while (rc != -ECHILD); > > -- > 2.25.1.362.g51ebf55 > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-08 12:06 ` [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Oleg Nesterov 2024-06-08 17:00 ` Boqun Feng @ 2024-06-09 14:12 ` Wei Fu 2024-06-12 16:57 ` Jens Axboe 2024-06-13 12:40 ` Eric W. Biederman 3 siblings, 0 replies; 23+ messages in thread From: Wei Fu @ 2024-06-09 14:12 UTC (permalink / raw) To: oleg Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, frederic, fuweid89, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt > kernel_wait4() doesn't sleep and returns -EINTR if there is no > eligible child and signal_pending() is true. > > That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not > enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() > return false and avoid a busy-wait loop. > > Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") > Reported-by: Rachel Menge <rachelmenge@linux.microsoft.com> > Closes: https://lore.kernel.org/all/1386cd49-36d0-4a5c-85e9-bc42056a5a38@linux.microsoft.com/ > Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-By: Wei Fu <fuweid89@gmail.com> This change looks good to me! I used [rcudeadlock-v1][1] to verify this patch on v5.15.160 for more than 30 hours. The soft lockup didn't show up. If there is no such patch, that test will trigger soft-lockup in 10 minutes. ``` root@(none):/# uname -a Linux (none) 5.15.160-dirty #7 SMP Fri Jun 7 15:25:30 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux root@(none):/# ps -ef | grep rcu root 3 2 0 Jun07 ? 00:00:00 [rcu_gp] root 4 2 0 Jun07 ? 00:00:00 [rcu_par_gp] root 11 2 0 Jun07 ? 00:00:00 [rcu_tasks_rude_] root 12 2 0 Jun07 ? 00:00:00 [rcu_tasks_trace] root 15 2 0 Jun07 ? 00:03:31 [rcu_sched] root 145 141 0 Jun07 ? 00:15:29 ./rcudeadlock root 5372 141 0 13:37 ? 00:00:00 grep rcu root@(none):/# date Sun Jun 9 13:37:38 UTC 2024 ``` I used [rcudeadlock-v2][2] to verify this patch on v6.10-rc2 for more than 2 hours. The soft lockup didn't show up. If there is no such patch, that test will trigger soft-lockup in 1 minute. ``` root@(none):/# uname -a Linux (none) 6.10.0-rc2-dirty #4 SMP Sun Jun 9 11:19:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux root@(none):/# ps -ef | grep rcu root 4 2 0 11:20 ? 00:00:00 [kworker/R-rcu_g] root 13 2 0 11:20 ? 00:00:00 [rcu_tasks_rude_kthread] root 14 2 0 11:20 ? 00:00:00 [rcu_tasks_trace_kthread] root 16 2 0 11:20 ? 00:00:03 [rcu_sched] root 17 2 0 11:20 ? 00:00:00 [rcu_exp_par_gp_kthread_worker/0] root 18 2 0 11:20 ? 00:00:12 [rcu_exp_gp_kthread_worker] root 117 108 0 11:21 ? 00:01:06 ./rcudeadlock root 14451 108 0 13:37 ? 00:00:00 grep rcu root@(none):/# date Sun Jun 9 13:37:15 UTC 2024 ``` It's about data-race during cleanup active iou-wrk-thread. I shares that idea about how to verify this patch. > --- > kernel/pid_namespace.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index dc48fecfa1dc..25f3cf679b35 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > */ > do { > clear_thread_flag(TIF_SIGPENDING); > + clear_thread_flag(TIF_NOTIFY_SIGNAL); > rc = kernel_wait4(-1, NULL, __WALL, NULL); > } while (rc != -ECHILD); > > -- > 2.25.1.362.g51ebf55 > > > Let's assume that there is new pid namespace unshared by host pid namespace, named by `PA`. There are two processes in `PA`. The init process is named by `X` and its child is named by `Y`. ``` unshare(CLONE_NEWPID|CLONE_NEWNS) X |__ Y ``` The main-thread of process X creates one active iouring worker thread `iou-wrk-X`. When process X exits, that main-thread of process X wakes up and set `TIF_NOTIFY_SIGNAL` flag on `iou-wrk-X` thread. However, if `iou-wrk-X` thread receives signal from main-thread and wakes up, that thread isn't able to clear `TIF_NOTIFY_SIGNAL` flag. And that `iou-wrk-X` thread is last thread in process-X and it will carry `TIF_NOTIFY_SIGNAL` flag to enter `zap_pid_ns_processes`. It can be described by the following comment. ``` == X main-thread == == X iou-wrk-X == == Y main-thread == do_exit kill iou-wrk-X thread io_uring_files_cancel io_wq_worker set TIF_NOTIFY_SIGNAL on iou-wrk-X thread do_exit(0) exit_task_namespace exit_task_namespace do_task_dead exit_notify forget_original_parent find_child_reaper zap_pid_ns_processes do_exit exit_task_namespace ... namespace_unlock synchronize_rcu_expedited ``` The `iou-wrk-X` thread kills process-Y which is only one holding the mount namespace reference. The process-Y will get into `synchronize_rcu_expedited`. Since kernel doesn't enable preempt and `iou-wrk-X` thread has `TIF_NOTIFY_SIGNAL` flag, the `iou-wrk-X` thread will get into infinity loop, which cause soft lockup. So, in [rcudeadlock-v2][2] test, I create more active iou-wrk- threads in init process so that there is high chance to have iou-wrk- thread in `zap_pid_ns_processes` function. Hope it can help. Thanks, Wei [1]: https://github.com/rlmenge/rcu-soft-lock-issue-repro/blob/662b8e414ff15d75419e2286b8121b7c2049a37c/rcudeadlock.go#L1 [2]: https://github.com/rlmenge/rcu-soft-lock-issue-repro/pull/1 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-08 12:06 ` [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Oleg Nesterov 2024-06-08 17:00 ` Boqun Feng 2024-06-09 14:12 ` Wei Fu @ 2024-06-12 16:57 ` Jens Axboe 2024-06-13 12:40 ` Eric W. Biederman 3 siblings, 0 replies; 23+ messages in thread From: Jens Axboe @ 2024-06-12 16:57 UTC (permalink / raw) To: Oleg Nesterov, Andrew Morton, Rachel Menge Cc: linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang On 6/8/24 6:06 AM, Oleg Nesterov wrote: > kernel_wait4() doesn't sleep and returns -EINTR if there is no > eligible child and signal_pending() is true. > > That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not > enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() > return false and avoid a busy-wait loop. Reviewed-by: Jens Axboe <axboe@kernel.dk> -- Jens Axboe ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-08 12:06 ` [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Oleg Nesterov ` (2 preceding siblings ...) 2024-06-12 16:57 ` Jens Axboe @ 2024-06-13 12:40 ` Eric W. Biederman 2024-06-13 14:02 ` Wei Fu 2024-06-13 15:30 ` Oleg Nesterov 3 siblings, 2 replies; 23+ messages in thread From: Eric W. Biederman @ 2024-06-13 12:40 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Rachel Menge, linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang Oleg Nesterov <oleg@redhat.com> writes: > kernel_wait4() doesn't sleep and returns -EINTR if there is no > eligible child and signal_pending() is true. > > That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not > enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() > return false and avoid a busy-wait loop. I took a look through the code. It used to be that TIF_NOTIFY_SIGNAL was all about waking up a task so that task_work_run can be used. io_uring still mostly uses it that way. There is also a use in kthread_stop that just uses it as a TIF_SIGPENDING without having a pending signal. At the point in do_exit where exit_notify and thus zap_pid_ns_processes is called I can't possibly see a use for TIF_NOTIFY_SIGNAL. exit_task_work, exit_signals, and io_uring_cancel have all been called. So TIF_NOTIFY_SIGNAL should be spurious at this point and safe to clear. Why it remains set is a mystery to me. If I had infinite time and energy the ideal is to rework the pid namespace exit logic so that waiting for everything to exit works like delay_group_leader in wait_task_consider. Simply blocking reaping of the pid namespace leader until everything in the pid namespace have been reaped. I think acct_exit_ns is the only piece of code that needs to be moved to allow that, and acct_exit_ns is purely bookkeeping so does not affect userspace visible semantics. This active waiting is weird and non-standard in the kernel and winds up causeing a problem every couple of years because of that. > > Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL") > Reported-by: Rachel Menge <rachelmenge@linux.microsoft.com> > Closes: https://lore.kernel.org/all/1386cd49-36d0-4a5c-85e9-bc42056a5a38@linux.microsoft.com/ > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/pid_namespace.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index dc48fecfa1dc..25f3cf679b35 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -218,6 +218,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > */ > do { > clear_thread_flag(TIF_SIGPENDING); > + clear_thread_flag(TIF_NOTIFY_SIGNAL); > rc = kernel_wait4(-1, NULL, __WALL, NULL); > } while (rc != -ECHILD); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-13 12:40 ` Eric W. Biederman @ 2024-06-13 14:02 ` Wei Fu 2024-06-13 14:49 ` Oleg Nesterov 2024-06-13 15:30 ` Oleg Nesterov 1 sibling, 1 reply; 23+ messages in thread From: Wei Fu @ 2024-06-13 14:02 UTC (permalink / raw) To: ebiederm Cc: Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, frederic, fuweid89, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, oleg, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt > > Oleg Nesterov <oleg@redhat.com> writes: > > > kernel_wait4() doesn't sleep and returns -EINTR if there is no > > eligible child and signal_pending() is true. > > > > That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not > > enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() > > return false and avoid a busy-wait loop. > > I took a look through the code. It used to be that TIF_NOTIFY_SIGNAL > was all about waking up a task so that task_work_run can be used. > io_uring still mostly uses it that way. There is also a use in > kthread_stop that just uses it as a TIF_SIGPENDING without having a > pending signal. > > At the point in do_exit where exit_notify and thus zap_pid_ns_processes > is called I can't possibly see a use for TIF_NOTIFY_SIGNAL. > exit_task_work, exit_signals, and io_uring_cancel have all been called. > > So TIF_NOTIFY_SIGNAL should be spurious at this point and safe to clear. > Why it remains set is a mystery to me. I think there is a case that TIF_NOTIFY_SIGNAL remains set. Init process has main-thread, sub-thread-X and iou-wrk-thread-X (created by sub-thread-X). When main-thread enters exit_group, both sub-thread-X and iou-wrk-thread-X are set by TIF_SIGPENDING and wake up. The sub-thread-X could call io_uring_cancel to set TIF_NOTIFY_SIGNAL for iou-wrk-thread-X which doesn't have chance to clear it. And then iou-wrk-thread-X gets into zap_pid_ns_processes function with TIF_NOTIFY_SIGNAL flag. If there are active processes in that pid namespace, it will run into this issue. Wei ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-13 14:02 ` Wei Fu @ 2024-06-13 14:49 ` Oleg Nesterov 0 siblings, 0 replies; 23+ messages in thread From: Oleg Nesterov @ 2024-06-13 14:49 UTC (permalink / raw) To: Wei Fu Cc: ebiederm, Sudhanva.Huruli, akpm, apais, axboe, boqun.feng, brauner, frederic, j.granados, jiangshanlai, joel, josh, linux-kernel, mathieu.desnoyers, michael.christie, mjguzik, neeraj.upadhyay, paulmck, qiang.zhang1211, rachelmenge, rcu, rostedt On 06/13, Wei Fu wrote: > > I think there is a case that TIF_NOTIFY_SIGNAL remains set. [...snip...] Of course! but please forget about io_uring even if currently io_uring/ is the only user of TWA_SIGNAL. Just suppose that the exiting task/thread races with task_workd_add(TWA_SIGNAL), TIF_NOTIFY_SIGNAL won't be cleared. This is fine in that the exiting task T will do exit_task_work() and after that task_work_add(T) can't succeed with or without TWA_SIGNAL. So it can't miss the pending work. But I think we can forget about TIF_NOTIFY_SIGNAL. To me, the problem is that the state of signal_pending() of the exiting task was never clearly defined, and I can't even recall how many times I mentioned this in the previous discussions. TIF_NOTIFY_SIGNAL doesn't add more confusion, imo. Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING 2024-06-13 12:40 ` Eric W. Biederman 2024-06-13 14:02 ` Wei Fu @ 2024-06-13 15:30 ` Oleg Nesterov 1 sibling, 0 replies; 23+ messages in thread From: Oleg Nesterov @ 2024-06-13 15:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Rachel Menge, linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang On 06/13, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > kernel_wait4() doesn't sleep and returns -EINTR if there is no > > eligible child and signal_pending() is true. > > > > That is why zap_pid_ns_processes() clears TIF_SIGPENDING but this is not > > enough, it should also clear TIF_NOTIFY_SIGNAL to make signal_pending() > > return false and avoid a busy-wait loop. > > I took a look through the code. It used to be that TIF_NOTIFY_SIGNAL > was all about waking up a task so that task_work_run can be used. > io_uring still mostly uses it that way. There is also a use in > kthread_stop that just uses it as a TIF_SIGPENDING without having a > pending signal. > > At the point in do_exit where exit_notify and thus zap_pid_ns_processes > is called I can't possibly see a use for TIF_NOTIFY_SIGNAL. > exit_task_work, exit_signals, and io_uring_cancel have all been called. > > So TIF_NOTIFY_SIGNAL should be spurious at this point and safe to clear. > Why it remains set is a mystery to me. because exit_task_work() -> task_work_run() doesn't clear TIF_NOTIFY_SIGNAL. So yes, it is spurious, but to me a possible TIF_SIGPENDING is even more "spurious". See my reply to Wei. We don't need to clear TIF_NOTIFY_SIGNAL inside the loop, task_work_addd() can't succeed after exit_task_work() sets ->task_works =&work_exited, but this is another story and this doesn't (well, shouldn't) differ from TIF_SIGPENDING. > If I had infinite time and energy the ideal is to rework the pid > namespace exit logic Perhaps in this case you could take a look at the next loop waiting for pid_ns->pid_allocated == init_pids ;) I always hated the fact that the the exiting sub-namespace init can "hang forever" if this namespace has the tasks injected from the parent namespace. And I had enough hard-to-debug internal bug reports which blamed the kernel. > This active waiting is weird and non-standard in the kernel and winds up > causeing a problem every couple of years because of that. Agreed. Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads 2024-06-05 23:42 [RCU] zombie task hung in synchronize_rcu_expedited Rachel Menge 2024-06-06 11:10 ` Oleg Nesterov 2024-06-08 12:06 ` [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Oleg Nesterov @ 2024-06-08 15:48 ` Oleg Nesterov 2024-06-13 13:01 ` Eric W. Biederman 2 siblings, 1 reply; 23+ messages in thread From: Oleg Nesterov @ 2024-06-08 15:48 UTC (permalink / raw) To: Andrew Morton, Rachel Menge Cc: linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang The comment above the idr_for_each_entry_continue() loop tries to explain why we have to signal each thread in the namespace, but it is outdated. This code no longer uses kill_proc_info(), we have a target task so we can check thread_group_leader() and avoid the unnecessary group_send_sig_info. Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID, this way it returns NULL if this pid is not a group-leader pid. Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process / thread doesn't necessarily has a pending SIGKILL. Either way these checks are racy without siglock, so the patch uses data_race() to shut up KCSAN. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- kernel/pid_namespace.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c index 25f3cf679b35..0f9bd67c9e75 100644 --- a/kernel/pid_namespace.c +++ b/kernel/pid_namespace.c @@ -191,21 +191,14 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) * The last thread in the cgroup-init thread group is terminating. * Find remaining pid_ts in the namespace, signal and wait for them * to exit. - * - * Note: This signals each threads in the namespace - even those that - * belong to the same thread group, To avoid this, we would have - * to walk the entire tasklist looking a processes in this - * namespace, but that could be unnecessarily expensive if the - * pid namespace has just a few processes. Or we need to - * maintain a tasklist for each pid namespace. - * */ rcu_read_lock(); read_lock(&tasklist_lock); nr = 2; idr_for_each_entry_continue(&pid_ns->idr, pid, nr) { - task = pid_task(pid, PIDTYPE_PID); - if (task && !__fatal_signal_pending(task)) + task = pid_task(pid, PIDTYPE_TGID); + /* reading signal->flags is racy without sighand->siglock */ + if (task && !(data_race(task->signal->flags) & SIGNAL_GROUP_EXIT)) group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX); } read_unlock(&tasklist_lock); -- 2.25.1.362.g51ebf55 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads 2024-06-08 15:48 ` [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads Oleg Nesterov @ 2024-06-13 13:01 ` Eric W. Biederman 2024-06-13 15:00 ` Oleg Nesterov 0 siblings, 1 reply; 23+ messages in thread From: Eric W. Biederman @ 2024-06-13 13:01 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Rachel Menge, linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang Oleg Nesterov <oleg@redhat.com> writes: > The comment above the idr_for_each_entry_continue() loop tries to explain > why we have to signal each thread in the namespace, but it is outdated. > This code no longer uses kill_proc_info(), we have a target task so we can > check thread_group_leader() and avoid the unnecessary group_send_sig_info. > Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID, > this way it returns NULL if this pid is not a group-leader pid. > > Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process / > thread doesn't necessarily has a pending SIGKILL. Either way these checks > are racy without siglock, so the patch uses data_race() to shut up KCSAN. You remove the comment but the meat of what it was trying to say remains true. For processes in a session or processes is a process group a list of all such processes is kept. No such list is kept for a pid namespace. So the best we can do is walk through the allocated pid numbers in the pid namespace. It would also help if this explains that in the case of SIGKILL complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good check to use to see if the process has been killed (with SIGKILL). There are races with coredump here but *shrug* I don't think this changes behavior in that situation. Eric > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > --- > kernel/pid_namespace.c | 13 +++---------- > 1 file changed, 3 insertions(+), 10 deletions(-) > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > index 25f3cf679b35..0f9bd67c9e75 100644 > --- a/kernel/pid_namespace.c > +++ b/kernel/pid_namespace.c > @@ -191,21 +191,14 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > * The last thread in the cgroup-init thread group is terminating. > * Find remaining pid_ts in the namespace, signal and wait for them > * to exit. > - * > - * Note: This signals each threads in the namespace - even those that > - * belong to the same thread group, To avoid this, we would have > - * to walk the entire tasklist looking a processes in this > - * namespace, but that could be unnecessarily expensive if the > - * pid namespace has just a few processes. Or we need to > - * maintain a tasklist for each pid namespace. > - * > */ > rcu_read_lock(); > read_lock(&tasklist_lock); > nr = 2; > idr_for_each_entry_continue(&pid_ns->idr, pid, nr) { > - task = pid_task(pid, PIDTYPE_PID); > - if (task && !__fatal_signal_pending(task)) > + task = pid_task(pid, PIDTYPE_TGID); > + /* reading signal->flags is racy without sighand->siglock */ > + if (task && !(data_race(task->signal->flags) & SIGNAL_GROUP_EXIT)) > group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX); > } > read_unlock(&tasklist_lock); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads 2024-06-13 13:01 ` Eric W. Biederman @ 2024-06-13 15:00 ` Oleg Nesterov 2024-06-13 16:23 ` Eric W. Biederman 2024-07-05 16:08 ` Oleg Nesterov 0 siblings, 2 replies; 23+ messages in thread From: Oleg Nesterov @ 2024-06-13 15:00 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Rachel Menge, linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang On 06/13, Eric W. Biederman wrote: > > Oleg Nesterov <oleg@redhat.com> writes: > > > The comment above the idr_for_each_entry_continue() loop tries to explain > > why we have to signal each thread in the namespace, but it is outdated. > > This code no longer uses kill_proc_info(), we have a target task so we can > > check thread_group_leader() and avoid the unnecessary group_send_sig_info. > > Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID, > > this way it returns NULL if this pid is not a group-leader pid. > > > > Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process / > > thread doesn't necessarily has a pending SIGKILL. Either way these checks > > are racy without siglock, so the patch uses data_race() to shut up KCSAN. > > You remove the comment but the meat of what it was trying to say remains > true. For processes in a session or processes is a process group a list > of all such processes is kept. No such list is kept for a pid > namespace. So the best we can do is walk through the allocated pid > numbers in the pid namespace. OK, I'll recheck tomorrow. Yet I think it doesn't make sense to send SIGKILL to sub-threads, and the comment looks misleading today. This was the main motivation, but again, I'll recheck. > It would also help if this explains that in the case of SIGKILL > complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good > check to use to see if the process has been killed (with SIGKILL). Well, if SIGNAL_GROUP_EXIT is set we do not care if this process was killed or not. It (the whole thread group) is going to exit, that is all. We can even remove this check, it is just the optimization, just like the current fatal_signal_pending() check. Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads 2024-06-13 15:00 ` Oleg Nesterov @ 2024-06-13 16:23 ` Eric W. Biederman 2024-07-05 16:08 ` Oleg Nesterov 1 sibling, 0 replies; 23+ messages in thread From: Eric W. Biederman @ 2024-06-13 16:23 UTC (permalink / raw) To: Oleg Nesterov Cc: Andrew Morton, Rachel Menge, linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang Oleg Nesterov <oleg@redhat.com> writes: > On 06/13, Eric W. Biederman wrote: >> >> Oleg Nesterov <oleg@redhat.com> writes: >> >> > The comment above the idr_for_each_entry_continue() loop tries to explain >> > why we have to signal each thread in the namespace, but it is outdated. >> > This code no longer uses kill_proc_info(), we have a target task so we can >> > check thread_group_leader() and avoid the unnecessary group_send_sig_info. >> > Better yet, we can change pid_task() to use PIDTYPE_TGID rather than _PID, >> > this way it returns NULL if this pid is not a group-leader pid. >> > >> > Also, change this code to check SIGNAL_GROUP_EXIT, the exiting process / >> > thread doesn't necessarily has a pending SIGKILL. Either way these checks >> > are racy without siglock, so the patch uses data_race() to shut up KCSAN. >> >> You remove the comment but the meat of what it was trying to say remains >> true. For processes in a session or processes is a process group a list >> of all such processes is kept. No such list is kept for a pid >> namespace. So the best we can do is walk through the allocated pid >> numbers in the pid namespace. > > OK, I'll recheck tomorrow. Yet I think it doesn't make sense to send > SIGKILL to sub-threads, and the comment looks misleading today. This was > the main motivation, but again, I'll recheck. Yes, we only need to send SIGKILL to only one thread. Of course there are a few weird cases with zombie leader threads, but I think the pattern you are using handles them. >> It would also help if this explains that in the case of SIGKILL >> complete_signal always sets SIGNAL_GROUP_EXIT which makes that a good >> check to use to see if the process has been killed (with SIGKILL). > > Well, if SIGNAL_GROUP_EXIT is set we do not care if this process was > killed or not. It (the whole thread group) is going to exit, that is all. > > We can even remove this check, it is just the optimization, just like > the current fatal_signal_pending() check. I just meant that the optimization is effective because group_send_sig_info calls complete_signal which sets SIGNAL_GROUP_EXIT. Which makes it an almost 100% accurate test, which makes it a very good optimization. Especially in the case of multi-threaded processes where the code will arrive there for every thread. Eric ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads 2024-06-13 15:00 ` Oleg Nesterov 2024-06-13 16:23 ` Eric W. Biederman @ 2024-07-05 16:08 ` Oleg Nesterov 1 sibling, 0 replies; 23+ messages in thread From: Oleg Nesterov @ 2024-07-05 16:08 UTC (permalink / raw) To: Eric W. Biederman Cc: Andrew Morton, Rachel Menge, linux-kernel, rcu, Wei Fu, apais, Sudhanva Huruli, Jens Axboe, Christian Brauner, Mike Christie, Joel Granados, Mateusz Guzik, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Boqun Feng, Steven Rostedt, Mathieu Desnoyers, Lai Jiangshan, Zqiang On 06/13, Oleg Nesterov wrote: > > Well, if SIGNAL_GROUP_EXIT is set we do not care if this process was > killed or not. It (the whole thread group) is going to exit, that is all. OOPS. I forgot again that you removed SIGNAL_GROUP_COREDUMP and thus we can't rely on SIGNAL_GROUP_EXIT in this case. I still think this was not right, but it is too late to complain. Andrew, please drop this patch. Currently zap_pid_ns_processes() kills the coredumping tasks, with this patch it doesn't. Oleg. ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-07-05 16:10 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-06-05 23:42 [RCU] zombie task hung in synchronize_rcu_expedited Rachel Menge 2024-06-06 11:10 ` Oleg Nesterov 2024-06-06 15:45 ` Wei Fu 2024-06-06 17:28 ` Oleg Nesterov 2024-06-07 3:02 ` Wei Fu 2024-06-07 6:25 ` Oleg Nesterov 2024-06-07 15:04 ` Wei Fu 2024-06-07 21:22 ` Oleg Nesterov 2024-06-08 12:42 ` Oleg Nesterov 2024-06-10 0:07 ` Wei Fu 2024-06-08 12:06 ` [PATCH] zap_pid_ns_processes: clear TIF_NOTIFY_SIGNAL along with TIF_SIGPENDING Oleg Nesterov 2024-06-08 17:00 ` Boqun Feng 2024-06-09 14:12 ` Wei Fu 2024-06-12 16:57 ` Jens Axboe 2024-06-13 12:40 ` Eric W. Biederman 2024-06-13 14:02 ` Wei Fu 2024-06-13 14:49 ` Oleg Nesterov 2024-06-13 15:30 ` Oleg Nesterov 2024-06-08 15:48 ` [PATCH] zap_pid_ns_processes: don't send SIGKILL to sub-threads Oleg Nesterov 2024-06-13 13:01 ` Eric W. Biederman 2024-06-13 15:00 ` Oleg Nesterov 2024-06-13 16:23 ` Eric W. Biederman 2024-07-05 16:08 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox