* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12) [not found] <CAGis_TWyhciem6bPzR98ysj1+gOVPHRGqSUNiiyvS1RnEidExw@mail.gmail.com> @ 2025-09-19 14:37 ` Oleg Nesterov 2025-09-19 15:16 ` Matt Fleming 0 siblings, 1 reply; 5+ messages in thread From: Oleg Nesterov @ 2025-09-19 14:37 UTC (permalink / raw) To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges Hi Matt, On 09/19, Matt Fleming wrote: > > Hi there, > > We're running into an intermittent issue where tasks end up in a state with > p->on_rq=1 and p->se.on_rq=0 when delivering a fatal signal to a thread > group. The thread handling the signal sits in wait_task_inactive() after > sending a SIGKILL to all other threads, most of which pass through > coredump_task_exit() just fine, but occasionally one thread calls into > coredump_task_exit()->schedule() and never comes back because of the above > state. I guess you mean coredump_wait() -> wait_task_inactive() ... Sorry I have no clue at least right now... And I don't see any problem in coredump_task_exit(). Stupid question. Any chance you can reproduce, figure out the pid of that sub-thread which fools wait_task_inactive() and, say, do "cat /proc/pid-of-that-thread/stack" ? Or any other info, everything can help. Crash dump? Yes, you have already mentioned this is hard-to-reproduce :( Oleg. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12) 2025-09-19 14:37 ` Debugging lost task in wait_task_inactive() when delivering signal (6.12) Oleg Nesterov @ 2025-09-19 15:16 ` Matt Fleming 2025-09-19 16:13 ` Oleg Nesterov 0 siblings, 1 reply; 5+ messages in thread From: Matt Fleming @ 2025-09-19 15:16 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges On Fri, 19 Sept 2025 at 15:39, 'Oleg Nesterov' via kernel-team <kernel-team@cloudflare.com> wrote: > > Hi Matt, > > On 09/19, Matt Fleming wrote: > > > > Hi there, > > > > We're running into an intermittent issue where tasks end up in a state with > > p->on_rq=1 and p->se.on_rq=0 when delivering a fatal signal to a thread > > group. The thread handling the signal sits in wait_task_inactive() after > > sending a SIGKILL to all other threads, most of which pass through > > coredump_task_exit() just fine, but occasionally one thread calls into > > coredump_task_exit()->schedule() and never comes back because of the above > > state. > > I guess you mean coredump_wait() -> wait_task_inactive() ... Sorry I have no > clue at least right now... And I don't see any problem in coredump_task_exit(). > > Stupid question. Any chance you can reproduce, figure out the pid of that > sub-thread which fools wait_task_inactive() and, say, do > "cat /proc/pid-of-that-thread/stack" ? Or any other info, everything can > help. Crash dump? Yes, you have already mentioned this is hard-to-reproduce :( I do have some info. The callstack for the lost thread is: Call Trace: <TASK> __schedule+0x4fb/0xbf0 ? srso_return_thunk+0x5/0x5f schedule+0x27/0xf0 do_exit+0xdd/0xaa0 ? __pfx_futex_wake_mark+0x10/0x10 do_group_exit+0x30/0x80 get_signal+0x81e/0x860 ? srso_return_thunk+0x5/0x5f ? futex_wake+0x177/0x1a0 arch_do_signal_or_restart+0x2e/0x1f0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? __x64_sys_futex+0x10c/0x1d0 syscall_exit_to_user_mode+0xa5/0x130 do_syscall_64+0x57/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e do_exit+0xdd is here in coredump_task_wait(): for (;;) { set_current_state(TASK_IDLE|TASK_FREEZABLE); if (!self.task) /* see coredump_finish() */ break; schedule(); } i.e. the task calls schedule() and never comes back. The waiting task sees p->on_rq=1 for this lost thread and spins in wait_task_inactive() forever. I have been able to use drgn to inspect the live system when the error occurred so if you have specific things you want me to look at the task_struct state I can try that. Or if you would like me to grab some diagnostic info whenever the issue crops up again, just let me know. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12) 2025-09-19 15:16 ` Matt Fleming @ 2025-09-19 16:13 ` Oleg Nesterov 2025-09-20 22:10 ` Matt Fleming 0 siblings, 1 reply; 5+ messages in thread From: Oleg Nesterov @ 2025-09-19 16:13 UTC (permalink / raw) To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges Let me repeat that currently I have no idea, so let me ask another stupid question... On 09/19, Matt Fleming wrote: > > I do have some info. The callstack for the lost thread is: > > Call Trace: > <TASK> > __schedule+0x4fb/0xbf0 > ? srso_return_thunk+0x5/0x5f > schedule+0x27/0xf0 > do_exit+0xdd/0xaa0 > ? __pfx_futex_wake_mark+0x10/0x10 > do_group_exit+0x30/0x80 > get_signal+0x81e/0x860 > ? srso_return_thunk+0x5/0x5f > ? futex_wake+0x177/0x1a0 > arch_do_signal_or_restart+0x2e/0x1f0 > ? srso_return_thunk+0x5/0x5f > ? srso_return_thunk+0x5/0x5f > ? __x64_sys_futex+0x10c/0x1d0 > syscall_exit_to_user_mode+0xa5/0x130 > do_syscall_64+0x57/0x110 > entry_SYSCALL_64_after_hwframe+0x76/0x7e OK, thanks. Nothing "interesting" at first glance. > do_exit+0xdd is here in coredump_task_wait(): > > for (;;) { > set_current_state(TASK_IDLE|TASK_FREEZABLE); > if (!self.task) /* see coredump_finish() */ > break; > schedule(); > } > > i.e. the task calls schedule() and never comes back. Are you sure it never comes back and doesn't loop? > The waiting task > sees p->on_rq=1 for this lost thread Strange... Oleg. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12) 2025-09-19 16:13 ` Oleg Nesterov @ 2025-09-20 22:10 ` Matt Fleming 2025-09-21 19:27 ` Oleg Nesterov 0 siblings, 1 reply; 5+ messages in thread From: Matt Fleming @ 2025-09-20 22:10 UTC (permalink / raw) To: Oleg Nesterov; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges On Fri, 19 Sept 2025 at 17:15, Oleg Nesterov <oleg@redhat.com> wrote: > > OK, thanks. Nothing "interesting" at first glance. Chris (Cc'd) and I managed to get a reproducer and I think I know what's happening now. When a task A gets the SIGKILL from whichever thread is handling the coredump (let's say task B) it might hit the delayed dequeue path in schedule() and call set_delayed(), e.g. dequeue_entity+1263 dequeue_entities+216 dequeue_task_fair+224 __schedule+468 schedule+39 do_exit+221 do_group_exit+48 get_signal+2078 arch_do_signal_or_restart+46 irqentry_exit_to_user_mode+132 asm_sysvec_apic_timer_interrupt+26 At this point task A has ->on_rq=1, ->se.sched_delayed=1 and ->se.on_rq=1. Now when task B calls into wait_task_inactive(), it sees ->se.sched_delayed=1 and calls dequeue_task(). At this point task A has ->on_rq=1, ->se.sched_delayed=0 and ->se.on_rq=0 Unfortunately, task B still thinks that task A is scheduled because task_on_rq_queued(A) is true, but it's not runnable and will never run because it's no longer in the fair rbtree and the only task that will enqueue it again is task B once it leaves wait_task_inactive() and hits coredump_finish(). > > do_exit+0xdd is here in coredump_task_wait(): > > > > for (;;) { > > set_current_state(TASK_IDLE|TASK_FREEZABLE); > > if (!self.task) /* see coredump_finish() */ > > break; > > schedule(); > > } > > > > i.e. the task calls schedule() and never comes back. > > Are you sure it never comes back and doesn't loop? Yeah, positive: $ sudo perf stat -e cycles -t 1546531 -- sleep 30 Performance counter stats for thread id '1546531': <not counted> cycles 30.001671072 seconds time elapsed ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12) 2025-09-20 22:10 ` Matt Fleming @ 2025-09-21 19:27 ` Oleg Nesterov 0 siblings, 0 replies; 5+ messages in thread From: Oleg Nesterov @ 2025-09-21 19:27 UTC (permalink / raw) To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges Thanks Matt! So I guess that this has nothing to do with coredump and wait_task_inactive() is broken... I am wondering if this code /* * If task is sched_delayed, force dequeue it, to avoid always * hitting the tick timeout in the queued case */ if (p->se.sched_delayed) dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED); ia actually correct but I know nothing about the sched_delayed logic. I will leave this to scheduler experts ;) I can't really help. Oleg. On 09/20, Matt Fleming wrote: > > On Fri, 19 Sept 2025 at 17:15, Oleg Nesterov <oleg@redhat.com> wrote: > > > > OK, thanks. Nothing "interesting" at first glance. > > Chris (Cc'd) and I managed to get a reproducer and I think I know > what's happening now. > > When a task A gets the SIGKILL from whichever thread is handling the > coredump (let's say task B) it might hit the delayed dequeue path in > schedule() and call set_delayed(), e.g. > > dequeue_entity+1263 > dequeue_entities+216 > dequeue_task_fair+224 > __schedule+468 > schedule+39 > do_exit+221 > do_group_exit+48 > get_signal+2078 > arch_do_signal_or_restart+46 > irqentry_exit_to_user_mode+132 > asm_sysvec_apic_timer_interrupt+26 > > At this point task A has ->on_rq=1, ->se.sched_delayed=1 and ->se.on_rq=1. > > Now when task B calls into wait_task_inactive(), it sees > ->se.sched_delayed=1 and calls dequeue_task(). > > At this point task A has ->on_rq=1, ->se.sched_delayed=0 and ->se.on_rq=0 > > Unfortunately, task B still thinks that task A is scheduled because > task_on_rq_queued(A) is true, but it's not runnable and will never run > because it's no longer in the fair rbtree and the only task that will > enqueue it again is task B once it leaves wait_task_inactive() and > hits coredump_finish(). > > > > do_exit+0xdd is here in coredump_task_wait(): > > > > > > for (;;) { > > > set_current_state(TASK_IDLE|TASK_FREEZABLE); > > > if (!self.task) /* see coredump_finish() */ > > > break; > > > schedule(); > > > } > > > > > > i.e. the task calls schedule() and never comes back. > > > > Are you sure it never comes back and doesn't loop? > > Yeah, positive: > > $ sudo perf stat -e cycles -t 1546531 -- sleep 30 > > Performance counter stats for thread id '1546531': > > <not counted> cycles > > 30.001671072 seconds time elapsed > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-09-21 19:28 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAGis_TWyhciem6bPzR98ysj1+gOVPHRGqSUNiiyvS1RnEidExw@mail.gmail.com>
2025-09-19 14:37 ` Debugging lost task in wait_task_inactive() when delivering signal (6.12) Oleg Nesterov
2025-09-19 15:16 ` Matt Fleming
2025-09-19 16:13 ` Oleg Nesterov
2025-09-20 22:10 ` Matt Fleming
2025-09-21 19:27 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.