* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
[not found] <CAGis_TWyhciem6bPzR98ysj1+gOVPHRGqSUNiiyvS1RnEidExw@mail.gmail.com>
@ 2025-09-19 14:37 ` Oleg Nesterov
2025-09-19 15:16 ` Matt Fleming
0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2025-09-19 14:37 UTC (permalink / raw)
To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges
Hi Matt,
On 09/19, Matt Fleming wrote:
>
> Hi there,
>
> We're running into an intermittent issue where tasks end up in a state with
> p->on_rq=1 and p->se.on_rq=0 when delivering a fatal signal to a thread
> group. The thread handling the signal sits in wait_task_inactive() after
> sending a SIGKILL to all other threads, most of which pass through
> coredump_task_exit() just fine, but occasionally one thread calls into
> coredump_task_exit()->schedule() and never comes back because of the above
> state.
I guess you mean coredump_wait() -> wait_task_inactive() ... Sorry I have no
clue at least right now... And I don't see any problem in coredump_task_exit().
Stupid question. Any chance you can reproduce, figure out the pid of that
sub-thread which fools wait_task_inactive() and, say, do
"cat /proc/pid-of-that-thread/stack" ? Or any other info, everything can
help. Crash dump? Yes, you have already mentioned this is hard-to-reproduce :(
Oleg.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
2025-09-19 14:37 ` Debugging lost task in wait_task_inactive() when delivering signal (6.12) Oleg Nesterov
@ 2025-09-19 15:16 ` Matt Fleming
2025-09-19 16:13 ` Oleg Nesterov
0 siblings, 1 reply; 5+ messages in thread
From: Matt Fleming @ 2025-09-19 15:16 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges
On Fri, 19 Sept 2025 at 15:39, 'Oleg Nesterov' via kernel-team
<kernel-team@cloudflare.com> wrote:
>
> Hi Matt,
>
> On 09/19, Matt Fleming wrote:
> >
> > Hi there,
> >
> > We're running into an intermittent issue where tasks end up in a state with
> > p->on_rq=1 and p->se.on_rq=0 when delivering a fatal signal to a thread
> > group. The thread handling the signal sits in wait_task_inactive() after
> > sending a SIGKILL to all other threads, most of which pass through
> > coredump_task_exit() just fine, but occasionally one thread calls into
> > coredump_task_exit()->schedule() and never comes back because of the above
> > state.
>
> I guess you mean coredump_wait() -> wait_task_inactive() ... Sorry I have no
> clue at least right now... And I don't see any problem in coredump_task_exit().
>
> Stupid question. Any chance you can reproduce, figure out the pid of that
> sub-thread which fools wait_task_inactive() and, say, do
> "cat /proc/pid-of-that-thread/stack" ? Or any other info, everything can
> help. Crash dump? Yes, you have already mentioned this is hard-to-reproduce :(
I do have some info. The callstack for the lost thread is:
Call Trace:
<TASK>
__schedule+0x4fb/0xbf0
? srso_return_thunk+0x5/0x5f
schedule+0x27/0xf0
do_exit+0xdd/0xaa0
? __pfx_futex_wake_mark+0x10/0x10
do_group_exit+0x30/0x80
get_signal+0x81e/0x860
? srso_return_thunk+0x5/0x5f
? futex_wake+0x177/0x1a0
arch_do_signal_or_restart+0x2e/0x1f0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? __x64_sys_futex+0x10c/0x1d0
syscall_exit_to_user_mode+0xa5/0x130
do_syscall_64+0x57/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
do_exit+0xdd is here in coredump_task_wait():
for (;;) {
set_current_state(TASK_IDLE|TASK_FREEZABLE);
if (!self.task) /* see coredump_finish() */
break;
schedule();
}
i.e. the task calls schedule() and never comes back. The waiting task
sees p->on_rq=1 for this lost thread and spins in wait_task_inactive()
forever.
I have been able to use drgn to inspect the live system when the error
occurred so if you have specific things you want me to look at the
task_struct state I can try that. Or if you would like me to grab some
diagnostic info whenever the issue crops up again, just let me know.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
2025-09-19 15:16 ` Matt Fleming
@ 2025-09-19 16:13 ` Oleg Nesterov
2025-09-20 22:10 ` Matt Fleming
0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2025-09-19 16:13 UTC (permalink / raw)
To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges
Let me repeat that currently I have no idea, so let me ask another stupid
question...
On 09/19, Matt Fleming wrote:
>
> I do have some info. The callstack for the lost thread is:
>
> Call Trace:
> <TASK>
> __schedule+0x4fb/0xbf0
> ? srso_return_thunk+0x5/0x5f
> schedule+0x27/0xf0
> do_exit+0xdd/0xaa0
> ? __pfx_futex_wake_mark+0x10/0x10
> do_group_exit+0x30/0x80
> get_signal+0x81e/0x860
> ? srso_return_thunk+0x5/0x5f
> ? futex_wake+0x177/0x1a0
> arch_do_signal_or_restart+0x2e/0x1f0
> ? srso_return_thunk+0x5/0x5f
> ? srso_return_thunk+0x5/0x5f
> ? __x64_sys_futex+0x10c/0x1d0
> syscall_exit_to_user_mode+0xa5/0x130
> do_syscall_64+0x57/0x110
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
OK, thanks. Nothing "interesting" at first glance.
> do_exit+0xdd is here in coredump_task_wait():
>
> for (;;) {
> set_current_state(TASK_IDLE|TASK_FREEZABLE);
> if (!self.task) /* see coredump_finish() */
> break;
> schedule();
> }
>
> i.e. the task calls schedule() and never comes back.
Are you sure it never comes back and doesn't loop?
> The waiting task
> sees p->on_rq=1 for this lost thread
Strange...
Oleg.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
2025-09-19 16:13 ` Oleg Nesterov
@ 2025-09-20 22:10 ` Matt Fleming
2025-09-21 19:27 ` Oleg Nesterov
0 siblings, 1 reply; 5+ messages in thread
From: Matt Fleming @ 2025-09-20 22:10 UTC (permalink / raw)
To: Oleg Nesterov; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges
On Fri, 19 Sept 2025 at 17:15, Oleg Nesterov <oleg@redhat.com> wrote:
>
> OK, thanks. Nothing "interesting" at first glance.
Chris (Cc'd) and I managed to get a reproducer and I think I know
what's happening now.
When a task A gets the SIGKILL from whichever thread is handling the
coredump (let's say task B) it might hit the delayed dequeue path in
schedule() and call set_delayed(), e.g.
dequeue_entity+1263
dequeue_entities+216
dequeue_task_fair+224
__schedule+468
schedule+39
do_exit+221
do_group_exit+48
get_signal+2078
arch_do_signal_or_restart+46
irqentry_exit_to_user_mode+132
asm_sysvec_apic_timer_interrupt+26
At this point task A has ->on_rq=1, ->se.sched_delayed=1 and ->se.on_rq=1.
Now when task B calls into wait_task_inactive(), it sees
->se.sched_delayed=1 and calls dequeue_task().
At this point task A has ->on_rq=1, ->se.sched_delayed=0 and ->se.on_rq=0
Unfortunately, task B still thinks that task A is scheduled because
task_on_rq_queued(A) is true, but it's not runnable and will never run
because it's no longer in the fair rbtree and the only task that will
enqueue it again is task B once it leaves wait_task_inactive() and
hits coredump_finish().
> > do_exit+0xdd is here in coredump_task_wait():
> >
> > for (;;) {
> > set_current_state(TASK_IDLE|TASK_FREEZABLE);
> > if (!self.task) /* see coredump_finish() */
> > break;
> > schedule();
> > }
> >
> > i.e. the task calls schedule() and never comes back.
>
> Are you sure it never comes back and doesn't loop?
Yeah, positive:
$ sudo perf stat -e cycles -t 1546531 -- sleep 30
Performance counter stats for thread id '1546531':
<not counted> cycles
30.001671072 seconds time elapsed
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
2025-09-20 22:10 ` Matt Fleming
@ 2025-09-21 19:27 ` Oleg Nesterov
0 siblings, 0 replies; 5+ messages in thread
From: Oleg Nesterov @ 2025-09-21 19:27 UTC (permalink / raw)
To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges
Thanks Matt!
So I guess that this has nothing to do with coredump and wait_task_inactive()
is broken...
I am wondering if this code
/*
* If task is sched_delayed, force dequeue it, to avoid always
* hitting the tick timeout in the queued case
*/
if (p->se.sched_delayed)
dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
ia actually correct but I know nothing about the sched_delayed logic.
I will leave this to scheduler experts ;) I can't really help.
Oleg.
On 09/20, Matt Fleming wrote:
>
> On Fri, 19 Sept 2025 at 17:15, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > OK, thanks. Nothing "interesting" at first glance.
>
> Chris (Cc'd) and I managed to get a reproducer and I think I know
> what's happening now.
>
> When a task A gets the SIGKILL from whichever thread is handling the
> coredump (let's say task B) it might hit the delayed dequeue path in
> schedule() and call set_delayed(), e.g.
>
> dequeue_entity+1263
> dequeue_entities+216
> dequeue_task_fair+224
> __schedule+468
> schedule+39
> do_exit+221
> do_group_exit+48
> get_signal+2078
> arch_do_signal_or_restart+46
> irqentry_exit_to_user_mode+132
> asm_sysvec_apic_timer_interrupt+26
>
> At this point task A has ->on_rq=1, ->se.sched_delayed=1 and ->se.on_rq=1.
>
> Now when task B calls into wait_task_inactive(), it sees
> ->se.sched_delayed=1 and calls dequeue_task().
>
> At this point task A has ->on_rq=1, ->se.sched_delayed=0 and ->se.on_rq=0
>
> Unfortunately, task B still thinks that task A is scheduled because
> task_on_rq_queued(A) is true, but it's not runnable and will never run
> because it's no longer in the fair rbtree and the only task that will
> enqueue it again is task B once it leaves wait_task_inactive() and
> hits coredump_finish().
>
> > > do_exit+0xdd is here in coredump_task_wait():
> > >
> > > for (;;) {
> > > set_current_state(TASK_IDLE|TASK_FREEZABLE);
> > > if (!self.task) /* see coredump_finish() */
> > > break;
> > > schedule();
> > > }
> > >
> > > i.e. the task calls schedule() and never comes back.
> >
> > Are you sure it never comes back and doesn't loop?
>
> Yeah, positive:
>
> $ sudo perf stat -e cycles -t 1546531 -- sleep 30
>
> Performance counter stats for thread id '1546531':
>
> <not counted> cycles
>
> 30.001671072 seconds time elapsed
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-09-21 19:28 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAGis_TWyhciem6bPzR98ysj1+gOVPHRGqSUNiiyvS1RnEidExw@mail.gmail.com>
2025-09-19 14:37 ` Debugging lost task in wait_task_inactive() when delivering signal (6.12) Oleg Nesterov
2025-09-19 15:16 ` Matt Fleming
2025-09-19 16:13 ` Oleg Nesterov
2025-09-20 22:10 ` Matt Fleming
2025-09-21 19:27 ` Oleg Nesterov
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.