Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
       [not found] <CAGis_TWyhciem6bPzR98ysj1+gOVPHRGqSUNiiyvS1RnEidExw@mail.gmail.com>
@ 2025-09-19 14:37 ` Oleg Nesterov
  2025-09-19 15:16   ` Matt Fleming
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2025-09-19 14:37 UTC (permalink / raw)
  To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges

Hi Matt,

On 09/19, Matt Fleming wrote:
>
> Hi there,
>
> We're running into an intermittent issue where tasks end up in a state with
> p->on_rq=1 and p->se.on_rq=0 when delivering a fatal signal to a thread
> group. The thread handling the signal sits in wait_task_inactive() after
> sending a SIGKILL to all other threads, most of which pass through
> coredump_task_exit() just fine, but occasionally one thread calls into
> coredump_task_exit()->schedule() and never comes back because of the above
> state.

I guess you mean coredump_wait() -> wait_task_inactive() ... Sorry I have no
clue at least right now... And I don't see any problem in coredump_task_exit().

Stupid question. Any chance you can reproduce, figure out the pid of that
sub-thread which fools wait_task_inactive() and, say, do
"cat /proc/pid-of-that-thread/stack" ? Or any other info, everything can
help. Crash dump? Yes, you have already mentioned this is hard-to-reproduce :(

Oleg.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
  2025-09-19 14:37 ` Debugging lost task in wait_task_inactive() when delivering signal (6.12) Oleg Nesterov
@ 2025-09-19 15:16   ` Matt Fleming
  2025-09-19 16:13     ` Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Matt Fleming @ 2025-09-19 15:16 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges

On Fri, 19 Sept 2025 at 15:39, 'Oleg Nesterov' via kernel-team
<kernel-team@cloudflare.com> wrote:
>
> Hi Matt,
>
> On 09/19, Matt Fleming wrote:
> >
> > Hi there,
> >
> > We're running into an intermittent issue where tasks end up in a state with
> > p->on_rq=1 and p->se.on_rq=0 when delivering a fatal signal to a thread
> > group. The thread handling the signal sits in wait_task_inactive() after
> > sending a SIGKILL to all other threads, most of which pass through
> > coredump_task_exit() just fine, but occasionally one thread calls into
> > coredump_task_exit()->schedule() and never comes back because of the above
> > state.
>
> I guess you mean coredump_wait() -> wait_task_inactive() ... Sorry I have no
> clue at least right now... And I don't see any problem in coredump_task_exit().
>
> Stupid question. Any chance you can reproduce, figure out the pid of that
> sub-thread which fools wait_task_inactive() and, say, do
> "cat /proc/pid-of-that-thread/stack" ? Or any other info, everything can
> help. Crash dump? Yes, you have already mentioned this is hard-to-reproduce :(

I do have some info. The callstack for the lost thread is:

Call Trace:
 <TASK>
 __schedule+0x4fb/0xbf0
 ? srso_return_thunk+0x5/0x5f
 schedule+0x27/0xf0
 do_exit+0xdd/0xaa0
 ? __pfx_futex_wake_mark+0x10/0x10
 do_group_exit+0x30/0x80
 get_signal+0x81e/0x860
 ? srso_return_thunk+0x5/0x5f
 ? futex_wake+0x177/0x1a0
 arch_do_signal_or_restart+0x2e/0x1f0
 ? srso_return_thunk+0x5/0x5f
 ? srso_return_thunk+0x5/0x5f
 ? __x64_sys_futex+0x10c/0x1d0
 syscall_exit_to_user_mode+0xa5/0x130
 do_syscall_64+0x57/0x110
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

do_exit+0xdd is here in coredump_task_wait():

                for (;;) {
                        set_current_state(TASK_IDLE|TASK_FREEZABLE);
                        if (!self.task) /* see coredump_finish() */
                                break;
                        schedule();
                }

i.e. the task calls schedule() and never comes back. The waiting task
sees p->on_rq=1 for this lost thread and spins in wait_task_inactive()
forever.

I have been able to use drgn to inspect the live system when the error
occurred so if you have specific things you want me to look at the
task_struct state I can try that. Or if you would like me to grab some
diagnostic info whenever the issue crops up again, just let me know.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
  2025-09-19 15:16   ` Matt Fleming
@ 2025-09-19 16:13     ` Oleg Nesterov
  2025-09-20 22:10       ` Matt Fleming
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Nesterov @ 2025-09-19 16:13 UTC (permalink / raw)
  To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges

Let me repeat that currently I have no idea, so let me ask another stupid
question...

On 09/19, Matt Fleming wrote:
>
> I do have some info. The callstack for the lost thread is:
>
> Call Trace:
>  <TASK>
>  __schedule+0x4fb/0xbf0
>  ? srso_return_thunk+0x5/0x5f
>  schedule+0x27/0xf0
>  do_exit+0xdd/0xaa0
>  ? __pfx_futex_wake_mark+0x10/0x10
>  do_group_exit+0x30/0x80
>  get_signal+0x81e/0x860
>  ? srso_return_thunk+0x5/0x5f
>  ? futex_wake+0x177/0x1a0
>  arch_do_signal_or_restart+0x2e/0x1f0
>  ? srso_return_thunk+0x5/0x5f
>  ? srso_return_thunk+0x5/0x5f
>  ? __x64_sys_futex+0x10c/0x1d0
>  syscall_exit_to_user_mode+0xa5/0x130
>  do_syscall_64+0x57/0x110
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e

OK, thanks. Nothing "interesting" at first glance.

> do_exit+0xdd is here in coredump_task_wait():
>
>                 for (;;) {
>                         set_current_state(TASK_IDLE|TASK_FREEZABLE);
>                         if (!self.task) /* see coredump_finish() */
>                                 break;
>                         schedule();
>                 }
>
> i.e. the task calls schedule() and never comes back.

Are you sure it never comes back and doesn't loop?

> The waiting task
> sees p->on_rq=1 for this lost thread

Strange...

Oleg.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
  2025-09-19 16:13     ` Oleg Nesterov
@ 2025-09-20 22:10       ` Matt Fleming
  2025-09-21 19:27         ` Oleg Nesterov
  0 siblings, 1 reply; 5+ messages in thread
From: Matt Fleming @ 2025-09-20 22:10 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges

On Fri, 19 Sept 2025 at 17:15, Oleg Nesterov <oleg@redhat.com> wrote:
>
> OK, thanks. Nothing "interesting" at first glance.

Chris (Cc'd) and I managed to get a reproducer and I think I know
what's happening now.

When a task A gets the SIGKILL from whichever thread is handling the
coredump (let's say task B) it might hit the delayed dequeue path in
schedule() and call set_delayed(), e.g.

        dequeue_entity+1263
        dequeue_entities+216
        dequeue_task_fair+224
        __schedule+468
        schedule+39
        do_exit+221
        do_group_exit+48
        get_signal+2078
        arch_do_signal_or_restart+46
        irqentry_exit_to_user_mode+132
        asm_sysvec_apic_timer_interrupt+26

At this point task A has ->on_rq=1, ->se.sched_delayed=1 and ->se.on_rq=1.

Now when task B calls into wait_task_inactive(), it sees
->se.sched_delayed=1 and calls dequeue_task().

At this point task A has ->on_rq=1, ->se.sched_delayed=0 and ->se.on_rq=0

Unfortunately, task B still thinks that task A is scheduled because
task_on_rq_queued(A) is true, but it's not runnable and will never run
because it's no longer in the fair rbtree and the only task that will
enqueue it again is task B once it leaves wait_task_inactive() and
hits coredump_finish().

> > do_exit+0xdd is here in coredump_task_wait():
> >
> >                 for (;;) {
> >                         set_current_state(TASK_IDLE|TASK_FREEZABLE);
> >                         if (!self.task) /* see coredump_finish() */
> >                                 break;
> >                         schedule();
> >                 }
> >
> > i.e. the task calls schedule() and never comes back.
>
> Are you sure it never comes back and doesn't loop?

Yeah, positive:

$ sudo perf stat -e cycles -t 1546531 -- sleep 30

 Performance counter stats for thread id '1546531':

     <not counted>      cycles

      30.001671072 seconds time elapsed

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Debugging lost task in wait_task_inactive() when delivering signal (6.12)
  2025-09-20 22:10       ` Matt Fleming
@ 2025-09-21 19:27         ` Oleg Nesterov
  0 siblings, 0 replies; 5+ messages in thread
From: Oleg Nesterov @ 2025-09-21 19:27 UTC (permalink / raw)
  To: Matt Fleming; +Cc: Peter Zijlstra, John Stultz, kernel-team, LKML, Chris Arges

Thanks Matt!

So I guess that this has nothing to do with coredump and wait_task_inactive()
is broken...

I am wondering if this code

		/*
		 * If task is sched_delayed, force dequeue it, to avoid always
		 * hitting the tick timeout in the queued case
		 */
		if (p->se.sched_delayed)
			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);

ia actually correct but I know nothing about the sched_delayed logic.

I will leave this to scheduler experts ;) I can't really help.

Oleg.

On 09/20, Matt Fleming wrote:
>
> On Fri, 19 Sept 2025 at 17:15, Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > OK, thanks. Nothing "interesting" at first glance.
>
> Chris (Cc'd) and I managed to get a reproducer and I think I know
> what's happening now.
>
> When a task A gets the SIGKILL from whichever thread is handling the
> coredump (let's say task B) it might hit the delayed dequeue path in
> schedule() and call set_delayed(), e.g.
>
>         dequeue_entity+1263
>         dequeue_entities+216
>         dequeue_task_fair+224
>         __schedule+468
>         schedule+39
>         do_exit+221
>         do_group_exit+48
>         get_signal+2078
>         arch_do_signal_or_restart+46
>         irqentry_exit_to_user_mode+132
>         asm_sysvec_apic_timer_interrupt+26
>
> At this point task A has ->on_rq=1, ->se.sched_delayed=1 and ->se.on_rq=1.
>
> Now when task B calls into wait_task_inactive(), it sees
> ->se.sched_delayed=1 and calls dequeue_task().
>
> At this point task A has ->on_rq=1, ->se.sched_delayed=0 and ->se.on_rq=0
>
> Unfortunately, task B still thinks that task A is scheduled because
> task_on_rq_queued(A) is true, but it's not runnable and will never run
> because it's no longer in the fair rbtree and the only task that will
> enqueue it again is task B once it leaves wait_task_inactive() and
> hits coredump_finish().
>
> > > do_exit+0xdd is here in coredump_task_wait():
> > >
> > >                 for (;;) {
> > >                         set_current_state(TASK_IDLE|TASK_FREEZABLE);
> > >                         if (!self.task) /* see coredump_finish() */
> > >                                 break;
> > >                         schedule();
> > >                 }
> > >
> > > i.e. the task calls schedule() and never comes back.
> >
> > Are you sure it never comes back and doesn't loop?
>
> Yeah, positive:
>
> $ sudo perf stat -e cycles -t 1546531 -- sleep 30
>
>  Performance counter stats for thread id '1546531':
>
>      <not counted>      cycles
>
>       30.001671072 seconds time elapsed
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-09-21 19:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAGis_TWyhciem6bPzR98ysj1+gOVPHRGqSUNiiyvS1RnEidExw@mail.gmail.com>
2025-09-19 14:37 ` Debugging lost task in wait_task_inactive() when delivering signal (6.12) Oleg Nesterov
2025-09-19 15:16   ` Matt Fleming
2025-09-19 16:13     ` Oleg Nesterov
2025-09-20 22:10       ` Matt Fleming
2025-09-21 19:27         ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.