public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@kernel.org>
To: Ihor Solodrai <ihor.solodrai@linux.dev>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>
Cc: Gabriele Monaco <gmonaco@redhat.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Michael Jeanson <mjeanson@efficios.com>,
	Jens Axboe <axboe@kernel.dk>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	Florian Weimer <fweimer@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Yury Norov <yury.norov@gmail.com>, bpf <bpf@vger.kernel.org>,
	sched-ext@lists.linux.dev, Kernel Team <kernel-team@meta.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Puranjay Mohan <puranjay@kernel.org>, Tejun Heo <tj@kernel.org>
Subject: Re: [patch V5 00/20] sched: Rewrite MM CID management
Date: Thu, 29 Jan 2026 18:06:15 +0100	[thread overview]
Message-ID: <87h5s4mjqw.ffs@tglx> (raw)
In-Reply-To: <6a77996f-5b08-4db6-8631-031ce3e52145@linux.dev>

On Wed, Jan 28 2026 at 15:08, Ihor Solodrai wrote:
> On 1/28/26 2:33 PM, Ihor Solodrai wrote:
>> [...]
>> 
>> We have a steady stream of jobs running, so if it's not a one-off it's
>> likely to happen again. I'll share if we get anything.
>
> Here is another one, with backtraces of other CPUs:
>
> [   59.133925] CPU: 2 UID: 0 PID: 127 Comm: test_progs Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
> [   59.133935] RIP: 0010:queued_spin_lock_slowpath+0x3a9/0xac0
> [   59.133985]  do_raw_spin_lock+0x1d9/0x270
> [   59.134001]  task_rq_lock+0xcf/0x3c0
> [   59.134007]  mm_cid_fixup_task_to_cpu+0xb0/0x460
> [   59.134025]  sched_mm_cid_fork+0x6da/0xc20

Compared to Shrikanth's splat this is the reverse situation, i.e. fork()
reached the point where it needs to switch to per CPU mode and the fixup
function is stuck on a runqueue lock.

> [   59.134176] CPU: 3 UID: 0 PID: 67 Comm: kworker/3:1 Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
> [   59.134186] Workqueue: events drain_vmap_area_work
> [   59.134194] RIP: 0010:smp_call_function_many_cond+0x772/0xe60
> [   59.134250]  on_each_cpu_cond_mask+0x24/0x40
> [   59.134254]  flush_tlb_kernel_range+0x402/0x6b0

CPU3 is unrelated as it does not hold runqueue lock.

> [   59.134374] NMI backtrace for cpu 1
> [   59.134388] RIP: 0010:_find_first_zero_bit+0x50/0x90
> [   59.134423]  __schedule+0x3312/0x4390
> [   59.134430]  ? __pfx___schedule+0x10/0x10
> [   59.134434]  ? trace_rcu_watching+0x105/0x150
> [   59.134440]  schedule_idle+0x59/0x90

CPU1 holds runqueue lock and find_first_zero_bit() suggests that this
comes from mm_get_cid(), but w/o decoding the return address it's hard
to tell for sure.

> [   59.134474] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20

CPU0 is idle and not involved at all.

So the situation is:

test_prog creates the 4th child, which exceeds the number of CPUs, so
it switches to per CPU mode.

At this point each task of test_prog has a CID associated. Let's
assume thread creation order assignment for simplicity.

   T0 (main thread)       CID0  runs fork()
   T1 (1st child)	  CID1
   T2 (2nd child)	  CID2
   T3 (3rd child)	  CID3
   T4 (4th child)         ---   is about to be forked and causes the
                                mode switch

T0 sets mm_cid::percpu = true
   transfers the CID from T0 to CPU2

   Starts the fixup which walks through the threads

During that T1 - T3 are free to schedule in and out before the fixup
caught up with them. Now I played through all possible permutations with
a python script and came up with the following snafu:

   T1 schedules in on CPU3 and observes percpu == true, so it transfers
      it's CID to CPU3

   T1 is migrated to CPU1 and schedule in observes percpu == true, but
      CPU1 does not have a CID associated and T1 transferred it's own to
      CPU3

      So it has to allocate one with CPU1 runqueue lock held, but the
      pool is empty, so it keeps looping.

Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held. ---> Livelock

So this side needs the same MM_CID_TRANSIT treatment as the other side,
which brings me back to the splat Shrikanth observed.

I used the same script to run through all possible permutations on that
side too, but nothing showed up there and the yesterday finding is
harmless because that only creates slightly inconsistent state as the
task is already marked CID inactive. But the CID has the MM_CID_TRANSIT
bit set, so the CID is dropped back into the pool when the exiting task
schedules out via preemption or the final schedule().

So I scratched my head some more and stared at the code with two things
in mind:

   1) It seems to be hard to reproduce
   2) It happened on a weakly ordered architecture

and indeed there is a opportunity to get this wrong:

The mode switch does:

    WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
    WRITE_ONCE(mm->mm_cid.percpu, ....);
    
sched_in() does:

    if (!READ_ONCE(mm->mm_cid.percpu))
       ...
       cid |= READ_ONCE(mm->mm_cid.transit);

so it can observe percpu == false and transit == 0 even if the fixup
function has not yet completed. As a consequence the task will not drop
the CID when scheduling out before the fixup is completed, which means
the CID space can be exhausted and the next task scheduling in will loop
in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.

I'll send out a series to address all of that later this evening when
tests have completed and changelogs are polished.

Thanks,

        tglx

      reply	other threads:[~2026-01-29 17:06 UTC|newest]

Thread overview: 74+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-19 17:26 [patch V5 00/20] sched: Rewrite MM CID management Thomas Gleixner
2025-11-19 17:26 ` [patch V5 01/20] sched/mmcid: Revert the complex " Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:26 ` [patch V5 02/20] sched/mmcid: Use proper data structures Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:26 ` [patch V5 03/20] sched/mmcid: Cacheline align MM CID storage Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:26 ` [patch V5 04/20] sched: Fixup whitespace damage Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:26 ` [patch V5 05/20] sched/mmcid: Move scheduler code out of global header Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:26 ` [patch V5 06/20] sched/mmcid: Prevent pointless work in mm_update_cpus_allowed() Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:26 ` [patch V5 07/20] cpumask: Introduce cpumask_weighted_or() Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:26 ` [patch V5 08/20] sched/mmcid: Use cpumask_weighted_or() Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 09/20] cpumask: Cache num_possible_cpus() Thomas Gleixner
2025-11-20 11:20   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-21 22:56   ` [patch V5 09/20] " Marek Szyprowski
2025-11-22 15:36     ` Thomas Gleixner
2025-11-22 16:24       ` Marek Szyprowski
2025-11-22 19:09         ` Paul E. McKenney
2025-11-23 19:03       ` [tip: core/rseq] cpu: Initialize __num_possible_cpus correctly tip-bot2 for Thomas Gleixner
2025-11-22 18:47     ` [patch V5 09/20] cpumask: Cache num_possible_cpus() Paul E. McKenney
2025-11-22 19:10       ` Thomas Gleixner
2025-11-22  0:27   ` Nathan Chancellor
2025-11-26  4:36   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 10/20] sched/mmcid: Convert mm CID mask to a bitmap Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 11/20] signal: Move MMCID exit out of sighand lock Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 12/20] sched/mmcid: Move initialization out of line Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 13/20] sched/mmcid: Provide precomputed maximal value Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 14/20] sched/mmcid: Serialize sched_mm_cid_fork()/exit() with a mutex Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 15/20] sched/mmcid: Introduce per task/CPU ownership infrastructure Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 16/20] sched/mmcid: Provide new scheduler CID mechanism Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 17/20] sched/mmcid: Provide CID ownership mode fixup functions Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 18/20] irqwork: Move data struct to a types header Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 19/20] sched/mmcid: Implement deferred mode change Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` tip-bot2 for Thomas Gleixner
2025-11-19 17:27 ` [patch V5 20/20] sched/mmcid: Switch over to the new mechanism Thomas Gleixner
2025-11-20 11:19   ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-22  0:43   ` [patch V5 20/20] " Nathan Chancellor
2025-11-22 15:02     ` Thomas Gleixner
2025-11-22 16:54       ` Shrikanth Hegde
2025-11-23 19:03       ` [tip: core/rseq] sched/mmcid: Ensure that per CPU threshold is > 0 tip-bot2 for Thomas Gleixner
2025-11-26  4:36   ` [tip: core/rseq] sched/mmcid: Switch over to the new mechanism tip-bot2 for Thomas Gleixner
2026-01-28  0:01 ` [patch V5 00/20] sched: Rewrite MM CID management Ihor Solodrai
2026-01-28  8:46   ` Peter Zijlstra
2026-01-28 11:57   ` Thomas Gleixner
2026-01-28 12:58     ` Shrikanth Hegde
2026-01-28 13:56       ` Thomas Gleixner
2026-01-28 22:24         ` Thomas Gleixner
2026-01-28 22:33           ` Ihor Solodrai
2026-01-28 23:08             ` Ihor Solodrai
2026-01-29 17:06               ` Thomas Gleixner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h5s4mjqw.ffs@tglx \
    --to=tglx@kernel.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=fweimer@redhat.com \
    --cc=gautham.shenoy@amd.com \
    --cc=gmonaco@redhat.com \
    --cc=ihor.solodrai@linux.dev \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mjeanson@efficios.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=puranjay@kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=sshegde@linux.ibm.com \
    --cc=tim.c.chen@intel.com \
    --cc=tj@kernel.org \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox