Re: [patch V5 00/20] sched: Rewrite MM CID management

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@kernel.org>
To: Ihor Solodrai <ihor.solodrai@linux.dev>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>
Cc: Gabriele Monaco <gmonaco@redhat.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Michael Jeanson <mjeanson@efficios.com>,
	Jens Axboe <axboe@kernel.dk>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	"Gautham R. Shenoy" <gautham.shenoy@amd.com>,
	Florian Weimer <fweimer@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Yury Norov <yury.norov@gmail.com>, bpf <bpf@vger.kernel.org>,
	sched-ext@lists.linux.dev, Kernel Team <kernel-team@meta.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Puranjay Mohan <puranjay@kernel.org>, Tejun Heo <tj@kernel.org>
Subject: Re: [patch V5 00/20] sched: Rewrite MM CID management
Date: Thu, 29 Jan 2026 18:06:15 +0100	[thread overview]
Message-ID: <87h5s4mjqw.ffs@tglx> (raw)
In-Reply-To: <6a77996f-5b08-4db6-8631-031ce3e52145@linux.dev>

On Wed, Jan 28 2026 at 15:08, Ihor Solodrai wrote:
> On 1/28/26 2:33 PM, Ihor Solodrai wrote:
>> [...]
>> 
>> We have a steady stream of jobs running, so if it's not a one-off it's
>> likely to happen again. I'll share if we get anything.
>
> Here is another one, with backtraces of other CPUs:
>
> [   59.133925] CPU: 2 UID: 0 PID: 127 Comm: test_progs Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
> [   59.133935] RIP: 0010:queued_spin_lock_slowpath+0x3a9/0xac0
> [   59.133985]  do_raw_spin_lock+0x1d9/0x270
> [   59.134001]  task_rq_lock+0xcf/0x3c0
> [   59.134007]  mm_cid_fixup_task_to_cpu+0xb0/0x460
> [   59.134025]  sched_mm_cid_fork+0x6da/0xc20

Compared to Shrikanth's splat this is the reverse situation, i.e. fork()
reached the point where it needs to switch to per CPU mode and the fixup
function is stuck on a runqueue lock.

> [   59.134176] CPU: 3 UID: 0 PID: 67 Comm: kworker/3:1 Tainted: G           OE       6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full)
> [   59.134186] Workqueue: events drain_vmap_area_work
> [   59.134194] RIP: 0010:smp_call_function_many_cond+0x772/0xe60
> [   59.134250]  on_each_cpu_cond_mask+0x24/0x40
> [   59.134254]  flush_tlb_kernel_range+0x402/0x6b0

CPU3 is unrelated as it does not hold runqueue lock.

> [   59.134374] NMI backtrace for cpu 1
> [   59.134388] RIP: 0010:_find_first_zero_bit+0x50/0x90
> [   59.134423]  __schedule+0x3312/0x4390
> [   59.134430]  ? __pfx___schedule+0x10/0x10
> [   59.134434]  ? trace_rcu_watching+0x105/0x150
> [   59.134440]  schedule_idle+0x59/0x90

CPU1 holds runqueue lock and find_first_zero_bit() suggests that this
comes from mm_get_cid(), but w/o decoding the return address it's hard
to tell for sure.

> [   59.134474] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20

CPU0 is idle and not involved at all.

So the situation is:

test_prog creates the 4th child, which exceeds the number of CPUs, so
it switches to per CPU mode.

At this point each task of test_prog has a CID associated. Let's
assume thread creation order assignment for simplicity.

   T0 (main thread)       CID0  runs fork()
   T1 (1st child)	  CID1
   T2 (2nd child)	  CID2
   T3 (3rd child)	  CID3
   T4 (4th child)         ---   is about to be forked and causes the
                                mode switch

T0 sets mm_cid::percpu = true
   transfers the CID from T0 to CPU2

   Starts the fixup which walks through the threads

During that T1 - T3 are free to schedule in and out before the fixup
caught up with them. Now I played through all possible permutations with
a python script and came up with the following snafu:

   T1 schedules in on CPU3 and observes percpu == true, so it transfers
      it's CID to CPU3

   T1 is migrated to CPU1 and schedule in observes percpu == true, but
      CPU1 does not have a CID associated and T1 transferred it's own to
      CPU3

      So it has to allocate one with CPU1 runqueue lock held, but the
      pool is empty, so it keeps looping.

Now T0 reaches T1 in the thread walk and tries to lock the corresponding
runqueue lock, which is held. ---> Livelock

So this side needs the same MM_CID_TRANSIT treatment as the other side,
which brings me back to the splat Shrikanth observed.

I used the same script to run through all possible permutations on that
side too, but nothing showed up there and the yesterday finding is
harmless because that only creates slightly inconsistent state as the
task is already marked CID inactive. But the CID has the MM_CID_TRANSIT
bit set, so the CID is dropped back into the pool when the exiting task
schedules out via preemption or the final schedule().

So I scratched my head some more and stared at the code with two things
in mind:

   1) It seems to be hard to reproduce
   2) It happened on a weakly ordered architecture

and indeed there is a opportunity to get this wrong:

The mode switch does:

    WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT);
    WRITE_ONCE(mm->mm_cid.percpu, ....);

sched_in() does:

    if (!READ_ONCE(mm->mm_cid.percpu))
       ...
       cid |= READ_ONCE(mm->mm_cid.transit);

so it can observe percpu == false and transit == 0 even if the fixup
function has not yet completed. As a consequence the task will not drop
the CID when scheduling out before the fixup is completed, which means
the CID space can be exhausted and the next task scheduling in will loop
in mm_get_cid() and the fixup thread can livelock on the held runqueue
lock as above.

I'll send out a series to address all of that later this evening when
tests have completed and changelogs are polished.

Thanks,

        tglx

     prev parent reply	other threads:[~2026-01-29 17:06 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20251119171016.815482037@linutronix.de>
2026-01-28  0:01 ` [patch V5 00/20] sched: Rewrite MM CID management Ihor Solodrai
2026-01-28  8:46   ` Peter Zijlstra
2026-01-28 11:57   ` Thomas Gleixner
2026-01-28 12:58     ` Shrikanth Hegde
2026-01-28 13:56       ` Thomas Gleixner
2026-01-28 22:24         ` Thomas Gleixner
2026-01-28 22:33           ` Ihor Solodrai
2026-01-28 23:08             ` Ihor Solodrai
2026-01-29 17:06               ` Thomas Gleixner [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87h5s4mjqw.ffs@tglx \
    --to=tglx@kernel.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=fweimer@redhat.com \
    --cc=gautham.shenoy@amd.com \
    --cc=gmonaco@redhat.com \
    --cc=ihor.solodrai@linux.dev \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mjeanson@efficios.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=puranjay@kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=sshegde@linux.ibm.com \
    --cc=tim.c.chen@intel.com \
    --cc=tj@kernel.org \
    --cc=yury.norov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox