From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 95CAE346AF9; Thu, 29 Jan 2026 17:06:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769706380; cv=none; b=kqsYbr99/KB4EZrw98ImQWyzKbYjpOETvgbln6+tGuzuMKp+uiio0zGYvUuQpaiU5C34vtAIlFYNUZiFzLxu30C7MsJVKuWMsH7dHKnx2AGMH0h+6z9wsH9tnQmsYr6Hg0n+1UapzCUOpoj81pZjLAPNqF859BynBuSjHd3RHSs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769706380; c=relaxed/simple; bh=N4n7EneCMOI+oV6Pq4xv/tDKMSCu8BVYllhdzfsTKcM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=irO0XYL9Z4Fn2mG96OIwisO4yq1+XsCTCBaXCyJPBYAP2wV4J4JlrOEeK8wYP75VR+uQy25eh6iXsCgtDrcNe32JnTVIfvoZy2jN+nYwQNe6CEX1Dc8tTpeN9ucM2ReAHFm5B/NEhw4SOgQaqzEgUYrcVf+WOIOdEbwFjz2PF70= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=QcOy61pI; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="QcOy61pI" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 59E54C4CEF7; Thu, 29 Jan 2026 17:06:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1769706379; bh=N4n7EneCMOI+oV6Pq4xv/tDKMSCu8BVYllhdzfsTKcM=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=QcOy61pI/xRBlw8Te4U+jOMIW/Dj0snOE3o/jF2QXAVxr1UtKmRDcPNdNfrcEnzfc IzjLGR+E6HZxvx+tLEwerogafiqrVe7q5wcYxlF3cK1kFUnFMKfpmKFYhayp2+SoDQ AaSfUDQG0yc0KAovLG5xDruwAGLnBSSKnF2AJiCve3Q/th3s+PW3XlDRM0tiUjPAlq taSYsc/u2o5lBDsBvXn5peKYJ58GY0Ert4NgIxecJuuokIeMNJ3Kxw5p68SMvUMbAl /KXjoqDgwXvTZGyTZkUteQ41UX4bOd0J4ohSQXvxGe/DU0SLm6C1rZI1x8cgf3uw6s +1a9BGRFOntxQ== From: Thomas Gleixner To: Ihor Solodrai , Shrikanth Hegde , Peter Zijlstra , LKML Cc: Gabriele Monaco , Mathieu Desnoyers , Michael Jeanson , Jens Axboe , "Paul E. McKenney" , "Gautham R. Shenoy" , Florian Weimer , Tim Chen , Yury Norov , bpf , sched-ext@lists.linux.dev, Kernel Team , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Puranjay Mohan , Tejun Heo Subject: Re: [patch V5 00/20] sched: Rewrite MM CID management In-Reply-To: <6a77996f-5b08-4db6-8631-031ce3e52145@linux.dev> References: <20251119171016.815482037@linutronix.de> <2b7463d7-0f58-4e34-9775-6e2115cfb971@linux.dev> <877bt29cgv.ffs@tglx> <87y0lh96xo.ffs@tglx> <87jyx1ml3h.ffs@tglx> <70335ad4-59b6-45fd-8a76-bd91d9658810@linux.dev> <6a77996f-5b08-4db6-8631-031ce3e52145@linux.dev> Date: Thu, 29 Jan 2026 18:06:15 +0100 Message-ID: <87h5s4mjqw.ffs@tglx> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Wed, Jan 28 2026 at 15:08, Ihor Solodrai wrote: > On 1/28/26 2:33 PM, Ihor Solodrai wrote: >> [...] >> >> We have a steady stream of jobs running, so if it's not a one-off it's >> likely to happen again. I'll share if we get anything. > > Here is another one, with backtraces of other CPUs: > > [ 59.133925] CPU: 2 UID: 0 PID: 127 Comm: test_progs Tainted: G OE 6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full) > [ 59.133935] RIP: 0010:queued_spin_lock_slowpath+0x3a9/0xac0 > [ 59.133985] do_raw_spin_lock+0x1d9/0x270 > [ 59.134001] task_rq_lock+0xcf/0x3c0 > [ 59.134007] mm_cid_fixup_task_to_cpu+0xb0/0x460 > [ 59.134025] sched_mm_cid_fork+0x6da/0xc20 Compared to Shrikanth's splat this is the reverse situation, i.e. fork() reached the point where it needs to switch to per CPU mode and the fixup function is stuck on a runqueue lock. > [ 59.134176] CPU: 3 UID: 0 PID: 67 Comm: kworker/3:1 Tainted: G OE 6.19.0-rc5-gbe9790cb9e63-dirty #1 PREEMPT(full) > [ 59.134186] Workqueue: events drain_vmap_area_work > [ 59.134194] RIP: 0010:smp_call_function_many_cond+0x772/0xe60 > [ 59.134250] on_each_cpu_cond_mask+0x24/0x40 > [ 59.134254] flush_tlb_kernel_range+0x402/0x6b0 CPU3 is unrelated as it does not hold runqueue lock. > [ 59.134374] NMI backtrace for cpu 1 > [ 59.134388] RIP: 0010:_find_first_zero_bit+0x50/0x90 > [ 59.134423] __schedule+0x3312/0x4390 > [ 59.134430] ? __pfx___schedule+0x10/0x10 > [ 59.134434] ? trace_rcu_watching+0x105/0x150 > [ 59.134440] schedule_idle+0x59/0x90 CPU1 holds runqueue lock and find_first_zero_bit() suggests that this comes from mm_get_cid(), but w/o decoding the return address it's hard to tell for sure. > [ 59.134474] NMI backtrace for cpu 0 skipped: idling at default_idle+0xf/0x20 CPU0 is idle and not involved at all. So the situation is: test_prog creates the 4th child, which exceeds the number of CPUs, so it switches to per CPU mode. At this point each task of test_prog has a CID associated. Let's assume thread creation order assignment for simplicity. T0 (main thread) CID0 runs fork() T1 (1st child) CID1 T2 (2nd child) CID2 T3 (3rd child) CID3 T4 (4th child) --- is about to be forked and causes the mode switch T0 sets mm_cid::percpu = true transfers the CID from T0 to CPU2 Starts the fixup which walks through the threads During that T1 - T3 are free to schedule in and out before the fixup caught up with them. Now I played through all possible permutations with a python script and came up with the following snafu: T1 schedules in on CPU3 and observes percpu == true, so it transfers it's CID to CPU3 T1 is migrated to CPU1 and schedule in observes percpu == true, but CPU1 does not have a CID associated and T1 transferred it's own to CPU3 So it has to allocate one with CPU1 runqueue lock held, but the pool is empty, so it keeps looping. Now T0 reaches T1 in the thread walk and tries to lock the corresponding runqueue lock, which is held. ---> Livelock So this side needs the same MM_CID_TRANSIT treatment as the other side, which brings me back to the splat Shrikanth observed. I used the same script to run through all possible permutations on that side too, but nothing showed up there and the yesterday finding is harmless because that only creates slightly inconsistent state as the task is already marked CID inactive. But the CID has the MM_CID_TRANSIT bit set, so the CID is dropped back into the pool when the exiting task schedules out via preemption or the final schedule(). So I scratched my head some more and stared at the code with two things in mind: 1) It seems to be hard to reproduce 2) It happened on a weakly ordered architecture and indeed there is a opportunity to get this wrong: The mode switch does: WRITE_ONCE(mm->mm_cid.transit, MM_CID_TRANSIT); WRITE_ONCE(mm->mm_cid.percpu, ....); sched_in() does: if (!READ_ONCE(mm->mm_cid.percpu)) ... cid |= READ_ONCE(mm->mm_cid.transit); so it can observe percpu == false and transit == 0 even if the fixup function has not yet completed. As a consequence the task will not drop the CID when scheduling out before the fixup is completed, which means the CID space can be exhausted and the next task scheduling in will loop in mm_get_cid() and the fixup thread can livelock on the held runqueue lock as above. I'll send out a series to address all of that later this evening when tests have completed and changelogs are polished. Thanks, tglx