All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] sched/mmcid: fix OOB clear_bit when CID is MM_CID_UNSET in fixup path
@ 2026-06-16 20:38 Rik van Riel
  2026-06-16 21:40 ` Mathieu Desnoyers
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Rik van Riel @ 2026-06-16 20:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: kernel-team, Rik van Riel, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Thomas Gleixner, Mathieu Desnoyers,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak

In mm_cid_fixup_cpus_to_tasks(), when rq->curr has the target mm and
mm_cid.active is set, the CID is checked with cid_in_transit() before
setting the transition bit.  In per-CPU mode a newly forked or exec'd
task can be running with mm_cid.cid == MM_CID_UNSET because CIDs are
assigned lazily on schedule-in.  With cid_in_transit() the guard passes
for MM_CID_UNSET (no transit bit), converts it to MM_CID_UNSET |
MM_CID_TRANSIT and stores it back; later mm_cid_schedout() feeds this
to clear_bit() with MM_CID_UNSET as the bit number, triggering an
out-of-bounds write.

Symptoms: this is genuine memory corruption, but a bounded out-of-bounds
write, not an arbitrary one.  MM_CID_UNSET is the fixed sentinel BIT(31),
so once the bad value reaches mm_cid_schedout() the cid_from_transit_cid()
strip leaves MM_CID_UNSET, which fails the "cid < max_cids" convergence
test and falls into mm_drop_cid() -> clear_bit(MM_CID_UNSET,
mm_cidmask(mm)).  The cid bitmap is embedded in the mm_struct slab object
(after cpu_bitmap and mm_cpus_allowed) and is only num_possible_cpus()
bits wide, so clearing bit 31 is a deterministic OOB bit-clear at a
fixed offset of 2^31 / 8 == 256 MiB past the bitmap base.  The address is
not attacker-influenced (fixed sentinel -> fixed offset) and the op only
clears a single bit; what sits 256 MiB further along the direct map is
whatever kernel object happens to live there, so this corrupts one bit of
unpredictable kernel memory -- it is not an arbitrary-address or
arbitrary-value write.

It triggers only in per-CPU CID mode, when a CPU is running an active
task of the target mm whose cid is still MM_CID_UNSET -- the
fork()/execve() window before that task's next schedule-in assigns it a
real CID -- and a per-CPU -> per-task fixup walks over it (the mode
fallback driven by a thread exit, sched_mm_cid_exit(), or by the deferred
max_cids recompute in mm_cid_work_fn()).

In practice syzkaller surfaced it as a KASAN use-after-free reported in
__schedule -> mm_cid_switch_to, where the offending clear_bit() is inlined
via mm_cid_schedout() -> mm_drop_cid().

Guard the transition-bit assignment against MM_CID_UNSET, in addition to
the existing cid_in_transit() check, so the bit is only set on a genuine
task-owned CID.  A CPU-owned (MM_CID_ONCPU) CID of a running active task
is handled by the cid_on_cpu(pcp->cid) branch above and never reaches
this path, so excluding MM_CID_UNSET (and the already-transitioning case)
is sufficient.

Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Assisted-by: Claude:claude-opus-4-8 syzkaller
Signed-off-by: Rik van Riel <riel@surriel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/core.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b791e9e9f67..3cc6fb1d2054 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10909,8 +10909,19 @@ static void mm_cid_fixup_cpus_to_tasks(struct mm_struct *mm)
 		} else if (rq->curr->mm == mm && rq->curr->mm_cid.active) {
 			unsigned int cid = rq->curr->mm_cid.cid;
 
-			/* Ensure it has the transition bit set */
-			if (!cid_in_transit(cid)) {
+			/*
+			 * Set the transition bit only on a genuine task-owned
+			 * CID. A running active task can legitimately have
+			 * MM_CID_UNSET here: in per-CPU mode CIDs are assigned
+			 * lazily on schedule-in, so the fork()/execve() window
+			 * leaves the task active with no owned CID. Setting the
+			 * transition bit on MM_CID_UNSET would later feed
+			 * clear_bit() an out-of-bounds bit number via
+			 * mm_cid_schedout(), so exclude it. A CPU-owned
+			 * (MM_CID_ONCPU) CID is handled by the cid_on_cpu()
+			 * branch above and never reaches here.
+			 */
+			if (cid != MM_CID_UNSET && !cid_in_transit(cid)) {
 				cid = cid_to_transit_cid(cid);
 				rq->curr->mm_cid.cid = cid;
 				pcp->cid = cid;
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] sched/mmcid: fix OOB clear_bit when CID is MM_CID_UNSET in fixup path
  2026-06-16 20:38 [PATCH v3] sched/mmcid: fix OOB clear_bit when CID is MM_CID_UNSET in fixup path Rik van Riel
@ 2026-06-16 21:40 ` Mathieu Desnoyers
  2026-06-19 19:40 ` Thomas Gleixner
  2026-06-19 19:50 ` [tip: core/urgent] sched/mmcid: Fix " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 4+ messages in thread
From: Mathieu Desnoyers @ 2026-06-16 21:40 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel, Thomas Gleixner
  Cc: kernel-team, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, K Prateek Nayak

On 2026-06-16 16:38, Rik van Riel wrote:
> In mm_cid_fixup_cpus_to_tasks(), when rq->curr has the target mm and
> mm_cid.active is set, the CID is checked with cid_in_transit() before
> setting the transition bit.  In per-CPU mode a newly forked or exec'd
> task can be running with mm_cid.cid == MM_CID_UNSET because CIDs are
> assigned lazily on schedule-in.  With cid_in_transit() the guard passes
> for MM_CID_UNSET (no transit bit), converts it to MM_CID_UNSET |
> MM_CID_TRANSIT and stores it back; later mm_cid_schedout() feeds this
> to clear_bit() with MM_CID_UNSET as the bit number, triggering an
> out-of-bounds write.

Thomas, can you have a look as well in case we missed something
subtle ?

Rik, did you check whether there are other instances of that
MM_CID_UNSET issue lurking in the code, or was your analysis
focused on the reproduced bug ?

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] sched/mmcid: fix OOB clear_bit when CID is MM_CID_UNSET in fixup path
  2026-06-16 20:38 [PATCH v3] sched/mmcid: fix OOB clear_bit when CID is MM_CID_UNSET in fixup path Rik van Riel
  2026-06-16 21:40 ` Mathieu Desnoyers
@ 2026-06-19 19:40 ` Thomas Gleixner
  2026-06-19 19:50 ` [tip: core/urgent] sched/mmcid: Fix " tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 4+ messages in thread
From: Thomas Gleixner @ 2026-06-19 19:40 UTC (permalink / raw)
  To: Rik van Riel, linux-kernel
  Cc: kernel-team, Rik van Riel, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Mathieu Desnoyers, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	K Prateek Nayak

On Tue, Jun 16 2026 at 16:38, Rik van Riel wrote:
> In mm_cid_fixup_cpus_to_tasks(), when rq->curr has the target mm and
> mm_cid.active is set, the CID is checked with cid_in_transit() before
> setting the transition bit.  In per-CPU mode a newly forked or exec'd
> task can be running with mm_cid.cid == MM_CID_UNSET because CIDs are
> assigned lazily on schedule-in.  With cid_in_transit() the guard passes
> for MM_CID_UNSET (no transit bit), converts it to MM_CID_UNSET |
> MM_CID_TRANSIT and stores it back; later mm_cid_schedout() feeds this
> to clear_bit() with MM_CID_UNSET as the bit number, triggering an
> out-of-bounds write.
>
> Symptoms: this is genuine memory corruption, but a bounded out-of-bounds
> write, not an arbitrary one.  MM_CID_UNSET is the fixed sentinel BIT(31),
> so once the bad value reaches mm_cid_schedout() the cid_from_transit_cid()
> strip leaves MM_CID_UNSET, which fails the "cid < max_cids" convergence
> test and falls into mm_drop_cid() -> clear_bit(MM_CID_UNSET,
> mm_cidmask(mm)).  The cid bitmap is embedded in the mm_struct slab object
> (after cpu_bitmap and mm_cpus_allowed) and is only num_possible_cpus()
> bits wide, so clearing bit 31 is a deterministic OOB bit-clear at a
> fixed offset of 2^31 / 8 == 256 MiB past the bitmap base.  The address is
> not attacker-influenced (fixed sentinel -> fixed offset) and the op only
> clears a single bit; what sits 256 MiB further along the direct map is
> whatever kernel object happens to live there, so this corrupts one bit of
> unpredictable kernel memory -- it is not an arbitrary-address or
> arbitrary-value write.
>
> It triggers only in per-CPU CID mode, when a CPU is running an active
> task of the target mm whose cid is still MM_CID_UNSET -- the
> fork()/execve() window before that task's next schedule-in assigns it a
> real CID -- and a per-CPU -> per-task fixup walks over it (the mode
> fallback driven by a thread exit, sched_mm_cid_exit(), or by the deferred
> max_cids recompute in mm_cid_work_fn()).
>
> In practice syzkaller surfaced it as a KASAN use-after-free reported in
> __schedule -> mm_cid_switch_to, where the offending clear_bit() is inlined
> via mm_cid_schedout() -> mm_drop_cid().
>
> Guard the transition-bit assignment against MM_CID_UNSET, in addition to
> the existing cid_in_transit() check, so the bit is only set on a genuine
> task-owned CID.  A CPU-owned (MM_CID_ONCPU) CID of a running active task
> is handled by the cid_on_cpu(pcp->cid) branch above and never reaches
> this path, so excluding MM_CID_UNSET (and the already-transitioning case)
> is sufficient.

Duh. Now that you explained it it's obvious. Thanks for tracking this
nasty down!

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [tip: core/urgent] sched/mmcid: Fix OOB clear_bit when CID is MM_CID_UNSET in fixup path
  2026-06-16 20:38 [PATCH v3] sched/mmcid: fix OOB clear_bit when CID is MM_CID_UNSET in fixup path Rik van Riel
  2026-06-16 21:40 ` Mathieu Desnoyers
  2026-06-19 19:40 ` Thomas Gleixner
@ 2026-06-19 19:50 ` tip-bot2 for Rik van Riel
  2 siblings, 0 replies; 4+ messages in thread
From: tip-bot2 for Rik van Riel @ 2026-06-19 19:50 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Rik van Riel, Thomas Gleixner, Mathieu Desnoyers, stable, x86,
	linux-kernel

The following commit has been merged into the core/urgent branch of tip:

Commit-ID:     de3ab9bd3133899efb92e4cd05ba4203e58fc0a3
Gitweb:        https://git.kernel.org/tip/de3ab9bd3133899efb92e4cd05ba4203e58fc0a3
Author:        Rik van Riel <riel@surriel.com>
AuthorDate:    Tue, 16 Jun 2026 16:38:17 -04:00
Committer:     Thomas Gleixner <tglx@kernel.org>
CommitterDate: Fri, 19 Jun 2026 21:44:16 +02:00

sched/mmcid: Fix OOB clear_bit when CID is MM_CID_UNSET in fixup path

In mm_cid_fixup_cpus_to_tasks(), when rq->curr has the target mm and
mm_cid.active is set, the CID is checked with cid_in_transit() before
setting the transition bit.  In per-CPU mode a newly forked or exec'd
task can be running with mm_cid.cid == MM_CID_UNSET because CIDs are
assigned lazily on schedule-in.  With cid_in_transit() the guard passes
for MM_CID_UNSET (no transit bit), converts it to MM_CID_UNSET |
MM_CID_TRANSIT and stores it back; later mm_cid_schedout() feeds this
to clear_bit() with MM_CID_UNSET as the bit number, triggering an
out-of-bounds write.

Symptoms: this is genuine memory corruption, but a bounded out-of-bounds
write, not an arbitrary one.  MM_CID_UNSET is the fixed sentinel BIT(31),
so once the bad value reaches mm_cid_schedout() the cid_from_transit_cid()
strip leaves MM_CID_UNSET, which fails the "cid < max_cids" convergence
test and falls into mm_drop_cid() -> clear_bit(MM_CID_UNSET,
mm_cidmask(mm)).  The cid bitmap is embedded in the mm_struct slab object
(after cpu_bitmap and mm_cpus_allowed) and is only num_possible_cpus()
bits wide, so clearing bit 31 is a deterministic OOB bit-clear at a
fixed offset of 2^31 / 8 == 256 MiB past the bitmap base.  The address is
not attacker-influenced (fixed sentinel -> fixed offset) and the op only
clears a single bit; what sits 256 MiB further along the direct map is
whatever kernel object happens to live there, so this corrupts one bit of
unpredictable kernel memory -- it is not an arbitrary-address or
arbitrary-value write.

It triggers only in per-CPU CID mode, when a CPU is running an active
task of the target mm whose cid is still MM_CID_UNSET -- the
fork()/execve() window before that task's next schedule-in assigns it a
real CID -- and a per-CPU -> per-task fixup walks over it (the mode
fallback driven by a thread exit, sched_mm_cid_exit(), or by the deferred
max_cids recompute in mm_cid_work_fn()).

In practice syzkaller surfaced it as a KASAN use-after-free reported in
__schedule -> mm_cid_switch_to, where the offending clear_bit() is inlined
via mm_cid_schedout() -> mm_drop_cid().

Guard the transition-bit assignment against MM_CID_UNSET, in addition to
the existing cid_in_transit() check, so the bit is only set on a genuine
task-owned CID.  A CPU-owned (MM_CID_ONCPU) CID of a running active task
is handled by the cid_on_cpu(pcp->cid) branch above and never reaches
this path, so excluding MM_CID_UNSET (and the already-transitioning case)
is sufficient.

Fixes: fbd0e71dc370 ("sched/mmcid: Provide CID ownership mode fixup functions")
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Assisted-by: Claude:claude-opus-4-8 syzkaller
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260616203818.1516263-1-riel@surriel.com
---
 kernel/sched/core.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b791e9..3cc6fb1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10909,8 +10909,19 @@ static void mm_cid_fixup_cpus_to_tasks(struct mm_struct *mm)
 		} else if (rq->curr->mm == mm && rq->curr->mm_cid.active) {
 			unsigned int cid = rq->curr->mm_cid.cid;
 
-			/* Ensure it has the transition bit set */
-			if (!cid_in_transit(cid)) {
+			/*
+			 * Set the transition bit only on a genuine task-owned
+			 * CID. A running active task can legitimately have
+			 * MM_CID_UNSET here: in per-CPU mode CIDs are assigned
+			 * lazily on schedule-in, so the fork()/execve() window
+			 * leaves the task active with no owned CID. Setting the
+			 * transition bit on MM_CID_UNSET would later feed
+			 * clear_bit() an out-of-bounds bit number via
+			 * mm_cid_schedout(), so exclude it. A CPU-owned
+			 * (MM_CID_ONCPU) CID is handled by the cid_on_cpu()
+			 * branch above and never reaches here.
+			 */
+			if (cid != MM_CID_UNSET && !cid_in_transit(cid)) {
 				cid = cid_to_transit_cid(cid);
 				rq->curr->mm_cid.cid = cid;
 				pcp->cid = cid;

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-19 19:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16 20:38 [PATCH v3] sched/mmcid: fix OOB clear_bit when CID is MM_CID_UNSET in fixup path Rik van Riel
2026-06-16 21:40 ` Mathieu Desnoyers
2026-06-19 19:40 ` Thomas Gleixner
2026-06-19 19:50 ` [tip: core/urgent] sched/mmcid: Fix " tip-bot2 for Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.