Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "zhidao su" <soolaugust@gmail.com>
Cc: sched-ext@lists.linux.dev
Subject: Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
Date: Wed, 06 May 2026 06:16:32 +0000	[thread overview]
Message-ID: <20260506061633.73070C2BCB8@smtp.kernel.org> (raw)
In-Reply-To: <20260506054001.1105522-1-suzhidao@xiaomi.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU) data race against `sched_ext_dead()`.
- [High] A memory ordering vulnerability defeats the bug 1 fix on weakly ordered architectures, allowing the NULL pointer dereference to still occur.
- [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the `scx_root_enable_workfn()` cleanup path exposes dead tasks to cgroup iterators, causing sub-scheduler enables to abort.
--

commit 1d291365affa3415c03f2031e9198ebe7208a245
Author: zhidao su <soolaugust@gmail.com>

sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()

In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
of scx_root. scx_root_enable_workfn() iterates all tasks and for each
releases scx_tasks_lock via scx_task_iter_unlock() before calling
scx_init_task().

> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5600618dcef8e..98505577d56ca 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -3690,7 +3690,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork
>  		/*
>  		 * While @p's rq is not locked. @p is not visible to the rest of
>  		 * SCX yet and it's safe to update the flags and state.
> +		 *
> +		 * Install p->scx.sched before transitioning state off NONE so
> +		 * that the invariant state!=NONE => p->scx.sched!=NULL holds as
> +		 * soon as state becomes observable.  A concurrent sched_ext_dead()
> +		 * that races the INIT window will then always find a valid
> +		 * scheduler pointer and can call scx_disable_and_exit_task()
> +		 * to release resources allocated by ops.init_task().
>  		 */
> +		scx_set_task_sched(p, sch);
>  		p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
>  		scx_set_task_state(p, SCX_TASK_INIT);
>  	}

Can this result in a memory ordering vulnerability on weakly ordered
architectures?

In scx_init_task(), scx_set_task_sched() uses rcu_assign_pointer(), which
relies on smp_store_release(). While smp_store_release() guarantees prior
operations are visible, it does not prevent subsequent memory operations
from being reordered before the release.

Could the CPU commit the store updating p->scx.flags to SCX_TASK_INIT
before the store to p->scx.sched is completed? If a concurrent
sched_ext_dead() observes state == SCX_TASK_INIT, branches into the cleanup
path, and reads a NULL scheduler pointer, it would pass it to
scx_disable_and_exit_task() which crashes in SCX_HAS_OP().

Does this require a full write memory barrier (e.g., smp_wmb()) between
setting the scheduler pointer and transitioning the state?

[ ... ]

> @@ -7130,7 +7133,49 @@ static void scx_root_enable_workfn(struct kthread_work *work)
>  			goto err_disable_unlock_all;
>  		}
>  
> -		scx_set_task_sched(p, sch);
> +		/*
> +		 * sched_ext_dead() may have raced while locks were dropped in
> +		 * scx_task_iter_unlock().  Two cases:
> +		 *
> +		 * (a) sched_ext_dead() ran after scx_init_task() set state=INIT:
> +		 *     it called scx_disable_and_exit_task() (cancelled=true) and
> +		 *     reset state to NONE.  ops.exit_task() already ran; skip.
> +		 *
> +		 * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at
> +		 *     the time): it skipped scx_disable_and_exit_task() because
> +		 *     state was NONE.  scx_init_task() subsequently called
> +		 *     ops.init_task() and set state=INIT, leaving allocated
> +		 *     resources with no owner.  We must call
> +		 *     scx_disable_and_exit_task() here to release them.
> +		 *
> +		 * Distinguish case (a) from (b) by reading state: (a) leaves
> +		 * state=NONE (reset by scx_disable_and_exit_task); (b) leaves
> +		 * state=INIT (set by scx_init_task, never reset).
> +		 */
> +		{
> +			bool p_dead = false, need_exit = false;
> +
> +			scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
> +				if (list_empty(&p->scx.tasks_node)) {
> +					p_dead = true;
> +					need_exit = scx_get_task_state(p) != SCX_TASK_NONE;
> +				}
> +			}
> +
> +			if (p_dead) {
> +				if (need_exit) {
> +					struct rq_flags rf;
> +					struct rq *rq;
> +
> +					rq = task_rq_lock(p, &rf);
> +					scx_disable_and_exit_task(sch, p);
> +					task_rq_unlock(rq, p, &rf);

Is this cleanup path missing the SCX_TASK_OFF_TASKS flag?

If sched_ext_dead() runs before scx_init_task() completes, it sees state
NONE and skips setting SCX_TASK_OFF_TASKS. When this cleanup path calls
scx_disable_and_exit_task() under task_rq_lock(), the SCX_TASK_OFF_TASKS
flag is not set before the runqueue is unlocked.

Because the flag is missing, if a cgroup iterator traverses the cgroup
before cgroup_task_dead() removes the task, could the iterator erroneously
yield the dead task with state NONE, hit assert_task_ready_or_enabled(p),
and abort the sub-scheduler enable?

> +				}
> +				put_task_struct(p);
> +				continue;
> +			}
> +		}
> +
>  		scx_set_task_state(p, SCX_TASK_READY);

Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against
sched_ext_dead()?

The code checks list_empty(&p->scx.tasks_node) under scx_tasks_lock to
determine if a concurrent sched_ext_dead() has started processing.

However, once scx_tasks_lock is dropped, sched_ext_dead() can acquire it,
remove the task, and proceed to exit the task, eventually doing
p->scx.flags |= SCX_TASK_OFF_TASKS under task_rq_lock(p).

Meanwhile, scx_root_enable_workfn() continues past the lock and
unconditionally calls scx_set_task_state(p, SCX_TASK_READY) without holding
task_rq_lock(p). Because scx_set_task_state() performs a non-atomic
read-modify-write on p->scx.flags, could this overwrite and permanently
lose the SCX_TASK_OFF_TASKS flag?

Also, could this incorrectly transition a dead task to SCX_TASK_READY after
its scheduler resources were torn down, triggering a warning?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260506054001.1105522-1-suzhidao@xiaomi.com?part=1

next prev parent reply	other threads:[~2026-05-06  6:16 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-29 13:31 [PATCH] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn() zhidao su
2026-05-04 20:31 ` Tejun Heo
2026-05-06  5:40 ` [PATCH v2] " zhidao su
2026-05-06  6:16   ` sashiko-bot [this message]
2026-05-07  2:32     ` zhidao su
2026-05-10 13:55   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260506061633.73070C2BCB8@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=sashiko@lists.linux.dev \
    --cc=sched-ext@lists.linux.dev \
    --cc=soolaugust@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.