Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()

Sashiko discussions
 help / color / mirror / Atom feed

From: zhidao su <soolaugust@gmail.com>
To: sashiko@lists.linux.dev
Cc: sched-ext@lists.linux.dev, Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH v2] sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
Date: Thu, 7 May 2026 10:32:21 +0800	[thread overview]
Message-ID: <5636DBB8-B5F3-4F88-AA54-B388DB3A80F1@gmail.com> (raw)
In-Reply-To: <20260506061633.73070C2BCB8@smtp.kernel.org>

On Wed, May 06 2026, sashiko-bot@kernel.org wrote:

> - [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding
>  `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU)
>  data race against `sched_ext_dead()`.
> [...]
> Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against
> sched_ext_dead()?

When p_dead is true the code does `continue`, so scx_set_task_state(p,
SCX_TASK_READY) is never reached for dead tasks. The TOCTOU window
doesn't exist.

> - [High] A memory ordering vulnerability defeats the bug 1 fix on weakly
>  ordered architectures, allowing the NULL pointer dereference to still occur.
> [...]
> Does this require a full write memory barrier (e.g., smp_wmb()) between
> setting the scheduler pointer and transitioning the state?

I'm not confident whether the store ordering between scx_set_task_sched()
and the state transition needs an explicit barrier here.

Tejun, what's the intended model for scx flags access in this path?

> - [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the scx_root_enable_workfn()
>  cleanup path exposes dead tasks to cgroup iterators.
> [...]
> Is this cleanup path missing the SCX_TASK_OFF_TASKS flag?

Not needed here. scx_task_iter_next_locked() checks SCX_TASK_OFF_TASKS
only for cgroup-scoped iteration, which is only reachable from
scx_sub_enable_workfn(). That function requires scx_enabled() at entry;
scx_root_enable_workfn() has not called scx_ops_enable() yet at this
point, so no sub-sched enable can be running concurrently.


> 2026年5月6日 14:16，sashiko-bot@kernel.org 写道：
> 
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [High] Calling `scx_set_task_state(p, SCX_TASK_READY)` without holding `task_rq_lock(p)` introduces a Time-Of-Check to Time-Of-Use (TOCTOU) data race against `sched_ext_dead()`.
> - [High] A memory ordering vulnerability defeats the bug 1 fix on weakly ordered architectures, allowing the NULL pointer dereference to still occur.
> - [Medium] Missing `SCX_TASK_OFF_TASKS` flag on the `scx_root_enable_workfn()` cleanup path exposes dead tasks to cgroup iterators, causing sub-scheduler enables to abort.
> --
> 
> commit 1d291365affa3415c03f2031e9198ebe7208a245
> Author: zhidao su <soolaugust@gmail.com>
> 
> sched_ext: Fix sched_ext_dead() race with scx_root_enable_workfn()
> 
> In CONFIG_EXT_SUB_SCHED, scx_task_sched(p) returns p->scx.sched instead
> of scx_root. scx_root_enable_workfn() iterates all tasks and for each
> releases scx_tasks_lock via scx_task_iter_unlock() before calling
> scx_init_task().
> 
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 5600618dcef8e..98505577d56ca 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -3690,7 +3690,15 @@ static int scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork
>> /*
>> * While @p's rq is not locked. @p is not visible to the rest of
>> * SCX yet and it's safe to update the flags and state.
>> + *
>> + * Install p->scx.sched before transitioning state off NONE so
>> + * that the invariant state!=NONE => p->scx.sched!=NULL holds as
>> + * soon as state becomes observable.  A concurrent sched_ext_dead()
>> + * that races the INIT window will then always find a valid
>> + * scheduler pointer and can call scx_disable_and_exit_task()
>> + * to release resources allocated by ops.init_task().
>> */
>> + scx_set_task_sched(p, sch);
>> p->scx.flags |= SCX_TASK_RESET_RUNNABLE_AT;
>> scx_set_task_state(p, SCX_TASK_INIT);
>> }
> 
> Can this result in a memory ordering vulnerability on weakly ordered
> architectures?
> 
> In scx_init_task(), scx_set_task_sched() uses rcu_assign_pointer(), which
> relies on smp_store_release(). While smp_store_release() guarantees prior
> operations are visible, it does not prevent subsequent memory operations
> from being reordered before the release.
> 
> Could the CPU commit the store updating p->scx.flags to SCX_TASK_INIT
> before the store to p->scx.sched is completed? If a concurrent
> sched_ext_dead() observes state == SCX_TASK_INIT, branches into the cleanup
> path, and reads a NULL scheduler pointer, it would pass it to
> scx_disable_and_exit_task() which crashes in SCX_HAS_OP().
> 
> Does this require a full write memory barrier (e.g., smp_wmb()) between
> setting the scheduler pointer and transitioning the state?
> 
> [ ... ]
> 
>> @@ -7130,7 +7133,49 @@ static void scx_root_enable_workfn(struct kthread_work *work)
>> goto err_disable_unlock_all;
>> }
>> 
>> - scx_set_task_sched(p, sch);
>> + /*
>> + * sched_ext_dead() may have raced while locks were dropped in
>> + * scx_task_iter_unlock().  Two cases:
>> + *
>> + * (a) sched_ext_dead() ran after scx_init_task() set state=INIT:
>> + *     it called scx_disable_and_exit_task() (cancelled=true) and
>> + *     reset state to NONE.  ops.exit_task() already ran; skip.
>> + *
>> + * (b) sched_ext_dead() ran before scx_init_task() (state=NONE at
>> + *     the time): it skipped scx_disable_and_exit_task() because
>> + *     state was NONE.  scx_init_task() subsequently called
>> + *     ops.init_task() and set state=INIT, leaving allocated
>> + *     resources with no owner.  We must call
>> + *     scx_disable_and_exit_task() here to release them.
>> + *
>> + * Distinguish case (a) from (b) by reading state: (a) leaves
>> + * state=NONE (reset by scx_disable_and_exit_task); (b) leaves
>> + * state=INIT (set by scx_init_task, never reset).
>> + */
>> + {
>> + bool p_dead = false, need_exit = false;
>> +
>> + scoped_guard(raw_spinlock_irq, &scx_tasks_lock) {
>> + if (list_empty(&p->scx.tasks_node)) {
>> + p_dead = true;
>> + need_exit = scx_get_task_state(p) != SCX_TASK_NONE;
>> + }
>> + }
>> +
>> + if (p_dead) {
>> + if (need_exit) {
>> + struct rq_flags rf;
>> + struct rq *rq;
>> +
>> + rq = task_rq_lock(p, &rf);
>> + scx_disable_and_exit_task(sch, p);
>> + task_rq_unlock(rq, p, &rf);
> 
> Is this cleanup path missing the SCX_TASK_OFF_TASKS flag?
> 
> If sched_ext_dead() runs before scx_init_task() completes, it sees state
> NONE and skips setting SCX_TASK_OFF_TASKS. When this cleanup path calls
> scx_disable_and_exit_task() under task_rq_lock(), the SCX_TASK_OFF_TASKS
> flag is not set before the runqueue is unlocked.
> 
> Because the flag is missing, if a cgroup iterator traverses the cgroup
> before cgroup_task_dead() removes the task, could the iterator erroneously
> yield the dead task with state NONE, hit assert_task_ready_or_enabled(p),
> and abort the sub-scheduler enable?
> 
>> + }
>> + put_task_struct(p);
>> + continue;
>> + }
>> + }
>> +
>> scx_set_task_state(p, SCX_TASK_READY);
> 
> Can this cause a Time-Of-Check to Time-Of-Use (TOCTOU) data race against
> sched_ext_dead()?
> 
> The code checks list_empty(&p->scx.tasks_node) under scx_tasks_lock to
> determine if a concurrent sched_ext_dead() has started processing.
> 
> However, once scx_tasks_lock is dropped, sched_ext_dead() can acquire it,
> remove the task, and proceed to exit the task, eventually doing
> p->scx.flags |= SCX_TASK_OFF_TASKS under task_rq_lock(p).
> 
> Meanwhile, scx_root_enable_workfn() continues past the lock and
> unconditionally calls scx_set_task_state(p, SCX_TASK_READY) without holding
> task_rq_lock(p). Because scx_set_task_state() performs a non-atomic
> read-modify-write on p->scx.flags, could this overwrite and permanently
> lose the SCX_TASK_OFF_TASKS flag?
> 
> Also, could this incorrectly transition a dead task to SCX_TASK_READY after
> its scheduler resources were torn down, triggering a warning?
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260506054001.1105522-1-suzhidao@xiaomi.com?part=1

          parent reply	other threads:[~2026-05-07  2:32 UTC|newest]

Thread overview: expand[flat|nested]  mbox.gz  Atom feed
 [parent not found: <20260506061633.73070C2BCB8@smtp.kernel.org>]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5636DBB8-B5F3-4F88-AA54-B388DB3A80F1@gmail.com \
    --to=soolaugust@gmail.com \
    --cc=sashiko@lists.linux.dev \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox