Re: [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier

Sched_ext development
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Tejun Heo" <tj@kernel.org>
Cc: sched-ext@lists.linux.dev
Subject: Re: [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier
Date: Fri, 03 Jul 2026 12:25:13 +0000	[thread overview]
Message-ID: <20260703122514.03E091F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260703080159.2314350-20-tj@kernel.org>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Concurrent freeing of a sub-scheduler can permanently discard ecaps syncs for all active sub-schedulers on a CPU that is coming online.
- [High] scx_discard_ecaps_to_sync() can hang indefinitely in a worker thread if the target CPU is continuously executing higher-priority tasks, locking up the system RCU workqueue.
- [High] Early clearing of `ecaps_to_sync_node` in `scx_process_sync_ecaps()` causes a Use-After-Free of `pcpu` and `sch` during sub-scheduler teardown.
--

commit f00aae3acf0710a9423504a897d70bfb5d9571f6
Author: Tejun Heo <tj@kernel.org>
Subject: sched_ext: Add sub_ecaps_updated() effective-cap change notifier

The patch adds an ops.sub_ecaps_updated() callback fired when a
sub-scheduler's effective caps on a cid change. CPU hotplug uses this
mechanism to zero caps when a CPU goes down, and to resync when coming
back online.

> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> index 08d9367cf2188..55437f1d1965f 100644
> --- a/kernel/sched/ext/sub.c
> +++ b/kernel/sched/ext/sub.c

[ ... ]

> @@ -351,33 +354,130 @@ void scx_process_sync_ecaps(struct rq *rq)
>  	if (likely(llist_empty(&rq->scx.ecaps_to_sync)))
>  		return;
>  
> +	/*
> +	 * ecaps are zeroed while the cpu is inactive and must stay zero.
> +	 * Discard queued syncs instead of processing them - the
> +	 * scx_online_ecaps() reseed re-syncs every sched on activation.
> +	 * cpu_active() clears before the offline zeroing and sets before the
> +	 * reseed is queued, so this test can neither miss a racing sync nor
> +	 * eat the reseed.
> +	 */
> +	if (unlikely(!cpu_active(cpu))) {
> +		discard_queued_syncs(rq);
> +		return;
> +	}
> +
>  	batch = llist_del_all(&rq->scx.ecaps_to_sync);
>  	llist_for_each_safe(pos, tmp, batch) {
>  		struct scx_sched_pcpu *pcpu =
>  			container_of(pos, struct scx_sched_pcpu, ecaps_to_sync_node);
>  		struct scx_pshard *ps = pcpu->sch->pshard[shard];
> +		u64 ecaps;
>  
>  		init_llist_node(pos);
>  
>  		/* pairs with smp_mb() in queue_sync_ecaps(), see there */
>  		smp_mb();
>  
> -		WRITE_ONCE(pcpu->ecaps, calc_effective_caps(ps, cid));
> +		ecaps = calc_effective_caps(ps, cid);
> +		WRITE_ONCE(pcpu->ecaps, ecaps);
> +
> +		/* tell the sched its effective caps on this cid changed */
> +		if (ecaps != pcpu->reported_ecaps &&
> +		    SCX_HAS_OP(pcpu->sch, sub_ecaps_updated) &&
> +		    !scx_bypassing(pcpu->sch, cpu)) {
> +			struct scx_dsp_ctx *dspc = &pcpu->dsp_ctx;
> +
> +			dspc->rq = rq;
> +			/* stash @prev so nested dispatches can access it */
> +			rq->scx.sub_dispatch_prev = prev;
> +			SCX_CALL_OP(pcpu->sch, sub_ecaps_updated, rq, scx_cpu_arg(cpu),
> +				    pcpu->reported_ecaps, ecaps);
> +			rq->scx.sub_dispatch_prev = NULL;
> +			scx_flush_dispatch_buf(pcpu->sch, rq);
> +			pcpu->reported_ecaps = ecaps;

[Severity: High]
Does this sequence create a use-after-free of pcpu?

In scx_process_sync_ecaps(), the pcpu node is unlinked via init_llist_node(pos).
Later, scx_flush_dispatch_buf() might drop the runqueue lock if it dispatches
to a remote CPU's local DSQ (e.g., via dispatch_to_local_dsq()).

If the lock is dropped, scx_discard_ecaps_to_sync() executing in the RCU worker
can acquire the lock, observe !llist_on_list(), and return, allowing the
pcpu structure to be freed.

When scx_flush_dispatch_buf() finishes and reacquires the lock,
scx_process_sync_ecaps() then writes to the freed pcpu:

    pcpu->reported_ecaps = ecaps;

> +		}
> +	}
> +}

[ ... ]

>  void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu)
>  {
> -	scoped_guard (rq_lock_irqsave, cpu_rq(cpu))
> -		scx_process_sync_ecaps(cpu_rq(cpu));
> +	struct rq *rq = cpu_rq(cpu);
>  
> -	WARN_ON_ONCE(llist_on_list(&pcpu->ecaps_to_sync_node));
> +	while (true) {
> +		scoped_guard (rq_lock_irqsave, rq) {
> +			/*
> +			 * scx_process_sync_ecaps() takes the node off the list
> +			 * before it is done accessing @pcpu but does all of it
> +			 * under the rq lock. Off-list observed under the rq
> +			 * lock guarantees that the sync is complete.
> +			 */
> +			if (!llist_on_list(&pcpu->ecaps_to_sync_node))
> +				return;
> +			if (!scx_enabled() || !scx_rq_online(rq)) {

[Severity: High]
Could this destructively drop valid queued syncs for active sub-schedulers?

If a sub-scheduler is being freed via scx_sched_free_rcu_work() concurrently
with CPU hotplug (scx_online_ecaps()), scx_rq_online() checks SCX_RQ_ONLINE.
This flag is set by sched_set_rq_online() after the CPU comes online.

During the window where !scx_rq_online(rq) is true, discard_queued_syncs()
deletes all queued syncs for all sub-schedulers via llist_del_all().

Checking !cpu_active(cpu) instead (as in scx_process_sync_ecaps) might avoid
this race.

> +				discard_queued_syncs(rq);
> +				return;
> +			}
> +		}
> +		resched_cpu(cpu);
> +		msleep(1);

[Severity: High]
Can this loop hang indefinitely if the target CPU is saturated with higher
priority tasks?

The loop spins waiting for the target CPU to process the pending ecaps sync.
The node is removed when scx_process_sync_ecaps() runs on the target CPU,
which is called from balance_one().

Because ext_sched_class has a lower priority than FAIR or RT classes,
if the target CPU is continuously executing higher priority tasks,
balance_one() is never called. The RCU worker thread will spin forever
waiting for !llist_on_list().

> +	}
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=19

next prev parent reply	other threads:[~2026-07-03 12:25 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 02/32] tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants Tejun Heo
2026-07-03  8:33   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 04/32] tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 05/32] sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 06/32] sched_ext: Make scx_bpf_kick_cid() return void Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched Tejun Heo
2026-07-03  9:02   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 08/32] sched_ext: Add ops.init_cids() to finalize the cid layout before init Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 09/32] sched_ext: Add CID sharding Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override() Tejun Heo
2026-07-03  9:51   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 11/32] sched_ext: Defer scx_sched kobj sysfs add into the enable workfns Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 12/32] sched_ext: Add per-shard scx_sched storage scaffolding Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 13/32] sched_ext: Add scx_cmask_ref for validated arena cmask access Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists Tejun Heo
2026-07-03 10:49   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 15/32] sched_ext: Add scx_skip_subtree_pre() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers Tejun Heo
2026-07-03 11:17   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 17/32] sched_ext: Add coalescing sub_caps_updated() notifier " Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks Tejun Heo
2026-07-03 12:05   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier Tejun Heo
2026-07-03 12:25   ` sashiko-bot [this message]
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 20/32] sched_ext: Generalize local-DSQ handling to rq-owned DSQs Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches Tejun Heo
2026-07-03 12:57   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 22/32] sched_ext: Add the SCX_CAP_ENQ_IMMED cap Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 23/32] sched_ext: Assign a unique id to each scheduler instance Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 24/32] sched_ext: Route task slice writes through set_task_slice() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice Tejun Heo
2026-07-03 13:34   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 26/32] sched_ext: Add the SCX_CAP_ENQ cap Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT Tejun Heo
2026-07-03 14:01   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds Tejun Heo
2026-07-03 14:14   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass Tejun Heo
2026-07-03 14:28   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler Tejun Heo
2026-07-03 14:45   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling Tejun Heo
2026-07-03 14:57   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 32/32] tools/sched_ext: scx_qmap - Add sub-sched cap fault injection Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260703122514.03E091F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox