Sched_ext development
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: David Vernet <void@manifault.com>,
	Andrea Righi <arighi@nvidia.com>,
	Changwoo Min <changwoo@igalia.com>
Cc: sched-ext@lists.linux.dev, Emil Tsalapatis <emil@etsalapatis.com>,
	linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>
Subject: [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT
Date: Thu,  2 Jul 2026 22:01:54 -1000	[thread overview]
Message-ID: <20260703080159.2314350-28-tj@kernel.org> (raw)
In-Reply-To: <20260703080159.2314350-1-tj@kernel.org>

A kick forces a scheduling event on the target cpu, and a preemption also
evicts the running task. Gate both on caps. Any kick requires baseline
access on the cid, and preempting a task the sub-sched does not own -
whether by a SCX_ENQ_PREEMPT insert or a SCX_KICK_PREEMPT kick - requires
the new SCX_CAP_PREEMPT. Gating either alone would leave a hole - the
weakest cap authorizing preempting kicks, or plain kicks disturbing cpus the
kicker has no access to.

Preempting the sched's own subtree is always allowed, and the cap extends
the right to any task on the cid. PREEMPT implies ENQ, and so ENQ_IMMED.

A preempting insert tests the running task under the target rq lock and is
rejected and reenqueued unless the victim is in the inserter's subtree or it
holds PREEMPT. A migration-disabled task is admitted regardless, but with
SCX_ENQ_PREEMPT stripped.

Kicks are enforced on the delivery path, where the effective caps can be
read coherently under the target rq's lock. A kick from a sub-sched lacking
SCX_CAP_BASE on the cid is dropped, and a SCX_KICK_PREEMPT kick without
PREEMPT for a task outside the kicker's subtree degrades to a plain
reschedule.

Unlike the enqueue caps, PREEMPT is checked only at the instant of the
insert or kick, never as a standing property of a queued task.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      | 34 +++++++++++++++++++++++++---------
 kernel/sched/ext/internal.h | 24 ++++++++++++++++++++++--
 kernel/sched/ext/sub.c      | 20 +++++++++++++++-----
 kernel/sched/ext/sub.h      | 15 +++++++++++++++
 4 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 7e521dc7e1b7..5a2c96bf8aa9 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -294,7 +294,7 @@ static bool u32_before(u32 a, u32 b)
  *
  * Test whether @sch is a descendant of @ancestor.
  */
-static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
+bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
 {
 	if (sch->level < ancestor->level)
 		return false;
@@ -4903,6 +4903,7 @@ SCX_ATTR(events);
 static const char *scx_cap_names[__SCX_NR_CAPS] = {
 	[__SCX_CAP_ENQ_IMMED]	= "enq_immed",
 	[__SCX_CAP_ENQ]		= "enq",
+	[__SCX_CAP_PREEMPT]	= "preempt",
 };
 
 static ssize_t scx_attr_caps_show(struct kobject *kobj,
@@ -7812,13 +7813,22 @@ static bool kick_one_cpu(s32 cpu, struct scx_sched_pcpu *pcpu, struct rq *this_r
 	 * During CPU hotplug, a CPU may depend on kicking itself to make
 	 * forward progress. Allow kicking self regardless of online state. If
 	 * @cpu is running a higher class task, we have no control over @cpu.
-	 * Skip kicking.
+	 * Skip kicking. A sub-sched lacking baseline access on @cid has no
+	 * business forcing a reschedule there - skip. This is the authoritative
+	 * cap check: ecaps is read here under @rq's lock.
 	 */
 	if ((cpu_online(cpu) || cpu == cpu_of(this_rq)) &&
-	    !sched_class_above(cur_class, &ext_sched_class)) {
+	    !sched_class_above(cur_class, &ext_sched_class) &&
+	    !scx_missing_caps(pcpu->sch, cpu, SCX_CAP_BASE)) {
 		if (cpumask_test_cpu(cpu, pcpu->cpus_to_preempt)) {
-			if (cur_class == &ext_sched_class)
-				set_task_slice(rq->curr, 0);
+			if (cur_class == &ext_sched_class) {
+				if (likely(!scx_missing_caps(pcpu->sch, cpu,
+							     scx_caps_for_preempt(pcpu->sch, rq))))
+					set_task_slice(rq->curr, 0);
+				else
+					__scx_add_event(pcpu->sch,
+							SCX_EV_SUB_PREEMPT_DENIED, 1);
+			}
 			cpumask_clear_cpu(cpu, pcpu->cpus_to_preempt);
 		}
 
@@ -7842,15 +7852,18 @@ static bool kick_one_cpu(s32 cpu, struct scx_sched_pcpu *pcpu, struct rq *this_r
 	return should_wait;
 }
 
-static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq)
+static void kick_one_cpu_if_idle(s32 cpu, struct scx_sched_pcpu *pcpu,
+				 struct rq *this_rq)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
 	raw_spin_rq_lock_irqsave(rq, flags);
 
+	/* idle kicks need baseline access too, see kick_one_cpu() */
 	if (!can_skip_idle_kick(rq) &&
-	    (cpu_online(cpu) || cpu == cpu_of(this_rq)))
+	    (cpu_online(cpu) || cpu == cpu_of(this_rq)) &&
+	    !scx_missing_caps(pcpu->sch, cpu, SCX_CAP_BASE))
 		resched_curr(rq);
 
 	raw_spin_rq_unlock_irqrestore(rq, flags);
@@ -7887,7 +7900,7 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 		}
 
 		for_each_cpu(cpu, pcpu->cpus_to_kick_if_idle) {
-			kick_one_cpu_if_idle(cpu, this_rq);
+			kick_one_cpu_if_idle(cpu, pcpu, this_rq);
 			cpumask_clear_cpu(cpu, pcpu->cpus_to_kick_if_idle);
 		}
 	}
@@ -8912,7 +8925,10 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
  * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
  *
  * cid-addressed equivalent of scx_bpf_kick_cpu(). An invalid @cid aborts the
- * scheduler via scx_cid_to_cpu().
+ * scheduler via scx_cid_to_cpu(). Caps are enforced on the delivery path: a
+ * kick is dropped if the caller lacks baseline access on @cid, and a
+ * %SCX_KICK_PREEMPT degrades to a plain reschedule if the caller lacks
+ * %SCX_CAP_PREEMPT for a task outside its subtree.
  */
 __bpf_kfunc void scx_bpf_kick_cid(s32 cid, u64 flags, const struct bpf_prog_aux *aux)
 {
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 80913365e19a..6e2daf90a4ac 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1155,6 +1155,13 @@ struct scx_event_stats {
 	 * it can't be rejected. The violation is counted here.
 	 */
 	s64		SCX_EV_SUB_FORCED_ADMIT;
+
+	/*
+	 * The number of times a preempting kick was refused because the
+	 * sub-sched lacked SCX_CAP_PREEMPT for a task outside its subtree. The
+	 * kick degrades to a plain reschedule.
+	 */
+	s64		SCX_EV_SUB_PREEMPT_DENIED;
 };
 
 #define SCX_EVENTS_LIST(SCX_EVENT)					\
@@ -1173,7 +1180,8 @@ struct scx_event_stats {
 	SCX_EVENT(SCX_EV_BYPASS_ACTIVATE);				\
 	SCX_EVENT(SCX_EV_INSERT_NOT_OWNED);				\
 	SCX_EVENT(SCX_EV_SUB_BYPASS_DISPATCH);				\
-	SCX_EVENT(SCX_EV_SUB_FORCED_ADMIT)
+	SCX_EVENT(SCX_EV_SUB_FORCED_ADMIT);				\
+	SCX_EVENT(SCX_EV_SUB_PREEMPT_DENIED)
 
 struct scx_sched;
 
@@ -1287,22 +1295,33 @@ struct scx_sched_pnode {
  * the allocation pattern.
  *
  * ENQ_IMMED  insert an IMMED task onto the cid's local DSQ
+ *            - kick the cid's cpu (except SCX_KICK_PREEMPT)
  *
  * ENQ        insert any task onto the cid's local DSQ (implies ENQ_IMMED)
  *
+ * PREEMPT    preempt any task running on the cid regardless of the owning
+ *            sched (implies ENQ). Preempting a task in the sched's own subtree
+ *            doesn't require any cap.
+ *            - SCX_ENQ_PREEMPT inserts
+ *            - SCX_KICK_PREEMPT kicks
+ *
  * Implied caps apply to the holder's own use of a cid, not to delegation.
  * scx_bpf_sub_grant() delegates literally-held caps, so a cap held only through
- * implication is usable but cannot be re-delegated to a child.
+ * implication is usable but cannot be re-delegated to a child. When granting a
+ * cap, it usually makes sense to delegate its implied caps explicitly alongside
+ * it.
  */
 enum scx_cap_flags {
 	__SCX_CAP_ENQ_IMMED		= 0,
 	__SCX_CAP_ENQ			= 1,
+	__SCX_CAP_PREEMPT		= 2,
 
 	__SCX_NR_CAPS,
 	__SCX_CAP_ALL			= BIT_U64(__SCX_NR_CAPS) - 1,
 
 	SCX_CAP_ENQ_IMMED		= BIT_U64(__SCX_CAP_ENQ_IMMED),
 	SCX_CAP_ENQ			= BIT_U64(__SCX_CAP_ENQ),
+	SCX_CAP_PREEMPT			= BIT_U64(__SCX_CAP_PREEMPT),
 
 	/* alias for minimal cap to make any use of a cpu */
 	SCX_CAP_BASE			= SCX_CAP_ENQ_IMMED,
@@ -1911,6 +1930,7 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 					  struct scx_sched *parent);
 int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops);
 int scx_sched_sysfs_add(struct scx_sched *sch);
+bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor);
 
 extern raw_spinlock_t scx_sched_lock;
 extern struct mutex scx_enable_mutex;
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 2f1e19db8e72..67ba352828e0 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -209,11 +209,14 @@ void scx_init_root_caps(struct scx_sched *sch)
  * @sch: enqueuing sub-sched
  * @rq: rq whose local DSQ @p targets
  * @p: task being inserted
- * @enq_flags: in/out; %SCX_ENQ_IMMED is cleared when diverting to reject
+ * @enq_flags: in/out, unhonored flags are cleared
  *
  * Return @rq's local DSQ if @sch holds the required caps on @rq's cid,
  * otherwise @rq's reject DSQ after recording the reenq reason on @p.
  *
+ * %SCX_ENQ_IMMED and %SCX_ENQ_PREEMPT are cleared when diverting to reject.
+ * %SCX_ENQ_PREEMPT is also cleared on a fallback migration-disabled admission.
+ *
  * Bypass doesn't need special-casing as a bypassing sched's tasks are enqueued
  * to and run by its nearest non-bypassing ancestor. If root is bypassing, it
  * always holds all caps.
@@ -222,7 +225,12 @@ struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq
 					       struct task_struct *p, u64 *enq_flags)
 {
 	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
-	u64 missing = scx_missing_caps(sch, cpu_of(rq), scx_caps_for_enq(*enq_flags));
+	u64 needed = scx_caps_for_enq(*enq_flags);
+	u64 missing;
+
+	if (*enq_flags & SCX_ENQ_PREEMPT)
+		needed |= scx_caps_for_preempt(sch, rq);
+	missing = scx_missing_caps(sch, cpu_of(rq), needed);
 
 	/* requirements met */
 	if (likely(!missing))
@@ -230,10 +238,11 @@ struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq
 
 	/*
 	 * A migration-disabled task must run on this CPU. Let it run and count
-	 * the violation.
+	 * the violation. Refuse preemptions.
 	 */
 	if (unlikely(is_migration_disabled(p))) {
 		__scx_add_event(sch, SCX_EV_SUB_FORCED_ADMIT, 1);
+		*enq_flags &= ~SCX_ENQ_PREEMPT;
 		return &rq->scx.local_dsq;
 	}
 
@@ -243,9 +252,10 @@ struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq
 	/*
 	 * Only local DSQ can honor IMMED and dsq_inc_nr() WARNs on IMMED into
 	 * others. Strip both the enq flag and the sticky task flag - the
-	 * latter can carry in from an earlier admitted IMMED insert.
+	 * latter can carry in from an earlier admitted IMMED insert. Strip
+	 * PREEMPT too.
 	 */
-	*enq_flags &= ~SCX_ENQ_IMMED;
+	*enq_flags &= ~(SCX_ENQ_IMMED | SCX_ENQ_PREEMPT);
 	p->scx.flags &= ~SCX_TASK_IMMED;
 
 	return &rq->scx.reject_dsq;
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 7d8c1632f58f..9f74c142b73f 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -118,10 +118,24 @@ static inline u64 scx_caps_for_task(struct task_struct *p)
 	return SCX_CAP_ENQ;
 }
 
+/* the cap @sch needs to preempt @rq's current task, 0 if none */
+static inline u64 scx_caps_for_preempt(struct scx_sched *sch, struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+
+	/* a non-ext task can't be preempted by ext, own-subtree needs no cap */
+	if (curr->sched_class != &ext_sched_class ||
+	    scx_is_descendant(scx_task_sched(curr), sch))
+		return 0;
+	return SCX_CAP_PREEMPT;
+}
+
 /* caps implied by holding @cap */
 static inline u64 scx_caps_implied(u64 cap)
 {
 	switch (cap) {
+	case SCX_CAP_PREEMPT:
+		return SCX_CAP_ENQ | SCX_CAP_ENQ_IMMED;
 	case SCX_CAP_ENQ:
 		return SCX_CAP_ENQ_IMMED;
 	}
@@ -141,6 +155,7 @@ static inline bool scx_task_can_stay_on_cpu(struct rq *rq, struct task_struct *p
 #else	/* CONFIG_EXT_SUB_SCHED */
 
 static inline u64 scx_missing_caps(struct scx_sched *sch, s32 cpu, u64 needed) { return 0; }
+static inline u64 scx_caps_for_preempt(struct scx_sched *sch, struct rq *rq) { return 0; }
 static inline bool scx_task_can_stay_on_cpu(struct rq *rq, struct task_struct *p) { return true; }
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
-- 
2.54.0


  parent reply	other threads:[~2026-07-03  8:02 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 02/32] tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants Tejun Heo
2026-07-03  8:33   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 04/32] tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 05/32] sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 06/32] sched_ext: Make scx_bpf_kick_cid() return void Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched Tejun Heo
2026-07-03  9:02   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 08/32] sched_ext: Add ops.init_cids() to finalize the cid layout before init Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 09/32] sched_ext: Add CID sharding Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override() Tejun Heo
2026-07-03  9:51   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 11/32] sched_ext: Defer scx_sched kobj sysfs add into the enable workfns Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 12/32] sched_ext: Add per-shard scx_sched storage scaffolding Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 13/32] sched_ext: Add scx_cmask_ref for validated arena cmask access Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists Tejun Heo
2026-07-03 10:49   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 15/32] sched_ext: Add scx_skip_subtree_pre() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers Tejun Heo
2026-07-03 11:17   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 17/32] sched_ext: Add coalescing sub_caps_updated() notifier " Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks Tejun Heo
2026-07-03 12:05   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier Tejun Heo
2026-07-03 12:25   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 20/32] sched_ext: Generalize local-DSQ handling to rq-owned DSQs Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches Tejun Heo
2026-07-03 12:57   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 22/32] sched_ext: Add the SCX_CAP_ENQ_IMMED cap Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 23/32] sched_ext: Assign a unique id to each scheduler instance Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 24/32] sched_ext: Route task slice writes through set_task_slice() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice Tejun Heo
2026-07-03 13:34   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 26/32] sched_ext: Add the SCX_CAP_ENQ cap Tejun Heo
2026-07-03  8:01 ` Tejun Heo [this message]
2026-07-03 14:01   ` [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds Tejun Heo
2026-07-03 14:14   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass Tejun Heo
2026-07-03 14:28   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler Tejun Heo
2026-07-03 14:45   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling Tejun Heo
2026-07-03 14:57   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 32/32] tools/sched_ext: scx_qmap - Add sub-sched cap fault injection Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260703080159.2314350-28-tj@kernel.org \
    --to=tj@kernel.org \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=emil@etsalapatis.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox