[PATCH AUTOSEL 6.17] sched_ext: defer queue_balance_callback() until after ops.dispatch

patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Emil Tsalapatis <etsal@meta.com>,
	Jakub Kicinski <kuba@kernel.org>,
	"Emil Tsalapatis (Meta)" <emil@etsalapatis.com>,
	Andrea Righi <arighi@nvidia.com>, Tejun Heo <tj@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, sched-ext@lists.linux.dev,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: [PATCH AUTOSEL 6.17] sched_ext: defer queue_balance_callback() until after ops.dispatch
Date: Mon,  3 Nov 2025 13:02:25 -0500	[thread overview]
Message-ID: <20251103180246.4097432-12-sashal@kernel.org> (raw)
In-Reply-To: <20251103180246.4097432-1-sashal@kernel.org>

From: Emil Tsalapatis <etsal@meta.com>

[ Upstream commit a8ad873113d3fe01f9b5d737d4b0570fa36826b0 ]

The sched_ext code calls queue_balance_callback() during enqueue_task()
to defer operations that drop multiple locks until we can unpin them.
The call assumes that the rq lock is held until the callbacks are
invoked, and the pending callbacks will not be visible to any other
threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock().

However, balance_one() may actually drop the lock during a BPF dispatch
call. Another thread may win the race to get the rq lock and see the
pending callback. To avoid this, sched_ext must only queue the callback
after the dispatch calls have completed.

CPU 0                   CPU 1           CPU 2

scx_balance()
  rq_unpin_lock()
  scx_balance_one()
    |= IN_BALANCE	scx_enqueue()
    ops.dispatch()
      rq_unlock()
                        rq_lock()
                        queue_balance_callback()
                        rq_unlock()
                                        [WARN] rq_pin_lock()
      rq_lock()
    &= ~IN_BALANCE
rq_repin_lock()

Changelog

v2-> v1 (https://lore.kernel.org/sched-ext/aOgOxtHCeyRT_7jn@gpd4)

- Fixed explanation in patch description (Andrea)
- Fixed scx_rq mask state updates (Andrea)
- Added Reviewed-by tag from Andrea

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

Based on my comprehensive analysis using semantic code analysis tools
and repository examination, here is my recommendation:

## **ANSWER: YES** (with critical caveat)

This commit **should be backported to stable kernel trees**, BUT the
follow-up fix commit a3c4a0a42e61a ("sched_ext: fix flag check for
deferred callbacks") **MUST be included together**, as the original
patch has a bug.

---

## **Extensive Analysis:**

### **1. Feature Availability Analysis**

Using git history analysis, I determined:
- **sched_ext was introduced in v6.12-rc1** (commit f0e1a0643a59b)
- **v6.11 and earlier kernels DO NOT have sched_ext** (verified with
  `git ls-tree`)
- **Only stable trees v6.12+ need this fix** (v6.12.x, v6.13.x, v6.16.x,
  v6.17.x all have kernel/sched/ext.c)

### **2. Semantic Code Analysis Using MCP Tools**

**Functions analyzed:**
- `mcp__semcode__find_function`: Located schedule_deferred(),
  balance_one(), balance_scx()
- `mcp__semcode__find_callers`: Traced call graph to understand impact
  scope

**Call chain discovered:**
```
Core scheduler → balance_scx (.balance callback)
  ↓
balance_one() [sets SCX_RQ_IN_BALANCE flag]
  ↓
ops.dispatch() [BPF scheduler callback - CAN DROP RQ LOCK]
  ↓
[RACE WINDOW - other CPUs can acquire lock]
  ↓
schedule_deferred() → queue_balance_callback()
  ↓
WARN_ON_ONCE() in rq_pin_lock() on CPU 2
```

**Impact scope:**
- schedule_deferred() called by: direct_dispatch()
- direct_dispatch() called by: do_enqueue_task()
- do_enqueue_task() called by: enqueue_task_scx, put_prev_task_scx,
  scx_bpf_reenqueue_local
- These are **core scheduler operations** triggered by normal task
  scheduling
- **User-space exposure**: Yes, any process using sched_ext can trigger
  this

### **3. Bug Severity Analysis**

**Race condition mechanism** (from commit message and code):
1. CPU 0: balance_one() sets IN_BALANCE flag, calls ops.dispatch()
2. ops.dispatch() **drops rq lock** during BPF execution
3. CPU 1: Acquires lock, calls schedule_deferred(), sees IN_BALANCE,
   queues callback
4. CPU 2: Calls rq_pin_lock(), sees pending callback → **WARN_ON_ONCE()
   triggers**

**Code reference** (kernel/sched/sched.h:1790-1797):
```c
static inline void rq_pin_lock(struct rq *rq, struct rq_flags *rf)
{
    rf->cookie = lockdep_pin_lock(__rq_lockp(rq));
    rq->clock_update_flags &= (RQCF_REQ_SKIP|RQCF_ACT_SKIP);
    rf->clock_update_flags = 0;
    WARN_ON_ONCE(rq->balance_callback &&
                 rq->balance_callback != &balance_push_callback);  // ←
VIOLATION
}
```

**Severity**: Medium-High
- Not a crash, but scheduler correctness issue
- Generates kernel warnings in logs
- Indicates inconsistent scheduler state
- Reported by Jakub Kicinski (well-known kernel developer)

### **4. Code Changes Analysis**

**Changes are minimal and focused:**
- kernel/sched/ext.c: +29 lines, -2 lines
- kernel/sched/sched.h: +1 line (new flag SCX_RQ_BAL_CB_PENDING)

**Behavioral change:**
- BEFORE: queue_balance_callback() called immediately when
  SCX_RQ_IN_BALANCE set
- AFTER: Set SCX_RQ_BAL_CB_PENDING flag, defer actual queuing until
  after ops.dispatch()
- NEW: maybe_queue_balance_callback() called after balance_one()
  completes

**No architectural changes:** Just timing adjustment to avoid race
window

### **5. Critical Follow-up Fix Required**

**Commit a3c4a0a42e61a analysis:**
```diff
- if (rq->scx.flags & SCX_RQ_BAL_PENDING)
+       if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING)
```

The original patch checks the **wrong flag** in schedule_deferred().
This means:
- Without the follow-up, multiple deferred operations could be queued
  incorrectly
- **Both commits must be backported together**

### **6. Stable Tree Compliance**

✅ **Fixes important bug**: Race condition causing kernel warnings
✅ **No new features**: Pure bug fix
✅ **No architectural changes**: Small, contained fix
✅ **Minimal regression risk**: Changes only affect sched_ext code path
✅ **Subsystem confined**: Only affects sched_ext subsystem

### **7. Backport Recommendation**

**YES**, backport to all stable trees with sched_ext (v6.12+), with
these requirements:

1. **MUST include both commits:**
   - a8ad873113d3 ("sched_ext: defer queue_balance_callback()")
   - a3c4a0a42e61a ("sched_ext: fix flag check for deferred callbacks")

2. **Target stable trees:** 6.12.x, 6.13.x, 6.14.x, 6.15.x, 6.16.x,
   6.17.x

3. **Not needed for:** v6.11.x and earlier (no sched_ext code)

**Reasoning:** This is a correctness fix for a real race condition in
core scheduler code that can be triggered by normal scheduling
operations when using BPF extensible schedulers. The fix is small,
contained, and follows all stable kernel rules.

 kernel/sched/ext.c   | 29 +++++++++++++++++++++++++++--
 kernel/sched/sched.h |  1 +
 2 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 040ca7419b4f9..b796ce247fffd 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -820,13 +820,23 @@ static void schedule_deferred(struct rq *rq)
 	if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
 		return;
 
+	/* Don't do anything if there already is a deferred operation. */
+	if (rq->scx.flags & SCX_RQ_BAL_PENDING)
+		return;
+
 	/*
 	 * If in balance, the balance callbacks will be called before rq lock is
 	 * released. Schedule one.
+	 *
+	 *
+	 * We can't directly insert the callback into the
+	 * rq's list: The call can drop its lock and make the pending balance
+	 * callback visible to unrelated code paths that call rq_pin_lock().
+	 *
+	 * Just let balance_one() know that it must do it itself.
 	 */
 	if (rq->scx.flags & SCX_RQ_IN_BALANCE) {
-		queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
-				       deferred_bal_cb_workfn);
+		rq->scx.flags |= SCX_RQ_BAL_CB_PENDING;
 		return;
 	}
 
@@ -2043,6 +2053,19 @@ static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
 	dspc->cursor = 0;
 }
 
+static inline void maybe_queue_balance_callback(struct rq *rq)
+{
+	lockdep_assert_rq_held(rq);
+
+	if (!(rq->scx.flags & SCX_RQ_BAL_CB_PENDING))
+		return;
+
+	queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
+				deferred_bal_cb_workfn);
+
+	rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
+}
+
 static int balance_one(struct rq *rq, struct task_struct *prev)
 {
 	struct scx_sched *sch = scx_root;
@@ -2190,6 +2213,8 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
 #endif
 	rq_repin_lock(rq, rf);
 
+	maybe_queue_balance_callback(rq);
+
 	return ret;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72fb9129afb6a..c7f67f54d4e3e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -782,6 +782,7 @@ enum scx_rq_flags {
 	SCX_RQ_BAL_KEEP		= 1 << 3, /* balance decided to keep current */
 	SCX_RQ_BYPASSING	= 1 << 4,
 	SCX_RQ_CLK_VALID	= 1 << 5, /* RQ clock is fresh and valid */
+	SCX_RQ_BAL_CB_PENDING	= 1 << 6, /* must queue a cb after dispatching */
 
 	SCX_RQ_IN_WAKEUP	= 1 << 16,
 	SCX_RQ_IN_BALANCE	= 1 << 17,
-- 
2.51.0

next prev parent reply	other threads:[~2025-11-03 18:03 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-03 18:02 [PATCH AUTOSEL 6.17-5.10] net: tls: Cancel RX async resync request on rcd_delta overflow Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] sched_ext: Allocate scx_kick_cpus_pnt_seqs lazily using kvzalloc() Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] perf/x86/intel/uncore: Add uncore PMU support for Wildcat Lake Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-6.12] net: tls: Change async resync helpers argument Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-6.1] bcma: don't register devices disabled in OF Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-6.12] blk-crypto: use BLK_STS_INVAL for alignment errors Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] drm/msm: Fix pgtable prealloc error path Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] ALSA: hda/realtek: Add quirk for Lenovo Yoga 7 2-in-1 14AKP10 Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-6.1] cifs: fix typo in enable_gcm_256 module parameter Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] smb: client: handle lack of IPC in dfs_cache_refresh() Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] ASoC: rt721: fix prepare clock stop failed Sasha Levin
2025-11-03 18:02 ` Sasha Levin [this message]
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-5.4] kconfig/nconf: Initialize the default locale at startup Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-5.10] scsi: core: Fix a regression triggered by scsi_host_busy() Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-5.15] selftests: net: use BASH for bareudp testing Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] ALSA: hda/realtek: Fix mute led for HP Victus 15-fa1xxx (MB 8C2D) Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-6.6] x86/microcode/AMD: Limit Entrysign signature checking to known generations Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] x86/CPU/AMD: Extend Zen6 model range Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17-5.4] kconfig/mconf: Initialize the default locale at startup Sasha Levin
2025-11-03 18:02 ` [PATCH AUTOSEL 6.17] selftests: cachestat: Fix warning on declaration under label Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:040ca7419b4f dfblob:b796ce247fff dfblob:72fb9129afb6
dfblob:c7f67f54d4e3 )
 OR (
bs:"[PATCH AUTOSEL 6.17] sched_ext: defer queue_balance_callback() until after ops.dispatch" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251103180246.4097432-12-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=arighi@nvidia.com \
    --cc=bpf@vger.kernel.org \
    --cc=emil@etsalapatis.com \
    --cc=etsal@meta.com \
    --cc=juri.lelli@redhat.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=patches@lists.linux.dev \
    --cc=peterz@infradead.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).