patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Tejun Heo <tj@kernel.org>, Andrea Righi <arighi@nvidia.com>,
	Sasha Levin <sashal@kernel.org>,
	changwoo@igalia.com, me@mostlynerdless.de, ggherdovich@suse.com,
	jameshongleiwang@126.com, sched-ext@lists.linux.dev,
	bpf@vger.kernel.org
Subject: [PATCH AUTOSEL 6.17-6.12] sched_ext: Make qmap dump operation non-destructive
Date: Mon,  6 Oct 2025 14:17:50 -0400	[thread overview]
Message-ID: <20251006181835.1919496-18-sashal@kernel.org> (raw)
In-Reply-To: <20251006181835.1919496-1-sashal@kernel.org>

From: Tejun Heo <tj@kernel.org>

[ Upstream commit d452972858e5cfa4262320ab74fe8f016460b96f ]

The qmap dump operation was destructively consuming queue entries while
displaying them. As dump can be triggered anytime, this can easily lead to
stalls. Add a temporary dump_store queue and modify the dump logic to pop
entries, display them, and then restore them back to the original queue.
This allows dump operations to be performed without affecting the
scheduler's queue state.

Note that if racing against new enqueues during dump, ordering can get
mixed up, but this is acceptable for debugging purposes.

Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Backport Analysis: YES

**This commit SHOULD be backported to stable kernel trees** (and
notably, it has already been backported to 6.17 stable as commit
2f64156df4204 by Sasha Levin on Oct 1, 2025).

### Detailed Analysis

#### 1. **Nature of the Bug (Critical Factor)**

The bug is **real and impactful**, not theoretical:

- **Location**: `tools/sched_ext/scx_qmap.bpf.c` lines 567-588
  (qmap_dump function)
- **Problem**: The dump operation uses `bpf_map_pop_elem(fifo, &pid)` to
  display queue contents but **never restores the entries**
- **Impact**: Tasks are permanently removed from scheduler queues,
  causing **system stalls**
- **Trigger**: Can be invoked at any time via SysRq-D (as documented in
  commit 07814a9439a3b) or during error exits

From the old code (lines 581-586):
```c
bpf_repeat(4096) {
    if (bpf_map_pop_elem(fifo, &pid))
        break;
    scx_bpf_dump(" %d", pid);  // ← Never restored!
}
```

This is a destructive read that removes tasks from the runnable queue,
effectively "losing" them from the scheduler.

#### 2. **The Fix is Simple and Safe**

The fix adds 17 insertions, 1 deletion (well under the 100-line limit):

- Adds one new queue map (`dump_store`) for temporary storage
- Modifies dump logic to: pop → store → display → restore
- Two `bpf_repeat` loops: first to pop and display, second to restore
- Low regression risk: only affects dump operations, not scheduling path

**Code changes at lines 579-600:**
```c
// First loop: pop from queue, save to dump_store, display
bpf_map_push_elem(&dump_store, &pid, 0);  // ← Save for restoration
scx_bpf_dump(" %d", pid);

// Second loop: restore from dump_store back to original queue
bpf_map_push_elem(fifo, &pid, 0);  // ← Restore to scheduler queue
```

#### 3. **Meets Stable Kernel Criteria**

Per `Documentation/process/stable-kernel-rules.rst`:

✅ **Already in mainline**: Upstream commit d452972858e5c
✅ **Obviously correct**: Simple save-restore pattern
✅ **Small size**: 41 total lines of diff
✅ **Fixes real bug**: Prevents stalls from destructive dump operations
✅ **User impact**: Anyone triggering dumps (SysRq-D, error exits) on
systems running scx_qmap would experience task loss

#### 4. **Why This Qualifies Despite Being in tools/**

While `tools/` changes are typically not backported, this case is
exceptional:

1. **BPF programs run in kernel space**: `scx_qmap.bpf.c` is not
   userspace tooling—it's a BPF program loaded into the kernel that
   implements actual scheduling decisions

2. **sched_ext schedulers are functional**: Although documented as
   "example schedulers" in the README (lines 6-15), they are
   **production-capable**. The README states: "Some of the examples are
   performant, production-ready schedulers" (line 11)

3. **Debugging is critical infrastructure**: The dump operation (added
   in commit 07814a9439a3b "Print debug dump after an error exit") is
   essential for debugging BPF scheduler failures. A broken dump that
   causes stalls defeats its purpose

4. **Already validated by stable maintainer**: Sasha Levin backported
   this on Oct 1, 2025, confirming it meets stable criteria

#### 5. **Historical Context**

- **sched_ext introduced**: v6.12-rc1 (commit f0e1a0643a59b)
- **Dump operations added**: June 18, 2024 (commit 07814a9439a3b)
- **Bug window**: ~15 months of potential stalls from dump operations
- **Fix date**: September 23, 2025 (upstream d452972858e5c)

#### 6. **No Security CVE, But Real Impact**

My search specialist agent found no CVE assigned to this issue, but that
doesn't diminish its importance:

- Stalls impact system availability
- Debugging a broken scheduler with a broken dump tool compounds
  problems
- Users investigating scheduler issues via SysRq-D would inadvertently
  cause more stalls

#### 7. **Risk Assessment**

**Regression risk**: **Very Low**
- Only modifies dump operations (debugging path)
- Does not touch scheduling hot paths
- Temporary storage pattern is standard and safe
- Race condition with concurrent enqueues is explicitly acceptable (per
  commit message: "ordering can get mixed up, but this is acceptable for
  debugging purposes")

**Benefit**: **High for affected users**
- Makes dump operations actually usable
- Prevents cascading failures during debugging
- Enables proper root cause analysis of scheduler issues

### Conclusion

**YES - This commit should be backported.** It fixes a real bug causing
system stalls, is small and safe, and affects functionality that users
rely on for debugging. The fact that it has already been accepted into
6.17 stable by Sasha Levin validates this assessment. This is an
appropriate stable backport that improves system reliability for users
of sched_ext schedulers.

 tools/sched_ext/scx_qmap.bpf.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 69d877501cb72..cd50a94326e3a 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -56,7 +56,8 @@ struct qmap {
   queue1 SEC(".maps"),
   queue2 SEC(".maps"),
   queue3 SEC(".maps"),
-  queue4 SEC(".maps");
+  queue4 SEC(".maps"),
+  dump_store SEC(".maps");
 
 struct {
 	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
@@ -578,11 +579,26 @@ void BPF_STRUCT_OPS(qmap_dump, struct scx_dump_ctx *dctx)
 			return;
 
 		scx_bpf_dump("QMAP FIFO[%d]:", i);
+
+		/*
+		 * Dump can be invoked anytime and there is no way to iterate in
+		 * a non-destructive way. Pop and store in dump_store and then
+		 * restore afterwards. If racing against new enqueues, ordering
+		 * can get mixed up.
+		 */
 		bpf_repeat(4096) {
 			if (bpf_map_pop_elem(fifo, &pid))
 				break;
+			bpf_map_push_elem(&dump_store, &pid, 0);
 			scx_bpf_dump(" %d", pid);
 		}
+
+		bpf_repeat(4096) {
+			if (bpf_map_pop_elem(&dump_store, &pid))
+				break;
+			bpf_map_push_elem(fifo, &pid, 0);
+		}
+
 		scx_bpf_dump("\n");
 	}
 }
-- 
2.51.0


  parent reply	other threads:[~2025-10-06 18:19 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 18:17 [PATCH AUTOSEL 6.17-5.4] x86/build: Remove cc-option from stack alignment flags Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.1] btrfs: zoned: refine extent allocator hint selection Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.1] arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] btrfs: abort transaction on specific error places when walking log tree Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-5.4] btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17] x86/bugs: Add attack vector controls for VMSCAPE Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.16] sched_ext: Keep bypass on between enable failure and scx_disable_workfn() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] perf/x86/intel: Add ICL_FIXED_0_ADAPTIVE bit into INTEL_FIXED_BITS_MASK Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] btrfs: abort transaction if we fail to update inode in log replay dir fixup Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.1] EDAC/mc_sysfs: Increase legacy channel support to 16 Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] btrfs: abort transaction in the process_one_buffer() log tree walk callback Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] btrfs: zoned: return error from btrfs_zone_finish_endio() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] perf: Skip user unwind if the task is a kernel thread Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.1] perf: Have get_perf_callchain() return NULL if crosstask and user are set Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.16] EDAC: Fix wrong executable file modes for C source files Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] btrfs: use level argument in log tree walk callback replay_one_buffer() Sasha Levin
2025-10-06 18:17 ` Sasha Levin [this message]
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] seccomp: passthrough uprobe systemcall without filtering Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] cpuset: Use new excpus for nocpu error check when enabling root partition Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] btrfs: tree-checker: add inode extref checks Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17] sched/fair: update_cfs_group() for throttled cfs_rqs Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-5.15] btrfs: scrub: replace max_t()/min_t() with clamp() in scrub_throttle_dev_io() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-5.4] x86/bugs: Fix reporting of LFENCE retpoline Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-5.10] btrfs: always drop log root tree reference in btrfs_replay_log() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] audit: record fanotify event regardless of presence of rules Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17] EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC support Sasha Levin
2025-10-06 18:18 ` [PATCH AUTOSEL 6.17-6.6] x86/bugs: Report correct retbleed mitigation status Sasha Levin
2025-10-06 21:55 ` [PATCH AUTOSEL 6.17-5.4] x86/build: Remove cc-option from stack alignment flags Nathan Chancellor
2025-10-06 23:13   ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251006181835.1919496-18-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=arighi@nvidia.com \
    --cc=bpf@vger.kernel.org \
    --cc=changwoo@igalia.com \
    --cc=ggherdovich@suse.com \
    --cc=jameshongleiwang@126.com \
    --cc=me@mostlynerdless.de \
    --cc=patches@lists.linux.dev \
    --cc=sched-ext@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).