From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F11BD3C65FF; Mon, 20 Apr 2026 13:17:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776691057; cv=none; b=SSkx4J6P7WB0gTe3fnvZr+capJf08OIWRXiNvIujaBQG1gBAPUxd7GLlEI09IO0znQyg6IrNCDLObrNfyRHStUM0evWv7U3ha14iqE1ps//SGzE3exDOynyxc6cMX7ggf05t8iW2Bb6bNRA0KbPxHGWNyxSjvj3Jq8FDeUaTUMc= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776691057; c=relaxed/simple; bh=DL6OvkoEDX+5B552BSy/iHSkR0WkeMYjdeHssdVlyZg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=BQFuiaA97nm1XSHua5op+lPnjBsjaG1VxDaWQIWD/WuJVVkStgy63stOkHlnJfRqy5B9SAfPVC6sLh2t0zQ6X/LZ1Wojs/WShiHkU35hbMzdwY2Q/cpivaBPjEnSR/WkFCxjMEELZJt1OOcjLOisAEcKlSVItaYwgwA3nDQb74g= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=VLXwcGhg; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="VLXwcGhg" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 24A90C19425; Mon, 20 Apr 2026 13:17:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776691056; bh=DL6OvkoEDX+5B552BSy/iHSkR0WkeMYjdeHssdVlyZg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=VLXwcGhgvHM0YqqobwEu/h8ExNS1HqLk8hwOgwSgCGAmG+vwtIKxGWhOM9HJzeuC+ FaGkmxxlQr44ys//4dzLVdg4nujsadTQbtRleMyqOUy8IIlwzBehGiLGxcFGEHyZxb O9QGzNHd9jNCi3WDjgVKmI3pAsIGpW1KLBQ+Gc2jm556HVCo7oNB7Mvs7mWV3+o6lr x3IsLfzP45/6L9UToggqh6mjDJttx+oBtAxZ+76qMxjKvm8VTpdI4MgJ+81FEfD7pj laFwtAbgRjyFu9uh5iSTzc592J+G9h+mRJCvo2CujNF39BfD+qnK7+/pI46AEn+TnW rcS7yWBuFN2vg== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Dengjun Su , "Peter Zijlstra (Intel)" , Sasha Levin , mingo@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, linux-kernel@vger.kernel.org Subject: [PATCH AUTOSEL 7.0-6.1] sched: Fix incorrect schedstats for rt and dl thread Date: Mon, 20 Apr 2026 09:08:54 -0400 Message-ID: <20260420131539.986432-68-sashal@kernel.org> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260420131539.986432-1-sashal@kernel.org> References: <20260420131539.986432-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 7.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Dengjun Su [ Upstream commit c0e1832ba6dad7057acf3f485a87e0adccc23141 ] For RT and DL thread, only 'set_next_task_(rt/dl)' will call 'update_stats_wait_end_(rt/dl)' to update schedstats information. However, during the migration process, 'update_stats_wait_start_(rt/dl)' will be called twice, which will cause the values of wait_max and wait_sum to be incorrect. The specific output as follows: $ cat /proc/6046/task/6046/sched | grep wait wait_start : 0.000000 wait_max : 496717.080029 wait_sum : 7921540.776553 A complete schedstats information update flow of migrate should be __update_stats_wait_start() [enter queue A, stage 1] -> __update_stats_wait_end() [leave queue A, stage 2] -> __update_stats_wait_start() [enter queue B, stage 3] -> __update_stats_wait_end() [start running on queue B, stage 4] Stage 1: prev_wait_start is 0, and in the end, wait_start records the time of entering the queue. Stage 2: task_on_rq_migrating(p) is true, and wait_start is updated to the waiting time on queue A. Stage 3: prev_wait_start is the waiting time on queue A, wait_start is the time of entering queue B, and wait_start is expected to be greater than prev_wait_start. Under this condition, wait_start is updated to (the moment of entering queue B) - (the waiting time on queue A). Stage 4: the final wait time = (time when starting to run on queue B) - (time of entering queue B) + (waiting time on queue A) = waiting time on queue B + waiting time on queue A. The current problem is that stage 2 does not call __update_stats_wait_end to update wait_start, which causes the final computed wait time = waiting time on queue B + the moment of entering queue A, leading to incorrect wait_max and wait_sum. Add 'update_stats_wait_end_(rt/dl)' in 'update_stats_dequeue_(rt/dl)' to update schedstats information when dequeue_task. Signed-off-by: Dengjun Su Signed-off-by: Peter Zijlstra (Intel) Link: https://patch.msgid.link/20260204115959.3183567-1-dengjun.su@mediatek.com Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: Now I have all the information needed for a thorough analysis. --- ## PHASE 1: COMMIT MESSAGE FORENSICS ### Step 1.1: Subject Line - **Subsystem**: `sched` (core scheduler) - **Action verb**: "Fix" - **Summary**: Fixes incorrect schedstats wait_max and wait_sum for RT and DL tasks during migration Record: [sched] [fix] [incorrect schedstats for RT/DL threads during migration] ### Step 1.2: Tags - `Signed-off-by: Dengjun Su ` - author from MediaTek - `Signed-off-by: Peter Zijlstra (Intel) ` - scheduler maintainer accepted the patch - `Link: https://patch.msgid.link/20260204115959.3183567-1- dengjun.su@mediatek.com` - patch submission link - No Fixes: tag (expected for our review pipeline) - No Cc: stable tag (expected) - No Reported-by (author discovered it themselves) Record: Accepted by Peter Zijlstra, the primary scheduler maintainer. Single patch (not a series). ### Step 1.3: Commit Body Analysis The commit provides a detailed 4-stage explanation of how wait accounting should work during migration: 1. Enter queue A: `wait_start` = time of entering 2. Leave queue A (migration): `wait_start` should be updated to "wait time on A" 3. Enter queue B: `wait_start` adjusted by subtracting "wait time on A" 4. Start running on B: final wait = time on B + time on A **The bug**: Stage 2 is missing for RT/DL — `update_stats_wait_end` is not called during dequeue, so the raw timestamp from Stage 1 persists. This causes `__update_stats_wait_start` in Stage 3 to compute an absurdly large value (subtracting a timestamp from a timestamp, rather than a delta from a timestamp), resulting in wildly incorrect `wait_max` (496717ms) and `wait_sum` (7921540ms) values. Record: Bug is clearly described with a concrete demonstration of incorrect output. The root cause (missing `update_stats_wait_end` call during dequeue) is clearly identified. ### Step 1.4: Hidden Bug Fix? This is explicitly a bug fix, not disguised. ## PHASE 2: DIFF ANALYSIS ### Step 2.1: Inventory - `kernel/sched/rt.c`: +5 lines, -1 line (net +4) - `kernel/sched/deadline.c`: +3 lines, 0 removed (net +3) - Functions modified: `update_stats_dequeue_rt()`, `update_stats_dequeue_dl()` - Scope: Surgical fix to two parallel functions ### Step 2.2: Code Flow Change **RT (`update_stats_dequeue_rt`):** - Before: Only recorded sleep/block stats on `DEQUEUE_SLEEP` - After: Also calls `update_stats_wait_end_rt()` when `p != rq->curr` before the existing sleep/block handling **DL (`update_stats_dequeue_dl`):** - Before: Only recorded sleep/block stats on `DEQUEUE_SLEEP` - After: Also calls `update_stats_wait_end_dl()` when `p != rq->curr` before the existing sleep/block handling ### Step 2.3: Bug Mechanism This is a **logic/correctness bug** — the RT and DL scheduler classes were missing the `update_stats_wait_end` call in their `update_stats_dequeue` functions, which the fair scheduler class already has. Looking at the fair scheduler reference: ```1420:1446:kernel/sched/fair.c update_stats_dequeue_fair(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { if (!schedstat_enabled()) return; if (se != cfs_rq->curr) update_stats_wait_end_fair(cfs_rq, se); // <-- THIS WAS MISSING IN RT/DL if ((flags & DEQUEUE_SLEEP) && entity_is_task(se)) { // ... sleep/block stats } } ``` The fix adds the identical pattern to RT and DL. ### Step 2.4: Fix Quality - **Obviously correct**: Directly parallels the CFS implementation - **Minimal/surgical**: 7 net lines added across 2 files - **Regression risk**: Extremely low — calls `update_stats_wait_end_rt/dl` which is already an existing function; the `p != rq->curr` guard ensures it only operates on waiting tasks (not the running task). Furthermore, `__update_stats_wait_end` has built-in migration handling (sets `wait_start = delta` and returns early when `task_on_rq_migrating`). ## PHASE 3: GIT HISTORY ### Step 3.1: Blame Both `update_stats_dequeue_rt` and `update_stats_dequeue_dl` were entirely introduced by: - `57a5c2dafca8e` ("sched/rt: Support schedstats for RT sched class") - Sep 2021 - `b5eb4a5f6521d` ("sched/dl: Support schedstats for deadline sched class") - Sep 2021 Both merged in **v5.16-rc1**. The bug has been present since v5.16. ### Step 3.2: Fixes Target No explicit Fixes: tag, but the bug clearly traces to `57a5c2dafca8` and `b5eb4a5f6521d`. ### Step 3.3: File History No other commits have touched the `update_stats_dequeue_rt/dl` functions since their introduction (the DL server schedstat fix `9c602adb799e7` only changed `__schedstats_from_dl_se()` and the wait_start/wait_end wrappers, not `update_stats_dequeue_dl`). ### Step 3.4: Author Dengjun Su from MediaTek. First contribution to the scheduler subsystem. However, the patch is accepted by Peter Zijlstra, the scheduler maintainer, which is a strong quality signal. ### Step 3.5: Dependencies This patch is standalone. It only adds calls to existing functions (`update_stats_wait_end_rt`, `update_stats_wait_end_dl`) that are already present in all stable trees since v5.16. No prerequisite patches needed. Note: For stable trees < v6.12, the DL side might need minor adaptation because `9c602adb799e7` (v6.12) changed the DL schedstats wrapper pattern. In older trees, `update_stats_wait_end_dl` takes different parameter styles, but the fix concept is the same. ## PHASE 4: MAILING LIST b4 dig could not find the original submission. The `Link:` tag points to `patch.msgid.link` which serves lore content. The patch was accepted by Peter Zijlstra, the primary scheduler maintainer, which is a strong endorsement. ## PHASE 5: CODE SEMANTIC ANALYSIS ### Step 5.1: Modified Functions - `update_stats_dequeue_rt()` — called from `dequeue_rt_entity()` - `update_stats_dequeue_dl()` — called from `dequeue_task_dl()` (indirectly via `dequeue_dl_entity`) ### Step 5.2: Callers - `update_stats_dequeue_rt` is called from `dequeue_rt_entity()` (rt.c:1414), which is called from `dequeue_task_rt()` — this is the main RT task dequeue path, triggered on every RT task dequeue (migration, sleep, etc.) - `update_stats_dequeue_dl` is called from code in `deadline.c` at the `dequeue_dl_entity` path ### Step 5.4: Reachability The buggy code path is triggered whenever an RT or DL task migrates between CPUs. This is common in multi-core systems with RT workloads, especially when using CPU affinity changes or load balancing. ## PHASE 6: STABLE TREE ANALYSIS ### Step 6.1: Buggy Code in Stable The buggy code was introduced in v5.16. It exists in all active stable trees: 6.1.y, 6.6.y, 6.12.y, etc. ### Step 6.2: Backport Complications For the RT side: The `update_stats_dequeue_rt` function has been unchanged since v5.16 in all stable trees. The patch should apply cleanly. For the DL side: Commit `9c602adb799e7` (v6.12) changed the DL schedstat wrapper pattern. For stable trees < 6.12 (i.e., 6.6.y, 6.1.y), the `update_stats_dequeue_dl` function has `schedstat_enabled()` check inline rather than in the wrapper. A minor context adjustment may be needed, but the fix is conceptually identical. ### Step 6.3: No other fix for this bug exists in stable. ## PHASE 7: SUBSYSTEM CONTEXT ### Step 7.1: Subsystem - `kernel/sched/` — **CORE** subsystem (scheduler) - Scheduler statistics (`schedstat`) are used by performance monitoring tools and reported via `/proc/[pid]/sched` ### Step 7.2: Activity The scheduler is one of the most actively developed subsystems. ## PHASE 8: IMPACT AND RISK ### Step 8.1: Affected Users - All users running RT or DL tasks on multi-core systems with `schedstat` enabled - This affects monitoring/debugging workflows for real-time systems (embedded, audio, industrial) ### Step 8.2: Trigger Conditions - Task migration of RT/DL tasks (common on multi-core systems) - `schedstat` must be enabled (common for monitoring) ### Step 8.3: Failure Mode - **Incorrect statistics**: `wait_max` and `wait_sum` report wildly inflated values (shown: 496717ms wait_max) - Severity: **MEDIUM** — no crash or data corruption, but provides misleading scheduling statistics that could cause incorrect tuning decisions for real-time systems ### Step 8.4: Risk-Benefit - **Benefit**: Corrects misleading scheduler statistics for RT/DL tasks during migration. Important for real-time system monitoring. - **Risk**: Very low — 7 lines, follows exact CFS pattern, only affects statistics (not scheduling decisions), guarded by `schedstat_enabled()` and `p != rq->curr`. ## PHASE 9: FINAL SYNTHESIS ### Step 9.1: Evidence Summary **FOR backporting:** - Fixes a real, demonstrable bug (incorrect schedstat values shown in commit message) - Small, surgical fix (7 net lines) - Obviously correct — mirrors the existing fair scheduler pattern exactly - Accepted by Peter Zijlstra, the primary scheduler maintainer - Bug has existed since v5.16 (affects all active stable trees) - No dependencies on other patches - Very low regression risk (statistics-only code path) - Standalone single patch **AGAINST backporting:** - No crash/security/corruption — only incorrect statistics - No Reported-by (single author, may not be widely noticed) - Statistics correctness is a lower priority than crash fixes ### Step 9.2: Stable Rules Checklist 1. Obviously correct and tested? **YES** — mirrors CFS pattern, accepted by maintainer 2. Fixes a real bug? **YES** — incorrect schedstat values for RT/DL on migration 3. Important issue? **MEDIUM** — not a crash, but incorrect diagnostic data for real-time systems 4. Small and contained? **YES** — 7 lines across 2 files 5. No new features/APIs? **YES** — no new features 6. Can apply to stable? **YES** — standalone fix, may need minor context adjustment for < 6.12 ### Step 9.3: Exception Categories Not an exception category — this is a standard correctness bug fix. ### Step 9.4: Decision This is a clear, correct bug fix that affects scheduler statistics for RT and DL tasks. While not a crash or security fix, it corrects meaningfully wrong data that users rely on for real-time performance monitoring. The fix is tiny, obviously correct (mirrors CFS), accepted by the scheduler maintainer, and has essentially zero regression risk. It meets all stable kernel criteria. ## Verification - [Phase 1] Parsed tags: Signed-off-by Peter Zijlstra (scheduler maintainer), Link to patch.msgid.link - [Phase 2] Diff analysis: +5 lines in rt.c, +3 lines in deadline.c; adds `update_stats_wait_end_rt/dl()` calls matching CFS pattern - [Phase 2] Verified CFS reference: `update_stats_dequeue_fair` (fair.c:1420-1446) calls `update_stats_wait_end_fair` with `se != cfs_rq->curr` guard — RT/DL fix mirrors this exactly - [Phase 2] Verified `__update_stats_wait_end` (stats.c:21-46) has `task_on_rq_migrating` handling that preserves delta as `wait_start` — the mechanism described in commit message - [Phase 3] git blame: `update_stats_dequeue_rt` introduced by 57a5c2dafca8 (v5.16-rc1), untouched since - [Phase 3] git blame: `update_stats_dequeue_dl` introduced by b5eb4a5f6521 (v5.16-rc1), untouched since - [Phase 3] git describe: Both original commits in v5.16-rc1 — bug present in all stable trees (6.1.y, 6.6.y, 6.12.y) - [Phase 3] Author check: Dengjun Su has no prior scheduler commits, but patch accepted by PeterZ - [Phase 4] b4 dig failed to find submission; lore.kernel.org blocked by anti-scraping - [Phase 5] Callers: `update_stats_dequeue_rt` called from `dequeue_rt_entity()` (verified at rt.c:1414); `update_stats_dequeue_dl` called during DL dequeue path - [Phase 6] 9c602adb799e7 (v6.12) changed DL wrapper pattern — may need minor backport adjustment for 6.6.y/6.1.y - [Phase 8] Failure mode: incorrect wait_max/wait_sum in /proc/[pid]/sched, severity MEDIUM **YES** kernel/sched/deadline.c | 4 ++++ kernel/sched/rt.c | 7 ++++++- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 674de6a48551b..4c882d1e359b5 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -2142,10 +2142,14 @@ update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se, int flags) { struct task_struct *p = dl_task_of(dl_se); + struct rq *rq = rq_of_dl_rq(dl_rq); if (!schedstat_enabled()) return; + if (p != rq->curr) + update_stats_wait_end_dl(dl_rq, dl_se); + if ((flags & DEQUEUE_SLEEP)) { unsigned int state; diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index f69e1f16d9238..3d823f5ffe2c8 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1302,13 +1302,18 @@ update_stats_dequeue_rt(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se, int flags) { struct task_struct *p = NULL; + struct rq *rq = rq_of_rt_rq(rt_rq); if (!schedstat_enabled()) return; - if (rt_entity_is_task(rt_se)) + if (rt_entity_is_task(rt_se)) { p = rt_task_of(rt_se); + if (p != rq->curr) + update_stats_wait_end_rt(rt_rq, rt_se); + } + if ((flags & DEQUEUE_SLEEP) && p) { unsigned int state; -- 2.53.0