patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Aaron Lu <ziqianlu@bytedance.com>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Sasha Levin <sashal@kernel.org>,
	mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, linux-kernel@vger.kernel.org
Subject: [PATCH AUTOSEL 6.17] sched/fair: update_cfs_group() for throttled cfs_rqs
Date: Mon,  6 Oct 2025 14:17:54 -0400	[thread overview]
Message-ID: <20251006181835.1919496-22-sashal@kernel.org> (raw)
In-Reply-To: <20251006181835.1919496-1-sashal@kernel.org>

From: Aaron Lu <ziqianlu@bytedance.com>

[ Upstream commit fcd394866e3db344cbe0bb485d7e3f741ac07245 ]

With task based throttle model, tasks in a throttled hierarchy are
allowed to continue to run if they are running in kernel mode. For this
reason, PELT clock is not stopped for these cfs_rqs in throttled
hierarchy when they still have tasks running or queued.

Since PELT clock is not stopped, whether to allow update_cfs_group()
doing its job for cfs_rqs which are in throttled hierarchy but still
have tasks running/queued is a question.

The good side is, continue to run update_cfs_group() can get these
cfs_rq entities with an up2date weight and that up2date weight can be
useful to derive an accurate load for the CPU as well as ensure fairness
if multiple tasks of different cgroups are running on the same CPU.
OTOH, as Benjamin Segall pointed: when unthrottle comes around the most
likely correct distribution is the distribution we had at the time of
throttle.

In reality, either way may not matter that much if tasks in throttled
hierarchy don't run in kernel mode for too long. But in case that
happens, let these cfs_rq entities have an up2date weight seems a good
thing to do.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## Backport Status: YES (with critical dependencies)

### Executive Summary

This commit **should be backported** to stable kernel trees, but **only
as part of the complete task-based throttle model series**. In fact,
this commit has **already been backported** to Linux 6.17 stable tree
(commit 4ee521004f8ef) by stable maintainer Sasha Levin, which is strong
evidence of its suitability for stable backporting.

### Code Change Analysis

The commit makes a minimal code change in `kernel/sched/fair.c`,
removing 3 lines from `update_cfs_group()`:

```c
- if (throttled_hierarchy(gcfs_rq))
- return;
-
```

This removal allows `update_cfs_group()` to continue updating group
entity weights even for cgroups in throttled hierarchies. Previously,
line 3960-3961 would cause an early return, preventing weight
recalculation for any throttled cfs_rq.

### Context and Dependencies

**Critical Finding**: This commit is **NOT standalone**. It is part 5 of
a 7-commit series implementing the task-based throttle model:

1. **e1fad12dcb66b** - "Switch to task based throttle model" (341 line
   change - the base)
2. **eb962f251fbba** - "Task based throttle time accounting"
3. **5b726e9bf9544** - "Get rid of throttled_lb_pair()"
4. **fe8d238e646e1** - "Propagate load for throttled cfs_rq"
5. **fcd394866e3db** - "update_cfs_group() for throttled cfs_rqs" ←
   **This commit**
6. **253b3f5872419** - "Do not special case tasks in throttled
   hierarchy" (follow-up fix)
7. **0d4eaf8caf8cd** - "Do not balance task to a throttled cfs_rq"
   (follow-up performance fix)

All 7 commits were backported together to Linux 6.17 stable tree.

### Why This Change Is Necessary

Under the **old throttle model**: When a cfs_rq was throttled, its
entity was dequeued from the CPU's runqueue, preventing all tasks from
running. The PELT clock stopped, so updating group weights was
unnecessary and prevented by the `throttled_hierarchy()` check at line
3960.

Under the **new task-based throttle model** (introduced by commit
e1fad12dcb66b):
- Tasks in throttled hierarchies **continue running if in kernel mode**
- PELT clock **remains active** while throttled tasks still run/queue
- The `throttled_hierarchy()` check at line 3960 becomes **incorrect** -
  it prevents weight updates even though PELT is still running

**The fix**: Remove lines 3960-3961 to allow `calc_group_shares()` (line
3963) and `reweight_entity()` (line 3965) to execute, giving throttled
cfs_rq entities up-to-date weights for accurate CPU load calculation and
cross-cgroup fairness.

### Benefits and Trade-offs

**Benefits** (from commit message):
- Up-to-date weights enable accurate CPU load derivation
- Ensures fairness when multiple tasks from different cgroups run on
  same CPU
- Prevents stale weight values during extended kernel-mode execution

**Trade-offs** (acknowledged in commit):
- As Benjamin Segall noted: "the most likely correct distribution is the
  distribution we had at the time of throttle"
- May not matter much if tasks don't run in kernel mode for long periods
- Performance tuning was needed (see follow-up commit 0d4eaf8caf8cd
  which addresses hackbench regression by preventing load balancing to
  throttled cfs_rqs)

### What Problems Does This Solve?

The base task-based throttle model (e1fad12dcb66b) solves a **real
bug**: With the old model, a task holding a percpu_rwsem as reader in a
throttled cgroup couldn't run until the next period, causing:
- Writers waiting longer
- Reader build-up
- **Task hung warnings**

This specific commit ensures the new model works correctly by keeping
weight calculations accurate during kernel-mode execution of throttled
tasks.

### Risk Assessment

**Low to Medium Risk** for the following reasons:

**Mitigating factors**:
- Small code change (3 lines removed)
- Already backported to 6.17 stable by experienced maintainer
- Well-tested by multiple developers (Valentin Schneider, Chen Yu,
  Matteo Martelli, K Prateek Nayak)
- Part of thoroughly reviewed patch series linked at
  https://lore.kernel.org/r/20250829081120.806-4-ziqianlu@bytedance.com

**Risk factors**:
- Modifies core scheduler behavior in subtle ways
- Requires entire series (cannot be cherry-picked alone)
- Follow-up performance fixes needed (commit 0d4eaf8caf8cd mentions AMD
  Genoa performance degradation with hackbench that required additional
  checks)
- Affects PELT weight calculations during throttling edge cases

**No evidence of**:
- Reverts
- CVE assignments
- Major regression reports
- Security implications

### Backporting Requirements

If backporting to stable trees **without** the task-based throttle
model:

**DO NOT BACKPORT** - This commit will break things. The
`throttled_hierarchy()` check at line 3960 exists for a reason in the
old throttle model where PELT clocks stop on throttle.

If backporting to stable trees **with** the task-based throttle model:

**MUST BACKPORT** as part of the complete series:
1. Base commit e1fad12dcb66b (341 lines - major change)
2. Commits eb962f251fbba, 5b726e9bf9544, fe8d238e646e1
3. **This commit** (fcd394866e3db)
4. **Follow-up fixes** 253b3f5872419 and 0d4eaf8caf8cd

### Stable Tree Rules Compliance

- ✅ **Fixes important bugs**: Yes (task hung due to percpu_rwsem
  interactions)
- ✅ **Relatively small change**: Yes for this commit (3 lines), but
  series is large
- ✅ **Minimal side effects**: When backported with complete series
- ❌ **No major architectural changes**: No - this IS part of a major
  architectural change
- ✅ **Clear benefits**: Yes - prevents task hangs, improves fairness
- ⚠️ **Explicit stable tag**: No "Cc: stable" tag, but manually selected
  by stable maintainer
- ✅ **Minimal regression risk**: When backported with complete series
  and follow-ups

### Recommendation

**YES - Backport this commit**, with the following requirements:

1. **MUST include the entire task-based throttle series** (commits 1-7
   listed above)
2. **MUST include follow-up performance fixes** (especially
   0d4eaf8caf8cd)
3. **Target kernel version**: 6.17+ (already done) or newer LTS versions
   planning major scheduler updates
4. **Not suitable for**: Older stable trees without appetite for the
   341-line base architectural change

The fact that Sasha Levin already backported this entire series to 6.17
stable is the strongest indicator this is appropriate for stable
backporting.

 kernel/sched/fair.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce56a8d507f9..eea0b6571af5a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3957,9 +3957,6 @@ static void update_cfs_group(struct sched_entity *se)
 	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;
 
-	if (throttled_hierarchy(gcfs_rq))
-		return;
-
 	shares = calc_group_shares(gcfs_rq);
 	if (unlikely(se->load.weight != shares))
 		reweight_entity(cfs_rq_of(se), se, shares);
-- 
2.51.0


  parent reply	other threads:[~2025-10-06 18:19 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-06 18:17 [PATCH AUTOSEL 6.17-5.4] x86/build: Remove cc-option from stack alignment flags Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.1] btrfs: zoned: refine extent allocator hint selection Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.1] arch: Add the macro COMPILE_OFFSETS to all the asm-offsets.c Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] btrfs: abort transaction on specific error places when walking log tree Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-5.4] btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17] x86/bugs: Add attack vector controls for VMSCAPE Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.16] sched_ext: Keep bypass on between enable failure and scx_disable_workfn() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] perf/x86/intel: Add ICL_FIXED_0_ADAPTIVE bit into INTEL_FIXED_BITS_MASK Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] btrfs: abort transaction if we fail to update inode in log replay dir fixup Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.1] EDAC/mc_sysfs: Increase legacy channel support to 16 Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] btrfs: abort transaction in the process_one_buffer() log tree walk callback Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] btrfs: zoned: return error from btrfs_zone_finish_endio() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] perf: Skip user unwind if the task is a kernel thread Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.1] perf: Have get_perf_callchain() return NULL if crosstask and user are set Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.16] EDAC: Fix wrong executable file modes for C source files Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] perf: Use current->flags & PF_KTHREAD|PF_USER_WORKER instead of current->mm == NULL Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] btrfs: use level argument in log tree walk callback replay_one_buffer() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] sched_ext: Make qmap dump operation non-destructive Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] seccomp: passthrough uprobe systemcall without filtering Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] cpuset: Use new excpus for nocpu error check when enabling root partition Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.12] btrfs: tree-checker: add inode extref checks Sasha Levin
2025-10-06 18:17 ` Sasha Levin [this message]
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-5.15] btrfs: scrub: replace max_t()/min_t() with clamp() in scrub_throttle_dev_io() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-5.4] x86/bugs: Fix reporting of LFENCE retpoline Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-5.10] btrfs: always drop log root tree reference in btrfs_replay_log() Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17-6.6] audit: record fanotify event regardless of presence of rules Sasha Levin
2025-10-06 18:17 ` [PATCH AUTOSEL 6.17] EDAC/ie31200: Add two more Intel Alder Lake-S SoCs for EDAC support Sasha Levin
2025-10-06 18:18 ` [PATCH AUTOSEL 6.17-6.6] x86/bugs: Report correct retbleed mitigation status Sasha Levin
2025-10-06 21:55 ` [PATCH AUTOSEL 6.17-5.4] x86/build: Remove cc-option from stack alignment flags Nathan Chancellor
2025-10-06 23:13   ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251006181835.1919496-22-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=patches@lists.linux.dev \
    --cc=peterz@infradead.org \
    --cc=stable@vger.kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).