From: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
To: Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
Tim C Chen <tim.c.chen@linux.intel.com>,
Chen Yu <yu.c.chen@intel.com>,
Christian Loehle <christian.loehle@arm.com>,
Barry Song <baohua@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
Len Brown <lenb@kernel.org>,
ricardo.neri@intel.com, linux-kernel@vger.kernel.org,
Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Subject: [PATCH v2 0/4] sched: Fix cluster scheduling in the presence of asymmetric capacity
Date: Wed, 29 Apr 2026 14:19:43 -0700 [thread overview]
Message-ID: <20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com> (raw)
Cluster scheduling aims to maximize performance by spreading load across
clusters of CPUs that share mid-level resources [1]. It works well on
uniform systems, but it breaks down on topologies with big and small
cores arranged in clusters. As a result, it fails on several generations
of Intel processors already shipped and upcoming.
Consider the topology below of big (B) cores and clusters of small (s)
cores.
------ ------
| B | | B | ----------------- -----------------
| | | | | s | s | s | s | | s | s | s | s |
------ ------ ----------------- -----------------
| L2 | | L2 | | L2 | | L2 |
-------------------------------------------------------
| L3 |
-------------------------------------------------------
On a partially busy system (one with idle CPUs; busy CPUs have one task
each), scheduling for asymmetric capacity ensures that misfit tasks land on
the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
be evenly spread among the small-CPU clusters. Today, this does not
happen.
Several issues in the load balancer prevent a small CPU in one cluster
from pulling tasks from another:
a) update_sd_pick_busiest() may select a fully_busy group with higher
per-CPU capacity as the busiest, preventing a subsequent fully_busy
group of equal capacity from being correctly selected.
b) Misfit-load statistics are used to identify tasks that would benefit
from migrating to bigger CPUs. Accounting misfit load is pointless if
the destination CPU is equally small, and it also blocks balancing
between clusters.
c) Due to b), groups that are truly has_spare or fully_busy get
misclassified as misfit_task. update_sd_pick_busiest() then skips
them, since a small destination CPU cannot help with misfit tasks.
d) Once a busiest group has been identified, sched_balance_find_src_rq()
will refuse to migrate tasks to CPUs of equal capacity, even when
doing so is precisely what is required to balance small-CPU clusters.
e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
asymmetric capacity, preventing the balancer from equalizing load
across sibling small-core clusters.
Together, these issues prevent cluster-level balancing on systems with
asymmetric CPU capacity.
This series addresses each problem and restores the intended behavior.
Details, rationale, and code changes are explained in each patch.
I tested these patches on Alder Lake (with Hyper-Threading disabled),
Lunar Lake and Panther Lake. I also tested configurations with only one
CPU online per cluster to ensure that systems without cluster topology
continue to behave as expected.
Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [1]
Changes in v2:
- Patch 1: Rewrote patch description for clarity. Added a note
clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
exclusive. (Tim)
- Patch 2: Fixed a bug where the capacity check inadvertently broke
the mutual exclusion of the sched_reduced_capacity() path. Keep
marking the root domain as overloaded when misfit tasks are present
to allow bigger CPUs to help via newly idle balance. (sashiko)
Fixed the description to state that capacity_greater() looks for
differences of ~5% or more, not 20%. (Christian)
- Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
ignore runtime capacity variability. Inverted the capacity check.
(Christian)
- Patch 4: Reworded the patch description for clarity.
- Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/
---
Ricardo Neri (4):
sched/fair: Check CPU capacity before comparing group types during load balance
sched/fair: Skip misfit load accounting when the destination CPU cannot help
sched/fair: Allow load balancing between CPUs of identical capacity
sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters
kernel/sched/fair.c | 52 +++++++++++++++++++++++++++++++++++--------------
kernel/sched/topology.c | 11 +++++++++--
2 files changed, 46 insertions(+), 17 deletions(-)
---
base-commit: 8f1aacb683ef4a49b83dcc40bfce022aaa4aa597
change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152
Best regards,
--
Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
next reply other threads:[~2026-04-29 21:21 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-29 21:19 Ricardo Neri [this message]
2026-04-29 21:19 ` [PATCH v2 1/4] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
2026-05-06 10:38 ` Christian Loehle
2026-05-06 23:45 ` Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 2/4] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
2026-05-06 11:39 ` Christian Loehle
2026-05-06 23:47 ` Ricardo Neri
2026-04-29 21:19 ` [PATCH v2 3/4] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
2026-05-06 13:10 ` Christian Loehle
2026-04-29 21:19 ` [PATCH v2 4/4] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com \
--to=ricardo.neri-calderon@linux.intel.com \
--cc=baohua@kernel.org \
--cc=bsegall@google.com \
--cc=christian.loehle@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=juri.lelli@redhat.com \
--cc=lenb@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=ricardo.neri@intel.com \
--cc=rostedt@goodmis.org \
--cc=tim.c.chen@linux.intel.com \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=yu.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox