All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity
@ 2026-06-23  0:05 Ricardo Neri
  2026-06-23  0:05 ` [PATCH v5 1/6] sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings Ricardo Neri
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Ricardo Neri @ 2026-06-23  0:05 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Tim C Chen, Chen Yu, Christian Loehle,
	K Prateek Nayak, Barry Song
  Cc: Rafael J. Wysocki, Andrea Righi, Len Brown, ricardo.neri,
	linux-kernel, Ricardo Neri, Vincent Guittot

Hi,

This is v5 of the patch series. The only changes are optimizations to check
same-capacity cluster and SMT siblings only when needed. Please read the
changelog for details.

Cluster scheduling aims to maximize performance by spreading load across
clusters of CPUs that share mid-level resources [2]. It works well on
uniform systems, but it breaks down on topologies with big and small
cores arranged in clusters. As a result, it fails on several generations
of Intel processors already shipped and upcoming.

Consider the topology below of big (B) cores and clusters of small (s)
cores.
         ------   ------
         | B  |   | B  |   -----------------   -----------------
         |    |   |    |   | s | s | s | s |   | s | s | s | s |
         ------   ------   -----------------   -----------------
         | L2 |   | L2 |   |      L2       |   |       L2      |
         -------------------------------------------------------
         |                          L3                         |
         -------------------------------------------------------

On a partially busy system (one with idle CPUs; busy CPUs have one task
each), scheduling for asymmetric capacity ensures that misfit tasks land on
the big CPUs. The remaining tasks, misfit or not, run on the small CPUs.
When CONFIG_SCHED_CLUSTER is enabled, these remaining tasks are supposed to
be evenly spread among the small-CPU clusters. Today, this does not
happen.

Several issues in the load balancer prevent a small CPU in one cluster
from pulling tasks from another:

 a) update_sd_pick_busiest() may select a fully_busy group with higher
    per-CPU capacity as the busiest, preventing a subsequent fully_busy
    group of equal capacity from being correctly selected.
 b) Misfit-load statistics are used to identify tasks that would benefit
    from migrating to bigger CPUs. Accounting misfit load is pointless if
    the destination CPU is equally small, and it also blocks balancing
    between clusters.
 c) Due to b), groups that are truly has_spare or fully_busy get
    misclassified as misfit_task. update_sd_pick_busiest() then skips
    them, since a small destination CPU cannot help with misfit tasks.
 d) Once a busiest group has been identified, sched_balance_find_src_rq()
    will refuse to migrate tasks to CPUs of equal capacity, even when
    doing so is precisely what is required to balance small-CPU clusters.
 e) The SD_PREFER_SIBLING flag is missing from scheduling domains with
    asymmetric capacity, preventing the balancer from equalizing load
    across sibling small-core clusters.

Together, these issues prevent cluster-level balancing on systems with
asymmetric CPU capacity.

This series addresses each problem and restores the intended behavior.
Details, rationale, and code changes are explained in each patch.

I tested these patches on Alder Lake, which has both SMT Pcores and
clusters of Ecores. I tested with SMT both disabled and enabled. I also
tested on Lunar Lake and Panther Lake, which have an Ecore cluster not
connected to the L3 cache. I repeated the same experiment with
CONFIG_SCHED_CLUSTER disabled. The load balancer behaves as expected.

Christian also tested this patchset on a synthetic arm64 qemu topology and
the expected behavior [3].

Link: https://lore.kernel.org/all/20260509180955.1840064-1-arighi@nvidia.com/ [1]
Link: https://lore.kernel.org/r/20210924085104.44806-1-21cnbao@gmail.com/ [2]
Link: https://lore.kernel.org/all/e08492e0-d9f3-4574-8841-b633db008507@arm.com/ [3]

Changes in v5:
 - Added Tested-by tags from Christian. Thanks!
 - Patch 1 (pre-work): Optimized logic to identify CPUs with busy SMT
   siblings only when needed. (Prateek, Chen Yu)
 - Patch 5: Optimized logic to check for architectural capacity only when
   needed.
 - Added Reviewed-by tag from Prateek. Thanks!
 - Link to v4: https://lore.kernel.org/r/20260608-rneri-fix-cas-clusters-v4-0-1526711c944c@linux.intel.com

Changes in v4:
 - Patch 1 (pre-work): Fixed a bug that would block load balancing on SMT
   cores with more than one busy sibling.
 - Patch 2 (pre-work): Fixed a bug that would needlessly update
   sg_overloaded.
 - Patch 5: Reworked logic using a local variable for improved
   readability.
 - Added Reviewed-by tags from Chen Yu, Tim, and Vincent. Thanks!
 - Link to v3: https://lore.kernel.org/r/20260514-rneri-fix-cas-clusters-v3-0-0037869554bd@linux.intel.com

Changes in v3:
 - Patch 3: Reverted the inverted runtime capacity check. The inverted
   form resulted in migrations to CPUs of slightly lower capacity. Guarded
   the check for architectural capacity with the sched_cluster_active
   static key.
 - Patch 4: Expanded the patch description to explain the behavior of
   overloaded groups and low-capacity clusters with spare capacity.
 - Added Reviewed-by tags from Christian. Thanks!
 - Link to v2: https://lore.kernel.org/r/20260429-rneri-fix-cas-clusters-v2-0-cd787de35cc6@linux.intel.com

Changes in v2:
 - Patch 1: Rewrote patch description for clarity. Added a note
   clarifying that SD_ASYM_CPUCAPACITY and SMT are mutually
   exclusive. (Tim)
 - Patch 2: Fixed a bug where the capacity check inadvertently broke
   the mutual exclusion of the sched_reduced_capacity() path. Keep
   marking the root domain as overloaded when misfit tasks are present
   to allow bigger CPUs to help via newly idle balance. (sashiko)
   Fixed the description to state that capacity_greater() looks for
   differences of ~5% or more, not 20%. (Christian)
 - Patch 3: Use arch_scale_cpu_capacity() instead of capacity_of() to
   ignore runtime capacity variability. Inverted the capacity check.
   (Christian)
 - Patch 4: Reworded the patch description for clarity.
 - Link to v1: https://lore.kernel.org/r/20260330-rneri-fix-cas-clusters-v1-0-1e465b6fecb2@linux.intel.com/

---
Ricardo Neri (6):
      sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings
      sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY
      sched/fair: Check CPU capacity before comparing group types during load balance
      sched/fair: Skip misfit load accounting when the destination CPU cannot help
      sched/fair: Allow load balancing between CPUs of identical capacity
      sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters

 include/linux/sched/sd_flags.h |  3 +-
 kernel/sched/fair.c            | 66 ++++++++++++++++++++++++++++++------------
 kernel/sched/topology.c        | 14 +++++++--
 3 files changed, 62 insertions(+), 21 deletions(-)
---
base-commit: 50436392fe2359ea108fd27308f86c8283be1622
change-id: 20250620-rneri-fix-cas-clusters-bb4287d1e152

Best regards,
-- 
Ricardo Neri <ricardo.neri-calderon@linux.intel.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-06-27 19:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-23  0:05 [PATCH v5 0/6] sched: Fix cluster scheduling in the presence of asymmetric capacity Ricardo Neri
2026-06-23  0:05 ` [PATCH v5 1/6] sched/fair: Do not skip CPUs of similar capacity with busy SMT siblings Ricardo Neri
2026-06-23  7:13   ` Vincent Guittot
2026-06-24  5:25     ` Ricardo Neri
2026-06-23  0:05 ` [PATCH v5 2/6] sched/fair: Also gate overloaded status update for SD_ASYM_CPUCAPACITY Ricardo Neri
2026-06-23  7:14   ` Vincent Guittot
2026-06-23  0:05 ` [PATCH v5 3/6] sched/fair: Check CPU capacity before comparing group types during load balance Ricardo Neri
2026-06-23  0:05 ` [PATCH v5 4/6] sched/fair: Skip misfit load accounting when the destination CPU cannot help Ricardo Neri
2026-06-23  0:05 ` [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity Ricardo Neri
2026-06-23  7:20   ` Vincent Guittot
2026-06-23  7:45     ` Christian Loehle
2026-06-24  5:25       ` Ricardo Neri
2026-06-26  0:11         ` Ricardo Neri
2026-06-26 14:50           ` Vincent Guittot
2026-06-27  2:02             ` Ricardo Neri
2026-06-26 15:20       ` Vincent Guittot
2026-06-27 19:07   ` Andrea Righi
2026-06-23  0:05 ` [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters Ricardo Neri
2026-06-23  7:26   ` Vincent Guittot
2026-06-24  5:14     ` Ricardo Neri
2026-06-26  0:19       ` Ricardo Neri
2026-06-26 14:54         ` Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.