[RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
@ 2025-03-13  9:37 K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains K Prateek Nayak
                   ` (16 more replies)
  0 siblings, 17 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

tl;dr

This prototype is currently limited in the sense that it can only reuse
statistics for busy load balancing. Reusing stats for newidle load
balancing specifically ran into issues elaborated below.

Despite these shortcomings, the data around load balancing cost suggests
the possibility of a huge savings if the stats can be reused
efficiently as much as possible.

This RFC is intended to start a discussion on the reuse strategy and
find solutions to the issues we've encountered along the way.

Most benchmarks are neutral to this change except few cases of tail
latencies in schbench shooting up that remains to be investigated. To
see the benchmarks results, please jump to "Benchmark results" section.

This series in based on tip:sched/core at commit fd881d0a085f ("rseq:
Fix segfault on registration when rseq_cs is non-zero") with upstream
commit 3d252160b818 ("fs/pipe: Read pipe->{head,tail} atomically outside
pipe->mutex") cherry picked on top to fix an issue that would cause
sched-messaging to hang.

With those details out of the way, lets jump into it ...


Motivation
==========

With larger core counts, load balancing at domains with large number of
CPUs can take up significant amount of time. This was first noted and
tackled by Chenyu using ILB_UTIL [1] which bailed out of the load
balancing after scanning a subset of groups of the sched domain based on
the total utilization of CPUs in the sched domain.

Matt Fleming too reported that he had observed newidle balance popping
up regularly when profiling a production server [2] which is not
uncommon since users of large server tend to run with idle headroom to
maintain SLA requirements and handle unexpected bursts.

We've had reports of load balancing jitters being a hinderance for
certain classes of workloads which is why the idea to add a new sched
domain demarcating NPS quadrants when running in NPS1 mode was not too
well received [3]

David had proposed SHARED_RUNQ [4] to improve on the shortcomings of
newidle balance for Meta's production workloads.

There have been multiple proposals to optimize newidle balance in the
past specifically for issues seen on larger core count systems.


Proposal
========

Each sched group, except for those of the lowest domain, derive
themselves for a sched domain struct of a higher domain. Since each
of these sub-NUMA sched domain contains a sd->shared reference, it can
be used to communicate data between the sched_group and the sched_domain
it is derived from.

This series introduces a sg->shared that references the sd->shared of
the sched domain the group is derived from. The shared struct now embeds
in it a "sg_lb_stats" object that can be used to aggregate the
statistics from load balancing at lower domain to update the stats of a
higher domain.

When update_sg_lb_stats() is called, the "sg_lb_stats" object in
sg->shared is checked for validity and is returned as the group stats if
it is still valid.

update_sd_lb_stats() aggregates the stats from all the sched groups
traversed and updates the stats in sd->shared for it to be reused when
load balancing at a higher domain.


                                update_sd_lb_stats(PKG)
                                           |
                                           v
                                 update_sg_lb_stats()
                                           |
                                           v
                                      sg->shared
                                           ^
                                           |
    update_sd_lb_stats(MC) -> updates sd->shared
             |
             v
    update_sg_lb_stats()
             |
             v
          sg->shared


Caveats of aggressive reuse
===========================

The initial version allowed stats to be propagated for all types of load
balancing however, with a wakeup intensive workload, newidle balance ran
into multiple problems.

Allowing stats to be updated frequently can lead to long delays for
readers trying to see a consistent stats object. Initial design used
seqcount_t to allow concurrent update to shared stats but it ran into
rcu stalls on the read side.

The next experiment used "jiffies" as a measure for stats freshness
which led to a mixed bag of results:

  Benchmark                                		 %diff

  schbench (new) 256-worker (wakeup latency)		12.81%
  schbench (new) 256-worker (request latency)		18.42%
  DeathStarBench 2x (throughput)			 8.10%
  DeathStarBench 3x (throughput)			 6.27%
  DeathStarBench 6x (throughput)			 2.25%

  sched-messaging 4-group (throughput)		       -47.27%
  sched-messaging 8-groups (throughput)		       -48.99%
  sched-messaging 16-group (throughput)		       -20.52%
  tbench 256-clients (throughput)		       -10.15%
  schbench (old) 128-workers (tail latency)	      -390.11%
  schbench (old) 256-workers (tail latency)	      -102.43%
  schbench (new) 128-workers (wakeup latency)	      -114.97%
  schbench (new) 512-workers (wakeup latency)	       -20.89%
  schbench (new) 256-worker (request latency)	       -32.44%
  HammerDB + MySQL8 16VU (throughput)	               -13.14%
  HammerDB + MySQL8 64VU (throughput)	                -4.46%

  ( Benchmark cases not listed above were found to be perf neutral. )

hackbench 4-groups was chosen for analysis since it is short running and
produces stable results across multiple iterations. Using "perf sched
stats" [5] to profile the workload revealed the following information:

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANGE    PCT_CHANGE1 PCT_CHANGE2
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                              :           0,          0  |     0.00% |
  Legacy counter can be ignored                                    :           0,          0  |     0.00% |
  schedule() called                                                :      117639,     125511  |     6.69% |
  schedule() left the processor idle                               :       58581,      58104  |    -0.81% |  (    49.80%,     46.29% )
  try_to_wake_up() was called                                      :       58973,      65141  |    10.46% |
  try_to_wake_up() was called to wake up the local cpu             :          30,        191  |   536.67% |  (     0.05%,      0.29% )
  total runtime by tasks on this processor (in jiffies)            :  2338057783, 2190178756  |    -6.32% |
  total waittime by tasks on this processor (in jiffies)           :    27373872,  284078866  |   937.77% |  (     1.17%,     12.97% )
  total timeslices run on this cpu                                 :       59052,      67395  |    14.13% |
  ----------------------------------------------------------------------------------------------------

  Summary:

  o The wait time shoots up by 9x
  o The number of local CPU wakeups go up 5x

Analyzing the lb statistics revealed the following for newidle balance
(please jump to summary for the insights from the data):

  SMT
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :        1576,      11323  |   618.46% |  $        0.77,        0.15 $
  load_balance() found balanced on cpu newly idle                  :        1552,      11146  |   618.17% |  $        0.78,        0.15 $
  load_balance() move task failed on cpu newly idle                :           4,         48  |  1100.00% |  $      303.75,       34.27 $
  pull_task() count on cpu newly idle                              :          19,        128  |   573.68% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |     0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,          0  |     0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :        1342,      10519  |   683.83% |  $        0.91,        0.16 $
  *load_balance() success count on cpu newly idle                  :          20,        129  |   545.00% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.95,       0.99  |     4.45% |

  MC
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :         717,       8824  |  1130.68% |  $        1.69,        0.19 $
  load_balance() found balanced on cpu newly idle                  :         503,       5608  |  1014.91% |  $        2.42,        0.29 $
  load_balance() move task failed on cpu newly idle                :         203,       3200  |  1476.35% |  $        5.99,        0.51 $
  pull_task() count on cpu newly idle                              :          10,         15  |    50.00% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |     0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,       1724  |     0.00% |  $        0.00,        0.95 $
  load_balance() failed to find busier group on cpu newly idle     :         475,       3756  |   690.74% |  $        2.56,        0.44 $
  *load_balance() success count on cpu newly idle                  :          11,         16  |    45.45% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.91,       0.94  |     3.12% |

  PKG
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :         112,       4199  |  3649.11% |  $       10.85,        0.39 $
  load_balance() found balanced on cpu newly idle                  :          39,       1989  |  5000.00% |  $       31.15,        0.83 $
  load_balance() move task failed on cpu newly idle                :          66,       2198  |  3230.30% |  $       18.41,        0.75 $
  pull_task() count on cpu newly idle                              :           6,         12  |   100.00% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |     0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,       1059  |     0.00% |  $        0.00,        1.55 $
  load_balance() failed to find busier group on cpu newly idle     :          17,        816  |  4700.00% |  $       71.47,        2.02 $
  *load_balance() success count on cpu newly idle                  :           7,         12  |    71.43% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.86,       1.00  |    16.67% |

  NUMA
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :          46,        844  |  1734.78% |  $       26.41,        1.95 $
  load_balance() found balanced on cpu newly idle                  :          36,        836  |  2222.22% |  $       33.75,        1.97 $
  load_balance() move task failed on cpu newly idle                :           9,          7  |   -22.22% |  $      135.00,      235.00 $
  pull_task() count on cpu newly idle                              :           1,          0  |  -100.00% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |     0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,          0  |     0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :          10,        781  |  7710.00% |  $      121.50,        2.11 $
  *load_balance() success count on cpu newly idle                  :           1,          1  |     0.00% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        1.00,       0.00  |  -100.00% |
  
  ---
  Summary:

  o The number of newidle balancing attempts have shot up drastically
  o "avg task pulled per successful lb attempt" stats are similar
  o "load_balance() failed to find busier queue on cpu newly idle"
    shoots up with stats reuse.

This led to two conclusions:

o Something is tripping the newidle balance guards (cost, rd->overloaded
  check).
o Stats going stale quickly can lead to failure to find a busy src_rq to
  pull the task from.


Cost of load balancing
======================

sg_overloaded flag was verified to be correct using the sched debug
data - this could also mean that the imbalance was failed to be balanced
however at this point, even the cost of newidle balance was analyzed to
spot any benefits of stats reuse:

o Bailout count: Number of instance where load balancing bailed out vs
  was attempted at a particular domain.

  - tip:sched/core

            Bailout           lb_count[CPU_NEWLY_IDLE]
    Early     57035                    N/A
    SMT           0                   1576
    MC          629                    717
    PKG         566                    112
    NUMA         37                     46

  - stats_prop

            Bailout           lb_count[CPU_NEWLY_IDLE]
    Early     46568 (-18%)             N/A
    SMT           1                  11323 (618%)
    MC         1745                   8824 (1131%)
    PKG        4481                   4199 (3949%)
    NUMA       3230                    844 (1735%)

tip:sched/core bails out early compared to with the stats_prop variant.
This suggests that rd->overloaded is set and the cost for a newidle
balance is cheap.

o Cost of newidle balance: These are the load balancing time in ns when
  sched_balance_rq() returned > 0 with "continue_balancing" set to true
  ensuring at least one loop in sched_balance_rq()

  - tip:sched/core

    Case   Min Cost   Max Cost   Avg Cost
    SMT       646ns    16187ns      878ns
    MC       1150ns    15732ns     1685ns
    PKG      7087ns    44092ns    10837ns
    NUMA    13765ns    33651ns    20153ns

  - stats_prop

    Case   Min Cost   Max Cost   Avg Cost
    SMT       753ns    14963ns      848ns (-3%)
    MC        454ns    15130ns      511ns (-70%)
    PKG       832ns    20551ns     1528ns (-86%)
    NUMA      477ns    18258ns      678ns (-97%)

The data suggests there is a significant cost savings possible if the
stats can be reused efficiently.


Stats invalidation
==================

Another experiment was tried with stats invalidation where the cached
statistics were invalidated once a decision to migrate the task was
taken.

With stats invalidation the situation improved but the regressions were
still very red:

  Benchmark                                              %diff

  netperf 128-clients (throughput)                       8.62%
  DeathStarBench 2x (throughput)                         2.91%
  DeathStarBench 3x (throughput)                         6.59%
  DeathStarBench 6x (throughput)                         2.87%

  sched-messaging 4-group (throughput)                 -41.60%
  sched-messaging 8-groups (throughput)                -45.88%
  sched-messaging 16-group (throughput)                -12.48%
  tbench 256-clients (throughput)                       -8.93%
  schbench (old) 128-workers (tail latency)           -673.33%
  schbench (old) 256-workers (tail latency)            -92.72%
  schbench (new) 128-workers (wakeup latency)          -67.91%
  schbench (new) 512-workers (wakeup latency)          -20.89%
  schbench (new) 256-worker (request latency)           -6.30%
  HammerDB + MySQL8 16VU (throughput)                  -12.02%
  HammerDB + MySQL8 64VU (throughput)                   -3.52%

For sched-messaging 4-group case, the wait time reduced compared to the
original approach but it was still 7x the baseline on tip:

  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANGE    PCT_CHANGE1 PCT_CHANGE2
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                              :           0,          0  |     0.00% |
  Legacy counter can be ignored                                    :           0,          0  |     0.00% |
  schedule() called                                                :      117639,     118350  |     0.60% |
  schedule() left the processor idle                               :       58581,      55392  |    -5.44% |  (    49.80%,     46.80% )
  try_to_wake_up() was called                                      :       58973,      61005  |     3.45% |
  try_to_wake_up() was called to wake up the local cpu             :          30,        199  |   563.33% |  (     0.05%,      0.33% )
  total runtime by tasks on this processor (in jiffies)            :  2338057783, 2395802991  |     2.47% |
  total waittime by tasks on this processor (in jiffies)           :    27373872,  232983564  |   751.12% |  (     1.17%,      9.72% )
  total timeslices run on this cpu                                 :       59052,      62949  |     6.60% |
  ----------------------------------------------------------------------------------------------------

  Summary:

  o Wait tim reduced to 7x from 9x
  o Local CPU wakeups still remain at 5x.

Even with stats invalidation, the load balancing stats look similar with
high number of attempts and an increase in "load_balance() failed to
find busier queue on cpu newly idle" with stats reuse.

  SMT
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :        1576,       3300  |   109.39% |  $        0.77,        0.46 $
  load_balance() found balanced on cpu newly idle                  :        1552,       3226  |   107.86% |  $        0.78,        0.47 $
  load_balance() move task failed on cpu newly idle                :           4,         22  |   450.00% |  $      303.75,       69.32 $
  pull_task() count on cpu newly idle                              :          19,         50  |   163.16% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |     0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,          0  |     0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :        1342,       3015  |   124.66% |  $        0.91,        0.51 $
  *load_balance() success count on cpu newly idle                  :          20,         52  |   160.00% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.95,       0.96  |     1.21% |

  MC
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :         717,       2567  |   258.02% |  $        1.69,        0.59 $
  load_balance() found balanced on cpu newly idle                  :         503,       1398  |   177.93% |  $        2.42,        1.09 $
  load_balance() move task failed on cpu newly idle                :         203,       1144  |   463.55% |  $        5.99,        1.33 $
  pull_task() count on cpu newly idle                              :          10,         24  |   140.00% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |     0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,        604  |     0.00% |  $        0.00,        2.52 $
  load_balance() failed to find busier group on cpu newly idle     :         475,        746  |    57.05% |  $        2.56,        2.04 $
  *load_balance() success count on cpu newly idle                  :          11,         25  |   127.27% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.91,       0.96  |     5.60% |

  PKG
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :         112,        790  |   605.36% |  $       10.85,        1.93 $
  load_balance() found balanced on cpu newly idle                  :          39,        326  |   735.90% |  $       31.15,        4.68 $
  load_balance() move task failed on cpu newly idle                :          66,        442  |   569.70% |  $       18.41,        3.45 $
  pull_task() count on cpu newly idle                              :           6,         21  |   250.00% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |     0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,        169  |     0.00% |  $        0.00,        9.02 $
  load_balance() failed to find busier group on cpu newly idle     :          17,         80  |   370.59% |  $       71.47,       19.06 $
  *load_balance() success count on cpu newly idle                  :           7,         22  |   214.29% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.86,       0.95  |    11.36% |

  Summary:

  o Still a very high count of load balancing attempts
  o "avg task pulled per successful lb attempt" is similar still
  o "load_balance() failed to find busier queue on cpu newly idle" is a
    large value once stats are reused.


Ripple effect
=============

The idle_cpu() function searching for idle CPU on SMP looks as follows:

  int idle_cpu(int cpu)
  {
          struct rq *rq = cpu_rq(cpu);
  
          if (rq->curr != rq->idle) /* (1) */
                  return 0;
  
          if (rq->nr_running)
                  return 0;
  
          if (rq->ttwu_pending)
                  return 0;
  
          return 1;
  }

The very first condition (1) rules out an CPU that is not running the
idle task. Since newidle balance happens at select_next_task_fair(),
rq->curr points to the previous task that was blocked which leads to
idle_cpu() checks returning false for CPUs currently in newidle balance.

Modifying this leads to change in baseline numbers which again needs to
be investigated separately.


Limiting stats reuse
====================

With the number of problems building, the stats propagation strategy was
then limited to periodic busy load balancing where the conditions in
should_we_balance() ensure that only a single CPU is performing the
periodic balance in a given domain and since idleness stats remain same
in case of busy balancing since all CPUs are busy and running tasks
cannot be migrated.

The newidle balance case will be worked on and revisited post OSPM since
it requires more evaluation.


Benchmark results
=================

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      stats_prop[pct imp](CV)
   1-groups     1.00 [ -0.00](10.12)     1.09 [ -9.11](11.93)
   2-groups     1.00 [ -0.00]( 6.92)     1.00 [ -0.22]( 4.57)
   4-groups     1.00 [ -0.00]( 3.14)     0.99 [  0.83]( 1.77)
   8-groups     1.00 [ -0.00]( 1.35)     1.00 [ -0.31]( 2.24)
  16-groups     1.00 [ -0.00]( 1.32)     0.99 [  0.84]( 0.67)


  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)      stats_prop[pct imp](CV)
      1     1.00 [  0.00]( 0.43)     0.99 [ -0.87]( 1.34)
      2     1.00 [  0.00]( 0.58)     1.02 [  2.14]( 0.29)
      4     1.00 [  0.00]( 0.54)     1.01 [  1.24]( 0.82)
      8     1.00 [  0.00]( 0.49)     1.01 [  0.62]( 0.97)
     16     1.00 [  0.00]( 1.06)     1.01 [  0.94]( 0.70)
     32     1.00 [  0.00]( 1.27)     0.99 [ -1.24]( 1.38)
     64     1.00 [  0.00]( 1.54)     1.00 [ -0.43]( 0.36)
    128     1.00 [  0.00]( 0.38)     1.00 [ -0.01]( 1.22)
    256     1.00 [  0.00]( 1.85)     1.02 [  1.58]( 0.90)
    512     1.00 [  0.00]( 0.31)     1.01 [  0.76]( 1.19)
   1024     1.00 [  0.00]( 0.19)     1.00 [  0.44]( 0.35)


  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      stats_prop[pct imp](CV)
   Copy     1.00 [  0.00](11.31)     1.02 [  1.69]( 6.44)
  Scale     1.00 [  0.00]( 6.62)     1.01 [  0.80]( 5.37)
    Add     1.00 [  0.00]( 7.06)     1.02 [  1.54]( 6.72)
  Triad     1.00 [  0.00]( 8.91)     1.01 [  1.36]( 6.73)


  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      stats_prop[pct imp](CV)
   Copy     1.00 [  0.00]( 2.01)     0.98 [ -1.55]( 2.15)
  Scale     1.00 [  0.00]( 1.49)     1.00 [  0.23]( 0.58)
    Add     1.00 [  0.00]( 2.67)     1.01 [  0.65]( 1.95)
  Triad     1.00 [  0.00]( 2.19)     1.01 [  0.61]( 1.37)


  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
     Clients:       tip[pct imp](CV)      stats_prop[pct imp](CV)
    1-clients     1.00 [  0.00]( 1.43)     1.00 [  0.17]( 0.32)
    2-clients     1.00 [  0.00]( 1.02)     1.01 [  1.00]( 0.44)
    4-clients     1.00 [  0.00]( 0.83)     1.01 [  0.62]( 0.36)
    8-clients     1.00 [  0.00]( 0.73)     1.00 [ -0.11]( 0.65)
   16-clients     1.00 [  0.00]( 0.97)     1.00 [  0.49]( 0.77)
   32-clients     1.00 [  0.00]( 0.88)     1.00 [  0.30]( 0.94)
   64-clients     1.00 [  0.00]( 1.49)     1.00 [  0.36]( 1.57)
  128-clients     1.00 [  0.00]( 1.05)     1.00 [  0.14]( 1.46)
  256-clients     1.00 [  0.00]( 3.85)     1.00 [ -0.04]( 4.85)
  512-clients     1.00 [  0.00](59.63)     1.00 [ -0.02](62.28)


  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
    1         1.00 [ -0.00]( 6.67)     0.76 [ 24.44](35.80)
    2         1.00 [ -0.00](10.18)     0.87 [ 13.04](10.38)
    4         1.00 [ -0.00]( 4.49)     1.04 [ -4.26]( 3.14)
    8         1.00 [ -0.00]( 6.68)     0.98 [  1.89]( 8.07)
   16         1.00 [ -0.00]( 1.87)     1.03 [ -3.28]( 5.21)
   32         1.00 [ -0.00]( 4.01)     0.98 [  2.20]( 1.31)
   64         1.00 [ -0.00]( 3.21)     1.00 [ -0.00]( 3.23)
  128         1.00 [ -0.00](44.13)     1.06 [ -6.43](113.66)
  256         1.00 [ -0.00](14.46)     1.04 [ -3.52]( 8.43)
  512         1.00 [ -0.00]( 1.95)     1.02 [ -1.80]( 1.14)


  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers:      tip[pct imp](CV)      stats_prop[pct imp](CV)
    1          1.00 [  0.00]( 0.46)     1.00 [  0.00]( 0.55)
    2          1.00 [  0.00]( 0.15)     0.99 [ -0.88]( 0.26)
    4          1.00 [  0.00]( 0.15)     0.99 [ -0.59]( 0.15)
    8          1.00 [  0.00]( 0.15)     0.99 [ -0.88]( 0.26)
   16          1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
   32          1.00 [  0.00]( 3.40)     1.07 [  6.59]( 0.16)
   64          1.00 [  0.00]( 7.09)     1.00 [ -0.38]( 0.96)
  128          1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.20)
  256          1.00 [  0.00]( 1.12)     1.00 [ -0.30]( 1.50)
  512          1.00 [  0.00]( 0.22)     1.05 [  4.86]( 0.71)


  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
    1         1.00 [ -0.00](19.72)     0.85 [ 15.38](16.75)
    2         1.00 [ -0.00](15.96)     1.00 [ -0.00]( 0.00)
    4         1.00 [ -0.00]( 3.87)     1.00 [ -0.00]( 4.08)
    8         1.00 [ -0.00]( 8.15)     1.00 [ -0.00](11.71)
   16         1.00 [ -0.00]( 3.87)     0.92 [  7.69]( 4.19)
   32         1.00 [ -0.00](12.99)     0.73 [ 26.67]( 0.00)
   64         1.00 [ -0.00]( 6.20)     1.12 [-12.50]( 9.94)
  128         1.00 [ -0.00]( 0.96)     0.98 [  1.55]( 0.95)
  256         1.00 [ -0.00]( 2.76)     0.99 [  1.45]( 1.38)
  512         1.00 [ -0.00]( 0.20)     1.20 [-20.42]( 0.00)


  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
    1         1.00 [ -0.00]( 1.07)     1.02 [ -2.08]( 0.13)
    2         1.00 [ -0.00]( 0.14)     1.04 [ -3.97]( 0.13)
    4         1.00 [ -0.00]( 1.39)     1.03 [ -3.15]( 0.13)
    8         1.00 [ -0.00]( 0.36)     1.03 [ -3.16]( 0.00)
   16         1.00 [ -0.00]( 1.18)     1.02 [ -1.59]( 0.75)
   32         1.00 [ -0.00]( 8.42)     0.81 [ 19.08]( 0.25)
   64         1.00 [ -0.00]( 4.85)     1.01 [ -1.10]( 2.58)
  128         1.00 [ -0.00]( 0.28)     1.00 [ -0.21]( 0.38)
  256         1.00 [ -0.00](10.52)     0.95 [  4.74]( 6.94)
  512         1.00 [ -0.00]( 0.69)     1.09 [ -8.99]( 0.27)


  ==================================================================
  Test          : Various longer running benchmarks
  Units         : %diff in throughput reported
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  Benchmarks:                 %diff

  ycsb-cassandra             -0.54%
  ycsb-mongodb                0.09%

  deathstarbench-1x          -0.30%
  deathstarbench-2x           2.38%
  deathstarbench-3x           0.58%
  deathstarbench-6x           0.62%

  hammerdb+mysql 16VU         0.76%
  hammerdb+mysql 64VU         0.74%
---


Outlook
=======

Since the cost savings are evident, stats propagation can be viable to
reduce the load balancing cost and the load balancing jitter on large
core count machines.

This topic will be discussed at OSPM'25. If folks have ideas around
stats propagation or newidle balance, please feel free to reach out to
us or reply on the thread. Any and all feedback is appreciated.


Future work
===========

Since the current prototype is limited to periodic load balancing,
experiments are under progress to improve the scope to newidle balance
and the idle load balance.

Tail latencies of load balancing after synchronizing the intervals are
also being investigated and will be reported on the next version.

Some bits like the "stats_lock" can be optimized further and replaced
with an atomic variable similar to the SD_SERIALIZE handling in
sched_balance_domains().


Acknowledgements
================

Thanks to Chenyu, David for having paved the way before me. Thanks to
Swapnil, Ravi for working on perf sched stats which has proved to be an
invaluable tool for analysis and debug.


References
==========

[1] https://lore.kernel.org/lkml/cover.1690273854.git.yu.c.chen@intel.com/
[2] https://lore.kernel.org/lkml/20240716141645.637620-1-mfleming@cloudflare.com/
[3] https://lore.kernel.org/lkml/3402dcc4-d52f-d99f-e6ce-b435478a5a59@amd.com/
[4] https://lore.kernel.org/lkml/20231212003141.216236-1-void@manifault.com/
[5] https://lore.kernel.org/lkml/20250311120230.61774-1-swapnil.sapkal@amd.com/

---
Chen Yu (1):
  sched/topology: Assign sd_share for all non NUMA sched domains

K Prateek Nayak (7):
  sched/topology: Introduce sg->shared
  sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h
  sched/fair: Move sg_{overloaded,overutilized} calculation to
    sg_lb_stats
  sched/topology: Define sg_lb_stats_prop and embed it inside
    sched_domain_shared
  sched/fair: Increase probability of lb stats being reused
  sched/fair: Retrieve cached group stats from sg_lb_stats_prop
  sched/fair: Update stats for sched_domain using the sched_group stats

 include/linux/sched/topology.h |   9 +-
 kernel/sched/fair.c            | 277 ++++++++++++++++++++++-----------
 kernel/sched/sched.h           |  82 ++++++++++
 kernel/sched/topology.c        |  55 ++++++-
 4 files changed, 324 insertions(+), 99 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
@ 2025-03-13  9:37 ` K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 2/8] sched/topology: Introduce sg->shared K Prateek Nayak
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

From: Chen Yu <yu.c.chen@intel.com>

Currently, only the domain with SD_SHARE_PKG_RESOURCES flag
would share 1 sd_share for every CPU in this domain. Remove this
restriction and extend it for other sched domains under NUMA
domain.

This shared field will be used by a later patch which optimizes
newidle balancing.

Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index c49aea8c1025..815474823b3f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1680,10 +1680,10 @@ sd_init(struct sched_domain_topology_level *tl,
 	}
 
 	/*
-	 * For all levels sharing cache; connect a sched_domain_shared
+	 * For all levels except for NUMA; connect a sched_domain_shared
 	 * instance.
 	 */
-	if (sd->flags & SD_SHARE_LLC) {
+	if (!(sd->flags & SD_NUMA)) {
 		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
 		atomic_inc(&sd->shared->ref);
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 2/8] sched/topology: Introduce sg->shared
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains K Prateek Nayak
@ 2025-03-13  9:37 ` K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 3/8] sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h K Prateek Nayak
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

sched_group(s) of a particular sched_domain are created using the
sched_domain struct of the child domain. Attach the sched_domain_shared
struct from the corresponding child domain to the sched_group.

This shared struct will be used to propagate the sched group stats up
the sched domain hierarchy to optimize load balancing in subsequent
commits.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/sched.h    |  3 +++
 kernel/sched/topology.c | 27 +++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 023b844159c9..38aa4cba5d1f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2089,6 +2089,9 @@ struct sched_group {
 	int			asym_prefer_cpu;	/* CPU of highest priority in group */
 	int			flags;
 
+	/* sd->shared of the domain from which this group was created */
+	struct sched_domain_shared *shared;
+
 	/*
 	 * The CPUs this group covers.
 	 *
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 815474823b3f..508ee8aa492b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -612,6 +612,23 @@ static struct root_domain *alloc_rootdomain(void)
 	return rd;
 }
 
+static void link_sg_shared(struct sched_group *sg, struct sched_domain_shared *sds)
+{
+	if (!sds)
+		return;
+
+	sg->shared = sds;
+	atomic_inc(&sds->ref);
+}
+
+static void free_sg_shared(struct sched_group *sg)
+{
+	if (sg->shared && atomic_dec_and_test(&sg->shared->ref))
+		kfree(sg->shared);
+
+	sg->shared = NULL;
+}
+
 static void free_sched_groups(struct sched_group *sg, int free_sgc)
 {
 	struct sched_group *tmp, *first;
@@ -626,6 +643,8 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
 		if (free_sgc && atomic_dec_and_test(&sg->sgc->ref))
 			kfree(sg->sgc);
 
+		free_sg_shared(sg);
+
 		if (atomic_dec_and_test(&sg->ref))
 			kfree(sg);
 		sg = tmp;
@@ -746,6 +765,9 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			if (parent->parent) {
 				parent->parent->child = tmp;
 				parent->parent->groups->flags = tmp->flags;
+
+				free_sg_shared(parent->parent->groups);
+				link_sg_shared(parent->parent->groups, tmp->shared);
 			}
 
 			/*
@@ -773,6 +795,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
 			 * the child is being destroyed.
 			 */
 			do {
+				free_sg_shared(sg);
 				sg->flags = 0;
 			} while (sg != sd->groups);
 
@@ -972,10 +995,12 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
 	if (!sg)
 		return NULL;
 
+	sg->shared = NULL;
 	sg_span = sched_group_span(sg);
 	if (sd->child) {
 		cpumask_copy(sg_span, sched_domain_span(sd->child));
 		sg->flags = sd->child->flags;
+		link_sg_shared(sg, sd->child->shared);
 	} else {
 		cpumask_copy(sg_span, sched_domain_span(sd));
 	}
@@ -1225,9 +1250,11 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 	if (already_visited)
 		return sg;
 
+	sg->shared = NULL;
 	if (child) {
 		cpumask_copy(sched_group_span(sg), sched_domain_span(child));
 		cpumask_copy(group_balance_mask(sg), sched_group_span(sg));
+		link_sg_shared(sg, child->shared);
 		sg->flags = child->flags;
 	} else {
 		cpumask_set_cpu(cpu, sched_group_span(sg));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 3/8] sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 2/8] sched/topology: Introduce sg->shared K Prateek Nayak
@ 2025-03-13  9:37 ` K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 4/8] sched/fair: Move sg_{overloaded,overutilized} calculation to sg_lb_stats K Prateek Nayak
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

"struct sg_lb_stats" will be embedded into "struct sched_domain_shared"
to propagate load balancing information up the sched domain hierarchy in
the subsequent commits. Move it, and the internal types in depends on
from fair.c to sched.h

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c  | 66 --------------------------------------------
 kernel/sched/sched.h | 66 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 66 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9dafb374d76d..39bee40dde27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9168,49 +9168,6 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
-/*
- * 'group_type' describes the group of CPUs at the moment of load balancing.
- *
- * The enum is ordered by pulling priority, with the group with lowest priority
- * first so the group_type can simply be compared when selecting the busiest
- * group. See update_sd_pick_busiest().
- */
-enum group_type {
-	/* The group has spare capacity that can be used to run more tasks.  */
-	group_has_spare = 0,
-	/*
-	 * The group is fully used and the tasks don't compete for more CPU
-	 * cycles. Nevertheless, some tasks might wait before running.
-	 */
-	group_fully_busy,
-	/*
-	 * One task doesn't fit with CPU's capacity and must be migrated to a
-	 * more powerful CPU.
-	 */
-	group_misfit_task,
-	/*
-	 * Balance SMT group that's fully busy. Can benefit from migration
-	 * a task on SMT with busy sibling to another CPU on idle core.
-	 */
-	group_smt_balance,
-	/*
-	 * SD_ASYM_PACKING only: One local CPU with higher capacity is available,
-	 * and the task should be migrated to it instead of running on the
-	 * current CPU.
-	 */
-	group_asym_packing,
-	/*
-	 * The tasks' affinity constraints previously prevented the scheduler
-	 * from balancing the load across the system.
-	 */
-	group_imbalanced,
-	/*
-	 * The CPU is overloaded and can't provide expected CPU cycles to all
-	 * tasks.
-	 */
-	group_overloaded
-};
-
 enum migration_type {
 	migrate_load = 0,
 	migrate_util,
@@ -9916,29 +9873,6 @@ static void sched_balance_update_blocked_averages(int cpu)
 
 /********** Helpers for sched_balance_find_src_group ************************/
 
-/*
- * sg_lb_stats - stats of a sched_group required for load-balancing:
- */
-struct sg_lb_stats {
-	unsigned long avg_load;			/* Avg load            over the CPUs of the group */
-	unsigned long group_load;		/* Total load          over the CPUs of the group */
-	unsigned long group_capacity;		/* Capacity            over the CPUs of the group */
-	unsigned long group_util;		/* Total utilization   over the CPUs of the group */
-	unsigned long group_runnable;		/* Total runnable time over the CPUs of the group */
-	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
-	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
-	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
-	unsigned int group_weight;
-	enum group_type group_type;
-	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
-	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
-	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
-#ifdef CONFIG_NUMA_BALANCING
-	unsigned int nr_numa_running;
-	unsigned int nr_preferred_running;
-#endif
-};
-
 /*
  * sd_lb_stats - stats of a sched_domain required for load-balancing:
  */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38aa4cba5d1f..dc9d6e4c704b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2102,6 +2102,72 @@ struct sched_group {
 	unsigned long		cpumask[];
 };
 
+/*
+ * 'group_type' describes the group of CPUs at the moment of load balancing.
+ *
+ * The enum is ordered by pulling priority, with the group with lowest priority
+ * first so the group_type can simply be compared when selecting the busiest
+ * group. See update_sd_pick_busiest().
+ */
+enum group_type {
+	/* The group has spare capacity that can be used to run more tasks.  */
+	group_has_spare = 0,
+	/*
+	 * The group is fully used and the tasks don't compete for more CPU
+	 * cycles. Nevertheless, some tasks might wait before running.
+	 */
+	group_fully_busy,
+	/*
+	 * One task doesn't fit with CPU's capacity and must be migrated to a
+	 * more powerful CPU.
+	 */
+	group_misfit_task,
+	/*
+	 * Balance SMT group that's fully busy. Can benefit from migration
+	 * a task on SMT with busy sibling to another CPU on idle core.
+	 */
+	group_smt_balance,
+	/*
+	 * SD_ASYM_PACKING only: One local CPU with higher capacity is available,
+	 * and the task should be migrated to it instead of running on the
+	 * current CPU.
+	 */
+	group_asym_packing,
+	/*
+	 * The tasks' affinity constraints previously prevented the scheduler
+	 * from balancing the load across the system.
+	 */
+	group_imbalanced,
+	/*
+	 * The CPU is overloaded and can't provide expected CPU cycles to all
+	 * tasks.
+	 */
+	group_overloaded
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load-balancing:
+ */
+struct sg_lb_stats {
+	unsigned long avg_load;			/* Avg load            over the CPUs of the group */
+	unsigned long group_load;		/* Total load          over the CPUs of the group */
+	unsigned long group_capacity;		/* Capacity            over the CPUs of the group */
+	unsigned long group_util;		/* Total utilization   over the CPUs of the group */
+	unsigned long group_runnable;		/* Total runnable time over the CPUs of the group */
+	unsigned int sum_nr_running;		/* Nr of all tasks running in the group */
+	unsigned int sum_h_nr_running;		/* Nr of CFS tasks running in the group */
+	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
+	unsigned int group_weight;
+	enum group_type group_type;
+	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
+	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int nr_numa_running;
+	unsigned int nr_preferred_running;
+#endif
+};
+
 static inline struct cpumask *sched_group_span(struct sched_group *sg)
 {
 	return to_cpumask(sg->cpumask);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 4/8] sched/fair: Move sg_{overloaded,overutilized} calculation to sg_lb_stats
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (2 preceding siblings ...)
  2025-03-13  9:37 ` [RFC PATCH 3/8] sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h K Prateek Nayak
@ 2025-03-13  9:37 ` K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 5/8] sched/topology: Define sg_lb_stats_prop and embed it inside sched_domain_shared K Prateek Nayak
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

update_sg_lb_stats() used a pointer to update the group overloaded and
overutilized status to propagate to root domain. Discard the pointer
passing and use a flag in sg_lb_stats struct to indicate the overloaded
and overutilized status. This will be used in subsequent commits to
propagate the overloaded and overutilized status up the sched domain
hierarchy and set these status at highest domain.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c  | 14 +++++++-------
 kernel/sched/sched.h |  6 ++++--
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39bee40dde27..3b1ed14e4b5e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10287,9 +10287,7 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 static inline void update_sg_lb_stats(struct lb_env *env,
 				      struct sd_lb_stats *sds,
 				      struct sched_group *group,
-				      struct sg_lb_stats *sgs,
-				      bool *sg_overloaded,
-				      bool *sg_overutilized)
+				      struct sg_lb_stats *sgs)
 {
 	int i, nr_running, local_group, sd_flags = env->sd->flags;
 	bool balancing_at_rd = !env->sd->parent;
@@ -10311,7 +10309,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_nr_running += nr_running;
 
 		if (cpu_overutilized(i))
-			*sg_overutilized = 1;
+			sgs->overutilized = 1;
 
 		/*
 		 * No need to call idle_cpu() if nr_running is not 0
@@ -10324,7 +10322,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		/* Overload indicator is only updated at root domain */
 		if (balancing_at_rd && nr_running > 1)
-			*sg_overloaded = 1;
+			sgs->overloaded = 1;
 
 #ifdef CONFIG_NUMA_BALANCING
 		/* Only fbq_classify_group() uses this to classify NUMA groups */
@@ -10340,7 +10338,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			/* Check for a misfit task on the cpu */
 			if (sgs->group_misfit_task_load < rq->misfit_task_load) {
 				sgs->group_misfit_task_load = rq->misfit_task_load;
-				*sg_overloaded = 1;
+				sgs->overloaded = 1;
 			}
 		} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
 			/* Check for a task running on a CPU with reduced capacity */
@@ -10982,7 +10980,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_capacity(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
+		update_sg_lb_stats(env, sds, sg, sgs);
 
 		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -10992,6 +10990,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
+		sg_overloaded |= sgs->overloaded;
+		sg_overutilized |= sgs->overutlizied;
 
 		sum_util += sgs->group_util;
 		sg = sg->next;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc9d6e4c704b..9372a75ab3cf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2159,8 +2159,10 @@ struct sg_lb_stats {
 	unsigned int idle_cpus;                 /* Nr of idle CPUs         in the group */
 	unsigned int group_weight;
 	enum group_type group_type;
-	unsigned int group_asym_packing;	/* Tasks should be moved to preferred CPU */
-	unsigned int group_smt_balance;		/* Task on busy SMT be moved */
+	unsigned char group_asym_packing;	/* Tasks should be moved to preferred CPU */
+	unsigned char group_smt_balance;	/* Task on busy SMT be moved */
+	unsigned char overloaded;		/* Contains at least one overloaded CPU */
+	unsigned char overutilized;		/* Contains at least one overutilized CPU */
 	unsigned long group_misfit_task_load;	/* A CPU has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 5/8] sched/topology: Define sg_lb_stats_prop and embed it inside sched_domain_shared
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (3 preceding siblings ...)
  2025-03-13  9:37 ` [RFC PATCH 4/8] sched/fair: Move sg_{overloaded,overutilized} calculation to sg_lb_stats K Prateek Nayak
@ 2025-03-13  9:37 ` K Prateek Nayak
  2025-03-13  9:37 ` [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused K Prateek Nayak
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

"struct sg_lb_stats_prop" is a container around "sg_lb_stats" to help
propagate the load balancing stats up the sched domain hierarchy. Embed
the same in "sched_domain_shared" for concurrent load balancing
instances to reuse the statistics collected for domains below.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched/topology.h |  9 +++++----
 kernel/sched/sched.h           | 11 +++++++++++
 kernel/sched/topology.c        | 26 +++++++++++++++++++++++---
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7f3dbafe1817..a16d7d9dd9d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -78,10 +78,11 @@ extern int sched_domain_level_max;
 struct sched_group;
 
 struct sched_domain_shared {
-	atomic_t	ref;
-	atomic_t	nr_busy_cpus;
-	int		has_idle_cores;
-	int		nr_idle_scan;
+	atomic_t		ref;
+	atomic_t		nr_busy_cpus;
+	int			has_idle_cores;
+	int			nr_idle_scan;
+	void			*private;	/* lb stats propagation field */
 };
 
 struct sched_domain {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9372a75ab3cf..391c4180eeb3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2170,6 +2170,17 @@ struct sg_lb_stats {
 #endif
 };
 
+/*
+ * sg_lb_stats_prop - Load balancer stats propagation container.
+ * This is embedded in sg->shared->private and is used to propagate
+ * sched_domain load balancing statistics up the hierarchy.
+ */
+struct sg_lb_stats_prop {
+	raw_spinlock_t          stats_lock;	/* Lock for updating the cached stats */
+	unsigned long		last_update;	/* Time when stats was last updated (jiffies) */
+	struct sg_lb_stats	sg_stats;	/* Cached sched_group stats */
+};
+
 static inline struct cpumask *sched_group_span(struct sched_group *sg)
 {
 	return to_cpumask(sg->cpumask);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 508ee8aa492b..aeb55f66e8d6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -621,10 +621,19 @@ static void link_sg_shared(struct sched_group *sg, struct sched_domain_shared *s
 	atomic_inc(&sds->ref);
 }
 
+static void free_sched_domain_shared(struct sched_domain_shared *sd_shared)
+{
+	if (!sd_shared)
+		return;
+
+	kfree(sd_shared->private);
+	kfree(sd_shared);
+}
+
 static void free_sg_shared(struct sched_group *sg)
 {
 	if (sg->shared && atomic_dec_and_test(&sg->shared->ref))
-		kfree(sg->shared);
+		free_sched_domain_shared(sg->shared);
 
 	sg->shared = NULL;
 }
@@ -661,7 +670,7 @@ static void destroy_sched_domain(struct sched_domain *sd)
 	free_sched_groups(sd->groups, 1);
 
 	if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
-		kfree(sd->shared);
+		free_sched_domain_shared(sd->shared);
 	kfree(sd);
 }
 
@@ -2273,6 +2282,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 			struct sched_domain_shared *sds;
 			struct sched_group *sg;
 			struct sched_group_capacity *sgc;
+			struct sg_lb_stats_prop *sg_stats;
 
 			sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
@@ -2288,6 +2298,16 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 
 			*per_cpu_ptr(sdd->sds, j) = sds;
 
+			sg_stats = kzalloc_node(sizeof(struct sg_lb_stats_prop),
+					GFP_KERNEL, cpu_to_node(j));
+
+			if (!sg_stats)
+				return -ENOMEM;
+
+			raw_spin_lock_init(&sg_stats->stats_lock);
+			sg_stats->last_update = 0;
+			sds->private = (void *)sg_stats;
+
 			sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
 			if (!sg)
@@ -2332,7 +2352,7 @@ static void __sdt_free(const struct cpumask *cpu_map)
 			}
 
 			if (sdd->sds)
-				kfree(*per_cpu_ptr(sdd->sds, j));
+				free_sched_domain_shared(*per_cpu_ptr(sdd->sds, j));
 			if (sdd->sg)
 				kfree(*per_cpu_ptr(sdd->sg, j));
 			if (sdd->sgc)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (4 preceding siblings ...)
  2025-03-13  9:37 ` [RFC PATCH 5/8] sched/topology: Define sg_lb_stats_prop and embed it inside sched_domain_shared K Prateek Nayak
@ 2025-03-13  9:37 ` K Prateek Nayak
  2025-03-17 18:07   ` Chen, Yu C
  2025-03-13  9:37 ` [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop K Prateek Nayak
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

The load balancer will start caching the sg_lb_stats during load
balancing and propagate it up the sched domain hierarchy in the
subsequent commits.

Increase the probability of load balancing intervals across domains to
be aligned to improve the reuse efficiency of the propagated stats.
Go one step further and proactively explore balancing at a higher domain
if the next update time for a higher domain in before the next update
time for its children.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b1ed14e4b5e..60517a732c10 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11956,15 +11956,6 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
 
 	/* scale ms to jiffies */
 	interval = msecs_to_jiffies(interval);
-
-	/*
-	 * Reduce likelihood of busy balancing at higher domains racing with
-	 * balancing at lower domains by preventing their balancing periods
-	 * from being multiples of each other.
-	 */
-	if (cpu_busy)
-		interval -= 1;
-
 	interval = clamp(interval, 1UL, max_load_balance_interval);
 
 	return interval;
@@ -12126,7 +12117,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 	int continue_balancing = 1;
 	int cpu = rq->cpu;
 	int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
-	unsigned long interval;
+	unsigned long interval, prev_sd_next_balance = 0;
 	struct sched_domain *sd;
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
@@ -12136,6 +12127,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 
 	rcu_read_lock();
 	for_each_domain(cpu, sd) {
+		unsigned long next_interval;
+
 		/*
 		 * Decay the newidle max times here because this is a regular
 		 * visit to all the domains.
@@ -12162,7 +12155,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 				goto out;
 		}
 
-		if (time_after_eq(jiffies, sd->last_balance + interval)) {
+		next_interval = sd->last_balance + interval;
+		if (time_after_eq(jiffies, next_interval) ||
+		    (prev_sd_next_balance && time_after(prev_sd_next_balance, next_interval))) {
 			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
 				/*
 				 * The LBF_DST_PINNED logic could have changed
@@ -12174,6 +12169,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 			}
 			sd->last_balance = jiffies;
 			interval = get_sd_balance_interval(sd, busy);
+			prev_sd_next_balance = sd->last_balance + interval;
 		}
 		if (need_serialize)
 			atomic_set_release(&sched_balance_running, 0);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused
  2025-03-13  9:37 ` [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused K Prateek Nayak
@ 2025-03-17 18:07   ` Chen, Yu C
  2025-03-19  6:51     ` K Prateek Nayak
  0 siblings, 1 reply; 24+ messages in thread
From: Chen, Yu C @ 2025-03-17 18:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, yu.c.chen,
	yu.chen.surf

On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
> The load balancer will start caching the sg_lb_stats during load
> balancing and propagate it up the sched domain hierarchy in the
> subsequent commits.
> 
> Increase the probability of load balancing intervals across domains to
> be aligned to improve the reuse efficiency of the propagated stats.
> Go one step further and proactively explore balancing at a higher domain
> if the next update time for a higher domain in before the next update
> time for its children.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c | 18 +++++++-----------
>   1 file changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3b1ed14e4b5e..60517a732c10 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11956,15 +11956,6 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>   
>   	/* scale ms to jiffies */
>   	interval = msecs_to_jiffies(interval);
> -
> -	/*
> -	 * Reduce likelihood of busy balancing at higher domains racing with
> -	 * balancing at lower domains by preventing their balancing periods
> -	 * from being multiples of each other.
> -	 */
> -	if (cpu_busy)
> -		interval -= 1;
> -
>   	interval = clamp(interval, 1UL, max_load_balance_interval);
>   
>   	return interval;
> @@ -12126,7 +12117,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>   	int continue_balancing = 1;
>   	int cpu = rq->cpu;
>   	int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> -	unsigned long interval;
> +	unsigned long interval, prev_sd_next_balance = 0;
>   	struct sched_domain *sd;
>   	/* Earliest time when we have to do rebalance again */
>   	unsigned long next_balance = jiffies + 60*HZ;
> @@ -12136,6 +12127,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>   
>   	rcu_read_lock();
>   	for_each_domain(cpu, sd) {
> +		unsigned long next_interval;
> +
>   		/*
>   		 * Decay the newidle max times here because this is a regular
>   		 * visit to all the domains.
> @@ -12162,7 +12155,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>   				goto out;
>   		}
>   
> -		if (time_after_eq(jiffies, sd->last_balance + interval)) {
> +		next_interval = sd->last_balance + interval;
> +		if (time_after_eq(jiffies, next_interval) ||
> +		    (prev_sd_next_balance && time_after(prev_sd_next_balance, next_interval))) {

(prev_sd_next_balance && time_after(jiffies, prev_sd_next_balance))?

thanks,
Chenyu

>   			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
>   				/*
>   				 * The LBF_DST_PINNED logic could have changed
> @@ -12174,6 +12169,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>   			}
>   			sd->last_balance = jiffies;
>   			interval = get_sd_balance_interval(sd, busy);
> +			prev_sd_next_balance = sd->last_balance + interval;
>   		}
>   		if (need_serialize)
>   			atomic_set_release(&sched_balance_running, 0);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused
  2025-03-17 18:07   ` Chen, Yu C
@ 2025-03-19  6:51     ` K Prateek Nayak
  0 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-19  6:51 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, linux-kernel, yu.chen.surf

Hello Chenyu,

On 3/17/2025 11:37 PM, Chen, Yu C wrote:
> On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
>> The load balancer will start caching the sg_lb_stats during load
>> balancing and propagate it up the sched domain hierarchy in the
>> subsequent commits.
>>
>> Increase the probability of load balancing intervals across domains to
>> be aligned to improve the reuse efficiency of the propagated stats.
>> Go one step further and proactively explore balancing at a higher domain
>> if the next update time for a higher domain in before the next update
>> time for its children.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>>   kernel/sched/fair.c | 18 +++++++-----------
>>   1 file changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3b1ed14e4b5e..60517a732c10 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11956,15 +11956,6 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>>       /* scale ms to jiffies */
>>       interval = msecs_to_jiffies(interval);
>> -
>> -    /*
>> -     * Reduce likelihood of busy balancing at higher domains racing with
>> -     * balancing at lower domains by preventing their balancing periods
>> -     * from being multiples of each other.
>> -     */
>> -    if (cpu_busy)
>> -        interval -= 1;
>> -
>>       interval = clamp(interval, 1UL, max_load_balance_interval);
>>       return interval;
>> @@ -12126,7 +12117,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>       int continue_balancing = 1;
>>       int cpu = rq->cpu;
>>       int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>> -    unsigned long interval;
>> +    unsigned long interval, prev_sd_next_balance = 0;
>>       struct sched_domain *sd;
>>       /* Earliest time when we have to do rebalance again */
>>       unsigned long next_balance = jiffies + 60*HZ;
>> @@ -12136,6 +12127,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>       rcu_read_lock();
>>       for_each_domain(cpu, sd) {
>> +        unsigned long next_interval;
>> +
>>           /*
>>            * Decay the newidle max times here because this is a regular
>>            * visit to all the domains.
>> @@ -12162,7 +12155,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>                   goto out;
>>           }
>> -        if (time_after_eq(jiffies, sd->last_balance + interval)) {
>> +        next_interval = sd->last_balance + interval;
>> +        if (time_after_eq(jiffies, next_interval) ||
>> +            (prev_sd_next_balance && time_after(prev_sd_next_balance, next_interval))) {
> 
> (prev_sd_next_balance && time_after(jiffies, prev_sd_next_balance))?

So the rationale here is to sync the balancing at different levels if the
load balancing interval at the parent is somewhere between now and the
next load balancing interval of the child domain:

                                            Move MC balance
                                             here for more
                                                 reuse
                                                   v
     jiffies <---------------------------------------------
                  ^                 ^              ^
             Next balance      Next balance   Current balance
             at SMT domain     at MC domain    at SMT domain


On some topology, it can mean slightly more aggressive load balancing at
higher domains but the goal is that cost savings of a stats reuse will
eventually hide this jitter of doing load balancing at multiple domains
at once.

I would like to fo one step further and modify the cpumask_first() is
should we balance to instead return the last CPU doing load balancing
for this tick but it became slightly harder to cover the case of delay
in SOFTIRQ handler being executed so I left it out of this prototype.
I'll try to add something in proper v2.

-- 
Thanks and Regards,
Prateek

> 
> thanks,
> Chenyu
> 
>>               if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
>>                   /*
>>                    * The LBF_DST_PINNED logic could have changed
>> @@ -12174,6 +12169,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>               }
>>               sd->last_balance = jiffies;
>>               interval = get_sd_balance_interval(sd, busy);
>> +            prev_sd_next_balance = sd->last_balance + interval;
>>           }
>>           if (need_serialize)
>>               atomic_set_release(&sched_balance_running, 0);




^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (5 preceding siblings ...)
  2025-03-13  9:37 ` [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused K Prateek Nayak
@ 2025-03-13  9:37 ` K Prateek Nayak
  2025-03-17 18:04   ` Chen, Yu C
  2025-03-13  9:37 ` [RFC PATCH 8/8] sched/fair: Update stats for sched_domain using the sched_group stats K Prateek Nayak
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

Allow update_sg_lb_stats() to retrieve the group stats cached in
sg_lb_stats_prop saved by another CPU performing load balancing around
the same time (same jiffy)

Current implementation without invalidation of cached stats have few
limitations namely that the stats reuse is limited to busy load
balancing since stats can only be updated once a jiffy. Newidle Balance
can happen frequently and concurrently on many CPUs which can result in
readers seeing inconsitent values for the propagated stats.

For this iteration, the focus is to reduce the time taken for busy load
balancing allowing the busy CPU to resume renning the task as quickly as
possible.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 81 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60517a732c10..3b402f294f0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10275,6 +10275,75 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
+{
+	/*
+	 * Only under perioric load balancing can we ensure that no concurrent
+	 * CPUs modifies the stats being propagated upwards since
+	 * should_we_balance() can allow multiple concurrent newidle balance
+	 * to progress and an idle -> busy transition for idle balance will
+	 * require the stats to be recomputed since idleness metrics will
+	 * change with migration.
+	 */
+	if (idle)
+		return 0;
+
+	/*
+	 * If individual groups are separate NUMA domains, migrations can cause
+	 * preferred task statistics to change and will require recomputing of
+	 * stats.
+	 */
+	if (sd->child && (sd->child->flags & SD_NUMA))
+		return 0;
+
+	/*
+	 * misfit_task_load requires recalculation on SD_ASYM_CPUCAPACITY
+	 * domains. Skip caching stats for them.
+	 */
+	if (sd->flags & SD_ASYM_CPUCAPACITY)
+		return 0;
+
+	/*
+	 * TODO: For CPU_IDLE case, invalidate stats for an idle -> busy
+	 * transition but for the time being, save some cycles during busy
+	 * load balancing.
+	 */
+	return 1;
+}
+
+static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
+{
+	struct sched_domain_shared *sg_share = group->shared;
+	unsigned long current_jiffy = jiffies;
+	struct sg_lb_stats_prop *lb_prop;
+
+	if (!sg_share)
+		return 0;
+
+	lb_prop = (struct sg_lb_stats_prop *)sg_share->private;
+	if (!lb_prop)
+		return 0;
+
+	/* Stale stats */
+	if (READ_ONCE(lb_prop->last_update) != current_jiffy)
+		return 0;
+
+	/*
+	 * Pairs against the update to sgs_prop->last_update to
+	 * prevent readers from seeing an inconsistent value of
+	 * the propagated stats from a concurrent update.
+	 */
+	smp_rmb();
+	*sg_stats = lb_prop->sg_stats;
+
+	/*
+	 * If stats were read in the same interval, it cannot
+	 * read an inconsistent state since stats are only
+	 * updated once per jiffy.
+	 */
+	return time_before_eq(jiffies, current_jiffy);
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -10292,10 +10361,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	int i, nr_running, local_group, sd_flags = env->sd->flags;
 	bool balancing_at_rd = !env->sd->parent;
 
-	memset(sgs, 0, sizeof(*sgs));
-
 	local_group = group == sds->local;
 
+	/*
+	 * If stats can be retrieved, we are doing a busy load balancing.
+	 * Skip right ahead to group_classify() since group_asym_packing and
+	 * group_smt_balance is not possible under busy load balancing.
+	 */
+	if (can_retrieve_stats(env->sd, env->idle) &&
+	    retrieve_cached_stats(group, sgs))
+		goto group_classify;
+
+	memset(sgs, 0, sizeof(*sgs));
+
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
 		unsigned long load = cpu_load(rq);
@@ -10360,6 +10438,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	if (!local_group && smt_balance(env, sgs, group))
 		sgs->group_smt_balance = 1;
 
+group_classify:
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	/* Computing avg_load makes sense only when group is overloaded */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop
  2025-03-13  9:37 ` [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop K Prateek Nayak
@ 2025-03-17 18:04   ` Chen, Yu C
  2025-03-19  6:42     ` K Prateek Nayak
  0 siblings, 1 reply; 24+ messages in thread
From: Chen, Yu C @ 2025-03-17 18:04 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, yu.c.chen, yu.chen.surf,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	linux-kernel

On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
> Allow update_sg_lb_stats() to retrieve the group stats cached in
> sg_lb_stats_prop saved by another CPU performing load balancing around
> the same time (same jiffy)
> 

If I understand correctly, we allow update_sg_lb_stats() to retrieve
cached sg stats if another CPU in the same sched group has just done
load balance within a jiffy ago, say 10ms for CONFIG_100_HZ.

There are two roles,  writer who updates the cached stats,
the reader who reads the cache stats. For both cache writer and
the cache reader, do we trigger them only when it is in busy periodic
load balance? If yes, consider the periodic load balance is usually
triggered on 1 CPU in each SD(should_we_balance()),  and the
interval increases with the number of CPUs in that sd, just wonder
if 10 ms is a little short to find a cached stats on large LLC?
thanks,
Chenyu


> Current implementation without invalidation of cached stats have few
> limitations namely that the stats reuse is limited to busy load
> balancing since stats can only be updated once a jiffy. Newidle Balance
> can happen frequently and concurrently on many CPUs which can result in
> readers seeing inconsitent values for the propagated stats.
> 
> For this iteration, the focus is to reduce the time taken for busy load
> balancing allowing the busy CPU to resume renning the task as quickly as
> possible.
> 
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 81 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 60517a732c10..3b402f294f0b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10275,6 +10275,75 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
>   	return check_cpu_capacity(rq, sd);
>   }
>   
> +static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
> +{
> +	/*
> +	 * Only under perioric load balancing can we ensure that no concurrent
> +	 * CPUs modifies the stats being propagated upwards since
> +	 * should_we_balance() can allow multiple concurrent newidle balance
> +	 * to progress and an idle -> busy transition for idle balance will
> +	 * require the stats to be recomputed since idleness metrics will
> +	 * change with migration.
> +	 */
> +	if (idle)
> +		return 0;
> +
> +	/*
> +	 * If individual groups are separate NUMA domains, migrations can cause
> +	 * preferred task statistics to change and will require recomputing of
> +	 * stats.
> +	 */
> +	if (sd->child && (sd->child->flags & SD_NUMA))
> +		return 0;
> +
> +	/*
> +	 * misfit_task_load requires recalculation on SD_ASYM_CPUCAPACITY
> +	 * domains. Skip caching stats for them.
> +	 */
> +	if (sd->flags & SD_ASYM_CPUCAPACITY)
> +		return 0;
> +
> +	/*
> +	 * TODO: For CPU_IDLE case, invalidate stats for an idle -> busy
> +	 * transition but for the time being, save some cycles during busy
> +	 * load balancing.
> +	 */
> +	return 1;
> +}
> +
> +static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
> +{
> +	struct sched_domain_shared *sg_share = group->shared;
> +	unsigned long current_jiffy = jiffies;
> +	struct sg_lb_stats_prop *lb_prop;
> +
> +	if (!sg_share)
> +		return 0;
> +
> +	lb_prop = (struct sg_lb_stats_prop *)sg_share->private;
> +	if (!lb_prop)
> +		return 0;
> +
> +	/* Stale stats */
> +	if (READ_ONCE(lb_prop->last_update) != current_jiffy)
> +		return 0;
> +
> +	/*
> +	 * Pairs against the update to sgs_prop->last_update to
> +	 * prevent readers from seeing an inconsistent value of
> +	 * the propagated stats from a concurrent update.
> +	 */
> +	smp_rmb();
> +	*sg_stats = lb_prop->sg_stats;
> +
> +	/*
> +	 * If stats were read in the same interval, it cannot
> +	 * read an inconsistent state since stats are only
> +	 * updated once per jiffy.
> +	 */
> +	return time_before_eq(jiffies, current_jiffy);
> +}
> +
>   /**
>    * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>    * @env: The load balancing environment.
> @@ -10292,10 +10361,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   	int i, nr_running, local_group, sd_flags = env->sd->flags;
>   	bool balancing_at_rd = !env->sd->parent;
>   
> -	memset(sgs, 0, sizeof(*sgs));
> -
>   	local_group = group == sds->local;
>   
> +	/*
> +	 * If stats can be retrieved, we are doing a busy load balancing.
> +	 * Skip right ahead to group_classify() since group_asym_packing and
> +	 * group_smt_balance is not possible under busy load balancing.
> +	 */
> +	if (can_retrieve_stats(env->sd, env->idle) &&
> +	    retrieve_cached_stats(group, sgs))
> +		goto group_classify;
> +
> +	memset(sgs, 0, sizeof(*sgs));
> +
>   	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
>   		struct rq *rq = cpu_rq(i);
>   		unsigned long load = cpu_load(rq);
> @@ -10360,6 +10438,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>   	if (!local_group && smt_balance(env, sgs, group))
>   		sgs->group_smt_balance = 1;
>   
> +group_classify:
>   	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>   
>   	/* Computing avg_load makes sense only when group is overloaded */

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop
  2025-03-17 18:04   ` Chen, Yu C
@ 2025-03-19  6:42     ` K Prateek Nayak
  0 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-19  6:42 UTC (permalink / raw)
  To: Chen, Yu C
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, yu.chen.surf, Ingo Molnar,
	Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel

Hello Chenyu,

Thank you for taking a look at the series.

On 3/17/2025 11:34 PM, Chen, Yu C wrote:
> On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
>> Allow update_sg_lb_stats() to retrieve the group stats cached in
>> sg_lb_stats_prop saved by another CPU performing load balancing around
>> the same time (same jiffy)
>>
> 
> If I understand correctly, we allow update_sg_lb_stats() to retrieve
> cached sg stats if another CPU in the same sched group has just done
> load balance within a jiffy ago, say 10ms for CONFIG_100_HZ.

Quick disclaimer: All of this is best effort currently.

Periodic load balancing is easy to start since it only happens once a
jiffy at the maximum so "last_update" as a jiffy counter should be
good enough (in most cases).

Secondly, and this is slightly harder to solve for, is to get all the
CPUs to actually sync. Currently it is a best effort case since the
tick can fire late due to disabled interrupts on CPU, SCHED_SOFTIRQ
may run at different times depending on how much work is done at tick
prior to reaching the softirq handler etc.

But assuming some amount of sync, I would like:

- During busy balance only one CPU gets to proceed as per
   should_we_balance() heuristics. In addition to that, since all CPUs
   are busy (should_we_balance() would have allowed the first idle CPU
   to go ahead otherwise) the "idle_cpus" and "overloaded" situations
   may change and those are hard to propagate.

- By the time this CPU does busy balancing, other groups below it
   hopefully had enough time to reach update_sd_lb_stats() and cache
   their copy for this jiffy in there. If not - the load balancing CPU
   will recompute.

- Since stats at a higher domain is used only once, there was no need
   to invalidate them which I couldn't get right back then (or maybe
   even now :)

> 
> There are two roles,  writer who updates the cached stats,
> the reader who reads the cache stats. For both cache writer and
> the cache reader, do we trigger them only when it is in busy periodic
> load balance? If yes, consider the periodic load balance is usually
> triggered on 1 CPU in each SD(should_we_balance()),  and the
> interval increases with the number of CPUs in that sd, just wonder
> if 10 ms is a little short to find a cached stats on large LLC?

So the reader is always the CPU going to the higher domain and
recomputing stats. The writer should have updated the stats by then
or the reader won't care and recompute it.

At the very least, since the CPU has to look at local stats too, the
logic ensures at least that is reused and not recomputed.

Beyond the annotated PATCH 9, I've moved to a versioning scheme that
could also be reused for newidle balancing with stats invalidation
and that should help reuse stats more. There are some stats on the
empty PATCH 9.

-- 
Thanks and Regards,
Prateek

> thanks,
> Chenyu
> 
> 
>> Current implementation without invalidation of cached stats have few
>> limitations namely that the stats reuse is limited to busy load
>> balancing since stats can only be updated once a jiffy. Newidle Balance
>> can happen frequently and concurrently on many CPUs which can result in
>> readers seeing inconsitent values for the propagated stats.
>>
>> For this iteration, the focus is to reduce the time taken for busy load
>> balancing allowing the busy CPU to resume renning the task as quickly as
>> possible.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---




^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 8/8] sched/fair: Update stats for sched_domain using the sched_group stats
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (6 preceding siblings ...)
  2025-03-13  9:37 ` [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop K Prateek Nayak
@ 2025-03-13  9:37 ` K Prateek Nayak
  2025-03-16 10:29 ` [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation K Prateek Nayak
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13  9:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

Aggregate the individual sched_group stats to compute the stat for the
entire sched_domain. Cache this in sd->shared which the sg->shared also
points to for the corresponding sched_group of sd for its parent. This
ensures that the stats are readily available for the higher domains if
the load balancing continues.

With the new infrastructure in place, following are the benchmark
numbers:

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      stats_prop[pct imp](CV)
   1-groups     1.00 [ -0.00](10.12)     1.09 [ -9.11](11.93)
   2-groups     1.00 [ -0.00]( 6.92)     1.00 [ -0.22]( 4.57)
   4-groups     1.00 [ -0.00]( 3.14)     0.99 [  0.83]( 1.77)
   8-groups     1.00 [ -0.00]( 1.35)     1.00 [ -0.31]( 2.24)
  16-groups     1.00 [ -0.00]( 1.32)     0.99 [  0.84]( 0.67)

  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)      stats_prop[pct imp](CV)
      1     1.00 [  0.00]( 0.43)     0.99 [ -0.87]( 1.34)
      2     1.00 [  0.00]( 0.58)     1.02 [  2.14]( 0.29)
      4     1.00 [  0.00]( 0.54)     1.01 [  1.24]( 0.82)
      8     1.00 [  0.00]( 0.49)     1.01 [  0.62]( 0.97)
     16     1.00 [  0.00]( 1.06)     1.01 [  0.94]( 0.70)
     32     1.00 [  0.00]( 1.27)     0.99 [ -1.24]( 1.38)
     64     1.00 [  0.00]( 1.54)     1.00 [ -0.43]( 0.36)
    128     1.00 [  0.00]( 0.38)     1.00 [ -0.01]( 1.22)
    256     1.00 [  0.00]( 1.85)     1.02 [  1.58]( 0.90)
    512     1.00 [  0.00]( 0.31)     1.01 [  0.76]( 1.19)
   1024     1.00 [  0.00]( 0.19)     1.00 [  0.44]( 0.35)

  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      stats_prop[pct imp](CV)
   Copy     1.00 [  0.00](11.31)     1.02 [  1.69]( 6.44)
  Scale     1.00 [  0.00]( 6.62)     1.01 [  0.80]( 5.37)
    Add     1.00 [  0.00]( 7.06)     1.02 [  1.54]( 6.72)
  Triad     1.00 [  0.00]( 8.91)     1.01 [  1.36]( 6.73)

  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      stats_prop[pct imp](CV)
   Copy     1.00 [  0.00]( 2.01)     0.98 [ -1.55]( 2.15)
  Scale     1.00 [  0.00]( 1.49)     1.00 [  0.23]( 0.58)
    Add     1.00 [  0.00]( 2.67)     1.01 [  0.65]( 1.95)
  Triad     1.00 [  0.00]( 2.19)     1.01 [  0.61]( 1.37)

  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
     Clients:       tip[pct imp](CV)      stats_prop[pct imp](CV)
    1-clients     1.00 [  0.00]( 1.43)     1.00 [  0.17]( 0.32)
    2-clients     1.00 [  0.00]( 1.02)     1.01 [  1.00]( 0.44)
    4-clients     1.00 [  0.00]( 0.83)     1.01 [  0.62]( 0.36)
    8-clients     1.00 [  0.00]( 0.73)     1.00 [ -0.11]( 0.65)
   16-clients     1.00 [  0.00]( 0.97)     1.00 [  0.49]( 0.77)
   32-clients     1.00 [  0.00]( 0.88)     1.00 [  0.30]( 0.94)
   64-clients     1.00 [  0.00]( 1.49)     1.00 [  0.36]( 1.57)
  128-clients     1.00 [  0.00]( 1.05)     1.00 [  0.14]( 1.46)
  256-clients     1.00 [  0.00]( 3.85)     1.00 [ -0.04]( 4.85)
  512-clients     1.00 [  0.00](59.63)     1.00 [ -0.02](62.28)

  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
    1         1.00 [ -0.00]( 6.67)     0.76 [ 24.44](35.80)
    2         1.00 [ -0.00](10.18)     0.87 [ 13.04](10.38)
    4         1.00 [ -0.00]( 4.49)     1.04 [ -4.26]( 3.14)
    8         1.00 [ -0.00]( 6.68)     0.98 [  1.89]( 8.07)
   16         1.00 [ -0.00]( 1.87)     1.03 [ -3.28]( 5.21)
   32         1.00 [ -0.00]( 4.01)     0.98 [  2.20]( 1.31)
   64         1.00 [ -0.00]( 3.21)     1.00 [ -0.00]( 3.23)
  128         1.00 [ -0.00](44.13)     1.06 [ -6.43](113.66)
  256         1.00 [ -0.00](14.46)     1.04 [ -3.52]( 8.43)
  512         1.00 [ -0.00]( 1.95)     1.02 [ -1.80]( 1.14)

  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers:      tip[pct imp](CV)      stats_prop[pct imp](CV)
    1          1.00 [  0.00]( 0.46)     1.00 [  0.00]( 0.55)
    2          1.00 [  0.00]( 0.15)     0.99 [ -0.88]( 0.26)
    4          1.00 [  0.00]( 0.15)     0.99 [ -0.59]( 0.15)
    8          1.00 [  0.00]( 0.15)     0.99 [ -0.88]( 0.26)
   16          1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
   32          1.00 [  0.00]( 3.40)     1.07 [  6.59]( 0.16)
   64          1.00 [  0.00]( 7.09)     1.00 [ -0.38]( 0.96)
  128          1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.20)
  256          1.00 [  0.00]( 1.12)     1.00 [ -0.30]( 1.50)
  512          1.00 [  0.00]( 0.22)     1.05 [  4.86]( 0.71)

  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
    1         1.00 [ -0.00](19.72)     0.85 [ 15.38](16.75)
    2         1.00 [ -0.00](15.96)     1.00 [ -0.00]( 0.00)
    4         1.00 [ -0.00]( 3.87)     1.00 [ -0.00]( 4.08)
    8         1.00 [ -0.00]( 8.15)     1.00 [ -0.00](11.71)
   16         1.00 [ -0.00]( 3.87)     0.92 [  7.69]( 4.19)
   32         1.00 [ -0.00](12.99)     0.73 [ 26.67]( 0.00)
   64         1.00 [ -0.00]( 6.20)     1.12 [-12.50]( 9.94)
  128         1.00 [ -0.00]( 0.96)     0.98 [  1.55]( 0.95)
  256         1.00 [ -0.00]( 2.76)     0.99 [  1.45]( 1.38)
  512         1.00 [ -0.00]( 0.20)     1.20 [-20.42]( 0.00)

  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
    1         1.00 [ -0.00]( 1.07)     1.02 [ -2.08]( 0.13)
    2         1.00 [ -0.00]( 0.14)     1.04 [ -3.97]( 0.13)
    4         1.00 [ -0.00]( 1.39)     1.03 [ -3.15]( 0.13)
    8         1.00 [ -0.00]( 0.36)     1.03 [ -3.16]( 0.00)
   16         1.00 [ -0.00]( 1.18)     1.02 [ -1.59]( 0.75)
   32         1.00 [ -0.00]( 8.42)     0.81 [ 19.08]( 0.25)
   64         1.00 [ -0.00]( 4.85)     1.01 [ -1.10]( 2.58)
  128         1.00 [ -0.00]( 0.28)     1.00 [ -0.21]( 0.38)
  256         1.00 [ -0.00](10.52)     0.95 [  4.74]( 6.94)
  512         1.00 [ -0.00]( 0.69)     1.09 [ -8.99]( 0.27)

  ==================================================================
  Test          : Various longer running benchmarks
  Units         : %diff in throughput reported
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  Benchmarks:                 %diff

  ycsb-cassandra             -0.54%
  ycsb-mongodb                0.09%

  deathstarbench-1x          -0.30%
  deathstarbench-2x           2.38%
  deathstarbench-3x           0.58%
  deathstarbench-6x           0.62%

  hammerdb+mysql 16VU         0.76%
  hammerdb+mysql 64VU         0.74%

* The tail latencies reported by schbench increases possibly due to the
  sync in load balancing across multiple domains however it remains to
  be investigated.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 92 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b402f294f0b..212bee3e9f35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10275,6 +10275,38 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
+static inline void cache_sd_stats(struct sched_domain *sd, struct sg_lb_stats *sd_stats)
+{
+	struct sched_domain_shared *sd_share = sd->shared;
+	unsigned long current_jiffy = jiffies;
+	struct sg_lb_stats_prop *lb_prop;
+
+	if (!sd_share)
+		return;
+
+	lb_prop = (struct sg_lb_stats_prop *)sd_share->private;
+	if (!lb_prop)
+		return;
+
+	/* Concurrent load balancing instance already updated the stats */
+	if (READ_ONCE(lb_prop->last_update) == current_jiffy)
+		return;
+
+	scoped_guard(raw_spinlock_irqsave_try, &lb_prop->stats_lock) {
+		if (READ_ONCE(lb_prop->last_update) == current_jiffy)
+			break;
+
+		lb_prop->sg_stats = *sd_stats;
+
+		/*
+		 *  Pairs against readers checking the last_update
+		 *  before reading the cached stats.
+		 */
+		smp_wmb();
+		WRITE_ONCE(lb_prop->last_update, current_jiffy);
+	}
+}
+
 static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
 {
 	/*
@@ -10344,6 +10376,35 @@ static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_
 	return time_before_eq(jiffies, current_jiffy);
 }
 
+/**
+ * aggregate_sd_prop_stats - Compute sched domains's stats from group stats.
+ * @env: The load balancing environment.
+ * @sgs_prop: variable to hold the statistics to propagate for the sd
+ * @sgs: Group stat that was computed or retrieved
+ */
+static inline void aggregate_sd_stats(struct lb_env *env,
+				      struct sg_lb_stats *sd_stats,
+				      struct sg_lb_stats *sg_stats)
+{
+	sd_stats->group_load += sg_stats->group_load;
+	sd_stats->group_util += sg_stats->group_util;
+	sd_stats->group_runnable += sg_stats->group_runnable;
+	sd_stats->sum_h_nr_running += sg_stats->sum_h_nr_running;
+	sd_stats->sum_nr_running += sg_stats->sum_nr_running;
+	sd_stats->idle_cpus += sg_stats->idle_cpus;
+	sd_stats->group_capacity += sg_stats->group_capacity;
+	sd_stats->group_weight += sg_stats->group_weight;
+	sd_stats->overloaded |= sg_stats->overloaded;
+	sd_stats->overutilized |= sg_stats->overutilized;
+
+#ifdef CONFIG_NUMA_BALANCING
+	if (env->sd->flags & SD_NUMA) {
+		sd_stats->nr_numa_running += sg_stats->nr_numa_running;
+		sd_stats->nr_preferred_running += sg_stats->nr_preferred_running;
+	}
+#endif
+}
+
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -11041,9 +11102,18 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 {
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
-	struct sg_lb_stats tmp_sgs;
-	unsigned long sum_util = 0;
 	bool sg_overloaded = 0, sg_overutilized = 0;
+	struct sg_lb_stats tmp_sgs, sd_stats;
+	unsigned long sum_util = 0;
+	bool should_prop = false;
+
+	/*
+	 * If a parent domain exists and the cached stats can be retrieved when
+	 * load balancing there, aggregate the statistics at current domain
+	 * to be retrieved when load balancing at parent.
+	 */
+	if (env->sd->parent && can_retrieve_stats(env->sd->parent, env->idle))
+		should_prop = true;
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
@@ -11061,21 +11131,36 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		update_sg_lb_stats(env, sds, sg, sgs);
 
+		if (should_prop)
+			aggregate_sd_stats(env, &sd_stats, sgs);
+
 		if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
 			sds->busiest_stat = *sgs;
 		}
 
 		/* Now, start updating sd_lb_stats */
-		sds->total_load += sgs->group_load;
-		sds->total_capacity += sgs->group_capacity;
-		sg_overloaded |= sgs->overloaded;
-		sg_overutilized |= sgs->overutlizied;
+		if (!should_prop) {
+			sds->total_load += sgs->group_load;
+			sds->total_capacity += sgs->group_capacity;
+			sg_overloaded |= sgs->overloaded;
+			sg_overutilized |= sgs->overutilized;
+			sum_util += sgs->group_util;
+		}
 
-		sum_util += sgs->group_util;
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
+	if (should_prop) {
+		sds->total_load = sd_stats.group_load;
+		sds->total_capacity = sd_stats.group_capacity;
+		sg_overloaded = sd_stats.overloaded;
+		sg_overutilized = sd_stats.overutilized;
+		sum_util = sd_stats.group_util;
+
+		cache_sd_stats(env->sd, &sd_stats);
+	}
+
 	/*
 	 * Indicate that the child domain of the busiest group prefers tasks
 	 * go to a child's sibling domains first. NB the flags of a sched group
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (7 preceding siblings ...)
  2025-03-13  9:37 ` [RFC PATCH 8/8] sched/fair: Update stats for sched_domain using the sched_group stats K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
  2025-03-16 10:29 ` [RFC PATCH 10/08] sched/fair: Compute nr_{numa,preferred}_running for non-NUMA domains K Prateek Nayak
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

I would have loved to spin another version of this but me being slightly
short on time before OSPM decided to add these bits on top of the RFC.
Sorry for the inconvenience.

Stats versioning
================

Earlier experiments looked at aggressive stats caching and reuse. Load
balancing instances computed and cached the stats for non-local groups
hoping that they would be reused.

With stats versioning, the load balancing CPU only caches the stats for
the local hierarchy. Instead of the jiffy based "last_update" freshness,
this moves to versioning based on sched_clock_cpu() value.

Stats cached are invalidated if the CPU doing load balance is done
allowing fresher stats to be propagated. Stats computed by a concurrent
load balancing instance can now be reused allowing idle and newidle
balance to reuse stats effectively.

Stats versioning nuances are explained in Patch 11/08. Since idle and
newidle balance can reuse stats, changes have been made in aggregation
to consider reduced capacity, but also forego computing total capacity.

Benchmark results are as follows:

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      versioning[pct imp](CV)
   1-groups     1.00 [ -0.00](10.12)     1.00 [  0.44](13.86)
   2-groups     1.00 [ -0.00]( 6.92)     1.04 [ -4.32]( 3.00)
   4-groups     1.00 [ -0.00]( 3.14)     1.00 [ -0.21]( 2.16)
   8-groups     1.00 [ -0.00]( 1.35)     1.01 [ -1.25]( 1.32)
  16-groups     1.00 [ -0.00]( 1.32)     1.01 [ -0.50]( 2.00)

  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)      versioning[pct imp](CV)
      1     1.00 [  0.00]( 0.43)     0.98 [ -1.65]( 0.15)
      2     1.00 [  0.00]( 0.58)     1.01 [  1.27]( 0.49)
      4     1.00 [  0.00]( 0.54)     1.00 [  0.47]( 0.40)
      8     1.00 [  0.00]( 0.49)     1.00 [ -0.44]( 1.18)
     16     1.00 [  0.00]( 1.06)     1.00 [ -0.07]( 1.14)
     32     1.00 [  0.00]( 1.27)     1.00 [  0.02]( 0.11)
     64     1.00 [  0.00]( 1.54)     0.99 [ -1.12]( 1.09)
    128     1.00 [  0.00]( 0.38)     0.98 [ -2.43]( 1.00)
    256     1.00 [  0.00]( 1.85)     0.99 [ -0.50]( 0.94)
    512     1.00 [  0.00]( 0.31)     0.99 [ -1.03]( 0.35)
   1024     1.00 [  0.00]( 0.19)     0.99 [ -0.56]( 0.42)

  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      versioning[pct imp](CV)
   Copy     1.00 [  0.00](11.31)     1.08 [  7.51]( 4.74)
  Scale     1.00 [  0.00]( 6.62)     1.00 [ -0.31]( 7.45)
    Add     1.00 [  0.00]( 7.06)     1.02 [  2.50]( 7.34)
  Triad     1.00 [  0.00]( 8.91)     1.08 [  7.78]( 2.88)

  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)     versioning[pct imp](CV)
   Copy     1.00 [  0.00]( 2.01)     1.02 [  1.82]( 1.26)
  Scale     1.00 [  0.00]( 1.49)     1.00 [  0.26]( 0.80)
    Add     1.00 [  0.00]( 2.67)     1.01 [  0.98]( 1.29)
  Triad     1.00 [  0.00]( 2.19)     1.02 [  2.06]( 1.01)

  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:         tip[pct imp](CV)      versioning[pct imp](CV)
   1-clients     1.00 [  0.00]( 1.43)     0.99 [ -0.72]( 0.81)
   2-clients     1.00 [  0.00]( 1.02)     1.00 [ -0.09]( 1.11)
   4-clients     1.00 [  0.00]( 0.83)     1.00 [  0.31]( 0.29)
   8-clients     1.00 [  0.00]( 0.73)     1.00 [ -0.25]( 0.61)
  16-clients     1.00 [  0.00]( 0.97)     1.00 [ -0.26]( 0.89)
  32-clients     1.00 [  0.00]( 0.88)     0.99 [ -0.61]( 0.82)
  64-clients     1.00 [  0.00]( 1.49)     0.99 [ -1.11]( 1.77)
  128-clients    1.00 [  0.00]( 1.05)     1.00 [ -0.03]( 1.13)
  256-clients    1.00 [  0.00]( 3.85)     1.00 [ -0.24]( 2.63)
  512-clients    1.00 [  0.00](59.63)     0.99 [ -0.74](59.01)

  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)     versioning[pct imp](CV)
    1         1.00 [ -0.00]( 6.67)     0.93 [  6.67](15.25)
    2         1.00 [ -0.00](10.18)     0.83 [ 17.39]( 7.15)
    4         1.00 [ -0.00]( 4.49)     1.04 [ -4.26]( 6.12)
    8         1.00 [ -0.00]( 6.68)     1.06 [ -5.66](12.98)
   16         1.00 [ -0.00]( 1.87)     1.00 [ -0.00]( 3.38)
   32         1.00 [ -0.00]( 4.01)     0.98 [  2.20]( 4.79)
   64         1.00 [ -0.00]( 3.21)     1.02 [ -1.68]( 0.84)
  128         1.00 [ -0.00](44.13)     1.16 [-15.98](14.99)
  256         1.00 [ -0.00](14.46)     0.90 [  9.99](17.45)
  512         1.00 [ -0.00]( 1.95)     0.98 [  1.54]( 1.13)

  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [  0.00]( 0.46)     1.00 [  0.00]( 0.26)
    2        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.15)
    4        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.30)
    8        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.26)
   16        1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
   32        1.00 [  0.00]( 3.40)     1.06 [  5.93]( 1.22)
   64        1.00 [  0.00]( 7.09)     1.00 [  0.00]( 0.20)
  128        1.00 [  0.00]( 0.00)     0.98 [ -1.52]( 0.34)
  256        1.00 [  0.00]( 1.12)     0.98 [ -2.41]( 1.19)
  512        1.00 [  0.00]( 0.22)     1.00 [  0.00]( 0.43)

  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [ -0.00](19.72)     1.00 [ -0.00]( 8.37)
    2        1.00 [ -0.00](15.96)     1.09 [ -9.09](11.08)
    4        1.00 [ -0.00]( 3.87)     1.15 [-15.38](17.44)
    8        1.00 [ -0.00]( 8.15)     0.92 [  8.33]( 8.85)
   16        1.00 [ -0.00]( 3.87)     1.23 [-23.08]( 5.59)
   32        1.00 [ -0.00](12.99)     0.73 [ 26.67](16.75)
   64        1.00 [ -0.00]( 6.20)     1.25 [-25.00]( 2.63)
  128        1.00 [ -0.00]( 0.96)     1.62 [-62.37]( 1.30)
  256        1.00 [ -0.00]( 2.76)     0.82 [ 17.89](10.56)
  512        1.00 [ -0.00]( 0.20)     1.00 [ -0.00]( 0.34)

  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [ -0.00]( 1.07)     1.02 [ -2.34]( 0.13)
    2        1.00 [ -0.00]( 0.14)     1.04 [ -3.97]( 0.13)
    4        1.00 [ -0.00]( 1.39)     1.03 [ -3.15]( 0.13)
    8        1.00 [ -0.00]( 0.36)     1.03 [ -3.43]( 0.66)
   16        1.00 [ -0.00]( 1.18)     0.99 [  0.79]( 1.22)
   32        1.00 [ -0.00]( 8.42)     0.82 [ 18.29]( 9.02)
   64        1.00 [ -0.00]( 4.85)     1.00 [ -0.44]( 1.61)
  128        1.00 [ -0.00]( 0.28)     1.06 [ -5.64]( 1.10)
  256        1.00 [ -0.00](10.52)     0.81 [ 19.18](12.55)
  512        1.00 [ -0.00]( 0.69)     1.00 [  0.33]( 1.27)

  ==================================================================
  Test          : Various longer running benchmarks
  Units         : %diff in throughput reported
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  Benchmarks:                 %diff

  ycsb-cassandra             -0.76%
  ycsb-mongodb                0.49%

  deathstarbench-1x          -2.37%
  deathstarbench-2x           0.12%
  deathstarbench-3x           2.30%
  deathstarbench-6x           1.88%

  hammerdb+mysql 16VU         3.85%
  hammerdb+mysql 64VU         0.27%

Following are the schedstats diff for sched-messaging 4-group and
16-groups:

o 4-groups:

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANGE    PCT_CHANGE1 PCT_CHANGE2
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                              :           0,          0  |     0.00% |
  Legacy counter can be ignored                                    :           0,          0  |     0.00% |
  schedule() called                                                :      174683,     176871  |     1.25% |
  schedule() left the processor idle                               :       86742,      88113  |     1.58% |  (    49.66%,     49.82% )
  try_to_wake_up() was called                                      :       87675,      88622  |     1.08% |
  try_to_wake_up() was called to wake up the local cpu             :          28,         26  |    -7.14% |  (     0.03%,      0.03% )
  total runtime by tasks on this processor (in jiffies)            :  2124248214, 2118780927  |    -0.26% |
  total waittime by tasks on this processor (in jiffies)           :    24160304,   16912073  |   -30.00% |  (     1.14%,      0.80% )
  total timeslices run on this cpu                                 :       87936,      88753  |     0.93% |
  ----------------------------------------------------------------------------------------------------

  ---------------------------------------- <Category newidle> ----------------------------------------
  SMT:

  load_balance() total time to balance on newly idle               :      449650,     465044  |     3.42% |
  load_balance() stats reused on newly idle                        :           0,          0  |     0.00% |
  load_balance() stats recomputed on newly idle                    :        2493,       2679  |     7.46% |

  MC:

  load_balance() total time to balance on newly idle               :      660742,     610346  |    -7.63% |
  load_balance() stats reused on newly idle                        :           0,       1898  |     0.00% |
  load_balance() stats recomputed on newly idle                    :        3985,       3527  |   -11.49% |

  PKG:

  load_balance() total time to balance on newly idle               :      725938,     530707  |   -26.89% |
  load_balance() stats reused on newly idle                        :           0,        401  |     0.00% |
  load_balance() stats recomputed on newly idle                    :         722,        474  |   -34.35% |

  NUMA:

  load_balance() total time to balance on newly idle               :      406862,     410386  |     0.87% |
  load_balance() stats reused on newly idle                        :           0,         36  |     0.00% |
  load_balance() stats recomputed on newly idle                    :          48,         39  |   -18.75% |

o 16-groups:

  ----------------------------------------------------------------------------------------------------
  CPU <ALL CPUS SUMMARY>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANGE    PCT_CHANGE1 PCT_CHANGE2
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                              :           0,          0  |     0.00% |
  Legacy counter can be ignored                                    :           0,          0  |     0.00% |
  schedule() called                                                :      566558,     554784  |    -2.08% |
  schedule() left the processor idle                               :      222161,     212164  |    -4.50% |  (    39.21%,     38.24% )
  try_to_wake_up() was called                                      :      325303,     322690  |    -0.80% |
  try_to_wake_up() was called to wake up the local cpu             :         990,       1017  |     2.73% |  (     0.30%,      0.32% )
  total runtime by tasks on this processor (in jiffies)            :  8807593610, 9142526964  |     3.80% |
  total waittime by tasks on this processor (in jiffies)           :  4093286876, 4314147489  |     5.40% |  (    46.47%,     47.19% )
  total timeslices run on this cpu                                 :      344281,     342495  |    -0.52% |
  ----------------------------------------------------------------------------------------------------

  ---------------------------------------- <Category newidle> ----------------------------------------
  SMT:

  load_balance() total time to balance on newly idle               :     9841719,   11615891  |    18.03% |
  load_balance() stats reused on newly idle                        :           0,          0  |     0.00% |
  load_balance() stats recomputed on newly idle                    :       28103,      27084  |    -3.63% |

  MC:

  load_balance() total time to balance on newly idle               :    20079305,   18103792  |    -9.84% |
  load_balance() stats reused on newly idle                        :           0,      37820  |     0.00% |
  load_balance() stats recomputed on newly idle                    :       63885,      33518  |   -47.53% |

  PKG:

  load_balance() total time to balance on newly idle               :    17972213,   16430220  |    -8.58% |
  load_balance() stats reused on newly idle                        :           0,       8461  |     0.00% |
  load_balance() stats recomputed on newly idle                    :       11513,       6318  |   -45.12% |

  NUMA:

  load_balance() total time to balance on newly idle               :    11050651,    9890509  |   -10.50% |
  load_balance() stats reused on newly idle                        :           0,        496  |     0.00% |
  load_balance() stats recomputed on newly idle                    :         827,        524  |   -36.64% |

---

Note: perf sched stats cannot properly aggregate "min" and "max" fields
yet.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
-- 
2.43.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC PATCH 10/08] sched/fair: Compute nr_{numa,preferred}_running for non-NUMA domains
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (8 preceding siblings ...)
  2025-03-16 10:29 ` [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
  2025-03-16 10:29 ` [RFC PATCH 11/08] sched/fair: Move from "last_update" to stats versioning K Prateek Nayak
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

Migrations within a NUMA domain will not change
nr_{numa,preferred}_running stats. Compute it for non-NUMA groups for it
to be propagated and reused for the first NUMA domain when it exists.

While at it, also clear sd_stats before aggregation.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 212bee3e9f35..d09f900a3107 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10398,10 +10398,8 @@ static inline void aggregate_sd_stats(struct lb_env *env,
 	sd_stats->overutilized |= sg_stats->overutilized;
 
 #ifdef CONFIG_NUMA_BALANCING
-	if (env->sd->flags & SD_NUMA) {
-		sd_stats->nr_numa_running += sg_stats->nr_numa_running;
-		sd_stats->nr_preferred_running += sg_stats->nr_preferred_running;
-	}
+	sd_stats->nr_numa_running += sg_stats->nr_numa_running;
+	sd_stats->nr_preferred_running += sg_stats->nr_preferred_running;
 #endif
 }
 
@@ -10464,11 +10462,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->overloaded = 1;
 
 #ifdef CONFIG_NUMA_BALANCING
-		/* Only fbq_classify_group() uses this to classify NUMA groups */
-		if (sd_flags & SD_NUMA) {
-			sgs->nr_numa_running += rq->nr_numa_running;
-			sgs->nr_preferred_running += rq->nr_preferred_running;
-		}
+		sgs->nr_numa_running += rq->nr_numa_running;
+		sgs->nr_preferred_running += rq->nr_preferred_running;
 #endif
 		if (local_group)
 			continue;
@@ -11112,8 +11107,10 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	 * load balancing there, aggregate the statistics at current domain
 	 * to be retrieved when load balancing at parent.
 	 */
-	if (env->sd->parent && can_retrieve_stats(env->sd->parent, env->idle))
+	if (env->sd->parent && can_retrieve_stats(env->sd->parent, env->idle)) {
+		memset(&sd_stats, 0, sizeof(sd_stats));
 		should_prop = true;
+	}
 
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 11/08] sched/fair: Move from "last_update" to stats versioning
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (9 preceding siblings ...)
  2025-03-16 10:29 ` [RFC PATCH 10/08] sched/fair: Compute nr_{numa,preferred}_running for non-NUMA domains K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
  2025-03-16 10:29 ` [RFC PATCH 12/08] sched/fair: Record the cpu that updated the stats last K Prateek Nayak
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

The combination of "stats_lock" and jiffy based "last_update" is not
scalable for newidle balance. Instead move to a versioning-based scheme
where the version number helps with both readers reading consistent data
without the need for a lock and writers using the version for both
locking and indicating stats freshness.

Additional semantics have been added for the writers to update state
stats if the time elapsed since last update has crossed the 50us
threshold.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c     | 83 +++++++++++++++++++++++++++--------------
 kernel/sched/sched.h    | 22 ++++++++++-
 kernel/sched/topology.c |  3 +-
 3 files changed, 77 insertions(+), 31 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d09f900a3107..6c486e194a9d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10275,11 +10275,13 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
 	return check_cpu_capacity(rq, sd);
 }
 
-static inline void cache_sd_stats(struct sched_domain *sd, struct sg_lb_stats *sd_stats)
+static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_stats)
 {
-	struct sched_domain_shared *sd_share = sd->shared;
-	unsigned long current_jiffy = jiffies;
+	struct sched_domain_shared *sd_share = env->sd->shared;
 	struct sg_lb_stats_prop *lb_prop;
+	int cpu, retry_limit = 3;
+	u64 time, lock;
+	long version;
 
 	if (!sd_share)
 		return;
@@ -10288,23 +10290,52 @@ static inline void cache_sd_stats(struct sched_domain *sd, struct sg_lb_stats *s
 	if (!lb_prop)
 		return;
 
-	/* Concurrent load balancing instance already updated the stats */
-	if (READ_ONCE(lb_prop->last_update) == current_jiffy)
+	version = atomic_long_read_acquire(&lb_prop->version);
+	if (version < 0) /* Raced with a concurrent update. */
 		return;
 
-	scoped_guard(raw_spinlock_irqsave_try, &lb_prop->stats_lock) {
-		if (READ_ONCE(lb_prop->last_update) == current_jiffy)
-			break;
+	guard(irqsave)(); /* Minimize interruptions. */
+
+	cpu = smp_processor_id();
+	time = sched_clock_cpu(cpu);
 
-		lb_prop->sg_stats = *sd_stats;
+	/* Version is still fresh, no need to be rude yet. */
+	if (version > 0 && (s64)(time - version) <= 50 * NSEC_PER_USEC)
+		return;
 
+retry:
+	/*
+	 * Try to grab the stats for update. If the cmpxchg fails,
+	 * a concurrent writer succeeded to grab the stats before
+	 * this load balancing instance did. The acquire ordering
+	 * also pairs against readers checking the version after
+	 * reading the stats to ensure consistent state.
+	 */
+	lock = atomic_long_cmpxchg_acquire(&lb_prop->version, version, LLONG_MIN);
+
+	/* Someone else grabbed the version. */
+	if (lock != version) {
 		/*
-		 *  Pairs against readers checking the last_update
-		 *  before reading the cached stats.
+		 * Version is up for grabs! Try again. If the CPU grabs
+		 * the lock next time around lock = version = 0 and this
+		 * is skipped. If it cannot grab the version, lock != 0
+		 * and we return from here thus ensuring on a single
+		 * retry.
 		 */
-		smp_wmb();
-		WRITE_ONCE(lb_prop->last_update, current_jiffy);
+		if (!lock) {
+			version = 0;
+			goto retry;
+		}
+		return;
 	}
+
+	lb_prop->sg_stats = *sd_stats;
+
+	/*
+	 * Pairs against readers checking the version
+	 * before reading the stats.
+	 */
+	atomic_long_set_release(&lb_prop->version, time);
 }
 
 static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
@@ -10346,8 +10377,8 @@ static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type
 static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
 {
 	struct sched_domain_shared *sg_share = group->shared;
-	unsigned long current_jiffy = jiffies;
 	struct sg_lb_stats_prop *lb_prop;
+	long version;
 
 	if (!sg_share)
 		return 0;
@@ -10356,24 +10387,22 @@ static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_
 	if (!lb_prop)
 		return 0;
 
-	/* Stale stats */
-	if (READ_ONCE(lb_prop->last_update) != current_jiffy)
-		return 0;
-
 	/*
-	 * Pairs against the update to sgs_prop->last_update to
-	 * prevent readers from seeing an inconsistent value of
-	 * the propagated stats from a concurrent update.
+	 * Pairs with writer atomically updating version after
+	 * writing the stats.
 	 */
-	smp_rmb();
+	version = atomic_long_read_acquire(&lb_prop->version);
+	if (version <= 0) /* Stats have gone stale / being updated. */
+		return 0;
+
 	*sg_stats = lb_prop->sg_stats;
 
 	/*
-	 * If stats were read in the same interval, it cannot
-	 * read an inconsistent state since stats are only
-	 * updated once per jiffy.
+	 * Pairs with writer atomically invalidating a version
+	 * before updating the stats.
 	 */
-	return time_before_eq(jiffies, current_jiffy);
+	smp_rmb();
+	return atomic_long_read(&lb_prop->version) == version;
 }
 
 /**
@@ -11155,7 +11184,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sg_overutilized = sd_stats.overutilized;
 		sum_util = sd_stats.group_util;
 
-		cache_sd_stats(env->sd, &sd_stats);
+		cache_sd_stats(env, &sd_stats);
 	}
 
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 391c4180eeb3..64f7e013fd59 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2176,8 +2176,26 @@ struct sg_lb_stats {
  * sched_domain load balancing statistics up the hierarchy.
  */
 struct sg_lb_stats_prop {
-	raw_spinlock_t          stats_lock;	/* Lock for updating the cached stats */
-	unsigned long		last_update;	/* Time when stats was last updated (jiffies) */
+	/*
+	 * Stats version has the following semantics:
+	 *
+	 * When 0, stats are considered state. A writer can lock the
+	 * stats by atomically changing it to LLONG_MIN. Once the
+	 * stats are written, the version is atomically updated to the
+	 * value returned by sched_clock_cpu().
+	 *
+	 * If the reader finds a positive value for version, the stats
+	 * are considered to be fresh and the reader will copy it for
+	 * load balancing. The version seen before and after the read
+	 * is compared to ensure the stats copied are consistent.
+	 *
+	 * Since invalidations under uncertain circumstances can take a
+	 * long time, a rude writer can always attempt to take over the
+	 * stats by atomically updating the version to LLONG_MIN if it
+	 * finds a large difference betwwen a valid version and the
+	 * value returned by sched_clock_cpu().
+	 */
+	atomic_long_t		version;
 	struct sg_lb_stats	sg_stats;	/* Cached sched_group stats */
 };
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index aeb55f66e8d6..2e72ef8d8d8e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2304,8 +2304,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 			if (!sg_stats)
 				return -ENOMEM;
 
-			raw_spin_lock_init(&sg_stats->stats_lock);
-			sg_stats->last_update = 0;
+			atomic_long_set(&sg_stats->version, 0);
 			sds->private = (void *)sg_stats;
 
 			sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 12/08] sched/fair: Record the cpu that updated the stats last
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (10 preceding siblings ...)
  2025-03-16 10:29 ` [RFC PATCH 11/08] sched/fair: Move from "last_update" to stats versioning K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
  2025-03-16 10:29 ` [RFC PATCH 13/08] sched/fair: Invalidate stats once the load balancing instance is done K Prateek Nayak
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

Record which CPU updated the stats last. This will be used to invalidate
the stats in the following commits.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c  | 5 +++--
 kernel/sched/sched.h | 1 +
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6c486e194a9d..2a34d73d824b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10279,9 +10279,9 @@ static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_sta
 {
 	struct sched_domain_shared *sd_share = env->sd->shared;
 	struct sg_lb_stats_prop *lb_prop;
-	int cpu, retry_limit = 3;
 	u64 time, lock;
 	long version;
+	int cpu;
 
 	if (!sd_share)
 		return;
@@ -10319,7 +10319,7 @@ static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_sta
 		 * Version is up for grabs! Try again. If the CPU grabs
 		 * the lock next time around lock = version = 0 and this
 		 * is skipped. If it cannot grab the version, lock != 0
-		 * and we return from here thus ensuring on a single
+		 * and we return from here thus ensuring only a single
 		 * retry.
 		 */
 		if (!lock) {
@@ -10330,6 +10330,7 @@ static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_sta
 	}
 
 	lb_prop->sg_stats = *sd_stats;
+	lb_prop->update_cpu = cpu;
 
 	/*
 	 * Pairs against readers checking the version
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 64f7e013fd59..adf4fa2ed031 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2197,6 +2197,7 @@ struct sg_lb_stats_prop {
 	 */
 	atomic_long_t		version;
 	struct sg_lb_stats	sg_stats;	/* Cached sched_group stats */
+	int			update_cpu;	/* CPU that updated the stats */
 };
 
 static inline struct cpumask *sched_group_span(struct sched_group *sg)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 13/08] sched/fair: Invalidate stats once the load balancing instance is done
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (11 preceding siblings ...)
  2025-03-16 10:29 ` [RFC PATCH 12/08] sched/fair: Record the cpu that updated the stats last K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
  2025-03-16 10:29 ` [RFC PATCH 14/08] [DEBUG] sched/fair: Add more lb_stats around lb_time and stats reuse K Prateek Nayak
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

The CPU doing the load balancing propagates the stats bottom-up, reusing
them as it traverses up the hierarchy. Once done, or if a decision is
taken to migrate the tasks that affect the stats, the old version needs
to be invalidated for a newer CPU with a recent view to recompute and
cache it.

Invalidate the old version once load balancing instance is done. Rudely
take over the stats if another CPU sees that cached stats are older than
50us.

This allows idle and newidle balance to propagate at the very least its
local stats up the hierarchy.

  ==================================================================
  Test          : hackbench
  Units         : Normalized time in seconds
  Interpretation: Lower is better
  Statistic     : AMean
  ==================================================================
  Case:           tip[pct imp](CV)      versioning[pct imp](CV)
   1-groups     1.00 [ -0.00](10.12)     1.00 [  0.44](13.86)
   2-groups     1.00 [ -0.00]( 6.92)     1.04 [ -4.32]( 3.00)
   4-groups     1.00 [ -0.00]( 3.14)     1.00 [ -0.21]( 2.16)
   8-groups     1.00 [ -0.00]( 1.35)     1.01 [ -1.25]( 1.32)
  16-groups     1.00 [ -0.00]( 1.32)     1.01 [ -0.50]( 2.00)

  ==================================================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:    tip[pct imp](CV)      versioning[pct imp](CV)
      1     1.00 [  0.00]( 0.43)     0.98 [ -1.65]( 0.15)
      2     1.00 [  0.00]( 0.58)     1.01 [  1.27]( 0.49)
      4     1.00 [  0.00]( 0.54)     1.00 [  0.47]( 0.40)
      8     1.00 [  0.00]( 0.49)     1.00 [ -0.44]( 1.18)
     16     1.00 [  0.00]( 1.06)     1.00 [ -0.07]( 1.14)
     32     1.00 [  0.00]( 1.27)     1.00 [  0.02]( 0.11)
     64     1.00 [  0.00]( 1.54)     0.99 [ -1.12]( 1.09)
    128     1.00 [  0.00]( 0.38)     0.98 [ -2.43]( 1.00)
    256     1.00 [  0.00]( 1.85)     0.99 [ -0.50]( 0.94)
    512     1.00 [  0.00]( 0.31)     0.99 [ -1.03]( 0.35)
   1024     1.00 [  0.00]( 0.19)     0.99 [ -0.56]( 0.42)

  ==================================================================
  Test          : stream-10
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)      versioning[pct imp](CV)
   Copy     1.00 [  0.00](11.31)     1.08 [  7.51]( 4.74)
  Scale     1.00 [  0.00]( 6.62)     1.00 [ -0.31]( 7.45)
    Add     1.00 [  0.00]( 7.06)     1.02 [  2.50]( 7.34)
  Triad     1.00 [  0.00]( 8.91)     1.08 [  7.78]( 2.88)

  ==================================================================
  Test          : stream-100
  Units         : Normalized Bandwidth, MB/s
  Interpretation: Higher is better
  Statistic     : HMean
  ==================================================================
  Test:       tip[pct imp](CV)     versioning[pct imp](CV)
   Copy     1.00 [  0.00]( 2.01)     1.02 [  1.82]( 1.26)
  Scale     1.00 [  0.00]( 1.49)     1.00 [  0.26]( 0.80)
    Add     1.00 [  0.00]( 2.67)     1.01 [  0.98]( 1.29)
  Triad     1.00 [  0.00]( 2.19)     1.02 [  2.06]( 1.01)

  ==================================================================
  Test          : netperf
  Units         : Normalized Througput
  Interpretation: Higher is better
  Statistic     : AMean
  ==================================================================
  Clients:         tip[pct imp](CV)      versioning[pct imp](CV)
   1-clients     1.00 [  0.00]( 1.43)     0.99 [ -0.72]( 0.81)
   2-clients     1.00 [  0.00]( 1.02)     1.00 [ -0.09]( 1.11)
   4-clients     1.00 [  0.00]( 0.83)     1.00 [  0.31]( 0.29)
   8-clients     1.00 [  0.00]( 0.73)     1.00 [ -0.25]( 0.61)
  16-clients     1.00 [  0.00]( 0.97)     1.00 [ -0.26]( 0.89)
  32-clients     1.00 [  0.00]( 0.88)     0.99 [ -0.61]( 0.82)
  64-clients     1.00 [  0.00]( 1.49)     0.99 [ -1.11]( 1.77)
  128-clients    1.00 [  0.00]( 1.05)     1.00 [ -0.03]( 1.13)
  256-clients    1.00 [  0.00]( 3.85)     1.00 [ -0.24]( 2.63)
  512-clients    1.00 [  0.00](59.63)     0.99 [ -0.74](59.01)

  ==================================================================
  Test          : schbench
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:     tip[pct imp](CV)     versioning[pct imp](CV)
    1         1.00 [ -0.00]( 6.67)     0.93 [  6.67](15.25)
    2         1.00 [ -0.00](10.18)     0.83 [ 17.39]( 7.15)
    4         1.00 [ -0.00]( 4.49)     1.04 [ -4.26]( 6.12)
    8         1.00 [ -0.00]( 6.68)     1.06 [ -5.66](12.98)
   16         1.00 [ -0.00]( 1.87)     1.00 [ -0.00]( 3.38)
   32         1.00 [ -0.00]( 4.01)     0.98 [  2.20]( 4.79)
   64         1.00 [ -0.00]( 3.21)     1.02 [ -1.68]( 0.84)
  128         1.00 [ -0.00](44.13)     1.16 [-15.98](14.99)
  256         1.00 [ -0.00](14.46)     0.90 [  9.99](17.45)
  512         1.00 [ -0.00]( 1.95)     0.98 [  1.54]( 1.13)

  ==================================================================
  Test          : new-schbench-requests-per-second
  Units         : Normalized Requests per second
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [  0.00]( 0.46)     1.00 [  0.00]( 0.26)
    2        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.15)
    4        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.30)
    8        1.00 [  0.00]( 0.15)     1.00 [ -0.29]( 0.26)
   16        1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
   32        1.00 [  0.00]( 3.40)     1.06 [  5.93]( 1.22)
   64        1.00 [  0.00]( 7.09)     1.00 [  0.00]( 0.20)
  128        1.00 [  0.00]( 0.00)     0.98 [ -1.52]( 0.34)
  256        1.00 [  0.00]( 1.12)     0.98 [ -2.41]( 1.19)
  512        1.00 [  0.00]( 0.22)     1.00 [  0.00]( 0.43)

  ==================================================================
  Test          : new-schbench-wakeup-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [ -0.00](19.72)     1.00 [ -0.00]( 8.37)
    2        1.00 [ -0.00](15.96)     1.09 [ -9.09](11.08)
    4        1.00 [ -0.00]( 3.87)     1.15 [-15.38](17.44)
    8        1.00 [ -0.00]( 8.15)     0.92 [  8.33]( 8.85)
   16        1.00 [ -0.00]( 3.87)     1.23 [-23.08]( 5.59)
   32        1.00 [ -0.00](12.99)     0.73 [ 26.67](16.75)
   64        1.00 [ -0.00]( 6.20)     1.25 [-25.00]( 2.63)
  128        1.00 [ -0.00]( 0.96)     1.62 [-62.37]( 1.30)
  256        1.00 [ -0.00]( 2.76)     0.82 [ 17.89](10.56)
  512        1.00 [ -0.00]( 0.20)     1.00 [ -0.00]( 0.34)

  ==================================================================
  Test          : new-schbench-request-latency
  Units         : Normalized 99th percentile latency in us
  Interpretation: Lower is better
  Statistic     : Median
  ==================================================================
  #workers:    tip[pct imp](CV)      versioning[pct imp](CV)
    1        1.00 [ -0.00]( 1.07)     1.02 [ -2.34]( 0.13)
    2        1.00 [ -0.00]( 0.14)     1.04 [ -3.97]( 0.13)
    4        1.00 [ -0.00]( 1.39)     1.03 [ -3.15]( 0.13)
    8        1.00 [ -0.00]( 0.36)     1.03 [ -3.43]( 0.66)
   16        1.00 [ -0.00]( 1.18)     0.99 [  0.79]( 1.22)
   32        1.00 [ -0.00]( 8.42)     0.82 [ 18.29]( 9.02)
   64        1.00 [ -0.00]( 4.85)     1.00 [ -0.44]( 1.61)
  128        1.00 [ -0.00]( 0.28)     1.06 [ -5.64]( 1.10)
  256        1.00 [ -0.00](10.52)     0.81 [ 19.18](12.55)
  512        1.00 [ -0.00]( 0.69)     1.00 [  0.33]( 1.27)

  ==================================================================
  Test          : Various longer running benchmarks
  Units         : %diff in throughput reported
  Interpretation: Higher is better
  Statistic     : Median
  ==================================================================
  Benchmarks:                 %diff

  ycsb-cassandra             -0.76%
  ycsb-mongodb                0.49%

  deathstarbench-1x          -2.37%
  deathstarbench-2x           0.12%
  deathstarbench-3x           2.30%
  deathstarbench-6x           1.88%

  hammerdb+mysql 16VU         3.85%
  hammerdb+mysql 64VU         0.27%

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 92 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 71 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a34d73d824b..31501b933d45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10341,17 +10341,6 @@ static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_sta
 
 static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
 {
-	/*
-	 * Only under perioric load balancing can we ensure that no concurrent
-	 * CPUs modifies the stats being propagated upwards since
-	 * should_we_balance() can allow multiple concurrent newidle balance
-	 * to progress and an idle -> busy transition for idle balance will
-	 * require the stats to be recomputed since idleness metrics will
-	 * change with migration.
-	 */
-	if (idle)
-		return 0;
-
 	/*
 	 * If individual groups are separate NUMA domains, migrations can cause
 	 * preferred task statistics to change and will require recomputing of
@@ -10422,8 +10411,6 @@ static inline void aggregate_sd_stats(struct lb_env *env,
 	sd_stats->sum_h_nr_running += sg_stats->sum_h_nr_running;
 	sd_stats->sum_nr_running += sg_stats->sum_nr_running;
 	sd_stats->idle_cpus += sg_stats->idle_cpus;
-	sd_stats->group_capacity += sg_stats->group_capacity;
-	sd_stats->group_weight += sg_stats->group_weight;
 	sd_stats->overloaded |= sg_stats->overloaded;
 	sd_stats->overutilized |= sg_stats->overutilized;
 
@@ -10431,6 +10418,52 @@ static inline void aggregate_sd_stats(struct lb_env *env,
 	sd_stats->nr_numa_running += sg_stats->nr_numa_running;
 	sd_stats->nr_preferred_running += sg_stats->nr_preferred_running;
 #endif
+
+	if (env->idle &&
+	    sg_stats->group_misfit_task_load > sd_stats->group_misfit_task_load)
+		sd_stats->group_misfit_task_load = sg_stats->group_misfit_task_load;
+}
+
+static inline void __invalidate_stats(struct sched_domain *sd)
+{
+	struct sched_domain_shared *sd_share = sd->shared;
+	struct sg_lb_stats_prop *lb_prop;
+	long version;
+
+	if (!sd_share)
+		return;
+
+	lb_prop = (struct sg_lb_stats_prop *)sd_share->private;
+	if (!lb_prop)
+		return;
+
+	/*
+	 * The acquire ordering pairs against the writer updating the
+	 * "update_cpu" before setting a valid version.
+	 */
+	version = atomic_long_read_acquire(&lb_prop->version);
+	if (version <= 0) /* Stats are invalidated / being updated. */
+		return;
+
+	guard(irqsave)();
+
+	/*
+	 * Stats were not updated by this CPU. Leave it to the
+	 * update_cpu to clean it up.
+	 */
+	if (lb_prop->update_cpu != smp_processor_id())
+		return;
+
+	/* Invalidate the stats. */
+	atomic_long_cmpxchg_release(&lb_prop->version, version, 0);
+}
+
+static inline void invalidate_below(struct sched_domain *sd)
+{
+	struct sched_domain *child;
+
+	for (child = sd->child; child; child = child->child)
+		__invalidate_stats(child);
 }
 
 /**
@@ -10495,8 +10528,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
 #endif
-		if (local_group)
-			continue;
 
 		if (sd_flags & SD_ASYM_CPUCAPACITY) {
 			/* Check for a misfit task on the cpu */
@@ -10511,10 +10542,13 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		}
 	}
 
+group_classify:
 	sgs->group_capacity = group->sgc->capacity;
-
 	sgs->group_weight = group->group_weight;
 
+	if (local_group || !env->idle)
+		sgs->group_misfit_task_load = 0;
+
 	/* Check if dst CPU is idle and preferred to this group */
 	if (!local_group && env->idle && sgs->sum_h_nr_running &&
 	    sched_group_asym(env, sgs, group))
@@ -10524,7 +10558,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	if (!local_group && smt_balance(env, sgs, group))
 		sgs->group_smt_balance = 1;
 
-group_classify:
 	sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
 
 	/* Computing avg_load makes sense only when group is overloaded */
@@ -11167,9 +11200,10 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		}
 
 		/* Now, start updating sd_lb_stats */
+		sds->total_capacity += sgs->group_capacity;
+
 		if (!should_prop) {
 			sds->total_load += sgs->group_load;
-			sds->total_capacity += sgs->group_capacity;
 			sg_overloaded |= sgs->overloaded;
 			sg_overutilized |= sgs->overutilized;
 			sum_util += sgs->group_util;
@@ -11180,7 +11214,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (should_prop) {
 		sds->total_load = sd_stats.group_load;
-		sds->total_capacity = sd_stats.group_capacity;
 		sg_overloaded = sd_stats.overloaded;
 		sg_overutilized = sd_stats.overutilized;
 		sum_util = sd_stats.group_util;
@@ -11947,6 +11980,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 
 		if (cur_ld_moved) {
 			attach_tasks(&env);
+			/*
+			 * If tasks have moved to an idle CPU, the idleness
+			 * metrics have changed. Invalidate stats for the
+			 * next instance to compute them afresh.
+			 */
+			if (env.idle)
+				__invalidate_stats(env.sd);
 			ld_moved += cur_ld_moved;
 		}
 
@@ -12308,12 +12348,12 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 	int continue_balancing = 1;
 	int cpu = rq->cpu;
 	int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+	int need_serialize, need_decay = 0, invalidate = 1;
 	unsigned long interval, prev_sd_next_balance = 0;
 	struct sched_domain *sd;
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
-	int need_serialize, need_decay = 0;
 	u64 max_cost = 0;
 
 	rcu_read_lock();
@@ -12333,6 +12373,11 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 		 * actively.
 		 */
 		if (!continue_balancing) {
+			if (invalidate)  {
+				invalidate_below(sd);
+				invalidate = 0;
+			}
+
 			if (need_decay)
 				continue;
 			break;
@@ -12369,6 +12414,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 			next_balance = sd->last_balance + interval;
 			update_next_balance = 1;
 		}
+
+		if (!sd->parent)
+			invalidate_below(sd);
 	}
 	if (need_decay) {
 		/*
@@ -12987,8 +13035,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
 		 * Stop searching for tasks to pull if there are
 		 * now runnable tasks on this rq.
 		 */
-		if (pulled_task || !continue_balancing)
+		if (pulled_task || !continue_balancing || !sd->parent) {
+			invalidate_below(sd);
 			break;
+		}
 	}
 	rcu_read_unlock();
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 14/08] [DEBUG] sched/fair: Add more lb_stats around lb_time and stats reuse
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (12 preceding siblings ...)
  2025-03-16 10:29 ` [RFC PATCH 13/08] sched/fair: Invalidate stats once the load balancing instance is done K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
  2025-03-16 10:29 ` [RFC PATCH 15/08] [DEBUG] tools/lib/perf: Extend schedstats v17 headers to include the new debug fields K Prateek Nayak
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

Add stats for load balancing time and stats reuse efficiency.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 include/linux/sched/topology.h |  5 +++++
 kernel/sched/fair.c            | 21 ++++++++++++++++++++-
 kernel/sched/stats.c           |  9 +++++++--
 3 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a16d7d9dd9d3..dea65eb263c6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -123,6 +123,11 @@ struct sched_domain {
 	unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
 	unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_min_time[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_max_time[CPU_MAX_IDLE_TYPES];
+	unsigned long lb_total_time[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_stats_reused[CPU_MAX_IDLE_TYPES];
+	unsigned int lb_stats_recomputed[CPU_MAX_IDLE_TYPES];
 
 	/* Active load balancing */
 	unsigned int alb_count;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31501b933d45..bb7b21421415 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10491,10 +10491,13 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	 * group_smt_balance is not possible under busy load balancing.
 	 */
 	if (can_retrieve_stats(env->sd, env->idle) &&
-	    retrieve_cached_stats(group, sgs))
+	    retrieve_cached_stats(group, sgs)) {
+		schedstat_inc(env->sd->lb_stats_reused[env->idle]);
 		goto group_classify;
+	}
 
 	memset(sgs, 0, sizeof(*sgs));
+	schedstat_inc(env->sd->lb_stats_recomputed[env->idle]);
 
 	for_each_cpu_and(i, sched_group_span(group), env->cpus) {
 		struct rq *rq = cpu_rq(i);
@@ -11901,6 +11904,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 {
 	int ld_moved, cur_ld_moved, active_balance = 0;
 	struct sched_domain *sd_parent = sd->parent;
+	u64 lb_start = sched_clock_cpu(this_cpu);
 	struct sched_group *group;
 	struct rq *busiest;
 	struct rq_flags rf;
@@ -12174,6 +12178,21 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
 out:
+	if (schedstat_enabled()) {
+		u64 now = sched_clock_cpu(this_cpu);
+		u64 elapsed = now - lb_start;
+
+		if (!schedstat_val(sd->lb_min_time[idle]) ||
+		    elapsed < schedstat_val(sd->lb_min_time[idle]))
+			__schedstat_set(sd->lb_min_time[idle], elapsed);
+
+		if (!schedstat_val(sd->lb_max_time[idle]) ||
+		    elapsed > schedstat_val(sd->lb_max_time[idle]))
+			__schedstat_set(sd->lb_max_time[idle], elapsed);
+
+		__schedstat_add(sd->lb_total_time[idle], elapsed);
+	}
+
 	return ld_moved;
 }
 
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 4346fd81c31f..b2ace3c51062 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -141,7 +141,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
 			seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name,
 				   cpumask_pr_args(sched_domain_span(sd)));
 			for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
-				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
+				seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u %u %lu %u %u",
 				    sd->lb_count[itype],
 				    sd->lb_balanced[itype],
 				    sd->lb_failed[itype],
@@ -152,7 +152,12 @@ static int show_schedstat(struct seq_file *seq, void *v)
 				    sd->lb_gained[itype],
 				    sd->lb_hot_gained[itype],
 				    sd->lb_nobusyq[itype],
-				    sd->lb_nobusyg[itype]);
+				    sd->lb_nobusyg[itype],
+				    sd->lb_min_time[itype],
+				    sd->lb_max_time[itype],
+				    sd->lb_total_time[itype],
+				    sd->lb_stats_reused[itype],
+				    sd->lb_stats_recomputed[itype]);
 			}
 			seq_printf(seq,
 				   " %u %u %u %u %u %u %u %u %u %u %u %u\n",
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH 15/08] [DEBUG] tools/lib/perf: Extend schedstats v17 headers to include the new debug fields
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (13 preceding siblings ...)
  2025-03-16 10:29 ` [RFC PATCH 14/08] [DEBUG] sched/fair: Add more lb_stats around lb_time and stats reuse K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
  2025-03-17 17:25 ` [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy Peter Zijlstra
  2025-03-21 10:04 ` Libo Chen
  16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
	linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak

The previous commit hacked up schedstats v17 to add more fields.
Extend the header file for perf sched stats for analysis. These changes
depend on perf sched stats tools being developed in [1].

Link: https://lore.kernel.org/lkml/20250311120230.61774-1-swapnil.sapkal@amd.com/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 tools/lib/perf/include/perf/schedstat-v17.h | 30 +++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/lib/perf/include/perf/schedstat-v17.h b/tools/lib/perf/include/perf/schedstat-v17.h
index 00009bd5f006..888dfa982a55 100644
--- a/tools/lib/perf/include/perf/schedstat-v17.h
+++ b/tools/lib/perf/include/perf/schedstat-v17.h
@@ -47,6 +47,16 @@ DOMAIN_FIELD(__u32, busy_lb_nobusyq,
 	     "load_balance() failed to find busier queue on cpu busy", "%11u", true, v17);
 DOMAIN_FIELD(__u32, busy_lb_nobusyg,
 	     "load_balance() failed to find busier group on cpu busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_min_time,
+	     "load_balance() min time to balance on busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_max_time,
+	     "load_balance() max time to balance on busy", "%11u", true, v17);
+DOMAIN_FIELD(unsigned long, busy_lb_total_time,
+	     "load_balance() total time to balance on busy", "%11lu", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_stats_reused,
+	     "load_balance() stats reused on busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_stats_recompute,
+	     "load_balance() stats recomputed on busy", "%11u", true, v17);
 #ifdef DERIVED_CNT_FIELD
 DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
 		  busy_lb_count, busy_lb_balanced, busy_lb_failed, v17);
@@ -80,6 +90,16 @@ DOMAIN_FIELD(__u32, idle_lb_nobusyq,
 	     "load_balance() failed to find busier queue on cpu idle", "%11u", true, v17);
 DOMAIN_FIELD(__u32, idle_lb_nobusyg,
 	     "load_balance() failed to find busier group on cpu idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_min_time,
+	     "load_balance() min time to balance on idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_max_time,
+	     "load_balance() max time to balance on idle", "%11u", true, v17);
+DOMAIN_FIELD(unsigned long, idle_lb_total_time,
+	     "load_balance() total time to balance on idle", "%11lu", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_stats_reused,
+	     "load_balance() stats reused on idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_stats_recompute,
+	     "load_balance() stats recomputed on idle", "%11u", true, v17);
 #ifdef DERIVED_CNT_FIELD
 DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
 		  idle_lb_count, idle_lb_balanced, idle_lb_failed, v17);
@@ -113,6 +133,16 @@ DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
 	     "load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v17);
 DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
 	     "load_balance() failed to find busier group on cpu newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_min_time,
+	     "load_balance() min time to balance on newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_max_time,
+	     "load_balance() max time to balance on newly idle", "%11u", true, v17);
+DOMAIN_FIELD(unsigned long, newidle_lb_total_time,
+	     "load_balance() total time to balance on newly idle", "%11lu", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_stats_reused,
+	     "load_balance() stats reused on newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_stats_recompute,
+	     "load_balance() stats recomputed on newly idle", "%11u", true, v17);
 #ifdef DERIVED_CNT_FIELD
 DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
 		  newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v17);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (14 preceding siblings ...)
  2025-03-16 10:29 ` [RFC PATCH 15/08] [DEBUG] tools/lib/perf: Extend schedstats v17 headers to include the new debug fields K Prateek Nayak
@ 2025-03-17 17:25 ` Peter Zijlstra
  2025-03-17 18:23   ` Chen, Yu C
  2025-03-21 10:04 ` Libo Chen
  16 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2025-03-17 17:25 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Chen Yu, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde

On Thu, Mar 13, 2025 at 09:37:38AM +0000, K Prateek Nayak wrote:
> tl;dr
> 
> This prototype is currently limited in the sense that it can only reuse
> statistics for busy load balancing. Reusing stats for newidle load
> balancing specifically ran into issues elaborated below.

Right, it makes sense for busy load balance, newidle I think:

> David had proposed SHARED_RUNQ [4] to improve on the shortcomings of
> newidle balance for Meta's production workloads.

we need to look at this again. Something around the EEVDF merge made the
thing unhappy -- if we figure out what and fix it, I think this makes
more sense than trying to optimize the current scheme for newidle.

newidle really is about getting *any* work fast, which is a totally
different game than the regular busy balancing.

Anyway, I'll try and have a look through the patches.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
  2025-03-17 17:25 ` [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy Peter Zijlstra
@ 2025-03-17 18:23   ` Chen, Yu C
  0 siblings, 0 replies; 24+ messages in thread
From: Chen, Yu C @ 2025-03-17 18:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak, yu.c.chen,
	yu.chen.surf

On 3/18/2025 1:25 AM, Peter Zijlstra wrote:
> On Thu, Mar 13, 2025 at 09:37:38AM +0000, K Prateek Nayak wrote:
>> tl;dr
>>
>> This prototype is currently limited in the sense that it can only reuse
>> statistics for busy load balancing. Reusing stats for newidle load
>> balancing specifically ran into issues elaborated below.
> 
> Right, it makes sense for busy load balance, newidle I think:
> 
>> David had proposed SHARED_RUNQ [4] to improve on the shortcomings of
>> newidle balance for Meta's production workloads.
> 
> we need to look at this again. Something around the EEVDF merge made the
> thing unhappy -- if we figure out what and fix it, I think this makes

Could you give some links on what the issue is? The newly-idle balance 
fail to pull tasks after switching to EEVDF?(I don't
see the connection between EEVDF and newly-idle balance on top of
my head)

> more sense than trying to optimize the current scheme for newidle.
> 
> newidle really is about getting *any* work fast, which is a totally
> different game than the regular busy balancing.
> 

The newly idle iterates every CPU in the domain to find the busiest one, 
would the following work: find a relative busy CPU and stop the search, 
say, rq->nr_running >= 2 and also consider the candidate task's average 
duration.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
  2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
                   ` (15 preceding siblings ...)
  2025-03-17 17:25 ` [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy Peter Zijlstra
@ 2025-03-21 10:04 ` Libo Chen
  2025-03-24  3:58   ` K Prateek Nayak
  16 siblings, 1 reply; 24+ messages in thread
From: Libo Chen @ 2025-03-21 10:04 UTC (permalink / raw)
  To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Chen Yu, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde



On 3/13/25 02:37, K Prateek Nayak wrote:

> Benchmark results
> =================
> 

Hi Prateek,

Definitely like the idea, esp. if we can pull this off on newidle lb
which tends to be more problematic on systems with a large number
of cores. But the data below on periodic lb isn't I guess as good as
I expect. So I am wondering if the cost of update_[sd|sg]_lb_stats() 
actually went down as the result of the caching?

Thanks,
Libo
>   ==================================================================
>   Test          : hackbench
>   Units         : Normalized time in seconds
>   Interpretation: Lower is better
>   Statistic     : AMean
>   ==================================================================
>   Case:           tip[pct imp](CV)      stats_prop[pct imp](CV)
>    1-groups     1.00 [ -0.00](10.12)     1.09 [ -9.11](11.93)
>    2-groups     1.00 [ -0.00]( 6.92)     1.00 [ -0.22]( 4.57)
>    4-groups     1.00 [ -0.00]( 3.14)     0.99 [  0.83]( 1.77)
>    8-groups     1.00 [ -0.00]( 1.35)     1.00 [ -0.31]( 2.24)
>   16-groups     1.00 [ -0.00]( 1.32)     0.99 [  0.84]( 0.67)
> 
> 
>   ==================================================================
>   Test          : tbench
>   Units         : Normalized throughput
>   Interpretation: Higher is better
>   Statistic     : AMean
>   ==================================================================
>   Clients:    tip[pct imp](CV)      stats_prop[pct imp](CV)
>       1     1.00 [  0.00]( 0.43)     0.99 [ -0.87]( 1.34)
>       2     1.00 [  0.00]( 0.58)     1.02 [  2.14]( 0.29)
>       4     1.00 [  0.00]( 0.54)     1.01 [  1.24]( 0.82)
>       8     1.00 [  0.00]( 0.49)     1.01 [  0.62]( 0.97)
>      16     1.00 [  0.00]( 1.06)     1.01 [  0.94]( 0.70)
>      32     1.00 [  0.00]( 1.27)     0.99 [ -1.24]( 1.38)
>      64     1.00 [  0.00]( 1.54)     1.00 [ -0.43]( 0.36)
>     128     1.00 [  0.00]( 0.38)     1.00 [ -0.01]( 1.22)
>     256     1.00 [  0.00]( 1.85)     1.02 [  1.58]( 0.90)
>     512     1.00 [  0.00]( 0.31)     1.01 [  0.76]( 1.19)
>    1024     1.00 [  0.00]( 0.19)     1.00 [  0.44]( 0.35)
> 
> 
>   ==================================================================
>   Test          : stream-10
>   Units         : Normalized Bandwidth, MB/s
>   Interpretation: Higher is better
>   Statistic     : HMean
>   ==================================================================
>   Test:       tip[pct imp](CV)      stats_prop[pct imp](CV)
>    Copy     1.00 [  0.00](11.31)     1.02 [  1.69]( 6.44)
>   Scale     1.00 [  0.00]( 6.62)     1.01 [  0.80]( 5.37)
>     Add     1.00 [  0.00]( 7.06)     1.02 [  1.54]( 6.72)
>   Triad     1.00 [  0.00]( 8.91)     1.01 [  1.36]( 6.73)
> 
> 
>   ==================================================================
>   Test          : stream-100
>   Units         : Normalized Bandwidth, MB/s
>   Interpretation: Higher is better
>   Statistic     : HMean
>   ==================================================================
>   Test:       tip[pct imp](CV)      stats_prop[pct imp](CV)
>    Copy     1.00 [  0.00]( 2.01)     0.98 [ -1.55]( 2.15)
>   Scale     1.00 [  0.00]( 1.49)     1.00 [  0.23]( 0.58)
>     Add     1.00 [  0.00]( 2.67)     1.01 [  0.65]( 1.95)
>   Triad     1.00 [  0.00]( 2.19)     1.01 [  0.61]( 1.37)
> 
> 
>   ==================================================================
>   Test          : netperf
>   Units         : Normalized Througput
>   Interpretation: Higher is better
>   Statistic     : AMean
>   ==================================================================
>      Clients:       tip[pct imp](CV)      stats_prop[pct imp](CV)
>     1-clients     1.00 [  0.00]( 1.43)     1.00 [  0.17]( 0.32)
>     2-clients     1.00 [  0.00]( 1.02)     1.01 [  1.00]( 0.44)
>     4-clients     1.00 [  0.00]( 0.83)     1.01 [  0.62]( 0.36)
>     8-clients     1.00 [  0.00]( 0.73)     1.00 [ -0.11]( 0.65)
>    16-clients     1.00 [  0.00]( 0.97)     1.00 [  0.49]( 0.77)
>    32-clients     1.00 [  0.00]( 0.88)     1.00 [  0.30]( 0.94)
>    64-clients     1.00 [  0.00]( 1.49)     1.00 [  0.36]( 1.57)
>   128-clients     1.00 [  0.00]( 1.05)     1.00 [  0.14]( 1.46)
>   256-clients     1.00 [  0.00]( 3.85)     1.00 [ -0.04]( 4.85)
>   512-clients     1.00 [  0.00](59.63)     1.00 [ -0.02](62.28)
> 
> 
>   ==================================================================
>   Test          : schbench
>   Units         : Normalized 99th percentile latency in us
>   Interpretation: Lower is better
>   Statistic     : Median
>   ==================================================================
>   #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
>     1         1.00 [ -0.00]( 6.67)     0.76 [ 24.44](35.80)
>     2         1.00 [ -0.00](10.18)     0.87 [ 13.04](10.38)
>     4         1.00 [ -0.00]( 4.49)     1.04 [ -4.26]( 3.14)
>     8         1.00 [ -0.00]( 6.68)     0.98 [  1.89]( 8.07)
>    16         1.00 [ -0.00]( 1.87)     1.03 [ -3.28]( 5.21)
>    32         1.00 [ -0.00]( 4.01)     0.98 [  2.20]( 1.31)
>    64         1.00 [ -0.00]( 3.21)     1.00 [ -0.00]( 3.23)
>   128         1.00 [ -0.00](44.13)     1.06 [ -6.43](113.66)
>   256         1.00 [ -0.00](14.46)     1.04 [ -3.52]( 8.43)
>   512         1.00 [ -0.00]( 1.95)     1.02 [ -1.80]( 1.14)
> 
> 
>   ==================================================================
>   Test          : new-schbench-requests-per-second
>   Units         : Normalized Requests per second
>   Interpretation: Higher is better
>   Statistic     : Median
>   ==================================================================
>   #workers:      tip[pct imp](CV)      stats_prop[pct imp](CV)
>     1          1.00 [  0.00]( 0.46)     1.00 [  0.00]( 0.55)
>     2          1.00 [  0.00]( 0.15)     0.99 [ -0.88]( 0.26)
>     4          1.00 [  0.00]( 0.15)     0.99 [ -0.59]( 0.15)
>     8          1.00 [  0.00]( 0.15)     0.99 [ -0.88]( 0.26)
>    16          1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
>    32          1.00 [  0.00]( 3.40)     1.07 [  6.59]( 0.16)
>    64          1.00 [  0.00]( 7.09)     1.00 [ -0.38]( 0.96)
>   128          1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.20)
>   256          1.00 [  0.00]( 1.12)     1.00 [ -0.30]( 1.50)
>   512          1.00 [  0.00]( 0.22)     1.05 [  4.86]( 0.71)
> 
> 
>   ==================================================================
>   Test          : new-schbench-wakeup-latency
>   Units         : Normalized 99th percentile latency in us
>   Interpretation: Lower is better
>   Statistic     : Median
>   ==================================================================
>   #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
>     1         1.00 [ -0.00](19.72)     0.85 [ 15.38](16.75)
>     2         1.00 [ -0.00](15.96)     1.00 [ -0.00]( 0.00)
>     4         1.00 [ -0.00]( 3.87)     1.00 [ -0.00]( 4.08)
>     8         1.00 [ -0.00]( 8.15)     1.00 [ -0.00](11.71)
>    16         1.00 [ -0.00]( 3.87)     0.92 [  7.69]( 4.19)
>    32         1.00 [ -0.00](12.99)     0.73 [ 26.67]( 0.00)
>    64         1.00 [ -0.00]( 6.20)     1.12 [-12.50]( 9.94)
>   128         1.00 [ -0.00]( 0.96)     0.98 [  1.55]( 0.95)
>   256         1.00 [ -0.00]( 2.76)     0.99 [  1.45]( 1.38)
>   512         1.00 [ -0.00]( 0.20)     1.20 [-20.42]( 0.00)
> 
> 
>   ==================================================================
>   Test          : new-schbench-request-latency
>   Units         : Normalized 99th percentile latency in us
>   Interpretation: Lower is better
>   Statistic     : Median
>   ==================================================================
>   #workers:     tip[pct imp](CV)      stats_prop[pct imp](CV)
>     1         1.00 [ -0.00]( 1.07)     1.02 [ -2.08]( 0.13)
>     2         1.00 [ -0.00]( 0.14)     1.04 [ -3.97]( 0.13)
>     4         1.00 [ -0.00]( 1.39)     1.03 [ -3.15]( 0.13)
>     8         1.00 [ -0.00]( 0.36)     1.03 [ -3.16]( 0.00)
>    16         1.00 [ -0.00]( 1.18)     1.02 [ -1.59]( 0.75)
>    32         1.00 [ -0.00]( 8.42)     0.81 [ 19.08]( 0.25)
>    64         1.00 [ -0.00]( 4.85)     1.01 [ -1.10]( 2.58)
>   128         1.00 [ -0.00]( 0.28)     1.00 [ -0.21]( 0.38)
>   256         1.00 [ -0.00](10.52)     0.95 [  4.74]( 6.94)
>   512         1.00 [ -0.00]( 0.69)     1.09 [ -8.99]( 0.27)
> 
> 
>   ==================================================================
>   Test          : Various longer running benchmarks
>   Units         : %diff in throughput reported
>   Interpretation: Higher is better
>   Statistic     : Median
>   ==================================================================
>   Benchmarks:                 %diff
> 
>   ycsb-cassandra             -0.54%
>   ycsb-mongodb                0.09%
> 
>   deathstarbench-1x          -0.30%
>   deathstarbench-2x           2.38%
>   deathstarbench-3x           0.58%
>   deathstarbench-6x           0.62%
> 
>   hammerdb+mysql 16VU         0.76%
>   hammerdb+mysql 64VU         0.74%
> ---
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
  2025-03-21 10:04 ` Libo Chen
@ 2025-03-24  3:58   ` K Prateek Nayak
  0 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-24  3:58 UTC (permalink / raw)
  To: Libo Chen, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Chen Yu, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, David Vernet, Gautham R. Shenoy,
	Swapnil Sapkal, Shrikanth Hegde

Hello Libo,

Thank you for taking a look at the series and sorry for the late
response.

On 3/21/2025 3:34 PM, Libo Chen wrote:
> 
> 
> On 3/13/25 02:37, K Prateek Nayak wrote:
> 
>> Benchmark results
>> =================
>>
> 
> Hi Prateek,
> 
> Definitely like the idea, esp. if we can pull this off on newidle lb
> which tends to be more problematic on systems with a large number
> of cores. But the data below on periodic lb isn't I guess as good as
> I expect. So I am wondering if the cost of update_[sd|sg]_lb_stats()
> actually went down as the result of the caching?

I have some numbers for versioning idea that I got working just before
OSPM in [1] The benchmark results don't move much but the total time
for newidle balance reduces by ~5% overall.

There is a ~30% overhead of aggregating and propagating the stats
upwards at SMT domain that offsets some of the benefits of propagation
at higher domains but I'm working to see if this can be reduced and
only done if required.

Some ideas were discussed at OSPM to reduce the overheads further and
shared the burden of busy load balancing across all CPUs in the domain
and I'll tackle that next.

If you have any benchmark where this shows up prominently, please do let
me know and I can try adding it to the bunch.

[1] https://lore.kernel.org/lkml/20250316102916.10614-1-kprateek.nayak@amd.com/

-- 
Thanks and Regards,
Prateek

> 
> Thanks,
> Libo

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-03-24  3:59 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-13  9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 2/8] sched/topology: Introduce sg->shared K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 3/8] sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 4/8] sched/fair: Move sg_{overloaded,overutilized} calculation to sg_lb_stats K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 5/8] sched/topology: Define sg_lb_stats_prop and embed it inside sched_domain_shared K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused K Prateek Nayak
2025-03-17 18:07   ` Chen, Yu C
2025-03-19  6:51     ` K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop K Prateek Nayak
2025-03-17 18:04   ` Chen, Yu C
2025-03-19  6:42     ` K Prateek Nayak
2025-03-13  9:37 ` [RFC PATCH 8/8] sched/fair: Update stats for sched_domain using the sched_group stats K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 10/08] sched/fair: Compute nr_{numa,preferred}_running for non-NUMA domains K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 11/08] sched/fair: Move from "last_update" to stats versioning K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 12/08] sched/fair: Record the cpu that updated the stats last K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 13/08] sched/fair: Invalidate stats once the load balancing instance is done K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 14/08] [DEBUG] sched/fair: Add more lb_stats around lb_time and stats reuse K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 15/08] [DEBUG] tools/lib/perf: Extend schedstats v17 headers to include the new debug fields K Prateek Nayak
2025-03-17 17:25 ` [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy Peter Zijlstra
2025-03-17 18:23   ` Chen, Yu C
2025-03-21 10:04 ` Libo Chen
2025-03-24  3:58   ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox