IPC drop down on AMD epyc 7702P

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* IPC drop down on AMD epyc 7702P
@ 2025-04-17 21:08 Jean-Baptiste Roquefere
  2025-04-18  6:39 ` K Prateek Nayak
  0 siblings, 1 reply; 16+ messages in thread
From: Jean-Baptiste Roquefere @ 2025-04-17 21:08 UTC (permalink / raw)
  To: stable@vger.kernel.org
  Cc: regressions@lists.linux.dev, mingo@kernel.org,
	linux-kernel@vger.kernel.org

[-- Attachment #1.1: Type: text/plain, Size: 4173 bytes --]

Hi,
We (Ateme, a video encoding company) may have found an unwanted behavior in the scheduler since 5.10 (commit 16b0a7a1a0af), then 5.16 (commit c5b0a7eefc70), then 5.19 (commit not found yet), then maybe some other commits from 5.19 to 6.12, with a consequence of IPC decrease. Problem still appears on lasts 6.12, 6.13 and 6.14

We have reverted both 16b0a7a1a0af and c5b0a7eefc70 commits that reduce our performances (see fair.patch attached, applicable on 6.12.17). Performances increase but still doesnt reach our reference on 5.4.152.

Instead of trying to find every single commits from 5.18 to 6.12 that could decrease our performance, I chosed to bench 5.4.152 versus 6.12.17 with and without fair.patch.

The problem appeared clear : a lot of CPU migrations go out of CCX, then L3 miss, then IPC decrease.

Context of our bench: video decoder which work at a regulated speed, 1 process, 21 main threads, everyone of them creates 10 threads, 8 of them have a fine granularity, meaning they go to sleep quite often, giving the scheduler a lot of opportunities to act).
Hardware is an AMD Epyc 7702P, 128 cores, grouped by shared LLC 4 cores +4 hyperthreaded cores. NUMA topology is set by the BIOS to 1 node per socket.
Every pthread are created with default attributes.
I use AMDuProf (-C -A system -a -m ipc,l1,l2,l3,memory) for CPU Utilization (%), CPU effective freq, IPC, L2 access (pti), L2 miss (pti), L3 miss (absolute) and Mem (GB/s, and perf (stat -d -d -d -a) for Context switches, CPU migrations and Real time (s).

We noted that upgrade 5.4.152 to 6.12.17 without any special preempt configuration :
Two fold increase in CPU migration
30% memory bandwidth increase
20% L3 cache misses increase
10% IPC decrease

With the attached fair.patch applied to 6.12.17 (reminder : this patch reverts one commit appeared in 5.10 and another in 5.16) we managed to reduce CPU migrations and increase IPC but not as much as we had on 5.4.152. Our goal is to keep kernel "clean" without any patch (we don't want to apply and maintain fair.patch) then for the rest of my email we will consider stock kernel 6.12.17.

I've reduced the "sub threads count" to stays below 128 threads. Then still 21 main threads and instead of 10 worker per main thread I set 5 workers (4 of them with fine granularity) giving 105 pthreads -> everything goes fine in 6.12.17, no extra CPU migration, no extra memory bandwidth...

But as soon as we increase worker threads count (10 instead of 5) the problem appears.

We know our decoder may have too many threads but that's out of our scope, it has been designed like that some years ago and moving from "lot of small threads to few of big thread" is for now not possible.

We have a work around : we group threads using pthread affinities. Every main thread (and by inheritance of affinities every worker threads) on a single CCX so we reduce the L3 miss for them, then decrease memory bandwidth, then finally increasing IPC.

With that solution, we go above our original performances, for both kernels, and they perform at the same level. However, it is impractical to productize as such.

I've tried many kernel build configurations (CONFIG_PREMPT_*, CONFIG_SCHEDULER_*, tuning of fair.c:sysctl_sched_migration_cost) on 6.12.17, 6.12.21 (longterm), 6.13.9 (mainline), and 6.14.0 Nothing changes.

Q: Is there anyway to tune the kernel so we can get our performance back without using the pthread affinities work around ?

Feel free to ask an archive containing binaries and payload.

I first posted on https://bugzilla.kernel.org/show_bug.cgi?id=220000 but one told me the best way to get answers where these mailing lists

Regards,

Jean-Baptiste Roquefere, Ateme

 Attached bench.tar.gz :
 * bench/fair.patch
 * bench/bench.ods with 2 sheets :
    - regulated : decoder speed is regulated to keep real time constant
    - no regul : decoder speed is not regulated and uses from 1 to 76 main threads with 10 worker per main thread
* bench/regulated.csv : bench.ods:regulated exported in csv format
* bench/not-regulated : bench.ods:no regul exported in csv format

[-- Attachment #1.2: Type: text/html, Size: 5184 bytes --]

[-- Attachment #2: bench.tar.gz --]
[-- Type: application/x-gzip, Size: 20312 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-04-17 21:08 IPC drop down on AMD epyc 7702P Jean-Baptiste Roquefere
@ 2025-04-18  6:39 ` K Prateek Nayak
  2025-04-28  7:43   ` Jean-Baptiste Roquefere
  0 siblings, 1 reply; 16+ messages in thread
From: K Prateek Nayak @ 2025-04-18  6:39 UTC (permalink / raw)
  To: Jean-Baptiste Roquefere, stable@vger.kernel.org,
	Gautham R. Shenoy, Swapnil Sapkal
  Cc: regressions@lists.linux.dev, mingo@kernel.org,
	linux-kernel@vger.kernel.org, Borislav Petkov

Hello Jean,

On 4/18/2025 2:38 AM, Jean-Baptiste Roquefere wrote:
> 
> 
> Hi,
> We (Ateme, a video encoding company) may have found an unwanted behavior in the scheduler since 5.10 (commit 16b0a7a1a0af), then 5.16 (commit c5b0a7eefc70), then 5.19 (commit not found yet), then maybe some other commits from 5.19 to 6.12, with a consequence of IPC decrease. Problem still appears on lasts 6.12, 6.13 and 6.14

Looking at the commit logs, it looks like these commits do solve other
problems around load balancing and might not be trivial to revert
without evaluating the damages.

> 
> We have reverted both 16b0a7a1a0af and c5b0a7eefc70 commits that reduce our performances (see fair.patch attached, applicable on 6.12.17). Performances increase but still doesnt reach our reference on 5.4.152.
> 
> Instead of trying to find every single commits from 5.18 to 6.12 that could decrease our performance, I chosed to bench 5.4.152 versus 6.12.17 with and without fair.patch.
> 
> The problem appeared clear : a lot of CPU migrations go out of CCX, then L3 miss, then IPC decrease.
> 
> Context of our bench: video decoder which work at a regulated speed, 1 process, 21 main threads, everyone of them creates 10 threads, 8 of them have a fine granularity, meaning they go to sleep quite often, giving the scheduler a lot of opportunities to act).
> Hardware is an AMD Epyc 7702P, 128 cores, grouped by shared LLC 4 cores +4 hyperthreaded cores. NUMA topology is set by the BIOS to 1 node per socket.
> Every pthread are created with default attributes.
> I use AMDuProf (-C -A system -a -m ipc,l1,l2,l3,memory) for CPU Utilization (%), CPU effective freq, IPC, L2 access (pti), L2 miss (pti), L3 miss (absolute) and Mem (GB/s, and perf (stat -d -d -d -a) for Context switches, CPU migrations and Real time (s).
> 
> 
> We noted that upgrade 5.4.152 to 6.12.17 without any special preempt configuration :
> Two fold increase in CPU migration
> 30% memory bandwidth increase
> 20% L3 cache misses increase
> 10% IPC decrease
> 
> With the attached fair.patch applied to 6.12.17 (reminder : this patch reverts one commit appeared in 5.10 and another in 5.16) we managed to reduce CPU migrations and increase IPC but not as much as we had on 5.4.152. Our goal is to keep kernel "clean" without any patch (we don't want to apply and maintain fair.patch) then for the rest of my email we will consider stock kernel 6.12.17.
> 
> I've reduced the "sub threads count" to stays below 128 threads. Then still 21 main threads and instead of 10 worker per main thread I set 5 workers (4 of them with fine granularity) giving 105 pthreads -> everything goes fine in 6.12.17, no extra CPU migration, no extra memory bandwidth...

The processor you are running on, the AME EPYC 7702P based on the Zen2
architecture contains 4 cores / 8 threads per CCX (LLC domain) which is
perhaps why reducing the thread count to below this limit is helping
your workload.

What we suspect is that when running the workload, the threads that
regularly sleep trigger a newidle balancing which causes them to move
to another CCX leading to higher number of L3 misses.

To confirm this, would it be possible to run the workload with the
not-yet-upstream perf sched stats [1] tool and share the result from
perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch
to rule out any other second order effect.

[1] https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/

> 
> But as soon as we increase worker threads count (10 instead of 5) the problem appears.
> 
> We know our decoder may have too many threads but that's out of our scope, it has been designed like that some years ago and moving from "lot of small threads to few of big thread" is for now not possible.
> 
> We have a work around : we group threads using pthread affinities. Every main thread (and by inheritance of affinities every worker threads) on a single CCX so we reduce the L3 miss for them, then decrease memory bandwidth, then finally increasing IPC.
> 
> With that solution, we go above our original performances, for both kernels, and they perform at the same level. However, it is impractical to productize as such.
> 
> I've tried many kernel build configurations (CONFIG_PREMPT_*, CONFIG_SCHEDULER_*, tuning offair.c:sysctl_sched_migration_cost) on 6.12.17, 6.12.21 (longterm), 6.13.9 (mainline), and 6.14.0 Nothing changes.
> 
> Q: Is there anyway to tune the kernel so we can get our performance back without using the pthread affinities work around ?

Assuming you control these deployments, would it possible to run
the workload on a kernel running with "relax_domain_level=2" kernel
cmdline that restricts newidle balance to only within the CCX. As a
side effect, it also limits  task wakeups to the same LLC domain but
I would still like to know if this makes a difference to the
workload you are running.

Note: This is a system-wide knob and will affect all workloads
running on the system and is better used for debug purposes.

-- 
Thanks and Regards,
Prateek

> 
> Feel free to ask an archive containing binaries and payload.
> 
> I first posted onhttps://bugzilla.kernel.org/show_bug.cgi?id=220000 but one told me the best way to get answers where these mailing lists
> 
> Regards,
> 
> 
> Jean-Baptiste Roquefere, Ateme
> 
> 
> 
>   Attached bench.tar.gz :
>   * bench/fair.patch
>   * bench/bench.ods with 2 sheets :
>      - regulated : decoder speed is regulated to keep real time constant
>      - no regul : decoder speed is not regulated and uses from 1 to 76 main threads with 10 worker per main thread
> * bench/regulated.csv :bench.ods:regulated exported in csv format
> * bench/not-regulated :bench.ods:no regul exported in csv format
> 



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-04-18  6:39 ` K Prateek Nayak
@ 2025-04-28  7:43   ` Jean-Baptiste Roquefere
  2025-04-30  9:13     ` K Prateek Nayak
  0 siblings, 1 reply; 16+ messages in thread
From: Jean-Baptiste Roquefere @ 2025-04-28  7:43 UTC (permalink / raw)
  To: K Prateek Nayak, stable@vger.kernel.org, Gautham R. Shenoy,
	Swapnil Sapkal
  Cc: regressions@lists.linux.dev, mingo@kernel.org,
	linux-kernel@vger.kernel.org, Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 3169 bytes --]

Hello Prateek,

thank's for your reponse.


> Looking at the commit logs, it looks like these commits do solve other
> problems around load balancing and might not be trivial to revert
> without evaluating the damages.

it's definitely not a productizable workaround !

> The processor you are running on, the AME EPYC 7702P based on the Zen2
> architecture contains 4 cores / 8 threads per CCX (LLC domain) which is
> perhaps why reducing the thread count to below this limit is helping
> your workload.
>
> What we suspect is that when running the workload, the threads that
> regularly sleep trigger a newidle balancing which causes them to move
> to another CCX leading to higher number of L3 misses.
>
> To confirm this, would it be possible to run the workload with the
> not-yet-upstream perf sched stats [1] tool and share the result from
> perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch
> to rule out any other second order effect.
>
> [1] 
> https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/

I had to patch tools/perf/util/session.c : static int 
open_file_read(struct perf_data *data) due to "failed to open perf.data: 
File exists" (looked more like a compiler issue than a tool/perf issue)

$ ./perf sched stats diff perf.data.6.12.17 perf.data.6.12.17patched > 
perf.diff (see perf.diff attached)

> Assuming you control these deployments, would it possible to run
> the workload on a kernel running with "relax_domain_level=2" kernel
> cmdline that restricts newidle balance to only within the CCX. As a
> side effect, it also limits  task wakeups to the same LLC domain but
> I would still like to know if this makes a difference to the
> workload you are running.
On vanilla 6.12.17 it gives the IPC we expected:

+--------------------+--------------------------+-----------------------+
|                    | relax_domain_level unset | relax_domain_level=2  |
+--------------------+--------------------------+-----------------------+
| Threads            |  210                     | 210                  |
| Utilization (%)    |  65,86                   | 52,01                |
| CPU effective freq |  1 622,93                |  1 294,12             |
| IPC                |  1,14                    | 1,42                 |
| L2 access (pti)    |  34,36                   | 38,18                |
| L2 miss   (pti)    |  7,34                    | 7,78                 |
| L3 miss   (abs)    |  39 711 971 741          |  33 929 609 924       |
| Mem (GB/s)         |  70,68                   | 49,10                |
| Context switches   |  109 281 524             |  107 896 729          |
+--------------------+--------------------------+-----------------------+

Kind regards,

JB

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: perf.diff --]
[-- Type: text/x-patch; name="perf.diff", Size: 20149 bytes --]

Columns description
----------------------------------------------------------------------------------------------------
DESC			-> Description of the field
COUNT			-> Value of the field
PCT_CHANGE		-> Percent change with corresponding base value
AVG_JIFFIES		-> Avg time in jiffies between two consecutive occurrence of event
----------------------------------------------------------------------------------------------------
Time elapsed (in jiffies)                                        :       48349,      48345
----------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------------
CPU <ALL CPUS SUMMARY>
----------------------------------------------------------------------------------------------------
DESC                                                                    COUNT1      COUNT2   PCT_CHANGE    PCT_CHANGE1 PCT_CHANGE2
----------------------------------------------------------------------------------------------------
sched_yield() count                                              :           0,          8  |     0.00% |
Legacy counter can be ignored                                    :           0,          0  |     0.00% |
schedule() called                                                :      856174,     886448  |     3.54% |
schedule() left the processor idle                               :      354060,     396363  |    11.95% |  (    41.35%,     44.71% )
try_to_wake_up() was called                                      :      478156,     469763  |    -1.76% |
try_to_wake_up() was called to wake up the local cpu             :       71136,      42146  |   -40.75% |  (    14.88%,      8.97% )
total runtime by tasks on this processor (in jiffies)            : 123927676874,108531911002  |   -12.42% |
total waittime by tasks on this processor (in jiffies)           : 34729211241,27076295778  |   -22.04% |  (    28.02%,     24.95% )
total timeslices run on this cpu                                 :      501606,     489799  |    -2.35% |
----------------------------------------------------------------------------------------------------
CPU <ALL CPUS SUMMARY>, DOMAIN 0
----------------------------------------------------------------------------------------------------
DESC                                                                    COUNT1      COUNT2   PCT_CHANGE     AVG_JIFFIES1 AVG_JIFFIES2
----------------------------------------- <Category busy> ------------------------------------------
load_balance() count on cpu busy                                 :        2494,        730  |   -70.73% |  $       19.39,       66.23 $
load_balance() found balanced on cpu busy                        :        2445,        641  |   -73.78% |  $       19.77,       75.42 $
load_balance() move task failed on cpu busy                      :          20,         55  |   175.00% |  $     2417.45,      879.00 $
imbalance sum on cpu busy                                        :         453,      29661  |  6447.68% |
pull_task() count on cpu busy                                    :          29,         35  |    20.69% |
pull_task() when target task was cache-hot on cpu busy           :           0,          0  |     0.00% |
load_balance() failed to find busier queue on cpu busy           :           0,          0  |     0.00% |  $        0.00,        0.00 $
load_balance() failed to find busier group on cpu busy           :        2445,        641  |   -73.78% |  $       19.77,       75.42 $
*load_balance() success count on cpu busy                        :          29,         34  |    17.24% |
*avg task pulled per successful lb attempt (cpu busy)            :        1.00,       1.03  |     2.94% |
----------------------------------------- <Category idle> ------------------------------------------
load_balance() count on cpu idle                                 :       11936,      14590  |    22.24% |  $        4.05,        3.31 $
load_balance() found balanced on cpu idle                        :       11690,      14069  |    20.35% |  $        4.14,        3.44 $
load_balance() move task failed on cpu idle                      :           8,        164  |  1950.00% |  $     6043.62,      294.79 $
imbalance sum on cpu idle                                        :         253,     154633  | 61019.76% |
pull_task() count on cpu idle                                    :         240,        363  |    51.25% |
pull_task() when target task was cache-hot on cpu idle           :           0,          0  |     0.00% |
load_balance() failed to find busier queue on cpu idle           :           0,          0  |     0.00% |  $        0.00,        0.00 $
load_balance() failed to find busier group on cpu idle           :       11689,      14069  |    20.36% |  $        4.14,        3.44 $
*load_balance() success count on cpu idle                        :         238,        357  |    50.00% |
*avg task pulled per successful lb attempt (cpu idle)            :        1.01,       1.02  |     0.83% |
---------------------------------------- <Category newidle> ----------------------------------------
load_balance() count on cpu newly idle                           :      331664,      31153  |   -90.61% |  $        0.15,        1.55 $
load_balance() found balanced on cpu newly idle                  :      302817,      28735  |   -90.51% |  $        0.16,        1.68 $
load_balance() move task failed on cpu newly idle                :         461,        874  |    89.59% |  $      104.88,       55.31 $
imbalance sum on cpu newly idle                                  :       28955,     829603  |  2765.15% |
pull_task() count on cpu newly idle                              :       28493,       1557  |   -94.54% |
pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |     0.00% |
load_balance() failed to find busier queue on cpu newly idle     :           0,          0  |     0.00% |  $        0.00,        0.00 $
load_balance() failed to find busier group on cpu newly idle     :      300234,      28470  |   -90.52% |  $        0.16,        1.70 $
*load_balance() success count on cpu newly idle                  :       28386,       1544  |   -94.56% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.00,       1.01  |     0.46% |
--------------------------------- <Category active_load_balance()> ---------------------------------
active_load_balance() count                                      :           0,          0  |     0.00% |
active_load_balance() move task failed                           :           0,          0  |     0.00% |
active_load_balance() successfully moved a task                  :           0,          0  |     0.00% |
--------------------------------- <Category sched_balance_exec()> ----------------------------------
sbe_count is not used                                            :           0,          0  |     0.00% |
sbe_balanced is not used                                         :           0,          0  |     0.00% |
sbe_pushed is not used                                           :           0,          0  |     0.00% |
--------------------------------- <Category sched_balance_fork()> ----------------------------------
sbf_count is not used                                            :           0,          0  |     0.00% |
sbf_balanced is not used                                         :           0,          0  |     0.00% |
sbf_pushed is not used                                           :           0,          0  |     0.00% |
------------------------------------------ <Wakeup Info> -------------------------------------------
try_to_wake_up() awoke a task that last ran on a diff cpu        :       25939,      31717  |    22.28% |
try_to_wake_up() moved task because cache-cold on own cpu        :        7221,       5908  |   -18.18% |
try_to_wake_up() started passive balancing                       :           0,          0  |     0.00% |
----------------------------------------------------------------------------------------------------
CPU <ALL CPUS SUMMARY>, DOMAIN 1
----------------------------------------------------------------------------------------------------
DESC                                                                    COUNT1      COUNT2   PCT_CHANGE     AVG_JIFFIES1 AVG_JIFFIES2
----------------------------------------- <Category busy> ------------------------------------------
load_balance() count on cpu busy                                 :          45,         17  |   -62.22% |  $     1074.42,     2843.82 $
load_balance() found balanced on cpu busy                        :          45,         16  |   -64.44% |  $     1074.42,     3021.56 $
load_balance() move task failed on cpu busy                      :           0,          0  |     0.00% |  $        0.00,        0.00 $
imbalance sum on cpu busy                                        :           2,        356  | 17700.00% |
pull_task() count on cpu busy                                    :           0,          0  |     0.00% |
pull_task() when target task was cache-hot on cpu busy           :           0,          0  |     0.00% |
load_balance() failed to find busier queue on cpu busy           :           0,          0  |     0.00% |  $        0.00,        0.00 $
load_balance() failed to find busier group on cpu busy           :           8,          2  |   -75.00% |  $     6043.62,    24172.50 $
*load_balance() success count on cpu busy                        :           0,          1  |     0.00% |
*avg task pulled per successful lb attempt (cpu busy)            :        0.00,       0.00  |     0.00% |
----------------------------------------- <Category idle> ------------------------------------------
load_balance() count on cpu idle                                 :        7753,       7930  |     2.28% |  $        6.24,        6.10 $
load_balance() found balanced on cpu idle                        :        6208,       6591  |     6.17% |  $        7.79,        7.34 $
load_balance() move task failed on cpu idle                      :        1334,       1000  |   -25.04% |  $       36.24,       48.35 $
imbalance sum on cpu idle                                        :        1612,     274184  | 16908.93% |
pull_task() count on cpu idle                                    :         216,        357  |    65.28% |
pull_task() when target task was cache-hot on cpu idle           :           0,         10  |     0.00% |
load_balance() failed to find busier queue on cpu idle           :           0,          0  |     0.00% |  $        0.00,        0.00 $
load_balance() failed to find busier group on cpu idle           :        4065,       4062  |    -0.07% |  $       11.89,       11.90 $
*load_balance() success count on cpu idle                        :         211,        339  |    60.66% |
*avg task pulled per successful lb attempt (cpu idle)            :        1.02,       1.05  |     2.87% |
---------------------------------------- <Category newidle> ----------------------------------------
load_balance() count on cpu newly idle                           :      258017,      29345  |   -88.63% |  $        0.19,        1.65 $
load_balance() found balanced on cpu newly idle                  :      131570,      16162  |   -87.72% |  $        0.37,        2.99 $
load_balance() move task failed on cpu newly idle                :      103161,      11002  |   -89.34% |  $        0.47,        4.39 $
imbalance sum on cpu newly idle                                  :      131916,    2537851  |  1823.84% |
pull_task() count on cpu newly idle                              :       23922,       2213  |   -90.75% |
pull_task() when target task was cache-hot on cpu newly idle     :           5,          5  |     0.00% |
load_balance() failed to find busier queue on cpu newly idle     :           0,          2  |     0.00% |  $        0.00,    24172.50 $
load_balance() failed to find busier group on cpu newly idle     :      131096,      16081  |   -87.73% |  $        0.37,        3.01 $
*load_balance() success count on cpu newly idle                  :       23286,       2181  |   -90.63% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.03,       1.01  |    -1.23% |
--------------------------------- <Category active_load_balance()> ---------------------------------
active_load_balance() count                                      :           0,          1  |     0.00% |
active_load_balance() move task failed                           :           0,          0  |     0.00% |
active_load_balance() successfully moved a task                  :           0,          1  |     0.00% |
--------------------------------- <Category sched_balance_exec()> ----------------------------------
sbe_count is not used                                            :           0,          0  |     0.00% |
sbe_balanced is not used                                         :           0,          0  |     0.00% |
sbe_pushed is not used                                           :           0,          0  |     0.00% |
--------------------------------- <Category sched_balance_fork()> ----------------------------------
sbf_count is not used                                            :           0,          0  |     0.00% |
sbf_balanced is not used                                         :           0,          0  |     0.00% |
sbf_pushed is not used                                           :           0,          0  |     0.00% |
------------------------------------------ <Wakeup Info> -------------------------------------------
try_to_wake_up() awoke a task that last ran on a diff cpu        :      209758,     283095  |    34.96% |
try_to_wake_up() moved task because cache-cold on own cpu        :       37946,      33835  |   -10.83% |
try_to_wake_up() started passive balancing                       :           0,          0  |     0.00% |
----------------------------------------------------------------------------------------------------
CPU <ALL CPUS SUMMARY>, DOMAIN 2
----------------------------------------------------------------------------------------------------
DESC                                                                    COUNT1      COUNT2   PCT_CHANGE     AVG_JIFFIES1 AVG_JIFFIES2
----------------------------------------- <Category busy> ------------------------------------------
load_balance() count on cpu busy                                 :           0,          0  |     0.00% |  $        0.00,        0.00 $
load_balance() found balanced on cpu busy                        :           0,          0  |     0.00% |  $        0.00,        0.00 $
load_balance() move task failed on cpu busy                      :           0,          0  |     0.00% |  $        0.00,        0.00 $
imbalance sum on cpu busy                                        :           0,          0  |     0.00% |
pull_task() count on cpu busy                                    :           0,          0  |     0.00% |
pull_task() when target task was cache-hot on cpu busy           :           0,          0  |     0.00% |
load_balance() failed to find busier queue on cpu busy           :           0,          0  |     0.00% |  $        0.00,        0.00 $
load_balance() failed to find busier group on cpu busy           :           0,          0  |     0.00% |  $        0.00,        0.00 $
*load_balance() success count on cpu busy                        :           0,          0  |     0.00% |
*avg task pulled per successful lb attempt (cpu busy)            :        0.00,       0.00  |     0.00% |
----------------------------------------- <Category idle> ------------------------------------------
load_balance() count on cpu idle                                 :        1285,       1321  |     2.80% |  $       37.63,       36.60 $
load_balance() found balanced on cpu idle                        :         908,       1006  |    10.79% |  $       53.25,       48.06 $
load_balance() move task failed on cpu idle                      :         310,        209  |   -32.58% |  $      155.96,      231.32 $
imbalance sum on cpu idle                                        :      251700,     220823  |   -12.27% |
pull_task() count on cpu idle                                    :          75,        136  |    81.33% |
pull_task() when target task was cache-hot on cpu idle           :           0,          0  |     0.00% |
load_balance() failed to find busier queue on cpu idle           :           2,          0  |  -100.00% |  $    24174.50,        0.00 $
load_balance() failed to find busier group on cpu idle           :          62,         45  |   -27.42% |  $      779.82,     1074.33 $
*load_balance() success count on cpu idle                        :          67,        106  |    58.21% |
*avg task pulled per successful lb attempt (cpu idle)            :        1.12,       1.28  |    14.62% |
---------------------------------------- <Category newidle> ----------------------------------------
load_balance() count on cpu newly idle                           :      124013,      27086  |   -78.16% |  $        0.39,        1.78 $
load_balance() found balanced on cpu newly idle                  :       13528,       3242  |   -76.03% |  $        3.57,       14.91 $
load_balance() move task failed on cpu newly idle                :       96593,      19105  |   -80.22% |  $        0.50,        2.53 $
imbalance sum on cpu newly idle                                  :    23681561,   10057827  |   -57.53% |
pull_task() count on cpu newly idle                              :       14841,       5231  |   -64.75% |
pull_task() when target task was cache-hot on cpu newly idle     :           4,          3  |   -25.00% |
load_balance() failed to find busier queue on cpu newly idle     :        1211,         30  |   -97.52% |  $       39.92,     1611.50 $
load_balance() failed to find busier group on cpu newly idle     :       11812,       3063  |   -74.07% |  $        4.09,       15.78 $
*load_balance() success count on cpu newly idle                  :       13892,       4739  |   -65.89% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.07,       1.10  |     3.32% |
--------------------------------- <Category active_load_balance()> ---------------------------------
active_load_balance() count                                      :           0,          0  |     0.00% |
active_load_balance() move task failed                           :           0,          0  |     0.00% |
active_load_balance() successfully moved a task                  :           0,          0  |     0.00% |
--------------------------------- <Category sched_balance_exec()> ----------------------------------
sbe_count is not used                                            :           0,          0  |     0.00% |
sbe_balanced is not used                                         :           0,          0  |     0.00% |
sbe_pushed is not used                                           :           0,          0  |     0.00% |
--------------------------------- <Category sched_balance_fork()> ----------------------------------
sbf_count is not used                                            :           0,          0  |     0.00% |
sbf_balanced is not used                                         :           0,          0  |     0.00% |
sbf_pushed is not used                                           :           0,          0  |     0.00% |
------------------------------------------ <Wakeup Info> -------------------------------------------
try_to_wake_up() awoke a task that last ran on a diff cpu        :      171321,     112803  |   -34.16% |
try_to_wake_up() moved task because cache-cold on own cpu        :       47112,      18467  |   -60.80% |
try_to_wake_up() started passive balancing                       :           0,          0  |     0.00% |
----------------------------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-04-28  7:43   ` Jean-Baptiste Roquefere
@ 2025-04-30  9:13     ` K Prateek Nayak
  2025-04-30  9:25       ` Peter Zijlstra
                         ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: K Prateek Nayak @ 2025-04-30  9:13 UTC (permalink / raw)
  To: Jean-Baptiste Roquefere, Peter Zijlstra, mingo@kernel.org,
	Juri Lelli, Vincent Guittot, linux-kernel@vger.kernel.org
  Cc: Borislav Petkov, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	linux-kernel, regressions@lists.linux.dev, stable@vger.kernel.org

(+ more scheduler folks)

tl;dr

JB has a workload that hates aggressive migration on the 2nd Generation
EPYC platform that has a small LLC domain (4C/8T) and very noticeable
C2C latency.

Based on JB's observation so far, reverting commit 16b0a7a1a0af
("sched/fair: Ensure tasks spreading in LLC during LB") and commit
c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
condition") helps the workload. Both those commits allow aggressive
migrations for work conservation except it also increased cache
misses which slows the workload quite a bit.

"relax_domain_level" helps but cannot be set at runtime and I couldn't
think of any stable / debug interfaces that JB hasn't tried out
already that can help this workload.

There is a patch towards the end to set "relax_domain_level" at
runtime but given cpusets got away with this when transitioning to
cgroup-v2, I don't know what the sentiments are around its usage.
Any input / feedback is greatly appreciated.

On 4/28/2025 1:13 PM, Jean-Baptiste Roquefere wrote:
> Hello Prateek,
> 
> thank's for your reponse.
> 
> 
>> Looking at the commit logs, it looks like these commits do solve other
>> problems around load balancing and might not be trivial to revert
>> without evaluating the damages.
> 
> it's definitely not a productizable workaround !
> 
>> The processor you are running on, the AME EPYC 7702P based on the Zen2
>> architecture contains 4 cores / 8 threads per CCX (LLC domain) which is
>> perhaps why reducing the thread count to below this limit is helping
>> your workload.
>>
>> What we suspect is that when running the workload, the threads that
>> regularly sleep trigger a newidle balancing which causes them to move
>> to another CCX leading to higher number of L3 misses.
>>
>> To confirm this, would it be possible to run the workload with the
>> not-yet-upstream perf sched stats [1] tool and share the result from
>> perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch
>> to rule out any other second order effect.
>>
>> [1]
>> https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
> 
> I had to patch tools/perf/util/session.c : static int
> open_file_read(struct perf_data *data) due to "failed to open perf.data:
> File exists" (looked more like a compiler issue than a tool/perf issue)
> 
> $ ./perf sched stats diff perf.data.6.12.17 perf.data.6.12.17patched >
> perf.diff (see perf.diff attached)

Thank you for all the information Jean. I'll highlight the interesting
bits (at least the bits that stood out to me)

(left is mainline, right is mainline with the two commits mentioned by
  JB reverted)

total runtime by tasks on this processor (in jiffies)            : 123927676874,108531911002  |   -12.42% |
total waittime by tasks on this processor (in jiffies)           :  34729211241, 27076295778  |   -22.04% |  (    28.02%,     24.95% )
total timeslices run on this cpu                                 :       501606,      489799  |    -2.35% |

Since "total runtime" is lower on the right, it means that the CPUs
were not as well utilized with the commits reverted however the
reduction in the "total waittime" suggests things are running faster
and on overage there are 0.28 waiting tasks on mainline compared to
0.24 with the commits reverted.

---------------------------------------- <Category newidle - SMT> ----------------------------------------
load_balance() count on cpu newly idle                           :      331664,      31153  |   -90.61% |  $        0.15,        1.55 $
load_balance() failed to find busier group on cpu newly idle     :      300234,      28470  |   -90.52% |  $        0.16,        1.70 $
*load_balance() success count on cpu newly idle                  :       28386,       1544  |   -94.56% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.00,       1.01  |     0.46% |
---------------------------------------- <Category newidle - MC > ----------------------------------------
load_balance() count on cpu newly idle                           :      258017,      29345  |   -88.63% |  $        0.19,        1.65 $
load_balance() failed to find busier group on cpu newly idle     :      131096,      16081  |   -87.73% |  $        0.37,        3.01 $
*load_balance() success count on cpu newly idle                  :       23286,       2181  |   -90.63% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.03,       1.01  |    -1.23% |
---------------------------------------- <Category newidle - PKG> ----------------------------------------
load_balance() count on cpu newly idle                           :      124013,      27086  |   -78.16% |  $        0.39,        1.78 $
load_balance() failed to find busier group on cpu newly idle     :       11812,       3063  |   -74.07% |  $        4.09,       15.78 $
*load_balance() success count on cpu newly idle                  :       13892,       4739  |   -65.89% |
*avg task pulled per successful lb attempt (cpu newly idle)      :        1.07,       1.10  |     3.32% |
----------------------------------------------------------------------------------------------------------

Most migrations are from newidle balancing which seems to move task
across cores ( > 50% of time) and the LLC too (~8% of the times).

> 
>> Assuming you control these deployments, would it possible to run
>> the workload on a kernel running with "relax_domain_level=2" kernel
>> cmdline that restricts newidle balance to only within the CCX. As a
>> side effect, it also limits  task wakeups to the same LLC domain but
>> I would still like to know if this makes a difference to the
>> workload you are running.
> On vanilla 6.12.17 it gives the IPC we expected:

Thank you JB for trying out this experiment. I'm not very sure what
the views are on "relax_domain_level" and I'm hoping the other
scheduler folks will chime in here - Is it a debug knob? Can it
be used in production?

I know it had additional uses with cpuset in cgroup-v1 but was not
adopted in v2 - are there any nasty historic reasons for this?

> 
> +--------------------+--------------------------+-----------------------+
> |                    | relax_domain_level unset | relax_domain_level=2  |
> +--------------------+--------------------------+-----------------------+
> | Threads            |  210                     | 210                  |
> | Utilization (%)    |  65,86                   | 52,01                |
> | CPU effective freq |  1 622,93                |  1 294,12             |
> | IPC                |  1,14                    | 1,42                 |
> | L2 access (pti)    |  34,36                   | 38,18                |
> | L2 miss   (pti)    |  7,34                    | 7,78                 |
> | L3 miss   (abs)    |  39 711 971 741          |  33 929 609 924       |
> | Mem (GB/s)         |  70,68                   | 49,10                |
> | Context switches   |  109 281 524             |  107 896 729          |
> +--------------------+--------------------------+-----------------------+
> 
> Kind regards,
> 
> JB

JB asked if there is any way to toggle "relax_domain_level" at runtime
on mainline and I couldn't find any easy way other than using cpusets
with cgroup-v1 which is probably harder to deploy at scale than the
pinning strategy that JB mentioned originally.

I currently cannot think of any stable interface that exists currently
to allow sticky behavior and mitigate aggressive migration for work
conservation - JB did try almost everything available that he
summarized in his original report.

Could something like below be a stop-gap band-aid to remedy such the
case of workloads that don't mind temporary imbalance in favor of
cache hotness?

---
From: K Prateek Nayak <kprateek.nayak@amd.com>
Subject: [RFC PATCH] sched/debug: Allow overriding "relax_domain_level" at runtime

Jean-Baptiste noted that Ateme's workload experiences poor IPC on a 2nd
Generation EPYC system and narrowed down the major culprits to commit
16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and
commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
condition") both of which enable more aggressive migrations in favor of
work conservation.

The larger C2C latency on the platform coupled with a smaller L3 size of
4C/8T makes downside of aggressive balance very prominent. Looking at
the perf sched stats report from JB [1], when the two commits are
reverted, despite the "total runtime" seeing a dip of 11% showing a
better load distribution on mainline, the "total waittime" dips by 22%
showing despite the imbalance, the workload runs faster and this
improvement can be co-related to the higher IPC and the reduced L3
misses in data shared by JB. Most of the migration during load
balancing can be attributed to newidle balance.

JB confirmed that using "relax_domain_level=2" in kernel cmdline helps
this particular workload by restricting the scope of wakeups and
migrations during newidle balancing however "relax_domain_level" works
on topology levels before degeneration and setting the level before
inspecting the topology might not be trivial at boot time.

Furthermore, a runtime knob that can help quickly narrow down any changes
in workload behavior to aggressive migrations during load balancing can
be helpful during debugs.

Introduce "relax_domain_level" in sched debugfs and allow overriding the
knob at runtime.

   # cat /sys/kernel/debug/sched/relax_domain_level
   -1

   # echo Y > /sys/kernel/debug/sched/verbose
   # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

To restrict newidle balance to only within the LLC, "relax_domain_level"
can be set to level 3 (SMT, CLUSTER, *MC* , PKG, NUMA)

   # echo 3 > /sys/kernel/debug/sched/relax_domain_level
   # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
   SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
   SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
   SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA

"relax_domain_level" forgives short term imbalances. Longer term
imbalances will be eventually caught by the periodic load balancer and
the system will reach a state of balance, only slightly later.

Link: https://lore.kernel.org/all/996ca8cb-3ac8-4f1b-93f1-415f43922d7a@ateme.com/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
  include/linux/sched/topology.h |  6 ++--
  kernel/sched/debug.c           | 52 ++++++++++++++++++++++++++++++++++
  kernel/sched/topology.c        |  2 +-
  3 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 198bb5cc1774..5f59bdc1d5b1 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -65,8 +65,10 @@ struct sched_domain_attr {
  	int relax_domain_level;
  };
  
-#define SD_ATTR_INIT	(struct sched_domain_attr) {	\
-	.relax_domain_level = -1,			\
+extern int default_relax_domain_level;
+
+#define SD_ATTR_INIT	(struct sched_domain_attr) {		\
+	.relax_domain_level = default_relax_domain_level,	\
  }
  
  extern int sched_domain_level_max;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 557246880a7e..cc6944b35535 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -214,6 +214,57 @@ static const struct file_operations sched_scaling_fops = {
  	.release	= single_release,
  };
  
+DEFINE_MUTEX(relax_domain_mutex);
+
+static ssize_t sched_relax_domain_write(struct file *filp,
+					const char __user *ubuf,
+					size_t cnt, loff_t *ppos)
+{
+	int relax_domain_level;
+	char buf[16];
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+	buf[cnt] = '\0';
+
+	if (kstrtoint(buf, 10, &relax_domain_level))
+		return -EINVAL;
+
+	if (relax_domain_level < -1 || relax_domain_level > sched_domain_level_max + 1)
+		return -EINVAL;
+
+	guard(mutex)(&relax_domain_mutex);
+
+	if (relax_domain_level != default_relax_domain_level) {
+		default_relax_domain_level = relax_domain_level;
+		rebuild_sched_domains();
+	}
+
+	*ppos += cnt;
+	return cnt;
+}
+static int sched_relax_domain_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", default_relax_domain_level);
+	return 0;
+}
+
+static int sched_relax_domain_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_relax_domain_show, NULL);
+}
+
+static const struct file_operations sched_relax_domain_fops = {
+	.open		= sched_relax_domain_open,
+	.write		= sched_relax_domain_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
  #endif /* SMP */
  
  #ifdef CONFIG_PREEMPT_DYNAMIC
@@ -516,6 +567,7 @@ static __init int sched_init_debug(void)
  	debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops);
  	debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
  	debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);
+	debugfs_create_file("relax_domain_level", 0644, debugfs_sched, NULL, &sched_relax_domain_fops);
  
  	sched_domains_mutex_lock();
  	update_sched_domain_debugfs();
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a2a38e1b6f18..eb5c8a9cd904 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1513,7 +1513,7 @@ static void asym_cpu_capacity_scan(void)
   * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
   */
  
-static int default_relax_domain_level = -1;
+int default_relax_domain_level = -1;
  int sched_domain_level_max;
  
  static int __init setup_relax_domain_level(char *str)
-- 

Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-04-30  9:13     ` K Prateek Nayak
@ 2025-04-30  9:25       ` Peter Zijlstra
  2025-04-30 10:41       ` Libo Chen
  2025-05-05 10:28       ` Vincent Guittot
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2025-04-30  9:25 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Jean-Baptiste Roquefere, mingo@kernel.org, Juri Lelli,
	Vincent Guittot, linux-kernel@vger.kernel.org, Borislav Petkov,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	regressions@lists.linux.dev, stable@vger.kernel.org

On Wed, Apr 30, 2025 at 02:43:00PM +0530, K Prateek Nayak wrote:
> (+ more scheduler folks)
> 
> tl;dr
> 
> JB has a workload that hates aggressive migration on the 2nd Generation
> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
> C2C latency.

Seems like the kind of chip the cache aware scheduling crud should be
good for. Of course, it's still early days on that, so it might not be
in good enough shape to help yet.

But long term, that should definitely be the goal, rather than finding
ways to make relax_domain hacks available again.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-04-30  9:13     ` K Prateek Nayak
  2025-04-30  9:25       ` Peter Zijlstra
@ 2025-04-30 10:41       ` Libo Chen
  2025-04-30 11:29         ` K Prateek Nayak
  2025-05-05 10:28       ` Vincent Guittot
  2 siblings, 1 reply; 16+ messages in thread
From: Libo Chen @ 2025-04-30 10:41 UTC (permalink / raw)
  To: K Prateek Nayak, Jean-Baptiste Roquefere, Peter Zijlstra,
	mingo@kernel.org, Juri Lelli, Vincent Guittot,
	linux-kernel@vger.kernel.org
  Cc: Borislav Petkov, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	regressions@lists.linux.dev, stable@vger.kernel.org, Konrad Wilk



On 4/30/25 02:13, K Prateek Nayak wrote:
> (+ more scheduler folks)
> 
> tl;dr
> 
> JB has a workload that hates aggressive migration on the 2nd Generation
> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
> C2C latency.
> 
> Based on JB's observation so far, reverting commit 16b0a7a1a0af
> ("sched/fair: Ensure tasks spreading in LLC during LB") and commit
> c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
> condition") helps the workload. Both those commits allow aggressive
> migrations for work conservation except it also increased cache
> misses which slows the workload quite a bit.
> 
> "relax_domain_level" helps but cannot be set at runtime and I couldn't
> think of any stable / debug interfaces that JB hasn't tried out
> already that can help this workload.
> 
> There is a patch towards the end to set "relax_domain_level" at
> runtime but given cpusets got away with this when transitioning to
> cgroup-v2, I don't know what the sentiments are around its usage.
> Any input / feedback is greatly appreciated.
> 


Hi Prateek,

Oh no, not "relax_domain_level" again, this can lead to load imbalance
in variety of ways. We were so glad this one went away with cgroupv2,
it tends to be abused by users as an "easy" fix for some urgent perf 
issues instead of addressing their root causes.


Thanks,
Libo




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-04-30 10:41       ` Libo Chen
@ 2025-04-30 11:29         ` K Prateek Nayak
  2025-05-01  2:46           ` Libo Chen
  0 siblings, 1 reply; 16+ messages in thread
From: K Prateek Nayak @ 2025-04-30 11:29 UTC (permalink / raw)
  To: Libo Chen, Jean-Baptiste Roquefere, Peter Zijlstra,
	mingo@kernel.org, Juri Lelli, Vincent Guittot,
	linux-kernel@vger.kernel.org
  Cc: Borislav Petkov, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	regressions@lists.linux.dev, stable@vger.kernel.org, Konrad Wilk

Hello Libo,

On 4/30/2025 4:11 PM, Libo Chen wrote:
> 
> 
> On 4/30/25 02:13, K Prateek Nayak wrote:
>> (+ more scheduler folks)
>>
>> tl;dr
>>
>> JB has a workload that hates aggressive migration on the 2nd Generation
>> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
>> C2C latency.
>>
>> Based on JB's observation so far, reverting commit 16b0a7a1a0af
>> ("sched/fair: Ensure tasks spreading in LLC during LB") and commit
>> c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
>> condition") helps the workload. Both those commits allow aggressive
>> migrations for work conservation except it also increased cache
>> misses which slows the workload quite a bit.
>>
>> "relax_domain_level" helps but cannot be set at runtime and I couldn't
>> think of any stable / debug interfaces that JB hasn't tried out
>> already that can help this workload.
>>
>> There is a patch towards the end to set "relax_domain_level" at
>> runtime but given cpusets got away with this when transitioning to
>> cgroup-v2, I don't know what the sentiments are around its usage.
>> Any input / feedback is greatly appreciated.
>>
> 
> 
> Hi Prateek,
> 
> Oh no, not "relax_domain_level" again, this can lead to load imbalance
> in variety of ways. We were so glad this one went away with cgroupv2,

I agree it is not pretty. JB also tried strategic pinning and they
did report that things are better overall but unfortunately, it is
very hard to deploy across multiple architectures and would also
require some redesign + testing from their application side.

> it tends to be abused by users as an "easy" fix for some urgent perf
> issues instead of addressing their root causes.

Was there ever a report of similar issue where migrations for right
reasons has led to performance degradation as a result of platform
architecture? I doubt there is a straightforward way to solve this
using the current interfaces - at least I haven't found one yet.

Perhaps cache-aware scheduling is the way forward to solve these
set of issues as Peter highlighted.

> 
> 
> Thanks,
> Libo
> 

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-04-30 11:29         ` K Prateek Nayak
@ 2025-05-01  2:46           ` Libo Chen
  0 siblings, 0 replies; 16+ messages in thread
From: Libo Chen @ 2025-05-01  2:46 UTC (permalink / raw)
  To: K Prateek Nayak, Jean-Baptiste Roquefere, Peter Zijlstra,
	mingo@kernel.org, Juri Lelli, Vincent Guittot,
	linux-kernel@vger.kernel.org
  Cc: Borislav Petkov, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	regressions@lists.linux.dev, stable@vger.kernel.org, Konrad Wilk

Hi Prateek,

On 4/30/25 04:29, K Prateek Nayak wrote:
> Hello Libo,
> 
> On 4/30/2025 4:11 PM, Libo Chen wrote:
>>
>>
>> On 4/30/25 02:13, K Prateek Nayak wrote:
>>> (+ more scheduler folks)
>>>
>>> tl;dr
>>>
>>> JB has a workload that hates aggressive migration on the 2nd Generation
>>> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
>>> C2C latency.
>>>
>>> Based on JB's observation so far, reverting commit 16b0a7a1a0af
>>> ("sched/fair: Ensure tasks spreading in LLC during LB") and commit
>>> c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
>>> condition") helps the workload. Both those commits allow aggressive
>>> migrations for work conservation except it also increased cache
>>> misses which slows the workload quite a bit.
>>>
>>> "relax_domain_level" helps but cannot be set at runtime and I couldn't
>>> think of any stable / debug interfaces that JB hasn't tried out
>>> already that can help this workload.
>>>
>>> There is a patch towards the end to set "relax_domain_level" at
>>> runtime but given cpusets got away with this when transitioning to
>>> cgroup-v2, I don't know what the sentiments are around its usage.
>>> Any input / feedback is greatly appreciated.
>>>
>>
>>
>> Hi Prateek,
>>
>> Oh no, not "relax_domain_level" again, this can lead to load imbalance
>> in variety of ways. We were so glad this one went away with cgroupv2,
> 
> I agree it is not pretty. JB also tried strategic pinning and they
> did report that things are better overall but unfortunately, it is
> very hard to deploy across multiple architectures and would also
> require some redesign + testing from their application side.
> 

I was more of stressing broadly how bad setting "relax_domain_level"
could go wrong if an user doesn't know this essentially disables newidle
balancing at higher levels, so the ability to balance loads across CCXes
or NUMA nodes will be a lot weaker. A subset of CCXes may consistently
get much more loads due to a whole bunch of reasons. Sometimes this is
hard to spot in testing, but does show up in real-world scenarios, esp.
when users have other weird hacks.

>> it tends to be abused by users as an "easy" fix for some urgent perf
>> issues instead of addressing their root causes.
> 
> Was there ever a report of similar issue where migrations for right
> reasons has led to performance degradation as a result of platform
> architecture? I doubt there is a straightforward way to solve this
> using the current interfaces - at least I haven't found one yet.
> 

It wasn't due to platform architecture for us but more of "exotic" NUMA
topology (like a cubic, a node is one hop away from 3 neighbors, two
hops away from other 4) in combination with certain userlevel settings
that cause more wakeups in a subset of domains. If relax_domain_level
is left untouched, then you get no load imbalance but perf is bad. But
once you set relax_domain_level to restrict newidle balancing to lower
domain levels, you actually see better performance numbers in testing
even though CPU loads are not well-balanced. Until one day, you find
out the imbalance is so bad that it slows down everything. Luckily it
wasn't too hard to fix from the application side.

I get it may not be easy to fix from their application side in this
case and but I still think this is too hackery, one may end up
regretting. 

I certainly want to hear what others think about relax_domain_level!

> Perhaps cache-aware scheduling is the way forward to solve these
> set of issues as Peter highlighted.
> 

Hope so! We will start test that series and provide feedback

Thanks,
Libo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-04-30  9:13     ` K Prateek Nayak
  2025-04-30  9:25       ` Peter Zijlstra
  2025-04-30 10:41       ` Libo Chen
@ 2025-05-05 10:28       ` Vincent Guittot
  2025-05-05 12:29         ` K Prateek Nayak
  2 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2025-05-05 10:28 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Jean-Baptiste Roquefere, Peter Zijlstra, mingo@kernel.org,
	Juri Lelli, linux-kernel@vger.kernel.org, Borislav Petkov,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	regressions@lists.linux.dev, stable@vger.kernel.org

On Wed, 30 Apr 2025 at 11:13, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> (+ more scheduler folks)
>
> tl;dr
>
> JB has a workload that hates aggressive migration on the 2nd Generation
> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
> C2C latency.
>
> Based on JB's observation so far, reverting commit 16b0a7a1a0af
> ("sched/fair: Ensure tasks spreading in LLC during LB") and commit
> c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
> condition") helps the workload. Both those commits allow aggressive
> migrations for work conservation except it also increased cache
> misses which slows the workload quite a bit.

commit 16b0a7a1a0af  ("sched/fair: Ensure tasks spreading in LLC
during LB") eases the spread of task inside a LLC so It's not obvious
for me how it would increase "a lot of CPU migrations go out of CCX,
then L3 miss,". On the other hand, it will spread task in SMT and in
LLC which can prevent running at highest freq on some system but I
don't know if it's relevant for this SoC.

commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
condition") makes newly idle migration happen more often which can
then do migrate tasks across LLC. But then It's more about why
enabling newly idle load balance out of LLC if it is so costly.

>
> "relax_domain_level" helps but cannot be set at runtime and I couldn't
> think of any stable / debug interfaces that JB hasn't tried out
> already that can help this workload.
>
> There is a patch towards the end to set "relax_domain_level" at
> runtime but given cpusets got away with this when transitioning to
> cgroup-v2, I don't know what the sentiments are around its usage.
> Any input / feedback is greatly appreciated.
>
> On 4/28/2025 1:13 PM, Jean-Baptiste Roquefere wrote:
> > Hello Prateek,
> >
> > thank's for your reponse.
> >
> >
> >> Looking at the commit logs, it looks like these commits do solve other
> >> problems around load balancing and might not be trivial to revert
> >> without evaluating the damages.
> >
> > it's definitely not a productizable workaround !
> >
> >> The processor you are running on, the AME EPYC 7702P based on the Zen2
> >> architecture contains 4 cores / 8 threads per CCX (LLC domain) which is
> >> perhaps why reducing the thread count to below this limit is helping
> >> your workload.
> >>
> >> What we suspect is that when running the workload, the threads that
> >> regularly sleep trigger a newidle balancing which causes them to move
> >> to another CCX leading to higher number of L3 misses.
> >>
> >> To confirm this, would it be possible to run the workload with the
> >> not-yet-upstream perf sched stats [1] tool and share the result from
> >> perf sched stats diff for the data from v6.12.17 and v6.12.17 + patch
> >> to rule out any other second order effect.
> >>
> >> [1]
> >> https://lore.kernel.org/all/20250311120230.61774-1-swapnil.sapkal@amd.com/
> >
> > I had to patch tools/perf/util/session.c : static int
> > open_file_read(struct perf_data *data) due to "failed to open perf.data:
> > File exists" (looked more like a compiler issue than a tool/perf issue)
> >
> > $ ./perf sched stats diff perf.data.6.12.17 perf.data.6.12.17patched >
> > perf.diff (see perf.diff attached)
>
> Thank you for all the information Jean. I'll highlight the interesting
> bits (at least the bits that stood out to me)
>
> (left is mainline, right is mainline with the two commits mentioned by
>   JB reverted)
>
> total runtime by tasks on this processor (in jiffies)            : 123927676874,108531911002  |   -12.42% |
> total waittime by tasks on this processor (in jiffies)           :  34729211241, 27076295778  |   -22.04% |  (    28.02%,     24.95% )
> total timeslices run on this cpu                                 :       501606,      489799  |    -2.35% |
>
> Since "total runtime" is lower on the right, it means that the CPUs
> were not as well utilized with the commits reverted however the
> reduction in the "total waittime" suggests things are running faster
> and on overage there are 0.28 waiting tasks on mainline compared to
> 0.24 with the commits reverted.
>
> ---------------------------------------- <Category newidle - SMT> ----------------------------------------
> load_balance() count on cpu newly idle                           :      331664,      31153  |   -90.61% |  $        0.15,        1.55 $
> load_balance() failed to find busier group on cpu newly idle     :      300234,      28470  |   -90.52% |  $        0.16,        1.70 $
> *load_balance() success count on cpu newly idle                  :       28386,       1544  |   -94.56% |
> *avg task pulled per successful lb attempt (cpu newly idle)      :        1.00,       1.01  |     0.46% |
> ---------------------------------------- <Category newidle - MC > ----------------------------------------
> load_balance() count on cpu newly idle                           :      258017,      29345  |   -88.63% |  $        0.19,        1.65 $
> load_balance() failed to find busier group on cpu newly idle     :      131096,      16081  |   -87.73% |  $        0.37,        3.01 $
> *load_balance() success count on cpu newly idle                  :       23286,       2181  |   -90.63% |
> *avg task pulled per successful lb attempt (cpu newly idle)      :        1.03,       1.01  |    -1.23% |
> ---------------------------------------- <Category newidle - PKG> ----------------------------------------
> load_balance() count on cpu newly idle                           :      124013,      27086  |   -78.16% |  $        0.39,        1.78 $
> load_balance() failed to find busier group on cpu newly idle     :       11812,       3063  |   -74.07% |  $        4.09,       15.78 $
> *load_balance() success count on cpu newly idle                  :       13892,       4739  |   -65.89% |
> *avg task pulled per successful lb attempt (cpu newly idle)      :        1.07,       1.10  |     3.32% |
> ----------------------------------------------------------------------------------------------------------
>
> Most migrations are from newidle balancing which seems to move task
> across cores ( > 50% of time) and the LLC too (~8% of the times).
>
> >
> >> Assuming you control these deployments, would it possible to run
> >> the workload on a kernel running with "relax_domain_level=2" kernel
> >> cmdline that restricts newidle balance to only within the CCX. As a
> >> side effect, it also limits  task wakeups to the same LLC domain but
> >> I would still like to know if this makes a difference to the
> >> workload you are running.
> > On vanilla 6.12.17 it gives the IPC we expected:
>
> Thank you JB for trying out this experiment. I'm not very sure what
> the views are on "relax_domain_level" and I'm hoping the other
> scheduler folks will chime in here - Is it a debug knob? Can it
> be used in production?
>
> I know it had additional uses with cpuset in cgroup-v1 but was not
> adopted in v2 - are there any nasty historic reasons for this?
>
> >
> > +--------------------+--------------------------+-----------------------+
> > |                    | relax_domain_level unset | relax_domain_level=2  |
> > +--------------------+--------------------------+-----------------------+
> > | Threads            |  210                     | 210                  |
> > | Utilization (%)    |  65,86                   | 52,01                |
> > | CPU effective freq |  1 622,93                |  1 294,12             |
> > | IPC                |  1,14                    | 1,42                 |
> > | L2 access (pti)    |  34,36                   | 38,18                |
> > | L2 miss   (pti)    |  7,34                    | 7,78                 |
> > | L3 miss   (abs)    |  39 711 971 741          |  33 929 609 924       |
> > | Mem (GB/s)         |  70,68                   | 49,10                |
> > | Context switches   |  109 281 524             |  107 896 729          |
> > +--------------------+--------------------------+-----------------------+
> >
> > Kind regards,
> >
> > JB
>
> JB asked if there is any way to toggle "relax_domain_level" at runtime
> on mainline and I couldn't find any easy way other than using cpusets
> with cgroup-v1 which is probably harder to deploy at scale than the
> pinning strategy that JB mentioned originally.
>
> I currently cannot think of any stable interface that exists currently
> to allow sticky behavior and mitigate aggressive migration for work
> conservation - JB did try almost everything available that he
> summarized in his original report.
>
> Could something like below be a stop-gap band-aid to remedy such the
> case of workloads that don't mind temporary imbalance in favor of
> cache hotness?
>
> ---
> From: K Prateek Nayak <kprateek.nayak@amd.com>
> Subject: [RFC PATCH] sched/debug: Allow overriding "relax_domain_level" at runtime
>
> Jean-Baptiste noted that Ateme's workload experiences poor IPC on a 2nd
> Generation EPYC system and narrowed down the major culprits to commit
> 16b0a7a1a0af ("sched/fair: Ensure tasks spreading in LLC during LB") and
> commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
> condition") both of which enable more aggressive migrations in favor of
> work conservation.
>
> The larger C2C latency on the platform coupled with a smaller L3 size of
> 4C/8T makes downside of aggressive balance very prominent. Looking at
> the perf sched stats report from JB [1], when the two commits are
> reverted, despite the "total runtime" seeing a dip of 11% showing a
> better load distribution on mainline, the "total waittime" dips by 22%
> showing despite the imbalance, the workload runs faster and this
> improvement can be co-related to the higher IPC and the reduced L3
> misses in data shared by JB. Most of the migration during load
> balancing can be attributed to newidle balance.
>
> JB confirmed that using "relax_domain_level=2" in kernel cmdline helps
> this particular workload by restricting the scope of wakeups and
> migrations during newidle balancing however "relax_domain_level" works
> on topology levels before degeneration and setting the level before
> inspecting the topology might not be trivial at boot time.
>
> Furthermore, a runtime knob that can help quickly narrow down any changes
> in workload behavior to aggressive migrations during load balancing can
> be helpful during debugs.
>
> Introduce "relax_domain_level" in sched debugfs and allow overriding the
> knob at runtime.
>
>    # cat /sys/kernel/debug/sched/relax_domain_level
>    -1
>
>    # echo Y > /sys/kernel/debug/sched/verbose
>    # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
>    SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
>    SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
>    SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
>    SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
>
> To restrict newidle balance to only within the LLC, "relax_domain_level"
> can be set to level 3 (SMT, CLUSTER, *MC* , PKG, NUMA)
>
>    # echo 3 > /sys/kernel/debug/sched/relax_domain_level
>    # cat /sys/kernel/debug/sched/domains/cpu0/domain*/flags
>    SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_CPUCAPACITY SD_SHARE_LLC SD_PREFER_SIBLING
>    SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SHARE_LLC SD_PREFER_SIBLING
>    SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_PREFER_SIBLING
>    SD_BALANCE_EXEC SD_BALANCE_FORK SD_WAKE_AFFINE SD_SERIALIZE SD_OVERLAP SD_NUMA
>
> "relax_domain_level" forgives short term imbalances. Longer term
> imbalances will be eventually caught by the periodic load balancer and
> the system will reach a state of balance, only slightly later.
>
> Link: https://lore.kernel.org/all/996ca8cb-3ac8-4f1b-93f1-415f43922d7a@ateme.com/ [1]
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>   include/linux/sched/topology.h |  6 ++--
>   kernel/sched/debug.c           | 52 ++++++++++++++++++++++++++++++++++
>   kernel/sched/topology.c        |  2 +-
>   3 files changed, 57 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 198bb5cc1774..5f59bdc1d5b1 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -65,8 +65,10 @@ struct sched_domain_attr {
>         int relax_domain_level;
>   };
>
> -#define SD_ATTR_INIT   (struct sched_domain_attr) {    \
> -       .relax_domain_level = -1,                       \
> +extern int default_relax_domain_level;
> +
> +#define SD_ATTR_INIT   (struct sched_domain_attr) {            \
> +       .relax_domain_level = default_relax_domain_level,       \
>   }
>
>   extern int sched_domain_level_max;
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 557246880a7e..cc6944b35535 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -214,6 +214,57 @@ static const struct file_operations sched_scaling_fops = {
>         .release        = single_release,
>   };
>
> +DEFINE_MUTEX(relax_domain_mutex);
> +
> +static ssize_t sched_relax_domain_write(struct file *filp,
> +                                       const char __user *ubuf,
> +                                       size_t cnt, loff_t *ppos)
> +{
> +       int relax_domain_level;
> +       char buf[16];
> +
> +       if (cnt > 15)
> +               cnt = 15;
> +
> +       if (copy_from_user(&buf, ubuf, cnt))
> +               return -EFAULT;
> +       buf[cnt] = '\0';
> +
> +       if (kstrtoint(buf, 10, &relax_domain_level))
> +               return -EINVAL;
> +
> +       if (relax_domain_level < -1 || relax_domain_level > sched_domain_level_max + 1)
> +               return -EINVAL;
> +
> +       guard(mutex)(&relax_domain_mutex);
> +
> +       if (relax_domain_level != default_relax_domain_level) {
> +               default_relax_domain_level = relax_domain_level;
> +               rebuild_sched_domains();
> +       }
> +
> +       *ppos += cnt;
> +       return cnt;
> +}
> +static int sched_relax_domain_show(struct seq_file *m, void *v)
> +{
> +       seq_printf(m, "%d\n", default_relax_domain_level);
> +       return 0;
> +}
> +
> +static int sched_relax_domain_open(struct inode *inode, struct file *filp)
> +{
> +       return single_open(filp, sched_relax_domain_show, NULL);
> +}
> +
> +static const struct file_operations sched_relax_domain_fops = {
> +       .open           = sched_relax_domain_open,
> +       .write          = sched_relax_domain_write,
> +       .read           = seq_read,
> +       .llseek         = seq_lseek,
> +       .release        = single_release,
> +};
> +
>   #endif /* SMP */
>
>   #ifdef CONFIG_PREEMPT_DYNAMIC
> @@ -516,6 +567,7 @@ static __init int sched_init_debug(void)
>         debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops);
>         debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
>         debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);
> +       debugfs_create_file("relax_domain_level", 0644, debugfs_sched, NULL, &sched_relax_domain_fops);
>
>         sched_domains_mutex_lock();
>         update_sched_domain_debugfs();
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index a2a38e1b6f18..eb5c8a9cd904 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1513,7 +1513,7 @@ static void asym_cpu_capacity_scan(void)
>    * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
>    */
>
> -static int default_relax_domain_level = -1;
> +int default_relax_domain_level = -1;
>   int sched_domain_level_max;
>
>   static int __init setup_relax_domain_level(char *str)
> --
>
> Thanks and Regards,
> Prateek
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-05-05 10:28       ` Vincent Guittot
@ 2025-05-05 12:29         ` K Prateek Nayak
  2025-05-05 15:10           ` Vincent Guittot
  0 siblings, 1 reply; 16+ messages in thread
From: K Prateek Nayak @ 2025-05-05 12:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Jean-Baptiste Roquefere, Peter Zijlstra, mingo@kernel.org,
	Juri Lelli, linux-kernel@vger.kernel.org, Borislav Petkov,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	regressions@lists.linux.dev, stable@vger.kernel.org

Hello Vincent,

On 5/5/2025 3:58 PM, Vincent Guittot wrote:
> On Wed, 30 Apr 2025 at 11:13, K Prateek Nayak<kprateek.nayak@amd.com> wrote:
>> (+ more scheduler folks)
>>
>> tl;dr
>>
>> JB has a workload that hates aggressive migration on the 2nd Generation
>> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
>> C2C latency.
>>
>> Based on JB's observation so far, reverting commit 16b0a7a1a0af
>> ("sched/fair: Ensure tasks spreading in LLC during LB") and commit
>> c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
>> condition") helps the workload. Both those commits allow aggressive
>> migrations for work conservation except it also increased cache
>> misses which slows the workload quite a bit.
> commit 16b0a7a1a0af  ("sched/fair: Ensure tasks spreading in LLC
> during LB") eases the spread of task inside a LLC so It's not obvious
> for me how it would increase "a lot of CPU migrations go out of CCX,
> then L3 miss,". On the other hand, it will spread task in SMT and in
> LLC which can prevent running at highest freq on some system but I
> don't know if it's relevant for this SoC.

I misspoke there. JB's workload seems to be sensitive even to core to
core migrations - "relax_domain_level=2" actually disabled newidle
balance above CLUSTER level which is a subset of MC on x86 and gets
degenerated into the SMT domain.

> 
> commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
> condition") makes newly idle migration happen more often which can
> then do migrate tasks across LLC. But then It's more about why
> enabling newly idle load balance out of LLC if it is so costly.

It seems to be very workload + possibly platform specific
characteristic where re-priming the cache is actually very costly.
I'm not sure if there are any other uarch factors at play here that
require repriming (branch prediction, prefetcher, etc.) after a task
migration to reach same IPC.

Essentially "relax_domain_level" gets the desired characteristic
where only the periodic balance will balance long-term imbalance
but as Libo mentioned the short term imbalances can build up
and using "relax_domain_level" might lead to other problems.

Short of pinning / more analysis of which part of migrations make
the workload unhappy, I couldn't think of a better way to
communicate this requirement.

> 
>> "relax_domain_level" helps but cannot be set at runtime and I couldn't
>> think of any stable / debug interfaces that JB hasn't tried out
>> already that can help this workload.
>>
>> There is a patch towards the end to set "relax_domain_level" at
>> runtime but given cpusets got away with this when transitioning to
>> cgroup-v2, I don't know what the sentiments are around its usage.
>> Any input / feedback is greatly appreciated.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-05-05 12:29         ` K Prateek Nayak
@ 2025-05-05 15:10           ` Vincent Guittot
  2025-05-05 15:16             ` K Prateek Nayak
  0 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2025-05-05 15:10 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Jean-Baptiste Roquefere, Peter Zijlstra, mingo@kernel.org,
	Juri Lelli, linux-kernel@vger.kernel.org, Borislav Petkov,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	regressions@lists.linux.dev, stable@vger.kernel.org

On Mon, 5 May 2025 at 14:29, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Vincent,
>
> On 5/5/2025 3:58 PM, Vincent Guittot wrote:
> > On Wed, 30 Apr 2025 at 11:13, K Prateek Nayak<kprateek.nayak@amd.com> wrote:
> >> (+ more scheduler folks)
> >>
> >> tl;dr
> >>
> >> JB has a workload that hates aggressive migration on the 2nd Generation
> >> EPYC platform that has a small LLC domain (4C/8T) and very noticeable
> >> C2C latency.
> >>
> >> Based on JB's observation so far, reverting commit 16b0a7a1a0af
> >> ("sched/fair: Ensure tasks spreading in LLC during LB") and commit
> >> c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
> >> condition") helps the workload. Both those commits allow aggressive
> >> migrations for work conservation except it also increased cache
> >> misses which slows the workload quite a bit.
> > commit 16b0a7a1a0af  ("sched/fair: Ensure tasks spreading in LLC
> > during LB") eases the spread of task inside a LLC so It's not obvious
> > for me how it would increase "a lot of CPU migrations go out of CCX,
> > then L3 miss,". On the other hand, it will spread task in SMT and in
> > LLC which can prevent running at highest freq on some system but I
> > don't know if it's relevant for this SoC.
>
> I misspoke there. JB's workload seems to be sensitive even to core to
> core migrations - "relax_domain_level=2" actually disabled newidle
> balance above CLUSTER level which is a subset of MC on x86 and gets

Did he try with relax_domain_level=3, i.e. prevent newilde idle
balance between LLC ? I don't see results showing that it's not enough
to prevent newly idle migration between LLC

> degenerated into the SMT domain.
>
> >
> > commit c5b0a7eefc70 ("sched/fair: Remove sysctl_sched_migration_cost
> > condition") makes newly idle migration happen more often which can
> > then do migrate tasks across LLC. But then It's more about why
> > enabling newly idle load balance out of LLC if it is so costly.
>
> It seems to be very workload + possibly platform specific
> characteristic where re-priming the cache is actually very costly.
> I'm not sure if there are any other uarch factors at play here that
> require repriming (branch prediction, prefetcher, etc.) after a task
> migration to reach same IPC.
>
> Essentially "relax_domain_level" gets the desired characteristic
> where only the periodic balance will balance long-term imbalance
> but as Libo mentioned the short term imbalances can build up
> and using "relax_domain_level" might lead to other problems.
>
> Short of pinning / more analysis of which part of migrations make
> the workload unhappy, I couldn't think of a better way to
> communicate this requirement.
>
> >
> >> "relax_domain_level" helps but cannot be set at runtime and I couldn't
> >> think of any stable / debug interfaces that JB hasn't tried out
> >> already that can help this workload.
> >>
> >> There is a patch towards the end to set "relax_domain_level" at
> >> runtime but given cpusets got away with this when transitioning to
> >> cgroup-v2, I don't know what the sentiments are around its usage.
> >> Any input / feedback is greatly appreciated.
>
> --
> Thanks and Regards,
> Prateek
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-05-05 15:10           ` Vincent Guittot
@ 2025-05-05 15:16             ` K Prateek Nayak
  2025-05-16 15:05               ` Jean-Baptiste Roquefere
  0 siblings, 1 reply; 16+ messages in thread
From: K Prateek Nayak @ 2025-05-05 15:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Jean-Baptiste Roquefere, Peter Zijlstra, mingo@kernel.org,
	Juri Lelli, linux-kernel@vger.kernel.org, Borislav Petkov,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Gautham R. Shenoy, Swapnil Sapkal, Valentin Schneider,
	regressions@lists.linux.dev, stable@vger.kernel.org

Hello Vincent,

On 5/5/2025 8:40 PM, Vincent Guittot wrote:
>>> commit 16b0a7a1a0af  ("sched/fair: Ensure tasks spreading in LLC
>>> during LB") eases the spread of task inside a LLC so It's not obvious
>>> for me how it would increase "a lot of CPU migrations go out of CCX,
>>> then L3 miss,". On the other hand, it will spread task in SMT and in
>>> LLC which can prevent running at highest freq on some system but I
>>> don't know if it's relevant for this SoC.
>>
>> I misspoke there. JB's workload seems to be sensitive even to core to
>> core migrations - "relax_domain_level=2" actually disabled newidle
>> balance above CLUSTER level which is a subset of MC on x86 and gets
> 
> Did he try with relax_domain_level=3, i.e. prevent newilde idle
> balance between LLC ? I don't see results showing that it's not enough
> to prevent newly idle migration between LLC

I don't think he did. JB if it isn't too much trouble, could you please
try running with "relax_domain_level=3" in kernel cmdline and see if
the performance is similar to "relax_domain_level=2".

I only realized it later that "relax_domain_level" works on topology
levels before degeneration.

-- 
Thanks and Regards,
Prateek



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-05-05 15:16             ` K Prateek Nayak
@ 2025-05-16 15:05               ` Jean-Baptiste Roquefere
  2025-05-22 14:51                 ` Vincent Guittot
  0 siblings, 1 reply; 16+ messages in thread
From: Jean-Baptiste Roquefere @ 2025-05-16 15:05 UTC (permalink / raw)
  To: K Prateek Nayak, Vincent Guittot
  Cc: Peter Zijlstra, mingo@kernel.org, Juri Lelli,
	linux-kernel@vger.kernel.org, Borislav Petkov, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Gautham R. Shenoy,
	Swapnil Sapkal, Valentin Schneider, regressions@lists.linux.dev,
	stable@vger.kernel.org

Hello Prateek,
long time no see... I've been very busy lately.

Did he try with relax_domain_level=3, i.e. prevent newilde idle


>> balance between LLC ? I don't see results showing that it's not enough
>> to prevent newly idle migration between LLC
>
> I don't think he did. JB if it isn't too much trouble, could you please
> try running with "relax_domain_level=3" in kernel cmdline and see if
> the performance is similar to "relax_domain_level=2".

I just tried relax_domain_level=3 on my payload. As you can see 
relax_domain_level=3 performances are more or less the same

+--------------------+---------------------+---------------------+
| Kernel             | 6.12.17 relax dom 2 | 6.12.17 relax dom 3 |
+--------------------+---------------------+---------------------+
| Utilization (%)    | 52,01               | 52,15 |
| CPU effective freq | 1 294,12            | 1 309,85 |
| IPC                | 1,42                | 1,40 |
| L2 access (pti)    | 38,18               | 38,03 |
| L2 miss   (pti)    | 7,78                | 7,90 |
| L3 miss   (abs)    | 33 929 609 924,00   | 33 705 899 797,00 |
| Mem (GB/s)         | 49,10               | 48,91 |
| Context switches   | 107 896 729,00      | 106 441 463,00 |
| CPU migrations     | 16 075 947,00       | 18 129 700,00 |
| Real time (s)      | 193,39              | 193,41 |
+--------------------+---------------------+---------------------+

We got the point that tuning this variable is not a good solution, but 
for now it's the only one we can apply.

Without this tuning our solution loses real time video processing. With 
: we keep real time on.


Thanks for your help, I'll stay alert on this thread if someday a better 
solution can emerge.


Regards,


jb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-05-16 15:05               ` Jean-Baptiste Roquefere
@ 2025-05-22 14:51                 ` Vincent Guittot
  2025-05-23 12:24                   ` Jean-Baptiste Roquefere
  0 siblings, 1 reply; 16+ messages in thread
From: Vincent Guittot @ 2025-05-22 14:51 UTC (permalink / raw)
  To: Jean-Baptiste Roquefere
  Cc: K Prateek Nayak, Peter Zijlstra, mingo@kernel.org, Juri Lelli,
	linux-kernel@vger.kernel.org, Borislav Petkov, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Gautham R. Shenoy,
	Swapnil Sapkal, Valentin Schneider, regressions@lists.linux.dev,
	stable@vger.kernel.org

Hi Jean-Baptiste,

On Fri, 16 May 2025 at 17:05, Jean-Baptiste Roquefere
<jb.roquefere@ateme.com> wrote:
>
> Hello Prateek,
> long time no see... I've been very busy lately.
>
> Did he try with relax_domain_level=3, i.e. prevent newilde idle
>
>
> >> balance between LLC ? I don't see results showing that it's not enough
> >> to prevent newly idle migration between LLC
> >
> > I don't think he did. JB if it isn't too much trouble, could you please
> > try running with "relax_domain_level=3" in kernel cmdline and see if
> > the performance is similar to "relax_domain_level=2".
>
> I just tried relax_domain_level=3 on my payload. As you can see
> relax_domain_level=3 performances are more or less the same

As there is no difference between level2 and level3, I assume that the
problem is not linked to a core to core migration but only the
migration between LLC.

As said previously, I don't see an obvious connection between commit
16b0a7a1a0af  ("sched/fair: Ensure tasks spreading in LLC during LB")
which mainly ensures a better usage of CPUs inside a LLC. Do you have
cpufreq and freq scaling enabled ? The only link that I could think
of, is that the spread of task inside a llc favors inter LLC newly
idle load balance

>
> +--------------------+---------------------+---------------------+
> | Kernel             | 6.12.17 relax dom 2 | 6.12.17 relax dom 3 |
> +--------------------+---------------------+---------------------+
> | Utilization (%)    | 52,01               | 52,15 |
> | CPU effective freq | 1 294,12            | 1 309,85 |
> | IPC                | 1,42                | 1,40 |
> | L2 access (pti)    | 38,18               | 38,03 |
> | L2 miss   (pti)    | 7,78                | 7,90 |
> | L3 miss   (abs)    | 33 929 609 924,00   | 33 705 899 797,00 |
> | Mem (GB/s)         | 49,10               | 48,91 |
> | Context switches   | 107 896 729,00      | 106 441 463,00 |
> | CPU migrations     | 16 075 947,00       | 18 129 700,00 |
> | Real time (s)      | 193,39              | 193,41 |
> +--------------------+---------------------+---------------------+
>
> We got the point that tuning this variable is not a good solution, but
> for now it's the only one we can apply.
>
> Without this tuning our solution loses real time video processing. With
> : we keep real time on.
>
>
> Thanks for your help, I'll stay alert on this thread if someday a better
> solution can emerge.
>
>
> Regards,
>
>
> jb
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-05-22 14:51                 ` Vincent Guittot
@ 2025-05-23 12:24                   ` Jean-Baptiste Roquefere
  2025-05-26  7:53                     ` Vincent Guittot
  0 siblings, 1 reply; 16+ messages in thread
From: Jean-Baptiste Roquefere @ 2025-05-23 12:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: K Prateek Nayak, Peter Zijlstra, mingo@kernel.org, Juri Lelli,
	linux-kernel@vger.kernel.org, Borislav Petkov, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Gautham R. Shenoy,
	Swapnil Sapkal, Valentin Schneider, regressions@lists.linux.dev,
	stable@vger.kernel.org

Hello Vincent,

> As said previously, I don't see an obvious connection between commit
> 16b0a7a1a0af  ("sched/fair: Ensure tasks spreading in LLC during LB")
> which mainly ensures a better usage of CPUs inside a LLC. Do you have
> cpufreq and freq scaling enabled ? The only link that I could think
> of, is that the spread of task inside a llc favors inter LLC newly
> idle load balance
# lsmod | grep cpufreq
cpufreq_userspace      16384  0
cpufreq_conservative    16384  0
cpufreq_powersave      16384  0

but I'm not sure cpufreq is well loaded :

# cpupower frequency-info
analyzing CPU 0:
   no or unknown cpufreq driver is active on this CPU
   CPUs which run at the same hardware frequency: Not Available
   CPUs which need to have their frequency coordinated by software: Not 
Available
   maximum transition latency:  Cannot determine or is not supported.
Not Available
   available cpufreq governors: Not Available
   Unable to determine current policy
   current CPU frequency: Unable to call hardware
   current CPU frequency:  Unable to call to kernel
   boost state support:
     Supported: yes
     Active: yes
     Boost States: 0
     Total States: 3
     Pstate-P0:  2000MHz
     Pstate-P1:  1800MHz
     Pstate-P2:  1500MHz

And I cant find cpufreq/ under /sys/devices/system/cpu/cpu*/

Thanks for your help,

jb

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: IPC drop down on AMD epyc 7702P
  2025-05-23 12:24                   ` Jean-Baptiste Roquefere
@ 2025-05-26  7:53                     ` Vincent Guittot
  0 siblings, 0 replies; 16+ messages in thread
From: Vincent Guittot @ 2025-05-26  7:53 UTC (permalink / raw)
  To: Jean-Baptiste Roquefere
  Cc: K Prateek Nayak, Peter Zijlstra, mingo@kernel.org, Juri Lelli,
	linux-kernel@vger.kernel.org, Borislav Petkov, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Gautham R. Shenoy,
	Swapnil Sapkal, Valentin Schneider, regressions@lists.linux.dev,
	stable@vger.kernel.org

On Fri, 23 May 2025 at 14:24, Jean-Baptiste Roquefere
<jb.roquefere@ateme.com> wrote:
>
> Hello Vincent,
>
> > As said previously, I don't see an obvious connection between commit
> > 16b0a7a1a0af  ("sched/fair: Ensure tasks spreading in LLC during LB")
> > which mainly ensures a better usage of CPUs inside a LLC. Do you have
> > cpufreq and freq scaling enabled ? The only link that I could think
> > of, is that the spread of task inside a llc favors inter LLC newly
> > idle load balance
> # lsmod | grep cpufreq
> cpufreq_userspace      16384  0
> cpufreq_conservative    16384  0
> cpufreq_powersave      16384  0
>
>
> but I'm not sure cpufreq is well loaded :
>
> # cpupower frequency-info
> analyzing CPU 0:
>    no or unknown cpufreq driver is active on this CPU
>    CPUs which run at the same hardware frequency: Not Available
>    CPUs which need to have their frequency coordinated by software: Not
> Available
>    maximum transition latency:  Cannot determine or is not supported.
> Not Available
>    available cpufreq governors: Not Available
>    Unable to determine current policy
>    current CPU frequency: Unable to call hardware
>    current CPU frequency:  Unable to call to kernel
>    boost state support:
>      Supported: yes
>      Active: yes
>      Boost States: 0
>      Total States: 3
>      Pstate-P0:  2000MHz
>      Pstate-P1:  1800MHz
>      Pstate-P2:  1500MHz
>
> And I cant find cpufreq/ under /sys/devices/system/cpu/cpu*/

Looks like you don't have cpufreq driver so we can forget a perf drop
because of a lower avg freq
Thanks

>
>
> Thanks for your help,
>
> jb
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-05-26  7:53 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-17 21:08 IPC drop down on AMD epyc 7702P Jean-Baptiste Roquefere
2025-04-18  6:39 ` K Prateek Nayak
2025-04-28  7:43   ` Jean-Baptiste Roquefere
2025-04-30  9:13     ` K Prateek Nayak
2025-04-30  9:25       ` Peter Zijlstra
2025-04-30 10:41       ` Libo Chen
2025-04-30 11:29         ` K Prateek Nayak
2025-05-01  2:46           ` Libo Chen
2025-05-05 10:28       ` Vincent Guittot
2025-05-05 12:29         ` K Prateek Nayak
2025-05-05 15:10           ` Vincent Guittot
2025-05-05 15:16             ` K Prateek Nayak
2025-05-16 15:05               ` Jean-Baptiste Roquefere
2025-05-22 14:51                 ` Vincent Guittot
2025-05-23 12:24                   ` Jean-Baptiste Roquefere
2025-05-26  7:53                     ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).