scheduler performance regression since v6.11

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* scheduler performance regression since v6.11
@ 2025-05-07 23:13 Chris Mason
  2025-05-09 19:49 ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2025-05-07 23:13 UTC (permalink / raw)
  To: Peter Zijlstra, linux-kernel

Hi everyone,

I've spent some time trying to track down a regression in a networking
benchmark, where it looks like we're spending roughly 10% more time in
new idle balancing than 6.9 did.

I'm not sure if I've reproduced that exact regression, but with some
changes to schbench, I was able to bisect a regression of some kind down
to commits in v6.11.

The actual result of the bisect was:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
781773e3b68031bd001c0c18aa72e8470c225ebd
e1459a50ba31831efdfc35278023d959e4ba775b
f12e148892ede8d9ee82bcd3e469e6d01fc077ac
152e11f6df293e816a6a37c69757033cdc72667d
2e0199df252a536a03f4cb0810324dff523d1e79
54a58a78779169f9c92a51facf6de7ce94962328
We cannot bisect more!

And this roughly matches the commits that introduce DELAY_DEQUEUE and
DELAY_ZERO.  Sadly, echo NO_DELAY_DEQUEUE/ZERO don't fix the regression
in 6.15.  I had to skip ~5 commits because they didn't compile, but
Peter might have already worked that part of the bisect out by the time
this email hits the lists.

I've landed a few modifications to schbench
(https://github.com/masoncl/schbench), but the basic workload is:

 - 4 message threads
  - each message thread pinned to its own single CPU
 - each waking up 128 worker threads
  - each worker sharing all the remaining CPUs

Neither the message threads or the worker threads are doing anything
other than waking up and going to sleep, so the whole RPS count from
schbench is basically just how fast can 4 threads wake up everyone else
on the system.

The answer is that v6.9 can do it roughly 8% faster than 6.11+.  I've
tried on a few different CPUs some have bigger or smaller deltas but the
regression is consistently there.

The exact command line I'm using:

schbench -L -m 4 -M auto -t 128 -n 0 -r 60

-L turns off the locking complexity I use to simulate our web workload
-m 4 is 4 message threads
-t 128 is 128 workers per message thread
-n 0 turns off all the think time
-r 60 is runtime in seconds
-M auto is the new CPU pinning described above

I haven't actually tried to fix this yet, but I do have some profiling
that might help, or is maybe interesting.  The most important thing in
the data down below is probably that 6.9 is calling available_idle_cpu()
16M times in 10 seconds, and 6.15 is calling it 56M times in 10 seconds.

The next step for me is to stall for time while hoping someone else
fixes this, but I'll try and understand why we're available_idle_cpuing
so hard in later kernels.

6.15.0-rc5 (git sha 0d8d44db295c, Linus as of May 6)

schbench RPS percentiles (requests) runtime 60 (s) (61 total samples)
	  20.0th: 1767424    (13 samples)
	* 50.0th: 1816576    (18 samples)
	  90.0th: 1841152    (27 samples)
	  min=1733207, max=2027049
average rps: 1813674.47

v6.9

schbench RPS percentiles (requests) runtime 60 (s) (61 total samples)
	  20.0th: 1955840    (14 samples)
	* 50.0th: 1968128    (17 samples)
	  90.0th: 1984512    (26 samples)
	  min=1908492, max=2122446
average rps: 1972418.82

6.9 perf from CPU #2

    15.50%  schbench  [kernel.kallsyms]  [k] available_idle_cpu
     7.45%  schbench  [kernel.kallsyms]  [k] llist_add_batch
     6.92%  schbench  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
     5.90%  schbench  [kernel.kallsyms]  [k] futex_wake
     4.62%  schbench  schbench           [.] fpost
     4.57%  schbench  [kernel.kallsyms]  [k] futex_wake_mark
     4.46%  schbench  [kernel.kallsyms]  [k] select_task_rq_fair
     4.26%  schbench  [kernel.kallsyms]  [k] _raw_spin_lock
     4.19%  schbench  schbench           [.] xlist_wake_all
     3.95%  schbench  [kernel.kallsyms]  [k] _find_next_bit
     3.55%  schbench  [kernel.kallsyms]  [k] __futex_unqueue

I don't know why perf record -g and perf report -g aren't giving me call
graphs, must be something funky with the fleet version of perf and my
hand built kernels.  But here's the top call stack for
available_idle_cpu(), along with line numbers from blazesym.

12313 samples (6.66%) Comms: schbench
available_idle_cpu @ linux/kernel/sched/core.c:7437:7
idle_cpu @ linux/kernel/sched/core.c:7415:5 [inlined]
select_idle_core @ linux/kernel/sched/fair.c:7301:6
select_task_rq_fair @ linux/kernel/sched/fair.c:8219:13
select_idle_sibling @ linux/kernel/sched/fair.c:7618:6 [inlined]
select_idle_cpu @ linux/kernel/sched/fair.c:7420:24 [inlined]
try_to_wake_up @ linux/kernel/sched/core.c:4363:9
select_task_rq @ linux/kernel/sched/core.c:3637:9 [inlined]
wake_up_q @ linux/kernel/sched/core.c:1030:3
put_task_struct @ linux/include/linux/sched/task.h:127:7 [inlined]
futex_wake @ linux/kernel/futex/waitwake.c:200:9
do_futex @ linux/kernel/futex/syscalls.c:131:1
__x64_sys_futex @ linux/kernel/futex/syscalls.c:160:1
__se_sys_futex @ linux/kernel/futex/syscalls.c:160:1 [inlined]
__do_sys_futex @ linux/kernel/futex/syscalls.c:179:9 [inlined]
do_syscall_64 @ linux/arch/x86/entry/common.c:83:7
do_syscall_x64 @ linux/arch/x86/entry/common.c:52:12 [inlined]
entry_SYSCALL_64_after_hwframe

Call frequency count of available_idle_cpu() (10 seconds of sampling,
system wide)

@counts: 16273429

v6.15-rc5 (sha 0d8d44db295c)

perf from CPU #2

    17.53%  schbench  [kernel.kallsyms] [k] available_idle_cpu
     7.11%  schbench  [kernel.kallsyms] [k] llist_add_batch
     5.75%  schbench  [kernel.kallsyms] [k] futex_wake
     4.98%  schbench  [kernel.kallsyms] [k] try_to_wake_up
     4.89%  schbench  [kernel.kallsyms] [k] futex_wake_mark
     4.51%  schbench  schbench          [.] fpost
     4.48%  schbench  [kernel.kallsyms] [k] select_task_rq_fair
     4.28%  schbench  [kernel.kallsyms] [k] _find_next_bit
     4.21%  schbench  [kernel.kallsyms] [k] _raw_spin_lock
     3.98%  schbench  schbench          [.] xlist_wake_all
     3.20%  schbench  [kernel.kallsyms] [k] migrate_task_rq_fair
     2.95%  schbench  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
     2.90%  schbench  [kernel.kallsyms] [k] select_idle_core.constprop.0
     2.83%  schbench  [kernel.kallsyms] [k] __futex_unqueue
     2.78%  schbench  [kernel.kallsyms] [k] set_task_cpu
     2.70%  schbench  [kernel.kallsyms] [k] clear_bhb_loop
     2.14%  schbench  [kernel.kallsyms] [k] remove_entity_load_avg
     2.10%  schbench  [kernel.kallsyms] [k] ttwu_queue_wakelist
     2.09%  schbench [kernel.kallsyms] [k] call_function_single_prep_ipi
     1.34%  schbench  libc.so.6         [.] syscall

17128 samples (9.06%) Comms: schbench
available_idle_cpu @ linux/kernel/sched/syscalls.c:228:7
idle_cpu @ linux/kernel/sched/syscalls.c:206:5 [inlined]
select_idle_core @ linux/kernel/sched/fair.c:7621:6
select_task_rq_fair @ linux/kernel/sched/fair.c:8637:13
select_idle_sibling @ linux/kernel/sched/fair.c:7938:6 [inlined]
select_idle_cpu @ linux/kernel/sched/fair.c:7740:24 [inlined]
try_to_wake_up @ linux/kernel/sched/core.c:4313:9
select_task_rq @ linux/kernel/sched/core.c:3583:9 [inlined]
wake_up_q @ linux/kernel/sched/core.c:1081:3
put_task_struct @ linux/./include/linux/sched/task.h:134:7 [inlined]
futex_wake @ linux/kernel/futex/waitwake.c:200:9
do_futex @ linux/kernel/futex/syscalls.c:131:1
__x64_sys_futex @ linux/kernel/futex/syscalls.c:160:1
__se_sys_futex @ linux/kernel/futex/syscalls.c:160:1 [inlined]
__do_sys_futex @ linux/kernel/futex/syscalls.c:179:9 [inlined]
do_syscall_64 @ linux/arch/x86/entry/syscall_64.c:94:7
do_syscall_x64 @ linux/arch/x86/entry/syscall_64.c:63:12 [inlined]
entry_SYSCALL_64_after_hwframe

Call frequency count of available_idle_cpu(), system-wide, 10 seconds:

@counts: 55867130

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-07 23:13 scheduler performance regression since v6.11 Chris Mason
@ 2025-05-09 19:49 ` Peter Zijlstra
  2025-05-12 18:08   ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-09 19:49 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel, Ingo Molnar, dietmar.eggemann, vschneid

On Wed, May 07, 2025 at 07:13:18PM -0400, Chris Mason wrote:
> Hi everyone,
> 
> I've spent some time trying to track down a regression in a networking
> benchmark, where it looks like we're spending roughly 10% more time in
> new idle balancing than 6.9 did.
> 
> I'm not sure if I've reproduced that exact regression, but with some
> changes to schbench, I was able to bisect a regression of some kind down
> to commits in v6.11.
> 
> The actual result of the bisect was:
> 
> There are only 'skip'ped commits left to test.
> The first bad commit could be any of:
> 781773e3b68031bd001c0c18aa72e8470c225ebd
> e1459a50ba31831efdfc35278023d959e4ba775b
> f12e148892ede8d9ee82bcd3e469e6d01fc077ac
> 152e11f6df293e816a6a37c69757033cdc72667d
> 2e0199df252a536a03f4cb0810324dff523d1e79
> 54a58a78779169f9c92a51facf6de7ce94962328
> We cannot bisect more!
> 
> And this roughly matches the commits that introduce DELAY_DEQUEUE and
> DELAY_ZERO.  Sadly, echo NO_DELAY_DEQUEUE/ZERO don't fix the regression
> in 6.15.  I had to skip ~5 commits because they didn't compile, but
> Peter might have already worked that part of the bisect out by the time
> this email hits the lists.
> 
> I've landed a few modifications to schbench
> (https://github.com/masoncl/schbench), but the basic workload is:
> 
>  - 4 message threads
>   - each message thread pinned to its own single CPU
>  - each waking up 128 worker threads
>   - each worker sharing all the remaining CPUs
> 
> Neither the message threads or the worker threads are doing anything
> other than waking up and going to sleep, so the whole RPS count from
> schbench is basically just how fast can 4 threads wake up everyone else
> on the system.
> 
> The answer is that v6.9 can do it roughly 8% faster than 6.11+.  I've
> tried on a few different CPUs some have bigger or smaller deltas but the
> regression is consistently there.
> 
> The exact command line I'm using:
> 
> schbench -L -m 4 -M auto -t 128 -n 0 -r 60
> 
> -L turns off the locking complexity I use to simulate our web workload
> -m 4 is 4 message threads
> -t 128 is 128 workers per message thread
> -n 0 turns off all the think time
> -r 60 is runtime in seconds
> -M auto is the new CPU pinning described above
> 
> I haven't actually tried to fix this yet, but I do have some profiling
> that might help, or is maybe interesting.  The most important thing in
> the data down below is probably that 6.9 is calling available_idle_cpu()
> 16M times in 10 seconds, and 6.15 is calling it 56M times in 10 seconds.
> 
> The next step for me is to stall for time while hoping someone else
> fixes this, but I'll try and understand why we're available_idle_cpuing
> so hard in later kernels.
> 
> 6.15.0-rc5 (git sha 0d8d44db295c, Linus as of May 6)
> 
> schbench RPS percentiles (requests) runtime 60 (s) (61 total samples)
> 	  20.0th: 1767424    (13 samples)
> 	* 50.0th: 1816576    (18 samples)
> 	  90.0th: 1841152    (27 samples)
> 	  min=1733207, max=2027049
> average rps: 1813674.47
> 
> v6.9
> 
> schbench RPS percentiles (requests) runtime 60 (s) (61 total samples)
> 	  20.0th: 1955840    (14 samples)
> 	* 50.0th: 1968128    (17 samples)
> 	  90.0th: 1984512    (26 samples)
> 	  min=1908492, max=2122446
> average rps: 1972418.82

Right, so when you poked me on IRC I started with testing v6.9,
tip/master and the two commits on either end of that range you
mentioned.

For me, the change is outside of that range, notably both ends perform
similar to v6.9.

As you mention, I spend a good few hours doing a traditional git-bisect
and got absolutely nowhere. After a proper bit of cursing git, Ingo
helped me out and got me a git command to list all the scheduler commits
that have DELAY_ZERO on.

Using that, I've done a manual bisect today, results below.

TL;DR, it looks like for me, it is commit 729288bc6856 ("kernel/sched:
Fix util_est accounting for DELAY_DEQUEUE") where the drop happens.

I've not yet tried to make sense of this result -- or double check.

(the results from 161 onwards are from the initial bisect run and have
been re-used here)

I'll prod more at this on Monday or so.

---

$ /usr/local/src/schbench/schbench -L -m 4 -t 160 -r 60 -n 0 -M 0-3 -W 4-1024

6.9: average rps: 2117400.22
     average rps: 2143333.58
     average rps: 2105387.47

tip: average rps: 1925521.90
     average rps: 1970553.90
     average rps: 1990727.87

$ git log --oneline --no-merges --ancestry-path=54a58a787791 54a58a787791..origin/master kernel/sched/ | grep -v sched_ext | awk '{ printf "%03d %s\n", ++i, $0 }'

001 bbce3de72be5 sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
	average rps: 1932495.83
	average rps: 1946343.63

002 75da043d8f88 cpufreq/sched: Set need_freq_update in ignore_dl_rate_limit()
003 79443a7e9da3 cpufreq/sched: Explicitly synchronize limits_changed flag handling
004 cfde542df7dd cpufreq/sched: Fix the usage of CPUFREQ_NEED_UPDATE_LIMITS
005 8fa7292fee5c treewide: Switch/rename to timer_delete[_sync]()
006 169eae7711ea rseq: Eliminate useless task_work on execve
007 26f80681a09b sched: Add sched tracepoints for RV task model
008 2cbb20b008db tracing: Disable branch profiling in noinstr code
009 dd5bdaf2b72d sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditional
010 57903f72f270 sched/debug: Make 'const_debug' tunables unconditional __read_mostly
011 f7d2728cc032 sched/debug: Change SCHED_WARN_ON() to WARN_ON_ONCE()
012 d128130f486b sched/topology: Stop exposing partition_sched_domains_locked
013 d735bab3d58c sched/topology: Remove redundant dl_clear_root_domain call
014 2ff899e35164 sched/deadline: Rebuild root domain accounting after every update
015 45007c6fb586 sched/deadline: Generalize unique visiting of root domains
016 56209334dda1 sched/topology: Wrappers for sched_domains_mutex
017 f6147af176ea sched/deadline: Ignore special tasks when rebuilding domains
018 8bdc5daaa01e sched: Add a generic function to return the preemption string
019 76f970ce51c8 Revert "sched/core: Reduce cost of sched_move_task when config autogroup"
020 4bc45824149e sched/uclamp: Optimize sched_uclamp_used static key enabling
021 5fca5a4cf973 sched/uclamp: Use the uclamp_is_used() helper instead of open-coding it
022 f3fa0e40df17 sched/clock: Don't define sched_clock_irqtime as static key
023 14672f059d83 sched/deadline: Use online cpus for validating runtime
024 3b4035ddbfc8 sched/fair: Fix potential memory corruption in child_cfs_rq_on_list
025 82c387ef7568 sched/core: Prevent rescheduling when interrupts are disabled
026 1a5d3492f8e1 sched: Add unlikey branch hints to several system calls
027 b796ea848991 sched/core: Remove duplicate included header file stats.h
028 ee13da875b8a sched: Switch to use hrtimer_setup()
029 02d954c0fdf9 sched: Compact RSEQ concurrency IDs with reduced threads and affinity
030 d34e798094ca sched/fair: Refactor can_migrate_task() to elimate looping
031 563bc2161b94 sched/eevdf: Force propagating min_slice of cfs_rq when {en,de}queue tasks
032 b9f2b29b9494 sched: Don't define sched_clock_irqtime as static key
033 2ae891b82695 sched: Reduce the default slice to avoid tasks getting an extra tick
034 f553741ac8c0 sched: Cancel the slice protection of the idle entity
035 bcc6244e13b4 sched: Clarify wake_up_q()'s write to task->wake_q.next
036 2c00e1199c06 sched: update __cond_resched comment about RCU quiescent states
037 9065ce69754d sched/debug: Provide slice length for fair tasks
038 1751f872cc97 treewide: const qualify ctl_tables where applicable
039 65ef17aa0711 hung_task: add task->flags, blocked by coredump to log
040 93940fbdc468 cpufreq/schedutil: Only bind threads if needed
041 3429dd57f0de sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeue
042 21641bd9a7a7 lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWN
043 d40797d6720e kasan: make kasan_record_aux_stack_noalloc() the default behaviour
044 7d9da040575b psi: Fix race when task wakes up before psi_sched_switch() adjusts flags
045 a6fd16148fdd sched, psi: Don't account irq time if sched_clock_irqtime is disabled
046 763a744e24a8 sched: Don't account irq time if sched_clock_irqtime is disabled
047 8722903cbb8f sched: Define sched_clock_irqtime as static key
048 3229adbe7875 sched/fair: Do not compute overloaded status unnecessarily during lb
049 0ac1ee9ebfb7 sched/fair: Do not compute NUMA Balancing stats unnecessarily during lb
050 873199d27bb2 sched/core: Prioritize migrating eligible tasks in sched_balance_rq()
051 8061b9f5e111 sched/debug: Change need_resched warnings to pr_err
052 2cf9ac40073d sched/fair: Encapsulate set custom slice in a __setparam_fair() function
053 5d808c78d972 sched: Fix race between yield_to() and try_to_wake_up()
054 66951e4860d3 sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE
055 f65c64f311ee delayacct: add delay min to record delay peak
056 658eb5ab916d delayacct: add delay max to record delay peak
057 6d71a9c61604 sched/fair: Fix EEVDF entity placement bug causing scheduling lag
058 b04e317b5226 treewide: Introduce kthread_run_worker[_on_cpu]()
059 3a5446612a3f sched,arm64: Handle CPU isolation on last resort fallback rq selection
060 7c8cd569ff66 docs: Update Schedstat version to 17
061 011b3a14dc66 sched/stats: Print domain name in /proc/schedstat
062 1c055a0f5d3b sched: Move sched domain name out of CONFIG_SCHED_DEBUG
063 3b2a793ea70f sched: Report the different kinds of imbalances in /proc/schedstat
064 c3856c9ce6b8 sched/fair: Cleanup in migrate_degrades_locality() to improve readability
065 a430d99e3490 sched/fair: Fix value reported by hot tasks pulled in /proc/schedstat
066 ee8118c1f186 sched/fair: Update comments after sched_tick() rename.
067 ebeeee390b6a PM: EM: Move sched domains rebuild function from schedutil to EM
068 8e461a1cb43d cpufreq: schedutil: Fix superfluous updates caused by need_freq_update
069 af98d8a36a96 sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug
070 c7f7e9c73178 sched/dlserver: Fix dlserver time accounting
071 b53127db1dbf sched/dlserver: Fix dlserver double enqueue
072 7675361ff9a1 sched: deadline: Cleanup goto label in pick_earliest_pushable_dl_task
073 2a77e4be12cb sched/fair: Untangle NEXT_BUDDY and pick_next_task()
074 95d9fed3a2ae sched/fair: Mark m*_vruntime() with __maybe_unused
075 0429489e0928 sched/fair: Fix variable declaration position
076 61b82dfb6b7e sched/fair: Do not try to migrate delayed dequeue task
077 736c55a02c47 sched/fair: Rename cfs_rq.nr_running into nr_queued
078 43eef7c3a4a6 sched/fair: Remove unused cfs_rq.idle_nr_running
079 31898e7b87dd sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
080 9216582b0bfb sched/fair: Removed unsued cfs_rq.h_nr_delayed
	average rps: 1988000.78
	average rps: 1928511.27

081 1a49104496d3 sched/fair: Use the new cfs_rq.h_nr_runnable
082 c2a295bffeaf sched/fair: Add new cfs_rq.h_nr_runnable
083 7b8a702d9438 sched/fair: Rename h_nr_running into h_nr_queued
084 76f2f783294d sched/eevdf: More PELT vs DELAYED_DEQUEUE
085 c1f43c342e1f sched/fair: Fix sched_can_stop_tick() for fair tasks
086 493afbd187c4 sched/fair: Fix NEXT_BUDDY
087 5f1b64e9a9b7 sched/numa: fix memory leak due to the overwritten vma->numab_state
088 c907cd44a108 sched: Unify HK_TYPE_{TIMER|TICK|MISC} to HK_TYPE_KERNEL_NOISE
089 6010d245ddc9 sched/isolation: Consolidate housekeeping cpumasks that are always identical
090 1174b9344bc7 sched/isolation: Make "isolcpus=nohz" equivalent to "nohz_full"
091 ae5c677729e9 sched/core: Remove HK_TYPE_SCHED
092 a76328d44c7a sched/fair: Remove CONFIG_CFS_BANDWIDTH=n definition of cfs_bandwidth_used()
093 3a181f20fb4e sched/deadline: Consolidate Timer Cancellation
094 53916d5fd3c0 sched/deadline: Check bandwidth overflow earlier for hotplug
095 d4742f6ed7ea sched/deadline: Correctly account for allocated bandwidth during hotplug
096 41d4200b7103 sched/deadline: Restore dl_server bandwidth on non-destructive root domain changes
097 59297e2093ce sched: add READ_ONCE to task_on_rq_queued
098 108ad0999085 sched: Don't try to catch up excess steal time.
099 0664e2c311b9 sched/deadline: Fix warning in migrate_enable for boosted tasks
100 e932c4ab38f0 sched/core: Prevent wakeup of ksoftirqd during idle load balance
101 ff47a0acfcce sched/fair: Check idle_cpu() before need_resched() to detect ilb CPU turning busy
102 ea9cffc0a154 sched/core: Remove the unnecessary need_resched() check in nohz_csd_func()
103 70ee7947a290 sched: fix warning in sched_setaffinity
104 22368fe1f9bb sched/deadline: Fix replenish_dl_new_period dl_server condition
105 70d8b6485b0b sched/cpufreq: Ensure sd is rebuilt for EAS check
106 46d076af6d64 sched/idle: Switch to use hrtimer_setup_on_stack()
107 35772d627b55 sched: Enable PREEMPT_DYNAMIC for PREEMPT_RT
108 7c70cb94d29c sched: Add Lazy preemption model
109 26baa1f1c4bd sched: Add TIF_NEED_RESCHED_LAZY infrastructure
110 0f0d1b8e5010 sched/ext: Remove sched_fork() hack
111 b23decf8ac91 sched: Initialize idle tasks only once
112 69d5e722be94 sched/ext: Fix scx vs sched_delayed
113 5db91545ef81 sched: Pass correct scheduling policy to __setscheduler_class
114 1a6151017ee5 sched: psi: pass enqueue/dequeue flags to psi callbacks directly
115 9c70b2a33cd2 sched/numa: Fix the potential null pointer dereference in task_numa_work()
116 23f1178ad706 sched/uclamp: Fix unnused variable warning
117 0e7ffff1b811 scx: Fix raciness in scx_ops_bypass()
118 b55945c500c5 sched: Fix pick_next_task_fair() vs try_to_wake_up() race
119 112cca098a70 sched_getattr: port to copy_struct_to_user
120 af0c8b2bf67b sched: Split scheduler and execution contexts
	average rps: 2067733.00
	average rps: 1949134.92

121 7b3d61f6578a sched: Split out __schedule() deactivate task logic into a helper
122 18adad1dac33 sched: Consolidate pick_*_task to task_is_pushable helper
123 2b05a0b4c08f sched: Add move_queued_task_locked helper
124 7e019dcc470f sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads
125 8e113df990c9 sched: idle: Optimize the generic idle loop by removing needless memory barrier
126 cd9626e9ebc7 sched/fair: Fix external p->on_rq users
127 c6508124193d sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
128 f5aaff7bfa11 sched/core: Dequeue PSI signals for blocked tasks that are delayed
129 98442f0ccd82 sched: Fix delayed_dequeue vs switched_from_fair()
130 73ab05aa46b0 sched/core: Disable page allocation in task_tick_mm_cid()
131 d16b7eb6f523 sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
132 f207dc2dcdcf sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
133 b62933eee41e sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
134 0ac8f14ef22a sched/wait: Remove unused bit_wait_io_timeout
135 b15148ce21c1 sched/fair: fix the comment for PREEMPT_SHORT
136 4423af84b297 sched/fair: optimize the PLACE_LAG when se->vlag is zero
137 e31488c9df27 sched/fair: remove the DOUBLE_TICK feature
138 bf39882edc79 sched: Document wait_var_event() family of functions and wake_up_var()
139 3cdee6b359f1 sched: Improve documentation for wake_up_bit/wait_on_bit family of functions
140 2382d68d7d43 sched: change wake_up_bit() and related function to expect unsigned long *
	average rps: 1979616.95
	average rps: 2001639.92

141 3840cbe24cf0 sched: psi: fix bogus pressure spikes from aggregation race
142 d4ac164bde7a sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
143 9b5ce1a37e90 sched: Fix sched_delayed vs cfs_bandwidth
144 161853a78bb2 sched/ext: Use tg_cgroup() to elieminate duplicate code
145 e418cd2b80f5 sched/ext: Fix unmatch trailing comment of CONFIG_EXT_GROUP_SCHED
146 7ebd84d627e4 sched: Put task_group::idle under CONFIG_GROUP_SCHED_WEIGHT
147 bdeb868c0ddf sched: Add dummy version of sched_group_set_idle()
148 902d67a2d40f sched: Move update_other_load_avgs() to kernel/sched/pelt.c
149 bc9057da1a22 sched/cpufreq: Use NSEC_PER_MSEC for deadline task
150 2cab4bd024d2 sched/debug: Fix the runnable tasks output
	average rps: 1964020.73
	average rps: 2164525.02
	average rps: 1948710.93

151 c662e2b1e8cf sched: Fix sched_delayed vs sched_core
152 729288bc6856 kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
	average rps: 1936829.47
	average rps: 1950715.10

153 84d265281d6c sched/pelt: Use rq_clock_task() for hw_pressure
	average rps: 2176857.32
	average rps: 2223004.23

154 5d871a63997f sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c
155 3dcac251b066 sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule()
	average rps: 2187538.05
	average rps: 2166339.93

156 e179e80c5d4f sched: Introduce CONFIG_GROUP_SCHED_WEIGHT
157 41082c1d1d2b sched: Make cpu_shares_read_u64() use tg_weight()
158 859dc4ec5a43 sched: Expose css_tg()
159 b2d70222dbf2 sched: Add put_prev_task(.next)
160 bd9bbc96e835 sched: Rework dl_server
161 436f3eed5c69 sched: Combine the last put_prev_task() and the first set_next_task()
	average rps: 2179736.88
	average rps: 2141045.32
	average rps: 2188505.50

162 fd03c5b85855 sched: Rework pick_next_task()
163 260598f142c3 sched: Split up put_prev_task_balance()
164 4686cc598f66 sched: Clean up DL server vs core sched
165 dae4320b29f0 sched: Fixup set_next_task() implementations
166 7d2180d9d943 sched: Use set_next_task(.first) where required
167 75b6499024a6 sched/fair: Properly deactivate sched_delayed task upon class change
168 9c602adb799e sched/deadline: Fix schedstats vs deadline servers
	average rps: 2006393.02
	average rps: 2214295.33
	average rps: 2217932.42

169 aef6987d8954 sched/eevdf: Propagate min_slice up the cgroup hierarchy
170 857b158dc5e8 sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion
171 85e511df3cec sched/eevdf: Allow shorter slices to wakeup-preempt
	average rps: 2002835.95
	average rps: 2199140.30
	average rps: 2174758.77

172 82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
	average rps: 2011212.10
	average rps: 2160537.57

173 fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
	average rps: 2014455.98
	average rps: 2207044.23
	average rps: 2235079.32

174 54a58a787791 sched/fair: Implement DELAY_ZERO
	average rps: 2231845.500
	average rps: 2195502.80
	average rps: 2192547.32

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-09 19:49 ` Peter Zijlstra
@ 2025-05-12 18:08   ` Peter Zijlstra
  2025-05-12 19:39     ` Chris Mason
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-12 18:08 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel, Ingo Molnar, dietmar.eggemann, vschneid

On Fri, May 09, 2025 at 09:49:55PM +0200, Peter Zijlstra wrote:
> 152 729288bc6856 kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
> 	average rps: 1936829.47
> 	average rps: 1950715.10
> 
> 153 84d265281d6c sched/pelt: Use rq_clock_task() for hw_pressure
> 	average rps: 2176857.32
> 	average rps: 2223004.23

So, a little more data on this.

The result appears stable, but reverting 729288bc6856 on master does not
seem to cure things.

OTOH, switching to performance governor completely cures things here.

My machine defaults to schedutil governor.

Chris, can you confirm -- or did we manage to find different issues?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-12 18:08   ` Peter Zijlstra
@ 2025-05-12 19:39     ` Chris Mason
  2025-05-12 22:35       ` Chris Mason
  0 siblings, 1 reply; 21+ messages in thread
From: Chris Mason @ 2025-05-12 19:39 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, dietmar.eggemann, vschneid



On 5/12/25 2:08 PM, Peter Zijlstra wrote:
> On Fri, May 09, 2025 at 09:49:55PM +0200, Peter Zijlstra wrote:
>> 152 729288bc6856 kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
>> 	average rps: 1936829.47
>> 	average rps: 1950715.10
>>
>> 153 84d265281d6c sched/pelt: Use rq_clock_task() for hw_pressure
>> 	average rps: 2176857.32
>> 	average rps: 2223004.23
> 
> So, a little more data on this.
> 
> The result appears stable, but reverting 729288bc6856 on master does not
> seem to cure things.

729288bc6856 is in that section of commits where the regression was ~1/2
as bad.  (bad: 1.8M, good: 2M, middle ground: 1.9M) I get the same RPS
with and without it.

I called it "good" in my bisect run, but I'll take a second pass through
those potential suspects and see if the bad really starts there.

> 
> OTOH, switching to performance governor completely cures things here.
> 
> My machine defaults to schedutil governor.
> 
> Chris, can you confirm -- or did we manage to find different issues?

My repro script sets to performance w/turbo on as step one.  I didn't
turbostat my way to making sure both sides were actually running at the
same frequency, but I'll check that too.

-chris


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-12 19:39     ` Chris Mason
@ 2025-05-12 22:35       ` Chris Mason
  2025-05-13  7:15         ` Peter Zijlstra
  2025-05-16 10:18         ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Chris Mason @ 2025-05-12 22:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, Ingo Molnar, dietmar.eggemann, vschneid



On 5/12/25 3:39 PM, Chris Mason wrote:
> 
> 
> On 5/12/25 2:08 PM, Peter Zijlstra wrote:
>> On Fri, May 09, 2025 at 09:49:55PM +0200, Peter Zijlstra wrote:
>>> 152 729288bc6856 kernel/sched: Fix util_est accounting for DELAY_DEQUEUE
>>> 	average rps: 1936829.47
>>> 	average rps: 1950715.10
>>>
>>> 153 84d265281d6c sched/pelt: Use rq_clock_task() for hw_pressure
>>> 	average rps: 2176857.32
>>> 	average rps: 2223004.23
>>
>> So, a little more data on this.
>>
>> The result appears stable, but reverting 729288bc6856 on master does not
>> seem to cure things.
> 
> 729288bc6856 is in that section of commits where the regression was ~1/2
> as bad.  (bad: 1.8M, good: 2M, middle ground: 1.9M) I get the same RPS
> with and without it.
> 
> I called it "good" in my bisect run, but I'll take a second pass through
> those potential suspects and see if the bad really starts there.

This commit drops us from 2M RPS down to 1.95M.  I did a bunch of runs
on the surrounding commits and the drop was very reliable:

commit 152e11f6df293e816a6a37c69757033cdc72667d
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Thu May 23 12:25:32 2024 +0200

    sched/fair: Implement delayed dequeue


e1459a50ba31, which was the next earliest commit is consistently fast.


If I take 54a58a787791 or 54a58a787791 and turn off the
DELAY_DEQUEUE/ZERO features at run time, I don't get the performance
back.  But, if I patch them such that DELAY_DEQUEUE/ZERO default off

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 7fdeb5576188c..94409e9831e97 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -37,8 +37,8 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  *
  * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
  */
-SCHED_FEAT(DELAY_DEQUEUE, true)
-SCHED_FEAT(DELAY_ZERO, true)
+SCHED_FEAT(DELAY_DEQUEUE, false)
+SCHED_FEAT(DELAY_ZERO, false)

It runs at 2M QPS again.  If I enable DELAY_QUEUE/ZERO, I go back to 1.95M

This is roughly 1/2 of the total regression, v6.15 runs at 1.88M maybe
1.85M on a bad run.

Peter sent me a patch to try improving my ttwu_queue rate, I'll give
that a shot in the morning.

-chris

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-12 22:35       ` Chris Mason
@ 2025-05-13  7:15         ` Peter Zijlstra
  2025-05-16 10:18         ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-13  7:15 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-kernel, Ingo Molnar, dietmar.eggemann, vschneid

On Mon, May 12, 2025 at 06:35:24PM -0400, Chris Mason wrote:

> If I take 54a58a787791 or 54a58a787791 and turn off the
> DELAY_DEQUEUE/ZERO features at run time, I don't get the performance
> back.  But, if I patch them such that DELAY_DEQUEUE/ZERO default off
> 
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 7fdeb5576188c..94409e9831e97 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -37,8 +37,8 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
>   *
>   * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
>   */
> -SCHED_FEAT(DELAY_DEQUEUE, true)
> -SCHED_FEAT(DELAY_ZERO, true)
> +SCHED_FEAT(DELAY_DEQUEUE, false)
> +SCHED_FEAT(DELAY_ZERO, false)
> 
> It runs at 2M QPS again.  If I enable DELAY_QUEUE/ZERO, I go back to 1.95M

Well, that's odd... there should not be residual effects like that. I'll
see if I can spot anything.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-12 22:35       ` Chris Mason
  2025-05-13  7:15         ` Peter Zijlstra
@ 2025-05-16 10:18         ` Peter Zijlstra
  2025-05-20 14:38           ` Dietmar Eggemann
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-16 10:18 UTC (permalink / raw)
  To: Chris Mason
  Cc: linux-kernel, Ingo Molnar, dietmar.eggemann, vschneid, Juri Lelli,
	Thomas Gleixner

On Mon, May 12, 2025 at 06:35:24PM -0400, Chris Mason wrote:

Right, so I can reproduce on Thomas' SKL and maybe see some of it on my
SPR.

I've managed to discover a whole bunch of ways that ttwu() can explode
again :-) But as you surmised, your workload *LOVES* TTWU_QUEUE, and
DELAYED_DEQUEUE takes some of that away, because those delayed things
remain on-rq and ttwu() can't deal with that other than by doing the
wakeup in-line and that's exactly the thing this workload hates most.

(I'll keep poking at ttwu() to see if I can get a combination of
TTWU_QUEUE and DELAYED_DEQUEUE that does not explode in 'fun' ways)

However, I've found that flipping the default in ttwu_queue_cond() seems
to make up for quite a bit -- for your workload.

(basically, all the work we can get away from those pinned message CPUs
is a win)

Also, meanwhile you discovered that the other part of your performance
woes were due to dl_server, specifically, disabling that gave you back a
healthy chunk of your performance.

The problem is indeed that we toggle the dl_server on every nr_running
from 0 and to 0 transition, and your workload has a shit-ton of those,
so every time we get the overhead of starting and stopping this thing.

In hindsight, that's a fairly stupid setup, and the below patch changes
this to keep the dl_server around until it's not seen fair activity for
a whole period. This appears to fully recover this dip.

Trouble seems to be that dl_server_update() always gets tickled by
random garbage, so in the end the dl_server never stops... oh well.

Juri, could you have a look at this, perhaps I messed up something
trivial -- its been like that this week :/

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..1f92572b20c0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -702,6 +702,7 @@ struct sched_dl_entity {
 	unsigned int			dl_defer	  : 1;
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
+	unsigned int			dl_server_idle    : 1;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba0..010537a2f368 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3964,7 +3964,7 @@ static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 	if (!cpu_rq(cpu)->nr_running)
 		return true;
 
-	return false;
+	return sched_feat(TTWU_QUEUE_DEFAULT);
 }
 
 static bool ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index ad45a8fea245..dce3a95cb8bc 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1639,8 +1639,10 @@ void dl_server_update_idle_time(struct rq *rq, struct task_struct *p)
 void dl_server_update(struct sched_dl_entity *dl_se, s64 delta_exec)
 {
 	/* 0 runtime = fair server disabled */
-	if (dl_se->dl_runtime)
+	if (dl_se->dl_runtime) {
+		dl_se->dl_server_idle = 0;
 		update_curr_dl_se(dl_se->rq, dl_se, delta_exec);
+	}
 }
 
 void dl_server_start(struct sched_dl_entity *dl_se)
@@ -1663,20 +1665,24 @@ void dl_server_start(struct sched_dl_entity *dl_se)
 		setup_new_dl_entity(dl_se);
 	}
 
-	if (!dl_se->dl_runtime)
+	if (!dl_se->dl_runtime || dl_se->dl_server_active)
 		return;
 
+	trace_printk("dl_server starting\n");
+
 	dl_se->dl_server_active = 1;
 	enqueue_dl_entity(dl_se, ENQUEUE_WAKEUP);
 	if (!dl_task(dl_se->rq->curr) || dl_entity_preempt(dl_se, &rq->curr->dl))
 		resched_curr(dl_se->rq);
 }
 
-void dl_server_stop(struct sched_dl_entity *dl_se)
+static void __dl_server_stop(struct sched_dl_entity *dl_se)
 {
 	if (!dl_se->dl_runtime)
 		return;
 
+	trace_printk("dl_server stopping\n");
+
 	dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
 	hrtimer_try_to_cancel(&dl_se->dl_timer);
 	dl_se->dl_defer_armed = 0;
@@ -1684,6 +1690,10 @@ void dl_server_stop(struct sched_dl_entity *dl_se)
 	dl_se->dl_server_active = 0;
 }
 
+void dl_server_stop(struct sched_dl_entity *dl_se)
+{
+}
+
 void dl_server_init(struct sched_dl_entity *dl_se, struct rq *rq,
 		    dl_server_has_tasks_f has_tasks,
 		    dl_server_pick_f pick_task)
@@ -2436,6 +2446,9 @@ static struct task_struct *__pick_task_dl(struct rq *rq)
 		p = dl_se->server_pick_task(dl_se);
 		if (!p) {
 			if (dl_server_active(dl_se)) {
+				if (dl_se->dl_server_idle)
+					__dl_server_stop(dl_se);
+				dl_se->dl_server_idle = 1;
 				dl_se->dl_yielded = 1;
 				update_curr_dl_se(rq, dl_se, 0);
 			}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 3c12d9f93331..75aa7fdc4c98 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,6 +81,7 @@ SCHED_FEAT(TTWU_QUEUE, false)
  */
 SCHED_FEAT(TTWU_QUEUE, true)
 #endif
+SCHED_FEAT(TTWU_QUEUE_DEFAULT, false)
 
 /*
  * When doing wakeups, attempt to limit superfluous scans of the LLC domain.

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-16 10:18         ` Peter Zijlstra
@ 2025-05-20 14:38           ` Dietmar Eggemann
  2025-05-20 14:53             ` Chris Mason
  2025-05-20 19:38             ` Peter Zijlstra
  0 siblings, 2 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2025-05-20 14:38 UTC (permalink / raw)
  To: Peter Zijlstra, Chris Mason
  Cc: linux-kernel, Ingo Molnar, vschneid, Juri Lelli, Thomas Gleixner

On 16/05/2025 12:18, Peter Zijlstra wrote:
> On Mon, May 12, 2025 at 06:35:24PM -0400, Chris Mason wrote:
> 
> Right, so I can reproduce on Thomas' SKL and maybe see some of it on my
> SPR.
> 
> I've managed to discover a whole bunch of ways that ttwu() can explode
> again :-) But as you surmised, your workload *LOVES* TTWU_QUEUE, and
> DELAYED_DEQUEUE takes some of that away, because those delayed things
> remain on-rq and ttwu() can't deal with that other than by doing the
> wakeup in-line and that's exactly the thing this workload hates most.
> 
> (I'll keep poking at ttwu() to see if I can get a combination of
> TTWU_QUEUE and DELAYED_DEQUEUE that does not explode in 'fun' ways)
> 
> However, I've found that flipping the default in ttwu_queue_cond() seems
> to make up for quite a bit -- for your workload.
> 
> (basically, all the work we can get away from those pinned message CPUs
> is a win)
> 
> Also, meanwhile you discovered that the other part of your performance
> woes were due to dl_server, specifically, disabling that gave you back a
> healthy chunk of your performance.
> 
> The problem is indeed that we toggle the dl_server on every nr_running
> from 0 and to 0 transition, and your workload has a shit-ton of those,
> so every time we get the overhead of starting and stopping this thing.
> 
> In hindsight, that's a fairly stupid setup, and the below patch changes
> this to keep the dl_server around until it's not seen fair activity for
> a whole period. This appears to fully recover this dip.
> 
> Trouble seems to be that dl_server_update() always gets tickled by
> random garbage, so in the end the dl_server never stops... oh well.
> 
> Juri, could you have a look at this, perhaps I messed up something
> trivial -- its been like that this week :/

On the same VM I use as a SUT for the 'hammerdb-mysqld' tests:

https://lkml.kernel.org/r/d6692902-837a-4f30-913b-763f01a5a7ea@arm.com

I can't spot any v6.11 related changes (dl_server or TTWU_QUEUE) but a
PSI related one for v6.12 results in a ~8% schbench regression.

VM (m7gd.16xlarge, 16 logical CPUs) on Graviton3:

schbench -L -m 4 -M auto -t 128 -n 0 -r 60

3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race

With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
4 times per task switch in my setup) in:

__schedule() -> psi_sched_switch() -> psi_task_switch() ->
psi_group_change().

There seems to be another/other v6.12 related patch(es) later which
cause(s) another 4% regression I yet have to discover.

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-20 14:38           ` Dietmar Eggemann
@ 2025-05-20 14:53             ` Chris Mason
  2025-05-21 13:59               ` Dietmar Eggemann
  2025-05-20 19:38             ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Chris Mason @ 2025-05-20 14:53 UTC (permalink / raw)
  To: Dietmar Eggemann, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, vschneid, Juri Lelli, Thomas Gleixner

On 5/20/25 10:38 AM, Dietmar Eggemann wrote:
> On 16/05/2025 12:18, Peter Zijlstra wrote:
>> On Mon, May 12, 2025 at 06:35:24PM -0400, Chris Mason wrote:
>>
>> Right, so I can reproduce on Thomas' SKL and maybe see some of it on my
>> SPR.
>>
>> I've managed to discover a whole bunch of ways that ttwu() can explode
>> again :-) But as you surmised, your workload *LOVES* TTWU_QUEUE, and
>> DELAYED_DEQUEUE takes some of that away, because those delayed things
>> remain on-rq and ttwu() can't deal with that other than by doing the
>> wakeup in-line and that's exactly the thing this workload hates most.
>>
>> (I'll keep poking at ttwu() to see if I can get a combination of
>> TTWU_QUEUE and DELAYED_DEQUEUE that does not explode in 'fun' ways)
>>
>> However, I've found that flipping the default in ttwu_queue_cond() seems
>> to make up for quite a bit -- for your workload.
>>
>> (basically, all the work we can get away from those pinned message CPUs
>> is a win)
>>
>> Also, meanwhile you discovered that the other part of your performance
>> woes were due to dl_server, specifically, disabling that gave you back a
>> healthy chunk of your performance.
>>
>> The problem is indeed that we toggle the dl_server on every nr_running
>> from 0 and to 0 transition, and your workload has a shit-ton of those,
>> so every time we get the overhead of starting and stopping this thing.
>>
>> In hindsight, that's a fairly stupid setup, and the below patch changes
>> this to keep the dl_server around until it's not seen fair activity for
>> a whole period. This appears to fully recover this dip.
>>
>> Trouble seems to be that dl_server_update() always gets tickled by
>> random garbage, so in the end the dl_server never stops... oh well.
>>
>> Juri, could you have a look at this, perhaps I messed up something
>> trivial -- its been like that this week :/
> 
> On the same VM I use as a SUT for the 'hammerdb-mysqld' tests:
> 
> https://lkml.kernel.org/r/d6692902-837a-4f30-913b-763f01a5a7ea@arm.com 
> 
> I can't spot any v6.11 related changes (dl_server or TTWU_QUEUE) but a
> PSI related one for v6.12 results in a ~8% schbench regression.
> 
> VM (m7gd.16xlarge, 16 logical CPUs) on Graviton3:
> 
> schbench -L -m 4 -M auto -t 128 -n 0 -r 60
> 
> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race

I also saw a regression on this one, but it wasn't stable enough for me
to be sure.  I'll retest, but I'm guessing this is made worse by the VM
/ graviton setup?

I've been testing Peter's changes, and they do help on my skylake box
but not as much on the big turin machines.  I'm trying to sort that out,
but we have some other variables wrt PGO/LTO that I need to rule out.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-20 14:38           ` Dietmar Eggemann
  2025-05-20 14:53             ` Chris Mason
@ 2025-05-20 19:38             ` Peter Zijlstra
  2025-05-21 14:02               ` Dietmar Eggemann
  2025-05-21 14:54               ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-20 19:38 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Chris Mason, linux-kernel, Ingo Molnar, vschneid, Juri Lelli,
	Thomas Gleixner

On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:

> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> 
> With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> 4 times per task switch in my setup) in:
> 
> __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> psi_group_change().
> 
> There seems to be another/other v6.12 related patch(es) later which
> cause(s) another 4% regression I yet have to discover.

Urgh, let me add this to the pile to look at. Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-20 14:53             ` Chris Mason
@ 2025-05-21 13:59               ` Dietmar Eggemann
  2025-05-21 14:32                 ` Chris Mason
  0 siblings, 1 reply; 21+ messages in thread
From: Dietmar Eggemann @ 2025-05-21 13:59 UTC (permalink / raw)
  To: Chris Mason, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, vschneid, Juri Lelli, Thomas Gleixner

On 20/05/2025 16:53, Chris Mason wrote:
> On 5/20/25 10:38 AM, Dietmar Eggemann wrote:
>> On 16/05/2025 12:18, Peter Zijlstra wrote:
>>> On Mon, May 12, 2025 at 06:35:24PM -0400, Chris Mason wrote:

[...]

>> I can't spot any v6.11 related changes (dl_server or TTWU_QUEUE) but a
>> PSI related one for v6.12 results in a ~8% schbench regression.
>>
>> VM (m7gd.16xlarge, 16 logical CPUs) on Graviton3:
>>
>> schbench -L -m 4 -M auto -t 128 -n 0 -r 60
>>
>> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> 
> I also saw a regression on this one, but it wasn't stable enough for me
> to be sure.  I'll retest, but I'm guessing this is made worse by the VM
> / graviton setup?

For me the 8% regression here is pretty stable. I have to add that I ran
schbench in:

  /sys/fs/cgroup/user.slice/user-1000.slice/session-33.scope

So that explains IMHO the 4 calls to psi_group_change() from
psi_task_switch() now doing all their own 'now = cpu_clock(cpu)' call.

schbench-6509    [004] d....   689.050466: psi_task_switch: CPU4 [schbench 6509] -> [schbench 6514] ->
schbench-6509    [004] d....   689.050466: psi_group_change: CPU4 now=689050466118
schbench-6509    [004] d....   689.050467: psi_group_change: CPU4 now=689050466537
schbench-6509    [004] d....   689.050467: psi_group_change: CPU4 now=689050466950
schbench-6509    [004] d....   689.050468: psi_group_change: CPU4 now=689050467838
schbench-6509    [004] d....   689.050468: psi_task_switch: CPU4 [schbench 6509] -> [schbench 6514] <-
 
> I've been testing Peter's changes, and they do help on my skylake box
> but not as much on the big turin machines.  I'm trying to sort that out,

Turin vs. SKL,SPR ?

> but we have some other variables wrt PGO/LTO that I need to rule out.

OK, thanks! We also started testing this on Arm64 server with +20 cores ...

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-20 19:38             ` Peter Zijlstra
@ 2025-05-21 14:02               ` Dietmar Eggemann
  2025-05-21 15:02                 ` Peter Zijlstra
  2025-05-21 14:54               ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Dietmar Eggemann @ 2025-05-21 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, linux-kernel, Ingo Molnar, vschneid, Juri Lelli,
	Thomas Gleixner

On 20/05/2025 21:38, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> 
>> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
>>
>> With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
>> 4 times per task switch in my setup) in:
>>
>> __schedule() -> psi_sched_switch() -> psi_task_switch() ->
>> psi_group_change().
>>
>> There seems to be another/other v6.12 related patch(es) later which
>> cause(s) another 4% regression I yet have to discover.
> 
> Urgh, let me add this to the pile to look at. Thanks!

Not sure how expensive 'cpu_clock(cpu)' is on bare-metal.

But I also don't get why PSI needs per group 'now' values when we
iterate over cgroup levels?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-21 13:59               ` Dietmar Eggemann
@ 2025-05-21 14:32                 ` Chris Mason
  0 siblings, 0 replies; 21+ messages in thread
From: Chris Mason @ 2025-05-21 14:32 UTC (permalink / raw)
  To: Dietmar Eggemann, Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, vschneid, Juri Lelli, Thomas Gleixner

On 5/21/25 9:59 AM, Dietmar Eggemann wrote:
> On 20/05/2025 16:53, Chris Mason wrote:
>> On 5/20/25 10:38 AM, Dietmar Eggemann wrote:
>>> On 16/05/2025 12:18, Peter Zijlstra wrote:
>>>> On Mon, May 12, 2025 at 06:35:24PM -0400, Chris Mason wrote:
> 
> [...]
> 
>>> I can't spot any v6.11 related changes (dl_server or TTWU_QUEUE) but a
>>> PSI related one for v6.12 results in a ~8% schbench regression.
>>>
>>> VM (m7gd.16xlarge, 16 logical CPUs) on Graviton3:
>>>
>>> schbench -L -m 4 -M auto -t 128 -n 0 -r 60
>>>
>>> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
>>
>> I also saw a regression on this one, but it wasn't stable enough for me
>> to be sure.  I'll retest, but I'm guessing this is made worse by the VM
>> / graviton setup?
> 
> For me the 8% regression here is pretty stable. I have to add that I ran
> schbench in:
> 
>   /sys/fs/cgroup/user.slice/user-1000.slice/session-33.scope
> 
> So that explains IMHO the 4 calls to psi_group_change() from
> psi_task_switch() now doing all their own 'now = cpu_clock(cpu)' call.

Makes sense.  If you pull the latest schbench, you can add -s 0 to the
command line.   That removes the usleep done by the workers, which
focuses things even more on the CPU selection when message threads wake
up the workers.

On turin, I'm seeing ~35% lower RPS with later kernels than with 6.9
when I add -s 0.  I'm also seeing 35% higher wakeup latencies, so I'll
spend some time today measuring placement decisions between the two.

> 
> schbench-6509    [004] d....   689.050466: psi_task_switch: CPU4 [schbench 6509] -> [schbench 6514] ->
> schbench-6509    [004] d....   689.050466: psi_group_change: CPU4 now=689050466118
> schbench-6509    [004] d....   689.050467: psi_group_change: CPU4 now=689050466537
> schbench-6509    [004] d....   689.050467: psi_group_change: CPU4 now=689050466950
> schbench-6509    [004] d....   689.050468: psi_group_change: CPU4 now=689050467838
> schbench-6509    [004] d....   689.050468: psi_task_switch: CPU4 [schbench 6509] -> [schbench 6514] <-
>  
>> I've been testing Peter's changes, and they do help on my skylake box
>> but not as much on the big turin machines.  I'm trying to sort that out,
> 
> Turin vs. SKL,SPR ?

This started with a bad networking benchmark on turin, where later
kernels have regressed since 6.9.  I made some changes to schbench to
try and model that regression, but until I can claw back the performance
on turin, I won't really be sure schbench isn't just exposing other
unrelated problems.

I also pulled in SKL and copperlake because they are easiest for me to
test on, but the turin machines have a much bigger hit.  It's a single
socket machine, but the high thread count seems to be making this set of
regressions much worse.

-chris

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-20 19:38             ` Peter Zijlstra
  2025-05-21 14:02               ` Dietmar Eggemann
@ 2025-05-21 14:54               ` Peter Zijlstra
  2025-05-22  8:48                 ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-21 14:54 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Chris Mason, linux-kernel, Ingo Molnar, vschneid, Juri Lelli,
	Thomas Gleixner, hannes

On Tue, May 20, 2025 at 09:38:31PM +0200, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> 
> > 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> > 
> > With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> > 4 times per task switch in my setup) in:
> > 
> > __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> > psi_group_change().
> > 
> > There seems to be another/other v6.12 related patch(es) later which
> > cause(s) another 4% regression I yet have to discover.
> 
> Urgh, let me add this to the pile to look at. Thanks!

Does something like the compile tested only hackery below work?

---
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 1396674fa722..98c50bd4ce76 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -768,30 +768,20 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
 		groupc->times[PSI_NONIDLE] += delta;
 }
 
+#define for_each_group(iter, group) \
+	for (typeof(group) iter = group; iter; iter = iter->parent)
+
 static void psi_group_change(struct psi_group *group, int cpu,
 			     unsigned int clear, unsigned int set,
-			     bool wake_clock)
+			     u64 now, bool wake_clock)
 {
 	struct psi_group_cpu *groupc;
 	unsigned int t, m;
 	u32 state_mask;
-	u64 now;
 
 	lockdep_assert_rq_held(cpu_rq(cpu));
 	groupc = per_cpu_ptr(group->pcpu, cpu);
 
-	/*
-	 * First we update the task counts according to the state
-	 * change requested through the @clear and @set bits.
-	 *
-	 * Then if the cgroup PSI stats accounting enabled, we
-	 * assess the aggregate resource states this CPU's tasks
-	 * have been in since the last change, and account any
-	 * SOME and FULL time these may have resulted in.
-	 */
-	write_seqcount_begin(&groupc->seq);
-	now = cpu_clock(cpu);
-
 	/*
 	 * Start with TSK_ONCPU, which doesn't have a corresponding
 	 * task count - it's just a boolean flag directly encoded in
@@ -843,7 +833,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
 
 		groupc->state_mask = state_mask;
 
-		write_seqcount_end(&groupc->seq);
 		return;
 	}
 
@@ -864,8 +853,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
 
 	groupc->state_mask = state_mask;
 
-	write_seqcount_end(&groupc->seq);
-
 	if (state_mask & group->rtpoll_states)
 		psi_schedule_rtpoll_work(group, 1, false);
 
@@ -901,6 +888,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
 	struct psi_group *group;
+	u64 now;
 
 	if (!task->pid)
 		return;
@@ -908,16 +896,31 @@ void psi_task_change(struct task_struct *task, int clear, int set)
 	psi_flags_change(task, clear, set);
 
 	group = task_psi_group(task);
-	do {
-		psi_group_change(group, cpu, clear, set, true);
-	} while ((group = group->parent));
+
+	for_each_group(tmp, group)
+		raw_write_seqcount_begin(&per_cpu_ptr(tmp->pcpu, cpu)->seq);
+
+	now = cpu_clock(cpu);
+
+	for_each_group(tmp, group) {
+		psi_group_change(tmp, cpu, clear, set, now, true);
+		raw_write_seqcount_end(&per_cpu_ptr(tmp->pcpu, cpu)->seq);
+	}
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep)
 {
-	struct psi_group *group, *common = NULL;
+	struct psi_group *prev_group = task_psi_group(prev);
+	struct psi_group *next_group = task_psi_group(next);
+	struct psi_group *common = NULL;
 	int cpu = task_cpu(prev);
+	u64 now;
+
+	if (prev->pid) {
+		for_each_group(tmp, prev_group)
+			raw_write_seqcount_begin(&per_cpu_ptr(tmp->pcpu, cpu)->seq);
+	}
 
 	if (next->pid) {
 		psi_flags_change(next, 0, TSK_ONCPU);
@@ -926,16 +929,27 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * ancestors with @prev, those will already have @prev's
 		 * TSK_ONCPU bit set, and we can stop the iteration there.
 		 */
-		group = task_psi_group(next);
-		do {
-			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
-			    PSI_ONCPU) {
-				common = group;
+
+		for_each_group(tmp, next_group) {
+			struct psi_group_cpu *groupc = per_cpu_ptr(tmp->pcpu, cpu);
+
+			if (groupc->state_mask & PSI_ONCPU) {
+				common = tmp;
 				break;
 			}
+			raw_write_seqcount_begin(&per_cpu_ptr(tmp->pcpu, cpu)->seq);
+		}
+	}
+
+	now = cpu_clock(cpu);
 
-			psi_group_change(group, cpu, 0, TSK_ONCPU, true);
-		} while ((group = group->parent));
+	if (next->pid) {
+		for_each_group(tmp, next_group) {
+			if (tmp == common)
+				break;
+			psi_group_change(tmp, cpu, 0, TSK_ONCPU, now, true);
+			raw_write_seqcount_end(&per_cpu_ptr(tmp->pcpu, cpu)->seq);
+		}
 	}
 
 	if (prev->pid) {
@@ -968,12 +982,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 
 		psi_flags_change(prev, clear, set);
 
-		group = task_psi_group(prev);
-		do {
-			if (group == common)
+		for_each_group(tmp, prev_group) {
+			if (tmp == common)
 				break;
-			psi_group_change(group, cpu, clear, set, wake_clock);
-		} while ((group = group->parent));
+			psi_group_change(tmp, cpu, clear, set, now, wake_clock);
+			raw_write_seqcount_end(&per_cpu_ptr(tmp->pcpu, cpu)->seq);
+		}
 
 		/*
 		 * TSK_ONCPU is handled up to the common ancestor. If there are
@@ -983,8 +997,10 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 */
 		if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
 			clear &= ~TSK_ONCPU;
-			for (; group; group = group->parent)
-				psi_group_change(group, cpu, clear, set, wake_clock);
+			for_each_group(tmp, common) {
+				psi_group_change(tmp, cpu, clear, set, now, wake_clock);
+				raw_write_seqcount_end(&per_cpu_ptr(tmp->pcpu, cpu)->seq);
+			}
 		}
 	}
 }
@@ -1221,12 +1237,14 @@ void psi_cgroup_restart(struct psi_group *group)
 		return;
 
 	for_each_possible_cpu(cpu) {
-		struct rq *rq = cpu_rq(cpu);
-		struct rq_flags rf;
+		u64 now;
 
-		rq_lock_irq(rq, &rf);
-		psi_group_change(group, cpu, 0, 0, true);
-		rq_unlock_irq(rq, &rf);
+		guard(rq_lock_irq)(cpu_rq(cpu));
+
+		raw_write_seqcount_begin(&per_cpu_ptr(group->pcpu, cpu)->seq);
+		now = cpu_clock(cpu);
+		psi_group_change(group, cpu, 0, 0, now, true);
+		raw_write_seqcount_end(&per_cpu_ptr(group->pcpu, cpu)->seq);
 	}
 }
 #endif /* CONFIG_CGROUPS */

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-21 14:02               ` Dietmar Eggemann
@ 2025-05-21 15:02                 ` Peter Zijlstra
  2025-05-21 19:00                   ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-21 15:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Chris Mason, linux-kernel, Ingo Molnar, vschneid, Juri Lelli,
	Thomas Gleixner

On Wed, May 21, 2025 at 04:02:46PM +0200, Dietmar Eggemann wrote:
> On 20/05/2025 21:38, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> > 
> >> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> >>
> >> With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> >> 4 times per task switch in my setup) in:
> >>
> >> __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> >> psi_group_change().
> >>
> >> There seems to be another/other v6.12 related patch(es) later which
> >> cause(s) another 4% regression I yet have to discover.
> > 
> > Urgh, let me add this to the pile to look at. Thanks!
> 
> Not sure how expensive 'cpu_clock(cpu)' is on bare-metal.
> 
> But I also don't get why PSI needs per group 'now' values when we
> iterate over cgroup levels?

IIUC the read side does something like:

 real-read + guestimate(now, read-time);

And if the time-stamp is from before the write_seqcount_begin(), the
guestimate part goes side-ways.

My 'fix' is fairly simple straight forward brute force, but ideally this
whole thing gets some actual thinking done -- but my brain is fried from
staring at the wakeup path too long and I need to do simple things for a
few days ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-21 15:02                 ` Peter Zijlstra
@ 2025-05-21 19:00                   ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-21 19:00 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Chris Mason, linux-kernel, Ingo Molnar, vschneid, Juri Lelli,
	Thomas Gleixner

On Wed, May 21, 2025 at 05:02:07PM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2025 at 04:02:46PM +0200, Dietmar Eggemann wrote:
> > On 20/05/2025 21:38, Peter Zijlstra wrote:
> > > On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> > > 
> > >> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> > >>
> > >> With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> > >> 4 times per task switch in my setup) in:
> > >>
> > >> __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> > >> psi_group_change().
> > >>
> > >> There seems to be another/other v6.12 related patch(es) later which
> > >> cause(s) another 4% regression I yet have to discover.
> > > 
> > > Urgh, let me add this to the pile to look at. Thanks!
> > 
> > Not sure how expensive 'cpu_clock(cpu)' is on bare-metal.
> > 
> > But I also don't get why PSI needs per group 'now' values when we
> > iterate over cgroup levels?
> 
> IIUC the read side does something like:
> 
>  real-read + guestimate(now, read-time);
> 
> And if the time-stamp is from before the write_seqcount_begin(), the
> guestimate part goes side-ways.
> 
> My 'fix' is fairly simple straight forward brute force, but ideally this
> whole thing gets some actual thinking done -- but my brain is fried from
> staring at the wakeup path too long and I need to do simple things for a
> few days ;-)

Ah, what probably wants to be done is move to a single seqcount_t per
CPU. It makes no sense to have this per group.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-21 14:54               ` Peter Zijlstra
@ 2025-05-22  8:48                 ` Peter Zijlstra
  2025-05-22 15:00                   ` Johannes Weiner
                                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-22  8:48 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Chris Mason, linux-kernel, Ingo Molnar, vschneid, Juri Lelli,
	Thomas Gleixner, hannes

On Wed, May 21, 2025 at 04:54:47PM +0200, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 09:38:31PM +0200, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> > 
> > > 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> > > 
> > > With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> > > 4 times per task switch in my setup) in:
> > > 
> > > __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> > > psi_group_change().
> > > 
> > > There seems to be another/other v6.12 related patch(es) later which
> > > cause(s) another 4% regression I yet have to discover.
> > 
> > Urgh, let me add this to the pile to look at. Thanks!
> 
> Does something like the compile tested only hackery below work?

possibly better hackery :-)

---
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 1396674fa722..8fb9d28f2bc8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -136,6 +136,10 @@
  * cost-wise, yet way more sensitive and accurate than periodic
  * sampling of the aggregate task states would be.
  */
+#include <linux/sched/clock.h>
+#include <linux/workqueue.h>
+#include <linux/psi.h>
+#include "sched.h"
 
 static int psi_bug __read_mostly;
 
@@ -172,6 +176,30 @@ struct psi_group psi_system = {
 	.pcpu = &system_group_pcpu,
 };
 
+static inline void psi_write_begin(int cpu)
+{
+	struct psi_group_cpu *groupc = per_cpu_ptr(&system_group_pcpu, cpu);
+	write_seqcount_begin(&groupc->seq);
+}
+
+static inline void psi_write_end(int cpu)
+{
+	struct psi_group_cpu *groupc = per_cpu_ptr(&system_group_pcpu, cpu);
+	write_seqcount_end(&groupc->seq);
+}
+
+static inline u32 psi_read_begin(int cpu)
+{
+	struct psi_group_cpu *groupc = per_cpu_ptr(&system_group_pcpu, cpu);
+	return read_seqcount_begin(&groupc->seq);
+}
+
+static inline bool psi_read_retry(int cpu, u32 seq)
+{
+	struct psi_group_cpu *groupc = per_cpu_ptr(&system_group_pcpu, cpu);
+	return read_seqcount_retry(&groupc->seq, seq);
+}
+
 static void psi_avgs_work(struct work_struct *work);
 
 static void poll_timer_fn(struct timer_list *t);
@@ -262,14 +290,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
 
 	/* Snapshot a coherent view of the CPU state */
 	do {
-		seq = read_seqcount_begin(&groupc->seq);
+		seq = psi_read_begin(cpu);
 		now = cpu_clock(cpu);
 		memcpy(times, groupc->times, sizeof(groupc->times));
 		state_mask = groupc->state_mask;
 		state_start = groupc->state_start;
 		if (cpu == current_cpu)
 			memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
-	} while (read_seqcount_retry(&groupc->seq, seq));
+	} while (psi_read_retry(cpu, seq));
 
 	/* Calculate state time deltas against the previous snapshot */
 	for (s = 0; s < NR_PSI_STATES; s++) {
@@ -768,30 +796,20 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
 		groupc->times[PSI_NONIDLE] += delta;
 }
 
+#define for_each_group(iter, group) \
+	for (typeof(group) iter = group; iter; iter = iter->parent)
+
 static void psi_group_change(struct psi_group *group, int cpu,
 			     unsigned int clear, unsigned int set,
-			     bool wake_clock)
+			     u64 now, bool wake_clock)
 {
 	struct psi_group_cpu *groupc;
 	unsigned int t, m;
 	u32 state_mask;
-	u64 now;
 
 	lockdep_assert_rq_held(cpu_rq(cpu));
 	groupc = per_cpu_ptr(group->pcpu, cpu);
 
-	/*
-	 * First we update the task counts according to the state
-	 * change requested through the @clear and @set bits.
-	 *
-	 * Then if the cgroup PSI stats accounting enabled, we
-	 * assess the aggregate resource states this CPU's tasks
-	 * have been in since the last change, and account any
-	 * SOME and FULL time these may have resulted in.
-	 */
-	write_seqcount_begin(&groupc->seq);
-	now = cpu_clock(cpu);
-
 	/*
 	 * Start with TSK_ONCPU, which doesn't have a corresponding
 	 * task count - it's just a boolean flag directly encoded in
@@ -843,7 +861,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
 
 		groupc->state_mask = state_mask;
 
-		write_seqcount_end(&groupc->seq);
 		return;
 	}
 
@@ -864,8 +881,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
 
 	groupc->state_mask = state_mask;
 
-	write_seqcount_end(&groupc->seq);
-
 	if (state_mask & group->rtpoll_states)
 		psi_schedule_rtpoll_work(group, 1, false);
 
@@ -900,24 +915,29 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
 void psi_task_change(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
-	struct psi_group *group;
+	u64 now;
 
 	if (!task->pid)
 		return;
 
 	psi_flags_change(task, clear, set);
 
-	group = task_psi_group(task);
-	do {
-		psi_group_change(group, cpu, clear, set, true);
-	} while ((group = group->parent));
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
+	for_each_group(group, task_psi_group(task))
+		psi_group_change(group, cpu, clear, set, now, true);
+	psi_write_end(cpu);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep)
 {
-	struct psi_group *group, *common = NULL;
+	struct psi_group *common = NULL;
 	int cpu = task_cpu(prev);
+	u64 now;
+
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
 
 	if (next->pid) {
 		psi_flags_change(next, 0, TSK_ONCPU);
@@ -926,16 +946,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * ancestors with @prev, those will already have @prev's
 		 * TSK_ONCPU bit set, and we can stop the iteration there.
 		 */
-		group = task_psi_group(next);
-		do {
-			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
-			    PSI_ONCPU) {
+		for_each_group(group, task_psi_group(next)) {
+			struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+
+			if (groupc->state_mask & PSI_ONCPU) {
 				common = group;
 				break;
 			}
-
-			psi_group_change(group, cpu, 0, TSK_ONCPU, true);
-		} while ((group = group->parent));
+			psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
+		}
 	}
 
 	if (prev->pid) {
@@ -968,12 +987,11 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 
 		psi_flags_change(prev, clear, set);
 
-		group = task_psi_group(prev);
-		do {
+		for_each_group(group, task_psi_group(prev)) {
 			if (group == common)
 				break;
-			psi_group_change(group, cpu, clear, set, wake_clock);
-		} while ((group = group->parent));
+			psi_group_change(group, cpu, clear, set, now, wake_clock);
+		}
 
 		/*
 		 * TSK_ONCPU is handled up to the common ancestor. If there are
@@ -983,20 +1001,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 */
 		if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
 			clear &= ~TSK_ONCPU;
-			for (; group; group = group->parent)
-				psi_group_change(group, cpu, clear, set, wake_clock);
+			for_each_group(group, common)
+				psi_group_change(group, cpu, clear, set, now, wake_clock);
 		}
 	}
+	psi_write_end(cpu);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
 {
 	int cpu = task_cpu(curr);
-	struct psi_group *group;
 	struct psi_group_cpu *groupc;
 	s64 delta;
 	u64 irq;
+	u64 now;
 
 	if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
 		return;
@@ -1005,8 +1024,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 		return;
 
 	lockdep_assert_rq_held(rq);
-	group = task_psi_group(curr);
-	if (prev && task_psi_group(prev) == group)
+	if (prev && task_psi_group(prev) == task_psi_group(curr))
 		return;
 
 	irq = irq_time_read(cpu);
@@ -1015,25 +1033,22 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 		return;
 	rq->psi_irq_time = irq;
 
-	do {
-		u64 now;
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
 
+	for_each_group(group, task_psi_group(curr)) {
 		if (!group->enabled)
 			continue;
 
 		groupc = per_cpu_ptr(group->pcpu, cpu);
 
-		write_seqcount_begin(&groupc->seq);
-		now = cpu_clock(cpu);
-
 		record_times(groupc, now);
 		groupc->times[PSI_IRQ_FULL] += delta;
 
-		write_seqcount_end(&groupc->seq);
-
 		if (group->rtpoll_states & (1 << PSI_IRQ_FULL))
 			psi_schedule_rtpoll_work(group, 1, false);
-	} while ((group = group->parent));
+	}
+	psi_write_end(cpu);
 }
 #endif
 
@@ -1221,12 +1236,14 @@ void psi_cgroup_restart(struct psi_group *group)
 		return;
 
 	for_each_possible_cpu(cpu) {
-		struct rq *rq = cpu_rq(cpu);
-		struct rq_flags rf;
+		u64 now;
 
-		rq_lock_irq(rq, &rf);
-		psi_group_change(group, cpu, 0, 0, true);
-		rq_unlock_irq(rq, &rf);
+		guard(rq_lock_irq)(cpu_rq(cpu));
+
+		psi_write_begin(cpu);
+		now = cpu_clock(cpu);
+		psi_group_change(group, cpu, 0, 0, now, true);
+		psi_write_end(cpu);
 	}
 }
 #endif /* CONFIG_CGROUPS */

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-22  8:48                 ` Peter Zijlstra
@ 2025-05-22 15:00                   ` Johannes Weiner
  2025-05-23 15:40                     ` Peter Zijlstra
  2025-05-23 12:27                   ` Dietmar Eggemann
  2025-07-10 12:46                   ` [tip: sched/core] sched/psi: Optimize psi_group_change() cpu_clock() usage tip-bot2 for Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Johannes Weiner @ 2025-05-22 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Chris Mason, linux-kernel, Ingo Molnar,
	vschneid, Juri Lelli, Thomas Gleixner

On Thu, May 22, 2025 at 10:48:44AM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2025 at 04:54:47PM +0200, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 09:38:31PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
> > > 
> > > > 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
> > > > 
> > > > With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
> > > > 4 times per task switch in my setup) in:
> > > > 
> > > > __schedule() -> psi_sched_switch() -> psi_task_switch() ->
> > > > psi_group_change().
> > > > 
> > > > There seems to be another/other v6.12 related patch(es) later which
> > > > cause(s) another 4% regression I yet have to discover.
> > > 
> > > Urgh, let me add this to the pile to look at. Thanks!
> > 
> > Does something like the compile tested only hackery below work?
> 
> possibly better hackery :-)
> 
> ---
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 1396674fa722..8fb9d28f2bc8 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -136,6 +136,10 @@
>   * cost-wise, yet way more sensitive and accurate than periodic
>   * sampling of the aggregate task states would be.
>   */
> +#include <linux/sched/clock.h>
> +#include <linux/workqueue.h>
> +#include <linux/psi.h>
> +#include "sched.h"
>  
>  static int psi_bug __read_mostly;
>  
> @@ -172,6 +176,30 @@ struct psi_group psi_system = {
>  	.pcpu = &system_group_pcpu,
>  };
>  
> +static inline void psi_write_begin(int cpu)
> +{
> +	struct psi_group_cpu *groupc = per_cpu_ptr(&system_group_pcpu, cpu);
> +	write_seqcount_begin(&groupc->seq);

Ah right, since all the ancestor walks would ultimately end up at the
system's seq anyway. And always have, really.

It does stretch the critical section, of course. I ran perf bench
sched messaging to saturate all CPUs in state changes and then read a
pressure file 1000x. This is on a 32-way machine:

     0.18%     +1.34%  [kernel.kallsyms]     [k] collect_percpu_times

and annotation shows it's indeed retrying on the seq-read a bit more.

But that seems well within tolerance, and obviously worth it assuming
it fixes the cpu_clock() regression on the sched side.

At that point, though, it probably makes sense to move seq out of
psi_group_cpu altogether? More for clarity, really - it won't save
much right away given that deliberate 2-cacheline-layout.

/* Serializes task state changes against aggregation runs */
static DEFINE_PER_CPU(seqcount_t, psi_seq);

Otherwise, the patch looks great to me. Thanks for including a couple
of cleanups as well.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-22  8:48                 ` Peter Zijlstra
  2025-05-22 15:00                   ` Johannes Weiner
@ 2025-05-23 12:27                   ` Dietmar Eggemann
  2025-07-10 12:46                   ` [tip: sched/core] sched/psi: Optimize psi_group_change() cpu_clock() usage tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: Dietmar Eggemann @ 2025-05-23 12:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, linux-kernel, Ingo Molnar, vschneid, Juri Lelli,
	Thomas Gleixner, hannes

On 22/05/2025 10:48, Peter Zijlstra wrote:
> On Wed, May 21, 2025 at 04:54:47PM +0200, Peter Zijlstra wrote:
>> On Tue, May 20, 2025 at 09:38:31PM +0200, Peter Zijlstra wrote:
>>> On Tue, May 20, 2025 at 04:38:09PM +0200, Dietmar Eggemann wrote:
>>>
>>>> 3840cbe24cf0 - sched: psi: fix bogus pressure spikes from aggregation race
>>>>
>>>> With CONFIG_PSI enabled we call cpu_clock(cpu) now multiple times (up to
>>>> 4 times per task switch in my setup) in:
>>>>
>>>> __schedule() -> psi_sched_switch() -> psi_task_switch() ->
>>>> psi_group_change().
>>>>
>>>> There seems to be another/other v6.12 related patch(es) later which
>>>> cause(s) another 4% regression I yet have to discover.
>>>
>>> Urgh, let me add this to the pile to look at. Thanks!
>>
>> Does something like the compile tested only hackery below work?
> 
> possibly better hackery :-)

Ah OK, moving the seqlock protection outside of the single CPU clock read.

schbench number is up again ~9% with 'tip sched/core' on this VM thanks
to this single CPU clock read per task switch.

Doesn't show a big effect on AWS' 'hammerdb-mysqld' repro-collection
test though. (only 0.5% on NOPM (New Operations Per Minue)) even though
mysqld threads are running in '/system.slice/mysql.service'.

schbench is way more aggressive on task switch.

[...]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: scheduler performance regression since v6.11
  2025-05-22 15:00                   ` Johannes Weiner
@ 2025-05-23 15:40                     ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2025-05-23 15:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dietmar Eggemann, Chris Mason, linux-kernel, Ingo Molnar,
	vschneid, Juri Lelli, Thomas Gleixner

On Thu, May 22, 2025 at 11:00:35AM -0400, Johannes Weiner wrote:

> Ah right, since all the ancestor walks would ultimately end up at the
> system's seq anyway. And always have, really.
> 
> It does stretch the critical section, of course.

Right, with a cost proportional to the depth of the cgroup tree. So deep
trees will have it worse.

> I ran perf bench
> sched messaging to saturate all CPUs in state changes and then read a
> pressure file 1000x. This is on a 32-way machine:
> 
>      0.18%     +1.34%  [kernel.kallsyms]     [k] collect_percpu_times
> 
> and annotation shows it's indeed retrying on the seq-read a bit more.
> 
> But that seems well within tolerance, and obviously worth it assuming
> it fixes the cpu_clock() regression on the sched side.

Right, it appears it does :-)

> At that point, though, it probably makes sense to move seq out of
> psi_group_cpu altogether? More for clarity, really - it won't save
> much right away given that deliberate 2-cacheline-layout.
> 
> /* Serializes task state changes against aggregation runs */
> static DEFINE_PER_CPU(seqcount_t, psi_seq);

Sure, let me do that.

> Otherwise, the patch looks great to me. Thanks for including a couple
> of cleanups as well.

Yeah, mostly due to the back and forth -- my earlier ugly hack added
more tree iterations and me being lazy I didn't want to type all that
out :-)

Also, I've been migrating to using neovim with clangd language server,
and that doesn't much like the weird scheduler build setup, so I added
sufficient headers for the thing to 'compile' as a stand alone unit
(which is what clangd does).

I'll break out this patch and clean up all the sched bits to be clangd
clean -- I've been sitting on that patch long enough.

> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [tip: sched/core] sched/psi: Optimize psi_group_change() cpu_clock() usage
  2025-05-22  8:48                 ` Peter Zijlstra
  2025-05-22 15:00                   ` Johannes Weiner
  2025-05-23 12:27                   ` Dietmar Eggemann
@ 2025-07-10 12:46                   ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 21+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2025-07-10 12:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Dietmar Eggemann, Peter Zijlstra (Intel), Johannes Weiner, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     570c8efd5eb79c3725ba439ce105ed1bedc5acd9
Gitweb:        https://git.kernel.org/tip/570c8efd5eb79c3725ba439ce105ed1bedc5acd9
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 23 May 2025 17:28:00 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 09 Jul 2025 13:40:21 +02:00

sched/psi: Optimize psi_group_change() cpu_clock() usage

Dietmar reported that commit 3840cbe24cf0 ("sched: psi: fix bogus
pressure spikes from aggregation race") caused a regression for him on
a high context switch rate benchmark (schbench) due to the now
repeating cpu_clock() calls.

In particular the problem is that get_recent_times() will extrapolate
the current state to 'now'. But if an update uses a timestamp from
before the start of the update, it is possible to get two reads
with inconsistent results. It is effectively back-dating an update.

(note that this all hard-relies on the clock being synchronized across
CPUs -- if this is not the case, all bets are off).

Combine this problem with the fact that there are per-group-per-cpu
seqcounts, the commit in question pushed the clock read into the group
iteration, causing tree-depth cpu_clock() calls. On architectures
where cpu_clock() has appreciable overhead, this hurts.

Instead move to a per-cpu seqcount, which allows us to have a single
clock read for all group updates, increasing internal consistency and
lowering update overhead. This comes at the cost of a longer update
side (proportional to the tree depth) which can cause the read side to
retry more often.

Fixes: 3840cbe24cf0 ("sched: psi: fix bogus pressure spikes from aggregation race")
Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>,
Link: https://lkml.kernel.org/20250522084844.GC31726@noisy.programming.kicks-ass.net
---
 include/linux/psi_types.h |   6 +--
 kernel/sched/psi.c        | 121 ++++++++++++++++++++-----------------
 2 files changed, 68 insertions(+), 59 deletions(-)

diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index f1fd3a8..dd10c22 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -84,11 +84,9 @@ enum psi_aggregators {
 struct psi_group_cpu {
 	/* 1st cacheline updated by the scheduler */
 
-	/* Aggregator needs to know of concurrent changes */
-	seqcount_t seq ____cacheline_aligned_in_smp;
-
 	/* States of the tasks belonging to this group */
-	unsigned int tasks[NR_PSI_TASK_COUNTS];
+	unsigned int tasks[NR_PSI_TASK_COUNTS]
+			____cacheline_aligned_in_smp;
 
 	/* Aggregate pressure state derived from the tasks */
 	u32 state_mask;
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 5837cd8..2024c1d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -176,6 +176,28 @@ struct psi_group psi_system = {
 	.pcpu = &system_group_pcpu,
 };
 
+static DEFINE_PER_CPU(seqcount_t, psi_seq);
+
+static inline void psi_write_begin(int cpu)
+{
+	write_seqcount_begin(per_cpu_ptr(&psi_seq, cpu));
+}
+
+static inline void psi_write_end(int cpu)
+{
+	write_seqcount_end(per_cpu_ptr(&psi_seq, cpu));
+}
+
+static inline u32 psi_read_begin(int cpu)
+{
+	return read_seqcount_begin(per_cpu_ptr(&psi_seq, cpu));
+}
+
+static inline bool psi_read_retry(int cpu, u32 seq)
+{
+	return read_seqcount_retry(per_cpu_ptr(&psi_seq, cpu), seq);
+}
+
 static void psi_avgs_work(struct work_struct *work);
 
 static void poll_timer_fn(struct timer_list *t);
@@ -186,7 +208,7 @@ static void group_init(struct psi_group *group)
 
 	group->enabled = true;
 	for_each_possible_cpu(cpu)
-		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
+		seqcount_init(per_cpu_ptr(&psi_seq, cpu));
 	group->avg_last_update = sched_clock();
 	group->avg_next_update = group->avg_last_update + psi_period;
 	mutex_init(&group->avgs_lock);
@@ -266,14 +288,14 @@ static void get_recent_times(struct psi_group *group, int cpu,
 
 	/* Snapshot a coherent view of the CPU state */
 	do {
-		seq = read_seqcount_begin(&groupc->seq);
+		seq = psi_read_begin(cpu);
 		now = cpu_clock(cpu);
 		memcpy(times, groupc->times, sizeof(groupc->times));
 		state_mask = groupc->state_mask;
 		state_start = groupc->state_start;
 		if (cpu == current_cpu)
 			memcpy(tasks, groupc->tasks, sizeof(groupc->tasks));
-	} while (read_seqcount_retry(&groupc->seq, seq));
+	} while (psi_read_retry(cpu, seq));
 
 	/* Calculate state time deltas against the previous snapshot */
 	for (s = 0; s < NR_PSI_STATES; s++) {
@@ -772,31 +794,21 @@ static void record_times(struct psi_group_cpu *groupc, u64 now)
 		groupc->times[PSI_NONIDLE] += delta;
 }
 
+#define for_each_group(iter, group) \
+	for (typeof(group) iter = group; iter; iter = iter->parent)
+
 static void psi_group_change(struct psi_group *group, int cpu,
 			     unsigned int clear, unsigned int set,
-			     bool wake_clock)
+			     u64 now, bool wake_clock)
 {
 	struct psi_group_cpu *groupc;
 	unsigned int t, m;
 	u32 state_mask;
-	u64 now;
 
 	lockdep_assert_rq_held(cpu_rq(cpu));
 	groupc = per_cpu_ptr(group->pcpu, cpu);
 
 	/*
-	 * First we update the task counts according to the state
-	 * change requested through the @clear and @set bits.
-	 *
-	 * Then if the cgroup PSI stats accounting enabled, we
-	 * assess the aggregate resource states this CPU's tasks
-	 * have been in since the last change, and account any
-	 * SOME and FULL time these may have resulted in.
-	 */
-	write_seqcount_begin(&groupc->seq);
-	now = cpu_clock(cpu);
-
-	/*
 	 * Start with TSK_ONCPU, which doesn't have a corresponding
 	 * task count - it's just a boolean flag directly encoded in
 	 * the state mask. Clear, set, or carry the current state if
@@ -847,7 +859,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
 
 		groupc->state_mask = state_mask;
 
-		write_seqcount_end(&groupc->seq);
 		return;
 	}
 
@@ -868,8 +879,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
 
 	groupc->state_mask = state_mask;
 
-	write_seqcount_end(&groupc->seq);
-
 	if (state_mask & group->rtpoll_states)
 		psi_schedule_rtpoll_work(group, 1, false);
 
@@ -904,24 +913,29 @@ static void psi_flags_change(struct task_struct *task, int clear, int set)
 void psi_task_change(struct task_struct *task, int clear, int set)
 {
 	int cpu = task_cpu(task);
-	struct psi_group *group;
+	u64 now;
 
 	if (!task->pid)
 		return;
 
 	psi_flags_change(task, clear, set);
 
-	group = task_psi_group(task);
-	do {
-		psi_group_change(group, cpu, clear, set, true);
-	} while ((group = group->parent));
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
+	for_each_group(group, task_psi_group(task))
+		psi_group_change(group, cpu, clear, set, now, true);
+	psi_write_end(cpu);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		     bool sleep)
 {
-	struct psi_group *group, *common = NULL;
+	struct psi_group *common = NULL;
 	int cpu = task_cpu(prev);
+	u64 now;
+
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
 
 	if (next->pid) {
 		psi_flags_change(next, 0, TSK_ONCPU);
@@ -930,16 +944,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 * ancestors with @prev, those will already have @prev's
 		 * TSK_ONCPU bit set, and we can stop the iteration there.
 		 */
-		group = task_psi_group(next);
-		do {
-			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
-			    PSI_ONCPU) {
+		for_each_group(group, task_psi_group(next)) {
+			struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+
+			if (groupc->state_mask & PSI_ONCPU) {
 				common = group;
 				break;
 			}
-
-			psi_group_change(group, cpu, 0, TSK_ONCPU, true);
-		} while ((group = group->parent));
+			psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
+		}
 	}
 
 	if (prev->pid) {
@@ -972,12 +985,11 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 
 		psi_flags_change(prev, clear, set);
 
-		group = task_psi_group(prev);
-		do {
+		for_each_group(group, task_psi_group(prev)) {
 			if (group == common)
 				break;
-			psi_group_change(group, cpu, clear, set, wake_clock);
-		} while ((group = group->parent));
+			psi_group_change(group, cpu, clear, set, now, wake_clock);
+		}
 
 		/*
 		 * TSK_ONCPU is handled up to the common ancestor. If there are
@@ -987,20 +999,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
 		 */
 		if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) {
 			clear &= ~TSK_ONCPU;
-			for (; group; group = group->parent)
-				psi_group_change(group, cpu, clear, set, wake_clock);
+			for_each_group(group, common)
+				psi_group_change(group, cpu, clear, set, now, wake_clock);
 		}
 	}
+	psi_write_end(cpu);
 }
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_struct *prev)
 {
 	int cpu = task_cpu(curr);
-	struct psi_group *group;
 	struct psi_group_cpu *groupc;
 	s64 delta;
 	u64 irq;
+	u64 now;
 
 	if (static_branch_likely(&psi_disabled) || !irqtime_enabled())
 		return;
@@ -1009,8 +1022,7 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 		return;
 
 	lockdep_assert_rq_held(rq);
-	group = task_psi_group(curr);
-	if (prev && task_psi_group(prev) == group)
+	if (prev && task_psi_group(prev) == task_psi_group(curr))
 		return;
 
 	irq = irq_time_read(cpu);
@@ -1019,25 +1031,22 @@ void psi_account_irqtime(struct rq *rq, struct task_struct *curr, struct task_st
 		return;
 	rq->psi_irq_time = irq;
 
-	do {
-		u64 now;
+	psi_write_begin(cpu);
+	now = cpu_clock(cpu);
 
+	for_each_group(group, task_psi_group(curr)) {
 		if (!group->enabled)
 			continue;
 
 		groupc = per_cpu_ptr(group->pcpu, cpu);
 
-		write_seqcount_begin(&groupc->seq);
-		now = cpu_clock(cpu);
-
 		record_times(groupc, now);
 		groupc->times[PSI_IRQ_FULL] += delta;
 
-		write_seqcount_end(&groupc->seq);
-
 		if (group->rtpoll_states & (1 << PSI_IRQ_FULL))
 			psi_schedule_rtpoll_work(group, 1, false);
-	} while ((group = group->parent));
+	}
+	psi_write_end(cpu);
 }
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
 
@@ -1225,12 +1234,14 @@ void psi_cgroup_restart(struct psi_group *group)
 		return;
 
 	for_each_possible_cpu(cpu) {
-		struct rq *rq = cpu_rq(cpu);
-		struct rq_flags rf;
+		u64 now;
 
-		rq_lock_irq(rq, &rf);
-		psi_group_change(group, cpu, 0, 0, true);
-		rq_unlock_irq(rq, &rf);
+		guard(rq_lock_irq)(cpu_rq(cpu));
+
+		psi_write_begin(cpu);
+		now = cpu_clock(cpu);
+		psi_group_change(group, cpu, 0, 0, now, true);
+		psi_write_end(cpu);
 	}
 }
 #endif /* CONFIG_CGROUPS */

^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-07-10 12:46 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-07 23:13 scheduler performance regression since v6.11 Chris Mason
2025-05-09 19:49 ` Peter Zijlstra
2025-05-12 18:08   ` Peter Zijlstra
2025-05-12 19:39     ` Chris Mason
2025-05-12 22:35       ` Chris Mason
2025-05-13  7:15         ` Peter Zijlstra
2025-05-16 10:18         ` Peter Zijlstra
2025-05-20 14:38           ` Dietmar Eggemann
2025-05-20 14:53             ` Chris Mason
2025-05-21 13:59               ` Dietmar Eggemann
2025-05-21 14:32                 ` Chris Mason
2025-05-20 19:38             ` Peter Zijlstra
2025-05-21 14:02               ` Dietmar Eggemann
2025-05-21 15:02                 ` Peter Zijlstra
2025-05-21 19:00                   ` Peter Zijlstra
2025-05-21 14:54               ` Peter Zijlstra
2025-05-22  8:48                 ` Peter Zijlstra
2025-05-22 15:00                   ` Johannes Weiner
2025-05-23 15:40                     ` Peter Zijlstra
2025-05-23 12:27                   ` Dietmar Eggemann
2025-07-10 12:46                   ` [tip: sched/core] sched/psi: Optimize psi_group_change() cpu_clock() usage tip-bot2 for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox