* [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time
@ 2024-08-20 16:34 Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 01/16] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
` (16 more replies)
0 siblings, 17 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
This series is a re-incarnation of Remove Hardcoded Margings posted a while ago
https://lore.kernel.org/lkml/20231208002342.367117-1-qyousef@layalina.io/
The original series attempted to address response time related issues stemming
from hardcoding migration margin in fits_capacity() on HMP system, and DVFS
headroom which had a constant 25% boost that is bad for power and thermal on
powerful systems. Saving power was the main goal by reducing these values to
the smallest possible value automatically based on anticipated worst case
scenario.
A tricky point was uncovered and demonstrated in the migration margin table in
this posting
https://lore.kernel.org/lkml/20240205223344.2280519-4-qyousef@layalina.io/
is that to make the system responsive to sudden changes, we actually need large
migration margin the smaller the core capacity is
cap threshold % threshold-tick %
0 0 0 0 0
16 0 0 0 0
32 1 3.12 0 0
48 3 6.25 2 4.16
64 4 6.25 2 3.12
80 6 7.5 5 6.25
96 10 10.41 8 8.33
112 14 12.5 11 9.82
128 18 14.06 16 12.5
144 21 14.58 18 12.5
160 26 16.25 23 14.37
176 33 18.75 29 16.47
192 39 20.31 35 18.22
208 47 22.59 43 20.67
224 55 24.55 50 22.32
240 63 26.25 59 24.58
256 73 28.51 68 26.56
272 82 30.14 77 28.30
288 93 32.29 87 30.20
304 103 33.88 97 31.90
320 114 35.62 108 33.75
336 126 37.5 120 35.71
352 138 39.20 132 37.5
368 151 41.03 144 39.13
384 163 42.44 157 40.88
The current 80% margin is valid for CPU with capacities in the 700-750 range,
which might have been true in the original generations of HMP systems.
704 557 79.11 550 78.12
720 578 80.27 572 79.44
736 606 82.33 600 81.52
752 633 84.17 627 83.37
This result contradicts the original goal of saving power as it indicates we
must be more aggressive with the margin, while the original observation was
that there are workloads with steady utilization that is hovering at a level
that is higher than this margin but lower than the capacity of the CPU (mid
CPUs particularly) and the aggressive upmigration is not desired, nor the
higher push to run at max freq where we could have run at a lower freq with no
impact on perf.
Further analysis using a simple rampup [1] test that spawns a busy task that
starts from util_avg/est = 0 and never goes to sleep. The purpose is to measure
the actual system response time for workloads that are bursty and need to
transition from lower to higher performance level quickly.
This lead to more surprising discovery due to utilization invariance, I call it
the black hole effect.
There's a black hole in the scheduler:
======================================
It is no surprise to anyone that DVFS and HMP system have a time stretching
effect where the same workload will take longer to do the same amount of work
the lower the frequency/capacity.
This is countered in the system via clock_pelt which is central for
implementing utilization invariance. This helps ensure that the utilization
signal still accurately represent the computation demand of sched_entities.
But this introduces this black hole effect of time dilation. The concept of
passage of time is now different from task's perspective compared to an
external observer. The task will think 1ms has passed, but depending on the
capacity or the freq, the time from external observer point of view has passed
for 25 or even 30ms in reality.
This has a terrible impact on utilization signal rise time. And since
utilization signal is central in making many scheduler decision like estimating
how loaded the CPU is, whether a task is misfit, and what freq to run at when
schedutil is being used, this leads to suboptimal decision being made and give
the external observer (userspace) that the system is not responsive or
reactive. This manifests as problems like:
* My task is stuck on the little core for too long
* My task is running at lower frequency causing missing important
deadlines although it has been running for the past 30ms
As a demonstration, running the rampup test on Mac mini with M1 SoC, 6.8 kernel
with 1ms TICK/HZ=1000.
$ grep . /sys/devices/system/cpu/cpu*/cpu_capacity
/sys/devices/system/cpu/cpu0/cpu_capacity:459
/sys/devices/system/cpu/cpu1/cpu_capacity:459
/sys/devices/system/cpu/cpu2/cpu_capacity:459
/sys/devices/system/cpu/cpu3/cpu_capacity:459
/sys/devices/system/cpu/cpu4/cpu_capacity:1024
/sys/devices/system/cpu/cpu5/cpu_capacity:1024
/sys/devices/system/cpu/cpu6/cpu_capacity:1024
/sys/devices/system/cpu/cpu7/cpu_capacity:1024
Ideal response time running at max performance level
----------------------------------------------------
$ uclampset -m 1024 rampup
rampup-5088 util_avg running
┌────────────────────────────────────────────────────────────────────────┐
1015.0┤ ▄▄▄▄▄▄▄▄▄▟▀▀▀▀▀▀▀▀▀▀▀▀│
│ ▗▄▄▄▛▀▀▀▀▘ │
│ ▗▄▟▀▀▀ │
│ ▄▟▀▀ │
761.2┤ ▄▟▀▘ │
│ ▗▛▘ │
│ ▗▟▀ │
507.5┤ ▗▟▀ │
│ ▗▛ │
│ ▄▛ │
│ ▟▘ │
253.8┤ ▐▘ │
│ ▟▀ │
│ ▗▘ │
│ ▗▛ │
0.0┤ ▗ ▛ │
└┬───────┬───────┬───────┬───────┬──────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── rampup-5088 util_avg running residency (ms) ──────────────────
0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.4000000000000001
26.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
47.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
67.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
86.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
105.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
124.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
143.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
161.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
178.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
196.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
213.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
229.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
245.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
277.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
292.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
307.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
322.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
336.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
350.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
364.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
378.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
391.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
404.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
416.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
429.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
441.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
453.0 ▇▇▇▇▇▇▇▇▇▇▇ 1.0
Observations:
-------------
* It takes ~233ms to go from 0 to ~1015
* It takes ~167ms to go from 0 to ~1000
* Utilization increases every tick or 1ms
* It takes ~29.5ms to reach a util of ~450
Worst case scenario running at lowest performance level
-------------------------------------------------------
$ uclampset -M 0 rampup
(Note the difference in the x-axis)
rampup-3740 util_avg running
┌─────────────────────────────────────────────────────────────────────────┐
989.0┤ ▄▄▄▄▄▄▄▄▄▛▀▀▀▀▀▀▀│
│ ▗▄▄▄▄▄▛▀▀▀▀▀▘ │
│ ▄▄▛▀▀▀ │
│ ▄▄▟▀▀▘ │
741.8┤ ▄▄▛▀▘ │
│ ▗▄▛▀▘ │
│ ▄▟▀ │
494.5┤ ▗▟▀▘ │
│ ▄▛▀ │
│ ▗▛▘ │
│ ▄▛▀ │
247.2┤ ▗▟▘ │
│ ▗▛ │
│ ▟▀ │
│ ▗▟▘ │
0.0┤ ▗▟ │
└┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬┘
0.60 0.76 0.91 1.07 1.22 1.38 1.53 1.69 1.84 2.00
───────────────── rampup-3740 util_avg running residency (ms) ──────────────────
0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.6
10.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
30.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.0
54.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
75.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
95.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
115.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.3
133.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.7
151.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.0
172.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
190.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
207.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
225.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
242.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
258.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
274.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
290.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
306.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
321.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
336.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
351.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
365.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.9
381.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
394.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
407.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
420.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
433.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
446.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
458.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
Observations:
* It takes 1.350 seconds (!) to go from 0 to ~1000
* Utilization updates every 8ms most of the time
* It takes ~223ms to reach a util of ~450
Default response time with 10ms rate_limit_us
---------------------------------------------
$ rampup
rampup-6338 util_avg running
┌─────────────────────────────────────────────────────────────────────────┐
986.0┤ ▄▄▄▄▄▟▀▀▀▀│
│ ▗▄▄▟▀▀▀▘ │
│ ▗▄▟▀▀ │
│ ▄▟▀▀ │
739.5┤ ▄▟▀▘ │
│ ▗▄▛▘ │
│ ▗▟▀ │
493.0┤ ▗▛▀ │
│ ▗▄▛▀ │
│ ▄▟▀ │
│ ▄▛▘ │
246.5┤ ▗▟▀▘ │
│ ▄▟▀▀ │
│ ▗▄▄▛▘ │
│ ▗▄▄▄▟▀ │
0.0┤ ▗ ▗▄▄▟▀▀ │
└┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── rampup-6338 util_avg running residency (ms) ──────────────────
0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.5
15.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
36.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
57.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
78.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
98.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
117.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
137.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
156.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
176.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
191.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
211.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
230.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
248.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
266.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
277.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
294.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.6
311.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.4
327.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
340.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
358.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
371.0 ▇▇▇▇▇▇▇▇▇ 1.0
377.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
389.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
401.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
413.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
431.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
442.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
456.0 ▇▇▇▇▇▇▇▇▇ 1.0
───────────────────────── Sum Time Running on CPU (ms) ─────────────────────────
CPU0.0 ▇▇▇▇▇ 90.39
CPU4.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1156.93
6338 rampup CPU0.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
2.06┤ ▛▀▀ │
│ ▌ │
│ ▌ │
│ ▌ │
1.70┤ ▛▀▀▘ │
│ ▌ │
│ ▌ │
1.33┤ ▗▄▄▄▌ │
│ ▐ │
│ ▐ │
│ ▐ │
0.97┤ ▗▄▄▄▟ │
│ ▐ │
│ ▐ │
│ ▐ │
0.60┤ ▗ ▗▄▄▄▄▄▄▄▄▟ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
6338 rampup CPU4.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
3.20┤ ▐▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀│
│ ▐ │
│ ▛▀▀ │
│ ▌ │
2.78┤ ▐▀▀▘ │
│ ▗▄▟ │
│ ▌ │
2.35┤ ▗▄▄▌ │
│ ▐ │
│ ▄▄▟ │
│ ▌ │
1.93┤ ▗▄▄▌ │
│ ▐ │
│ ▐ │
│ ▐ │
1.50┤ ▗▄▄▟ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── 6338 rampup CPU0.0 Frequency residency (ms) ──────────────────
0.6 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 37.300000000000004
0.972 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 15.0
1.332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 15.0
1.704 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.0
2.064 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.1
───────────────── 6338 rampup CPU4.0 Frequency residency (ms) ──────────────────
1.5 ▇▇▇▇▇▇▇▇▇▇ 11.9
1.956 ▇▇▇▇▇▇▇▇ 10.0
2.184 ▇▇▇▇▇▇▇▇ 10.0
2.388 ▇▇▇▇▇▇▇▇▇ 11.0
2.592 ▇▇▇▇▇▇▇▇ 10.0
2.772 ▇▇▇▇▇▇▇▇ 10.0
2.988 ▇▇▇▇▇▇▇▇ 10.0
3.204 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85.3
Observations:
* It takes ~284ms to go from 0 to ~1000
* Utilization ramps up initially every 8ms, then starts to speed up not
reaching 1ms until util ~450. It actually flips between 1ms and 2ms
for a while after this value.
* It takes ~105ms to reach a util ~450
* The task runs on little core for a whopping 90ms before it migrates
to the big core in spite of obviously an always running task. That is
with the current 80% migration margin.
* when running on little CPU, it stays at lowest freq for a whopping
37ms. It takes that long for util to reach a value high enough to
move on to the next freq, that is with the 1.25 DVFS headroom.
* Moving across frequencies remain slow afterward on little. On big, it
seems to be capped at 10ms due to the rate_limit_us which was
addressed already in [2].
Defeault response with 70us rate_limit_us
-----------------------------------------
rampup-6581 util_avg running
┌───────────────────────────────────────────────────────────────────────────┐
984┤ ▄▄▄▄▄▟▀▀▀▀│
│ ▗▄▄▛▀▀▀▘ │
│ ▗▄▞▀▀ │
│ ▄▄▛▀ │
738┤ ▗▟▀▘ │
│ ▄▛▀ │
│ ▗▟▀▘ │
492┤ ▄▟▀ │
│ ▄▛▘ │
│ ▗▄▛▘ │
│ ▄▟▀ │
246┤ ▄▟▀▘ │
│ ▗▄▟▘ │
│ ▗▟▀▀ │
│ ▄▄▄▄▛▀▀ │
0┤ ▗ ▗▄▄▛▀▘ │
└┬───────┬───────┬────────┬───────┬───────┬───────┬────────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── 6581 rampup CPU4.0 Frequency residency (ms) ──────────────────
1.5 ▇▇ 2.9
1.728 ▇▇▇▇▇▇▇ 8.0
1.956 ▇▇▇▇▇▇▇▇ 9.0
2.184 ▇▇▇▇▇▇▇▇ 9.0
2.388 ▇▇▇▇▇▇ 7.0
2.592 ▇▇▇▇▇▇▇ 8.0
2.772 ▇▇▇▇▇▇ 7.0
2.988 ▇▇▇▇▇▇▇▇ 10.0
3.096 ▇▇▇▇▇ 6.0
3.144 ▇▇▇ 3.0
3.204 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 82.4
Observations:
-------------
* Results is more or less the same
* It takes ~264ms to go from 0 to ~1000
* With better rate limit we still see slow jumps in freqs on the big
core. Which demonstrates the bottleneck is in the utilization signal
rampup time
It gets worse on systems with smaller cores
-------------------------------------------
Mobile systems which commonly contain littles that have a capcity of ~200 and
sometimes less will suffer more from this bad impact. The smaller the core/freq,
the greater the gravitational pull!
It was measured to stay over 100ms stuck on the little core and being stuck for
longer at lowest frequencies to pick up from 0 util.
The solution:
j============
The proposal to remove the hardcoded DVFS headroom and migration margin in [3]
is still valid. And we build on top of it.
But to address the utilization invariance black hole problem, I add a number of
patches on top to extend util_est.
This black hole effect is only valid for tasks that are transitioning from one
steady state, to another steady state. Completely periodic tasks which the
system is traditionally built upon; the current utilization signal is a good
approximation of its _compute_ demand. But observers (userspace), care about
real time.
Computational domain vs Time domain:
------------------------------------
The util_avg is a good representation of compute demand of periodic tasks. And
should remain as such. But when they are no longer periodic, then looking at
computational domain doesn't make sense as we have no idea what's the actual
compute demand of the task, it's in transition. During this transition we need
to fallback to time domain based signal. Which is simply done by ignoring
invariance and let the util accumulate based on observer's time.
Coherent response time:
-----------------------
Moving transient tasks to be based on observer's time will create a coherent
and constant response time. Which is the time it takes util_avg to rampup from
0 to max on the biggest core running at max freq (or performance level
1024/max).
IOW, the rampup time of util signal should appear to be the same on all
capacities/frequencies as if we are running at the highest performance level
all the time. This will give the observer (userspace) the expected behavior of
things moving through the motions in a constant response time regardless of
initial conditions.
util_est extension:
-------------------
The extension is quite simple. util_est currently latches to util_avg at
enqueue/dequeue to act as a hold function for when busy tasks sleep for long
period and decay prematurely.
The extension is to account for RUNNING time of the task in util_est too, which
is currently ignored.
when a task is RUNNING, we accumulate delta_exec across context switches and
accumulate util_est as we're accumulating util_avg, but simply without any
invariance taken into account. This means when tasks are RUNNABLE, and continue
to run, util_est will act as our time based signal to help with the faster and
'constant' rampup response.
Periodic vs Transient tasks:
----------------------------
It is important to make a distinction now between tasks that are periodic and
their util_avg is a good faithful presentation of its compute demand. And
transient tasks that need help to move faster to their next steady state point.
In the code this distinction is made based on util_avg. In theory (I think we
have bugs, will send a separate report), util_avg should be near constant for
a periodic task. So simply transient tasks are ones that lead to util_avg being
higher across activations. And this is our trigger point to know whether we
need to accumulate variant (real time based) util_est.
Rampup multipliers and sched-qos:
---------------------------------
It turns out the slow rampup time is great for power. With the fix, many tasks
will start causing higher freqs.
Equally, the speed up will not be good enough for some workloads that need to
move even faster than default response time.
To cater for those, introduce per-task rampup multiplier. It can be set to
0 to keep tasks that don't care about performance from burning power. And it
can be set to higher than 1 to make tasks go even faster through the motions.
The multiplier is introduced as a first implementation of a generic sched-qos
framework. Based on various discussions in many threads there's a burning need
to provide more hints to enable smarter resource management based on userspace
choices/trade-offs. Hopefully this framework will make the job simpler from
both adding deprecatable kernel hints, and for userspace as there won't be
a need to continously extend sched_attr, but add a new enum and userspace
should be able to reuse the sched-qos wrappers when new hints are added to make
use of them more readily.
The patches:
==========i=
Patch 1 is a repost of an existing patch on the list but is a required for the
series.
Patches 2 and 3 add helper functions to accumulate util_avg and calculate the
rampup time from any point.
Patches 4 and 5 remove the hardcoded margins in favour of a more automatic and
'deterministic' behavior based on the worst case scenario for current
configuration (TICK mostly, but base_slice too).
Patch 6 adds a new tunable to schedutil to dictate the rampup response time. It
allows it to be sped up and slowed down. Since utilization signal has
a constant response time on *all* systems regardless of how powerful or weak
they are, this should allow userspace to control this more sensibly based on
their system and workload characteristics and their efficiency goals.
Patch 7 adds a multiplier to change PELT time constant. I am not sure if this
is necessary now after introducing per task rampup multipliers. The original
rationale was to help cater different hadware against the constant util_avg
response time. I might drop this in future postings. I haven't tested the
latest version which follows a new implementation suggested by Vincent.
Patches 8 and 9 implement util_est extensions to better handle periodic vs
transient tasks.
Patches 10 and 11 add sched-qos and implements SCHED_QOS_RAMPUP_MULTIPLIER.
Patches 12 and 13 further improve dvfs headroom definition by taking into
account waiting_avg. waiting_avg is a new signal that accumulates how long
a task is RUNNABLE && !RUNNING. This is an important source of latency and
perception of responsiveness. It decouples util_avg, which is a measure of
computational demand, and the requirement to run at specific freq to meet this
demand, and the fact that slower frequency could mean tasks end up waiting for
longer behind other tasks. If waiting time is long, it means the DVFS headroom
need to increase. So we add it to the list of items to take into account.
I tested to ensure the waiting_avg looks sane, but haven't done proper
verification on how it helps with contended situations for frequency selection.
Patch 14 implements an optimization to ignore DVFS headroom when the
utilization signal is falling. It indicates that we are already running faster
than we should, so we should be able to save power safely. This patch still
needs more verification to ensure it produces the desired impact.
Patch 15 uses rampup_multipier = 0 to disable util_est completely assuming
tasks that don't care about perf they are okay without util_est altogether.
Patch 16 implements another optimization to keep util_avg at 0 on fork given we
have enough means now for tasks to manage their perf requirements and we can
never crystal ball what the util_avg should be after fork. Be consistent and
start from the same lowest point preserving precious resources.
The series needs more polishing. But posting now to help discuss further during
LPC to ensure it is moving in the right direction.
Results:
========
Response time with rampup multiplier = 1
-----------------------------------------
rampup-2234 util_avg running
┌───────────────────────────────────────────────────────────────────────────┐
984┤ ▗▄▄▄▄▄▛▀▀▀▀│
│ ▄▄▟▀▀▀▀ │
│ ▄▄▟▀▀ │
│ ▄▟▀▘ │
738┤ ▄▟▀▘ │
│ ▗▟▀▘ │
│ ▗▟▀ │
492┤ ▗▟▀ │
│ ▗▟▀ │
│ ▟▀ │
│ ▄▛▘ │
246┤ ▗▟▘ │
│ ▗▟▀ │
│ ▗▟▀ │
│ ▗▟▀ │
0┤ ▄▄▄▛▀ │
└┬───────┬───────┬────────┬───────┬───────┬───────┬────────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── rampup-2234 util_avg running residency (ms) ──────────────────
0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.6000000000000005
15.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
39.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
61.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
85.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
99.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
120.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
144.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
160.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
176.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
192.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
210.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
228.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
246.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
263.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
282.0 ▇▇▇▇▇▇▇ 1.0
291.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
309.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
327.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
344.0 ▇▇▇▇▇▇▇ 1.0
354.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
373.0 ▇▇▇▇▇▇▇ 1.0
382.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
400.0 ▇▇▇▇▇▇▇ 1.0
408.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
425.0 ▇▇▇▇▇▇▇ 1.0
434.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
452.0 ▇▇▇▇▇▇▇ 1.0
2234 rampup CPU1.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
2.06┤ ▐▀ │
│ ▐ │
│ ▐ │
│ ▐ │
1.70┤ ▛▀ │
│ ▌ │
│ ▌ │
1.33┤ ▄▌ │
│ ▌ │
│ ▌ │
│ ▌ │
0.97┤ ▗▄▌ │
│ ▐ │
│ ▐ │
│ ▐ │
0.60┤ ▗▄▄▟ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
2234 rampup CPU4.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
3.10┤ ▐▀▀▀▀▀▀▀▀▀▀▀▀▀│
│ ▛▀▀▀▀▀▀▀▀▀▀▀ │
│ ▌ │
│ ▐▀▀▀▀▘ │
2.70┤ ▐ │
│ ▐▀▀▀▀ │
│ ▐ │
2.30┤ ▛▀▀ │
│ ▌ │
│ ▐▀▀▘ │
│ ▐ │
1.90┤ ▐▀▀ │
│ ▐ │
│ ▗▄▟ │
│ ▐ │
1.50┤ ▗▟ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────────────── Sum Time Running on CPU (ms) ─────────────────────────
CPU1.0 ▇▇▇▇ 32.53
CPU4.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 540.3
───────────────── 2234 rampup CPU1.0 Frequency residency (ms) ──────────────────
0.6 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.1
0.972 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.5
1.332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.7
1.704 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.5
2.064 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.8
───────────────── 2234 rampup CPU4.0 Frequency residency (ms) ──────────────────
1.5 ▇▇▇▇▇ 4.0
1.728 ▇▇▇▇▇▇▇▇▇▇ 8.0
1.956 ▇▇▇▇▇▇▇▇▇▇▇▇ 9.0
2.184 ▇▇▇▇▇▇▇▇▇▇▇▇ 9.0
2.388 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.0
2.592 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16.0
2.772 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 18.0
2.988 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 47.0
3.096 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 53.4
Response time with rampup multiplier = 2
-----------------------------------------
rampup-2331 util_avg running
┌────────────────────────────────────────────────────────────────────────┐
1002.0┤ ▄▄▄▄▄▄▄▛▀▀▀▀▀▀│
│ ▄▄▄▟▀▀▀▀▘ │
│ ▗▄▟▀▀▘ │
│ ▗▄▛▀ │
751.5┤ ▗▄▛▀ │
│ ▟▀ │
│ ▗▟▀▘ │
501.0┤ ▟▀ │
│ ▗▟▘ │
│ ▄▛ │
│ ▟▘ │
250.5┤ ▗▛▘ │
│ ▄▛ │
│ ▟▘ │
│ ▄▛▘ │
0.0┤ ▄▄▛ │
└┬───────┬───────┬───────┬───────┬──────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── rampup-2331 util_avg running residency (ms) ──────────────────
0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.7
4.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.0
26.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
52.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
67.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
93.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.9000000000000001
106.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
126.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
149.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
170.0 ▇▇▇▇▇▇▇▇▇ 1.0
182.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
206.0 ▇▇▇▇▇▇▇▇▇ 1.0
217.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
239.0 ▇▇▇▇▇▇▇▇▇ 1.0
251.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
275.0 ▇▇▇▇▇▇▇▇▇ 1.0
286.0 ▇▇▇▇▇▇▇▇▇ 1.0
299.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
322.0 ▇▇▇▇▇▇▇▇▇ 1.0
334.0 ▇▇▇▇▇▇▇▇▇ 1.0
345.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
368.0 ▇▇▇▇▇▇▇▇▇ 1.0
379.0 ▇▇▇▇▇▇▇▇▇ 1.0
391.0 ▇▇▇▇▇▇▇▇▇ 1.0
402.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
424.0 ▇▇▇▇▇▇▇▇▇ 1.0
434.0 ▇▇▇▇▇▇▇▇▇ 1.0
445.0 ▇▇▇▇▇▇▇▇▇ 1.0
455.0 ▇▇▇▇▇▇▇▇▇ 1.0
───────────────────────── Sum Time Running on CPU (ms) ─────────────────────────
CPU0.0 ▇ 16.740000000000002
CPU4.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 726.91
2331 rampup CPU0.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
2.06┤ ▛ │
│ ▌ │
│ ▌ │
│ ▌ │
1.70┤ ▛▘ │
│ ▌ │
│ ▌ │
1.33┤ ▗▌ │
│ ▐ │
│ ▐ │
│ ▐ │
0.97┤ ▟ │
│ ▌ │
│ ▌ │
│ ▌ │
0.60┤ ▗▄▌ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
2331 rampup CPU4.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
3.14┤ ▄▄▄▄▄▟▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀│
│ ▛▀▀▀▀▘ │
│ ▄▄▌ │
│ ▄▄▌ │
2.51┤ ▄▌ │
│ ▌ │
│ ▐▀▘ │
1.87┤ ▐▀ │
│ ▗▟ │
│ ▌ │
│ ▌ │
1.24┤ ▌ │
│ ▌ │
│ ▌ │
│ ▌ │
0.60┤ ▌ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── 2331 rampup CPU0.0 Frequency residency (ms) ──────────────────
0.6 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.7
0.972 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
1.332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
1.704 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
2.064 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
───────────────── 2331 rampup CPU4.0 Frequency residency (ms) ──────────────────
0.6 ▇ 1.0
1.728 ▇▇ 2.9
1.956 ▇▇ 4.0
2.184 ▇▇▇ 6.0
2.388 ▇▇ 4.0
2.592 ▇▇▇▇ 7.0
2.772 ▇▇▇▇▇ 9.0
2.988 ▇▇▇▇▇▇▇▇▇▇▇ 20.0
3.096 ▇▇▇▇▇▇▇▇▇▇▇▇▇ 23.0
3.144 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 118.3
Speedometer score:
------------------
With the fix [2] applied to keep rate_limit_us as small as possible
| score
----------------------+--------
default | 352
rampup multiplier = 1 | 388
rampup multiplier = 2 | 427
rampup multiplier = 3 | 444
rampup multiplier = 4 | 456
[1] https://github.com/qais-yousef/rampup
[2] https://lore.kernel.org/lkml/20240728192659.58115-1-qyousef@layalina.io/
[3] https://lore.kernel.org/lkml/20231208002342.367117-1-qyousef@layalina.io/
Qais Yousef (16):
sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom
sched/pelt: Add a new function to approximate the future util_avg
value
sched/pelt: Add a new function to approximate runtime to reach given
util
sched/fair: Remove magic hardcoded margin in fits_capacity()
sched: cpufreq: Remove magic 1.25 headroom from
sugov_apply_dvfs_headroom()
sched/schedutil: Add a new tunable to dictate response time
sched/pelt: Introduce PELT multiplier boot time parameter
sched/fair: Extend util_est to improve rampup time
sched/fair: util_est: Take into account periodic tasks
sched/qos: Add a new sched-qos interface
sched/qos: Add rampup multiplier QoS
sched/pelt: Add new waiting_avg to record when runnable && !running
sched/schedutil: Take into account waiting_avg in apply_dvfs_headroom
sched/schedutil: Ignore dvfs headroom when util is decaying
sched/fair: Enable disabling util_est via rampup_multiplier
sched/fair: Don't mess with util_avg post init
Documentation/admin-guide/pm/cpufreq.rst | 17 +-
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-qos.rst | 44 +++++
drivers/cpufreq/cpufreq.c | 4 +-
include/linux/cpufreq.h | 3 +
include/linux/sched.h | 12 ++
include/linux/sched/cpufreq.h | 5 -
include/uapi/linux/sched.h | 6 +
include/uapi/linux/sched/types.h | 46 +++++
kernel/sched/core.c | 71 +++++++
kernel/sched/cpufreq_schedutil.c | 174 +++++++++++++++++-
kernel/sched/debug.c | 5 +
kernel/sched/fair.c | 149 +++++++++++++--
kernel/sched/pelt.c | 140 ++++++++++++--
kernel/sched/sched.h | 12 ++
kernel/sched/syscalls.c | 37 ++++
.../trace/beauty/include/uapi/linux/sched.h | 4 +
17 files changed, 685 insertions(+), 45 deletions(-)
create mode 100644 Documentation/scheduler/sched-qos.rst
--
2.34.1
^ permalink raw reply [flat|nested] 37+ messages in thread
* [RFC PATCH 01/16] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
@ 2024-08-20 16:34 ` Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 02/16] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
` (15 subsequent siblings)
16 siblings, 0 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
We are providing headroom for the utilization to grow until the next
decision point to pick the next frequency. Give the function a better
name and give it some documentation. It is not really mapping anything.
Also move it to cpufreq_schedutil.c. This function relies on updating
util signal appropriately to give a headroom to grow. This is tied to
schedutil and scheduler and not something that can be shared with other
governors.
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
include/linux/sched/cpufreq.h | 5 -----
kernel/sched/cpufreq_schedutil.c | 20 +++++++++++++++++++-
2 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index bdd31ab93bc5..d01755d3142f 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -28,11 +28,6 @@ static inline unsigned long map_util_freq(unsigned long util,
{
return freq * util / cap;
}
-
-static inline unsigned long map_util_perf(unsigned long util)
-{
- return util + (util >> 2);
-}
#endif /* CONFIG_CPU_FREQ */
#endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eece6244f9d2..575df3599813 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -178,12 +178,30 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
return cpufreq_driver_resolve_freq(policy, freq);
}
+/*
+ * DVFS decision are made at discrete points. If CPU stays busy, the util will
+ * continue to grow, which means it could need to run at a higher frequency
+ * before the next decision point was reached. IOW, we can't follow the util as
+ * it grows immediately, but there's a delay before we issue a request to go to
+ * higher frequency. The headroom caters for this delay so the system continues
+ * to run at adequate performance point.
+ *
+ * This function provides enough headroom to provide adequate performance
+ * assuming the CPU continues to be busy.
+ *
+ * At the moment it is a constant multiplication with 1.25.
+ */
+static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util)
+{
+ return util + (util >> 2);
+}
+
unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max)
{
/* Add dvfs headroom to actual utilization */
- actual = map_util_perf(actual);
+ actual = sugov_apply_dvfs_headroom(actual);
/* Actually we don't need to target the max performance */
if (actual < max)
max = actual;
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 02/16] sched/pelt: Add a new function to approximate the future util_avg value
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 01/16] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
@ 2024-08-20 16:34 ` Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 03/16] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
` (14 subsequent siblings)
16 siblings, 0 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
Given a util_avg value, the new function will return the future one
given a runtime delta.
This will be useful in later patches to help replace some magic margins
with more deterministic behavior.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/pelt.c | 22 +++++++++++++++++++++-
kernel/sched/sched.h | 1 +
2 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index fa52906a4478..2ce83e880bd5 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -466,4 +466,24 @@ int update_irq_load_avg(struct rq *rq, u64 running)
return ret;
}
-#endif
+#endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
+
+/*
+ * Approximate the new util_avg value assuming an entity has continued to run
+ * for @delta us.
+ */
+unsigned long approximate_util_avg(unsigned long util, u64 delta)
+{
+ struct sched_avg sa = {
+ .util_sum = util * PELT_MIN_DIVIDER,
+ .util_avg = util,
+ };
+
+ if (unlikely(!delta))
+ return util;
+
+ accumulate_sum(delta, &sa, 1, 0, 1);
+ ___update_load_avg(&sa, 0);
+
+ return sa.util_avg;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c36cc680361..294c6769e330 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3064,6 +3064,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max);
+unsigned long approximate_util_avg(unsigned long util, u64 delta);
/*
* Verify the fitness of task @p to run on @cpu taking into account the
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 03/16] sched/pelt: Add a new function to approximate runtime to reach given util
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 01/16] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 02/16] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
@ 2024-08-20 16:34 ` Qais Yousef
2024-08-22 5:36 ` Sultan Alsawaf (unemployed)
2024-08-20 16:35 ` [RFC PATCH 04/16] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
` (13 subsequent siblings)
16 siblings, 1 reply; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:34 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
It is basically the ramp-up time from 0 to a given value. Will be used
later to implement new tunable to control response time for schedutil.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/pelt.c | 21 +++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 22 insertions(+)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 2ce83e880bd5..06cb881ba582 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
return sa.util_avg;
}
+
+/*
+ * Approximate the required amount of runtime in ms required to reach @util.
+ */
+u64 approximate_runtime(unsigned long util)
+{
+ struct sched_avg sa = {};
+ u64 delta = 1024; // period = 1024 = ~1ms
+ u64 runtime = 0;
+
+ if (unlikely(!util))
+ return runtime;
+
+ while (sa.util_avg < util) {
+ accumulate_sum(delta, &sa, 1, 0, 1);
+ ___update_load_avg(&sa, 0);
+ runtime++;
+ }
+
+ return runtime;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 294c6769e330..47f158b2cdc2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3065,6 +3065,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long max);
unsigned long approximate_util_avg(unsigned long util, u64 delta);
+u64 approximate_runtime(unsigned long util);
/*
* Verify the fitness of task @p to run on @cpu taking into account the
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 04/16] sched/fair: Remove magic hardcoded margin in fits_capacity()
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (2 preceding siblings ...)
2024-08-20 16:34 ` [RFC PATCH 03/16] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-08-22 5:09 ` Sultan Alsawaf (unemployed)
2024-08-20 16:35 ` [RFC PATCH 05/16] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
` (12 subsequent siblings)
16 siblings, 1 reply; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
Replace hardcoded margin value in fits_capacity() with better dynamic
logic.
80% margin is a magic value that has served its purpose for now, but it
no longer fits the variety of systems that exist today. If a system is
over powered specifically, this 80% will mean we leave a lot of capacity
unused before we decide to upmigrate on HMP system.
On many systems the little cores are under powered and ability to
migrate faster away from them is desired.
Redefine misfit migration to mean the utilization threshold at which the
task would become misfit at the next load balance event assuming it
becomes an always running task.
To calculate this threshold, we use the new approximate_util_avg()
function to find out the threshold, based on arch_scale_cpu_capacity()
the task will be misfit if it continues to run for a TICK_USEC which is
our worst case scenario for when misfit migration will kick in.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 1 +
3 files changed, 34 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6d35c48239be..402ee4947ef0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8266,6 +8266,7 @@ void __init sched_init(void)
rq->sd = NULL;
rq->rd = NULL;
rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+ rq->fits_capacity_threshold = SCHED_CAPACITY_SCALE;
rq->balance_callback = &balance_push_callback;
rq->active_balance = 0;
rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9057584ec06d..e5e986af18dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -95,11 +95,15 @@ int __weak arch_asym_cpu_priority(int cpu)
}
/*
- * The margin used when comparing utilization with CPU capacity.
- *
- * (default: ~20%)
+ * fits_capacity() must ensure that a task will not be 'stuck' on a CPU with
+ * lower capacity for too long. This the threshold is the util value at which
+ * if a task becomes always busy it could miss misfit migration load balance
+ * event. So we consider a task is misfit before it reaches this point.
*/
-#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
+static inline bool fits_capacity(unsigned long util, int cpu)
+{
+ return util < cpu_rq(cpu)->fits_capacity_threshold;
+}
/*
* The margin used when comparing CPU capacities.
@@ -4978,14 +4982,13 @@ static inline int util_fits_cpu(unsigned long util,
unsigned long uclamp_max,
int cpu)
{
- unsigned long capacity = capacity_of(cpu);
unsigned long capacity_orig;
bool fits, uclamp_max_fits;
/*
* Check if the real util fits without any uclamp boost/cap applied.
*/
- fits = fits_capacity(util, capacity);
+ fits = fits_capacity(util, cpu);
if (!uclamp_is_used())
return fits;
@@ -9592,12 +9595,33 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
{
unsigned long capacity = scale_rt_capacity(cpu);
struct sched_group *sdg = sd->groups;
+ struct rq *rq = cpu_rq(cpu);
+ u64 limit;
if (!capacity)
capacity = 1;
- cpu_rq(cpu)->cpu_capacity = capacity;
- trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+ rq->cpu_capacity = capacity;
+ trace_sched_cpu_capacity_tp(rq);
+
+ /*
+ * Calculate the util at which the task must be considered a misfit.
+ *
+ * We must ensure that a task experiences the same ramp-up time to
+ * reach max performance point of the system regardless of the CPU it
+ * is running on (due to invariance, time will stretch and task will
+ * take longer to achieve the same util value compared to a task
+ * running on a big CPU) and a delay in misfit migration which depends
+ * on TICK doesn't end up hurting it as it can happen after we would
+ * have crossed this threshold.
+ *
+ * To ensure that invaraince is taken into account, we don't scale time
+ * and use it as-is, approximate_util_avg() will then let us know the
+ * our threshold.
+ */
+ limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;
+ limit -= TICK_USEC; /* sd->balance_interval is more accurate */
+ rq->fits_capacity_threshold = approximate_util_avg(0, limit);
sdg->sgc->capacity = capacity;
sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 47f158b2cdc2..ab4672675b84 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1093,6 +1093,7 @@ struct rq {
struct sched_domain __rcu *sd;
unsigned long cpu_capacity;
+ unsigned long fits_capacity_threshold;
struct balance_callback *balance_callback;
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 05/16] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom()
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (3 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 04/16] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-11-13 4:51 ` John Stultz
2024-08-20 16:35 ` [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
` (11 subsequent siblings)
16 siblings, 1 reply; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
Replace 1.25 headroom in sugov_apply_dvfs_headroom() with better dynamic
logic.
Instead of the magical 1.25 headroom, use the new approximate_util_avg()
to provide headroom based on the dvfs_update_delay, which is the period
at which the cpufreq governor will send DVFS updates to the hardware, or
min(curr.se.slice, TICK_USEC) which is the max delay for util signal to
change and promote a cpufreq update; whichever is higher.
Add a new percpu dvfs_update_delay that can be cheaply accessed whenever
sugov_apply_dvfs_headroom() is called. We expect cpufreq governors that
rely on util to drive its DVFS logic/algorithm to populate these percpu
variables. schedutil is the only such governor at the moment.
The behavior of schedutil will change. Some systems will experience
faster dvfs rampup (because of higher TICK or rate_limit_us), others
will experience slower rampup.
The impact on performance should not be visible if not for the black
hole effect of utilization invariance. A problem that will be addressed
in later patches.
Later patches will also address how to provide better control of how
fast or slow the system should respond to allow userspace to select
their power/perf/thermal trade-off.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/core.c | 1 +
kernel/sched/cpufreq_schedutil.c | 36 ++++++++++++++++++++++++++------
kernel/sched/sched.h | 9 ++++++++
3 files changed, 40 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 402ee4947ef0..7099e40cc8bd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -118,6 +118,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_READ_MOSTLY(u64, dvfs_update_delay);
#ifdef CONFIG_SCHED_DEBUG
/*
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 575df3599813..303b0ab227e7 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -187,13 +187,28 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* to run at adequate performance point.
*
* This function provides enough headroom to provide adequate performance
- * assuming the CPU continues to be busy.
+ * assuming the CPU continues to be busy. This headroom is based on the
+ * dvfs_update_delay of the cpufreq governor or min(curr.se.slice, TICK_US),
+ * whichever is higher.
*
- * At the moment it is a constant multiplication with 1.25.
+ * XXX: Should we provide headroom when the util is decaying?
*/
-static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util)
+static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util, int cpu)
{
- return util + (util >> 2);
+ struct rq *rq = cpu_rq(cpu);
+ u64 delay;
+
+ /*
+ * What is the possible worst case scenario for updating util_avg, ctx
+ * switch or TICK?
+ */
+ if (rq->cfs.h_nr_running > 1)
+ delay = min(rq->curr->se.slice/1000, TICK_USEC);
+ else
+ delay = TICK_USEC;
+ delay = max(delay, per_cpu(dvfs_update_delay, cpu));
+
+ return approximate_util_avg(util, delay);
}
unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
@@ -201,7 +216,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long max)
{
/* Add dvfs headroom to actual utilization */
- actual = sugov_apply_dvfs_headroom(actual);
+ actual = sugov_apply_dvfs_headroom(actual, cpu);
/* Actually we don't need to target the max performance */
if (actual < max)
max = actual;
@@ -579,15 +594,21 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
struct sugov_policy *sg_policy;
unsigned int rate_limit_us;
+ int cpu;
if (kstrtouint(buf, 10, &rate_limit_us))
return -EINVAL;
tunables->rate_limit_us = rate_limit_us;
- list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+
sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+ for_each_cpu(cpu, sg_policy->policy->cpus)
+ per_cpu(dvfs_update_delay, cpu) = rate_limit_us;
+ }
+
return count;
}
@@ -868,6 +889,9 @@ static int sugov_start(struct cpufreq_policy *policy)
memset(sg_cpu, 0, sizeof(*sg_cpu));
sg_cpu->cpu = cpu;
sg_cpu->sg_policy = sg_policy;
+
+ per_cpu(dvfs_update_delay, cpu) = sg_policy->tunables->rate_limit_us;
+
cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, uu);
}
return 0;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ab4672675b84..c2d9fba6ea7a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3068,6 +3068,15 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long approximate_util_avg(unsigned long util, u64 delta);
u64 approximate_runtime(unsigned long util);
+/*
+ * Any governor that relies on util signal to drive DVFS, must populate these
+ * percpu dvfs_update_delay variables.
+ *
+ * It should describe the rate/delay at which the governor sends DVFS freq
+ * update to the hardware in us.
+ */
+DECLARE_PER_CPU_READ_MOSTLY(u64, dvfs_update_delay);
+
/*
* Verify the fitness of task @p to run on @cpu taking into account the
* CPU original capacity and the runtime/deadline ratio of the task.
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate response time
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (4 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 05/16] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-09-16 22:22 ` Dietmar Eggemann
2024-08-20 16:35 ` [RFC PATCH 07/16] sched/pelt: Introduce PELT multiplier boot time parameter Qais Yousef
` (10 subsequent siblings)
16 siblings, 1 reply; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
The new tunable, response_time_ms, allow us to speed up or slow down
the response time of the policy to meet the perf, power and thermal
characteristic desired by the user/sysadmin. There's no single universal
trade-off that we can apply for all systems even if they use the same
SoC. The form factor of the system, the dominant use case, and in case
of battery powered systems, the size of the battery and presence or
absence of active cooling can play a big role on what would be best to
use.
The new tunable provides sensible defaults, but yet gives the power to
control the response time to the user/sysadmin, if they wish to.
This tunable is applied before we apply the DVFS headroom.
The default behavior of applying 1.25 headroom can be re-instated easily
now. But we continue to keep the min required headroom to overcome
hardware limitation in its speed to change DVFS. And any additional
headroom to speed things up must be applied by userspace to match their
expectation for best perf/watt as it dictates a type of policy that will
be better for some systems, but worse for others.
There's a whitespace clean up included in sugov_start().
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
Documentation/admin-guide/pm/cpufreq.rst | 17 +++-
drivers/cpufreq/cpufreq.c | 4 +-
include/linux/cpufreq.h | 3 +
kernel/sched/cpufreq_schedutil.c | 115 ++++++++++++++++++++++-
4 files changed, 132 insertions(+), 7 deletions(-)
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 6adb7988e0eb..fa0d602a920e 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -417,7 +417,7 @@ is passed by the scheduler to the governor callback which causes the frequency
to go up to the allowed maximum immediately and then draw back to the value
returned by the above formula over time.
-This governor exposes only one tunable:
+This governor exposes two tunables:
``rate_limit_us``
Minimum time (in microseconds) that has to pass between two consecutive
@@ -427,6 +427,21 @@ This governor exposes only one tunable:
The purpose of this tunable is to reduce the scheduler context overhead
of the governor which might be excessive without it.
+``respone_time_ms``
+ Amount of time (in milliseconds) required to ramp the policy from
+ lowest to highest frequency. Can be decreased to speed up the
+ responsiveness of the system, or increased to slow the system down in
+ hope to save power. The best perf/watt will depend on the system
+ characteristics and the dominant workload you expect to run. For
+ userspace that has smart context on the type of workload running (like
+ in Android), one can tune this to suite the demand of that workload.
+
+ Note that when slowing the response down, you can end up effectively
+ chopping off the top frequencies for that policy as the util is capped
+ to 1024. On HMP systems this chopping effect will only occur on the
+ biggest core whose capacity is 1024. Don't rely on this behavior as
+ this is a limitation that can hopefully be improved in the future.
+
This governor generally is regarded as a replacement for the older `ondemand`_
and `conservative`_ governors (described below), as it is simpler and more
tightly integrated with the CPU scheduler, its overhead in terms of CPU context
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index a45aac17c20f..5dc44c3694fe 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -533,8 +533,8 @@ void cpufreq_disable_fast_switch(struct cpufreq_policy *policy)
}
EXPORT_SYMBOL_GPL(cpufreq_disable_fast_switch);
-static unsigned int __resolve_freq(struct cpufreq_policy *policy,
- unsigned int target_freq, unsigned int relation)
+unsigned int __resolve_freq(struct cpufreq_policy *policy,
+ unsigned int target_freq, unsigned int relation)
{
unsigned int idx;
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 20f7e98ee8af..c14ffdcd8933 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -622,6 +622,9 @@ int cpufreq_driver_target(struct cpufreq_policy *policy,
int __cpufreq_driver_target(struct cpufreq_policy *policy,
unsigned int target_freq,
unsigned int relation);
+unsigned int __resolve_freq(struct cpufreq_policy *policy,
+ unsigned int target_freq,
+ unsigned int relation);
unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
unsigned int target_freq);
unsigned int cpufreq_policy_transition_delay_us(struct cpufreq_policy *policy);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 303b0ab227e7..94e35b7c972d 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -8,9 +8,12 @@
#define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)
+DEFINE_PER_CPU_READ_MOSTLY(unsigned long, response_time_mult);
+
struct sugov_tunables {
struct gov_attr_set attr_set;
unsigned int rate_limit_us;
+ unsigned int response_time_ms;
};
struct sugov_policy {
@@ -22,6 +25,7 @@ struct sugov_policy {
raw_spinlock_t update_lock;
u64 last_freq_update_time;
s64 freq_update_delay_ns;
+ unsigned int freq_response_time_ms;
unsigned int next_freq;
unsigned int cached_raw_freq;
@@ -59,6 +63,70 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
/************************ Governor internals ***********************/
+static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
+{
+ int cpu = cpumask_first(sg_policy->policy->cpus);
+ unsigned long cap = arch_scale_cpu_capacity(cpu);
+ unsigned int max_freq, sec_max_freq;
+
+ max_freq = sg_policy->policy->cpuinfo.max_freq;
+ sec_max_freq = __resolve_freq(sg_policy->policy,
+ max_freq - 1,
+ CPUFREQ_RELATION_H);
+
+ /*
+ * We will request max_freq as soon as util crosses the capacity at
+ * second highest frequency. So effectively our response time is the
+ * util at which we cross the cap@2nd_highest_freq.
+ */
+ cap = sec_max_freq * cap / max_freq;
+
+ return approximate_runtime(cap + 1);
+}
+
+static inline void sugov_update_response_time_mult(struct sugov_policy *sg_policy)
+{
+ unsigned long mult;
+ int cpu;
+
+ if (unlikely(!sg_policy->freq_response_time_ms))
+ sg_policy->freq_response_time_ms = sugov_calc_freq_response_ms(sg_policy);
+
+ mult = sg_policy->freq_response_time_ms * SCHED_CAPACITY_SCALE;
+ mult /= sg_policy->tunables->response_time_ms;
+
+ if (SCHED_WARN_ON(!mult))
+ mult = SCHED_CAPACITY_SCALE;
+
+ for_each_cpu(cpu, sg_policy->policy->cpus)
+ per_cpu(response_time_mult, cpu) = mult;
+}
+
+/*
+ * Shrink or expand how long it takes to reach the maximum performance of the
+ * policy.
+ *
+ * sg_policy->freq_response_time_ms is a constant value defined by PELT
+ * HALFLIFE and the capacity of the policy (assuming HMP systems).
+ *
+ * sg_policy->tunables->response_time_ms is a user defined response time. By
+ * setting it lower than sg_policy->freq_response_time_ms, the system will
+ * respond faster to changes in util, which will result in reaching maximum
+ * performance point quicker. By setting it higher, it'll slow down the amount
+ * of time required to reach the maximum OPP.
+ *
+ * This should be applied when selecting the frequency.
+ */
+static inline unsigned long
+sugov_apply_response_time(unsigned long util, int cpu)
+{
+ unsigned long mult;
+
+ mult = per_cpu(response_time_mult, cpu) * util;
+
+ return mult >> SCHED_CAPACITY_SHIFT;
+}
+
static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
{
s64 delta_ns;
@@ -215,7 +283,10 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
unsigned long min,
unsigned long max)
{
- /* Add dvfs headroom to actual utilization */
+ /*
+ * Speed up/slow down response timee first then apply DVFS headroom.
+ */
+ actual = sugov_apply_response_time(actual, cpu);
actual = sugov_apply_dvfs_headroom(actual, cpu);
/* Actually we don't need to target the max performance */
if (actual < max)
@@ -614,8 +685,42 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+static ssize_t response_time_ms_show(struct gov_attr_set *attr_set, char *buf)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+
+ return sprintf(buf, "%u\n", tunables->response_time_ms);
+}
+
+static ssize_t
+response_time_ms_store(struct gov_attr_set *attr_set, const char *buf, size_t count)
+{
+ struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
+ struct sugov_policy *sg_policy;
+ unsigned int response_time_ms;
+
+ if (kstrtouint(buf, 10, &response_time_ms))
+ return -EINVAL;
+
+ /* XXX need special handling for high values? */
+
+ tunables->response_time_ms = response_time_ms;
+
+ list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+ if (sg_policy->tunables == tunables) {
+ sugov_update_response_time_mult(sg_policy);
+ break;
+ }
+ }
+
+ return count;
+}
+
+static struct governor_attr response_time_ms = __ATTR_RW(response_time_ms);
+
static struct attribute *sugov_attrs[] = {
&rate_limit_us.attr,
+ &response_time_ms.attr,
NULL
};
ATTRIBUTE_GROUPS(sugov);
@@ -803,11 +908,13 @@ static int sugov_init(struct cpufreq_policy *policy)
goto stop_kthread;
}
- tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
-
policy->governor_data = sg_policy;
sg_policy->tunables = tunables;
+ tunables->rate_limit_us = cpufreq_policy_transition_delay_us(policy);
+ tunables->response_time_ms = sugov_calc_freq_response_ms(sg_policy);
+ sugov_update_response_time_mult(sg_policy);
+
ret = kobject_init_and_add(&tunables->attr_set.kobj, &sugov_tunables_ktype,
get_governor_parent_kobj(policy), "%s",
schedutil_gov.name);
@@ -867,7 +974,7 @@ static int sugov_start(struct cpufreq_policy *policy)
void (*uu)(struct update_util_data *data, u64 time, unsigned int flags);
unsigned int cpu;
- sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+ sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
sg_policy->last_freq_update_time = 0;
sg_policy->next_freq = 0;
sg_policy->work_in_progress = false;
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 07/16] sched/pelt: Introduce PELT multiplier boot time parameter
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (5 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-08-20 16:35 ` [RFC PATCH 08/16] sched/fair: Extend util_est to improve rampup time Qais Yousef
` (9 subsequent siblings)
16 siblings, 0 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
The param is set as read only and can only be changed at boot time via
kernel.sched_pelt_multiplier=[1, 2, 4]
PELT has a big impact on the overall system response and reactiveness to
change. Smaller PELT HF means it'll require less time to reach the
maximum performance point of the system when the system become fully
busy; and equally shorter time to go back to lowest performance point
when the system goes back to idle.
This faster reaction impacts both DVFS response and migration time
between clusters in HMP system.
Smaller PELT values (higher multiplier) are expected to give better
performance at the cost of more power. Under-powered systems can
particularly benefit from faster response time. Powerful systems can
still benefit from response time if they want to be tuned towards perf
more and power is not the major concern for them.
This combined with response_time_ms from schedutil should give the user
and sysadmin a deterministic way to control the triangular power, perf
and thermals for their system. The default response_time_ms will half
as PELT HF halves.
Update approximate_{util_avg, runtime}() to take into account the PELT
HALFLIFE multiplier.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[qyousef: Commit message and boot param]
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/pelt.c | 62 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 58 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 06cb881ba582..536575757420 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -24,6 +24,9 @@
* Author: Vincent Guittot <vincent.guittot@linaro.org>
*/
+static __read_mostly unsigned int sched_pelt_lshift;
+static unsigned int sched_pelt_multiplier = 1;
+
/*
* Approximate:
* val * y^n, where y^32 ~= 0.5 (~1 scheduling period)
@@ -180,6 +183,7 @@ static __always_inline int
___update_load_sum(u64 now, struct sched_avg *sa,
unsigned long load, unsigned long runnable, int running)
{
+ int time_shift;
u64 delta;
delta = now - sa->last_update_time;
@@ -195,12 +199,17 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
/*
* Use 1024ns as the unit of measurement since it's a reasonable
* approximation of 1us and fast to compute.
+ * On top of this, we can change the half-time period from the default
+ * 32ms to a shorter value. This is equivalent to left shifting the
+ * time.
+ * Merge both right and left shifts in one single right shift
*/
- delta >>= 10;
+ time_shift = 10 - sched_pelt_lshift;
+ delta >>= time_shift;
if (!delta)
return 0;
- sa->last_update_time += delta << 10;
+ sa->last_update_time += delta << time_shift;
/*
* running is a subset of runnable (weight) so running can't be set if
@@ -468,6 +477,51 @@ int update_irq_load_avg(struct rq *rq, u64 running)
}
#endif /* CONFIG_HAVE_SCHED_AVG_IRQ */
+static int set_sched_pelt_multiplier(const char *val, const struct kernel_param *kp)
+{
+ int ret;
+
+ ret = param_set_int(val, kp);
+ if (ret)
+ goto error;
+
+ switch (sched_pelt_multiplier) {
+ case 1:
+ fallthrough;
+ case 2:
+ fallthrough;
+ case 4:
+ WRITE_ONCE(sched_pelt_lshift,
+ sched_pelt_multiplier >> 1);
+ break;
+ default:
+ ret = -EINVAL;
+ goto error;
+ }
+
+ return 0;
+
+error:
+ sched_pelt_multiplier = 1;
+ return ret;
+}
+
+static const struct kernel_param_ops sched_pelt_multiplier_ops = {
+ .set = set_sched_pelt_multiplier,
+ .get = param_get_int,
+};
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+/* XXX: should we use sched as prefix? */
+#define MODULE_PARAM_PREFIX "kernel."
+module_param_cb(sched_pelt_multiplier, &sched_pelt_multiplier_ops, &sched_pelt_multiplier, 0444);
+MODULE_PARM_DESC(sched_pelt_multiplier, "PELT HALFLIFE helps control the responsiveness of the system.");
+MODULE_PARM_DESC(sched_pelt_multiplier, "Accepted value: 1 32ms PELT HALIFE - roughly 200ms to go from 0 to max performance point (default).");
+MODULE_PARM_DESC(sched_pelt_multiplier, " 2 16ms PELT HALIFE - roughly 100ms to go from 0 to max performance point.");
+MODULE_PARM_DESC(sched_pelt_multiplier, " 4 8ms PELT HALIFE - roughly 50ms to go from 0 to max performance point.");
+
/*
* Approximate the new util_avg value assuming an entity has continued to run
* for @delta us.
@@ -482,7 +536,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
if (unlikely(!delta))
return util;
- accumulate_sum(delta, &sa, 1, 0, 1);
+ accumulate_sum(delta << sched_pelt_lshift, &sa, 1, 0, 1);
___update_load_avg(&sa, 0);
return sa.util_avg;
@@ -494,7 +548,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
u64 approximate_runtime(unsigned long util)
{
struct sched_avg sa = {};
- u64 delta = 1024; // period = 1024 = ~1ms
+ u64 delta = 1024 << sched_pelt_lshift; // period = 1024 = ~1ms
u64 runtime = 0;
if (unlikely(!util))
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 08/16] sched/fair: Extend util_est to improve rampup time
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (6 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 07/16] sched/pelt: Introduce PELT multiplier boot time parameter Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-09-17 19:21 ` Dietmar Eggemann
2024-10-14 16:04 ` Christian Loehle
2024-08-20 16:35 ` [RFC PATCH 09/16] sched/fair: util_est: Take into account periodic tasks Qais Yousef
` (8 subsequent siblings)
16 siblings, 2 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
Utilization invariance can cause big delays. When tasks are running,
accumulate non-invairiant version of utilization to help tasks to settle
down to their new util_avg values faster.
Keep track of delta_exec during runnable across activations to help
update util_est for a long running task accurately. util_est shoudl
still behave the same at enqueue/dequeue.
Before this patch the a busy task tamping up would experience the
following transitions, running on M1 Mac Mini
rampup-6338 util_avg running
┌─────────────────────────────────────────────────────────────────────────┐
986.0┤ ▄▄▄▄▄▟▀▀▀▀│
│ ▗▄▄▟▀▀▀▘ │
│ ▗▄▟▀▀ │
│ ▄▟▀▀ │
739.5┤ ▄▟▀▘ │
│ ▗▄▛▘ │
│ ▗▟▀ │
493.0┤ ▗▛▀ │
│ ▗▄▛▀ │
│ ▄▟▀ │
│ ▄▛▘ │
246.5┤ ▗▟▀▘ │
│ ▄▟▀▀ │
│ ▗▄▄▛▘ │
│ ▗▄▄▄▟▀ │
0.0┤ ▗ ▗▄▄▟▀▀ │
└┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── rampup-6338 util_avg running residency (ms) ──────────────────
0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.5
15.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
36.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
57.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
78.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
98.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
117.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
137.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
156.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
176.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
191.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
211.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
230.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
248.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
266.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
277.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
294.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.6
311.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.4
327.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
340.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
358.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
371.0 ▇▇▇▇▇▇▇▇▇ 1.0
377.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
389.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
401.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
413.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
431.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
442.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
456.0 ▇▇▇▇▇▇▇▇▇ 1.0
───────────────────────── Sum Time Running on CPU (ms) ─────────────────────────
CPU0.0 ▇▇▇▇▇ 90.39
CPU4.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1156.93
6338 rampup CPU0.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
2.06┤ ▛▀▀ │
│ ▌ │
│ ▌ │
│ ▌ │
1.70┤ ▛▀▀▘ │
│ ▌ │
│ ▌ │
1.33┤ ▗▄▄▄▌ │
│ ▐ │
│ ▐ │
│ ▐ │
0.97┤ ▗▄▄▄▟ │
│ ▐ │
│ ▐ │
│ ▐ │
0.60┤ ▗ ▗▄▄▄▄▄▄▄▄▟ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
6338 rampup CPU4.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
3.20┤ ▐▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀│
│ ▐ │
│ ▛▀▀ │
│ ▌ │
2.78┤ ▐▀▀▘ │
│ ▗▄▟ │
│ ▌ │
2.35┤ ▗▄▄▌ │
│ ▐ │
│ ▄▄▟ │
│ ▌ │
1.93┤ ▗▄▄▌ │
│ ▐ │
│ ▐ │
│ ▐ │
1.50┤ ▗▄▄▟ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── 6338 rampup CPU0.0 Frequency residency (ms) ──────────────────
0.6 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 37.300000000000004
0.972 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 15.0
1.332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 15.0
1.704 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.0
2.064 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.1
───────────────── 6338 rampup CPU4.0 Frequency residency (ms) ──────────────────
1.5 ▇▇▇▇▇▇▇▇▇▇ 11.9
1.956 ▇▇▇▇▇▇▇▇ 10.0
2.184 ▇▇▇▇▇▇▇▇ 10.0
2.388 ▇▇▇▇▇▇▇▇▇ 11.0
2.592 ▇▇▇▇▇▇▇▇ 10.0
2.772 ▇▇▇▇▇▇▇▇ 10.0
2.988 ▇▇▇▇▇▇▇▇ 10.0
3.204 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85.3
After the patch the response is improved to rampup frequencies faster
and migrate from little quicker
rampup-2234 util_avg running
┌───────────────────────────────────────────────────────────────────────────┐
984┤ ▗▄▄▄▄▄▛▀▀▀▀│
│ ▄▄▟▀▀▀▀ │
│ ▄▄▟▀▀ │
│ ▄▟▀▘ │
738┤ ▄▟▀▘ │
│ ▗▟▀▘ │
│ ▗▟▀ │
492┤ ▗▟▀ │
│ ▗▟▀ │
│ ▟▀ │
│ ▄▛▘ │
246┤ ▗▟▘ │
│ ▗▟▀ │
│ ▗▟▀ │
│ ▗▟▀ │
0┤ ▄▄▄▛▀ │
└┬───────┬───────┬────────┬───────┬───────┬───────┬────────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────── rampup-2234 util_avg running residency (ms) ──────────────────
0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.6000000000000005
15.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
39.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
61.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
85.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
99.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
120.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
144.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
160.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
176.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
192.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
210.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
228.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
246.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
263.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
282.0 ▇▇▇▇▇▇▇ 1.0
291.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
309.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
327.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
344.0 ▇▇▇▇▇▇▇ 1.0
354.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
373.0 ▇▇▇▇▇▇▇ 1.0
382.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
400.0 ▇▇▇▇▇▇▇ 1.0
408.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
425.0 ▇▇▇▇▇▇▇ 1.0
434.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
452.0 ▇▇▇▇▇▇▇ 1.0
2234 rampup CPU1.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
2.06┤ ▐▀ │
│ ▐ │
│ ▐ │
│ ▐ │
1.70┤ ▛▀ │
│ ▌ │
│ ▌ │
1.33┤ ▄▌ │
│ ▌ │
│ ▌ │
│ ▌ │
0.97┤ ▗▄▌ │
│ ▐ │
│ ▐ │
│ ▐ │
0.60┤ ▗▄▄▟ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
2234 rampup CPU4.0 Frequency
┌──────────────────────────────────────────────────────────────────────────┐
3.10┤ ▐▀▀▀▀▀▀▀▀▀▀▀▀▀│
│ ▛▀▀▀▀▀▀▀▀▀▀▀ │
│ ▌ │
│ ▐▀▀▀▀▘ │
2.70┤ ▐ │
│ ▐▀▀▀▀ │
│ ▐ │
2.30┤ ▛▀▀ │
│ ▌ │
│ ▐▀▀▘ │
│ ▐ │
1.90┤ ▐▀▀ │
│ ▐ │
│ ▗▄▟ │
│ ▐ │
1.50┤ ▗▟ │
└┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
1.700 1.733 1.767 1.800 1.833 1.867 1.900 1.933 1.967 2.000
───────────────────────── Sum Time Running on CPU (ms) ─────────────────────────
CPU1.0 ▇▇▇▇ 32.53
CPU4.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 540.3
───────────────── 2234 rampup CPU1.0 Frequency residency (ms) ──────────────────
0.6 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.1
0.972 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.5
1.332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.7
1.704 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.5
2.064 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.8
───────────────── 2234 rampup CPU4.0 Frequency residency (ms) ──────────────────
1.5 ▇▇▇▇▇ 4.0
1.728 ▇▇▇▇▇▇▇▇▇▇ 8.0
1.956 ▇▇▇▇▇▇▇▇▇▇▇▇ 9.0
2.184 ▇▇▇▇▇▇▇▇▇▇▇▇ 9.0
2.388 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.0
2.592 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16.0
2.772 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 18.0
2.988 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 47.0
3.096 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 53.4
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 43 +++++++++++++++++++++++++++++++------------
3 files changed, 33 insertions(+), 12 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..8db8f4085d84 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -544,6 +544,7 @@ struct sched_entity {
unsigned int on_rq;
u64 exec_start;
+ u64 delta_exec;
u64 sum_exec_runtime;
u64 prev_sum_exec_runtime;
u64 vruntime;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7099e40cc8bd..e2b4b87ec2b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4318,6 +4318,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.on_rq = 0;
p->se.exec_start = 0;
+ p->se.delta_exec = 0;
p->se.sum_exec_runtime = 0;
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5e986af18dc..a6421e4032c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1118,6 +1118,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
curr->exec_start = now;
curr->sum_exec_runtime += delta_exec;
+ curr->delta_exec = delta_exec;
if (schedstat_enabled()) {
struct sched_statistics *stats;
@@ -1126,7 +1127,6 @@ static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
__schedstat_set(stats->exec_max,
max(delta_exec, stats->exec_max));
}
-
return delta_exec;
}
@@ -4890,16 +4890,20 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
if (!sched_feat(UTIL_EST))
return;
- /*
- * Skip update of task's estimated utilization when the task has not
- * yet completed an activation, e.g. being migrated.
- */
- if (!task_sleep)
- return;
-
/* Get current estimate of utilization */
ewma = READ_ONCE(p->se.avg.util_est);
+ /*
+ * If a task is running, update util_est ignoring utilization
+ * invariance so that if the task suddenly becomes busy we will rampup
+ * quickly to settle down to our new util_avg.
+ */
+ if (!task_sleep) {
+ ewma &= ~UTIL_AVG_UNCHANGED;
+ ewma = approximate_util_avg(ewma, p->se.delta_exec / 1000);
+ goto done;
+ }
+
/*
* If the PELT values haven't changed since enqueue time,
* skip the util_est update.
@@ -4968,6 +4972,14 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
trace_sched_util_est_se_tp(&p->se);
}
+static inline void util_est_update_running(struct cfs_rq *cfs_rq,
+ struct task_struct *p)
+{
+ util_est_dequeue(cfs_rq, p);
+ util_est_update(cfs_rq, p, false);
+ util_est_enqueue(cfs_rq, p);
+}
+
static inline unsigned long get_actual_cpu_capacity(int cpu)
{
unsigned long capacity = arch_scale_cpu_capacity(cpu);
@@ -5164,13 +5176,13 @@ static inline int sched_balance_newidle(struct rq *rq, struct rq_flags *rf)
static inline void
util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
-
static inline void
util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
-
static inline void
-util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p,
- bool task_sleep) {}
+util_est_update(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep) {}
+static inline void
+util_est_update_running(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+
static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
#endif /* CONFIG_SMP */
@@ -6906,6 +6918,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
rq->next_balance = jiffies;
dequeue_throttle:
+ if (task_sleep)
+ p->se.delta_exec = 0;
util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}
@@ -8546,6 +8560,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
set_next_entity(cfs_rq, se);
}
+ if (prev->on_rq)
+ util_est_update_running(&rq->cfs, prev);
+
goto done;
simple:
#endif
@@ -12710,6 +12727,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
entity_tick(cfs_rq, se, queued);
}
+ util_est_update_running(&rq->cfs, curr);
+
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 09/16] sched/fair: util_est: Take into account periodic tasks
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (7 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 08/16] sched/fair: Extend util_est to improve rampup time Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-11-13 4:57 ` John Stultz
2024-08-20 16:35 ` [RFC PATCH 10/16] sched/qos: Add a new sched-qos interface Qais Yousef
` (7 subsequent siblings)
16 siblings, 1 reply; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
The new faster rampup is great for performance. But terrible for power.
We want the faster rampup to be only applied for tasks that are
transitioning from one periodic/steady state to another periodic/steady
state. But if they are stably periodic, then the faster rampup doesn't
make sense as util_avg describes their computational demand accurately
and we can rely on that to make accurate decision. And preserve the
power savings from being exact with the resources we give to this task
(ie: smaller DVFS headroom).
We detect periodic tasks based on util_avg across util_est_update()
calls. If it is rising, then the task is going through a transition.
We rely on util_avg being stable for periodic tasks with very little
variations around one stable point.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 17 ++++++++++++++---
3 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db8f4085d84..2e8c5a9ffa76 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -829,6 +829,8 @@ struct task_struct {
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
+ unsigned long util_avg_dequeued;
+
struct sched_statistics stats;
#ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e2b4b87ec2b7..c91e6a62c7ab 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4331,6 +4331,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.cfs_rq = NULL;
#endif
+ p->util_avg_dequeued = 0;
+
#ifdef CONFIG_SCHEDSTATS
/* Even if schedstat is disabled, there should not be garbage */
memset(&p->stats, 0, sizeof(p->stats));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a6421e4032c0..0c10e2afb52d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4832,6 +4832,11 @@ static inline unsigned long task_util(struct task_struct *p)
return READ_ONCE(p->se.avg.util_avg);
}
+static inline unsigned long task_util_dequeued(struct task_struct *p)
+{
+ return READ_ONCE(p->util_avg_dequeued);
+}
+
static inline unsigned long task_runnable(struct task_struct *p)
{
return READ_ONCE(p->se.avg.runnable_avg);
@@ -4899,9 +4904,12 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* quickly to settle down to our new util_avg.
*/
if (!task_sleep) {
- ewma &= ~UTIL_AVG_UNCHANGED;
- ewma = approximate_util_avg(ewma, p->se.delta_exec / 1000);
- goto done;
+ if (task_util(p) > task_util_dequeued(p)) {
+ ewma &= ~UTIL_AVG_UNCHANGED;
+ ewma = approximate_util_avg(ewma, p->se.delta_exec / 1000);
+ goto done;
+ }
+ return;
}
/*
@@ -4914,6 +4922,9 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
/* Get utilization at dequeue */
dequeued = task_util(p);
+ if (!task_on_rq_migrating(p))
+ p->util_avg_dequeued = dequeued;
+
/*
* Reset EWMA on utilization increases, the moving average is used only
* to smooth utilization decreases.
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 10/16] sched/qos: Add a new sched-qos interface
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (8 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 09/16] sched/fair: util_est: Take into account periodic tasks Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-11-28 1:47 ` John Stultz
2024-08-20 16:35 ` [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS Qais Yousef
` (6 subsequent siblings)
16 siblings, 1 reply; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
The need to describe the conflicting demand of various workloads hasn't
been higher. Both hardware and software have moved rapidly in the past
decade and system usage is more diverse and the number of workloads
expected to run on the same machine whether on Mobile or Server markets
has created a big dilemma on how to better manage those requirements.
The problem is that we lack mechanisms to allow these workloads to
describe what they need, and then allow kernel to do best efforts to
manage those demands based on the hardware it is running on
transparently and current system state.
Example of conflicting requirements that come across frequently:
1. Improve wake up latency for SCHED_OTHER. Many tasks end up
using SCHED_FIFO/SCHED_RR to compensate for this shortcoming.
RT tasks lack power management and fairness and can be hard
and error prone to use correctly and portably.
2. Prefer spreading vs prefer packing on wake up for a group of
tasks. Geekbench-like workloads would benefit from
parallelising on different CPUs. hackbench type of workloads
can benefit from waking on up same CPUs or a CPU that is
closer in the cache hierarchy.
3. Nice values for SCHED_OTHER are system wide and require
privileges. Many workloads would like a way to set relative
nice value so they can preempt each others, but not be
impact or be impacted by other tasks belong to different
workloads on the system.
4. Provide a way to tag some tasks as 'background' to keep them
out of the way. SCHED_IDLE is too strong for some of these
tasks but yet they can be computationally heavy. Example
tasks are garbage collectors. Their work is both important
and not important.
5. Provide a way to improve DVFS/upmigration rampup time for
specific tasks that are bursty in nature and highly
interactive.
Whether any of these use cases warrants an additional QoS hint is
something to be discussed individually. But the main point is to
introduce an interface that can be extendable to cater for potentially
those requirements and more. rampup_multiplier to improve
DVFS/upmigration for bursty tasks will be the first user in later patch.
It is desired to have apps (and benchmarks!) directly use this interface
for optimal perf/watt. But in the absence of such support, it should be
possible to write a userspace daemon to monitor workloads and apply
these QoS hints on apps behalf based on analysis done by anyone
interested in improving the performance of those workloads.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-qos.rst | 44 ++++++++++++++++++
include/uapi/linux/sched.h | 4 ++
include/uapi/linux/sched/types.h | 46 +++++++++++++++++++
kernel/sched/syscalls.c | 3 ++
.../trace/beauty/include/uapi/linux/sched.h | 4 ++
6 files changed, 102 insertions(+)
create mode 100644 Documentation/scheduler/sched-qos.rst
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 43bd8a145b7a..f49b8b021d97 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -21,6 +21,7 @@ Scheduler
sched-rt-group
sched-stats
sched-debug
+ sched-qos
text_files
diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
new file mode 100644
index 000000000000..0911261cb124
--- /dev/null
+++ b/Documentation/scheduler/sched-qos.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Scheduler QoS
+=============
+
+1. Introduction
+===============
+
+Different workloads have different scheduling requirements to operate
+optimally. The same applies to tasks within the same workload.
+
+To enable smarter usage of system resources and to cater for the conflicting
+demands of various tasks, Scheduler QoS provides a mechanism to provide more
+information about those demands so that scheduler can do best-effort to
+honour them.
+
+ @sched_qos_type what QoS hint to apply
+ @sched_qos_value value of the QoS hint
+ @sched_qos_cookie magic cookie to tag a group of tasks for which the QoS
+ applies. If 0, the hint will apply globally system
+ wide. If not 0, the hint will be relative to tasks that
+ has the same cookie value only.
+
+QoS hints are set once and not inherited by children by design. The
+rationale is that each task has its individual characteristics and it is
+encouraged to describe each of these separately. Also since system resources
+are finite, there's a limit to what can be done to honour these requests
+before reaching a tipping point where there are too many requests for
+a particular QoS that is impossible to service for all of them at once and
+some will start to lose out. For example if 10 tasks require better wake
+up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
+4 can perceive the hint honoured and the rest will have to wait. Inheritance
+can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
+hint will lose its meaning and effectiveness rapidly. The chances of 10
+tasks waking up at the same time is lower than a 100 and lower than a 1000.
+
+To set multiple QoS hints, a syscall is required for each. This is a
+trade-off to reduce the churn on extending the interface as the hope for
+this to evolve as workloads and hardware get more sophisticated and the
+need for extension will arise; and when this happen the task should be
+simpler to add the kernel extension and allow userspace to use readily by
+setting the newly added flag without having to update the whole of
+sched_attr.
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3bac0a8ceab2..67ef99f64ddc 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -102,6 +102,9 @@ struct clone_args {
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
};
+
+enum sched_qos_type {
+};
#endif
#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
@@ -132,6 +135,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_QOS 0x80
#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index 90662385689b..55e4b1e79ed2 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -94,6 +94,48 @@
* scheduled on a CPU with no more capacity than the specified value.
*
* A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Scheduler QoS
+ * =============
+ *
+ * Different workloads have different scheduling requirements to operate
+ * optimally. The same applies to tasks within the same workload.
+ *
+ * To enable smarter usage of system resources and to cater for the conflicting
+ * demands of various tasks, Scheduler QoS provides a mechanism to provide more
+ * information about those demands so that scheduler can do best-effort to
+ * honour them.
+ *
+ * @sched_qos_type what QoS hint to apply
+ * @sched_qos_value value of the QoS hint
+ * @sched_qos_cookie magic cookie to tag a group of tasks for which the QoS
+ * applies. If 0, the hint will apply globally system
+ * wide. If not 0, the hint will be relative to tasks that
+ * has the same cookie value only.
+ *
+ * QoS hints are set once and not inherited by children by design. The
+ * rationale is that each task has its individual characteristics and it is
+ * encouraged to describe each of these separately. Also since system resources
+ * are finite, there's a limit to what can be done to honour these requests
+ * before reaching a tipping point where there are too many requests for
+ * a particular QoS that is impossible to service for all of them at once and
+ * some will start to lose out. For example if 10 tasks require better wake
+ * up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
+ * 4 can perceive the hint honoured and the rest will have to wait. Inheritance
+ * can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
+ * hint will lose its meaning and effectiveness rapidly. The chances of 10
+ * tasks waking up at the same time is lower than a 100 and lower than a 1000.
+ *
+ * To set multiple QoS hints, a syscall is required for each. This is a
+ * trade-off to reduce the churn on extending the interface as the hope for
+ * this to evolve as workloads and hardware get more sophisticated and the
+ * need for extension will arise; and when this happen the task should be
+ * simpler to add the kernel extension and allow userspace to use readily by
+ * setting the newly added flag without having to update the whole of
+ * sched_attr.
+ *
+ * Details about the available QoS hints can be found in:
+ * Documentation/scheduler/sched-qos.rst
*/
struct sched_attr {
__u32 size;
@@ -116,6 +158,10 @@ struct sched_attr {
__u32 sched_util_min;
__u32 sched_util_max;
+ __u32 sched_qos_type;
+ __s64 sched_qos_value;
+ __u32 sched_qos_cookie;
+
};
#endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index ae1b42775ef9..a7d4dfdfed43 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -668,6 +668,9 @@ int __sched_setscheduler(struct task_struct *p,
return retval;
}
+ if (attr->sched_flags & SCHED_FLAG_QOS)
+ return -EOPNOTSUPP;
+
/*
* SCHED_DEADLINE bandwidth accounting relies on stable cpusets
* information.
diff --git a/tools/perf/trace/beauty/include/uapi/linux/sched.h b/tools/perf/trace/beauty/include/uapi/linux/sched.h
index 3bac0a8ceab2..67ef99f64ddc 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/sched.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/sched.h
@@ -102,6 +102,9 @@ struct clone_args {
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
};
+
+enum sched_qos_type {
+};
#endif
#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
@@ -132,6 +135,7 @@ struct clone_args {
#define SCHED_FLAG_KEEP_PARAMS 0x10
#define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
#define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
+#define SCHED_FLAG_QOS 0x80
#define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
SCHED_FLAG_KEEP_PARAMS)
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (9 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 10/16] sched/qos: Add a new sched-qos interface Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-09-17 20:09 ` Dietmar Eggemann
` (3 more replies)
2024-08-20 16:35 ` [RFC PATCH 12/16] sched/pelt: Add new waiting_avg to record when runnable && !running Qais Yousef
` (5 subsequent siblings)
16 siblings, 4 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
Bursty tasks are hard to predict. To use resources efficiently, the
system would like to be exact as much as possible. But this poses
a challenge for these bursty tasks that need to get access to more
resources quickly.
The new SCHED_QOS_RAMPUP_MULTIPLIER allows userspace to do that. As the
name implies, it only helps them to transition to a higher performance
state when they get _busier_. That is perfectly periodic tasks by
definition are not going through a transition and will run at a constant
performance level. It is the tasks that need to transition from one
periodic state to another periodic state that is at a higher level that
this rampup_multiplier will help with. It also slows down the ewma decay
of util_est which should help those bursty tasks to keep their faster
rampup.
This should work complimentary with uclamp. uclamp tells the system
about min and max perf requirements which can be applied immediately.
rampup_multiplier is about reactiveness of the task to change.
Specifically to a change for a higher performance level. The task might
necessary need to have a min perf requirements, but it can have sudden
burst of changes that require higher perf level and it needs the system
to provide this faster.
TODO: update the sched_qos docs
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
include/linux/sched.h | 7 ++++
include/uapi/linux/sched.h | 2 ++
kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 6 ++--
kernel/sched/syscalls.c | 38 ++++++++++++++++++++--
5 files changed, 115 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2e8c5a9ffa76..a30ee43a25fb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -404,6 +404,11 @@ struct sched_info {
#endif /* CONFIG_SCHED_INFO */
};
+struct sched_qos {
+ DECLARE_BITMAP(user_defined, SCHED_QOS_MAX);
+ unsigned int rampup_multiplier;
+};
+
/*
* Integer metrics need fixed point arithmetic, e.g., sched/fair
* has a few: load, load_avg, util_avg, freq, and capacity.
@@ -882,6 +887,8 @@ struct task_struct {
struct sched_info sched_info;
+ struct sched_qos sched_qos;
+
struct list_head tasks;
#ifdef CONFIG_SMP
struct plist_node pushable_tasks;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 67ef99f64ddc..0baba91ba5b8 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -104,6 +104,8 @@ struct clone_args {
};
enum sched_qos_type {
+ SCHED_QOS_RAMPUP_MULTIPLIER,
+ SCHED_QOS_MAX,
};
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c91e6a62c7ab..54faa845cb29 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -152,6 +152,8 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
*/
const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
+unsigned int sysctl_sched_qos_default_rampup_multiplier = 1;
+
__read_mostly int scheduler_running;
#ifdef CONFIG_SCHED_CORE
@@ -4488,6 +4490,47 @@ static int sysctl_schedstats(struct ctl_table *table, int write, void *buffer,
#endif /* CONFIG_SCHEDSTATS */
#ifdef CONFIG_SYSCTL
+static void sched_qos_sync_sysctl(void)
+{
+ struct task_struct *g, *p;
+
+ guard(rcu)();
+ for_each_process_thread(g, p) {
+ struct rq_flags rf;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &rf);
+ if (!test_bit(SCHED_QOS_RAMPUP_MULTIPLIER, p->sched_qos.user_defined))
+ p->sched_qos.rampup_multiplier = sysctl_sched_qos_default_rampup_multiplier;
+ task_rq_unlock(rq, p, &rf);
+ }
+}
+
+static int sysctl_sched_qos_handler(struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ unsigned int old_rampup_mult;
+ int result;
+
+ old_rampup_mult = sysctl_sched_qos_default_rampup_multiplier;
+
+ result = proc_dointvec(table, write, buffer, lenp, ppos);
+ if (result)
+ goto undo;
+ if (!write)
+ return 0;
+
+ if (old_rampup_mult != sysctl_sched_qos_default_rampup_multiplier) {
+ sched_qos_sync_sysctl();
+ }
+
+ return 0;
+
+undo:
+ sysctl_sched_qos_default_rampup_multiplier = old_rampup_mult;
+ return result;
+}
+
static struct ctl_table sched_core_sysctls[] = {
#ifdef CONFIG_SCHEDSTATS
{
@@ -4534,6 +4577,13 @@ static struct ctl_table sched_core_sysctls[] = {
.extra2 = SYSCTL_FOUR,
},
#endif /* CONFIG_NUMA_BALANCING */
+ {
+ .procname = "sched_qos_default_rampup_multiplier",
+ .data = &sysctl_sched_qos_default_rampup_multiplier,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = sysctl_sched_qos_handler,
+ },
};
static int __init sched_core_sysctl_init(void)
{
@@ -4543,6 +4593,21 @@ static int __init sched_core_sysctl_init(void)
late_initcall(sched_core_sysctl_init);
#endif /* CONFIG_SYSCTL */
+static void sched_qos_fork(struct task_struct *p)
+{
+ /*
+ * We always force reset sched_qos on fork. These sched_qos are treated
+ * as finite resources to help improve quality of life. Inheriting them
+ * by default can easily lead to a situation where the QoS hint become
+ * meaningless because all tasks in the system have it.
+ *
+ * Every task must request the QoS explicitly if it needs it. No
+ * accidental inheritance is allowed to keep the default behavior sane.
+ */
+ bitmap_zero(p->sched_qos.user_defined, SCHED_QOS_MAX);
+ p->sched_qos.rampup_multiplier = sysctl_sched_qos_default_rampup_multiplier;
+}
+
/*
* fork()/clone()-time setup:
*/
@@ -4562,6 +4627,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
p->prio = current->normal_prio;
uclamp_fork(p);
+ sched_qos_fork(p);
/*
* Revert to default priority/policy on fork if requested.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c10e2afb52d..3d9794db58e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4906,7 +4906,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
if (!task_sleep) {
if (task_util(p) > task_util_dequeued(p)) {
ewma &= ~UTIL_AVG_UNCHANGED;
- ewma = approximate_util_avg(ewma, p->se.delta_exec / 1000);
+ ewma = approximate_util_avg(ewma, (p->se.delta_exec/1000) * p->sched_qos.rampup_multiplier);
goto done;
}
return;
@@ -4974,6 +4974,8 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
*/
ewma <<= UTIL_EST_WEIGHT_SHIFT;
+ if (p->sched_qos.rampup_multiplier)
+ last_ewma_diff /= p->sched_qos.rampup_multiplier;
ewma -= last_ewma_diff;
ewma >>= UTIL_EST_WEIGHT_SHIFT;
done:
@@ -9643,7 +9645,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
* on TICK doesn't end up hurting it as it can happen after we would
* have crossed this threshold.
*
- * To ensure that invaraince is taken into account, we don't scale time
+ * To ensure that invariance is taken into account, we don't scale time
* and use it as-is, approximate_util_avg() will then let us know the
* our threshold.
*/
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index a7d4dfdfed43..dc7d7bcaae7b 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -543,6 +543,35 @@ static void __setscheduler_uclamp(struct task_struct *p,
const struct sched_attr *attr) { }
#endif
+static inline int sched_qos_validate(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ switch (attr->sched_qos_type) {
+ case SCHED_QOS_RAMPUP_MULTIPLIER:
+ if (attr->sched_qos_cookie)
+ return -EINVAL;
+ if (attr->sched_qos_value < 0)
+ return -EINVAL;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static void __setscheduler_sched_qos(struct task_struct *p,
+ const struct sched_attr *attr)
+{
+ switch (attr->sched_qos_type) {
+ case SCHED_QOS_RAMPUP_MULTIPLIER:
+ set_bit(SCHED_QOS_RAMPUP_MULTIPLIER, p->sched_qos.user_defined);
+ p->sched_qos.rampup_multiplier = attr->sched_qos_value;
+ default:
+ break;
+ }
+}
+
/*
* Allow unprivileged RT tasks to decrease priority.
* Only issue a capable test if needed and only once to avoid an audit
@@ -668,8 +697,11 @@ int __sched_setscheduler(struct task_struct *p,
return retval;
}
- if (attr->sched_flags & SCHED_FLAG_QOS)
- return -EOPNOTSUPP;
+ if (attr->sched_flags & SCHED_FLAG_QOS) {
+ retval = sched_qos_validate(p, attr);
+ if (retval)
+ return retval;
+ }
/*
* SCHED_DEADLINE bandwidth accounting relies on stable cpusets
@@ -799,7 +831,9 @@ int __sched_setscheduler(struct task_struct *p,
__setscheduler_params(p, attr);
__setscheduler_prio(p, newprio);
}
+
__setscheduler_uclamp(p, attr);
+ __setscheduler_sched_qos(p, attr);
if (queued) {
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 12/16] sched/pelt: Add new waiting_avg to record when runnable && !running
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (10 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-09-18 7:01 ` Dietmar Eggemann
2024-08-20 16:35 ` [RFC PATCH 13/16] sched/schedutil: Take into account waiting_avg in apply_dvfs_headroom Qais Yousef
` (4 subsequent siblings)
16 siblings, 1 reply; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
This info will be useful to understand how long tasks end up waiting
behind other tasks. This info is recorded for tasks only, and
added/subtracted from root cfs_rq on __update_load_avg_se().
It also helps to decouple util_avg which indicates tasks computational
demand from the fact that the CPU might need to run faster to reduce the
waiting time. It has been a point of confusion in the past while
discussing uclamp and util_avg and the fact that not keeping freq high
means tasks will take longer to run and cause delays. Isolating the
source of delay into its own signal would be a better way to take this
source of delay into account when making decisions independently of
task's/CPU's computational demands.
It is not used now. But will be used later to help drive DVFS headroom.
It could become a helpful metric to help us manage waiting latencies in
general, for example in load balance.
TODO: waiting_avg should use rq_clock_task() as it doesn't care about
invariance. Waiting time should reflect actual wait in realtime as this
is the measure of latency that users care about.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
include/linux/sched.h | 2 ++
kernel/sched/debug.c | 5 +++++
kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++-
kernel/sched/pelt.c | 45 ++++++++++++++++++++++++++++++-------------
4 files changed, 70 insertions(+), 14 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a30ee43a25fb..f332ce5e226f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -477,10 +477,12 @@ struct sched_avg {
u64 last_update_time;
u64 load_sum;
u64 runnable_sum;
+ u64 waiting_sum;
u32 util_sum;
u32 period_contrib;
unsigned long load_avg;
unsigned long runnable_avg;
+ unsigned long waiting_avg;
unsigned long util_avg;
unsigned int util_est;
} ____cacheline_aligned;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c1eb9a1afd13..5fa2662a4a50 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -528,6 +528,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
P(se->avg.load_avg);
P(se->avg.util_avg);
P(se->avg.runnable_avg);
+ P(se->avg.waiting_avg);
#endif
#undef PN_SCHEDSTAT
@@ -683,6 +684,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "runnable_avg",
cfs_rq->avg.runnable_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "waiting_avg",
+ cfs_rq->avg.waiting_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
SEQ_printf(m, " .%-30s: %u\n", "util_est",
@@ -1071,9 +1074,11 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
#ifdef CONFIG_SMP
P(se.avg.load_sum);
P(se.avg.runnable_sum);
+ P(se.avg.waiting_sum);
P(se.avg.util_sum);
P(se.avg.load_avg);
P(se.avg.runnable_avg);
+ P(se.avg.waiting_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
PM(se.avg.util_est, ~UTIL_AVG_UNCHANGED);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d9794db58e1..a8dbba0b755e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4726,6 +4726,22 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
trace_pelt_cfs_tp(cfs_rq);
}
+static inline void add_waiting_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ unsigned long waiting_avg;
+ waiting_avg = READ_ONCE(cfs_rq->avg.waiting_avg);
+ waiting_avg += READ_ONCE(se->avg.waiting_avg);
+ WRITE_ONCE(cfs_rq->avg.waiting_avg, waiting_avg);
+}
+
+static inline void sub_waiting_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+ unsigned long waiting_avg;
+ waiting_avg = READ_ONCE(cfs_rq->avg.waiting_avg);
+ waiting_avg -= min(waiting_avg, READ_ONCE(se->avg.waiting_avg));
+ WRITE_ONCE(cfs_rq->avg.waiting_avg, waiting_avg);
+}
+
/*
* Optional action to be done while updating the load average
*/
@@ -4744,8 +4760,15 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
* Track task load average for carrying it to new CPU after migrated, and
* track group sched_entity load average for task_h_load calculation in migration
*/
- if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
+ if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
+ bool update_rq_waiting_avg = entity_is_task(se) && se_runnable(se);
+
+ if (update_rq_waiting_avg)
+ sub_waiting_avg(&rq_of(cfs_rq)->cfs, se);
__update_load_avg_se(now, cfs_rq, se);
+ if (update_rq_waiting_avg)
+ add_waiting_avg(&rq_of(cfs_rq)->cfs, se);
+ }
decayed = update_cfs_rq_load_avg(now, cfs_rq);
decayed |= propagate_entity_load_avg(se);
@@ -5182,6 +5205,11 @@ attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
static inline void
detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void
+add_waiting_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+static inline void
+sub_waiting_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
+
static inline int sched_balance_newidle(struct rq *rq, struct rq_flags *rf)
{
return 0;
@@ -6786,6 +6814,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* estimated utilization, before we update schedutil.
*/
util_est_enqueue(&rq->cfs, p);
+ add_waiting_avg(&rq->cfs, se);
/*
* If in_iowait is set, the code below may not trigger any cpufreq
@@ -6874,6 +6903,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
bool was_sched_idle = sched_idle_rq(rq);
util_est_dequeue(&rq->cfs, p);
+ sub_waiting_avg(&rq->cfs, se);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 536575757420..f0974abf8566 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -103,7 +103,8 @@ static u32 __accumulate_pelt_segments(u64 periods, u32 d1, u32 d3)
*/
static __always_inline u32
accumulate_sum(u64 delta, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
+ unsigned long load, unsigned long runnable, int running,
+ bool is_task)
{
u32 contrib = (u32)delta; /* p == 0 -> delta < 1024 */
u64 periods;
@@ -118,6 +119,7 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
sa->load_sum = decay_load(sa->load_sum, periods);
sa->runnable_sum =
decay_load(sa->runnable_sum, periods);
+ sa->waiting_sum = decay_load((u64)(sa->waiting_sum), periods);
sa->util_sum = decay_load((u64)(sa->util_sum), periods);
/*
@@ -147,6 +149,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
sa->runnable_sum += runnable * contrib << SCHED_CAPACITY_SHIFT;
if (running)
sa->util_sum += contrib << SCHED_CAPACITY_SHIFT;
+ if (is_task && runnable && !running)
+ sa->waiting_sum += contrib << SCHED_CAPACITY_SHIFT;
return periods;
}
@@ -181,7 +185,8 @@ accumulate_sum(u64 delta, struct sched_avg *sa,
*/
static __always_inline int
___update_load_sum(u64 now, struct sched_avg *sa,
- unsigned long load, unsigned long runnable, int running)
+ unsigned long load, unsigned long runnable, int running,
+ bool is_task)
{
int time_shift;
u64 delta;
@@ -232,7 +237,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
* Step 1: accumulate *_sum since last_update_time. If we haven't
* crossed period boundaries, finish.
*/
- if (!accumulate_sum(delta, sa, load, runnable, running))
+ if (!accumulate_sum(delta, sa, load, runnable, running, is_task))
return 0;
return 1;
@@ -272,6 +277,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
*/
sa->load_avg = div_u64(load * sa->load_sum, divider);
sa->runnable_avg = div_u64(sa->runnable_sum, divider);
+ sa->waiting_avg = div_u64(sa->waiting_sum, divider);
WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
}
@@ -303,7 +309,7 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
{
- if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
+ if (___update_load_sum(now, &se->avg, 0, 0, 0, false)) {
___update_load_avg(&se->avg, se_weight(se));
trace_pelt_se_tp(se);
return 1;
@@ -314,10 +320,17 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ bool is_task = entity_is_task(se);
+
+ if (is_task)
+ rq_of(cfs_rq)->cfs.avg.waiting_avg -= se->avg.waiting_avg;
+
if (___update_load_sum(now, &se->avg, !!se->on_rq, se_runnable(se),
- cfs_rq->curr == se)) {
+ cfs_rq->curr == se, is_task)) {
___update_load_avg(&se->avg, se_weight(se));
+ if (is_task)
+ rq_of(cfs_rq)->cfs.avg.waiting_avg += se->avg.waiting_avg;
cfs_se_util_change(&se->avg);
trace_pelt_se_tp(se);
return 1;
@@ -331,7 +344,8 @@ int __update_load_avg_cfs_rq(u64 now, struct cfs_rq *cfs_rq)
if (___update_load_sum(now, &cfs_rq->avg,
scale_load_down(cfs_rq->load.weight),
cfs_rq->h_nr_running,
- cfs_rq->curr != NULL)) {
+ cfs_rq->curr != NULL,
+ false)) {
___update_load_avg(&cfs_rq->avg, 1);
trace_pelt_cfs_tp(cfs_rq);
@@ -357,7 +371,8 @@ int update_rt_rq_load_avg(u64 now, struct rq *rq, int running)
if (___update_load_sum(now, &rq->avg_rt,
running,
running,
- running)) {
+ running,
+ false)) {
___update_load_avg(&rq->avg_rt, 1);
trace_pelt_rt_tp(rq);
@@ -383,7 +398,8 @@ int update_dl_rq_load_avg(u64 now, struct rq *rq, int running)
if (___update_load_sum(now, &rq->avg_dl,
running,
running,
- running)) {
+ running,
+ false)) {
___update_load_avg(&rq->avg_dl, 1);
trace_pelt_dl_tp(rq);
@@ -414,7 +430,8 @@ int update_hw_load_avg(u64 now, struct rq *rq, u64 capacity)
if (___update_load_sum(now, &rq->avg_hw,
capacity,
capacity,
- capacity)) {
+ capacity,
+ false)) {
___update_load_avg(&rq->avg_hw, 1);
trace_pelt_hw_tp(rq);
return 1;
@@ -462,11 +479,13 @@ int update_irq_load_avg(struct rq *rq, u64 running)
ret = ___update_load_sum(rq->clock - running, &rq->avg_irq,
0,
0,
- 0);
+ 0,
+ false);
ret += ___update_load_sum(rq->clock, &rq->avg_irq,
1,
1,
- 1);
+ 1,
+ false);
if (ret) {
___update_load_avg(&rq->avg_irq, 1);
@@ -536,7 +555,7 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
if (unlikely(!delta))
return util;
- accumulate_sum(delta << sched_pelt_lshift, &sa, 1, 0, 1);
+ accumulate_sum(delta << sched_pelt_lshift, &sa, 1, 0, 1, false);
___update_load_avg(&sa, 0);
return sa.util_avg;
@@ -555,7 +574,7 @@ u64 approximate_runtime(unsigned long util)
return runtime;
while (sa.util_avg < util) {
- accumulate_sum(delta, &sa, 1, 0, 1);
+ accumulate_sum(delta, &sa, 1, 0, 1, false);
___update_load_avg(&sa, 0);
runtime++;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 13/16] sched/schedutil: Take into account waiting_avg in apply_dvfs_headroom
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (11 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 12/16] sched/pelt: Add new waiting_avg to record when runnable && !running Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-08-20 16:35 ` [RFC PATCH 14/16] sched/schedutil: Ignore dvfs headroom when util is decaying Qais Yousef
` (3 subsequent siblings)
16 siblings, 0 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
We now have three sources of delays.
1. How often we send cpufreq_updates
2. How often we update util_avg
3. How long tasks wait in RUNNABLE to become RUNNING
The headroom should cater for all this type of delays to ensure the
system is running at adequate performance point.
We want to pick the maximum headroom required by any of these sources of
delays.
TODO: the signal should use task clock not pelt as this should be
real time based and we don't care about invariance.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/cpufreq_schedutil.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 94e35b7c972d..318b09bc4ab1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -259,10 +259,15 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* dvfs_update_delay of the cpufreq governor or min(curr.se.slice, TICK_US),
* whichever is higher.
*
+ * Also take into accounting how long tasks have been waiting in runnable but
+ * !running state. If it is high, it means we need higher DVFS headroom to
+ * reduce it.
+ *
* XXX: Should we provide headroom when the util is decaying?
*/
static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util, int cpu)
{
+ unsigned long update_headroom, waiting_headroom;
struct rq *rq = cpu_rq(cpu);
u64 delay;
@@ -276,7 +281,10 @@ static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util, int c
delay = TICK_USEC;
delay = max(delay, per_cpu(dvfs_update_delay, cpu));
- return approximate_util_avg(util, delay);
+ update_headroom = approximate_util_avg(util, delay);
+ waiting_headroom = util + READ_ONCE(rq->cfs.avg.waiting_avg);
+
+ return max(update_headroom, waiting_headroom);
}
unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 14/16] sched/schedutil: Ignore dvfs headroom when util is decaying
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (12 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 13/16] sched/schedutil: Take into account waiting_avg in apply_dvfs_headroom Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-08-22 5:29 ` Sultan Alsawaf (unemployed)
2024-09-18 10:40 ` Christian Loehle
2024-08-20 16:35 ` [RFC PATCH 15/16] sched/fair: Enable disabling util_est via rampup_multiplier Qais Yousef
` (2 subsequent siblings)
16 siblings, 2 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
It means we're being idling or doing less work and are already running
at a higher value. No need to apply any dvfs headroom in this case.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/cpufreq_schedutil.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 318b09bc4ab1..4a1a8b353d51 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -9,6 +9,7 @@
#define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)
DEFINE_PER_CPU_READ_MOSTLY(unsigned long, response_time_mult);
+DEFINE_PER_CPU(unsigned long, last_update_util);
struct sugov_tunables {
struct gov_attr_set attr_set;
@@ -262,15 +263,19 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
* Also take into accounting how long tasks have been waiting in runnable but
* !running state. If it is high, it means we need higher DVFS headroom to
* reduce it.
- *
- * XXX: Should we provide headroom when the util is decaying?
*/
static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util, int cpu)
{
- unsigned long update_headroom, waiting_headroom;
+ unsigned long update_headroom, waiting_headroom, prev_util;
struct rq *rq = cpu_rq(cpu);
u64 delay;
+ prev_util = per_cpu(last_update_util, cpu);
+ per_cpu(last_update_util, cpu) = util;
+
+ if (util < prev_util)
+ return util;
+
/*
* What is the possible worst case scenario for updating util_avg, ctx
* switch or TICK?
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 15/16] sched/fair: Enable disabling util_est via rampup_multiplier
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (13 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 14/16] sched/schedutil: Ignore dvfs headroom when util is decaying Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-08-20 16:35 ` [RFC PATCH 16/16] sched/fair: Don't mess with util_avg post init Qais Yousef
2024-09-16 12:21 ` [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Dietmar Eggemann
16 siblings, 0 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
util_est is a great feature to enable busy tasks with long sleep time to
maintain their perf level. But it can also be expensive in terms of
power for tasks that have no such perf requirements and just happened to
be busy in the last activation.
If a task sets its rampup_multiplier to 0, then it indicates that it is
happy to glide along with system default response and doesn't require
responsiveness. We can use that to further imply that the task is happy
to decay its util for long sleep too and disable util_est.
XXX: This could be overloading this QoS. We could add a separate more
explicit QoS to disable util_est for tasks that don't care.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a8dbba0b755e..ad72db5a266c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4918,6 +4918,14 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
if (!sched_feat(UTIL_EST))
return;
+ /*
+ * rampup_multiplier = 0 indicates util_est is disabled.
+ */
+ if (!p->sched_qos.rampup_multiplier) {
+ ewma = 0;
+ goto done;
+ }
+
/* Get current estimate of utilization */
ewma = READ_ONCE(p->se.avg.util_est);
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* [RFC PATCH 16/16] sched/fair: Don't mess with util_avg post init
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (14 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 15/16] sched/fair: Enable disabling util_est via rampup_multiplier Qais Yousef
@ 2024-08-20 16:35 ` Qais Yousef
2024-09-16 12:21 ` [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Dietmar Eggemann
16 siblings, 0 replies; 37+ messages in thread
From: Qais Yousef @ 2024-08-20 16:35 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel, Qais Yousef
The extrapolation logic for util_avg for newly forked tasks tries to
crystal ball the task's demand. This has worked well when the system
didn't have the means to help these tasks otherwise. But now we do have
util_est that will rampup faster. And uclamp_min to ensure a good
starting point if they really care.
Since we really can't crystal ball the behavior, and giving the same
starting value for all tasks is more consistent behavior for all forked
tasks, and it helps to preserve system resources for tasks to compete to
get them if they truly care, set the initial util_avg to be 0 when
util_est feature is enabled.
This should not impact workloads that need best single threaded
performance (like geekbench) given the previous improvements introduced
to help with faster rampup to reach max perf point more coherently and
consistently across systems.
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
kernel/sched/fair.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad72db5a266c..45be77d1112f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1031,6 +1031,19 @@ void init_entity_runnable_average(struct sched_entity *se)
}
/*
+ * When util_est is used, the tasks can rampup much faster by default. And with
+ * the rampup_multiplier, tasks can ask for faster rampup after fork. And with
+ * uclamp, they can ensure a min perf requirement. Given all these factors, we
+ * keep util_avg at 0 as we can't crystal ball the task demand after fork.
+ * Userspace have enough ways to ensure good perf for tasks after fork. Keeping
+ * the util_avg to 0 is good way to ensure a uniform start for all tasks. And
+ * it is good to preserve precious resources. Truly busy forked tasks can
+ * compete for the resources without the need for initial 'cheat' to ramp them
+ * up automagically.
+ *
+ * When util_est is not present, the extrapolation logic below will still
+ * apply.
+ *
* With new tasks being created, their initial util_avgs are extrapolated
* based on the cfs_rq's current util_avg:
*
@@ -1080,6 +1093,12 @@ void post_init_entity_util_avg(struct task_struct *p)
return;
}
+ /*
+ * Tasks can rampup faster with util_est, so don't mess with util_avg.
+ */
+ if (sched_feat(UTIL_EST))
+ return;
+
if (cap > 0) {
if (cfs_rq->avg.util_avg != 0) {
sa->util_avg = cfs_rq->avg.util_avg * se_weight(se);
--
2.34.1
^ permalink raw reply related [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 04/16] sched/fair: Remove magic hardcoded margin in fits_capacity()
2024-08-20 16:35 ` [RFC PATCH 04/16] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
@ 2024-08-22 5:09 ` Sultan Alsawaf (unemployed)
2024-09-17 19:41 ` Dietmar Eggemann
0 siblings, 1 reply; 37+ messages in thread
From: Sultan Alsawaf (unemployed) @ 2024-08-22 5:09 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
John Stultz, linux-pm, linux-kernel
Hi Qais,
On Tue, Aug 20, 2024 at 05:35:00PM +0100, Qais Yousef wrote:
> Replace hardcoded margin value in fits_capacity() with better dynamic
> logic.
>
> 80% margin is a magic value that has served its purpose for now, but it
> no longer fits the variety of systems that exist today. If a system is
> over powered specifically, this 80% will mean we leave a lot of capacity
> unused before we decide to upmigrate on HMP system.
>
> On many systems the little cores are under powered and ability to
> migrate faster away from them is desired.
>
> Redefine misfit migration to mean the utilization threshold at which the
> task would become misfit at the next load balance event assuming it
> becomes an always running task.
>
> To calculate this threshold, we use the new approximate_util_avg()
> function to find out the threshold, based on arch_scale_cpu_capacity()
> the task will be misfit if it continues to run for a TICK_USEC which is
> our worst case scenario for when misfit migration will kick in.
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> kernel/sched/core.c | 1 +
> kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++--------
> kernel/sched/sched.h | 1 +
> 3 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6d35c48239be..402ee4947ef0 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -8266,6 +8266,7 @@ void __init sched_init(void)
> rq->sd = NULL;
> rq->rd = NULL;
> rq->cpu_capacity = SCHED_CAPACITY_SCALE;
> + rq->fits_capacity_threshold = SCHED_CAPACITY_SCALE;
> rq->balance_callback = &balance_push_callback;
> rq->active_balance = 0;
> rq->next_balance = jiffies;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9057584ec06d..e5e986af18dc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -95,11 +95,15 @@ int __weak arch_asym_cpu_priority(int cpu)
> }
>
> /*
> - * The margin used when comparing utilization with CPU capacity.
> - *
> - * (default: ~20%)
> + * fits_capacity() must ensure that a task will not be 'stuck' on a CPU with
> + * lower capacity for too long. This the threshold is the util value at which
> + * if a task becomes always busy it could miss misfit migration load balance
> + * event. So we consider a task is misfit before it reaches this point.
> */
> -#define fits_capacity(cap, max) ((cap) * 1280 < (max) * 1024)
> +static inline bool fits_capacity(unsigned long util, int cpu)
> +{
> + return util < cpu_rq(cpu)->fits_capacity_threshold;
> +}
>
> /*
> * The margin used when comparing CPU capacities.
> @@ -4978,14 +4982,13 @@ static inline int util_fits_cpu(unsigned long util,
> unsigned long uclamp_max,
> int cpu)
> {
> - unsigned long capacity = capacity_of(cpu);
> unsigned long capacity_orig;
> bool fits, uclamp_max_fits;
>
> /*
> * Check if the real util fits without any uclamp boost/cap applied.
> */
> - fits = fits_capacity(util, capacity);
> + fits = fits_capacity(util, cpu);
>
> if (!uclamp_is_used())
> return fits;
> @@ -9592,12 +9595,33 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
> {
> unsigned long capacity = scale_rt_capacity(cpu);
> struct sched_group *sdg = sd->groups;
> + struct rq *rq = cpu_rq(cpu);
> + u64 limit;
>
> if (!capacity)
> capacity = 1;
>
> - cpu_rq(cpu)->cpu_capacity = capacity;
> - trace_sched_cpu_capacity_tp(cpu_rq(cpu));
> + rq->cpu_capacity = capacity;
> + trace_sched_cpu_capacity_tp(rq);
> +
> + /*
> + * Calculate the util at which the task must be considered a misfit.
> + *
> + * We must ensure that a task experiences the same ramp-up time to
> + * reach max performance point of the system regardless of the CPU it
> + * is running on (due to invariance, time will stretch and task will
> + * take longer to achieve the same util value compared to a task
> + * running on a big CPU) and a delay in misfit migration which depends
> + * on TICK doesn't end up hurting it as it can happen after we would
> + * have crossed this threshold.
> + *
> + * To ensure that invaraince is taken into account, we don't scale time
> + * and use it as-is, approximate_util_avg() will then let us know the
> + * our threshold.
> + */
> + limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;
Perhaps it makes more sense to use `capacity` here instead of
`arch_scale_cpu_capacity(cpu)`? Seems like reduced capacity due to HW pressure
(and IRQs + RT util) should be considered, e.g. for a capacity inversion due to
HW pressure on a mid core that results in a little core being faster.
Also, multiplying by the PELT period (1024 us) rather than USEC_PER_MSEC would
be more accurate.
> + limit -= TICK_USEC; /* sd->balance_interval is more accurate */
I think `limit` could easily wrap here, especially with a 100 Hz tick, and make
it seem like an ultra-slow core (e.g. due to HW pressure) can suddenly fit any
task.
How about `lsub_positive(&limit, TICK_USEC)` instead?
> + rq->fits_capacity_threshold = approximate_util_avg(0, limit);
>
> sdg->sgc->capacity = capacity;
> sdg->sgc->min_capacity = capacity;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 47f158b2cdc2..ab4672675b84 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1093,6 +1093,7 @@ struct rq {
> struct sched_domain __rcu *sd;
>
> unsigned long cpu_capacity;
> + unsigned long fits_capacity_threshold;
>
> struct balance_callback *balance_callback;
>
> --
> 2.34.1
>
Cheers,
Sultan
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 14/16] sched/schedutil: Ignore dvfs headroom when util is decaying
2024-08-20 16:35 ` [RFC PATCH 14/16] sched/schedutil: Ignore dvfs headroom when util is decaying Qais Yousef
@ 2024-08-22 5:29 ` Sultan Alsawaf (unemployed)
2024-09-18 10:40 ` Christian Loehle
1 sibling, 0 replies; 37+ messages in thread
From: Sultan Alsawaf (unemployed) @ 2024-08-22 5:29 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
John Stultz, linux-pm, linux-kernel
On Tue, Aug 20, 2024 at 05:35:10PM +0100, Qais Yousef wrote:
> It means we're being idling or doing less work and are already running
> at a higher value. No need to apply any dvfs headroom in this case.
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> kernel/sched/cpufreq_schedutil.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 318b09bc4ab1..4a1a8b353d51 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -9,6 +9,7 @@
> #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)
>
> DEFINE_PER_CPU_READ_MOSTLY(unsigned long, response_time_mult);
> +DEFINE_PER_CPU(unsigned long, last_update_util);
>
> struct sugov_tunables {
> struct gov_attr_set attr_set;
> @@ -262,15 +263,19 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> * Also take into accounting how long tasks have been waiting in runnable but
> * !running state. If it is high, it means we need higher DVFS headroom to
> * reduce it.
> - *
> - * XXX: Should we provide headroom when the util is decaying?
> */
> static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util, int cpu)
> {
> - unsigned long update_headroom, waiting_headroom;
> + unsigned long update_headroom, waiting_headroom, prev_util;
> struct rq *rq = cpu_rq(cpu);
> u64 delay;
>
> + prev_util = per_cpu(last_update_util, cpu);
> + per_cpu(last_update_util, cpu) = util;
> +
> + if (util < prev_util)
> + return util;
> +
> /*
> * What is the possible worst case scenario for updating util_avg, ctx
> * switch or TICK?
> --
> 2.34.1
>
Hmm, after the changes in "sched: cpufreq: Remove magic 1.25 headroom from
sugov_apply_dvfs_headroom()", won't sugov_apply_dvfs_headroom() already decay
the headroom gracefully in step with the decaying util? I suspect that abruptly
killing the headroom entirely could be premature depending on the workload, and
lead to util bouncing back up due to the time dilation effect you described in
the cover letter.
Cheers,
Sultan
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 03/16] sched/pelt: Add a new function to approximate runtime to reach given util
2024-08-20 16:34 ` [RFC PATCH 03/16] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
@ 2024-08-22 5:36 ` Sultan Alsawaf (unemployed)
2024-09-16 15:31 ` Christian Loehle
0 siblings, 1 reply; 37+ messages in thread
From: Sultan Alsawaf (unemployed) @ 2024-08-22 5:36 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
John Stultz, linux-pm, linux-kernel
On Tue, Aug 20, 2024 at 05:34:59PM +0100, Qais Yousef wrote:
> It is basically the ramp-up time from 0 to a given value. Will be used
> later to implement new tunable to control response time for schedutil.
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> kernel/sched/pelt.c | 21 +++++++++++++++++++++
> kernel/sched/sched.h | 1 +
> 2 files changed, 22 insertions(+)
>
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 2ce83e880bd5..06cb881ba582 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
>
> return sa.util_avg;
> }
> +
> +/*
> + * Approximate the required amount of runtime in ms required to reach @util.
> + */
> +u64 approximate_runtime(unsigned long util)
> +{
> + struct sched_avg sa = {};
> + u64 delta = 1024; // period = 1024 = ~1ms
> + u64 runtime = 0;
> +
> + if (unlikely(!util))
> + return runtime;
Seems like this check can be removed since it's covered by the loop condition.
> +
> + while (sa.util_avg < util) {
> + accumulate_sum(delta, &sa, 1, 0, 1);
> + ___update_load_avg(&sa, 0);
> + runtime++;
> + }
I think this could be a lookup table (probably 1024 * u8), for constant-time
runtime approximation.
> +
> + return runtime;
> +}
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 294c6769e330..47f158b2cdc2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3065,6 +3065,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> unsigned long max);
>
> unsigned long approximate_util_avg(unsigned long util, u64 delta);
> +u64 approximate_runtime(unsigned long util);
>
> /*
> * Verify the fitness of task @p to run on @cpu taking into account the
> --
> 2.34.1
>
Cheers,
Sultan
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
` (15 preceding siblings ...)
2024-08-20 16:35 ` [RFC PATCH 16/16] sched/fair: Don't mess with util_avg post init Qais Yousef
@ 2024-09-16 12:21 ` Dietmar Eggemann
16 siblings, 0 replies; 37+ messages in thread
From: Dietmar Eggemann @ 2024-09-16 12:21 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, John Stultz, linux-pm, linux-kernel
On 20/08/2024 18:34, Qais Yousef wrote:
> This series is a re-incarnation of Remove Hardcoded Margings posted a while ago
>
> https://lore.kernel.org/lkml/20231208002342.367117-1-qyousef@layalina.io/
>
Looks like some of the ideas were already discussed under
https://lkml.kernel.org/r/20230827233203.1315953-1-qyousef@layalina.io
back in Aug/Sept 23.
> The original series attempted to address response time related issues stemming
> from hardcoding migration margin in fits_capacity() on HMP system, and DVFS
> headroom which had a constant 25% boost that is bad for power and thermal on
> powerful systems. Saving power was the main goal by reducing these values to
> the smallest possible value automatically based on anticipated worst case
> scenario.
>
> A tricky point was uncovered and demonstrated in the migration margin table in
> this posting
>
> https://lore.kernel.org/lkml/20240205223344.2280519-4-qyousef@layalina.io/
>
> is that to make the system responsive to sudden changes, we actually need large
> migration margin the smaller the core capacity is
>
> cap threshold % threshold-tick %
> 0 0 0 0 0
> 16 0 0 0 0
> 32 1 3.12 0 0
> 48 3 6.25 2 4.16
> 64 4 6.25 2 3.12
> 80 6 7.5 5 6.25
> 96 10 10.41 8 8.33
> 112 14 12.5 11 9.82
> 128 18 14.06 16 12.5
> 144 21 14.58 18 12.5
> 160 26 16.25 23 14.37
Not sure what this 'misfit threshold' should be?
160 * 1024 / 1280 = 128 so threshold = 32 ?
I know that you want to make the threshold bigger for smaller CPUs
[PATCH 04/16]. I get:
update_cpu_capacity(): cpu=0 arch_scale_cpu_capacity=160
approx_runtime=8 limit=4000 rq->fits_capacity_threshold=83
for the little CPU on Pix6, I just don't know how this relates to 26 or 23.
> 176 33 18.75 29 16.47
> 192 39 20.31 35 18.22
> 208 47 22.59 43 20.67
> 224 55 24.55 50 22.32
> 240 63 26.25 59 24.58
> 256 73 28.51 68 26.56
> 272 82 30.14 77 28.30
> 288 93 32.29 87 30.20
> 304 103 33.88 97 31.90
> 320 114 35.62 108 33.75
> 336 126 37.5 120 35.71
> 352 138 39.20 132 37.5
> 368 151 41.03 144 39.13
> 384 163 42.44 157 40.88
>
> The current 80% margin is valid for CPU with capacities in the 700-750 range,
> which might have been true in the original generations of HMP systems.
>
> 704 557 79.11 550 78.12
> 720 578 80.27 572 79.44
> 736 606 82.33 600 81.52
> 752 633 84.17 627 83.37
>
> This result contradicts the original goal of saving power as it indicates we
> must be more aggressive with the margin, while the original observation was
> that there are workloads with steady utilization that is hovering at a level
> that is higher than this margin but lower than the capacity of the CPU (mid
> CPUs particularly) and the aggressive upmigration is not desired, nor the
> higher push to run at max freq where we could have run at a lower freq with no
> impact on perf.
>
> Further analysis using a simple rampup [1] test that spawns a busy task that
> starts from util_avg/est = 0 and never goes to sleep. The purpose is to measure
> the actual system response time for workloads that are bursty and need to
> transition from lower to higher performance level quickly.
>
> This lead to more surprising discovery due to utilization invariance, I call it
> the black hole effect.
>
> There's a black hole in the scheduler:
> ======================================
>
> It is no surprise to anyone that DVFS and HMP system have a time stretching
> effect where the same workload will take longer to do the same amount of work
> the lower the frequency/capacity.
>
> This is countered in the system via clock_pelt which is central for
> implementing utilization invariance. This helps ensure that the utilization
> signal still accurately represent the computation demand of sched_entities.
>
> But this introduces this black hole effect of time dilation. The concept of
> passage of time is now different from task's perspective compared to an
> external observer. The task will think 1ms has passed, but depending on the
> capacity or the freq, the time from external observer point of view has passed
> for 25 or even 30ms in reality.
But only the PELT angle (and here especially p->se.avg.util_avg) of the
task related accounting, right?
> This has a terrible impact on utilization signal rise time. And since
> utilization signal is central in making many scheduler decision like estimating
> how loaded the CPU is, whether a task is misfit, and what freq to run at when
> schedutil is being used, this leads to suboptimal decision being made and give
> the external observer (userspace) that the system is not responsive or
> reactive. This manifests as problems like:
This can be described by:
t = 1/cap_factor * hl * ln(1 - S_n/S_inv)/ln(0.5)
cap_factor ... arch_scale_cpu_capacity(cpu)/SCHED_CAPACITY_SCALE
S_n ... partial sum
S_inf ... infinitive sum
hl ... halflife
t_1024(cap=1024) = 323ms
t_1024(cap=160) = 2063ms
[...]
> Computational domain vs Time domain:
> ------------------------------------
>
> The util_avg is a good representation of compute demand of periodic tasks. And
> should remain as such. But when they are no longer periodic, then looking at
> computational domain doesn't make sense as we have no idea what's the actual
> compute demand of the task, it's in transition. During this transition we need
> to fallback to time domain based signal. Which is simply done by ignoring
> invariance and let the util accumulate based on observer's time.
And this is achieved by:
time = approximate_runtime(util)
and
util_avg_end = approximate_util_avg(util_avg_start, time_delta)
These functions allow you to switch between both domains. They do not
consider invariance and are based on the 'util_avg - time curve' of the
big CPU at max CPU frequency.
> Coherent response time:
> -----------------------
>
> Moving transient tasks to be based on observer's time will create a coherent
> and constant response time. Which is the time it takes util_avg to rampup from
> 0 to max on the biggest core running at max freq (or performance level
> 1024/max).
>
> IOW, the rampup time of util signal should appear to be the same on all
> capacities/frequencies as if we are running at the highest performance level
> all the time. This will give the observer (userspace) the expected behavior of
> things moving through the motions in a constant response time regardless of
> initial conditions.
>
> util_est extension:
> -------------------
>
> The extension is quite simple. util_est currently latches to util_avg at
> enqueue/dequeue to act as a hold function for when busy tasks sleep for long
> period and decay prematurely.
>
> The extension is to account for RUNNING time of the task in util_est too, which
> is currently ignored.
>
> when a task is RUNNING, we accumulate delta_exec across context switches and
> accumulate util_est as we're accumulating util_avg, but simply without any
> invariance taken into account. This means when tasks are RUNNABLE, and continue
> to run, util_est will act as our time based signal to help with the faster and
> 'constant' rampup response.
>
> Periodic vs Transient tasks:
> ----------------------------
>
> It is important to make a distinction now between tasks that are periodic and
> their util_avg is a good faithful presentation of its compute demand. And
> transient tasks that need help to move faster to their next steady state point.
>
> In the code this distinction is made based on util_avg. In theory (I think we
> have bugs, will send a separate report), util_avg should be near constant for
Do you mean bugs in maintaining util_avg signal for tasks/taskgroups or
cfs_rq?
> a periodic task. So simply transient tasks are ones that lead to util_avg being
> higher across activations. And this is our trigger point to know whether we
Activations as in enqueue_entity()/dequeue_entity() or
set_next_entity()/put_prev_entity().
[...]
> Patch 7 adds a multiplier to change PELT time constant. I am not sure if this
> is necessary now after introducing per task rampup multipliers. The original
> rationale was to help cater different hadware against the constant util_avg
> response time. I might drop this in future postings. I haven't tested the
> latest version which follows a new implementation suggested by Vincent.
This one definitely stands out here. I remember that PELT halflife
multiplier never had a chance in mainline so far (compile-time or
boot-time) since the actual problem it solves couldn't be explained
sufficiently so far.
In previous discussions we went via the UTIL_EST_FASTER discussion to
'runnable boosting' which is in mainline so far.
https://lkml.kernel.org/r/20230907130805.GE10955@noisy.programming.kicks-ass.net
[...]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 03/16] sched/pelt: Add a new function to approximate runtime to reach given util
2024-08-22 5:36 ` Sultan Alsawaf (unemployed)
@ 2024-09-16 15:31 ` Christian Loehle
0 siblings, 0 replies; 37+ messages in thread
From: Christian Loehle @ 2024-09-16 15:31 UTC (permalink / raw)
To: Sultan Alsawaf (unemployed), Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
John Stultz, linux-pm, linux-kernel
On 8/22/24 06:36, Sultan Alsawaf (unemployed) wrote:
> On Tue, Aug 20, 2024 at 05:34:59PM +0100, Qais Yousef wrote:
>> It is basically the ramp-up time from 0 to a given value. Will be used
>> later to implement new tunable to control response time for schedutil.
>>
>> Signed-off-by: Qais Yousef <qyousef@layalina.io>
>> ---
>> kernel/sched/pelt.c | 21 +++++++++++++++++++++
>> kernel/sched/sched.h | 1 +
>> 2 files changed, 22 insertions(+)
>>
>> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
>> index 2ce83e880bd5..06cb881ba582 100644
>> --- a/kernel/sched/pelt.c
>> +++ b/kernel/sched/pelt.c
>> @@ -487,3 +487,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
>>
>> return sa.util_avg;
>> }
>> +
>> +/*
>> + * Approximate the required amount of runtime in ms required to reach @util.
>> + */
>> +u64 approximate_runtime(unsigned long util)
>> +{
>> + struct sched_avg sa = {};
>> + u64 delta = 1024; // period = 1024 = ~1ms
>> + u64 runtime = 0;
>> +
>> + if (unlikely(!util))
>> + return runtime;
>
> Seems like this check can be removed since it's covered by the loop condition.
>
>> +
>> + while (sa.util_avg < util) {
>> + accumulate_sum(delta, &sa, 1, 0, 1);
>> + ___update_load_avg(&sa, 0);
>> + runtime++;
>> + }
>
> I think this could be a lookup table (probably 1024 * u8), for constant-time
> runtime approximation.
Somewhat agreed, given that we don't seem to care about the 2.4% error margin,
we could allow some more errors here even. Something like 50 values should be
more than enough (which might fit nicely in a simple formula, too?).
FWIW
util: approximate_runtime(util)
160: 8
192: 10
224: 12
256: 14
288: 16
320: 18
352: 20
384: 22
416: 25
448: 27
480: 30
512: 32
544: 35
576: 39
608: 42
640: 46
672: 50
704: 54
736: 59
768: 64
800: 71
832: 78
864: 86
896: 96
928: 109
960: 128
992: 159
1024: 323
Fine for a RFC though.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate response time
2024-08-20 16:35 ` [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
@ 2024-09-16 22:22 ` Dietmar Eggemann
2024-09-17 10:22 ` Christian Loehle
0 siblings, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2024-09-16 22:22 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, John Stultz, linux-pm, linux-kernel
On 20/08/2024 18:35, Qais Yousef wrote:
> The new tunable, response_time_ms, allow us to speed up or slow down
> the response time of the policy to meet the perf, power and thermal
> characteristic desired by the user/sysadmin. There's no single universal
> trade-off that we can apply for all systems even if they use the same
> SoC. The form factor of the system, the dominant use case, and in case
> of battery powered systems, the size of the battery and presence or
> absence of active cooling can play a big role on what would be best to
> use.
>
> The new tunable provides sensible defaults, but yet gives the power to
> control the response time to the user/sysadmin, if they wish to.
>
> This tunable is applied before we apply the DVFS headroom.
>
> The default behavior of applying 1.25 headroom can be re-instated easily
> now. But we continue to keep the min required headroom to overcome
> hardware limitation in its speed to change DVFS. And any additional
> headroom to speed things up must be applied by userspace to match their
> expectation for best perf/watt as it dictates a type of policy that will
> be better for some systems, but worse for others.
>
> There's a whitespace clean up included in sugov_start().
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> Documentation/admin-guide/pm/cpufreq.rst | 17 +++-
> drivers/cpufreq/cpufreq.c | 4 +-
> include/linux/cpufreq.h | 3 +
> kernel/sched/cpufreq_schedutil.c | 115 ++++++++++++++++++++++-
> 4 files changed, 132 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
> index 6adb7988e0eb..fa0d602a920e 100644
> --- a/Documentation/admin-guide/pm/cpufreq.rst
> +++ b/Documentation/admin-guide/pm/cpufreq.rst
> @@ -417,7 +417,7 @@ is passed by the scheduler to the governor callback which causes the frequency
> to go up to the allowed maximum immediately and then draw back to the value
> returned by the above formula over time.
>
> -This governor exposes only one tunable:
> +This governor exposes two tunables:
>
> ``rate_limit_us``
> Minimum time (in microseconds) that has to pass between two consecutive
> @@ -427,6 +427,21 @@ This governor exposes only one tunable:
> The purpose of this tunable is to reduce the scheduler context overhead
> of the governor which might be excessive without it.
>
> +``respone_time_ms``
> + Amount of time (in milliseconds) required to ramp the policy from
> + lowest to highest frequency. Can be decreased to speed up the
^^^^^^^^^^^^^^^^^
This has changed IMHO. Should be the time from lowest (or better 0) to
second highest frequency.
https://lkml.kernel.org/r/20230827233203.1315953-6-qyousef@layalina.io
[...]
> @@ -59,6 +63,70 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
> /************************ Governor internals ***********************/
>
> +static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
> +{
> + int cpu = cpumask_first(sg_policy->policy->cpus);
> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> + unsigned int max_freq, sec_max_freq;
> +
> + max_freq = sg_policy->policy->cpuinfo.max_freq;
> + sec_max_freq = __resolve_freq(sg_policy->policy,
> + max_freq - 1,
> + CPUFREQ_RELATION_H);
> +
> + /*
> + * We will request max_freq as soon as util crosses the capacity at
> + * second highest frequency. So effectively our response time is the
> + * util at which we cross the cap@2nd_highest_freq.
> + */
> + cap = sec_max_freq * cap / max_freq;
> +
> + return approximate_runtime(cap + 1);
> +}
Still uses the CPU capacity value based on dt-entry
capacity-dmips-mhz = <578> (CPU0 on juno-r0)
^^^
i.e. frequency invariance is not considered.
[ 1.943356] CPU0 max_freq=850000 sec_max_freq=775000 cap=578 cap_at_sec_max_opp=527 runtime=34
^^^^^^^
[ 1.957593] CPU1 max_freq=1100000 sec_max_freq=950000 cap=1024 cap_at_sec_max_opp=884 runtime=92
# cat /sys/devices/system/cpu/cpu*/cpu_capacity
446
^^^
1024
1024
446
446
446
[...]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate response time
2024-09-16 22:22 ` Dietmar Eggemann
@ 2024-09-17 10:22 ` Christian Loehle
0 siblings, 0 replies; 37+ messages in thread
From: Christian Loehle @ 2024-09-17 10:22 UTC (permalink / raw)
To: Dietmar Eggemann, Qais Yousef, Ingo Molnar, Peter Zijlstra,
Vincent Guittot, Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, John Stultz, linux-pm, linux-kernel
On 9/16/24 23:22, Dietmar Eggemann wrote:
> On 20/08/2024 18:35, Qais Yousef wrote:
>> The new tunable, response_time_ms, allow us to speed up or slow down
>> the response time of the policy to meet the perf, power and thermal
>> characteristic desired by the user/sysadmin. There's no single universal
>> trade-off that we can apply for all systems even if they use the same
>> SoC. The form factor of the system, the dominant use case, and in case
>> of battery powered systems, the size of the battery and presence or
>> absence of active cooling can play a big role on what would be best to
>> use.
>>
>> The new tunable provides sensible defaults, but yet gives the power to
>> control the response time to the user/sysadmin, if they wish to.
>>
>> This tunable is applied before we apply the DVFS headroom.
>>
>> The default behavior of applying 1.25 headroom can be re-instated easily
>> now. But we continue to keep the min required headroom to overcome
>> hardware limitation in its speed to change DVFS. And any additional
>> headroom to speed things up must be applied by userspace to match their
>> expectation for best perf/watt as it dictates a type of policy that will
>> be better for some systems, but worse for others.
>>
>> There's a whitespace clean up included in sugov_start().
>>
>> Signed-off-by: Qais Yousef <qyousef@layalina.io>
>> ---
>> Documentation/admin-guide/pm/cpufreq.rst | 17 +++-
>> drivers/cpufreq/cpufreq.c | 4 +-
>> include/linux/cpufreq.h | 3 +
>> kernel/sched/cpufreq_schedutil.c | 115 ++++++++++++++++++++++-
>> 4 files changed, 132 insertions(+), 7 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
>> index 6adb7988e0eb..fa0d602a920e 100644
>> --- a/Documentation/admin-guide/pm/cpufreq.rst
>> +++ b/Documentation/admin-guide/pm/cpufreq.rst
>> @@ -417,7 +417,7 @@ is passed by the scheduler to the governor callback which causes the frequency
>> to go up to the allowed maximum immediately and then draw back to the value
>> returned by the above formula over time.
>>
>> -This governor exposes only one tunable:
>> +This governor exposes two tunables:
>>
>> ``rate_limit_us``
>> Minimum time (in microseconds) that has to pass between two consecutive
>> @@ -427,6 +427,21 @@ This governor exposes only one tunable:
>> The purpose of this tunable is to reduce the scheduler context overhead
>> of the governor which might be excessive without it.
>>
>> +``respone_time_ms``
s/respone/response
>> + Amount of time (in milliseconds) required to ramp the policy from
>> + lowest to highest frequency. Can be decreased to speed up the
> ^^^^^^^^^^^^^^^^^
>
> This has changed IMHO. Should be the time from lowest (or better 0) to
> second highest frequency.
>
> https://lkml.kernel.org/r/20230827233203.1315953-6-qyousef@layalina.io
>
> [...]
>
Isn't it even more complicated than that?
We have the headroom applied on top of the response_time_ms, so
response_time_ms will be longer than the time it takes to reach highest cap OPP.
Furthermore, applying this to a big CPU e.g. with OPP0 cap of 200, starting
from 0 is (usually?) irrelevant, as we likely wouldn't be here if we were at 0.
I get the intent, but conveying this in an understandable interface is hard.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 08/16] sched/fair: Extend util_est to improve rampup time
2024-08-20 16:35 ` [RFC PATCH 08/16] sched/fair: Extend util_est to improve rampup time Qais Yousef
@ 2024-09-17 19:21 ` Dietmar Eggemann
2024-10-14 16:04 ` Christian Loehle
1 sibling, 0 replies; 37+ messages in thread
From: Dietmar Eggemann @ 2024-09-17 19:21 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, John Stultz, linux-pm, linux-kernel
On 20/08/2024 18:35, Qais Yousef wrote:
> Utilization invariance can cause big delays. When tasks are running,
> accumulate non-invairiant version of utilization to help tasks to settle
> down to their new util_avg values faster.
>
> Keep track of delta_exec during runnable across activations to help
> update util_est for a long running task accurately. util_est shoudl
> still behave the same at enqueue/dequeue.
>
> Before this patch the a busy task tamping up would experience the
> following transitions, running on M1 Mac Mini
[...]
> @@ -4890,16 +4890,20 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> if (!sched_feat(UTIL_EST))
> return;
>
> - /*
> - * Skip update of task's estimated utilization when the task has not
> - * yet completed an activation, e.g. being migrated.
> - */
> - if (!task_sleep)
> - return;
> -
> /* Get current estimate of utilization */
> ewma = READ_ONCE(p->se.avg.util_est);
>
> + /*
> + * If a task is running, update util_est ignoring utilization
> + * invariance so that if the task suddenly becomes busy we will rampup
> + * quickly to settle down to our new util_avg.
> + */
> + if (!task_sleep) {
> + ewma &= ~UTIL_AVG_UNCHANGED;
> + ewma = approximate_util_avg(ewma, p->se.delta_exec / 1000);
> + goto done;
> + }
> +
Can you not use the UTIL_EST_FASTER idea for that? I mean speed up
ramp-up on little CPUs for truly ramp-up tasks to fight the influence of
invariant util_avg->util_est here.
https://lkml.kernel.org/r/Y2kLA8x40IiBEPYg@hirez.programming.kicks-ass.net
I do understand that runnable_avg boosting wont help here since we're
not fighting contention.
It uses the sum of all activations since wake-up so it should be faster
than just using the last activation.
It uses existing infrastructure: __accumulate_pelt_segments()
If you use it inside task- and/or cpu-util function, you don't need to
make util_est state handling more complicated (distinguish periodic and
ramp-up task, including PATCH 09/16).
From your workload analysis, do you have examples of Android tasks which
are clearly ramp-up tasks and maybe also affine to the little CPUs
(thanks to Android BACKGROUND group) which would require this correction
of the invariant util_avg->util_est signals?
[...]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 04/16] sched/fair: Remove magic hardcoded margin in fits_capacity()
2024-08-22 5:09 ` Sultan Alsawaf (unemployed)
@ 2024-09-17 19:41 ` Dietmar Eggemann
0 siblings, 0 replies; 37+ messages in thread
From: Dietmar Eggemann @ 2024-09-17 19:41 UTC (permalink / raw)
To: Sultan Alsawaf (unemployed), Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, John Stultz, linux-pm,
linux-kernel
On 22/08/2024 07:09, Sultan Alsawaf (unemployed) wrote:
> Hi Qais,
>
> On Tue, Aug 20, 2024 at 05:35:00PM +0100, Qais Yousef wrote:
>> Replace hardcoded margin value in fits_capacity() with better dynamic
>> logic.
>>
>> 80% margin is a magic value that has served its purpose for now, but it
>> no longer fits the variety of systems that exist today. If a system is
>> over powered specifically, this 80% will mean we leave a lot of capacity
>> unused before we decide to upmigrate on HMP system.
>>
>> On many systems the little cores are under powered and ability to
>> migrate faster away from them is desired.
>>
>> Redefine misfit migration to mean the utilization threshold at which the
>> task would become misfit at the next load balance event assuming it
>> becomes an always running task.
>>
>> To calculate this threshold, we use the new approximate_util_avg()
>> function to find out the threshold, based on arch_scale_cpu_capacity()
>> the task will be misfit if it continues to run for a TICK_USEC which is
>> our worst case scenario for when misfit migration will kick in.
[...]
>> + /*
>> + * Calculate the util at which the task must be considered a misfit.
>> + *
>> + * We must ensure that a task experiences the same ramp-up time to
>> + * reach max performance point of the system regardless of the CPU it
>> + * is running on (due to invariance, time will stretch and task will
>> + * take longer to achieve the same util value compared to a task
>> + * running on a big CPU) and a delay in misfit migration which depends
>> + * on TICK doesn't end up hurting it as it can happen after we would
>> + * have crossed this threshold.
>> + *
>> + * To ensure that invaraince is taken into account, we don't scale time
>> + * and use it as-is, approximate_util_avg() will then let us know the
>> + * our threshold.
>> + */
>> + limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;
>
> Perhaps it makes more sense to use `capacity` here instead of
> `arch_scale_cpu_capacity(cpu)`? Seems like reduced capacity due to HW pressure
> (and IRQs + RT util) should be considered, e.g. for a capacity inversion due to
> HW pressure on a mid core that results in a little core being faster.
If you want to keep it strictly 'uarch & freq-invariant' based, then it
wouldn't have to be called periodically in update_cpu_capacity(). Just
set rq->fits_capacity_threshold once after cpu_scale has been fully
(uArch & Freq) normalized.
[...]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS
2024-08-20 16:35 ` [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS Qais Yousef
@ 2024-09-17 20:09 ` Dietmar Eggemann
2024-09-17 21:43 ` Ricardo Neri
` (2 subsequent siblings)
3 siblings, 0 replies; 37+ messages in thread
From: Dietmar Eggemann @ 2024-09-17 20:09 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, John Stultz, linux-pm, linux-kernel
On 20/08/2024 18:35, Qais Yousef wrote:
[...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0c10e2afb52d..3d9794db58e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4906,7 +4906,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> if (!task_sleep) {
> if (task_util(p) > task_util_dequeued(p)) {
> ewma &= ~UTIL_AVG_UNCHANGED;
> - ewma = approximate_util_avg(ewma, p->se.delta_exec / 1000);
> + ewma = approximate_util_avg(ewma, (p->se.delta_exec/1000) * p->sched_qos.rampup_multiplier);
Isn't this exactly the idea from UTIL_EST_FASTER?
faster_est_approx(delta * 2) ... double speed even w/o contention?
^
[...]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS
2024-08-20 16:35 ` [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS Qais Yousef
2024-09-17 20:09 ` Dietmar Eggemann
@ 2024-09-17 21:43 ` Ricardo Neri
2024-09-18 21:21 ` Ricardo Neri
2024-10-14 16:06 ` Christian Loehle
2024-11-28 0:12 ` John Stultz
3 siblings, 1 reply; 37+ messages in thread
From: Ricardo Neri @ 2024-09-17 21:43 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
John Stultz, linux-pm, linux-kernel
On Tue, Aug 20, 2024 at 05:35:07PM +0100, Qais Yousef wrote:
> Bursty tasks are hard to predict. To use resources efficiently, the
> system would like to be exact as much as possible. But this poses
> a challenge for these bursty tasks that need to get access to more
> resources quickly.
>
> The new SCHED_QOS_RAMPUP_MULTIPLIER allows userspace to do that. As the
> name implies, it only helps them to transition to a higher performance
> state when they get _busier_. That is perfectly periodic tasks by
> definition are not going through a transition and will run at a constant
> performance level. It is the tasks that need to transition from one
> periodic state to another periodic state that is at a higher level that
> this rampup_multiplier will help with. It also slows down the ewma decay
> of util_est which should help those bursty tasks to keep their faster
> rampup.
>
> This should work complimentary with uclamp. uclamp tells the system
> about min and max perf requirements which can be applied immediately.
>
> rampup_multiplier is about reactiveness of the task to change.
> Specifically to a change for a higher performance level. The task might
> necessary need to have a min perf requirements, but it can have sudden
> burst of changes that require higher perf level and it needs the system
> to provide this faster.
>
> TODO: update the sched_qos docs
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> include/linux/sched.h | 7 ++++
> include/uapi/linux/sched.h | 2 ++
> kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 6 ++--
> kernel/sched/syscalls.c | 38 ++++++++++++++++++++--
> 5 files changed, 115 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2e8c5a9ffa76..a30ee43a25fb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -404,6 +404,11 @@ struct sched_info {
> #endif /* CONFIG_SCHED_INFO */
> };
>
> +struct sched_qos {
> + DECLARE_BITMAP(user_defined, SCHED_QOS_MAX);
> + unsigned int rampup_multiplier;
> +};
> +
> /*
> * Integer metrics need fixed point arithmetic, e.g., sched/fair
> * has a few: load, load_avg, util_avg, freq, and capacity.
> @@ -882,6 +887,8 @@ struct task_struct {
>
> struct sched_info sched_info;
>
> + struct sched_qos sched_qos;
> +
> struct list_head tasks;
> #ifdef CONFIG_SMP
> struct plist_node pushable_tasks;
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 67ef99f64ddc..0baba91ba5b8 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -104,6 +104,8 @@ struct clone_args {
> };
>
> enum sched_qos_type {
> + SCHED_QOS_RAMPUP_MULTIPLIER,
> + SCHED_QOS_MAX,
> };
> #endif
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c91e6a62c7ab..54faa845cb29 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -152,6 +152,8 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
> */
> const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
>
> +unsigned int sysctl_sched_qos_default_rampup_multiplier = 1;
> +
> __read_mostly int scheduler_running;
>
> #ifdef CONFIG_SCHED_CORE
> @@ -4488,6 +4490,47 @@ static int sysctl_schedstats(struct ctl_table *table, int write, void *buffer,
> #endif /* CONFIG_SCHEDSTATS */
>
> #ifdef CONFIG_SYSCTL
> +static void sched_qos_sync_sysctl(void)
> +{
> + struct task_struct *g, *p;
> +
> + guard(rcu)();
> + for_each_process_thread(g, p) {
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + rq = task_rq_lock(p, &rf);
> + if (!test_bit(SCHED_QOS_RAMPUP_MULTIPLIER, p->sched_qos.user_defined))
> + p->sched_qos.rampup_multiplier = sysctl_sched_qos_default_rampup_multiplier;
> + task_rq_unlock(rq, p, &rf);
> + }
> +}
> +
> +static int sysctl_sched_qos_handler(struct ctl_table *table, int write,
> + void *buffer, size_t *lenp, loff_t *ppos)
> +{
> + unsigned int old_rampup_mult;
> + int result;
> +
> + old_rampup_mult = sysctl_sched_qos_default_rampup_multiplier;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + return 0;
> +
> + if (old_rampup_mult != sysctl_sched_qos_default_rampup_multiplier) {
> + sched_qos_sync_sysctl();
> + }
> +
> + return 0;
> +
> +undo:
> + sysctl_sched_qos_default_rampup_multiplier = old_rampup_mult;
> + return result;
> +}
> +
> static struct ctl_table sched_core_sysctls[] = {
> #ifdef CONFIG_SCHEDSTATS
> {
> @@ -4534,6 +4577,13 @@ static struct ctl_table sched_core_sysctls[] = {
> .extra2 = SYSCTL_FOUR,
> },
> #endif /* CONFIG_NUMA_BALANCING */
> + {
> + .procname = "sched_qos_default_rampup_multiplier",
> + .data = &sysctl_sched_qos_default_rampup_multiplier,
> + .maxlen = sizeof(unsigned int),
IIUC, user space needs to select a value between 0 and (2^32 - 1). Does
this mean that it will need fine-tuning for each product and application?
Could there be some translation to a fewer number of QoS levels that are
qualitatively?
Also, I think about Intel processors. They work with hardware-controlled
performance scaling. The proposed interface would help us to communicate
per-task multipliers to hardware, but they would be used as hints to
hardware and not acted upon by the kernel to scale frequency.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 12/16] sched/pelt: Add new waiting_avg to record when runnable && !running
2024-08-20 16:35 ` [RFC PATCH 12/16] sched/pelt: Add new waiting_avg to record when runnable && !running Qais Yousef
@ 2024-09-18 7:01 ` Dietmar Eggemann
0 siblings, 0 replies; 37+ messages in thread
From: Dietmar Eggemann @ 2024-09-18 7:01 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, John Stultz, linux-pm, linux-kernel
On 20/08/2024 18:35, Qais Yousef wrote:
> This info will be useful to understand how long tasks end up waiting
> behind other tasks. This info is recorded for tasks only, and
> added/subtracted from root cfs_rq on __update_load_avg_se().
>
> It also helps to decouple util_avg which indicates tasks computational
> demand from the fact that the CPU might need to run faster to reduce the
> waiting time. It has been a point of confusion in the past while
> discussing uclamp and util_avg and the fact that not keeping freq high
> means tasks will take longer to run and cause delays. Isolating the
> source of delay into its own signal would be a better way to take this
> source of delay into account when making decisions independently of
> task's/CPU's computational demands.
>
> It is not used now. But will be used later to help drive DVFS headroom.
> It could become a helpful metric to help us manage waiting latencies in
> general, for example in load balance.
>
> TODO: waiting_avg should use rq_clock_task() as it doesn't care about
> invariance. Waiting time should reflect actual wait in realtime as this
> is the measure of latency that users care about.
Since you use PELT for the update, you're bound to use rq_clock_pelt().
If we could have PELT with two time values, then we could have
'util_avg' and 'invariant util_avg' to cure the slow ramp-up on tiny CPU
and/or low OPPs and we wouldn't have to add all of this extra code.
[...]
> @@ -4744,8 +4760,15 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> * Track task load average for carrying it to new CPU after migrated, and
> * track group sched_entity load average for task_h_load calculation in migration
> */
> - if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
> + if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
> + bool update_rq_waiting_avg = entity_is_task(se) && se_runnable(se);
> +
> + if (update_rq_waiting_avg)
> + sub_waiting_avg(&rq_of(cfs_rq)->cfs, se);
> __update_load_avg_se(now, cfs_rq, se);
> + if (update_rq_waiting_avg)
> + add_waiting_avg(&rq_of(cfs_rq)->cfs, se);
> + }
That's a pretty convoluted design. util_est-style attach/detach within
the PELT update but only for tasks and not all se's.
Doesn't 'p->se.avg.runnable_avg - p->se.avg.util_avg' give you what you
want? It's invariant but so is this here.
Commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task
utilization") uses some of it already.
+ /*
+ * To avoid underestimate of task utilization, skip updates of EWMA if
+ * we cannot grant that thread got all CPU time it wanted.
+ */
+ if ((ue.enqueued + UTIL_EST_MARGIN) < task_runnable(p))
+ goto done;
[...]
> @@ -6786,6 +6814,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> * estimated utilization, before we update schedutil.
> */
> util_est_enqueue(&rq->cfs, p);
> + add_waiting_avg(&rq->cfs, se);
This would also have to be checked against the new p->se.sched_delayed
thing.
> /*
> * If in_iowait is set, the code below may not trigger any cpufreq
> @@ -6874,6 +6903,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> bool was_sched_idle = sched_idle_rq(rq);
>
> util_est_dequeue(&rq->cfs, p);
> + sub_waiting_avg(&rq->cfs, se);
^^
This won't compile. se vs. &p->se
[...]
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 14/16] sched/schedutil: Ignore dvfs headroom when util is decaying
2024-08-20 16:35 ` [RFC PATCH 14/16] sched/schedutil: Ignore dvfs headroom when util is decaying Qais Yousef
2024-08-22 5:29 ` Sultan Alsawaf (unemployed)
@ 2024-09-18 10:40 ` Christian Loehle
1 sibling, 0 replies; 37+ messages in thread
From: Christian Loehle @ 2024-09-18 10:40 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel
On 8/20/24 17:35, Qais Yousef wrote:
> It means we're being idling or doing less work and are already running
> at a higher value. No need to apply any dvfs headroom in this case.
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> kernel/sched/cpufreq_schedutil.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 318b09bc4ab1..4a1a8b353d51 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -9,6 +9,7 @@
> #define IOWAIT_BOOST_MIN (SCHED_CAPACITY_SCALE / 8)
>
> DEFINE_PER_CPU_READ_MOSTLY(unsigned long, response_time_mult);
> +DEFINE_PER_CPU(unsigned long, last_update_util);
>
> struct sugov_tunables {
> struct gov_attr_set attr_set;
> @@ -262,15 +263,19 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> * Also take into accounting how long tasks have been waiting in runnable but
> * !running state. If it is high, it means we need higher DVFS headroom to
> * reduce it.
> - *
> - * XXX: Should we provide headroom when the util is decaying?
> */
> static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util, int cpu)
> {
> - unsigned long update_headroom, waiting_headroom;
> + unsigned long update_headroom, waiting_headroom, prev_util;
> struct rq *rq = cpu_rq(cpu);
> u64 delay;
>
> + prev_util = per_cpu(last_update_util, cpu);
> + per_cpu(last_update_util, cpu) = util;
> +
> + if (util < prev_util)
> + return util;
> +
> /*
> * What is the possible worst case scenario for updating util_avg, ctx
> * switch or TICK?
Kind of in the same vain as Sultan here, -/+1 util really doesn't tell much,
I would be wary to base any special behavior on that.
This goes for here but also for the 'periodic'-task detection in
[RFC PATCH 09/16] sched/fair: util_est: Take into account periodic tasks
in my experience as soon as we leave the world of rt-app workloads behind
these aren't stable enough on that granularity.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS
2024-09-17 21:43 ` Ricardo Neri
@ 2024-09-18 21:21 ` Ricardo Neri
0 siblings, 0 replies; 37+ messages in thread
From: Ricardo Neri @ 2024-09-18 21:21 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
John Stultz, linux-pm, linux-kernel
On Tue, Sep 17, 2024 at 02:43:37PM -0700, Ricardo Neri wrote:
> On Tue, Aug 20, 2024 at 05:35:07PM +0100, Qais Yousef wrote:
> > Bursty tasks are hard to predict. To use resources efficiently, the
> > system would like to be exact as much as possible. But this poses
> > a challenge for these bursty tasks that need to get access to more
> > resources quickly.
> >
> > The new SCHED_QOS_RAMPUP_MULTIPLIER allows userspace to do that. As the
> > name implies, it only helps them to transition to a higher performance
> > state when they get _busier_. That is perfectly periodic tasks by
> > definition are not going through a transition and will run at a constant
> > performance level. It is the tasks that need to transition from one
> > periodic state to another periodic state that is at a higher level that
> > this rampup_multiplier will help with. It also slows down the ewma decay
> > of util_est which should help those bursty tasks to keep their faster
> > rampup.
> >
> > This should work complimentary with uclamp. uclamp tells the system
> > about min and max perf requirements which can be applied immediately.
> >
> > rampup_multiplier is about reactiveness of the task to change.
> > Specifically to a change for a higher performance level. The task might
> > necessary need to have a min perf requirements, but it can have sudden
> > burst of changes that require higher perf level and it needs the system
> > to provide this faster.
> >
> > TODO: update the sched_qos docs
> >
> > Signed-off-by: Qais Yousef <qyousef@layalina.io>
> > ---
> > include/linux/sched.h | 7 ++++
> > include/uapi/linux/sched.h | 2 ++
> > kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++
> > kernel/sched/fair.c | 6 ++--
> > kernel/sched/syscalls.c | 38 ++++++++++++++++++++--
> > 5 files changed, 115 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 2e8c5a9ffa76..a30ee43a25fb 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -404,6 +404,11 @@ struct sched_info {
> > #endif /* CONFIG_SCHED_INFO */
> > };
> >
> > +struct sched_qos {
> > + DECLARE_BITMAP(user_defined, SCHED_QOS_MAX);
> > + unsigned int rampup_multiplier;
> > +};
> > +
> > /*
> > * Integer metrics need fixed point arithmetic, e.g., sched/fair
> > * has a few: load, load_avg, util_avg, freq, and capacity.
> > @@ -882,6 +887,8 @@ struct task_struct {
> >
> > struct sched_info sched_info;
> >
> > + struct sched_qos sched_qos;
> > +
> > struct list_head tasks;
> > #ifdef CONFIG_SMP
> > struct plist_node pushable_tasks;
> > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> > index 67ef99f64ddc..0baba91ba5b8 100644
> > --- a/include/uapi/linux/sched.h
> > +++ b/include/uapi/linux/sched.h
> > @@ -104,6 +104,8 @@ struct clone_args {
> > };
> >
> > enum sched_qos_type {
> > + SCHED_QOS_RAMPUP_MULTIPLIER,
> > + SCHED_QOS_MAX,
> > };
> > #endif
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index c91e6a62c7ab..54faa845cb29 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -152,6 +152,8 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
> > */
> > const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
> >
> > +unsigned int sysctl_sched_qos_default_rampup_multiplier = 1;
> > +
> > __read_mostly int scheduler_running;
> >
> > #ifdef CONFIG_SCHED_CORE
> > @@ -4488,6 +4490,47 @@ static int sysctl_schedstats(struct ctl_table *table, int write, void *buffer,
> > #endif /* CONFIG_SCHEDSTATS */
> >
> > #ifdef CONFIG_SYSCTL
> > +static void sched_qos_sync_sysctl(void)
> > +{
> > + struct task_struct *g, *p;
> > +
> > + guard(rcu)();
> > + for_each_process_thread(g, p) {
> > + struct rq_flags rf;
> > + struct rq *rq;
> > +
> > + rq = task_rq_lock(p, &rf);
> > + if (!test_bit(SCHED_QOS_RAMPUP_MULTIPLIER, p->sched_qos.user_defined))
> > + p->sched_qos.rampup_multiplier = sysctl_sched_qos_default_rampup_multiplier;
> > + task_rq_unlock(rq, p, &rf);
> > + }
> > +}
> > +
> > +static int sysctl_sched_qos_handler(struct ctl_table *table, int write,
> > + void *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > + unsigned int old_rampup_mult;
> > + int result;
> > +
> > + old_rampup_mult = sysctl_sched_qos_default_rampup_multiplier;
> > +
> > + result = proc_dointvec(table, write, buffer, lenp, ppos);
> > + if (result)
> > + goto undo;
> > + if (!write)
> > + return 0;
> > +
> > + if (old_rampup_mult != sysctl_sched_qos_default_rampup_multiplier) {
> > + sched_qos_sync_sysctl();
> > + }
> > +
> > + return 0;
> > +
> > +undo:
> > + sysctl_sched_qos_default_rampup_multiplier = old_rampup_mult;
> > + return result;
> > +}
> > +
> > static struct ctl_table sched_core_sysctls[] = {
> > #ifdef CONFIG_SCHEDSTATS
> > {
> > @@ -4534,6 +4577,13 @@ static struct ctl_table sched_core_sysctls[] = {
> > .extra2 = SYSCTL_FOUR,
> > },
> > #endif /* CONFIG_NUMA_BALANCING */
> > + {
> > + .procname = "sched_qos_default_rampup_multiplier",
> > + .data = &sysctl_sched_qos_default_rampup_multiplier,
> > + .maxlen = sizeof(unsigned int),
>
> IIUC, user space needs to select a value between 0 and (2^32 - 1). Does
> this mean that it will need fine-tuning for each product and application?
>
> Could there be some translation to a fewer number of QoS levels that are
> qualitatively?
>
> Also, I think about Intel processors. They work with hardware-controlled
> performance scaling. The proposed interface would help us to communicate
> per-task multipliers to hardware, but they would be used as hints to
> hardware and not acted upon by the kernel to scale frequency.
Also, as discussed during LPC 2024 it might be good to have an interface
that is compatible with other operating systems. They have qualitative
descriptions of QoS levels (see Len Brown's LPC 2022 presentation [1]).
It can be this hint or a new one.
[1]. https://lpc.events/event/16/contributions/1276/attachments/1070/2039/Brown-Shankar%20LPC%202022.09.13%20Sched%20QOS%20API.pdf
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 08/16] sched/fair: Extend util_est to improve rampup time
2024-08-20 16:35 ` [RFC PATCH 08/16] sched/fair: Extend util_est to improve rampup time Qais Yousef
2024-09-17 19:21 ` Dietmar Eggemann
@ 2024-10-14 16:04 ` Christian Loehle
1 sibling, 0 replies; 37+ messages in thread
From: Christian Loehle @ 2024-10-14 16:04 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel
On 8/20/24 17:35, Qais Yousef wrote:
> Utilization invariance can cause big delays. When tasks are running,
> accumulate non-invairiant version of utilization to help tasks to settle
> down to their new util_avg values faster.
>
> Keep track of delta_exec during runnable across activations to help
> update util_est for a long running task accurately. util_est shoudl
> still behave the same at enqueue/dequeue.
For periodic tasks that have longer slices (~tick) this overestimates
util_est by a lot.
AFAICS this also breaks util_est for co-scheduling tasks of different slice
lengths.
I'm testing with HZ=1000, but should work for any. On a RK3399, all pinned
to one big.
Having task A be 10ms period with 20% util (running for 2ms when scheduled)
and tasks B+ with 1ms period and 1% util.
I guess 9/16 tries to work around that somewhat but without any leeway that
doesn't work. Even rt-app tasks will vary slightly in their util_est values:
Task A only:
mainline:
A:
util_avg: 192 util_est: 204
9/16 sched/fair: util_est: Take into account periodic tasks
A:
util_avg: 185 util_est: 423
8 tasks:
mainline:
A:
util_avg: 229 util_est: 229
The rest
util_avg: 12 util_est: 24
9/16 sched/fair: util_est: Take into account periodic tasks
A:
util_avg: 242 util_est: 643
The rest:
util_avg: 12 util_est: 50
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS
2024-08-20 16:35 ` [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS Qais Yousef
2024-09-17 20:09 ` Dietmar Eggemann
2024-09-17 21:43 ` Ricardo Neri
@ 2024-10-14 16:06 ` Christian Loehle
2024-11-28 0:12 ` John Stultz
3 siblings, 0 replies; 37+ messages in thread
From: Christian Loehle @ 2024-10-14 16:06 UTC (permalink / raw)
To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
Rafael J. Wysocki, Viresh Kumar
Cc: Juri Lelli, Steven Rostedt, Dietmar Eggemann, John Stultz,
linux-pm, linux-kernel
On 8/20/24 17:35, Qais Yousef wrote:
> Bursty tasks are hard to predict. To use resources efficiently, the
> system would like to be exact as much as possible. But this poses
> a challenge for these bursty tasks that need to get access to more
> resources quickly.
>
> The new SCHED_QOS_RAMPUP_MULTIPLIER allows userspace to do that. As the
> name implies, it only helps them to transition to a higher performance
> state when they get _busier_. That is perfectly periodic tasks by
> definition are not going through a transition and will run at a constant
> performance level. It is the tasks that need to transition from one
> periodic state to another periodic state that is at a higher level that
> this rampup_multiplier will help with. It also slows down the ewma decay
> of util_est which should help those bursty tasks to keep their faster
> rampup.
>
> This should work complimentary with uclamp. uclamp tells the system
> about min and max perf requirements which can be applied immediately.
>
> rampup_multiplier is about reactiveness of the task to change.
> Specifically to a change for a higher performance level. The task might
> necessary need to have a min perf requirements, but it can have sudden
> burst of changes that require higher perf level and it needs the system
> to provide this faster.
>
> TODO: update the sched_qos docs
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> include/linux/sched.h | 7 ++++
> include/uapi/linux/sched.h | 2 ++
> kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 6 ++--
> kernel/sched/syscalls.c | 38 ++++++++++++++++++++--
> 5 files changed, 115 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2e8c5a9ffa76..a30ee43a25fb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -404,6 +404,11 @@ struct sched_info {
> #endif /* CONFIG_SCHED_INFO */
> };
>
> +struct sched_qos {
> + DECLARE_BITMAP(user_defined, SCHED_QOS_MAX);
> + unsigned int rampup_multiplier;
> +};
> +
> /*
> * Integer metrics need fixed point arithmetic, e.g., sched/fair
> * has a few: load, load_avg, util_avg, freq, and capacity.
> @@ -882,6 +887,8 @@ struct task_struct {
>
> struct sched_info sched_info;
>
> + struct sched_qos sched_qos;
> +
> struct list_head tasks;
> #ifdef CONFIG_SMP
> struct plist_node pushable_tasks;
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 67ef99f64ddc..0baba91ba5b8 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -104,6 +104,8 @@ struct clone_args {
> };
>
> enum sched_qos_type {
> + SCHED_QOS_RAMPUP_MULTIPLIER,
> + SCHED_QOS_MAX,
> };
> #endif
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c91e6a62c7ab..54faa845cb29 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -152,6 +152,8 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
> */
> const_debug unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
>
> +unsigned int sysctl_sched_qos_default_rampup_multiplier = 1;
> +
> __read_mostly int scheduler_running;
>
> #ifdef CONFIG_SCHED_CORE
> @@ -4488,6 +4490,47 @@ static int sysctl_schedstats(struct ctl_table *table, int write, void *buffer,
> #endif /* CONFIG_SCHEDSTATS */
>
> #ifdef CONFIG_SYSCTL
> +static void sched_qos_sync_sysctl(void)
> +{
> + struct task_struct *g, *p;
> +
> + guard(rcu)();
> + for_each_process_thread(g, p) {
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + rq = task_rq_lock(p, &rf);
> + if (!test_bit(SCHED_QOS_RAMPUP_MULTIPLIER, p->sched_qos.user_defined))
> + p->sched_qos.rampup_multiplier = sysctl_sched_qos_default_rampup_multiplier;
> + task_rq_unlock(rq, p, &rf);
> + }
> +}
> +
> +static int sysctl_sched_qos_handler(struct ctl_table *table, int write,
> + void *buffer, size_t *lenp, loff_t *ppos)
table should be const struct ctl_table *table for this to build on 6.11 at least.
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 05/16] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom()
2024-08-20 16:35 ` [RFC PATCH 05/16] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
@ 2024-11-13 4:51 ` John Stultz
0 siblings, 0 replies; 37+ messages in thread
From: John Stultz @ 2024-11-13 4:51 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
linux-pm, linux-kernel
On Tue, Aug 20, 2024 at 9:35 AM Qais Yousef <qyousef@layalina.io> wrote:
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 575df3599813..303b0ab227e7 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -187,13 +187,28 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> * to run at adequate performance point.
> *
> * This function provides enough headroom to provide adequate performance
> - * assuming the CPU continues to be busy.
> + * assuming the CPU continues to be busy. This headroom is based on the
> + * dvfs_update_delay of the cpufreq governor or min(curr.se.slice, TICK_US),
> + * whichever is higher.
> *
> - * At the moment it is a constant multiplication with 1.25.
> + * XXX: Should we provide headroom when the util is decaying?
> */
> -static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util)
> +static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util, int cpu)
> {
> - return util + (util >> 2);
> + struct rq *rq = cpu_rq(cpu);
> + u64 delay;
> +
> + /*
> + * What is the possible worst case scenario for updating util_avg, ctx
> + * switch or TICK?
> + */
> + if (rq->cfs.h_nr_running > 1)
> + delay = min(rq->curr->se.slice/1000, TICK_USEC);
Nit: this fails to build on 32bit due to the u64 division.
Need something like:
if (rq->cfs.h_nr_running > 1) {
u64 slice = rq->curr->se.slice;
do_div(slice, 1000);
delay = min(slice, TICK_USEC);
} else
...
thanks
-john
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 09/16] sched/fair: util_est: Take into account periodic tasks
2024-08-20 16:35 ` [RFC PATCH 09/16] sched/fair: util_est: Take into account periodic tasks Qais Yousef
@ 2024-11-13 4:57 ` John Stultz
0 siblings, 0 replies; 37+ messages in thread
From: John Stultz @ 2024-11-13 4:57 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
linux-pm, linux-kernel
On Tue, Aug 20, 2024 at 9:36 AM Qais Yousef <qyousef@layalina.io> wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a6421e4032c0..0c10e2afb52d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4899,9 +4904,12 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> * quickly to settle down to our new util_avg.
> */
> if (!task_sleep) {
> - ewma &= ~UTIL_AVG_UNCHANGED;
> - ewma = approximate_util_avg(ewma, p->se.delta_exec / 1000);
> - goto done;
> + if (task_util(p) > task_util_dequeued(p)) {
> + ewma &= ~UTIL_AVG_UNCHANGED;
> + ewma = approximate_util_avg(ewma, p->se.delta_exec / 1000);
Same 32bit build issue due to 64bit division here. Need to rework w/ do_div().
thanks
-john
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS
2024-08-20 16:35 ` [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS Qais Yousef
` (2 preceding siblings ...)
2024-10-14 16:06 ` Christian Loehle
@ 2024-11-28 0:12 ` John Stultz
3 siblings, 0 replies; 37+ messages in thread
From: John Stultz @ 2024-11-28 0:12 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
linux-pm, linux-kernel
On Tue, Aug 20, 2024 at 9:36 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> Bursty tasks are hard to predict. To use resources efficiently, the
> system would like to be exact as much as possible. But this poses
> a challenge for these bursty tasks that need to get access to more
> resources quickly.
>
> The new SCHED_QOS_RAMPUP_MULTIPLIER allows userspace to do that. As the
> name implies, it only helps them to transition to a higher performance
> state when they get _busier_. That is perfectly periodic tasks by
> definition are not going through a transition and will run at a constant
> performance level. It is the tasks that need to transition from one
> periodic state to another periodic state that is at a higher level that
> this rampup_multiplier will help with. It also slows down the ewma decay
> of util_est which should help those bursty tasks to keep their faster
> rampup.
>
> This should work complimentary with uclamp. uclamp tells the system
> about min and max perf requirements which can be applied immediately.
>
> rampup_multiplier is about reactiveness of the task to change.
> Specifically to a change for a higher performance level. The task might
> necessary need to have a min perf requirements, but it can have sudden
> burst of changes that require higher perf level and it needs the system
> to provide this faster.
>
> TODO: update the sched_qos docs
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
> include/linux/sched.h | 7 ++++
> include/uapi/linux/sched.h | 2 ++
> kernel/sched/core.c | 66 ++++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 6 ++--
> kernel/sched/syscalls.c | 38 ++++++++++++++++++++--
> 5 files changed, 115 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2e8c5a9ffa76..a30ee43a25fb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -404,6 +404,11 @@ struct sched_info {
> #endif /* CONFIG_SCHED_INFO */
> };
>
> +struct sched_qos {
> + DECLARE_BITMAP(user_defined, SCHED_QOS_MAX);
> + unsigned int rampup_multiplier;
> +};
> +
> /*
> * Integer metrics need fixed point arithmetic, e.g., sched/fair
> * has a few: load, load_avg, util_avg, freq, and capacity.
> @@ -882,6 +887,8 @@ struct task_struct {
>
> struct sched_info sched_info;
>
> + struct sched_qos sched_qos;
> +
> struct list_head tasks;
> #ifdef CONFIG_SMP
> struct plist_node pushable_tasks;
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 67ef99f64ddc..0baba91ba5b8 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -104,6 +104,8 @@ struct clone_args {
> };
>
> enum sched_qos_type {
> + SCHED_QOS_RAMPUP_MULTIPLIER,
> + SCHED_QOS_MAX,
> };
> #endif
...
> +static void __setscheduler_sched_qos(struct task_struct *p,
> + const struct sched_attr *attr)
> +{
> + switch (attr->sched_qos_type) {
> + case SCHED_QOS_RAMPUP_MULTIPLIER:
> + set_bit(SCHED_QOS_RAMPUP_MULTIPLIER, p->sched_qos.user_defined);
> + p->sched_qos.rampup_multiplier = attr->sched_qos_value;
> + default:
> + break;
> + }
> +}
> +
> /*
> * Allow unprivileged RT tasks to decrease priority.
> * Only issue a capable test if needed and only once to avoid an audit
...
> @@ -799,7 +831,9 @@ int __sched_setscheduler(struct task_struct *p,
> __setscheduler_params(p, attr);
> __setscheduler_prio(p, newprio);
> }
> +
> __setscheduler_uclamp(p, attr);
> + __setscheduler_sched_qos(p, attr);
>
Hey Qais,
Started tinkering a bit more with this patch series and found
unexpectedly a number of tasks were getting their rampup_multiplier
value set to zero.
It looks like the issue is that the SCHED_QOS_RAMPUP_MULTIPLIER enum
value is 0, so the switch (attr->sched_qos_type) always catches the
uninitialized/unset value during any sched_setscheduler()call, and
further the call to __setscheduler_sched_qos() isn't protected by a
(attr->sched_flags & SCHED_FLAG_QOS) check as is done for
sched_qos_validate() so we always end up falling into it and setting
the rampup_multiplier.
The easiest fix is probably just to have a SCHED_QOS_NONE base value
in the sched_qos_type enum, but we can also add checks on sched_flags
& SCHED_FLAG_QOS. Or do you have another idea?
thanks
-john
^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: [RFC PATCH 10/16] sched/qos: Add a new sched-qos interface
2024-08-20 16:35 ` [RFC PATCH 10/16] sched/qos: Add a new sched-qos interface Qais Yousef
@ 2024-11-28 1:47 ` John Stultz
0 siblings, 0 replies; 37+ messages in thread
From: John Stultz @ 2024-11-28 1:47 UTC (permalink / raw)
To: Qais Yousef
Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
Viresh Kumar, Juri Lelli, Steven Rostedt, Dietmar Eggemann,
linux-pm, linux-kernel
On Tue, Aug 20, 2024 at 9:36 AM Qais Yousef <qyousef@layalina.io> wrote:
>
> The need to describe the conflicting demand of various workloads hasn't
> been higher. Both hardware and software have moved rapidly in the past
> decade and system usage is more diverse and the number of workloads
> expected to run on the same machine whether on Mobile or Server markets
> has created a big dilemma on how to better manage those requirements.
>
> The problem is that we lack mechanisms to allow these workloads to
> describe what they need, and then allow kernel to do best efforts to
> manage those demands based on the hardware it is running on
> transparently and current system state.
>
> Example of conflicting requirements that come across frequently:
>
> 1. Improve wake up latency for SCHED_OTHER. Many tasks end up
> using SCHED_FIFO/SCHED_RR to compensate for this shortcoming.
> RT tasks lack power management and fairness and can be hard
> and error prone to use correctly and portably.
>
> 2. Prefer spreading vs prefer packing on wake up for a group of
> tasks. Geekbench-like workloads would benefit from
> parallelising on different CPUs. hackbench type of workloads
> can benefit from waking on up same CPUs or a CPU that is
> closer in the cache hierarchy.
>
> 3. Nice values for SCHED_OTHER are system wide and require
> privileges. Many workloads would like a way to set relative
> nice value so they can preempt each others, but not be
> impact or be impacted by other tasks belong to different
> workloads on the system.
>
> 4. Provide a way to tag some tasks as 'background' to keep them
> out of the way. SCHED_IDLE is too strong for some of these
> tasks but yet they can be computationally heavy. Example
> tasks are garbage collectors. Their work is both important
> and not important.
>
> 5. Provide a way to improve DVFS/upmigration rampup time for
> specific tasks that are bursty in nature and highly
> interactive.
>
> Whether any of these use cases warrants an additional QoS hint is
> something to be discussed individually. But the main point is to
> introduce an interface that can be extendable to cater for potentially
> those requirements and more. rampup_multiplier to improve
> DVFS/upmigration for bursty tasks will be the first user in later patch.
>
> It is desired to have apps (and benchmarks!) directly use this interface
> for optimal perf/watt. But in the absence of such support, it should be
> possible to write a userspace daemon to monitor workloads and apply
> these QoS hints on apps behalf based on analysis done by anyone
> interested in improving the performance of those workloads.
>
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
...
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/sched.h b/tools/perf/trace/beauty/include/uapi/linux/sched.h
> index 3bac0a8ceab2..67ef99f64ddc 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/sched.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/sched.h
> @@ -102,6 +102,9 @@ struct clone_args {
> __aligned_u64 set_tid_size;
> __aligned_u64 cgroup;
> };
> +
> +enum sched_qos_type {
> +};
> #endif
>
> #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> @@ -132,6 +135,7 @@ struct clone_args {
> #define SCHED_FLAG_KEEP_PARAMS 0x10
> #define SCHED_FLAG_UTIL_CLAMP_MIN 0x20
> #define SCHED_FLAG_UTIL_CLAMP_MAX 0x40
> +#define SCHED_FLAG_QOS 0x80
>
Hey Qais,
Just heads up, It seems this needs to be added to SCHED_FLAG_ALL for
the code in later patches to be reachable.
thanks
-john
^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2024-11-28 1:47 UTC | newest]
Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-20 16:34 [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 01/16] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 02/16] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
2024-08-20 16:34 ` [RFC PATCH 03/16] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
2024-08-22 5:36 ` Sultan Alsawaf (unemployed)
2024-09-16 15:31 ` Christian Loehle
2024-08-20 16:35 ` [RFC PATCH 04/16] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
2024-08-22 5:09 ` Sultan Alsawaf (unemployed)
2024-09-17 19:41 ` Dietmar Eggemann
2024-08-20 16:35 ` [RFC PATCH 05/16] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
2024-11-13 4:51 ` John Stultz
2024-08-20 16:35 ` [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate response time Qais Yousef
2024-09-16 22:22 ` Dietmar Eggemann
2024-09-17 10:22 ` Christian Loehle
2024-08-20 16:35 ` [RFC PATCH 07/16] sched/pelt: Introduce PELT multiplier boot time parameter Qais Yousef
2024-08-20 16:35 ` [RFC PATCH 08/16] sched/fair: Extend util_est to improve rampup time Qais Yousef
2024-09-17 19:21 ` Dietmar Eggemann
2024-10-14 16:04 ` Christian Loehle
2024-08-20 16:35 ` [RFC PATCH 09/16] sched/fair: util_est: Take into account periodic tasks Qais Yousef
2024-11-13 4:57 ` John Stultz
2024-08-20 16:35 ` [RFC PATCH 10/16] sched/qos: Add a new sched-qos interface Qais Yousef
2024-11-28 1:47 ` John Stultz
2024-08-20 16:35 ` [RFC PATCH 11/16] sched/qos: Add rampup multiplier QoS Qais Yousef
2024-09-17 20:09 ` Dietmar Eggemann
2024-09-17 21:43 ` Ricardo Neri
2024-09-18 21:21 ` Ricardo Neri
2024-10-14 16:06 ` Christian Loehle
2024-11-28 0:12 ` John Stultz
2024-08-20 16:35 ` [RFC PATCH 12/16] sched/pelt: Add new waiting_avg to record when runnable && !running Qais Yousef
2024-09-18 7:01 ` Dietmar Eggemann
2024-08-20 16:35 ` [RFC PATCH 13/16] sched/schedutil: Take into account waiting_avg in apply_dvfs_headroom Qais Yousef
2024-08-20 16:35 ` [RFC PATCH 14/16] sched/schedutil: Ignore dvfs headroom when util is decaying Qais Yousef
2024-08-22 5:29 ` Sultan Alsawaf (unemployed)
2024-09-18 10:40 ` Christian Loehle
2024-08-20 16:35 ` [RFC PATCH 15/16] sched/fair: Enable disabling util_est via rampup_multiplier Qais Yousef
2024-08-20 16:35 ` [RFC PATCH 16/16] sched/fair: Don't mess with util_avg post init Qais Yousef
2024-09-16 12:21 ` [RFC PATCH 00/16] sched/fair/schedutil: Better manage system response time Dietmar Eggemann
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox