public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time
@ 2026-05-04  1:59 Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 01/13] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
                   ` (12 more replies)
  0 siblings, 13 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

This is the long delayed follow up to the series sent back in August 2024 [1].
Life got in the way to some extent (I had a baby, and now my time that I used
to do upstream work late at night was stolen :). Apologies for those who
replied and I didn't get a chance to respond back.

The series is now rebased on top of tip/sched/core 78cde54ea5f0. I removed
a number of optimization patches that are not necessary for this initial merge
and can be treated as their own separate topics once this is hopefully
accepted.

I discussed the problem in LPC in 2024 [2] and the initial cover letter
contains all the details. I hope all the key parties are up-to-date on the
problem details by now.

As a brief recap, there are some hardcoded constants in the kernel that
introduce a bias that frequently fails to deliver the best outcome on various
systems.  It turns out these constant seem to help somewhat against a bigger
problem in utilization signal distortion due to utilization invariance causing
what I call black hole effect. The lower the capacity, the harder it is to
accumulate runtime to cause the signal to rise acting like a gravitational pull
causing time dilation.

One of the major difficulties we will face is that this distortion turns up bad
for performance but good for power. The fix will inevitably rebalance the
system, while in the right way, but also in a surprising way to potentially
cause some to be unhappy. sched_features were added to ensure those unhappy
folks can revert the system to the old behavior while still allow us to make
the right progress.

That is to retain the older behavior one must:

	echo 0 | sudo tee /proc/sys/kernel/sched_qos_default_rampup_multiplier
	echo CONST_DVFS_HEADROOM NO_UTIL_EST_RAMPUP_ZERO UTIL_EST_FORCE_POST_INIT > /sys/kernel/debug/sched/features

Note for migration margin there's no sched features since I think the old
behavior was worse for perf and power and doesn't require reverting back to.

The system is going to be a lot faster now by default with
sched_qos_default_rampup_multiplier=1 since it fixes the distortion issue and
provides a constant rise time regardless of DVFS latencies.

The desired behavior is for default rampup_multiplier to be 0 and only those
interactive tasks to request a higher rampup multiplier. Preliminary
integration with schedqos is available [3] for those who want to see the full
benefit of fine grained control to mange perf and power.

Open questions:

* The details of the QoS interface is the biggest one.
* Would debugfs be better for setting the default rampup multiplier instead of sysctl?
* Patch 13 makes updating load_avg unconditional not on period boundaries.

Patches 1-3 are prepatory patches renaming a function and introducing new ones.

Patches 4-5 handle the magic margin problem but making them dynamic based on
actual hardware limitations.

Patches 6-7 fix the black hole problem and teaches the scheduler how to handle
bursty and periodic tasks via extending util_est.

Patches 8-9 is where I expect most of the discussion on as I introduce a new
sched_qos interface to support the new rampup_multiplier to help manage DVFS.

Patches 10-11 introduces a couple of necessary optimizations to counter the
power impact of increased responsiveness by disabling some features that we now
know how to handle better.

Patches 12-13 fix a couple of issues causing util_est and util_avg value to
swing for a periodic task. Patch 12 must go via stable.

My mac mini M1 system where I did the testing on before is down and it has been
proven difficult to revive it before sending this series. I will revive and
repeat the testing to ensure all is okay after the rebase.

I did test it on AMD system, but it has only 3 freqs so no real perf numbers to
report since it just whizzes by these 3 freqs anyway. But I did spend enough
time to verify the util_est behaves as expected under different scenarios. More
testing would still be appreciated :)

[1] https://lore.kernel.org/lkml/20240820163512.1096301-1-qyousef@layalina.io/
[2] https://lpc.events/event/18/contributions/1880/
[3] https://github.com/qais-yousef/schedqos/compare/main...schedqos

Qais Yousef (13):
  sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom
  sched/pelt: Add a new function to approximate the future util_avg
    value
  sched/pelt: Add a new function to approximate runtime to reach given
    util
  sched/fair: Remove magic hardcoded margin in fits_capacity()
  sched: cpufreq: Remove magic 1.25 headroom from
    sugov_apply_dvfs_headroom()
  sched/fair: Extend util_est to improve rampup time
  sched/fair: util_est: Take into account periodic tasks
  sched/qos: Add a new sched-qos interface
  sched/qos: Add rampup multiplier QoS
  sched/fair: Disable util_est when rampup_multiplier is 0
  sched/fair: Don't mess with util_avg post init
  sched/fair: Call update_util_est() after dequeue_entities()
  sched/pelt: Always allow load updates

 Documentation/scheduler/index.rst             |   1 +
 Documentation/scheduler/sched-qos.rst         |  66 ++++++++++
 include/linux/sched.h                         |  10 ++
 include/linux/sched/cpufreq.h                 |   5 -
 include/uapi/linux/sched.h                    |  10 +-
 include/uapi/linux/sched/types.h              |  46 +++++++
 kernel/sched/core.c                           |  71 ++++++++++
 kernel/sched/cpufreq_schedutil.c              |  49 ++++++-
 kernel/sched/debug.c                          |   1 +
 kernel/sched/fair.c                           | 124 ++++++++++++++++--
 kernel/sched/features.h                       |  21 +++
 kernel/sched/pelt.c                           |  44 ++++++-
 kernel/sched/sched.h                          |  12 ++
 kernel/sched/syscalls.c                       |  61 +++++++++
 .../trace/beauty/include/uapi/linux/sched.h   |   4 +
 15 files changed, 501 insertions(+), 24 deletions(-)
 create mode 100644 Documentation/scheduler/sched-qos.rst

-- 
2.34.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 01/13] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 02/13] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

We are providing headroom for the utilization to grow until the next
decision point to pick the next frequency. Give the function a better
name and give it some documentation. It is not really mapping anything.

Also move it to cpufreq_schedutil.c. This function relies on updating
util signal appropriately to give a headroom to grow. This is tied to
schedutil and scheduler and not something that can be shared with other
governors.

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 include/linux/sched/cpufreq.h    |  5 -----
 kernel/sched/cpufreq_schedutil.c | 20 +++++++++++++++++++-
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index bdd31ab93bc5..d01755d3142f 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -28,11 +28,6 @@ static inline unsigned long map_util_freq(unsigned long util,
 {
 	return freq * util / cap;
 }
-
-static inline unsigned long map_util_perf(unsigned long util)
-{
-	return util + (util >> 2);
-}
 #endif /* CONFIG_CPU_FREQ */
 
 #endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 153232dd8276..f6de241fc62c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -206,12 +206,30 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
 
+/*
+ * DVFS decision are made at discrete points. If CPU stays busy, the util will
+ * continue to grow, which means it could need to run at a higher frequency
+ * before the next decision point was reached. IOW, we can't follow the util as
+ * it grows immediately, but there's a delay before we issue a request to go to
+ * higher frequency. The headroom caters for this delay so the system continues
+ * to run at adequate performance point.
+ *
+ * This function provides enough headroom to provide adequate performance
+ * assuming the CPU continues to be busy.
+ *
+ * At the moment it is a constant multiplication with 1.25.
+ */
+static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util)
+{
+	return util + (util >> 2);
+}
+
 unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 				 unsigned long min,
 				 unsigned long max)
 {
 	/* Add dvfs headroom to actual utilization */
-	actual = map_util_perf(actual);
+	actual = sugov_apply_dvfs_headroom(actual);
 	/* Actually we don't need to target the max performance */
 	if (actual < max)
 		max = actual;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 02/13] sched/pelt: Add a new function to approximate the future util_avg value
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 01/13] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 03/13] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

Given a util_avg value, the new function will return the future one
given a runtime delta.

This will be useful in later patches to help replace some magic margins
with more deterministic behavior.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 kernel/sched/pelt.c  | 20 ++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 21 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 897790889ba3..5a8f4dc99ffc 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -488,3 +488,23 @@ bool update_other_load_avgs(struct rq *rq)
 		update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
 		update_irq_load_avg(rq, 0);
 }
+
+/*
+ * Approximate the new util_avg value assuming an entity has continued to run
+ * for @delta us.
+ */
+unsigned long approximate_util_avg(unsigned long util, u64 delta)
+{
+	struct sched_avg sa = {
+		.util_sum = util * PELT_MIN_DIVIDER,
+		.util_avg = util,
+	};
+
+	if (unlikely(!delta))
+		return util;
+
+	accumulate_sum(delta, &sa, 1, 0, 1);
+	___update_load_avg(&sa, 0);
+
+	return sa.util_avg;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c95584191d58..190515b50dc8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3527,6 +3527,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 				 unsigned long min,
 				 unsigned long max);
 
+unsigned long approximate_util_avg(unsigned long util, u64 delta);
 
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 03/13] sched/pelt: Add a new function to approximate runtime to reach given util
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 01/13] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 02/13] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 04/13] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

It is basically the ramp-up time from 0 to a given value. Will be used
later to implement new tunable to control response time  for schedutil.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 kernel/sched/pelt.c  | 21 +++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 5a8f4dc99ffc..dbd450798b03 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -508,3 +508,24 @@ unsigned long approximate_util_avg(unsigned long util, u64 delta)
 
 	return sa.util_avg;
 }
+
+/*
+ * Approximate the required amount of runtime in ms required to reach @util.
+ */
+u64 approximate_runtime(unsigned long util)
+{
+	struct sched_avg sa = {};
+	u64 delta = 1024; // period = 1024 = ~1ms
+	u64 runtime = 0;
+
+	if (unlikely(!util))
+		return runtime;
+
+	while (sa.util_avg < util) {
+		accumulate_sum(delta, &sa, 1, 0, 1);
+		___update_load_avg(&sa, 0);
+		runtime++;
+	}
+
+	return runtime;
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 190515b50dc8..a445add5cc3a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3528,6 +3528,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 				 unsigned long max);
 
 unsigned long approximate_util_avg(unsigned long util, u64 delta);
+u64 approximate_runtime(unsigned long util);
 
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 04/13] sched/fair: Remove magic hardcoded margin in fits_capacity()
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (2 preceding siblings ...)
  2026-05-04  1:59 ` [PATCH v2 03/13] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 05/13] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

Replace hardcoded margin value in fits_capacity() with better dynamic
logic.

80% margin is a magic value that has served its purpose for now, but it
no longer fits the variety of systems that exist today. If a system is
over powered specifically, this 80% will mean we leave a lot of capacity
unused before we decide to upmigrate on HMP system.

On many systems the little cores are under powered and ability to
migrate faster away from them is desired.

Redefine misfit migration to mean the utilization threshold at which the
task would become misfit at the next load balance event assuming it
becomes an always running task.

To calculate this threshold, we use the new approximate_util_avg()
function to find out the threshold, based on arch_scale_cpu_capacity()
the task will be misfit if it continues to run for a TICK_USEC which is
our worst case scenario for when misfit migration will kick in.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 kernel/sched/core.c  |  1 +
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h |  1 +
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49cd5d217161..47ec8ea7c52e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8921,6 +8921,7 @@ void __init sched_init(void)
 		rq->sd = NULL;
 		rq->rd = NULL;
 		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->fits_capacity_threshold = SCHED_CAPACITY_SCALE;
 		rq->balance_callback = &balance_push_callback;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f179faf7a6a1..4e1ed3c7f96e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -97,11 +97,15 @@ int __weak arch_asym_cpu_priority(int cpu)
 }
 
 /*
- * The margin used when comparing utilization with CPU capacity.
- *
- * (default: ~20%)
+ * fits_capacity() must ensure that a task will not be 'stuck' on a CPU with
+ * lower capacity for too long. This the threshold is the util value at which
+ * if a task becomes always busy it could miss misfit migration load balance
+ * event. So we consider a task is misfit before it reaches this point.
  */
-#define fits_capacity(cap, max)	((cap) * 1280 < (max) * 1024)
+static inline bool fits_capacity(unsigned long util, int cpu)
+{
+	return util < cpu_rq(cpu)->fits_capacity_threshold;
+}
 
 /*
  * The margin used when comparing CPU capacities.
@@ -5180,14 +5184,13 @@ static inline int util_fits_cpu(unsigned long util,
 				unsigned long uclamp_max,
 				int cpu)
 {
-	unsigned long capacity = capacity_of(cpu);
 	unsigned long capacity_orig;
 	bool fits, uclamp_max_fits;
 
 	/*
 	 * Check if the real util fits without any uclamp boost/cap applied.
 	 */
-	fits = fits_capacity(util, capacity);
+	fits = fits_capacity(util, cpu);
 
 	if (!uclamp_is_used())
 		return fits;
@@ -10299,12 +10302,33 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	unsigned long capacity = scale_rt_capacity(cpu);
 	struct sched_group *sdg = sd->groups;
+	struct rq *rq = cpu_rq(cpu);
+	u64 limit;
 
 	if (!capacity)
 		capacity = 1;
 
-	cpu_rq(cpu)->cpu_capacity = capacity;
-	trace_sched_cpu_capacity_tp(cpu_rq(cpu));
+	rq->cpu_capacity = capacity;
+	trace_sched_cpu_capacity_tp(rq);
+
+	/*
+	 * Calculate the util at which the task must be considered a misfit.
+	 *
+	 * We must ensure that a task experiences the same ramp-up time to
+	 * reach max performance point of the system regardless of the CPU it
+	 * is running on (due to invariance, time will stretch and task will
+	 * take longer to achieve the same util value compared to a task
+	 * running on a big CPU) and a delay in misfit migration which depends
+	 * on TICK doesn't end up hurting it as it can happen after we would
+	 * have crossed this threshold.
+	 *
+	 * To ensure that invaraince is taken into account, we don't scale time
+	 * and use it as-is, approximate_util_avg() will then let us know the
+	 * our threshold.
+	 */
+	limit = approximate_runtime(arch_scale_cpu_capacity(cpu)) * USEC_PER_MSEC;
+	limit -= TICK_USEC; /* sd->balance_interval is more accurate */
+	rq->fits_capacity_threshold = approximate_util_avg(0, limit);
 
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = capacity;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a445add5cc3a..24008f1ec812 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1236,6 +1236,7 @@ struct rq {
 	unsigned char		nohz_idle_balance;
 	unsigned char		idle_balance;
 
+	unsigned long		fits_capacity_threshold;
 	unsigned long		misfit_task_load;
 
 	/* For active balancing */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 05/13] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom()
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (3 preceding siblings ...)
  2026-05-04  1:59 ` [PATCH v2 04/13] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 06/13] sched/fair: Extend util_est to improve rampup time Qais Yousef
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

Replace 1.25 headroom in sugov_apply_dvfs_headroom() with better dynamic
logic.

Instead of the magical 1.25 headroom, use the new approximate_util_avg()
to provide headroom based on the dvfs_update_delay, which is the period
at which the cpufreq governor will send DVFS updates to the hardware, or
min(curr.se.slice, TICK_USEC) which is the max delay for util signal to
change and promote a cpufreq update; whichever is higher.

Add a new percpu dvfs_update_delay that can be cheaply accessed whenever
sugov_apply_dvfs_headroom() is called. We expect cpufreq governors that
rely on util to drive its DVFS logic/algorithm to populate these percpu
variables. schedutil is the only such governor at the moment.

The behavior of schedutil will change. Some systems will experience
faster dvfs rampup (because of higher TICK or rate_limit_us), others
will experience slower rampup.

The impact on performance should not be visible if not for the black
hole effect of utilization invariance. A problem that will be addressed
in later patches.

CONST_DVFS_HEADROOM sched_feat allows reverting back to the old behavior
for easy backward compatibility.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 kernel/sched/core.c              |  1 +
 kernel/sched/cpufreq_schedutil.c | 39 +++++++++++++++++++++++++++-----
 kernel/sched/features.h          |  6 +++++
 kernel/sched/sched.h             |  9 ++++++++
 4 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 47ec8ea7c52e..3fbf560203f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -124,6 +124,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_exit_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_set_need_resched_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
+DEFINE_PER_CPU_READ_MOSTLY(u64, dvfs_update_delay);
 DEFINE_PER_CPU(struct rnd_state, sched_rnd_state);
 
 #ifdef CONFIG_SCHED_PROXY_EXEC
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index f6de241fc62c..b529f5b96f6e 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -215,13 +215,31 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
  * to run at adequate performance point.
  *
  * This function provides enough headroom to provide adequate performance
- * assuming the CPU continues to be busy.
+ * assuming the CPU continues to be busy. This headroom is based on the
+ * dvfs_update_delay of the cpufreq governor or min(curr.se.slice, TICK_US),
+ * whichever is higher.
  *
- * At the moment it is a constant multiplication with 1.25.
+ * XXX: Should we provide headroom when the util is decaying?
  */
-static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util)
+static inline unsigned long sugov_apply_dvfs_headroom(unsigned long util,  int cpu)
 {
-	return util + (util >> 2);
+	struct rq *rq = cpu_rq(cpu);
+	u64 delay;
+
+	if (sched_feat(CONST_DVFS_HEADROOM))
+		return util + (util >> 2);
+
+	/*
+	 * What is the possible worst case scenario for updating util_avg, ctx
+	 * switch or TICK?
+	 */
+	if (rq->cfs.h_nr_queued > 1)
+		delay = min(rq->curr->se.slice/1000, TICK_USEC);
+	else
+		delay = TICK_USEC;
+	delay = max(delay, per_cpu(dvfs_update_delay, cpu));
+
+	return approximate_util_avg(util, delay);
 }
 
 unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
@@ -229,7 +247,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 				 unsigned long max)
 {
 	/* Add dvfs headroom to actual utilization */
-	actual = sugov_apply_dvfs_headroom(actual);
+	actual = sugov_apply_dvfs_headroom(actual, cpu);
 	/* Actually we don't need to target the max performance */
 	if (actual < max)
 		max = actual;
@@ -615,15 +633,21 @@ rate_limit_us_store(struct gov_attr_set *attr_set, const char *buf, size_t count
 	struct sugov_tunables *tunables = to_sugov_tunables(attr_set);
 	struct sugov_policy *sg_policy;
 	unsigned int rate_limit_us;
+	int cpu;
 
 	if (kstrtouint(buf, 10, &rate_limit_us))
 		return -EINVAL;
 
 	tunables->rate_limit_us = rate_limit_us;
 
-	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook)
+	list_for_each_entry(sg_policy, &attr_set->policy_list, tunables_hook) {
+
 		sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
 
+		for_each_cpu(cpu, sg_policy->policy->cpus)
+			per_cpu(dvfs_update_delay, cpu) = rate_limit_us;
+	}
+
 	return count;
 }
 
@@ -886,6 +910,9 @@ static int sugov_start(struct cpufreq_policy *policy)
 		memset(sg_cpu, 0, sizeof(*sg_cpu));
 		sg_cpu->cpu = cpu;
 		sg_cpu->sg_policy = sg_policy;
+
+		per_cpu(dvfs_update_delay, cpu) = sg_policy->tunables->rate_limit_us;
+
 		cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util, uu);
 	}
 	return 0;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a25f97201ab9..6f7e5bba854f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -129,3 +129,9 @@ SCHED_FEAT(LATENCY_WARN, false)
  */
 SCHED_FEAT(NI_RANDOM, true)
 SCHED_FEAT(NI_RATE, true)
+
+/*
+ * For backward compatibility. Use the constant 1.25 dvfs headroom in
+ * schedutil instead of the dynamic one.
+ */
+SCHED_FEAT(CONST_DVFS_HEADROOM, false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24008f1ec812..16ebd8eb48d5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3531,6 +3531,15 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 unsigned long approximate_util_avg(unsigned long util, u64 delta);
 u64 approximate_runtime(unsigned long util);
 
+/*
+ * Any governor that relies on util signal to drive DVFS, must populate these
+ * percpu dvfs_update_delay variables.
+ *
+ * It should describe the rate/delay at which the governor sends DVFS freq
+ * update to the hardware in us.
+ */
+DECLARE_PER_CPU_READ_MOSTLY(u64, dvfs_update_delay);
+
 /*
  * Verify the fitness of task @p to run on @cpu taking into account the
  * CPU original capacity and the runtime/deadline ratio of the task.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 06/13] sched/fair: Extend util_est to improve rampup time
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (4 preceding siblings ...)
  2026-05-04  1:59 ` [PATCH v2 05/13] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 07/13] sched/fair: util_est: Take into account periodic tasks Qais Yousef
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

Utilization invariance can cause big delays. When tasks are running,
accumulate non-invairiant version of utilization to help tasks to settle
down to their new util_avg values faster.

Keep track of delta_exec during runnable across activations to help
update util_est for a long running task accurately. util_est shoudl
still behave the same at enqueue/dequeue.

Before this patch the a busy task tamping up would experience the
following transitions, running on M1 Mac Mini

                            rampup-6338 util_avg running
     ┌─────────────────────────────────────────────────────────────────────────┐
986.0┤                                                               ▄▄▄▄▄▟▀▀▀▀│
     │                                                        ▗▄▄▟▀▀▀▘         │
     │                                                    ▗▄▟▀▀                │
     │                                                 ▄▟▀▀                    │
739.5┤                                              ▄▟▀▘                       │
     │                                           ▗▄▛▘                          │
     │                                         ▗▟▀                             │
493.0┤                                       ▗▛▀                               │
     │                                    ▗▄▛▀                                 │
     │                                  ▄▟▀                                    │
     │                                ▄▛▘                                      │
246.5┤                             ▗▟▀▘                                        │
     │                          ▄▟▀▀                                           │
     │                      ▗▄▄▛▘                                              │
     │                 ▗▄▄▄▟▀                                                  │
  0.0┤  ▗         ▗▄▄▟▀▀                                                       │
     └┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬┘
    1.700   1.733   1.767   1.800   1.833   1.867   1.900   1.933   1.967 2.000

───────────────── rampup-6338 util_avg running residency (ms) ──────────────────
0.0   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.5
15.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
36.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
57.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
78.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.9
98.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
117.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
137.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
156.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
176.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
191.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
211.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
230.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
248.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
266.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
277.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
294.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.6
311.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.4
327.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
340.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
358.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
371.0 ▇▇▇▇▇▇▇▇▇ 1.0
377.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
389.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
401.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
413.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
431.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
442.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
456.0 ▇▇▇▇▇▇▇▇▇ 1.0

───────────────────────── Sum Time Running on CPU (ms) ─────────────────────────
CPU0.0 ▇▇▇▇▇ 90.39
CPU4.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1156.93

                            6338 rampup CPU0.0 Frequency
    ┌──────────────────────────────────────────────────────────────────────────┐
2.06┤                                ▛▀▀                                       │
    │                                ▌                                         │
    │                                ▌                                         │
    │                                ▌                                         │
1.70┤                             ▛▀▀▘                                         │
    │                             ▌                                            │
    │                             ▌                                            │
1.33┤                         ▗▄▄▄▌                                            │
    │                         ▐                                                │
    │                         ▐                                                │
    │                         ▐                                                │
0.97┤                     ▗▄▄▄▟                                                │
    │                     ▐                                                    │
    │                     ▐                                                    │
    │                     ▐                                                    │
0.60┤  ▗         ▗▄▄▄▄▄▄▄▄▟                                                    │
    └┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
   1.700   1.733   1.767   1.800   1.833    1.867   1.900   1.933   1.967 2.000

                            6338 rampup CPU4.0 Frequency
    ┌──────────────────────────────────────────────────────────────────────────┐
3.20┤                                                    ▐▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀│
    │                                                    ▐                     │
    │                                                  ▛▀▀                     │
    │                                                  ▌                       │
2.78┤                                               ▐▀▀▘                       │
    │                                             ▗▄▟                          │
    │                                             ▌                            │
2.35┤                                          ▗▄▄▌                            │
    │                                          ▐                               │
    │                                        ▄▄▟                               │
    │                                        ▌                                 │
1.93┤                                     ▗▄▄▌                                 │
    │                                     ▐                                    │
    │                                     ▐                                    │
    │                                     ▐                                    │
1.50┤                                  ▗▄▄▟                                    │
    └┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
   1.700   1.733   1.767   1.800   1.833    1.867   1.900   1.933   1.967 2.000

───────────────── 6338 rampup CPU0.0 Frequency residency (ms) ──────────────────
0.6   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 37.300000000000004
0.972 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 15.0
1.332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 15.0
1.704 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.0
2.064 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.1

───────────────── 6338 rampup CPU4.0 Frequency residency (ms) ──────────────────
1.5   ▇▇▇▇▇▇▇▇▇▇ 11.9
1.956 ▇▇▇▇▇▇▇▇ 10.0
2.184 ▇▇▇▇▇▇▇▇ 10.0
2.388 ▇▇▇▇▇▇▇▇▇ 11.0
2.592 ▇▇▇▇▇▇▇▇ 10.0
2.772 ▇▇▇▇▇▇▇▇ 10.0
2.988 ▇▇▇▇▇▇▇▇ 10.0
3.204 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 85.3

After the patch the response is improved to rampup frequencies faster
and migrate from little quicker

                           rampup-2234 util_avg running
   ┌───────────────────────────────────────────────────────────────────────────┐
984┤                                                                ▗▄▄▄▄▄▛▀▀▀▀│
   │                                                          ▄▄▟▀▀▀▀          │
   │                                                     ▄▄▟▀▀                 │
   │                                                  ▄▟▀▘                     │
738┤                                               ▄▟▀▘                        │
   │                                            ▗▟▀▘                           │
   │                                          ▗▟▀                              │
492┤                                        ▗▟▀                                │
   │                                      ▗▟▀                                  │
   │                                     ▟▀                                    │
   │                                   ▄▛▘                                     │
246┤                                 ▗▟▘                                       │
   │                               ▗▟▀                                         │
   │                             ▗▟▀                                           │
   │                           ▗▟▀                                             │
  0┤                       ▄▄▄▛▀                                               │
   └┬───────┬───────┬────────┬───────┬───────┬───────┬────────┬───────┬───────┬┘
  1.700   1.733   1.767    1.800   1.833   1.867   1.900    1.933   1.967 2.000

───────────────── rampup-2234 util_avg running residency (ms) ──────────────────
0.0   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.6000000000000005
15.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.0
39.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.0
61.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.0
85.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
99.0  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
120.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.0
144.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
160.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
176.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
192.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
210.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
228.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
246.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
263.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
282.0 ▇▇▇▇▇▇▇ 1.0
291.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
309.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
327.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
344.0 ▇▇▇▇▇▇▇ 1.0
354.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
373.0 ▇▇▇▇▇▇▇ 1.0
382.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
400.0 ▇▇▇▇▇▇▇ 1.0
408.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
425.0 ▇▇▇▇▇▇▇ 1.0
434.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.0
452.0 ▇▇▇▇▇▇▇ 1.0

                            2234 rampup CPU1.0 Frequency
    ┌──────────────────────────────────────────────────────────────────────────┐
2.06┤                             ▐▀                                           │
    │                             ▐                                            │
    │                             ▐                                            │
    │                             ▐                                            │
1.70┤                            ▛▀                                            │
    │                            ▌                                             │
    │                            ▌                                             │
1.33┤                           ▄▌                                             │
    │                           ▌                                              │
    │                           ▌                                              │
    │                           ▌                                              │
0.97┤                         ▗▄▌                                              │
    │                         ▐                                                │
    │                         ▐                                                │
    │                         ▐                                                │
0.60┤                      ▗▄▄▟                                                │
    └┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
   1.700   1.733   1.767   1.800   1.833    1.867   1.900   1.933   1.967 2.000

                            2234 rampup CPU4.0 Frequency
    ┌──────────────────────────────────────────────────────────────────────────┐
3.10┤                                                            ▐▀▀▀▀▀▀▀▀▀▀▀▀▀│
    │                                                 ▛▀▀▀▀▀▀▀▀▀▀▀             │
    │                                                 ▌                        │
    │                                            ▐▀▀▀▀▘                        │
2.70┤                                            ▐                             │
    │                                        ▐▀▀▀▀                             │
    │                                        ▐                                 │
2.30┤                                      ▛▀▀                                 │
    │                                      ▌                                   │
    │                                   ▐▀▀▘                                   │
    │                                   ▐                                      │
1.90┤                                 ▐▀▀                                      │
    │                                 ▐                                        │
    │                               ▗▄▟                                        │
    │                               ▐                                          │
1.50┤                              ▗▟                                          │
    └┬───────┬───────┬───────┬───────┬────────┬───────┬───────┬───────┬───────┬┘
   1.700   1.733   1.767   1.800   1.833    1.867   1.900   1.933   1.967 2.000

───────────────────────── Sum Time Running on CPU (ms) ─────────────────────────
CPU1.0 ▇▇▇▇ 32.53
CPU4.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 540.3

───────────────── 2234 rampup CPU1.0 Frequency residency (ms) ──────────────────
0.6   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12.1
0.972 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.5
1.332 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.7
1.704 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.5
2.064 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.8

───────────────── 2234 rampup CPU4.0 Frequency residency (ms) ──────────────────
1.5   ▇▇▇▇▇ 4.0
1.728 ▇▇▇▇▇▇▇▇▇▇ 8.0
1.956 ▇▇▇▇▇▇▇▇▇▇▇▇ 9.0
2.184 ▇▇▇▇▇▇▇▇▇▇▇▇ 9.0
2.388 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.0
2.592 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16.0
2.772 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 18.0
2.988 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 47.0
3.096 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 53.4

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   |  1 +
 kernel/sched/fair.c   | 41 ++++++++++++++++++++++++++++++++++-------
 3 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8ec3b6d7d718..b61da16861e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -589,6 +589,7 @@ struct sched_entity {
 					/* hole */
 
 	u64				exec_start;
+	u64				delta_exec;
 	u64				sum_exec_runtime;
 	u64				prev_sum_exec_runtime;
 	u64				vruntime;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3fbf560203f3..fe14fd4a2d53 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4391,6 +4391,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 
 	p->se.on_rq			= 0;
 	p->se.exec_start		= 0;
+	p->se.delta_exec		= 0;
 	p->se.sum_exec_runtime		= 0;
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e1ed3c7f96e..c6363ec5de9d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1344,6 +1344,7 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 		 */
 		running->se.exec_start = now;
 		running->se.sum_exec_runtime += delta_exec;
+		running->se.delta_exec += delta_exec;
 
 		trace_sched_stat_runtime(running, delta_exec);
 		account_group_exec_runtime(running, delta_exec);
@@ -1362,7 +1363,6 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
 		__schedstat_set(stats->exec_max,
 				max(delta_exec, stats->exec_max));
 	}
-
 	return delta_exec;
 }
 
@@ -5099,15 +5099,30 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	if (!sched_feat(UTIL_EST))
 		return;
 
+	/* Get current estimate of utilization */
+	ewma = READ_ONCE(p->se.avg.util_est);
+
 	/*
-	 * Skip update of task's estimated utilization when the task has not
-	 * yet completed an activation, e.g. being migrated.
+	 * If a task is running, update util_est ignoring utilization
+	 * invariance so that if the task suddenly becomes busy we will rampup
+	 * quickly to settle down to our new util_avg.
 	 */
-	if (!task_sleep)
-		return;
+	if (!task_sleep) {
+		u64 delta = p->se.delta_exec;
+		unsigned int prev_ewma = ewma & ~UTIL_AVG_UNCHANGED;
 
-	/* Get current estimate of utilization */
-	ewma = READ_ONCE(p->se.avg.util_est);
+		do_div(delta, 1000);
+		ewma = approximate_util_avg(prev_ewma, delta);
+		/*
+		 * Keep accumulating delta_exec if it is too small to cause
+		 * a change.
+		 */
+		if (ewma != prev_ewma)
+			p->se.delta_exec = 0;
+		goto done;
+	} else {
+		p->se.delta_exec = 0;
+	}
 
 	/*
 	 * If the PELT values haven't changed since enqueue time,
@@ -5170,6 +5185,14 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	trace_sched_util_est_se_tp(&p->se);
 }
 
+static inline void util_est_update_running(struct cfs_rq *cfs_rq,
+					   struct task_struct *p)
+{
+	util_est_dequeue(cfs_rq, p);
+	util_est_update(cfs_rq, p, false);
+	util_est_enqueue(cfs_rq, p);
+}
+
 static inline unsigned long get_actual_cpu_capacity(int cpu)
 {
 	unsigned long capacity = arch_scale_cpu_capacity(cpu);
@@ -9245,6 +9268,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
 simple:
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 	put_prev_set_next_task(rq, prev, p);
+	if (prev->on_rq)
+		util_est_update_running(&rq->cfs, prev);
 	return p;
 
 idle:
@@ -13670,6 +13695,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
+	util_est_update_running(&rq->cfs, curr);
+
 	if (queued) {
 		if (!need_resched())
 			hrtick_start_fair(rq, curr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 07/13] sched/fair: util_est: Take into account periodic tasks
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (5 preceding siblings ...)
  2026-05-04  1:59 ` [PATCH v2 06/13] sched/fair: Extend util_est to improve rampup time Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-04  1:59 ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Qais Yousef
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

The new faster rampup is great for performance. But terrible for power.
We want the faster rampup to be only applied for tasks that are
transitioning from one periodic/steady state to another periodic/steady
state. But if they are stably periodic, then the faster rampup doesn't
make sense as util_avg describes their computational demand accurately
and we can rely on that to make accurate decision. And preserve the
power savings from being exact with the resources we give to this task
(ie: smaller DVFS headroom).

We detect periodic tasks based on util_avg across util_est_update()
calls. If it is rising, then the task is going through a transition.

We rely on util_avg being stable for periodic tasks with very little
variations around one stable point.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 35 ++++++++++++++++++++++++-----------
 3 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b61da16861e7..70517497e80b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -907,6 +907,8 @@ struct task_struct {
 	struct uclamp_se		uclamp[UCLAMP_CNT];
 #endif
 
+	unsigned long			util_avg_dequeued;
+
 	struct sched_statistics         stats;
 
 #ifdef CONFIG_PREEMPT_NOTIFIERS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe14fd4a2d53..82189bdc85b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4409,6 +4409,8 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
 #endif
 #endif
 
+	p->util_avg_dequeued		= 0;
+
 #ifdef CONFIG_SCHEDSTATS
 	/* Even if schedstat is disabled, there should not be garbage */
 	memset(&p->stats, 0, sizeof(p->stats));
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6363ec5de9d..d9729da3901a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5041,6 +5041,11 @@ static inline unsigned long task_util(struct task_struct *p)
 	return READ_ONCE(p->se.avg.util_avg);
 }
 
+static inline unsigned long task_util_dequeued(struct task_struct *p)
+{
+	return READ_ONCE(p->util_avg_dequeued);
+}
+
 static inline unsigned long task_runnable(struct task_struct *p)
 {
 	return READ_ONCE(p->se.avg.runnable_avg);
@@ -5108,18 +5113,22 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	 * quickly to settle down to our new util_avg.
 	 */
 	if (!task_sleep) {
-		u64 delta = p->se.delta_exec;
-		unsigned int prev_ewma = ewma & ~UTIL_AVG_UNCHANGED;
+		if (task_util(p) > task_util_dequeued(p) &&
+		    task_util(p) - task_util_dequeued(p) > UTIL_EST_MARGIN) {
+			u64 delta = p->se.delta_exec;
+			unsigned int prev_ewma = ewma & ~UTIL_AVG_UNCHANGED;
 
-		do_div(delta, 1000);
-		ewma = approximate_util_avg(prev_ewma, delta);
-		/*
-		 * Keep accumulating delta_exec if it is too small to cause
-		 * a change.
-		 */
-		if (ewma != prev_ewma)
-			p->se.delta_exec = 0;
-		goto done;
+			do_div(delta, 1000);
+			ewma = approximate_util_avg(prev_ewma, delta);
+			/*
+			 * Keep accumulating delta_exec if it is too small to cause
+			 * a change.
+			 */
+			if (ewma != prev_ewma)
+				p->se.delta_exec = 0;
+			goto done_running;
+		}
+		return;
 	} else {
 		p->se.delta_exec = 0;
 	}
@@ -5134,6 +5143,9 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	/* Get utilization at dequeue */
 	dequeued = task_util(p);
 
+	if (!task_on_rq_migrating(p))
+		p->util_avg_dequeued = dequeued;
+
 	/*
 	 * Reset EWMA on utilization increases, the moving average is used only
 	 * to smooth utilization decreases.
@@ -5180,6 +5192,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	ewma >>= UTIL_EST_WEIGHT_SHIFT;
 done:
 	ewma |= UTIL_AVG_UNCHANGED;
+done_running:
 	WRITE_ONCE(p->se.avg.util_est, ewma);
 
 	trace_sched_util_est_se_tp(&p->se);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (6 preceding siblings ...)
  2026-05-04  1:59 ` [PATCH v2 07/13] sched/fair: util_est: Take into account periodic tasks Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-06 20:38   ` Tim Chen
  2026-05-04  1:59 ` [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS Qais Yousef
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

Provide a generic and extensible interface to describe arbitrary QoS
tags to tell the kernel about specific behavior that is doesn't fall
into the existing sched_attr.

The interface is broken into three parts:

* Type
* Value
* Cookie

Type is an enum that should be give us enough space to extend (and
deprecate) comfortably.

Value is a signed 64bit number to allow for arbitrary high values.

Cookie is to help group tasks selectively so that some QoS might want to
operate on tasks per groups. A value of 0 indicates system wide.

There are two anticipated users being discussed on the list.

1. Per task rampup multiplier to allow controlling how fast util rises,
   and by implication it can migrate between cores on HMP systems and
   cause freqs to rise with schedutil.

2. Tag a group of task that are memory dependent for Cache Aware
   Scheduling.

The interface is anticipated to be provisioned to apps via utilities and
libraries. schedqos [1] is an example how such interface can be used to
provide higher level QoS abstraction to describe workloads without
baking it into the binaries, and by implication without worrying about
potential abuse. The interface requires privileged access since QoS is
considered scarce resource and requires admin control to ensure it is
set properly. Again that admin control is anticipated to be the schedqos
utility service.

QoS is treated as a scarce resource and the intention is for the
a syscall to be done for each individual QoS tag. QoS tags are not
inherited on fork by default too for the same reason.

A reasonable point of debate is whether to make the sched_qos an array
of 3 or 5 value to avoid potential bottleneck if this grows large and
users do end up hitting a bottleneck of having to issue too many
syscalls to set all QoS. Being limited as it is now helps enforce
intentionality and scarcity of tagging.

[1] https://github.com/qais-yousef/schedqos

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 Documentation/scheduler/index.rst             |  1 +
 Documentation/scheduler/sched-qos.rst         | 44 ++++++++++++++++++
 include/uapi/linux/sched.h                    |  4 ++
 include/uapi/linux/sched/types.h              | 46 +++++++++++++++++++
 kernel/sched/syscalls.c                       | 10 ++++
 .../trace/beauty/include/uapi/linux/sched.h   |  4 ++
 6 files changed, 109 insertions(+)
 create mode 100644 Documentation/scheduler/sched-qos.rst

diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index 17ce8d76befc..6652f18e553b 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -23,5 +23,6 @@ Scheduler
     sched-stats
     sched-ext
     sched-debug
+    sched-qos
 
     text_files
diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
new file mode 100644
index 000000000000..0911261cb124
--- /dev/null
+++ b/Documentation/scheduler/sched-qos.rst
@@ -0,0 +1,44 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+Scheduler QoS
+=============
+
+1. Introduction
+===============
+
+Different workloads have different scheduling requirements to operate
+optimally. The same applies to tasks within the same workload.
+
+To enable smarter usage of system resources and to cater for the conflicting
+demands of various tasks, Scheduler QoS provides a mechanism to provide more
+information about those demands so that scheduler can do best-effort to
+honour them.
+
+  @sched_qos_type	what QoS hint to apply
+  @sched_qos_value	value of the QoS hint
+  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
+			applies. If 0, the hint will apply globally system
+			wide. If not 0, the hint will be relative to tasks that
+			has the same cookie value only.
+
+QoS hints are set once and not inherited by children by design. The
+rationale is that each task has its individual characteristics and it is
+encouraged to describe each of these separately. Also since system resources
+are finite, there's a limit to what can be done to honour these requests
+before reaching a tipping point where there are too many requests for
+a particular QoS that is impossible to service for all of them at once and
+some will start to lose out. For example if 10 tasks require better wake
+up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
+4 can perceive the hint honoured and the rest will have to wait. Inheritance
+can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
+hint will lose its meaning and effectiveness rapidly. The chances of 10
+tasks waking up at the same time is lower than a 100 and lower than a 1000.
+
+To set multiple QoS hints, a syscall is required for each. This is a
+trade-off to reduce the churn on extending the interface as the hope for
+this to evolve as workloads and hardware get more sophisticated and the
+need for extension will arise; and when this happen the task should be
+simpler to add the kernel extension and allow userspace to use readily by
+setting the newly added flag without having to update the whole of
+sched_attr.
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 52b69ce89368..3cdba44bc1cb 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -102,6 +102,9 @@ struct clone_args {
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
 };
+
+enum sched_qos_type {
+};
 #endif
 
 #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
@@ -133,6 +136,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_QOS			0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index bf6e9ae031c1..b65da4938f43 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -94,6 +94,48 @@
  * scheduled on a CPU with no more capacity than the specified value.
  *
  * A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * Scheduler QoS
+ * =============
+ *
+ * Different workloads have different scheduling requirements to operate
+ * optimally. The same applies to tasks within the same workload.
+ *
+ * To enable smarter usage of system resources and to cater for the conflicting
+ * demands of various tasks, Scheduler QoS provides a mechanism to provide more
+ * information about those demands so that scheduler can do best-effort to
+ * honour them.
+ *
+ *  @sched_qos_type	what QoS hint to apply
+ *  @sched_qos_value	value of the QoS hint
+ *  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
+ *			applies. If 0, the hint will apply globally system
+ *			wide. If not 0, the hint will be relative to tasks that
+ *			has the same cookie value only.
+ *
+ * QoS hints are set once and not inherited by children by design. The
+ * rationale is that each task has its individual characteristics and it is
+ * encouraged to describe each of these separately. Also since system resources
+ * are finite, there's a limit to what can be done to honour these requests
+ * before reaching a tipping point where there are too many requests for
+ * a particular QoS that is impossible to service for all of them at once and
+ * some will start to lose out. For example if 10 tasks require better wake
+ * up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
+ * 4 can perceive the hint honoured and the rest will have to wait. Inheritance
+ * can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
+ * hint will lose its meaning and effectiveness rapidly. The chances of 10
+ * tasks waking up at the same time is lower than a 100 and lower than a 1000.
+ *
+ * To set multiple QoS hints, a syscall is required for each. This is a
+ * trade-off to reduce the churn on extending the interface as the hope for
+ * this to evolve as workloads and hardware get more sophisticated and the
+ * need for extension will arise; and when this happen the task should be
+ * simpler to add the kernel extension and allow userspace to use readily by
+ * setting the newly added flag without having to update the whole of
+ * sched_attr.
+ *
+ * Details about the available QoS hints can be found in:
+ * Documentation/scheduler/sched-qos.rst
  */
 struct sched_attr {
 	__u32 size;
@@ -116,6 +158,10 @@ struct sched_attr {
 	__u32 sched_util_min;
 	__u32 sched_util_max;
 
+	__u32 sched_qos_type;
+	__s64 sched_qos_value;
+	__u32 sched_qos_cookie;
+
 };
 
 #endif /* _UAPI_LINUX_SCHED_TYPES_H */
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index b215b0ead9a6..88feedd2f7c9 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -481,6 +481,13 @@ static int user_check_sched_setscheduler(struct task_struct *p,
 	if (p->sched_reset_on_fork && !reset_on_fork)
 		goto req_priv;
 
+	/*
+	 * Normal users can't set QoS on their own, must go via admin
+	 * controlled service
+	 */
+	if (attr->sched_flags & SCHED_FLAG_QOS)
+		goto req_priv;
+
 	return 0;
 
 req_priv:
@@ -552,6 +559,9 @@ int __sched_setscheduler(struct task_struct *p,
 			return retval;
 	}
 
+	if (attr->sched_flags & SCHED_FLAG_QOS)
+		return -EOPNOTSUPP;
+
 	/*
 	 * SCHED_DEADLINE bandwidth accounting relies on stable cpusets
 	 * information.
diff --git a/tools/perf/trace/beauty/include/uapi/linux/sched.h b/tools/perf/trace/beauty/include/uapi/linux/sched.h
index 359a14cc76a4..4ff525928430 100644
--- a/tools/perf/trace/beauty/include/uapi/linux/sched.h
+++ b/tools/perf/trace/beauty/include/uapi/linux/sched.h
@@ -102,6 +102,9 @@ struct clone_args {
 	__aligned_u64 set_tid_size;
 	__aligned_u64 cgroup;
 };
+
+enum sched_qos_type {
+};
 #endif
 
 #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
@@ -133,6 +136,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_QOS			0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (7 preceding siblings ...)
  2026-05-04  1:59 ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Qais Yousef
@ 2026-05-04  1:59 ` Qais Yousef
  2026-05-04  2:00 ` [PATCH v2 10/13] sched/fair: Disable util_est when rampup_multiplier is 0 Qais Yousef
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  1:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

Bursty tasks are hard to predict. To use resources efficiently, the
system would like to be exact as much as possible. But this poses
a challenge for these bursty tasks that need to get access to more
resources quickly.

The new SCHED_QOS_RAMPUP_MULTIPLIER allows userspace to do that. As the
name implies, it only helps them to transition to a higher performance
state when they get _busier_. That is perfectly periodic tasks by
definition are not going through a transition and will run at a constant
performance level. It is the tasks that need to transition from one
periodic state to another periodic state that is at a higher level that
this rampup_multiplier will help with. It also slows down the ewma decay
of util_est which should help those bursty tasks to keep their faster
rampup.

This should work complimentary with uclamp. uclamp tells the system
about min and max perf requirements which can be applied immediately.

rampup_multiplier is about reactiveness to change in behavior;
specifically when a task gets a sudden burst of work and gets busier.

In practice this is found to be a much better control than uclamp_min as
it is relative parameter and doesn't require absolute description. It
allows the task to go through the motion faster without knowing exactly
how busy it can get at any particular point of time.

The intention is for this rampup multiplier to be applied only during
a burst. It has no effect on perfectly periodic tasks.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 Documentation/scheduler/sched-qos.rst | 22 +++++++++
 include/linux/sched.h                 |  7 +++
 include/uapi/linux/sched.h            |  6 ++-
 kernel/sched/core.c                   | 66 +++++++++++++++++++++++++++
 kernel/sched/debug.c                  |  1 +
 kernel/sched/fair.c                   |  6 ++-
 kernel/sched/syscalls.c               | 55 +++++++++++++++++++++-
 7 files changed, 158 insertions(+), 5 deletions(-)

diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
index 0911261cb124..f68856f23b6b 100644
--- a/Documentation/scheduler/sched-qos.rst
+++ b/Documentation/scheduler/sched-qos.rst
@@ -42,3 +42,25 @@ need for extension will arise; and when this happen the task should be
 simpler to add the kernel extension and allow userspace to use readily by
 setting the newly added flag without having to update the whole of
 sched_attr.
+
+2. QoS Tags
+===========
+
+SCHED_QOS_RAMPUP_MULTIPLIER
+---------------------------
+
+Controls how fast util signal rises. Affects frequency selection when schedutil
+is in use. And affects how fast tasks migrate between clusters on HMP systems.
+
+It affects bursty tasks only. Perfectly periodic tasks are well described by
+util_avg and the rampup multiplier will have no effect on them.
+
+When set to 0, util_est will be disabled to help further with power saving.
+This behavior can be controlled via UTIL_EST_RAMPUP_ZERO sched_feature.
+
+Value is not capped to retain flexibility, but it tapers off very quickly to
+notice a difference above 16. Roughly it takes ~200ms to reach a util_avg of
+1000 starting from 0. With 16 it should take ~12.5ms. A range of 0-8 is
+advised for general use.
+
+Cookie must always be set to 0.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 70517497e80b..38f0f507960a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -443,6 +443,11 @@ struct sched_info {
 #endif /* CONFIG_SCHED_INFO */
 };
 
+struct sched_qos {
+	DECLARE_BITMAP(user_defined, SCHED_QOS_MAX);
+	unsigned int rampup_multiplier;
+};
+
 /*
  * Integer metrics need fixed point arithmetic, e.g., sched/fair
  * has a few: load, load_avg, util_avg, freq, and capacity.
@@ -954,6 +959,8 @@ struct task_struct {
 
 	struct sched_info		sched_info;
 
+	struct sched_qos		sched_qos;
+
 	struct list_head		tasks;
 	struct plist_node		pushable_tasks;
 	struct rb_node			pushable_dl_tasks;
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 3cdba44bc1cb..2247fe805abc 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -104,6 +104,9 @@ struct clone_args {
 };
 
 enum sched_qos_type {
+	SCHED_QOS_NONE,
+	SCHED_QOS_RAMPUP_MULTIPLIER,
+	SCHED_QOS_MAX,
 };
 #endif
 
@@ -148,7 +151,8 @@ enum sched_qos_type {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_QOS)
 
 /* Only for sched_getattr() own flag param, if task is SCHED_DEADLINE */
 #define SCHED_GETATTR_FLAG_DL_DYNAMIC	0x01
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 82189bdc85b7..2b06701191c5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -186,6 +186,8 @@ __read_mostly int sysctl_resched_latency_warn_once = 1;
  */
 __read_mostly unsigned int sysctl_sched_nr_migrate = SCHED_NR_MIGRATE_BREAK;
 
+unsigned int sysctl_sched_qos_default_rampup_multiplier	= 1;
+
 __read_mostly int scheduler_running;
 
 #ifdef CONFIG_SCHED_CORE
@@ -4567,6 +4569,47 @@ static int sysctl_schedstats(const struct ctl_table *table, int write, void *buf
 #endif /* CONFIG_SCHEDSTATS */
 
 #ifdef CONFIG_SYSCTL
+static void sched_qos_sync_sysctl(void)
+{
+	struct task_struct *g, *p;
+
+	guard(rcu)();
+	for_each_process_thread(g, p) {
+		struct rq_flags rf;
+		struct rq *rq;
+
+		rq = task_rq_lock(p, &rf);
+		if (!test_bit(SCHED_QOS_RAMPUP_MULTIPLIER, p->sched_qos.user_defined))
+			p->sched_qos.rampup_multiplier = sysctl_sched_qos_default_rampup_multiplier;
+		task_rq_unlock(rq, p, &rf);
+	}
+}
+
+static int sysctl_sched_qos_handler(const struct ctl_table *table, int write,
+				    void *buffer, size_t *lenp, loff_t *ppos)
+{
+	unsigned int old_rampup_mult;
+	int result;
+
+	old_rampup_mult = sysctl_sched_qos_default_rampup_multiplier;
+
+	result = proc_dointvec(table, write, buffer, lenp, ppos);
+	if (result)
+		goto undo;
+	if (!write)
+		return 0;
+
+	if (old_rampup_mult != sysctl_sched_qos_default_rampup_multiplier) {
+		sched_qos_sync_sysctl();
+	}
+
+	return 0;
+
+undo:
+	sysctl_sched_qos_default_rampup_multiplier = old_rampup_mult;
+	return result;
+}
+
 static const struct ctl_table sched_core_sysctls[] = {
 #ifdef CONFIG_SCHEDSTATS
 	{
@@ -4613,6 +4656,13 @@ static const struct ctl_table sched_core_sysctls[] = {
 		.extra2		= SYSCTL_FOUR,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+	{
+		.procname	= "sched_qos_default_rampup_multiplier",
+		.data           = &sysctl_sched_qos_default_rampup_multiplier,
+		.maxlen         = sizeof(unsigned int),
+		.mode           = 0644,
+		.proc_handler   = sysctl_sched_qos_handler,
+	},
 };
 static int __init sched_core_sysctl_init(void)
 {
@@ -4622,6 +4672,21 @@ static int __init sched_core_sysctl_init(void)
 late_initcall(sched_core_sysctl_init);
 #endif /* CONFIG_SYSCTL */
 
+static void sched_qos_fork(struct task_struct *p)
+{
+	/*
+	 * We always force reset sched_qos on fork. These sched_qos are treated
+	 * as finite resources to help improve quality of life. Inheriting them
+	 * by default can easily lead to a situation where the QoS hint become
+	 * meaningless because all tasks in the system have it.
+	 *
+	 * Every task must request the QoS explicitly if it needs it. No
+	 * accidental inheritance is allowed to keep the default behavior sane.
+	 */
+	bitmap_zero(p->sched_qos.user_defined, SCHED_QOS_MAX);
+	p->sched_qos.rampup_multiplier = sysctl_sched_qos_default_rampup_multiplier;
+}
+
 /*
  * fork()/clone()-time setup:
  */
@@ -4641,6 +4706,7 @@ int sched_fork(u64 clone_flags, struct task_struct *p)
 	p->prio = current->normal_prio;
 
 	uclamp_fork(p);
+	sched_qos_fork(p);
 
 	/*
 	 * Revert to default priority/policy on fork if requested.
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 74c1617cf652..60a0d4b0e6a6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1357,6 +1357,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	__PS("effective uclamp.min", uclamp_eff_value(p, UCLAMP_MIN));
 	__PS("effective uclamp.max", uclamp_eff_value(p, UCLAMP_MAX));
 #endif /* CONFIG_UCLAMP_TASK */
+	__PS("sched_qos.rampup_multiplier", p->sched_qos.rampup_multiplier);
 	P(policy);
 	P(prio);
 	if (task_has_dl_policy(p)) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d9729da3901a..8124bcc602d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5119,7 +5119,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 			unsigned int prev_ewma = ewma & ~UTIL_AVG_UNCHANGED;
 
 			do_div(delta, 1000);
-			ewma = approximate_util_avg(prev_ewma, delta);
+			ewma = approximate_util_avg(prev_ewma, delta * p->sched_qos.rampup_multiplier);
 			/*
 			 * Keep accumulating delta_exec if it is too small to cause
 			 * a change.
@@ -5188,6 +5188,8 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	 * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
 	 */
 	ewma <<= UTIL_EST_WEIGHT_SHIFT;
+	if (p->sched_qos.rampup_multiplier)
+		last_ewma_diff /= p->sched_qos.rampup_multiplier;
 	ewma  -= last_ewma_diff;
 	ewma >>= UTIL_EST_WEIGHT_SHIFT;
 done:
@@ -10360,7 +10362,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	 * on TICK doesn't end up hurting it as it can happen after we would
 	 * have crossed this threshold.
 	 *
-	 * To ensure that invaraince is taken into account, we don't scale time
+	 * To ensure that invariance is taken into account, we don't scale time
 	 * and use it as-is, approximate_util_avg() will then let us know the
 	 * our threshold.
 	 */
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 88feedd2f7c9..3bf9a8b32f7d 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -427,6 +427,38 @@ static void __setscheduler_uclamp(struct task_struct *p,
 				  const struct sched_attr *attr) { }
 #endif /* !CONFIG_UCLAMP_TASK */
 
+static inline int sched_qos_validate(struct task_struct *p,
+				     const struct sched_attr *attr)
+{
+	switch (attr->sched_qos_type) {
+	case SCHED_QOS_RAMPUP_MULTIPLIER:
+		if (attr->sched_qos_cookie)
+			return -EINVAL;
+		if (attr->sched_qos_value < 0)
+			return -EINVAL;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static void __setscheduler_sched_qos(struct task_struct *p,
+				     const struct sched_attr *attr)
+{
+	if ((attr->sched_flags & SCHED_FLAG_QOS) == 0)
+		return;
+
+	switch (attr->sched_qos_type) {
+	case SCHED_QOS_RAMPUP_MULTIPLIER:
+		set_bit(SCHED_QOS_RAMPUP_MULTIPLIER, p->sched_qos.user_defined);
+		p->sched_qos.rampup_multiplier = attr->sched_qos_value;
+	default:
+		break;
+	}
+}
+
 /*
  * Allow unprivileged RT tasks to decrease priority.
  * Only issue a capable test if needed and only once to avoid an audit
@@ -559,8 +591,11 @@ int __sched_setscheduler(struct task_struct *p,
 			return retval;
 	}
 
-	if (attr->sched_flags & SCHED_FLAG_QOS)
-		return -EOPNOTSUPP;
+	if (attr->sched_flags & SCHED_FLAG_QOS) {
+		retval = sched_qos_validate(p, attr);
+		if (retval)
+			return retval;
+	}
 
 	/*
 	 * SCHED_DEADLINE bandwidth accounting relies on stable cpusets
@@ -697,6 +732,7 @@ int __sched_setscheduler(struct task_struct *p,
 			__setscheduler_dl_pi(newprio, policy, p, scope);
 		}
 		__setscheduler_uclamp(p, attr);
+		__setscheduler_sched_qos(p, attr);
 
 		if (scope->queued) {
 			/*
@@ -1108,6 +1144,21 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
 		kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value;
 		kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value;
 #endif
+		if (copy_from_user(&kattr.sched_qos_type,
+				   &uattr->sched_qos_type,
+				   sizeof(kattr.sched_qos_type))) {
+
+			return -EFAULT;
+		}
+
+		switch (kattr.sched_qos_type) {
+		case SCHED_QOS_RAMPUP_MULTIPLIER:
+			kattr.sched_qos_value = p->sched_qos.rampup_multiplier;
+			kattr.sched_qos_cookie = 0;
+			break;
+		default:
+			break;
+		}
 	}
 
 	kattr.size = min(usize, sizeof(kattr));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 10/13] sched/fair: Disable util_est when rampup_multiplier is 0
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (8 preceding siblings ...)
  2026-05-04  1:59 ` [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS Qais Yousef
@ 2026-05-04  2:00 ` Qais Yousef
  2026-05-04  2:00 ` [PATCH v2 11/13] sched/fair: Don't mess with util_avg post init Qais Yousef
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  2:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

util_est is a great feature to enable busy tasks with long sleep time to
maintain their perf level. But it can also be expensive in terms of
power for tasks that have no such perf requirements and just happened to
be busy in the last activation.

If a task sets its rampup_multiplier to 0, then it indicates that it is
happy to glide along with system default response and doesn't require
responsiveness. We can use that to further imply that the task is happy
to decay its util for long sleep too and disable util_est.

The behavior can be controlled via UTIL_EST_RAMPUP_ZERO sched_feat that
is enabled by default.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 kernel/sched/fair.c     | 8 ++++++++
 kernel/sched/features.h | 5 +++++
 2 files changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8124bcc602d3..a36d6abaf6d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5104,6 +5104,14 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	if (!sched_feat(UTIL_EST))
 		return;
 
+	/*
+	 * rampup_multiplier = 0 indicates util_est is disabled.
+	 */
+	if (sched_feat(UTIL_EST_RAMPUP_ZERO) && !p->sched_qos.rampup_multiplier) {
+		ewma = 0;
+		goto done;
+	}
+
 	/* Get current estimate of utilization */
 	ewma = READ_ONCE(p->se.avg.util_est);
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 6f7e5bba854f..05eed37a9064 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -122,6 +122,11 @@ SCHED_FEAT(WA_BIAS, true)
  */
 SCHED_FEAT(UTIL_EST, true)
 
+/*
+ * Disable util_est when rampup_multiplier is 0.
+ */
+SCHED_FEAT(UTIL_EST_RAMPUP_ZERO, true)
+
 SCHED_FEAT(LATENCY_WARN, false)
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 11/13] sched/fair: Don't mess with util_avg post init
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (9 preceding siblings ...)
  2026-05-04  2:00 ` [PATCH v2 10/13] sched/fair: Disable util_est when rampup_multiplier is 0 Qais Yousef
@ 2026-05-04  2:00 ` Qais Yousef
  2026-05-04  2:00 ` [PATCH v2 12/13] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
  2026-05-04  2:00 ` [PATCH v2 RFC 13/13] sched/pelt: Always allow load updates Qais Yousef
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  2:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

The extrapolation logic for util_avg for newly forked tasks tries to
crystal ball the task's demand. This has worked well when the system
didn't have the means to help these tasks otherwise. But now we do have
util_est that will rampup faster. And uclamp_min to ensure a good
starting point if they really care.

Since we really can't crystal ball the behavior, and giving the same
starting value for all tasks is more consistent behavior for all forked
tasks, and it helps to preserve system resources for tasks to compete to
get them if they truly care, set the initial util_avg to be 0 when
util_est feature is enabled.

This should not impact workloads that need best single threaded
performance (like geekbench) given the previous improvements introduced
to help with faster rampup to reach max perf point more coherently and
consistently across systems.

The logic can be forced back on using UTIL_EST_FORCE_POST_INIT
sched_feat.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 kernel/sched/fair.c     | 19 +++++++++++++++++++
 kernel/sched/features.h | 10 ++++++++++
 2 files changed, 29 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a36d6abaf6d2..d0f646b32c2d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1261,6 +1261,19 @@ void init_entity_runnable_average(struct sched_entity *se)
 }
 
 /*
+ * When util_est is used, the tasks can rampup much faster by default. And with
+ * the rampup_multiplier, tasks can ask for faster rampup after fork. And with
+ * uclamp, they can ensure a min perf requirement. Given all these factors, we
+ * keep util_avg at 0 as we can't crystal ball the task demand after fork.
+ * Userspace have enough ways to ensure good perf for tasks after fork. Keeping
+ * the util_avg to 0 is good way to ensure a uniform start for all tasks. And
+ * it is good to preserve precious resources. Truly busy forked tasks can
+ * compete for the resources without the need for initial 'cheat' to ramp them
+ * up automagically.
+ *
+ * When util_est is not present, the extrapolation logic below will still
+ * apply.
+ *
  * With new tasks being created, their initial util_avgs are extrapolated
  * based on the cfs_rq's current util_avg:
  *
@@ -1310,6 +1323,12 @@ void post_init_entity_util_avg(struct task_struct *p)
 		return;
 	}
 
+	/*
+	 * Tasks can rampup faster with util_est, so don't mess with util_avg.
+	 */
+	if (sched_feat(UTIL_EST) && !sched_feat(UTIL_EST_FORCE_POST_INIT))
+		return;
+
 	if (cap > 0) {
 		if (cfs_rq->avg.util_avg != 0) {
 			sa->util_avg  = cfs_rq->avg.util_avg * se_weight(se);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 05eed37a9064..fa8e7d458029 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -127,6 +127,16 @@ SCHED_FEAT(UTIL_EST, true)
  */
 SCHED_FEAT(UTIL_EST_RAMPUP_ZERO, true)
 
+/*
+ * Force extrapolating util_avg on fork.
+ *
+ * When util_est is enabled the extrapolation is not necessary since tasks can
+ * rampup faster and can be controlled with a rampup multiplier to get better
+ * responses making the need for the extrapolation moot. Switch this on to
+ * force the extrapolation logic.
+ */
+SCHED_FEAT(UTIL_EST_FORCE_POST_INIT, false)
+
 SCHED_FEAT(LATENCY_WARN, false)
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 12/13] sched/fair: Call update_util_est() after dequeue_entities()
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (10 preceding siblings ...)
  2026-05-04  2:00 ` [PATCH v2 11/13] sched/fair: Don't mess with util_avg post init Qais Yousef
@ 2026-05-04  2:00 ` Qais Yousef
  2026-05-04  2:00 ` [PATCH v2 RFC 13/13] sched/pelt: Always allow load updates Qais Yousef
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  2:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

update_util_est() reads task_util() at dequeue which is updated in
dequeue_entities(). To read the accurate util_avg at dequeue, make sure
to do the read after load_avg is updated in dequeue_entities().

util_est for a periodic task before

                                periodic-3114 util_est.enqueued running
   ┌───────────────────────────────────────────────────────────────────────────────────────────────┐
183┤                ▖▗  ▐▖         ▖ ▗▙   ▗   ▗▙▖▖       ▖▖   ▖       ▖▖        ▗  ▟  ▗▄▖          │
139┤               ▐▛█▜▙▞▀▄▄▞▚▄▟█▞▙█▄▟▀▚▄▄▞▚▄▄▟▀▀▛▄▝▄▄▄▙█▛▛█▛▜▛▄▄▀▄█▙▛▛▛▙▄▀▄▄▖▜▄▟█▟▀▜▟▄▜▀▄▄▟▙▖     │
 95┤              ▐▀    ▘   ▝   ▝        ▝▘        ▘   ▘▘       ▝▘       ▝▘  ▝    ▝        ▀       │
   │              ▛                                                                                │
 51┤             ▐▘                                                                                │
  7┤      ▖▗▗  ▗▄▐                                                                                 │
   └┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬──────────┬─────────┬┘
  0.00      0.65       1.30      1.96       2.61      3.26       3.91      4.57       5.22     5.87

and after

                                 periodic-2977 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
 82.0┤             ▟                                                                               │
     │             ▌                                                                               │
 44.5┤             ▌                                                                               │
  7.0┤      ▗   ▗▖ ▌                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86

Note how the signal is noisier and can peak to 183 vs 157 now.

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
 kernel/sched/fair.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0f646b32c2d..2fec5b6a7c30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7455,6 +7455,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
+	int ret;
+
 	if (task_is_throttled(p)) {
 		dequeue_throttled_task(p, flags);
 		return true;
@@ -7463,8 +7465,9 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!p->se.sched_delayed)
 		util_est_dequeue(&rq->cfs, p);
 
+	ret = dequeue_entities(rq, &p->se, flags);
 	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
-	if (dequeue_entities(rq, &p->se, flags) < 0)
+	if (ret < 0)
 		return false;
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 RFC 13/13] sched/pelt: Always allow load updates
  2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
                   ` (11 preceding siblings ...)
  2026-05-04  2:00 ` [PATCH v2 12/13] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
@ 2026-05-04  2:00 ` Qais Yousef
  12 siblings, 0 replies; 15+ messages in thread
From: Qais Yousef @ 2026-05-04  2:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Rafael J. Wysocki,
	Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Tim Chen, Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm,
	Qais Yousef

1024us period can cause a problem at dequeue if the last udpate (due to
tick) has happened less than this period. Running a periodic task I can
see the dequeued util_avg changing by 15-20 points due to this variation
- which on HMP system with small cores can mean a big jump in freqs.

Before

                                         periodic-2977 util_avg
     ┌┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┐
140.0┼┼─▐▀▛▜─▄▄▄▖─────────┼──────────┼─────────┼────────▄▄──▗▄▄▄──▗▄▄▄────────┼─────────┼─────────┼┤
     ││ ▐ ▌▐ ▌▐ ▌  ▛▜▄   ▄▟▜▄▖  ▐▜▛▌ │▗▄       │        ▌▐  ▐ ▌▐  ▐│▌▐ ▛▙▄    ▗▄    ▐▜  │ ▗▄      ││
     │▀▙▟ ▌▐ ▌▐ ▌  ▌▐▐ ▐▜▌█▐▌█▜▛█▐▌▌ │▐▐       │  ▛▜▀▛▜▀▌▐▀▙▟ ▌▐  ▐│▌▐ ▌█▐ ▗▄▛█▐▛█▜▄▟▐  │ ▐▐▀▌▗▄▖ ││
     ││▌▐ ▌▐▄▌▐ ▌  ▌▐▐ ▐▐▌█▐▌█▐▌█▐▌▌ │▐▐  ▐▀▙▄ ▛▜▄▌▐ ▌▐ ▌▐ ▌▐ ▌▐▄▖▐│▌▐ ▌█▐ ▐▐▌█▐▌█▐▌█▐▛▙▄ ▐▐ ▌▐ ▛▜││
129.5┼┼▌▐─▌▐─▌▐─▌──▌▐▐─▐▐▌█▐▌█▐▌█▐▌▙▄▄▟▐──▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▛▜┼▌▐─▌█▐─▐▐▌█▐▌█▐▌█▐▌▌▐─▐▐─▌▐─▌▐┼┤
     ││▌▐ ▌▐ ▌▐ ▛▜▀▌▐▐ ▐▐▌█▐▌█▐▌█▐▌█▐│▛▐  ▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐▄▌█▐ ▐▐▌█▐▌█▐▌█▐▌▌▐▀▛▐ ▌▐ ▌▐││
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐ ▐▐▌█▐▌█▐▌█▐▌█▐│▌▐▀▙▟ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌█▐ ▐▐▌█▐▌█▐▌█▐▌▌▐ ▌▐ ▙▟ ▌▐││
119.0┼┼▌▐─▌▐─▌▐─▌▐─▌▐▐▛█▐▌█▐▌█▐▌█▐▌█▐┼▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐┼▌▐─▌█▐▄▟▐▌█▐▌█▐▌█▐▌▌▐─▌▐─▌▐─▌▐┼┤
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌█▐▌█▐▌█▐│▌▐ ▌▐ ▌▐▄▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌█▐▌█▐▌█▐▌█▐▌█▐▌▌▐ ▌▐ ▌▐ ▌▐▀│
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌▛▐▘█▐▌█▐│▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌█▐▌█▐▌█▐▌▛▐▘█▐▌▌▐ ▌▐ ▌▐ ▌▐││
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌▌▐ █▐▘█▐│▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌█▐▌█▐▌█▐▌▌▐ █▐▘▌▐ ▌▐ ▌▐ ▌▐││
108.5┼┼▌▐─▌▐─▌▐─▌▐─▌▐▐▌█▐▌█▐▌▌▐─█▐─▌▐┼▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐┼▌▐─▌█▐▌█▐▌█▐▌▌▐─█▐─▌▐─▌▐─▌▐─▌▐┼┤
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌▌▐ █▐ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌█▐▌█▐▌█▐▌▌▐ █▐ ▌▐ ▌▐ ▌▐ ▌▐││
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌▌▐ █▐ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌█▐▌█▐▌█▐▌▌▐ █▐ ▌▐ ▌▐ ▌▐ ▌▐││
     ││         │   ▝ ▘█▐▌█▝▘▘▝ ▀▝ ▘▝│▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▘▝  │         │    ▝▐▌█▐▌█▝▘▘▝ ▀▝ ▘▝ ▘▐ ▌▐ ▌▐││
 98.0┼┼─────────┼───────▝▘┼──────────┼───▘▝─▌▐─▌▐─▌──────┼─────────┼───────▀▝▘┼─────────┼────▘▝─▘▝┼┤
     └┼─────────┼─────────┼──────────┼─────────┼─────────┼─────────┼──────────┼─────────┼─────────┼┘
    2.00      2.11      2.22       2.33      2.44      2.56      2.67       2.78      2.89     3.00

After

                                         periodic-2968 util_avg
     ┌┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┐
139.0┼▄▖──▄▄▄▄▄▄▖─────────┼──────────┼─────────┼──────▗▄▄▟▀▛▜▀▛▜▄▖─┼──────────┼─────────┼─────────┼┤
     ││▛▜▀▌▐ ▌▐ ▛▜▀▙▄    ▄▟▀▛▜▄      │         │   ▐▀▛▜ ▌▐ ▌▐ ▌▐ ▛▜▀▙▄        │         │         ││
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐▀▙▟▀▌▐ ▌▐▐▛█▜▛█▜▛█▜▛▙▄    │ ▄▞▜ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐▀▛▜    ▄▄▄▄▟▜▛█▜▛█▜▛█▜▛█▜▄▄▄││
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌█▐▌█▐▛█▜▛█▜▌▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌▐▀▛▜▀▌▐ ▌█▐▌█▐▌█▐▌█▐▌█▐▌█▐▛│
128.2┼┼▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐▐▌█▐▌█▐▌█▐▌█▐▌█▐▌█▐▌▌▐─▌▐─▌▐─▌▐─▌▐─▌▐┼▌▐─▌▐─▌▐─▌▐─▌█▐▌█▐▌█▐▌█▐▌█▐▌█▐▌┤
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌█▐▌█▐▌█▐▌█▐▌▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌█▐▌█▐▌█▐▌█▐▌█▐▌█▐▌│
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌█▐▌█▐▌█▐▌█▐▌▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌█▐▌█▐▌█▐▌█▐▌█▐▌█▐▌│
117.5┼┼▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐▐▌█▐▌█▐▌█▐▌█▐▌█▐▌█▐▌▌▐─▌▐─▌▐─▌▐─▌▐─▌▐┼▌▐─▌▐─▌▐─▌▐─▌█▐▌█▐▌█▐▌█▐▌▛▐▘█▐▌┤
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌█▐▌█▐ ▛▐▌█▐▌▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌█▐▌█▐▌█▐▌█▐▌▌▐ ▛▐▌│
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌█▐▌█▐ ▌▐ ▛▐▌▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌█▐▌█▐▌█▐▌█▐▌▌▐ ▌▐││
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌█▐▌█▐▌█▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌█▐▌█▐▌█▐▌█▐▌▌▐ ▌▐││
106.8┼┼▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐▐▌█▐▌▛▐▌█▐▌█▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐─▌▐┼▌▐─▌▐─▌▐─▌▐─▌█▐▌█▐▌▛▐▌█▐▌▌▐─▌▐┼┤
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌▌▐│▌▐▌█▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌█▐▌█▐▌▌▐ ▌▐▘▌▐ ▌▐││
     ││▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐▐▌█▐▌▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▌▐ ▘▝ ▘   ▝ ▌▐│▌▐ ▌▐ ▌▐ ▌▐ ▌█▐▌█▐▌▌▐ ▌▐ ▌▐ ▌▐││
     ││         │   ▝ ▘▐ ▘│    ▘▀▝▘▘▝│▌▐ ▌▐ ▌▐ ▌▐ ▌▝     │         │ ▝ ▘▝ ▌▐ ▌▐ ▌▜▐▌█▐▌▘▝ ▘▝ ▘▐ ▌▐││
 96.0┼┼─────────┼──────▝──┼──────────┼────▝─▘▝─▘▐─▘──────┼─────────┼──────▘▐─▘▝─▘───────┼───────▘─┼┤
     └┼─────────┼─────────┼──────────┼─────────┼─────────┼─────────┼──────────┼─────────┼─────────┼┘
    2.00      2.11      2.22       2.33      2.44      2.56      2.67       2.78      2.89     3.00

Also the new util_est periodic detection logic can be thrown off by this
variation. With this fix it now stabilizes pretty well.

Before

                                 periodic-2977 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
157.0┤               ▙▄ ▗▄  ▗▄▄▄ ▗▄  ▗▄▄▄▗▄▄  ▗▄▄▖ ▄   ▄▄▄   ▄  ▄▖▖  ▄▄▄▄▄▖▖▝▙▄▄▄▄▄▄▖ ▗▄           │
119.5┤             ▗▄▌▘▀▀ ▀▀▀ ▝▀▀▘▝▀▀▀ ▝▀▘ ▝▀▀▘ ▀▝▀▘▀▀▀▘▝▀▀▀▀▀▀▀▘▝▝▀▀ ▀   ▝▝▀  ▀   ▀▀▀▀            │
 82.0┤             ▟                                                                               │
     │             ▌                                                                               │
 44.5┤             ▌                                                                               │
  7.0┤      ▗   ▗▖ ▌                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.56      5.21     5.86

After

                                 periodic-2968 util_est.enqueued running
     ┌─────────────────────────────────────────────────────────────────────────────────────────────┐
139.0┤              ▗▟▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀    │
106.5┤             ▐▛                                                                              │
 74.0┤             ▟                                                                               │
     │             ▌                                                                               │
 41.5┤             ▌                                                                               │
  9.0┤        ▗▖  ▗▌                                                                               │
     └┬─────────┬─────────┬──────────┬─────────┬─────────┬─────────┬──────────┬─────────┬─────────┬┘
    0.00      0.65      1.30       1.95      2.60      3.25      3.90       4.55      5.20     5.85

Signed-off-by: Qais Yousef <qyousef@layalina.io>
---

I tried to do the update every 256us intead of every period, but this didn't
help to flatten util_est.

If doing the update always is too much, AND, I didn't miss something else that
could be contributing to this problem, would another sched feature to allow
those who want accuracy vs those who want minimal overhead take their pick?

 kernel/sched/pelt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index dbd450798b03..64f9e60023a9 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -224,8 +224,7 @@ ___update_load_sum(u64 now, struct sched_avg *sa,
 	 * Step 1: accumulate *_sum since last_update_time. If we haven't
 	 * crossed period boundaries, finish.
 	 */
-	if (!accumulate_sum(delta, sa, load, runnable, running))
-		return 0;
+	accumulate_sum(delta, sa, load, runnable, running);
 
 	return 1;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface
  2026-05-04  1:59 ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Qais Yousef
@ 2026-05-06 20:38   ` Tim Chen
  0 siblings, 0 replies; 15+ messages in thread
From: Tim Chen @ 2026-05-06 20:38 UTC (permalink / raw)
  To: Qais Yousef, Ingo Molnar, Peter Zijlstra, Vincent Guittot,
	Rafael J. Wysocki, Viresh Kumar
  Cc: Juri Lelli, Steven Rostedt, John Stultz, Dietmar Eggemann,
	Chen, Yu C, Thomas Gleixner, linux-kernel, linux-pm

On Mon, 2026-05-04 at 02:59 +0100, Qais Yousef wrote:
> Provide a generic and extensible interface to describe arbitrary QoS
> tags to tell the kernel about specific behavior that is doesn't fall
> into the existing sched_attr.
> 
> The interface is broken into three parts:
> 
> * Type
> * Value
> * Cookie
> 
> Type is an enum that should be give us enough space to extend (and
> deprecate) comfortably.
> 
> Value is a signed 64bit number to allow for arbitrary high values.
> 
> Cookie is to help group tasks selectively so that some QoS might want to
> operate on tasks per groups. A value of 0 indicates system wide.
> 
> There are two anticipated users being discussed on the list.
> 
> 1. Per task rampup multiplier to allow controlling how fast util rises,
>    and by implication it can migrate between cores on HMP systems and
>    cause freqs to rise with schedutil.
> 
> 2. Tag a group of task that are memory dependent for Cache Aware
>    Scheduling.
> 
> The interface is anticipated to be provisioned to apps via utilities and
> libraries. schedqos [1] is an example how such interface can be used to
> provide higher level QoS abstraction to describe workloads without
> baking it into the binaries, and by implication without worrying about
> potential abuse. The interface requires privileged access since QoS is
> considered scarce resource and requires admin control to ensure it is
> set properly. Again that admin control is anticipated to be the schedqos
> utility service.
> 
> QoS is treated as a scarce resource and the intention is for the
> a syscall to be done for each individual QoS tag. QoS tags are not
> inherited on fork by default too for the same reason.
> 
> A reasonable point of debate is whether to make the sched_qos an array
> of 3 or 5 value to avoid potential bottleneck if this grows large and
> users do end up hitting a bottleneck of having to issue too many
> syscalls to set all QoS. Being limited as it is now helps enforce
> intentionality and scarcity of tagging.
> 
> [1] https://github.com/qais-yousef/schedqos
> 
> Signed-off-by: Qais Yousef <qyousef@layalina.io>
> ---
>  Documentation/scheduler/index.rst             |  1 +
>  Documentation/scheduler/sched-qos.rst         | 44 ++++++++++++++++++
>  include/uapi/linux/sched.h                    |  4 ++
>  include/uapi/linux/sched/types.h              | 46 +++++++++++++++++++
>  kernel/sched/syscalls.c                       | 10 ++++
>  .../trace/beauty/include/uapi/linux/sched.h   |  4 ++
>  6 files changed, 109 insertions(+)
>  create mode 100644 Documentation/scheduler/sched-qos.rst
> 
> diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
> index 17ce8d76befc..6652f18e553b 100644
> --- a/Documentation/scheduler/index.rst
> +++ b/Documentation/scheduler/index.rst
> @@ -23,5 +23,6 @@ Scheduler
>      sched-stats
>      sched-ext
>      sched-debug
> +    sched-qos
>  
>      text_files
> diff --git a/Documentation/scheduler/sched-qos.rst b/Documentation/scheduler/sched-qos.rst
> new file mode 100644
> index 000000000000..0911261cb124
> --- /dev/null
> +++ b/Documentation/scheduler/sched-qos.rst
> @@ -0,0 +1,44 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +Scheduler QoS
> +=============
> +
> +1. Introduction
> +===============
> +
> +Different workloads have different scheduling requirements to operate
> +optimally. The same applies to tasks within the same workload.
> +
> +To enable smarter usage of system resources and to cater for the conflicting
> +demands of various tasks, Scheduler QoS provides a mechanism to provide more
> +information about those demands so that scheduler can do best-effort to
> +honour them.
> +
> +  @sched_qos_type	what QoS hint to apply
> +  @sched_qos_value	value of the QoS hint
> +  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> +			applies. If 0, the hint will apply globally system
> +			wide. If not 0, the hint will be relative to tasks that
> +			has the same cookie value only.

Qais,

Thanks for your proposal. I have some follow up thoughts.

How can we query all the tasks that use a cookie?
A scenario I can think of is there may be two group of tasks, and we may
want to merge the two group of tasks into one when they start sharing
data in the context of cache aware scheduling.  In that case, we
need to get all the tasks under the second cookie and change them to
that of the first.  We may need to link together tasks sharing a cookie.

We probably need a sched_qos_cookie structure defined analogous to
the sched_core_cookie to anchor the tasks.  And sched_qos_cookie could be a ptr value
to sched_qos_cookie, as in sched_core_cookie instead of it being a __u32
as in the patch below.

Tim

> +
> +QoS hints are set once and not inherited by children by design. The
> +rationale is that each task has its individual characteristics and it is
> +encouraged to describe each of these separately. Also since system resources
> +are finite, there's a limit to what can be done to honour these requests
> +before reaching a tipping point where there are too many requests for
> +a particular QoS that is impossible to service for all of them at once and
> +some will start to lose out. For example if 10 tasks require better wake
> +up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> +4 can perceive the hint honoured and the rest will have to wait. Inheritance
> +can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> +hint will lose its meaning and effectiveness rapidly. The chances of 10
> +tasks waking up at the same time is lower than a 100 and lower than a 1000.
> +
> +To set multiple QoS hints, a syscall is required for each. This is a
> +trade-off to reduce the churn on extending the interface as the hope for
> +this to evolve as workloads and hardware get more sophisticated and the
> +need for extension will arise; and when this happen the task should be
> +simpler to add the kernel extension and allow userspace to use readily by
> +setting the newly added flag without having to update the whole of
> +sched_attr.
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 52b69ce89368..3cdba44bc1cb 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -102,6 +102,9 @@ struct clone_args {
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
>  };
> +
> +enum sched_qos_type {
> +};
>  #endif
>  
>  #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> @@ -133,6 +136,7 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS		0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
> +#define SCHED_FLAG_QOS			0x80
>  
>  #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
>  				 SCHED_FLAG_KEEP_PARAMS)
> diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
> index bf6e9ae031c1..b65da4938f43 100644
> --- a/include/uapi/linux/sched/types.h
> +++ b/include/uapi/linux/sched/types.h
> @@ -94,6 +94,48 @@
>   * scheduled on a CPU with no more capacity than the specified value.
>   *
>   * A task utilization boundary can be reset by setting the attribute to -1.
> + *
> + * Scheduler QoS
> + * =============
> + *
> + * Different workloads have different scheduling requirements to operate
> + * optimally. The same applies to tasks within the same workload.
> + *
> + * To enable smarter usage of system resources and to cater for the conflicting
> + * demands of various tasks, Scheduler QoS provides a mechanism to provide more
> + * information about those demands so that scheduler can do best-effort to
> + * honour them.
> + *
> + *  @sched_qos_type	what QoS hint to apply
> + *  @sched_qos_value	value of the QoS hint
> + *  @sched_qos_cookie	magic cookie to tag a group of tasks for which the QoS
> + *			applies. If 0, the hint will apply globally system
> + *			wide. If not 0, the hint will be relative to tasks that
> + *			has the same cookie value only.
> + *
> + * QoS hints are set once and not inherited by children by design. The
> + * rationale is that each task has its individual characteristics and it is
> + * encouraged to describe each of these separately. Also since system resources
> + * are finite, there's a limit to what can be done to honour these requests
> + * before reaching a tipping point where there are too many requests for
> + * a particular QoS that is impossible to service for all of them at once and
> + * some will start to lose out. For example if 10 tasks require better wake
> + * up latencies on a 4 CPUs SMP system, then if they all wake up at once, only
> + * 4 can perceive the hint honoured and the rest will have to wait. Inheritance
> + * can lead these 10 to become a 100 or a 1000 more easily, and then the QoS
> + * hint will lose its meaning and effectiveness rapidly. The chances of 10
> + * tasks waking up at the same time is lower than a 100 and lower than a 1000.
> + *
> + * To set multiple QoS hints, a syscall is required for each. This is a
> + * trade-off to reduce the churn on extending the interface as the hope for
> + * this to evolve as workloads and hardware get more sophisticated and the
> + * need for extension will arise; and when this happen the task should be
> + * simpler to add the kernel extension and allow userspace to use readily by
> + * setting the newly added flag without having to update the whole of
> + * sched_attr.
> + *
> + * Details about the available QoS hints can be found in:
> + * Documentation/scheduler/sched-qos.rst
>   */
>  struct sched_attr {
>  	__u32 size;
> @@ -116,6 +158,10 @@ struct sched_attr {
>  	__u32 sched_util_min;
>  	__u32 sched_util_max;
>  
> +	__u32 sched_qos_type;
> +	__s64 sched_qos_value;
> +	__u32 sched_qos_cookie;
> +
>  };
>  
>  #endif /* _UAPI_LINUX_SCHED_TYPES_H */
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index b215b0ead9a6..88feedd2f7c9 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -481,6 +481,13 @@ static int user_check_sched_setscheduler(struct task_struct *p,
>  	if (p->sched_reset_on_fork && !reset_on_fork)
>  		goto req_priv;
>  
> +	/*
> +	 * Normal users can't set QoS on their own, must go via admin
> +	 * controlled service
> +	 */
> +	if (attr->sched_flags & SCHED_FLAG_QOS)
> +		goto req_priv;
> +
>  	return 0;
>  
>  req_priv:
> @@ -552,6 +559,9 @@ int __sched_setscheduler(struct task_struct *p,
>  			return retval;
>  	}
>  
> +	if (attr->sched_flags & SCHED_FLAG_QOS)
> +		return -EOPNOTSUPP;
> +
>  	/*
>  	 * SCHED_DEADLINE bandwidth accounting relies on stable cpusets
>  	 * information.
> diff --git a/tools/perf/trace/beauty/include/uapi/linux/sched.h b/tools/perf/trace/beauty/include/uapi/linux/sched.h
> index 359a14cc76a4..4ff525928430 100644
> --- a/tools/perf/trace/beauty/include/uapi/linux/sched.h
> +++ b/tools/perf/trace/beauty/include/uapi/linux/sched.h
> @@ -102,6 +102,9 @@ struct clone_args {
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
>  };
> +
> +enum sched_qos_type {
> +};
>  #endif
>  
>  #define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> @@ -133,6 +136,7 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS		0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
> +#define SCHED_FLAG_QOS			0x80
>  
>  #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
>  				 SCHED_FLAG_KEEP_PARAMS)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-05-06 20:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-04  1:59 [PATCH v2 00/13] sched/fair/schedutil: Better manage system response time Qais Yousef
2026-05-04  1:59 ` [PATCH v2 01/13] sched: cpufreq: Rename map_util_perf to sugov_apply_dvfs_headroom Qais Yousef
2026-05-04  1:59 ` [PATCH v2 02/13] sched/pelt: Add a new function to approximate the future util_avg value Qais Yousef
2026-05-04  1:59 ` [PATCH v2 03/13] sched/pelt: Add a new function to approximate runtime to reach given util Qais Yousef
2026-05-04  1:59 ` [PATCH v2 04/13] sched/fair: Remove magic hardcoded margin in fits_capacity() Qais Yousef
2026-05-04  1:59 ` [PATCH v2 05/13] sched: cpufreq: Remove magic 1.25 headroom from sugov_apply_dvfs_headroom() Qais Yousef
2026-05-04  1:59 ` [PATCH v2 06/13] sched/fair: Extend util_est to improve rampup time Qais Yousef
2026-05-04  1:59 ` [PATCH v2 07/13] sched/fair: util_est: Take into account periodic tasks Qais Yousef
2026-05-04  1:59 ` [PATCH v2 RFC 08/13] sched/qos: Add a new sched-qos interface Qais Yousef
2026-05-06 20:38   ` Tim Chen
2026-05-04  1:59 ` [PATCH v2 09/13] sched/qos: Add rampup multiplier QoS Qais Yousef
2026-05-04  2:00 ` [PATCH v2 10/13] sched/fair: Disable util_est when rampup_multiplier is 0 Qais Yousef
2026-05-04  2:00 ` [PATCH v2 11/13] sched/fair: Don't mess with util_avg post init Qais Yousef
2026-05-04  2:00 ` [PATCH v2 12/13] sched/fair: Call update_util_est() after dequeue_entities() Qais Yousef
2026-05-04  2:00 ` [PATCH v2 RFC 13/13] sched/pelt: Always allow load updates Qais Yousef

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox