[PATCH v2 0/8] uclamp sum aggregation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/8] uclamp sum aggregation
@ 2025-03-04 14:23 Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 1/8] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" Hongyan Xia
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel

This series gives an alternative implementation that addresses some of
the problems in uclamp max aggregation. Sum aggregation mostly gives:

1. Simplicity. Sum aggregation implements uclamp with less than half of
   code than max aggregation.
2. Effectiveness. Sum aggregation shows better uclamp effectiveness,
   either in benchmark scores or sometimes in energy efficiency.

The key idea of sum aggregation is fairly simple. Each task has a
util_avg_bias, which is obtained by:

    util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg;

If a CPU has N tasks, p1, p2, p3... pN, then we sum the biases up and
obtain a rq total bias:

    rq_bias = util_avg_bias1 + util_avg_bias2... + util_avg_biasN;

Then we use the biased rq utilization rq_util + rq_bias to select OPP
and to schedule tasks.

PATCH BREAKDOWN:

Patch 1/6 reverts a patch that accommodate uclamp_max tasks under max
aggregation. This patch is not needed and creates other problems for sum
aggregation. It is discussed elsewhere that this patch will be improved
and there may not be the need to revert it in the future.

Patch 2, 3 and 4 implement sum aggregation.

Patch 5 and 6 remove max aggregation.

Patch 7 applies PELT decay on negative util_avg_bias. This improves
energy efficiency and task placement, but is not strictly necessary.

Patch 8 addresses sum aggregation under-utilization problem.

TESTING:

Two notebooks are shared at

https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/max.ipynb
https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/sum-offset.ipynb

The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
little cores with capacity of 383. CPU4-5 are big cores. The rt-app
profiles used for these experiments are included in the notebooks.

Scenario 1: Scheduling 4 tasks with UCLAMP_MAX at 110.

The scheduling decisions are plotted in Out[11]. Both max and sum
aggregation understand the UCLAMP_MAX hint and schedule all 4 tasks on
the little cluster. Max aggregation sometimes schedule 2 tasks on 1 CPU,
and this is the reason why sum aggregation reverts the 1st commit.

Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
slightly above the capacity of the little CPU.

Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
on the big core. Both max and sum aggregation handle this correctly.

Scenario 3: Task A is a task with a small utilization pinned to CPU4.
Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
at 300. After a while, task A is then pinned to CPU5, joining B.

Results are in Out[23]. Max aggregation sees a frequency spike at
873.64s. When zoomed in, one can see square-wave-like utilization values
because of A periodically going to sleep. When A wakes up, its default
UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency.
When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq
utilization. This happens repeatedly, hence the square wave. In
contrast, sum aggregation sees a normal increase in utilization when A
joins B, without any square-wave behavior.

Scenario 4: 4 always-running tasks with UCLAMP_MAX of 110 pinned to the
little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
After a while, remove the CPU pinning of the 4 tasks on the big PD.

Results are in Out[29]. After unpinning, max aggregation moves all 8
tasks to the little cluster, but schedules 5 tasks on CPU0 and 1 each on
CPU1-3. In contrast, sum aggregation schedules 2 on each little CPU
after unpinning, which is the desired balanced task placement.

EVALUATION:

We backport patches to GKI kernel v6.1 on Pixel 9 and run Android
benchmarks.

Speedometer:

We run Speedometer 2.1 on Chrome v131 to test ADPF/uclamp effectiveness.
Because sum aggregation does not circumvent the 25% OPP margin, we
reduce uclamp values to 80% to be fair.

|   score   | score |   %    | CPU power % |
|    max    | 192.4 |        |             |
|  sum_0.8  | 230.8 | +19.96 |   +31.54    |
| sum_tuned | 201.8 |  +4.89 |    -0.41    |

We see a consistant higher score and higher average power consumption.
Note that a higher score also means a reduction in run-time, total
energy increase for sum_0.8 is only 9.65%.

We then reduce uclamp values so that power consumption is roughly
the same. If we do so, then sum aggregation achieves slightly better
scores, shown in the sum_tuned row.

UIBench:

|   score   | jank percentage |   %    | CPU power (mW) |   %   |
|    max    |     0.115%      |        |     158.1      |       |
|  sum_0.8  |     0.129%      | +11.96 |     154.9      | -4.19 |

UIBench on Pixel 9 by default already has a low enough jank percentage.
Moving to sum aggregation gives slightly higher jank percentage and
lower power consumption.

--- 
Changed in v2:
- Completely separate uclamp component from PELT and util_est.
- Separate util_est_uclamp into an individual patch.
- Address the under-utilization problem.
- Update Python notebooks to reflect the latest sched/tip.

Hongyan Xia (8):
  Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
    0"
  sched/uclamp: Track a new util_avg_bias signal
  sched/uclamp: Add util_est_uclamp
  sched/fair: Use util biases for utilization and frequency
  sched/uclamp: Remove all uclamp bucket logic
  sched/uclamp: Simplify uclamp_eff_value()
  sched/uclamp: Propagate negative bias
  sched/uclamp: Solve under-utilization problem

 include/linux/sched.h            |   8 +-
 init/Kconfig                     |  32 ---
 kernel/sched/core.c              | 308 ++--------------------
 kernel/sched/cpufreq_schedutil.c |   6 +-
 kernel/sched/debug.c             |   2 +-
 kernel/sched/fair.c              | 430 ++++++++++++++++---------------
 kernel/sched/pelt.c              |  62 +++++
 kernel/sched/rt.c                |   4 -
 kernel/sched/sched.h             | 132 +++-------
 kernel/sched/syscalls.c          |   2 +
 10 files changed, 341 insertions(+), 645 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/8] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
@ 2025-03-04 14:23 ` Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 2/8] sched/uclamp: Track a new util_avg_bias signal Hongyan Xia
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel, Hongyan Xia

From: Hongyan Xia <Hongyan.Xia2@arm.com>

That commit creates further problems because 0 spare capacity can be
either a real indication that the CPU is maxed out, or the CPU is
UCLAMP_MAX throttled, but we end up giving all of them a chance which
can results in bogus energy calculations. It also tends to schedule
tasks on the same CPU and requires load balancing patches. Sum
aggregation solves these problems and this patch is not needed.

This reverts commit 6b00a40147653c8ea748e8f4396510f252763364.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/fair.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 857808da23d8..71fc86eafbd9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8417,10 +8417,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	for (; pd; pd = pd->next) {
 		unsigned long util_min = p_util_min, util_max = p_util_max;
 		unsigned long cpu_cap, cpu_actual_cap, util;
-		long prev_spare_cap = -1, max_spare_cap = -1;
+		unsigned long cur_delta, max_spare_cap = 0;
 		unsigned long rq_util_min, rq_util_max;
-		unsigned long cur_delta, base_energy;
+		unsigned long prev_spare_cap = 0;
 		int max_spare_cap_cpu = -1;
+		unsigned long base_energy;
 		int fits, max_fits = -1;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
@@ -8482,7 +8483,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 				prev_spare_cap = cpu_cap;
 				prev_fits = fits;
 			} else if ((fits > max_fits) ||
-				   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
+				   ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
 				/*
 				 * Find the CPU with the maximum spare capacity
 				 * among the remaining CPUs in the performance
@@ -8494,7 +8495,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			}
 		}
 
-		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
+		if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
 			continue;
 
 		eenv_pd_busy_time(&eenv, cpus, p);
@@ -8502,7 +8503,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
 
 		/* Evaluate the energy impact of using prev_cpu. */
-		if (prev_spare_cap > -1) {
+		if (prev_spare_cap > 0) {
 			prev_delta = compute_energy(&eenv, pd, cpus, p,
 						    prev_cpu);
 			/* CPU utilization has changed */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/8] sched/uclamp: Track a new util_avg_bias signal
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 1/8] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" Hongyan Xia
@ 2025-03-04 14:23 ` Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 3/8] sched/uclamp: Add util_est_uclamp Hongyan Xia
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel

Add a util_avg_bias signal in sched_avg, which is obtained by:

util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg

The task utilization after considering uclamp is;

util_avg_uclamp = util_avg + util_avg_bias

We then sum up all biases on the same rq and use the total bias to bias
the rq utilization. This is the core idea of uclamp sum aggregation. The
rq utilization will be

rq_util_avg_uclamp = rq_util_avg + total_util_avg_bias

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 include/linux/sched.h |  3 ++-
 kernel/sched/debug.c  |  2 +-
 kernel/sched/fair.c   | 33 +++++++++++++++++++++++++++++++++
 kernel/sched/pelt.c   | 37 +++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  | 24 ++++++++++++++++++++++++
 5 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9632e3318e0d..1f3b06aa024d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -487,7 +487,8 @@ struct sched_avg {
 	u32				period_contrib;
 	unsigned long			load_avg;
 	unsigned long			runnable_avg;
-	unsigned long			util_avg;
+	unsigned int			util_avg;
+	int				util_avg_bias;
 	unsigned int			util_est;
 } ____cacheline_aligned;
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ef047add7f9e..264ee83958b5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -853,7 +853,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_avg",
 			cfs_rq->avg.runnable_avg);
-	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
+	SEQ_printf(m, "  .%-30s: %u\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est",
 			cfs_rq->avg.util_est);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 71fc86eafbd9..438755f55624 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1131,6 +1131,7 @@ void post_init_entity_util_avg(struct task_struct *p)
 	}
 
 	sa->runnable_avg = sa->util_avg;
+	sa->util_avg_bias = 0;
 }
 
 #else /* !CONFIG_SMP */
@@ -4852,6 +4853,32 @@ static inline unsigned long task_util_est(struct task_struct *p)
 	return max(task_util(p), _task_util_est(p));
 }
 
+#ifdef CONFIG_UCLAMP_TASK
+static inline long task_util_bias(struct task_struct *p)
+{
+	return READ_ONCE(p->se.avg.util_avg_bias);
+}
+
+static inline unsigned long task_util_uclamp(struct task_struct *p)
+{
+	long ret = task_util(p);
+
+	ret += task_util_bias(p);
+
+	return max(ret, 0L);
+}
+#else
+static inline long task_util_bias(struct task_struct *p)
+{
+	return 0;
+}
+
+static inline unsigned long task_util_uclamp(struct task_struct *p)
+{
+	return task_util(p);
+}
+#endif
+
 static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
 				    struct task_struct *p)
 {
@@ -7027,6 +7054,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	/* At this point se is NULL and we are at root level*/
 	add_nr_running(rq, 1);
+	util_bias_enqueue(rq, p);
+	/* XXX: We should skip the update above and only do it once here. */
+	if (task_util_bias(p) > 0)
+		cpufreq_update_util(rq, 0);
 
 	/*
 	 * Since new tasks are assigned an initial util_avg equal to
@@ -7150,6 +7181,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	}
 
 	sub_nr_running(rq, h_nr_queued);
+	if (p)
+		util_bias_dequeue(rq, p);
 
 	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
 		dl_server_stop(&rq->fair_server);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 7a8534a2deff..f38abe6f0b8b 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -266,6 +266,39 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
+#ifdef CONFIG_UCLAMP_TASK
+/* avg must belong to the queue this se is on. */
+static void util_bias_update(struct task_struct *p)
+{
+	unsigned int util, uclamp_min, uclamp_max;
+	struct rq *rq;
+	int old, new;
+
+	util = READ_ONCE(p->se.avg.util_avg);
+	uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
+	uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
+	/*
+	 * uclamp_max at the max value means there is no uclamp_max, and should
+	 * not have any clamping effect at all here.
+	 */
+	if (uclamp_max == SCHED_CAPACITY_SCALE)
+		uclamp_max = UINT_MAX;
+	old = READ_ONCE(p->se.avg.util_avg_bias);
+	new = (int)clamp(util, uclamp_min, uclamp_max) - (int)util;
+
+	WRITE_ONCE(p->se.avg.util_avg_bias, new);
+	if (!p->se.on_rq)
+		return;
+	rq = task_rq(p);
+	WRITE_ONCE(rq->cfs.avg.util_avg_bias,
+		   READ_ONCE(rq->cfs.avg.util_avg_bias) + new - old);
+}
+#else /* !CONFIG_UCLAMP_TASK */
+static void util_bias_update(struct task_struct *p)
+{
+}
+#endif
+
 /*
  * sched_entity:
  *
@@ -296,6 +329,8 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
 	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
 		___update_load_avg(&se->avg, se_weight(se));
+		if (entity_is_task(se))
+			util_bias_update(task_of(se));
 		trace_pelt_se_tp(se);
 		return 1;
 	}
@@ -310,6 +345,8 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
+		if (entity_is_task(se))
+			util_bias_update(task_of(se));
 		trace_pelt_se_tp(se);
 		return 1;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ab16d3d0e51c..74363bc74e23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3468,6 +3468,22 @@ uclamp_se_set(struct uclamp_se *uc_se, unsigned int value, bool user_defined)
 	uc_se->user_defined = user_defined;
 }
 
+static inline void util_bias_enqueue(struct rq *rq, struct task_struct *p)
+{
+	int rq_val = READ_ONCE(rq->cfs.avg.util_avg_bias);
+	int p_val = READ_ONCE(p->se.avg.util_avg_bias);
+
+	WRITE_ONCE(rq->cfs.avg.util_avg_bias, rq_val + p_val);
+}
+
+static inline void util_bias_dequeue(struct rq *rq, struct task_struct *p)
+{
+	int rq_val = READ_ONCE(rq->cfs.avg.util_avg_bias);
+	int p_val = READ_ONCE(p->se.avg.util_avg_bias);
+
+	WRITE_ONCE(rq->cfs.avg.util_avg_bias, rq_val - p_val);
+}
+
 #else /* !CONFIG_UCLAMP_TASK: */
 
 static inline unsigned long
@@ -3505,6 +3521,14 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
 	return false;
 }
 
+static inline void util_bias_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void util_bias_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
+
 #endif /* !CONFIG_UCLAMP_TASK */
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/8] sched/uclamp: Add util_est_uclamp
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 1/8] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 2/8] sched/uclamp: Track a new util_avg_bias signal Hongyan Xia
@ 2025-03-04 14:23 ` Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 4/8] sched/fair: Use util biases for utilization and frequency Hongyan Xia
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel

The new util_est_uclamp is essentially clamp(util_est, min, max) and
follows how util_est operates.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1f3b06aa024d..a4bdfa1d6be1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -490,6 +490,7 @@ struct sched_avg {
 	unsigned int			util_avg;
 	int				util_avg_bias;
 	unsigned int			util_est;
+	unsigned int			util_est_uclamp;
 } ____cacheline_aligned;
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 438755f55624..e9aa93f99a4e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4867,6 +4867,16 @@ static inline unsigned long task_util_uclamp(struct task_struct *p)
 
 	return max(ret, 0L);
 }
+
+static inline unsigned long _task_util_est_uclamp(struct task_struct *p)
+{
+	return READ_ONCE(p->se.avg.util_est_uclamp);
+}
+
+static inline unsigned long task_util_est_uclamp(struct task_struct *p)
+{
+	return max(task_util_uclamp(p), _task_util_est_uclamp(p));
+}
 #else
 static inline long task_util_bias(struct task_struct *p)
 {
@@ -4877,6 +4887,16 @@ static inline unsigned long task_util_uclamp(struct task_struct *p)
 {
 	return task_util(p);
 }
+
+static inline unsigned long _task_util_est_uclamp(struct task_struct *p)
+{
+	return _task_util_est(p);
+}
+
+static inline unsigned long task_util_est_uclamp(struct task_struct *p)
+{
+	return task_util_est(p);
+}
 #endif
 
 static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
@@ -4891,6 +4911,9 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
 	enqueued  = cfs_rq->avg.util_est;
 	enqueued += _task_util_est(p);
 	WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
+	enqueued  = cfs_rq->avg.util_est_uclamp;
+	enqueued += _task_util_est_uclamp(p);
+	WRITE_ONCE(cfs_rq->avg.util_est_uclamp, enqueued);
 
 	trace_sched_util_est_cfs_tp(cfs_rq);
 }
@@ -4907,6 +4930,9 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
 	enqueued  = cfs_rq->avg.util_est;
 	enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
 	WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
+	enqueued  = cfs_rq->avg.util_est_uclamp;
+	enqueued -= _task_util_est_uclamp(p);
+	WRITE_ONCE(cfs_rq->avg.util_est_uclamp, enqueued);
 
 	trace_sched_util_est_cfs_tp(cfs_rq);
 }
@@ -4994,6 +5020,10 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	ewma  -= last_ewma_diff;
 	ewma >>= UTIL_EST_WEIGHT_SHIFT;
 done:
+	WRITE_ONCE(p->se.avg.util_est_uclamp,
+		   clamp(ewma,
+			 (unsigned int)uclamp_eff_value(p, UCLAMP_MIN),
+			 (unsigned int)uclamp_eff_value(p, UCLAMP_MAX)));
 	ewma |= UTIL_AVG_UNCHANGED;
 	WRITE_ONCE(p->se.avg.util_est, ewma);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/8] sched/fair: Use util biases for utilization and frequency
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
                   ` (2 preceding siblings ...)
  2025-03-04 14:23 ` [PATCH v2 3/8] sched/uclamp: Add util_est_uclamp Hongyan Xia
@ 2025-03-04 14:23 ` Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 5/8] sched/uclamp: Remove all uclamp bucket logic Hongyan Xia
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel, linux-pm

Use the new util_avg_bias for task and runqueue utilization. We also
maintain separate util_est and util_est_uclamp signals.

Now that we have the uclamp sum aggregated CFS util value, we do not
need to consult uclamp buckets to know how the frequency should be
clamped. We simply look at the aggregated top level rq->cfs.avg.util_avg
+ rq->cfs.avg.util_avg_bias and rq->cfs.avg.util_est_uclamp to know what
frequency to choose and how to place tasks.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/cpufreq_schedutil.c |   6 +-
 kernel/sched/fair.c              | 296 +++++++++++--------------------
 kernel/sched/sched.h             |  19 +-
 3 files changed, 101 insertions(+), 220 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 1a19d69b91ed..7b1616b139f6 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -200,7 +200,7 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 	unsigned long min, max, util = scx_cpuperf_target(sg_cpu->cpu);
 
 	if (!scx_switched_all())
-		util += cpu_util_cfs_boost(sg_cpu->cpu);
+		util += cpu_util_cfs_boost_uclamp(sg_cpu->cpu);
 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
 	util = max(util, boost);
 	sg_cpu->bw_min = min;
@@ -340,10 +340,6 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
 	if (scx_switched_all())
 		return false;
 
-	/* if capped by uclamp_max, always update to be in compliance */
-	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
-		return false;
-
 	/*
 	 * Maintain the frequency if the CPU has not been idle recently, as
 	 * reduction is likely to be premature.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9aa93f99a4e..3d91dbd19a85 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4877,6 +4877,15 @@ static inline unsigned long task_util_est_uclamp(struct task_struct *p)
 {
 	return max(task_util_uclamp(p), _task_util_est_uclamp(p));
 }
+
+static inline unsigned long root_cfs_util_uclamp(struct rq *rq)
+{
+	long ret = READ_ONCE(rq->cfs.avg.util_avg);
+
+	ret += READ_ONCE(rq->cfs.avg.util_avg_bias);
+
+	return max(ret, 0L);
+}
 #else
 static inline long task_util_bias(struct task_struct *p)
 {
@@ -4897,6 +4906,11 @@ static inline unsigned long task_util_est_uclamp(struct task_struct *p)
 {
 	return task_util_est(p);
 }
+
+static inline unsigned long root_cfs_util_uclamp(struct rq *rq)
+{
+	return READ_ONCE(rq->cfs.avg.util_avg);
+}
 #endif
 
 static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
@@ -5039,135 +5053,24 @@ static inline unsigned long get_actual_cpu_capacity(int cpu)
 	return capacity;
 }
 
-static inline int util_fits_cpu(unsigned long util,
-				unsigned long uclamp_min,
-				unsigned long uclamp_max,
-				int cpu)
+static inline int util_fits_cpu(unsigned long util, int cpu)
 {
 	unsigned long capacity = capacity_of(cpu);
-	unsigned long capacity_orig;
-	bool fits, uclamp_max_fits;
-
-	/*
-	 * Check if the real util fits without any uclamp boost/cap applied.
-	 */
-	fits = fits_capacity(util, capacity);
-
-	if (!uclamp_is_used())
-		return fits;
-
-	/*
-	 * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
-	 * uclamp_max. We only care about capacity pressure (by using
-	 * capacity_of()) for comparing against the real util.
-	 *
-	 * If a task is boosted to 1024 for example, we don't want a tiny
-	 * pressure to skew the check whether it fits a CPU or not.
-	 *
-	 * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
-	 * should fit a little cpu even if there's some pressure.
-	 *
-	 * Only exception is for HW or cpufreq pressure since it has a direct impact
-	 * on available OPP of the system.
-	 *
-	 * We honour it for uclamp_min only as a drop in performance level
-	 * could result in not getting the requested minimum performance level.
-	 *
-	 * For uclamp_max, we can tolerate a drop in performance level as the
-	 * goal is to cap the task. So it's okay if it's getting less.
-	 */
-	capacity_orig = arch_scale_cpu_capacity(cpu);
-
-	/*
-	 * We want to force a task to fit a cpu as implied by uclamp_max.
-	 * But we do have some corner cases to cater for..
-	 *
-	 *
-	 *                                 C=z
-	 *   |                             ___
-	 *   |                  C=y       |   |
-	 *   |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _  uclamp_max
-	 *   |      C=x        |   |      |   |
-	 *   |      ___        |   |      |   |
-	 *   |     |   |       |   |      |   |    (util somewhere in this region)
-	 *   |     |   |       |   |      |   |
-	 *   |     |   |       |   |      |   |
-	 *   +----------------------------------------
-	 *         CPU0        CPU1       CPU2
-	 *
-	 *   In the above example if a task is capped to a specific performance
-	 *   point, y, then when:
-	 *
-	 *   * util = 80% of x then it does not fit on CPU0 and should migrate
-	 *     to CPU1
-	 *   * util = 80% of y then it is forced to fit on CPU1 to honour
-	 *     uclamp_max request.
-	 *
-	 *   which is what we're enforcing here. A task always fits if
-	 *   uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
-	 *   the normal upmigration rules should withhold still.
-	 *
-	 *   Only exception is when we are on max capacity, then we need to be
-	 *   careful not to block overutilized state. This is so because:
-	 *
-	 *     1. There's no concept of capping at max_capacity! We can't go
-	 *        beyond this performance level anyway.
-	 *     2. The system is being saturated when we're operating near
-	 *        max capacity, it doesn't make sense to block overutilized.
-	 */
-	uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
-	uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
-	fits = fits || uclamp_max_fits;
 
-	/*
-	 *
-	 *                                 C=z
-	 *   |                             ___       (region a, capped, util >= uclamp_max)
-	 *   |                  C=y       |   |
-	 *   |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
-	 *   |      C=x        |   |      |   |
-	 *   |      ___        |   |      |   |      (region b, uclamp_min <= util <= uclamp_max)
-	 *   |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
-	 *   |     |   |       |   |      |   |
-	 *   |     |   |       |   |      |   |      (region c, boosted, util < uclamp_min)
-	 *   +----------------------------------------
-	 *         CPU0        CPU1       CPU2
-	 *
-	 * a) If util > uclamp_max, then we're capped, we don't care about
-	 *    actual fitness value here. We only care if uclamp_max fits
-	 *    capacity without taking margin/pressure into account.
-	 *    See comment above.
-	 *
-	 * b) If uclamp_min <= util <= uclamp_max, then the normal
-	 *    fits_capacity() rules apply. Except we need to ensure that we
-	 *    enforce we remain within uclamp_max, see comment above.
-	 *
-	 * c) If util < uclamp_min, then we are boosted. Same as (b) but we
-	 *    need to take into account the boosted value fits the CPU without
-	 *    taking margin/pressure into account.
-	 *
-	 * Cases (a) and (b) are handled in the 'fits' variable already. We
-	 * just need to consider an extra check for case (c) after ensuring we
-	 * handle the case uclamp_min > uclamp_max.
-	 */
-	uclamp_min = min(uclamp_min, uclamp_max);
-	if (fits && (util < uclamp_min) &&
-	    (uclamp_min > get_actual_cpu_capacity(cpu)))
-		return -1;
+	if (fits_capacity(util, capacity))
+		return 1;
 
-	return fits;
+	return 0;
 }
 
 static inline int task_fits_cpu(struct task_struct *p, int cpu)
 {
-	unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
-	unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
-	unsigned long util = task_util_est(p);
+	unsigned long util_uclamp = task_util_est_uclamp(p);
 	/*
 	 * Return true only if the cpu fully fits the task requirements, which
 	 * include the utilization but also the performance hints.
 	 */
-	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
+	return (util_fits_cpu(util_uclamp, cpu) > 0);
 }
 
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@@ -6886,18 +6789,18 @@ static inline void hrtick_update(struct rq *rq)
 #endif
 
 #ifdef CONFIG_SMP
+static unsigned long cpu_util_cfs_uclamp(int cpu);
+
 static inline bool cpu_overutilized(int cpu)
 {
-	unsigned long  rq_util_min, rq_util_max;
+	unsigned long util_uclamp;
 
 	if (!sched_energy_enabled())
 		return false;
 
-	rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
-	rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
+	util_uclamp = cpu_util_cfs_uclamp(cpu);
 
-	/* Return true only if the utilization doesn't fit CPU's capacity */
-	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+	return !util_fits_cpu(util_uclamp, cpu);
 }
 
 /*
@@ -7828,7 +7731,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 static int
 select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 {
-	unsigned long task_util, util_min, util_max, best_cap = 0;
+	unsigned long task_util_uclamp, best_cap = 0;
 	int fits, best_fits = 0;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
@@ -7836,9 +7739,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-	task_util = task_util_est(p);
-	util_min = uclamp_eff_value(p, UCLAMP_MIN);
-	util_max = uclamp_eff_value(p, UCLAMP_MAX);
+	task_util_uclamp = task_util_est_uclamp(p);
 
 	for_each_cpu_wrap(cpu, cpus, target) {
 		unsigned long cpu_cap = capacity_of(cpu);
@@ -7846,7 +7747,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
 			continue;
 
-		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
+		fits = util_fits_cpu(task_util_uclamp, cpu);
 
 		/* This CPU fits with all requirements */
 		if (fits > 0)
@@ -7874,8 +7775,6 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 }
 
 static inline bool asym_fits_cpu(unsigned long util,
-				 unsigned long util_min,
-				 unsigned long util_max,
 				 int cpu)
 {
 	if (sched_asym_cpucap_active())
@@ -7883,7 +7782,7 @@ static inline bool asym_fits_cpu(unsigned long util,
 		 * Return true only if the cpu fully fits the task requirements
 		 * which include the utilization and the performance hints.
 		 */
-		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+		return (util_fits_cpu(util, cpu) > 0);
 
 	return true;
 }
@@ -7895,7 +7794,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	bool has_idle_core = false;
 	struct sched_domain *sd;
-	unsigned long task_util, util_min, util_max;
+	unsigned long task_util_uclamp;
 	int i, recent_used_cpu, prev_aff = -1;
 
 	/*
@@ -7904,9 +7803,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	if (sched_asym_cpucap_active()) {
 		sync_entity_load_avg(&p->se);
-		task_util = task_util_est(p);
-		util_min = uclamp_eff_value(p, UCLAMP_MIN);
-		util_max = uclamp_eff_value(p, UCLAMP_MAX);
+		task_util_uclamp = task_util_est_uclamp(p);
 	}
 
 	/*
@@ -7915,7 +7812,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	lockdep_assert_irqs_disabled();
 
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
-	    asym_fits_cpu(task_util, util_min, util_max, target))
+	    asym_fits_cpu(task_util_uclamp, target))
 		return target;
 
 	/*
@@ -7923,7 +7820,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	if (prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
-	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
+	    asym_fits_cpu(task_util_uclamp, prev)) {
 
 		if (!static_branch_unlikely(&sched_cluster_active) ||
 		    cpus_share_resources(prev, target))
@@ -7944,7 +7841,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	    in_task() &&
 	    prev == smp_processor_id() &&
 	    this_rq()->nr_running <= 1 &&
-	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
+	    asym_fits_cpu(task_util_uclamp, prev)) {
 		return prev;
 	}
 
@@ -7956,7 +7853,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	    cpus_share_cache(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
-	    asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
+	    asym_fits_cpu(task_util_uclamp, recent_used_cpu)) {
 
 		if (!static_branch_unlikely(&sched_cluster_active) ||
 		    cpus_share_resources(recent_used_cpu, target))
@@ -8124,16 +8021,67 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
 	return min(util, arch_scale_cpu_capacity(cpu));
 }
 
+/* This is basically a copy-paste from cpu_util(), but instead using uclamp values. */
+static unsigned long
+cpu_util_uclamp(int cpu, struct task_struct *p, int dst_cpu, int boost)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	unsigned long util = root_cfs_util_uclamp(rq);
+
+	if (boost) {
+		unsigned long runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
+		unsigned long util_raw = READ_ONCE(cfs_rq->avg.util_avg);
+
+		util = max(util, util_raw ? util * runnable / util_raw : 0);
+	}
+
+	if (p) {
+		if (task_cpu(p) == cpu && !p->se.on_rq) {
+			util += task_util_bias(p);
+			if ((long)util < 0)
+				util = 0;
+		}
+		if (task_cpu(p) == cpu && dst_cpu != cpu)
+			lsub_positive(&util, task_util_uclamp(p));
+		else if (task_cpu(p) != cpu && dst_cpu == cpu)
+			util += task_util_uclamp(p);
+	}
+
+	if (sched_feat(UTIL_EST)) {
+		unsigned long util_est = READ_ONCE(cfs_rq->avg.util_est_uclamp);
+
+		if (dst_cpu == cpu)
+			util_est += _task_util_est_uclamp(p);
+		else if (p && unlikely(task_on_rq_queued(p) || current == p))
+			lsub_positive(&util_est, _task_util_est_uclamp(p));
+
+		util = max(util, util_est);
+	}
+
+	return min(util, arch_scale_cpu_capacity(cpu));
+}
+
 unsigned long cpu_util_cfs(int cpu)
 {
 	return cpu_util(cpu, NULL, -1, 0);
 }
 
-unsigned long cpu_util_cfs_boost(int cpu)
+static unsigned long cpu_util_cfs_uclamp(int cpu)
+{
+	return cpu_util_uclamp(cpu, NULL, -1, 0);
+}
+
+static unsigned long cpu_util_cfs_boost(int cpu)
 {
 	return cpu_util(cpu, NULL, -1, 1);
 }
 
+unsigned long cpu_util_cfs_boost_uclamp(int cpu)
+{
+	return cpu_util_uclamp(cpu, NULL, -1, 1);
+}
+
 /*
  * cpu_util_without: compute cpu utilization without any contributions from *p
  * @cpu: the CPU which utilization is requested
@@ -8206,7 +8154,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 		 *   steals time to the deadline task.
 		 * - The minimum performance requirement for CFS and/or RT.
 		 */
-		*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
+		*min = irq + cpu_bw_dl(rq);
 
 		/*
 		 * When an RT task is runnable and uclamp is not used, we must
@@ -8230,7 +8178,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	 * than the actual utilization because of uclamp_max requirements.
 	 */
 	if (max)
-		*max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+		*max = scale;
 
 	if (util >= scale)
 		return scale;
@@ -8343,33 +8291,15 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 	int cpu;
 
 	for_each_cpu(cpu, pd_cpus) {
-		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
-		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+		unsigned long util = cpu_util_uclamp(cpu, p, dst_cpu, 1);
 		unsigned long eff_util, min, max;
 
 		/*
-		 * Performance domain frequency: utilization clamping
-		 * must be considered since it affects the selection
-		 * of the performance domain frequency.
-		 * NOTE: in case RT tasks are running, by default the min
-		 * utilization can be max OPP.
+		 * NOTE: in case RT tasks are running, by default the
+		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
 		eff_util = effective_cpu_util(cpu, util, &min, &max);
 
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
-
 		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
 		max_util = max(max_util, eff_util);
 	}
@@ -8443,8 +8373,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
-	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
-	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
 	struct root_domain *rd = this_rq()->rd;
 	int cpu, best_energy_cpu, target = -1;
 	int prev_fits = -1, best_fits = -1;
@@ -8472,16 +8400,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	target = prev_cpu;
 
 	sync_entity_load_avg(&p->se);
-	if (!task_util_est(p) && p_util_min == 0)
+	if (!task_util_est_uclamp(p))
 		goto unlock;
 
 	eenv_task_busy_time(&eenv, p, prev_cpu);
 
 	for (; pd; pd = pd->next) {
-		unsigned long util_min = p_util_min, util_max = p_util_max;
-		unsigned long cpu_cap, cpu_actual_cap, util;
+		unsigned long cpu_cap, cpu_actual_cap, util_uclamp;
 		unsigned long cur_delta, max_spare_cap = 0;
-		unsigned long rq_util_min, rq_util_max;
 		unsigned long prev_spare_cap = 0;
 		int max_spare_cap_cpu = -1;
 		unsigned long base_energy;
@@ -8500,8 +8426,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		eenv.pd_cap = 0;
 
 		for_each_cpu(cpu, cpus) {
-			struct rq *rq = cpu_rq(cpu);
-
 			eenv.pd_cap += cpu_actual_cap;
 
 			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
@@ -8510,37 +8434,15 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			if (!cpumask_test_cpu(cpu, p->cpus_ptr))
 				continue;
 
-			util = cpu_util(cpu, p, cpu, 0);
+			util_uclamp = cpu_util_uclamp(cpu, p, cpu, 0);
 			cpu_cap = capacity_of(cpu);
 
-			/*
-			 * Skip CPUs that cannot satisfy the capacity request.
-			 * IOW, placing the task there would make the CPU
-			 * overutilized. Take uclamp into account to see how
-			 * much capacity we can get out of the CPU; this is
-			 * aligned with sched_cpu_util().
-			 */
-			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
-				/*
-				 * Open code uclamp_rq_util_with() except for
-				 * the clamp() part. I.e.: apply max aggregation
-				 * only. util_fits_cpu() logic requires to
-				 * operate on non clamped util but must use the
-				 * max-aggregated uclamp_{min, max}.
-				 */
-				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
-				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
-
-				util_min = max(rq_util_min, p_util_min);
-				util_max = max(rq_util_max, p_util_max);
-			}
-
-			fits = util_fits_cpu(util, util_min, util_max, cpu);
-			if (!fits)
+			fits = util_fits_cpu(util_uclamp, cpu);
+			if (fits == 1)
+				lsub_positive(&cpu_cap, util_uclamp);
+			else
 				continue;
 
-			lsub_positive(&cpu_cap, util);
-
 			if (cpu == prev_cpu) {
 				/* Always use prev_cpu as a candidate. */
 				prev_spare_cap = cpu_cap;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74363bc74e23..b50e3d6e79c4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3379,7 +3379,7 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
 
 
 extern unsigned long cpu_util_cfs(int cpu);
-extern unsigned long cpu_util_cfs_boost(int cpu);
+extern unsigned long cpu_util_cfs_boost_uclamp(int cpu);
 
 static inline unsigned long cpu_util_rt(struct rq *rq)
 {
@@ -3411,21 +3411,6 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
 	return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
 }
 
-/* Is the rq being capped/throttled by uclamp_max? */
-static inline bool uclamp_rq_is_capped(struct rq *rq)
-{
-	unsigned long rq_util;
-	unsigned long max_util;
-
-	if (!static_branch_likely(&sched_uclamp_used))
-		return false;
-
-	rq_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq);
-	max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
-
-	return max_util != SCHED_CAPACITY_SCALE && rq_util >= max_util;
-}
-
 /*
  * When uclamp is compiled in, the aggregation at rq level is 'turned off'
  * by default in the fast path and only gets turned on once userspace performs
@@ -3495,8 +3480,6 @@ uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
-static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
-
 static inline bool uclamp_is_used(void)
 {
 	return false;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 5/8] sched/uclamp: Remove all uclamp bucket logic
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
                   ` (3 preceding siblings ...)
  2025-03-04 14:23 ` [PATCH v2 4/8] sched/fair: Use util biases for utilization and frequency Hongyan Xia
@ 2025-03-04 14:23 ` Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 6/8] sched/uclamp: Simplify uclamp_eff_value() Hongyan Xia
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel

Also rewrite uclamp_update_active() so that the effective uclamp values
are updated every time we change task group properties, change system
defaults or a request is issued from userspace.

This also signnificantly reduces uclamp overhead because we no longer
need to compute effective uclamp values and manipulate buckets every
time a task is enqueued or dequeued (in uclamp_rq_{inc/dec}()).

TODO: Rewrite documentation to match the new logic.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 include/linux/sched.h   |   4 -
 init/Kconfig            |  32 -----
 kernel/sched/core.c     | 301 ++--------------------------------------
 kernel/sched/fair.c     |   4 -
 kernel/sched/rt.c       |   4 -
 kernel/sched/sched.h    |  95 +------------
 kernel/sched/syscalls.c |   2 +
 7 files changed, 20 insertions(+), 422 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4bdfa1d6be1..012df2f612d4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -722,9 +722,6 @@ struct sched_dl_entity {
 };
 
 #ifdef CONFIG_UCLAMP_TASK
-/* Number of utilization clamp buckets (shorter alias) */
-#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
-
 /*
  * Utilization clamp for a scheduling entity
  * @value:		clamp value "assigned" to a se
@@ -750,7 +747,6 @@ struct sched_dl_entity {
  */
 struct uclamp_se {
 	unsigned int value		: bits_per(SCHED_CAPACITY_SCALE);
-	unsigned int bucket_id		: bits_per(UCLAMP_BUCKETS);
 	unsigned int active		: 1;
 	unsigned int user_defined	: 1;
 };
diff --git a/init/Kconfig b/init/Kconfig
index d0d021b3fa3b..6d6d2eaa2963 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -843,38 +843,6 @@ config UCLAMP_TASK
 	  enforce or grant any specific bandwidth for tasks.
 
 	  If in doubt, say N.
-
-config UCLAMP_BUCKETS_COUNT
-	int "Number of supported utilization clamp buckets"
-	range 5 20
-	default 5
-	depends on UCLAMP_TASK
-	help
-	  Defines the number of clamp buckets to use. The range of each bucket
-	  will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
-	  number of clamp buckets the finer their granularity and the higher
-	  the precision of clamping aggregation and tracking at run-time.
-
-	  For example, with the minimum configuration value we will have 5
-	  clamp buckets tracking 20% utilization each. A 25% boosted tasks will
-	  be refcounted in the [20..39]% bucket and will set the bucket clamp
-	  effective value to 25%.
-	  If a second 30% boosted task should be co-scheduled on the same CPU,
-	  that task will be refcounted in the same bucket of the first task and
-	  it will boost the bucket clamp effective value to 30%.
-	  The clamp effective value of a bucket is reset to its nominal value
-	  (20% in the example above) when there are no more tasks refcounted in
-	  that bucket.
-
-	  An additional boost/capping margin can be added to some tasks. In the
-	  example above the 25% task will be boosted to 30% until it exits the
-	  CPU. If that should be considered not acceptable on certain systems,
-	  it's always possible to reduce the margin by increasing the number of
-	  clamp buckets to trade off used memory for run-time tracking
-	  precision.
-
-	  If in doubt, use the default value.
-
 endmenu
 
 #
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b00f884701a6..85c69ca7abaa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1495,54 +1495,6 @@ static struct uclamp_se uclamp_default[UCLAMP_CNT];
  */
 DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
 
-static inline unsigned int
-uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
-		  unsigned int clamp_value)
-{
-	/*
-	 * Avoid blocked utilization pushing up the frequency when we go
-	 * idle (which drops the max-clamp) by retaining the last known
-	 * max-clamp.
-	 */
-	if (clamp_id == UCLAMP_MAX) {
-		rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
-		return clamp_value;
-	}
-
-	return uclamp_none(UCLAMP_MIN);
-}
-
-static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
-				     unsigned int clamp_value)
-{
-	/* Reset max-clamp retention only on idle exit */
-	if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
-		return;
-
-	uclamp_rq_set(rq, clamp_id, clamp_value);
-}
-
-static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
-				   unsigned int clamp_value)
-{
-	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
-	int bucket_id = UCLAMP_BUCKETS - 1;
-
-	/*
-	 * Since both min and max clamps are max aggregated, find the
-	 * top most bucket with tasks in.
-	 */
-	for ( ; bucket_id >= 0; bucket_id--) {
-		if (!bucket[bucket_id].tasks)
-			continue;
-		return bucket[bucket_id].value;
-	}
-
-	/* No tasks -- default clamp values */
-	return uclamp_idle_value(rq, clamp_id, clamp_value);
-}
-
 static void __uclamp_update_util_min_rt_default(struct task_struct *p)
 {
 	unsigned int default_util_min;
@@ -1598,8 +1550,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 }
 
 /*
- * The effective clamp bucket index of a task depends on, by increasing
- * priority:
+ * The effective uclamp value of a task depends on, by increasing priority:
  * - the task specific clamp value, when explicitly requested from userspace
  * - the task group effective clamp value, for tasks not either in the root
  *   group or in an autogroup
@@ -1620,202 +1571,23 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
-	struct uclamp_se uc_eff;
-
-	/* Task currently refcounted: use back-annotated (effective) value */
-	if (p->uclamp[clamp_id].active)
-		return (unsigned long)p->uclamp[clamp_id].value;
-
-	uc_eff = uclamp_eff_get(p, clamp_id);
-
-	return (unsigned long)uc_eff.value;
-}
-
-/*
- * When a task is enqueued on a rq, the clamp bucket currently defined by the
- * task's uclamp::bucket_id is refcounted on that rq. This also immediately
- * updates the rq's clamp value if required.
- *
- * Tasks can have a task-specific value requested from user-space, track
- * within each bucket the maximum value for tasks refcounted in it.
- * This "local max aggregation" allows to track the exact "requested" value
- * for each bucket when all its RUNNABLE tasks require the same clamp.
- */
-static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
-				    enum uclamp_id clamp_id)
-{
-	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
-	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
-	struct uclamp_bucket *bucket;
-
-	lockdep_assert_rq_held(rq);
-
-	/* Update task effective clamp */
-	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
-
-	bucket = &uc_rq->bucket[uc_se->bucket_id];
-	bucket->tasks++;
-	uc_se->active = true;
-
-	uclamp_idle_reset(rq, clamp_id, uc_se->value);
-
-	/*
-	 * Local max aggregation: rq buckets always track the max
-	 * "requested" clamp value of its RUNNABLE tasks.
-	 */
-	if (bucket->tasks == 1 || uc_se->value > bucket->value)
-		bucket->value = uc_se->value;
-
-	if (uc_se->value > uclamp_rq_get(rq, clamp_id))
-		uclamp_rq_set(rq, clamp_id, uc_se->value);
-}
-
-/*
- * When a task is dequeued from a rq, the clamp bucket refcounted by the task
- * is released. If this is the last task reference counting the rq's max
- * active clamp value, then the rq's clamp value is updated.
- *
- * Both refcounted tasks and rq's cached clamp values are expected to be
- * always valid. If it's detected they are not, as defensive programming,
- * enforce the expected state and warn.
- */
-static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
-				    enum uclamp_id clamp_id)
-{
-	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
-	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
-	struct uclamp_bucket *bucket;
-	unsigned int bkt_clamp;
-	unsigned int rq_clamp;
-
-	lockdep_assert_rq_held(rq);
-
-	/*
-	 * If sched_uclamp_used was enabled after task @p was enqueued,
-	 * we could end up with unbalanced call to uclamp_rq_dec_id().
-	 *
-	 * In this case the uc_se->active flag should be false since no uclamp
-	 * accounting was performed at enqueue time and we can just return
-	 * here.
-	 *
-	 * Need to be careful of the following enqueue/dequeue ordering
-	 * problem too
-	 *
-	 *	enqueue(taskA)
-	 *	// sched_uclamp_used gets enabled
-	 *	enqueue(taskB)
-	 *	dequeue(taskA)
-	 *	// Must not decrement bucket->tasks here
-	 *	dequeue(taskB)
-	 *
-	 * where we could end up with stale data in uc_se and
-	 * bucket[uc_se->bucket_id].
-	 *
-	 * The following check here eliminates the possibility of such race.
-	 */
-	if (unlikely(!uc_se->active))
-		return;
-
-	bucket = &uc_rq->bucket[uc_se->bucket_id];
-
-	SCHED_WARN_ON(!bucket->tasks);
-	if (likely(bucket->tasks))
-		bucket->tasks--;
-
-	uc_se->active = false;
-
-	/*
-	 * Keep "local max aggregation" simple and accept to (possibly)
-	 * overboost some RUNNABLE tasks in the same bucket.
-	 * The rq clamp bucket value is reset to its base value whenever
-	 * there are no more RUNNABLE tasks refcounting it.
-	 */
-	if (likely(bucket->tasks))
-		return;
-
-	rq_clamp = uclamp_rq_get(rq, clamp_id);
-	/*
-	 * Defensive programming: this should never happen. If it happens,
-	 * e.g. due to future modification, warn and fix up the expected value.
-	 */
-	SCHED_WARN_ON(bucket->value > rq_clamp);
-	if (bucket->value >= rq_clamp) {
-		bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
-		uclamp_rq_set(rq, clamp_id, bkt_clamp);
-	}
-}
-
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
-{
-	enum uclamp_id clamp_id;
-
-	/*
-	 * Avoid any overhead until uclamp is actually used by the userspace.
-	 *
-	 * The condition is constructed such that a NOP is generated when
-	 * sched_uclamp_used is disabled.
-	 */
-	if (!static_branch_unlikely(&sched_uclamp_used))
-		return;
-
-	if (unlikely(!p->sched_class->uclamp_enabled))
-		return;
-
-	if (p->se.sched_delayed)
-		return;
-
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_inc_id(rq, p, clamp_id);
+	if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
+		return uclamp_none(clamp_id);
 
-	/* Reset clamp idle holding when there is one RUNNABLE task */
-	if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
-		rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
+	return p->uclamp[clamp_id].value;
 }
 
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+void uclamp_update_active_nolock(struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
-	/*
-	 * Avoid any overhead until uclamp is actually used by the userspace.
-	 *
-	 * The condition is constructed such that a NOP is generated when
-	 * sched_uclamp_used is disabled.
-	 */
-	if (!static_branch_unlikely(&sched_uclamp_used))
-		return;
-
-	if (unlikely(!p->sched_class->uclamp_enabled))
-		return;
-
-	if (p->se.sched_delayed)
-		return;
-
 	for_each_clamp_id(clamp_id)
-		uclamp_rq_dec_id(rq, p, clamp_id);
-}
-
-static inline void uclamp_rq_reinc_id(struct rq *rq, struct task_struct *p,
-				      enum uclamp_id clamp_id)
-{
-	if (!p->uclamp[clamp_id].active)
-		return;
-
-	uclamp_rq_dec_id(rq, p, clamp_id);
-	uclamp_rq_inc_id(rq, p, clamp_id);
-
-	/*
-	 * Make sure to clear the idle flag if we've transiently reached 0
-	 * active tasks on rq.
-	 */
-	if (clamp_id == UCLAMP_MAX && (rq->uclamp_flags & UCLAMP_FLAG_IDLE))
-		rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
+		p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
 }
 
 static inline void
 uclamp_update_active(struct task_struct *p)
 {
-	enum uclamp_id clamp_id;
 	struct rq_flags rf;
 	struct rq *rq;
 
@@ -1829,14 +1601,7 @@ uclamp_update_active(struct task_struct *p)
 	 */
 	rq = task_rq_lock(p, &rf);
 
-	/*
-	 * Setting the clamp bucket is serialized by task_rq_lock().
-	 * If the task is not yet RUNNABLE and its task_struct is not
-	 * affecting a valid clamp bucket, the next time it's enqueued,
-	 * it will already see the updated clamp bucket value.
-	 */
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_reinc_id(rq, p, clamp_id);
+	uclamp_update_active_nolock(p);
 
 	task_rq_unlock(rq, p, &rf);
 }
@@ -1968,20 +1733,14 @@ static void uclamp_fork(struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
-	/*
-	 * We don't need to hold task_rq_lock() when updating p->uclamp_* here
-	 * as the task is still at its early fork stages.
-	 */
-	for_each_clamp_id(clamp_id)
-		p->uclamp[clamp_id].active = false;
-
-	if (likely(!p->sched_reset_on_fork))
-		return;
-
-	for_each_clamp_id(clamp_id) {
-		uclamp_se_set(&p->uclamp_req[clamp_id],
-			      uclamp_none(clamp_id), false);
+	if (unlikely(p->sched_reset_on_fork)) {
+		for_each_clamp_id(clamp_id) {
+			uclamp_se_set(&p->uclamp_req[clamp_id],
+				      uclamp_none(clamp_id), false);
+		}
 	}
+
+	uclamp_update_active(p);
 }
 
 static void uclamp_post_fork(struct task_struct *p)
@@ -1989,28 +1748,10 @@ static void uclamp_post_fork(struct task_struct *p)
 	uclamp_update_util_min_rt_default(p);
 }
 
-static void __init init_uclamp_rq(struct rq *rq)
-{
-	enum uclamp_id clamp_id;
-	struct uclamp_rq *uc_rq = rq->uclamp;
-
-	for_each_clamp_id(clamp_id) {
-		uc_rq[clamp_id] = (struct uclamp_rq) {
-			.value = uclamp_none(clamp_id)
-		};
-	}
-
-	rq->uclamp_flags = UCLAMP_FLAG_IDLE;
-}
-
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
 	enum uclamp_id clamp_id;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		init_uclamp_rq(cpu_rq(cpu));
 
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&init_task.uclamp_req[clamp_id],
@@ -2029,8 +1770,6 @@ static void __init init_uclamp(void)
 }
 
 #else /* !CONFIG_UCLAMP_TASK */
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
 static inline void uclamp_fork(struct task_struct *p) { }
 static inline void uclamp_post_fork(struct task_struct *p) { }
 static inline void init_uclamp(void) { }
@@ -2066,11 +1805,6 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 		update_rq_clock(rq);
 
 	p->sched_class->enqueue_task(rq, p, flags);
-	/*
-	 * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
-	 * ->sched_delayed.
-	 */
-	uclamp_rq_inc(rq, p);
 
 	psi_enqueue(p, flags);
 
@@ -2097,11 +1831,6 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 
 	psi_dequeue(p, flags);
 
-	/*
-	 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
-	 * and mark the task ->sched_delayed.
-	 */
-	uclamp_rq_dec(rq, p);
 	return p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -9079,6 +8808,7 @@ void sched_move_task(struct task_struct *tsk)
 
 	sched_change_group(tsk, group);
 	scx_move_task(tsk);
+	uclamp_update_active_nolock(tsk);
 
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
@@ -9225,7 +8955,6 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 			if (eff[clamp_id] == uc_se[clamp_id].value)
 				continue;
 			uc_se[clamp_id].value = eff[clamp_id];
-			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
 			clamps |= (0x1 << clamp_id);
 		}
 		if (!clamps) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3d91dbd19a85..68e7b1ac7a57 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13626,10 +13626,6 @@ DEFINE_SCHED_CLASS(fair) = {
 #ifdef CONFIG_SCHED_CORE
 	.task_is_throttled	= task_is_throttled_fair,
 #endif
-
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4b8e33c615b1..4cd8d3e06eeb 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2649,10 +2649,6 @@ DEFINE_SCHED_CLASS(rt) = {
 #ifdef CONFIG_SCHED_CORE
 	.task_is_throttled	= task_is_throttled_rt,
 #endif
-
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b50e3d6e79c4..cfeaefcec8b6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1050,46 +1050,6 @@ extern void rto_push_irq_work_func(struct irq_work *work);
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_UCLAMP_TASK
-/*
- * struct uclamp_bucket - Utilization clamp bucket
- * @value: utilization clamp value for tasks on this clamp bucket
- * @tasks: number of RUNNABLE tasks on this clamp bucket
- *
- * Keep track of how many tasks are RUNNABLE for a given utilization
- * clamp value.
- */
-struct uclamp_bucket {
-	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
-	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
-};
-
-/*
- * struct uclamp_rq - rq's utilization clamp
- * @value: currently active clamp values for a rq
- * @bucket: utilization clamp buckets affecting a rq
- *
- * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
- * A clamp value is affecting a rq when there is at least one task RUNNABLE
- * (or actually running) with that value.
- *
- * There are up to UCLAMP_CNT possible different clamp values, currently there
- * are only two: minimum utilization and maximum utilization.
- *
- * All utilization clamping values are MAX aggregated, since:
- * - for util_min: we want to run the CPU at least at the max of the minimum
- *   utilization required by its currently RUNNABLE tasks.
- * - for util_max: we want to allow the CPU to run up to the max of the
- *   maximum utilization allowed by its currently RUNNABLE tasks.
- *
- * Since on each system we expect only a limited number of different
- * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
- * the metrics required to compute all the per-rq utilization clamp values.
- */
-struct uclamp_rq {
-	unsigned int value;
-	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
-};
-
 DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
 #endif /* CONFIG_UCLAMP_TASK */
 
@@ -1126,10 +1086,6 @@ struct rq {
 	u64			nr_switches;
 
 #ifdef CONFIG_UCLAMP_TASK
-	/* Utilization clamp values based on CPU's RUNNABLE tasks */
-	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
-	unsigned int		uclamp_flags;
-#define UCLAMP_FLAG_IDLE 0x01
 #endif
 
 	struct cfs_rq		cfs;
@@ -2409,11 +2365,6 @@ struct affinity_context {
 extern s64 update_curr_common(struct rq *rq);
 
 struct sched_class {
-
-#ifdef CONFIG_UCLAMP_TASK
-	int uclamp_enabled;
-#endif
-
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	bool (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task)   (struct rq *rq);
@@ -3393,23 +3344,7 @@ static inline bool update_other_load_avgs(struct rq *rq) { return false; }
 #ifdef CONFIG_UCLAMP_TASK
 
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
-
-static inline unsigned long uclamp_rq_get(struct rq *rq,
-					  enum uclamp_id clamp_id)
-{
-	return READ_ONCE(rq->uclamp[clamp_id].value);
-}
-
-static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
-				 unsigned int value)
-{
-	WRITE_ONCE(rq->uclamp[clamp_id].value, value);
-}
-
-static inline bool uclamp_rq_is_idle(struct rq *rq)
-{
-	return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
-}
+void uclamp_update_active_nolock(struct task_struct *p);
 
 /*
  * When uclamp is compiled in, the aggregation at rq level is 'turned off'
@@ -3437,19 +3372,10 @@ static inline unsigned int uclamp_none(enum uclamp_id clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
-/* Integer rounded range for each bucket */
-#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
-
-static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
-{
-	return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1);
-}
-
 static inline void
 uclamp_se_set(struct uclamp_se *uc_se, unsigned int value, bool user_defined)
 {
 	uc_se->value = value;
-	uc_se->bucket_id = uclamp_bucket_id(value);
 	uc_se->user_defined = user_defined;
 }
 
@@ -3480,26 +3406,11 @@ uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
-static inline bool uclamp_is_used(void)
-{
-	return false;
-}
-
-static inline unsigned long
-uclamp_rq_get(struct rq *rq, enum uclamp_id clamp_id)
-{
-	if (clamp_id == UCLAMP_MIN)
-		return 0;
-
-	return SCHED_CAPACITY_SCALE;
-}
-
-static inline void
-uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id, unsigned int value)
+static inline void uclamp_update_active_nolock(struct task_struct *p)
 {
 }
 
-static inline bool uclamp_rq_is_idle(struct rq *rq)
+static inline bool uclamp_is_used(void)
 {
 	return false;
 }
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 9f40348f1dc7..24af915f8d18 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -437,6 +437,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
 			      attr->sched_util_max, true);
 	}
+
+	uclamp_update_active_nolock(p);
 }
 
 #else /* !CONFIG_UCLAMP_TASK: */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 6/8] sched/uclamp: Simplify uclamp_eff_value()
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
                   ` (4 preceding siblings ...)
  2025-03-04 14:23 ` [PATCH v2 5/8] sched/uclamp: Remove all uclamp bucket logic Hongyan Xia
@ 2025-03-04 14:23 ` Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 7/8] sched/uclamp: Propagate negative bias Hongyan Xia
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel

The commit

sched: Remove all uclamp bucket logic

removes uclamp_rq_{inc/dec}() functions, so now p->uclamp contains the
correct values all the time after a uclamp_update_active() call, and
there's no need to toggle the boolean `active` after an update. As a
result, this function is fairly simple now and can live as a static
inline function.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/core.c  | 13 ++++---------
 kernel/sched/sched.h | 10 +++++++++-
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 85c69ca7abaa..f3825e36ae85 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1569,20 +1569,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 	return uc_req;
 }
 
-unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
-{
-	if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
-		return uclamp_none(clamp_id);
-
-	return p->uclamp[clamp_id].value;
-}
-
 void uclamp_update_active_nolock(struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
-	for_each_clamp_id(clamp_id)
+	for_each_clamp_id(clamp_id) {
 		p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
+		p->uclamp[clamp_id].active = 1;
+	}
 }
 
 static inline void
@@ -1737,6 +1731,7 @@ static void uclamp_fork(struct task_struct *p)
 		for_each_clamp_id(clamp_id) {
 			uclamp_se_set(&p->uclamp_req[clamp_id],
 				      uclamp_none(clamp_id), false);
+			p->uclamp[clamp_id].active = 0;
 		}
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cfeaefcec8b6..f4a82e6cc029 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3343,7 +3343,6 @@ static inline bool update_other_load_avgs(struct rq *rq) { return false; }
 
 #ifdef CONFIG_UCLAMP_TASK
 
-unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
 void uclamp_update_active_nolock(struct task_struct *p);
 
 /*
@@ -3372,6 +3371,15 @@ static inline unsigned int uclamp_none(enum uclamp_id clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
+static inline unsigned long
+uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
+{
+	if (uclamp_is_used() && p->uclamp[clamp_id].active)
+		return p->uclamp[clamp_id].value;
+
+	return uclamp_none(clamp_id);
+}
+
 static inline void
 uclamp_se_set(struct uclamp_se *uc_se, unsigned int value, bool user_defined)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 7/8] sched/uclamp: Propagate negative bias
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
                   ` (5 preceding siblings ...)
  2025-03-04 14:23 ` [PATCH v2 6/8] sched/uclamp: Simplify uclamp_eff_value() Hongyan Xia
@ 2025-03-04 14:23 ` Hongyan Xia
  2025-03-04 14:23 ` [PATCH v2 8/8] sched/uclamp: Solve under-utilization problem Hongyan Xia
  2025-03-06 11:12 ` [PATCH v2 0/8] uclamp sum aggregation Xuewen Yan
  8 siblings, 0 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel

Negative bias is interesting, because dequeuing such a task will
actually increase utilization.

Solve by applying PELT decay to negative biases as well. This in fact
can be implemented easily with some math tricks.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/fair.c  | 46 +++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  4 ++++
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68e7b1ac7a57..944953b90297 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4886,6 +4886,48 @@ static inline unsigned long root_cfs_util_uclamp(struct rq *rq)
 
 	return max(ret, 0L);
 }
+
+/*
+ * Negative biases are tricky. If we remove them right away then dequeuing a
+ * uclamp_max task has the interesting effect that dequeuing results in a higher
+ * rq utilization. Solve this by applying PELT decay to the bias itself.
+ *
+ * Keeping track of a PELT-decayed negative bias is extra overhead. However, we
+ * observe this interesting math property, where y is the decay factor and p is
+ * the number of periods elapsed:
+ *
+ *	util_new = util_old * y^p - neg_bias * y^p
+ *		 = (util_old - neg_bias) * y^p
+ *
+ * Therefore, we simply subtract the negative bias from util_avg the moment we
+ * dequeue, then the PELT signal itself is the total of util_avg and the decayed
+ * negative bias, and we no longer need to track the decayed bias separately.
+ */
+static void propagate_negative_bias(struct task_struct *p)
+{
+	if (task_util_bias(p) < 0 && !task_on_rq_migrating(p)) {
+		unsigned long neg_bias = -task_util_bias(p);
+		struct sched_entity *se = &p->se;
+
+		p->se.avg.util_avg_bias = 0;
+
+		for_each_sched_entity(se) {
+			struct sched_avg *sa = &se->avg;
+			u32 divider = get_pelt_divider(sa);
+
+			sub_positive(&sa->util_avg, neg_bias);
+			sub_positive(&sa->util_sum, neg_bias * divider);
+			sa->util_sum = max_t(u32, sa->util_sum,
+					     sa->util_avg * PELT_MIN_DIVIDER);
+			sa = &cfs_rq_of(se)->avg;
+			divider = get_pelt_divider(sa);
+			sub_positive(&sa->util_avg, neg_bias);
+			sub_positive(&sa->util_sum, neg_bias * divider);
+			sa->util_sum = max_t(u32, sa->util_sum,
+					     sa->util_avg * PELT_MIN_DIVIDER);
+		}
+	}
+}
 #else
 static inline long task_util_bias(struct task_struct *p)
 {
@@ -7114,8 +7156,10 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	}
 
 	sub_nr_running(rq, h_nr_queued);
-	if (p)
+	if (p) {
 		util_bias_dequeue(rq, p);
+		propagate_negative_bias(p);
+	}
 
 	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
 		dl_server_stop(&rq->fair_server);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f4a82e6cc029..654eede62979 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3431,6 +3431,10 @@ static inline void util_bias_dequeue(struct rq *rq, struct task_struct *p)
 {
 }
 
+static inline void propagate_negative_bias(struct task_struct *p)
+{
+}
+
 #endif /* !CONFIG_UCLAMP_TASK */
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 8/8] sched/uclamp: Solve under-utilization problem
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
                   ` (6 preceding siblings ...)
  2025-03-04 14:23 ` [PATCH v2 7/8] sched/uclamp: Propagate negative bias Hongyan Xia
@ 2025-03-04 14:23 ` Hongyan Xia
  2025-03-06 11:12 ` [PATCH v2 0/8] uclamp sum aggregation Xuewen Yan
  8 siblings, 0 replies; 15+ messages in thread
From: Hongyan Xia @ 2025-03-04 14:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider
  Cc: Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel

With sum aggregation, a heavily uclamp_max-throttled task may throttle
the whole rq, resulting in low OPP.

For example, two tasks having the same priority and both tasks are
always-running tasks. One task has no uclamp values but the other has
uclamp_max of 1. Then, under sum aggregation, the CPU will run at 512 +
1 = 513 OPP, which means the task without uclamp_max only gets 513 / 2 =
256 utilization, even though the CPU still can run faster.

With this patch, we do not throttle a uclamp_max too hard such that it
impacts other tasks. This is done by tracking the highest uclamp_factor
and any uclamp_max tasks cannot throttle more than this factor allows.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/fair.c  | 12 ++++++++++++
 kernel/sched/pelt.c  | 33 +++++++++++++++++++++++++++++----
 kernel/sched/sched.h |  2 ++
 3 files changed, 43 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 944953b90297..966ca63da3fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7159,6 +7159,18 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (p) {
 		util_bias_dequeue(rq, p);
 		propagate_negative_bias(p);
+		if (p->pid == rq->max_uclamp_factor_pid) {
+			/*
+			 * If the task with the highest uclamp_factor gets
+			 * dequeued, the correct thing to do is to set pid and
+			 * factor to the second highest. However, the overhead
+			 * isn't really necessary because the second highest
+			 * will set these fields the next time it gets updated
+			 * anyway.
+			 */
+			rq->max_uclamp_factor_pid = -1;
+			rq->max_uclamp_factor = 0;
+		}
 	}
 
 	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index f38abe6f0b8b..e96ca045af2e 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -271,8 +271,8 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
 static void util_bias_update(struct task_struct *p)
 {
 	unsigned int util, uclamp_min, uclamp_max;
-	struct rq *rq;
-	int old, new;
+	struct rq *rq = task_rq(p);
+	int old, new, clamped_util, prio = p->prio - MAX_RT_PRIO;
 
 	util = READ_ONCE(p->se.avg.util_avg);
 	uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
@@ -284,12 +284,37 @@ static void util_bias_update(struct task_struct *p)
 	if (uclamp_max == SCHED_CAPACITY_SCALE)
 		uclamp_max = UINT_MAX;
 	old = READ_ONCE(p->se.avg.util_avg_bias);
-	new = (int)clamp(util, uclamp_min, uclamp_max) - (int)util;
+	clamped_util = (int)clamp(util, uclamp_min, uclamp_max);
+	if (p->se.on_rq && prio >= 0) {
+		/* We only do this for fair class priorities. */
+		u64 uclamp_factor = sched_prio_to_wmult[prio];
+
+		/* This has to be a 64-bit multiplication. */
+		uclamp_factor *= clamped_util;
+		if (rq->max_uclamp_factor_pid == p->pid) {
+			rq->max_uclamp_factor = uclamp_factor;
+		} else if (uclamp_factor > rq->max_uclamp_factor) {
+			rq->max_uclamp_factor = uclamp_factor;
+			rq->max_uclamp_factor_pid = p->pid;
+		} else {
+			u32 weight = sched_prio_to_weight[prio];
+
+			/*
+			 * We cannot throttle too much if some other task is
+			 * running at high utilization. We should prioritize
+			 * giving that task enough utilization and respect
+			 * task priority, before enforcing uclamp_max.
+			 */
+			uclamp_max = max(uclamp_max,
+				(rq->max_uclamp_factor * weight) >> 32);
+			clamped_util = (int)clamp(util, uclamp_min, uclamp_max);
+		}
+	}
+	new = clamped_util - (int)util;
 
 	WRITE_ONCE(p->se.avg.util_avg_bias, new);
 	if (!p->se.on_rq)
 		return;
-	rq = task_rq(p);
 	WRITE_ONCE(rq->cfs.avg.util_avg_bias,
 		   READ_ONCE(rq->cfs.avg.util_avg_bias) + new - old);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 654eede62979..0dc90208ad73 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1086,6 +1086,8 @@ struct rq {
 	u64			nr_switches;
 
 #ifdef CONFIG_UCLAMP_TASK
+	u64			max_uclamp_factor;
+	pid_t			max_uclamp_factor_pid;
 #endif
 
 	struct cfs_rq		cfs;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] uclamp sum aggregation
  2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
                   ` (7 preceding siblings ...)
  2025-03-04 14:23 ` [PATCH v2 8/8] sched/uclamp: Solve under-utilization problem Hongyan Xia
@ 2025-03-06 11:12 ` Xuewen Yan
  2025-03-06 11:32   ` Hongyan Xia
  8 siblings, 1 reply; 15+ messages in thread
From: Xuewen Yan @ 2025-03-06 11:12 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel, Xuewen Yan

Hi Hongyan,

On Tue, Mar 4, 2025 at 10:26 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> This series gives an alternative implementation that addresses some of
> the problems in uclamp max aggregation. Sum aggregation mostly gives:
>
> 1. Simplicity. Sum aggregation implements uclamp with less than half of
>    code than max aggregation.
> 2. Effectiveness. Sum aggregation shows better uclamp effectiveness,
>    either in benchmark scores or sometimes in energy efficiency.
>
> The key idea of sum aggregation is fairly simple. Each task has a
> util_avg_bias, which is obtained by:
>
>     util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg;
>
> If a CPU has N tasks, p1, p2, p3... pN, then we sum the biases up and
> obtain a rq total bias:
>
>     rq_bias = util_avg_bias1 + util_avg_bias2... + util_avg_biasN;
>
> Then we use the biased rq utilization rq_util + rq_bias to select OPP
> and to schedule tasks.
>
> PATCH BREAKDOWN:
>
> Patch 1/6 reverts a patch that accommodate uclamp_max tasks under max
> aggregation. This patch is not needed and creates other problems for sum
> aggregation. It is discussed elsewhere that this patch will be improved
> and there may not be the need to revert it in the future.
>
> Patch 2, 3 and 4 implement sum aggregation.
>
> Patch 5 and 6 remove max aggregation.
>
> Patch 7 applies PELT decay on negative util_avg_bias. This improves
> energy efficiency and task placement, but is not strictly necessary.
>
> Patch 8 addresses sum aggregation under-utilization problem.
>
> TESTING:
>
> Two notebooks are shared at
>
> https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/max.ipynb
> https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/sum-offset.ipynb
>
> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
> little cores with capacity of 383. CPU4-5 are big cores. The rt-app
> profiles used for these experiments are included in the notebooks.
>
> Scenario 1: Scheduling 4 tasks with UCLAMP_MAX at 110.
>
> The scheduling decisions are plotted in Out[11]. Both max and sum
> aggregation understand the UCLAMP_MAX hint and schedule all 4 tasks on
> the little cluster. Max aggregation sometimes schedule 2 tasks on 1 CPU,
> and this is the reason why sum aggregation reverts the 1st commit.
>
> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
> slightly above the capacity of the little CPU.
>
> Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
> on the big core. Both max and sum aggregation handle this correctly.
>
> Scenario 3: Task A is a task with a small utilization pinned to CPU4.
> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
> at 300. After a while, task A is then pinned to CPU5, joining B.
>
> Results are in Out[23]. Max aggregation sees a frequency spike at
> 873.64s. When zoomed in, one can see square-wave-like utilization values
> because of A periodically going to sleep. When A wakes up, its default
> UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency.
> When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq
> utilization. This happens repeatedly, hence the square wave. In
> contrast, sum aggregation sees a normal increase in utilization when A
> joins B, without any square-wave behavior.
>
> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 110 pinned to the
> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
> After a while, remove the CPU pinning of the 4 tasks on the big PD.
>
> Results are in Out[29]. After unpinning, max aggregation moves all 8
> tasks to the little cluster, but schedules 5 tasks on CPU0 and 1 each on
> CPU1-3. In contrast, sum aggregation schedules 2 on each little CPU
> after unpinning, which is the desired balanced task placement.
>
> EVALUATION:
>
> We backport patches to GKI kernel v6.1 on Pixel 9 and run Android
> benchmarks.
>
> Speedometer:
>
> We run Speedometer 2.1 on Chrome v131 to test ADPF/uclamp effectiveness.
> Because sum aggregation does not circumvent the 25% OPP margin, we
> reduce uclamp values to 80% to be fair.
>
> |   score   | score |   %    | CPU power % |
> |    max    | 192.4 |        |             |
> |  sum_0.8  | 230.8 | +19.96 |   +31.54    |
> | sum_tuned | 201.8 |  +4.89 |    -0.41    |
>
> We see a consistant higher score and higher average power consumption.
> Note that a higher score also means a reduction in run-time, total
> energy increase for sum_0.8 is only 9.65%.
>
> We then reduce uclamp values so that power consumption is roughly
> the same. If we do so, then sum aggregation achieves slightly better
> scores, shown in the sum_tuned row.
>
> UIBench:
>
> |   score   | jank percentage |   %    | CPU power (mW) |   %   |
> |    max    |     0.115%      |        |     158.1      |       |
> |  sum_0.8  |     0.129%      | +11.96 |     154.9      | -4.19 |
>
> UIBench on Pixel 9 by default already has a low enough jank percentage.
> Moving to sum aggregation gives slightly higher jank percentage and
> lower power consumption.
>
> ---
> Changed in v2:
> - Completely separate uclamp component from PELT and util_est.
> - Separate util_est_uclamp into an individual patch.
> - Address the under-utilization problem.
> - Update Python notebooks to reflect the latest sched/tip.
>
> Hongyan Xia (8):
>   Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
>     0"
>   sched/uclamp: Track a new util_avg_bias signal
>   sched/uclamp: Add util_est_uclamp
>   sched/fair: Use util biases for utilization and frequency
>   sched/uclamp: Remove all uclamp bucket logic

I’ve recently been looking into the issue with uclamp and
delayed-dequeue, and I found that uclamp_rq_inc should be placed
before enqueue_task, which led to a patch.
Before sending the patch, I came across your series of patches. I
haven’t fully understood your patch yet, but it seems like
uclamp_rq_inc is no longer needed.
Do you think the patch below is still necessary?

--->

Subject: [PATCH] sched/uclamp: Update the rq's uclamp before enqueue task

When task's uclamp is set, we hope that the CPU frequency
can increase as quickly as possible when the task is enqueued.
Because the cpu frequency updating happens during the enqueue_task(),
so the rq's uclamp needs to be updated before the task is enqueued.
For sched-delayed tasks, the rq uclamp should only be updated
when they are enqueued upon being awakened.

Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
---
 kernel/sched/core.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67189907214d..b07e78910221 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1747,7 +1747,7 @@ static inline void uclamp_rq_dec_id(struct rq
*rq, struct task_struct *p,
        }
 }

-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
*p, int flags)
 {
        enum uclamp_id clamp_id;

@@ -1763,7 +1763,8 @@ static inline void uclamp_rq_inc(struct rq *rq,
struct task_struct *p)
        if (unlikely(!p->sched_class->uclamp_enabled))
                return;

-       if (p->se.sched_delayed)
+       /* Only inc the delayed task which is being woken up. */
+       if (p->se.sched_delayed && !(flags & ENQUEUE_DELAYED))
                return;

        for_each_clamp_id(clamp_id)
@@ -2031,7 +2032,7 @@ static void __init init_uclamp(void)
 }

 #else /* !CONFIG_UCLAMP_TASK */
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
+static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
*p, int flags) { }
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
 static inline void uclamp_fork(struct task_struct *p) { }
 static inline void uclamp_post_fork(struct task_struct *p) { }
@@ -2067,12 +2068,9 @@ void enqueue_task(struct rq *rq, struct
task_struct *p, int flags)
        if (!(flags & ENQUEUE_NOCLOCK))
                update_rq_clock(rq);

+       uclamp_rq_inc(rq, p, flags);
+
        p->sched_class->enqueue_task(rq, p, flags);
-       /*
-        * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
-        * ->sched_delayed.
-        */
-       uclamp_rq_inc(rq, p);

        psi_enqueue(p, flags);

--

Thanks!

BR
---

>   sched/uclamp: Simplify uclamp_eff_value()
>   sched/uclamp: Propagate negative bias
>   sched/uclamp: Solve under-utilization problem
>
>  include/linux/sched.h            |   8 +-
>  init/Kconfig                     |  32 ---
>  kernel/sched/core.c              | 308 ++--------------------
>  kernel/sched/cpufreq_schedutil.c |   6 +-
>  kernel/sched/debug.c             |   2 +-
>  kernel/sched/fair.c              | 430 ++++++++++++++++---------------
>  kernel/sched/pelt.c              |  62 +++++
>  kernel/sched/rt.c                |   4 -
>  kernel/sched/sched.h             | 132 +++-------
>  kernel/sched/syscalls.c          |   2 +
>  10 files changed, 341 insertions(+), 645 deletions(-)
>
> --
> 2.34.1
>
>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] uclamp sum aggregation
  2025-03-06 11:12 ` [PATCH v2 0/8] uclamp sum aggregation Xuewen Yan
@ 2025-03-06 11:32   ` Hongyan Xia
  2025-03-06 11:38     ` Xuewen Yan
  0 siblings, 1 reply; 15+ messages in thread
From: Hongyan Xia @ 2025-03-06 11:32 UTC (permalink / raw)
  To: Xuewen Yan
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel, Xuewen Yan

Hi Xuewen,

On 06/03/2025 11:12, Xuewen Yan wrote:
> Hi Hongyan,
> 
> On Tue, Mar 4, 2025 at 10:26 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>
>> This series gives an alternative implementation that addresses some of
>> the problems in uclamp max aggregation. Sum aggregation mostly gives:
>>
>> 1. Simplicity. Sum aggregation implements uclamp with less than half of
>>     code than max aggregation.
>> 2. Effectiveness. Sum aggregation shows better uclamp effectiveness,
>>     either in benchmark scores or sometimes in energy efficiency.
>>
>> The key idea of sum aggregation is fairly simple. Each task has a
>> util_avg_bias, which is obtained by:
>>
>>      util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg;
>>
>> If a CPU has N tasks, p1, p2, p3... pN, then we sum the biases up and
>> obtain a rq total bias:
>>
>>      rq_bias = util_avg_bias1 + util_avg_bias2... + util_avg_biasN;
>>
>> Then we use the biased rq utilization rq_util + rq_bias to select OPP
>> and to schedule tasks.
>>
>> PATCH BREAKDOWN:
>>
>> Patch 1/6 reverts a patch that accommodate uclamp_max tasks under max
>> aggregation. This patch is not needed and creates other problems for sum
>> aggregation. It is discussed elsewhere that this patch will be improved
>> and there may not be the need to revert it in the future.
>>
>> Patch 2, 3 and 4 implement sum aggregation.
>>
>> Patch 5 and 6 remove max aggregation.
>>
>> Patch 7 applies PELT decay on negative util_avg_bias. This improves
>> energy efficiency and task placement, but is not strictly necessary.
>>
>> Patch 8 addresses sum aggregation under-utilization problem.
>>
>> TESTING:
>>
>> Two notebooks are shared at
>>
>> https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/max.ipynb
>> https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/sum-offset.ipynb
>>
>> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
>> little cores with capacity of 383. CPU4-5 are big cores. The rt-app
>> profiles used for these experiments are included in the notebooks.
>>
>> Scenario 1: Scheduling 4 tasks with UCLAMP_MAX at 110.
>>
>> The scheduling decisions are plotted in Out[11]. Both max and sum
>> aggregation understand the UCLAMP_MAX hint and schedule all 4 tasks on
>> the little cluster. Max aggregation sometimes schedule 2 tasks on 1 CPU,
>> and this is the reason why sum aggregation reverts the 1st commit.
>>
>> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
>> slightly above the capacity of the little CPU.
>>
>> Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
>> on the big core. Both max and sum aggregation handle this correctly.
>>
>> Scenario 3: Task A is a task with a small utilization pinned to CPU4.
>> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
>> at 300. After a while, task A is then pinned to CPU5, joining B.
>>
>> Results are in Out[23]. Max aggregation sees a frequency spike at
>> 873.64s. When zoomed in, one can see square-wave-like utilization values
>> because of A periodically going to sleep. When A wakes up, its default
>> UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency.
>> When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq
>> utilization. This happens repeatedly, hence the square wave. In
>> contrast, sum aggregation sees a normal increase in utilization when A
>> joins B, without any square-wave behavior.
>>
>> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 110 pinned to the
>> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
>> After a while, remove the CPU pinning of the 4 tasks on the big PD.
>>
>> Results are in Out[29]. After unpinning, max aggregation moves all 8
>> tasks to the little cluster, but schedules 5 tasks on CPU0 and 1 each on
>> CPU1-3. In contrast, sum aggregation schedules 2 on each little CPU
>> after unpinning, which is the desired balanced task placement.
>>
>> EVALUATION:
>>
>> We backport patches to GKI kernel v6.1 on Pixel 9 and run Android
>> benchmarks.
>>
>> Speedometer:
>>
>> We run Speedometer 2.1 on Chrome v131 to test ADPF/uclamp effectiveness.
>> Because sum aggregation does not circumvent the 25% OPP margin, we
>> reduce uclamp values to 80% to be fair.
>>
>> |   score   | score |   %    | CPU power % |
>> |    max    | 192.4 |        |             |
>> |  sum_0.8  | 230.8 | +19.96 |   +31.54    |
>> | sum_tuned | 201.8 |  +4.89 |    -0.41    |
>>
>> We see a consistant higher score and higher average power consumption.
>> Note that a higher score also means a reduction in run-time, total
>> energy increase for sum_0.8 is only 9.65%.
>>
>> We then reduce uclamp values so that power consumption is roughly
>> the same. If we do so, then sum aggregation achieves slightly better
>> scores, shown in the sum_tuned row.
>>
>> UIBench:
>>
>> |   score   | jank percentage |   %    | CPU power (mW) |   %   |
>> |    max    |     0.115%      |        |     158.1      |       |
>> |  sum_0.8  |     0.129%      | +11.96 |     154.9      | -4.19 |
>>
>> UIBench on Pixel 9 by default already has a low enough jank percentage.
>> Moving to sum aggregation gives slightly higher jank percentage and
>> lower power consumption.
>>
>> ---
>> Changed in v2:
>> - Completely separate uclamp component from PELT and util_est.
>> - Separate util_est_uclamp into an individual patch.
>> - Address the under-utilization problem.
>> - Update Python notebooks to reflect the latest sched/tip.
>>
>> Hongyan Xia (8):
>>    Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
>>      0"
>>    sched/uclamp: Track a new util_avg_bias signal
>>    sched/uclamp: Add util_est_uclamp
>>    sched/fair: Use util biases for utilization and frequency
>>    sched/uclamp: Remove all uclamp bucket logic
> 
> I’ve recently been looking into the issue with uclamp and
> delayed-dequeue, and I found that uclamp_rq_inc should be placed
> before enqueue_task, which led to a patch.
> Before sending the patch, I came across your series of patches. I
> haven’t fully understood your patch yet, but it seems like
> uclamp_rq_inc is no longer needed.
> Do you think the patch below is still necessary?
> 

I posted a fix of the issue you mentioned days ago here

https://lore.kernel.org/lkml/84441660bef0a5e67fd09dc3787178d0276dad31.1740664400.git.hongyan.xia2@arm.com/

I think we found the same issue, but our approaches are different. I 
fear that as more complexity goes into each sched_class like delayed 
dequeue, it's better to just let the sched_class handle how uclamp is 
enqueued and dequeued within itself rather than leaking into core.c. 
Would be nice if you could take a look at my fix.

Your patch is definitely necessary. The thing with this uclamp sum 
aggregation series is that there are still debates around it and it 
might take a while before everything is settled. So, at the moment we 
should view this series and the uclamp enqueue fix as separate things.

> --->
> 
> Subject: [PATCH] sched/uclamp: Update the rq's uclamp before enqueue task
> 
> When task's uclamp is set, we hope that the CPU frequency
> can increase as quickly as possible when the task is enqueued.
> Because the cpu frequency updating happens during the enqueue_task(),
> so the rq's uclamp needs to be updated before the task is enqueued.
> For sched-delayed tasks, the rq uclamp should only be updated
> when they are enqueued upon being awakened.
> 
> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> ---
>   kernel/sched/core.c | 14 ++++++--------
>   1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 67189907214d..b07e78910221 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1747,7 +1747,7 @@ static inline void uclamp_rq_dec_id(struct rq
> *rq, struct task_struct *p,
>          }
>   }
> 
> -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
> *p, int flags)
>   {
>          enum uclamp_id clamp_id;
> 
> @@ -1763,7 +1763,8 @@ static inline void uclamp_rq_inc(struct rq *rq,
> struct task_struct *p)
>          if (unlikely(!p->sched_class->uclamp_enabled))
>                  return;
> 
> -       if (p->se.sched_delayed)
> +       /* Only inc the delayed task which is being woken up. */
> +       if (p->se.sched_delayed && !(flags & ENQUEUE_DELAYED))
>                  return;
> 
>          for_each_clamp_id(clamp_id)
> @@ -2031,7 +2032,7 @@ static void __init init_uclamp(void)
>   }
> 
>   #else /* !CONFIG_UCLAMP_TASK */
> -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
> *p, int flags) { }
>   static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
>   static inline void uclamp_fork(struct task_struct *p) { }
>   static inline void uclamp_post_fork(struct task_struct *p) { }
> @@ -2067,12 +2068,9 @@ void enqueue_task(struct rq *rq, struct
> task_struct *p, int flags)
>          if (!(flags & ENQUEUE_NOCLOCK))
>                  update_rq_clock(rq);
> 
> +       uclamp_rq_inc(rq, p, flags);
> +
>          p->sched_class->enqueue_task(rq, p, flags);
> -       /*
> -        * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> -        * ->sched_delayed.
> -        */
> -       uclamp_rq_inc(rq, p);
> 
>          psi_enqueue(p, flags);
> 
> --
> 
> Thanks!
> 
> BR
> ---
> 
>>    sched/uclamp: Simplify uclamp_eff_value()
>>    sched/uclamp: Propagate negative bias
>>    sched/uclamp: Solve under-utilization problem
>>
>>   include/linux/sched.h            |   8 +-
>>   init/Kconfig                     |  32 ---
>>   kernel/sched/core.c              | 308 ++--------------------
>>   kernel/sched/cpufreq_schedutil.c |   6 +-
>>   kernel/sched/debug.c             |   2 +-
>>   kernel/sched/fair.c              | 430 ++++++++++++++++---------------
>>   kernel/sched/pelt.c              |  62 +++++
>>   kernel/sched/rt.c                |   4 -
>>   kernel/sched/sched.h             | 132 +++-------
>>   kernel/sched/syscalls.c          |   2 +
>>   10 files changed, 341 insertions(+), 645 deletions(-)
>>
>> --
>> 2.34.1
>>
>>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] uclamp sum aggregation
  2025-03-06 11:32   ` Hongyan Xia
@ 2025-03-06 11:38     ` Xuewen Yan
  2025-03-10 11:34       ` Dietmar Eggemann
  0 siblings, 1 reply; 15+ messages in thread
From: Xuewen Yan @ 2025-03-06 11:38 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	linux-kernel, Xuewen Yan

On Thu, Mar 6, 2025 at 7:32 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:
>
> Hi Xuewen,
>
> On 06/03/2025 11:12, Xuewen Yan wrote:
> > Hi Hongyan,
> >
> > On Tue, Mar 4, 2025 at 10:26 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:
> >>
> >> This series gives an alternative implementation that addresses some of
> >> the problems in uclamp max aggregation. Sum aggregation mostly gives:
> >>
> >> 1. Simplicity. Sum aggregation implements uclamp with less than half of
> >>     code than max aggregation.
> >> 2. Effectiveness. Sum aggregation shows better uclamp effectiveness,
> >>     either in benchmark scores or sometimes in energy efficiency.
> >>
> >> The key idea of sum aggregation is fairly simple. Each task has a
> >> util_avg_bias, which is obtained by:
> >>
> >>      util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg;
> >>
> >> If a CPU has N tasks, p1, p2, p3... pN, then we sum the biases up and
> >> obtain a rq total bias:
> >>
> >>      rq_bias = util_avg_bias1 + util_avg_bias2... + util_avg_biasN;
> >>
> >> Then we use the biased rq utilization rq_util + rq_bias to select OPP
> >> and to schedule tasks.
> >>
> >> PATCH BREAKDOWN:
> >>
> >> Patch 1/6 reverts a patch that accommodate uclamp_max tasks under max
> >> aggregation. This patch is not needed and creates other problems for sum
> >> aggregation. It is discussed elsewhere that this patch will be improved
> >> and there may not be the need to revert it in the future.
> >>
> >> Patch 2, 3 and 4 implement sum aggregation.
> >>
> >> Patch 5 and 6 remove max aggregation.
> >>
> >> Patch 7 applies PELT decay on negative util_avg_bias. This improves
> >> energy efficiency and task placement, but is not strictly necessary.
> >>
> >> Patch 8 addresses sum aggregation under-utilization problem.
> >>
> >> TESTING:
> >>
> >> Two notebooks are shared at
> >>
> >> https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/max.ipynb
> >> https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/sum-offset.ipynb
> >>
> >> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
> >> little cores with capacity of 383. CPU4-5 are big cores. The rt-app
> >> profiles used for these experiments are included in the notebooks.
> >>
> >> Scenario 1: Scheduling 4 tasks with UCLAMP_MAX at 110.
> >>
> >> The scheduling decisions are plotted in Out[11]. Both max and sum
> >> aggregation understand the UCLAMP_MAX hint and schedule all 4 tasks on
> >> the little cluster. Max aggregation sometimes schedule 2 tasks on 1 CPU,
> >> and this is the reason why sum aggregation reverts the 1st commit.
> >>
> >> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
> >> slightly above the capacity of the little CPU.
> >>
> >> Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
> >> on the big core. Both max and sum aggregation handle this correctly.
> >>
> >> Scenario 3: Task A is a task with a small utilization pinned to CPU4.
> >> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
> >> at 300. After a while, task A is then pinned to CPU5, joining B.
> >>
> >> Results are in Out[23]. Max aggregation sees a frequency spike at
> >> 873.64s. When zoomed in, one can see square-wave-like utilization values
> >> because of A periodically going to sleep. When A wakes up, its default
> >> UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency.
> >> When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq
> >> utilization. This happens repeatedly, hence the square wave. In
> >> contrast, sum aggregation sees a normal increase in utilization when A
> >> joins B, without any square-wave behavior.
> >>
> >> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 110 pinned to the
> >> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
> >> After a while, remove the CPU pinning of the 4 tasks on the big PD.
> >>
> >> Results are in Out[29]. After unpinning, max aggregation moves all 8
> >> tasks to the little cluster, but schedules 5 tasks on CPU0 and 1 each on
> >> CPU1-3. In contrast, sum aggregation schedules 2 on each little CPU
> >> after unpinning, which is the desired balanced task placement.
> >>
> >> EVALUATION:
> >>
> >> We backport patches to GKI kernel v6.1 on Pixel 9 and run Android
> >> benchmarks.
> >>
> >> Speedometer:
> >>
> >> We run Speedometer 2.1 on Chrome v131 to test ADPF/uclamp effectiveness.
> >> Because sum aggregation does not circumvent the 25% OPP margin, we
> >> reduce uclamp values to 80% to be fair.
> >>
> >> |   score   | score |   %    | CPU power % |
> >> |    max    | 192.4 |        |             |
> >> |  sum_0.8  | 230.8 | +19.96 |   +31.54    |
> >> | sum_tuned | 201.8 |  +4.89 |    -0.41    |
> >>
> >> We see a consistant higher score and higher average power consumption.
> >> Note that a higher score also means a reduction in run-time, total
> >> energy increase for sum_0.8 is only 9.65%.
> >>
> >> We then reduce uclamp values so that power consumption is roughly
> >> the same. If we do so, then sum aggregation achieves slightly better
> >> scores, shown in the sum_tuned row.
> >>
> >> UIBench:
> >>
> >> |   score   | jank percentage |   %    | CPU power (mW) |   %   |
> >> |    max    |     0.115%      |        |     158.1      |       |
> >> |  sum_0.8  |     0.129%      | +11.96 |     154.9      | -4.19 |
> >>
> >> UIBench on Pixel 9 by default already has a low enough jank percentage.
> >> Moving to sum aggregation gives slightly higher jank percentage and
> >> lower power consumption.
> >>
> >> ---
> >> Changed in v2:
> >> - Completely separate uclamp component from PELT and util_est.
> >> - Separate util_est_uclamp into an individual patch.
> >> - Address the under-utilization problem.
> >> - Update Python notebooks to reflect the latest sched/tip.
> >>
> >> Hongyan Xia (8):
> >>    Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
> >>      0"
> >>    sched/uclamp: Track a new util_avg_bias signal
> >>    sched/uclamp: Add util_est_uclamp
> >>    sched/fair: Use util biases for utilization and frequency
> >>    sched/uclamp: Remove all uclamp bucket logic
> >
> > I’ve recently been looking into the issue with uclamp and
> > delayed-dequeue, and I found that uclamp_rq_inc should be placed
> > before enqueue_task, which led to a patch.
> > Before sending the patch, I came across your series of patches. I
> > haven’t fully understood your patch yet, but it seems like
> > uclamp_rq_inc is no longer needed.
> > Do you think the patch below is still necessary?
> >
>
> I posted a fix of the issue you mentioned days ago here
>
> https://lore.kernel.org/lkml/84441660bef0a5e67fd09dc3787178d0276dad31.1740664400.git.hongyan.xia2@arm.com/
>
> I think we found the same issue, but our approaches are different. I
> fear that as more complexity goes into each sched_class like delayed
> dequeue, it's better to just let the sched_class handle how uclamp is
> enqueued and dequeued within itself rather than leaking into core.c.
> Would be nice if you could take a look at my fix.
>
> Your patch is definitely necessary. The thing with this uclamp sum
> aggregation series is that there are still debates around it and it
> might take a while before everything is settled. So, at the moment we
> should view this series and the uclamp enqueue fix as separate things.

Okay, I will look at the patch.

Thanks!

>
> > --->
> >
> > Subject: [PATCH] sched/uclamp: Update the rq's uclamp before enqueue task
> >
> > When task's uclamp is set, we hope that the CPU frequency
> > can increase as quickly as possible when the task is enqueued.
> > Because the cpu frequency updating happens during the enqueue_task(),
> > so the rq's uclamp needs to be updated before the task is enqueued.
> > For sched-delayed tasks, the rq uclamp should only be updated
> > when they are enqueued upon being awakened.
> >
> > Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> > ---
> >   kernel/sched/core.c | 14 ++++++--------
> >   1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 67189907214d..b07e78910221 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1747,7 +1747,7 @@ static inline void uclamp_rq_dec_id(struct rq
> > *rq, struct task_struct *p,
> >          }
> >   }
> >
> > -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> > +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
> > *p, int flags)
> >   {
> >          enum uclamp_id clamp_id;
> >
> > @@ -1763,7 +1763,8 @@ static inline void uclamp_rq_inc(struct rq *rq,
> > struct task_struct *p)
> >          if (unlikely(!p->sched_class->uclamp_enabled))
> >                  return;
> >
> > -       if (p->se.sched_delayed)
> > +       /* Only inc the delayed task which is being woken up. */
> > +       if (p->se.sched_delayed && !(flags & ENQUEUE_DELAYED))
> >                  return;
> >
> >          for_each_clamp_id(clamp_id)
> > @@ -2031,7 +2032,7 @@ static void __init init_uclamp(void)
> >   }
> >
> >   #else /* !CONFIG_UCLAMP_TASK */
> > -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> > +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
> > *p, int flags) { }
> >   static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> >   static inline void uclamp_fork(struct task_struct *p) { }
> >   static inline void uclamp_post_fork(struct task_struct *p) { }
> > @@ -2067,12 +2068,9 @@ void enqueue_task(struct rq *rq, struct
> > task_struct *p, int flags)
> >          if (!(flags & ENQUEUE_NOCLOCK))
> >                  update_rq_clock(rq);
> >
> > +       uclamp_rq_inc(rq, p, flags);
> > +
> >          p->sched_class->enqueue_task(rq, p, flags);
> > -       /*
> > -        * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> > -        * ->sched_delayed.
> > -        */
> > -       uclamp_rq_inc(rq, p);
> >
> >          psi_enqueue(p, flags);
> >
> > --
> >
> > Thanks!
> >
> > BR
> > ---
> >
> >>    sched/uclamp: Simplify uclamp_eff_value()
> >>    sched/uclamp: Propagate negative bias
> >>    sched/uclamp: Solve under-utilization problem
> >>
> >>   include/linux/sched.h            |   8 +-
> >>   init/Kconfig                     |  32 ---
> >>   kernel/sched/core.c              | 308 ++--------------------
> >>   kernel/sched/cpufreq_schedutil.c |   6 +-
> >>   kernel/sched/debug.c             |   2 +-
> >>   kernel/sched/fair.c              | 430 ++++++++++++++++---------------
> >>   kernel/sched/pelt.c              |  62 +++++
> >>   kernel/sched/rt.c                |   4 -
> >>   kernel/sched/sched.h             | 132 +++-------
> >>   kernel/sched/syscalls.c          |   2 +
> >>   10 files changed, 341 insertions(+), 645 deletions(-)
> >>
> >> --
> >> 2.34.1
> >>
> >>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] uclamp sum aggregation
  2025-03-06 11:38     ` Xuewen Yan
@ 2025-03-10 11:34       ` Dietmar Eggemann
  2025-03-10 12:54         ` Hongyan Xia
  0 siblings, 1 reply; 15+ messages in thread
From: Dietmar Eggemann @ 2025-03-10 11:34 UTC (permalink / raw)
  To: Xuewen Yan, Hongyan Xia
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen,
	Lukasz Luba, Christian Loehle, Pierre Gondois, linux-kernel,
	Xuewen Yan

On 06/03/2025 12:38, Xuewen Yan wrote:
> On Thu, Mar 6, 2025 at 7:32 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>
>> Hi Xuewen,
>>
>> On 06/03/2025 11:12, Xuewen Yan wrote:
>>> Hi Hongyan,
>>>
>>> On Tue, Mar 4, 2025 at 10:26 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:

[...]

>>> Subject: [PATCH] sched/uclamp: Update the rq's uclamp before enqueue task
>>>
>>> When task's uclamp is set, we hope that the CPU frequency
>>> can increase as quickly as possible when the task is enqueued.
>>> Because the cpu frequency updating happens during the enqueue_task(),
>>> so the rq's uclamp needs to be updated before the task is enqueued.
>>> For sched-delayed tasks, the rq uclamp should only be updated
>>> when they are enqueued upon being awakened.
>>>
>>> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
>>> ---
>>>   kernel/sched/core.c | 14 ++++++--------
>>>   1 file changed, 6 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 67189907214d..b07e78910221 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -1747,7 +1747,7 @@ static inline void uclamp_rq_dec_id(struct rq
>>> *rq, struct task_struct *p,
>>>          }
>>>   }
>>>
>>> -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>>> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
>>> *p, int flags)
>>>   {
>>>          enum uclamp_id clamp_id;
>>>
>>> @@ -1763,7 +1763,8 @@ static inline void uclamp_rq_inc(struct rq *rq,
>>> struct task_struct *p)
>>>          if (unlikely(!p->sched_class->uclamp_enabled))
>>>                  return;
>>>
>>> -       if (p->se.sched_delayed)
>>> +       /* Only inc the delayed task which is being woken up. */
>>> +       if (p->se.sched_delayed && !(flags & ENQUEUE_DELAYED))
>>>                  return;
>>>
>>>          for_each_clamp_id(clamp_id)
>>> @@ -2031,7 +2032,7 @@ static void __init init_uclamp(void)
>>>   }
>>>
>>>   #else /* !CONFIG_UCLAMP_TASK */
>>> -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
>>> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
>>> *p, int flags) { }
>>>   static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
>>>   static inline void uclamp_fork(struct task_struct *p) { }
>>>   static inline void uclamp_post_fork(struct task_struct *p) { }
>>> @@ -2067,12 +2068,9 @@ void enqueue_task(struct rq *rq, struct
>>> task_struct *p, int flags)
>>>          if (!(flags & ENQUEUE_NOCLOCK))
>>>                  update_rq_clock(rq);
>>>
>>> +       uclamp_rq_inc(rq, p, flags);
>>> +
>>>          p->sched_class->enqueue_task(rq, p, flags);
>>> -       /*
>>> -        * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
>>> -        * ->sched_delayed.
>>> -        */
>>> -       uclamp_rq_inc(rq, p);
>>>
>>>          psi_enqueue(p, flags);

Like I mentioned already in the original thread:

https://lkml.kernel.org/r/65365ec7-6a16-4e66-8005-e78788cbedfa@arm.com

I would prefer that uclamp stays in core.c. ENQUEUE_DELAYED among all
the other flags is already used there (ttwu_runnable()).

task_struct contains sched_{,rt_,dl_}entity}. We just have to be
careful when switching policies.

--

Could you also incorporate the changes in {en,de}queue_task_fair()
((task_on_rq_migrating(p) || (flags & {RESTORE,DEQUEUE}_SAVE))) vs.
(!p->se.sched_delayed || (flags & ENQUEUE_DELAYED)) and
(!p->se.sched_delayed) so the uclamp-util_est relation is easier to spot?

[...]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] uclamp sum aggregation
  2025-03-10 11:34       ` Dietmar Eggemann
@ 2025-03-10 12:54         ` Hongyan Xia
  2025-03-10 15:37           ` Dietmar Eggemann
  0 siblings, 1 reply; 15+ messages in thread
From: Hongyan Xia @ 2025-03-10 12:54 UTC (permalink / raw)
  To: Dietmar Eggemann, Xuewen Yan
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen,
	Lukasz Luba, Christian Loehle, Pierre Gondois, linux-kernel,
	Xuewen Yan

On 10/03/2025 11:34, Dietmar Eggemann wrote:
> On 06/03/2025 12:38, Xuewen Yan wrote:
>> On Thu, Mar 6, 2025 at 7:32 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>>
>>> Hi Xuewen,
>>>
>>> On 06/03/2025 11:12, Xuewen Yan wrote:
>>>> Hi Hongyan,
>>>>
>>>> On Tue, Mar 4, 2025 at 10:26 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:
> 
> [...]
> 
>>>> Subject: [PATCH] sched/uclamp: Update the rq's uclamp before enqueue task
>>>>
>>>> When task's uclamp is set, we hope that the CPU frequency
>>>> can increase as quickly as possible when the task is enqueued.
>>>> Because the cpu frequency updating happens during the enqueue_task(),
>>>> so the rq's uclamp needs to be updated before the task is enqueued.
>>>> For sched-delayed tasks, the rq uclamp should only be updated
>>>> when they are enqueued upon being awakened.
>>>>
>>>> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
>>>> ---
>>>>    kernel/sched/core.c | 14 ++++++--------
>>>>    1 file changed, 6 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 67189907214d..b07e78910221 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -1747,7 +1747,7 @@ static inline void uclamp_rq_dec_id(struct rq
>>>> *rq, struct task_struct *p,
>>>>           }
>>>>    }
>>>>
>>>> -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>>>> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
>>>> *p, int flags)
>>>>    {
>>>>           enum uclamp_id clamp_id;
>>>>
>>>> @@ -1763,7 +1763,8 @@ static inline void uclamp_rq_inc(struct rq *rq,
>>>> struct task_struct *p)
>>>>           if (unlikely(!p->sched_class->uclamp_enabled))
>>>>                   return;
>>>>
>>>> -       if (p->se.sched_delayed)
>>>> +       /* Only inc the delayed task which is being woken up. */
>>>> +       if (p->se.sched_delayed && !(flags & ENQUEUE_DELAYED))
>>>>                   return;
>>>>
>>>>           for_each_clamp_id(clamp_id)
>>>> @@ -2031,7 +2032,7 @@ static void __init init_uclamp(void)
>>>>    }
>>>>
>>>>    #else /* !CONFIG_UCLAMP_TASK */
>>>> -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
>>>> +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
>>>> *p, int flags) { }
>>>>    static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
>>>>    static inline void uclamp_fork(struct task_struct *p) { }
>>>>    static inline void uclamp_post_fork(struct task_struct *p) { }
>>>> @@ -2067,12 +2068,9 @@ void enqueue_task(struct rq *rq, struct
>>>> task_struct *p, int flags)
>>>>           if (!(flags & ENQUEUE_NOCLOCK))
>>>>                   update_rq_clock(rq);
>>>>
>>>> +       uclamp_rq_inc(rq, p, flags);
>>>> +
>>>>           p->sched_class->enqueue_task(rq, p, flags);
>>>> -       /*
>>>> -        * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
>>>> -        * ->sched_delayed.
>>>> -        */
>>>> -       uclamp_rq_inc(rq, p);
>>>>
>>>>           psi_enqueue(p, flags);
> 
> Like I mentioned already in the original thread:
> 
> https://lkml.kernel.org/r/65365ec7-6a16-4e66-8005-e78788cbedfa@arm.com
> 
> I would prefer that uclamp stays in core.c. ENQUEUE_DELAYED among all
> the other flags is already used there (ttwu_runnable()).
> 
> task_struct contains sched_{,rt_,dl_}entity}. We just have to be
> careful when switching policies.
> 
> --
> 
> Could you also incorporate the changes in {en,de}queue_task_fair()
> ((task_on_rq_migrating(p) || (flags & {RESTORE,DEQUEUE}_SAVE))) vs.
> (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED)) and
> (!p->se.sched_delayed) so the uclamp-util_est relation is easier to spot?
> 
> [...]

At the moment we can't do this. Sum aggregation was designed before 
delayed dequeue and it syncs with p->se.on_rq. If we sync with something 
else and take care of delayed dequeue cases (like util_est) then I have 
to rewrite part of the series.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/8] uclamp sum aggregation
  2025-03-10 12:54         ` Hongyan Xia
@ 2025-03-10 15:37           ` Dietmar Eggemann
  0 siblings, 0 replies; 15+ messages in thread
From: Dietmar Eggemann @ 2025-03-10 15:37 UTC (permalink / raw)
  To: Hongyan Xia, Xuewen Yan
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Morten Rasmussen,
	Lukasz Luba, Christian Loehle, Pierre Gondois, linux-kernel,
	Xuewen Yan

On 10/03/2025 13:54, Hongyan Xia wrote:
> On 10/03/2025 11:34, Dietmar Eggemann wrote:
>> On 06/03/2025 12:38, Xuewen Yan wrote:
>>> On Thu, Mar 6, 2025 at 7:32 PM Hongyan Xia <hongyan.xia2@arm.com> wrote:
>>>>
>>>> Hi Xuewen,
>>>>
>>>> On 06/03/2025 11:12, Xuewen Yan wrote:
>>>>> Hi Hongyan,
>>>>>
>>>>> On Tue, Mar 4, 2025 at 10:26 PM Hongyan Xia <hongyan.xia2@arm.com>

[...]

>> Like I mentioned already in the original thread:
>>
>> https://lkml.kernel.org/r/65365ec7-6a16-4e66-8005-e78788cbedfa@arm.com
>>
>> I would prefer that uclamp stays in core.c. ENQUEUE_DELAYED among all
>> the other flags is already used there (ttwu_runnable()).
>>
>> task_struct contains sched_{,rt_,dl_}entity}. We just have to be
>> careful when switching policies.
>>
>> -- 
>>
>> Could you also incorporate the changes in {en,de}queue_task_fair()
>> ((task_on_rq_migrating(p) || (flags & {RESTORE,DEQUEUE}_SAVE))) vs.
>> (!p->se.sched_delayed || (flags & ENQUEUE_DELAYED)) and
>> (!p->se.sched_delayed) so the uclamp-util_est relation is easier to spot?
>>
>> [...]
> 
> At the moment we can't do this. Sum aggregation was designed before
> delayed dequeue and it syncs with p->se.on_rq. If we sync with something
> else and take care of delayed dequeue cases (like util_est) then I have
> to rewrite part of the series.

Ah, OK! But this 'uclamp not in sync with util_est' is already an issue
on today's mainline so I would like to see this fixed as discreet as
possible and then another prep-patch for 'uclamp sum aggregation'.
IMHO, then it's clearer why we would need more rework in this area.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-03-10 15:37 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-04 14:23 [PATCH v2 0/8] uclamp sum aggregation Hongyan Xia
2025-03-04 14:23 ` [PATCH v2 1/8] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" Hongyan Xia
2025-03-04 14:23 ` [PATCH v2 2/8] sched/uclamp: Track a new util_avg_bias signal Hongyan Xia
2025-03-04 14:23 ` [PATCH v2 3/8] sched/uclamp: Add util_est_uclamp Hongyan Xia
2025-03-04 14:23 ` [PATCH v2 4/8] sched/fair: Use util biases for utilization and frequency Hongyan Xia
2025-03-04 14:23 ` [PATCH v2 5/8] sched/uclamp: Remove all uclamp bucket logic Hongyan Xia
2025-03-04 14:23 ` [PATCH v2 6/8] sched/uclamp: Simplify uclamp_eff_value() Hongyan Xia
2025-03-04 14:23 ` [PATCH v2 7/8] sched/uclamp: Propagate negative bias Hongyan Xia
2025-03-04 14:23 ` [PATCH v2 8/8] sched/uclamp: Solve under-utilization problem Hongyan Xia
2025-03-06 11:12 ` [PATCH v2 0/8] uclamp sum aggregation Xuewen Yan
2025-03-06 11:32   ` Hongyan Xia
2025-03-06 11:38     ` Xuewen Yan
2025-03-10 11:34       ` Dietmar Eggemann
2025-03-10 12:54         ` Hongyan Xia
2025-03-10 15:37           ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox