[PATCH 0/7] uclamp sum aggregation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/7] uclamp sum aggregation
@ 2024-06-24 10:23 Hongyan Xia
  2024-06-24 10:23 ` [PATCH 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" Hongyan Xia
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Hongyan Xia @ 2024-06-24 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann
  Cc: Qais Yousef, Morten Rasmussen, Lukasz Luba, Christian Loehle,
	Pierre Gondois, Youssef Esmat, linux-kernel

Current uclamp implementation, max aggregation, has several drawbacks.
This series gives an alternative implementation that addresses the
problems and shows other advantages, mostly:

1. Simplicity. Sum aggregation implements uclamp with less than half of
   code than max aggregation.
2. Effectiveness. Sum aggregation shows better uclamp effectiveness,
   either in benchmark scores or more importantly, in energy efficiency.
3. Works on its own. No changes in cpufreq or other sub-systems are
   needed.
4. Low-overhead. No bucket operations and no need to tweak the number of
   buckets to balance between overhead and uclamp granularity.

The key idea of sum aggregation is fairly simple. Each task has a
util_avg_bias, which is obtained by:

    util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg;

If a CPU has N tasks, p1, p2, p3... pN, then we sum the biases up and
obtain a rq total bias:

    rq_bias = util_avg_bias1 + util_avg_bias2... + util_avg_biasN;

Then we use the biased rq utilization rq_util + rq_bias to select OPP
and to schedule tasks.

PATCH BREAKDOWN:

Patch 1/7 reverts a patch that accommodate uclamp_max tasks under max
aggregation. This patch is not needed and creates other problems for sum
aggregation. It is discussed elsewhere that this patch will be improved
and there may not be the need to revert it in the future.

Patch 2, 3 and 4 implement sum aggregation.

Patch 5 and 6 remove max aggregation.

Patch 7 applies PELT decay on negative util_avg_bias. This improves
energy efficiency and task placement, but is not strictly necessary.

TESTING:

Two notebooks are shared at

https://nbviewer.org/github/honxia02/notebooks/blob/bb97afd74f49d4b8add8b28ad4378ea337c695a8/whitebox/max.ipynb
https://nbviewer.org/github/honxia02/notebooks/blob/bb97afd74f49d4b8add8b28ad4378ea337c695a8/whitebox/sum-offset.ipynb

The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
little cores with capacity of 383. CPU4-5 are big cores. The rt-app
profiles used for these experiments are included in the notebooks.

Scenario 1: Scheduling 4 tasks with UCLAMP_MAX at 110.

The scheduling decisions are plotted in Out[11]. Both max and sum
aggregation understand the UCLAMP_MAX hint and schedule all 4 tasks on
the little cluster. Max aggregation sometimes schedule 2 tasks on 1 CPU,
and this is the reason why sum aggregation reverts the 1st commit.
However, the reverted patch may be improved and this revert may not be
needed in the future.

Scenario 2: Scheduling 2 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
slightly above the capacity of the little CPU.

Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
on the big core. Both max and sum aggregation handle this correctly.

Scenario 3: Task A is a task with a small utilization pinned to CPU4.
Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
at 300. After a while, task A is then pinned to CPU5, joining B.

Results are in Out[23]. Max aggregation sees a frequency spike at
239.75s. When zoomed in, one can see square-wave-like utilization values
because of A periodically going to sleep. When A wakes up, its default
UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency.
When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq
utilization. This happens repeatedly, hence the square wave. In
contrast, sum aggregation sees a normal increase in utilization when A
joins B, without any square-wave behavior.

Scenario 4: 4 always-running tasks with UCLAMP_MAX of 110 pinned to the
little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
After a while, remove the CPU pinning of the 4 tasks on the big PD.

Results are in Out[29]. After unpinning, max aggregation moves all 8
tasks to the little cluster, but schedules 5 tasks on CPU0 and 1 each on
CPU1-3. In contrast, sum aggregation schedules 2 on each little CPU
after unpinning, which is the desired balanced task placement.

Same as Scenario 1, the situation may not be as bad once the improvement
of the reverted patch comes out in the future.

Scenario 5: Scheduling 8 tasks with UCLAMP_MAX of 110.

Results are in Out[35] and Out[36]. There's no doubt that sum
aggregation yields substantially better scheduling decisions. This tests
roughly the same thing as Scenario 4.

EVALUATION:

We backport patches to kernel v6.1 on Pixel 6 and run Android
benchmarks.

Speedometer:

We run Speedometer 2.0 to test ADPF/uclamp effectiveness. Because sum
aggregation does not circumvent the 20% OPP margin, we reduce uclamp
values to 80% to be fair.

------------------------------------------------------
|   score   | score |   %   | CPU power (mW) |   %   |
|    max    | 161.4 |       |    2358.9      |       |
|  sum_0.8  | 166.0 | +2.85 |    2485.0      | +5.35 |
| sum_tuned | 162.6 | +0.74 |    2332.0      | -1.14 |
------------------------------------------------------

We see a consistant higher score and higher average power consumption.
Note that a higher score also means a reduction in run-time, so total
energy increase for sum_0.8 is only 1.88%.

We then reduce uclamp values so that the Speedometer score is roughly
the same. If we do so, then sum aggregation actually gives a reduced
average power and total energy consumption than max aggregation.

UIBench:

-----------------------------------------------------------------
|   score   | jank percentage |   %    | CPU power (mW) |   %   |
|    max    |     0.375%      |        |     122.75     |       |
|  sum_0.8  |     0.440%      | +17.33 |     116.35     | -5.21 |
| sum_tuned |     0.220%      | -41.33 |     119.35     | -2.77 |
-----------------------------------------------------------------

UIBench on Pixel 6 by default already has a low enough jank percentage.
Moving to sum aggregation gives higher jank percentage and lower power
consumption. We then tune the hardcoded uclamp values in the Android
image to take advantage of the power budget, and can achieve more than
41% jank reduction while still operating with less power consumption
than max aggregation.

This result is not suggesting that sum aggregation greatly outperforms
max aggregation, because the jank percentage is already very low, but
instead suggests that hardcoded uclamp values in the system (like in
init scripts) need to be changed to perform well under sum aggregation.
If tuned well, sum aggregation generally shows better effectiveness, or
the same effectiveness but with less power consumption.

---
Changed in v1 since RFCv3:
- Rebase and add comments as suggested.
- Move util_est changes into a separate patch.
- Fix build failures on !CONFIG_SMP detected by Intel kernel bot.
- Avoid duplicate calls to util() and util_uclamp() functions by using
  the static key.

Changed in RFCv3:
- Addresses the biggest concern from multiple people, that PELT and
  uclamp need to be separate. The new design is significantly simpler
  than the previous revision and separates util_avg_uclamp into the
  original util_avg (which this series doesn't touch at all) and the
  util_avg_bias component.
- Keep the tri-state return value of util_fits_cpu().
- Keep both the unclamped and clamped util_est, so that we use the right
  one depending on the caller in frequency or energy calculations.

Hongyan Xia (7):
  Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
    0"
  sched/uclamp: Track a new util_avg_bias signal
  sched/uclamp: Add util_est_uclamp
  sched/fair: Use util biases for utilization and frequency
  sched/uclamp: Remove all uclamp bucket logic
  sched/uclamp: Simplify uclamp_eff_value()
  Propagate negative bias

 include/linux/sched.h            |   8 +-
 init/Kconfig                     |  32 ---
 kernel/sched/core.c              | 294 ++--------------------
 kernel/sched/cpufreq_schedutil.c |  12 +-
 kernel/sched/debug.c             |   2 +-
 kernel/sched/fair.c              | 419 ++++++++++++++++---------------
 kernel/sched/pelt.c              |  37 +++
 kernel/sched/rt.c                |   4 -
 kernel/sched/sched.h             | 131 +++-------
 kernel/sched/syscalls.c          |  16 +-
 10 files changed, 312 insertions(+), 643 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0"
  2024-06-24 10:23 [PATCH 0/7] uclamp sum aggregation Hongyan Xia
@ 2024-06-24 10:23 ` Hongyan Xia
  2024-06-24 10:23 ` [PATCH 2/7] sched/uclamp: Track a new util_avg_bias signal Hongyan Xia
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Hongyan Xia @ 2024-06-24 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Qais Yousef, Morten Rasmussen, Lukasz Luba, Christian Loehle,
	Pierre Gondois, Youssef Esmat, linux-kernel, Hongyan Xia

From: Hongyan Xia <Hongyan.Xia2@arm.com>

That commit creates further problems because 0 spare capacity can be
either a real indication that the CPU is maxed out, or the CPU is
UCLAMP_MAX throttled, but we end up giving all of them a chance which
can results in bogus energy calculations. It also tends to schedule
tasks on the same CPU and requires load balancing patches. Sum
aggregation solves these problems and this patch is not needed.

This reverts commit 6b00a40147653c8ea748e8f4396510f252763364.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/fair.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..2f8ed6561a9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8028,10 +8028,11 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	for (; pd; pd = pd->next) {
 		unsigned long util_min = p_util_min, util_max = p_util_max;
 		unsigned long cpu_cap, cpu_actual_cap, util;
-		long prev_spare_cap = -1, max_spare_cap = -1;
+		unsigned long cur_delta, max_spare_cap = 0;
 		unsigned long rq_util_min, rq_util_max;
-		unsigned long cur_delta, base_energy;
+		unsigned long prev_spare_cap = 0;
 		int max_spare_cap_cpu = -1;
+		unsigned long base_energy;
 		int fits, max_fits = -1;
 
 		cpumask_and(cpus, perf_domain_span(pd), cpu_online_mask);
@@ -8093,7 +8094,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 				prev_spare_cap = cpu_cap;
 				prev_fits = fits;
 			} else if ((fits > max_fits) ||
-				   ((fits == max_fits) && ((long)cpu_cap > max_spare_cap))) {
+				   ((fits == max_fits) && (cpu_cap > max_spare_cap))) {
 				/*
 				 * Find the CPU with the maximum spare capacity
 				 * among the remaining CPUs in the performance
@@ -8105,7 +8106,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 			}
 		}
 
-		if (max_spare_cap_cpu < 0 && prev_spare_cap < 0)
+		if (max_spare_cap_cpu < 0 && prev_spare_cap == 0)
 			continue;
 
 		eenv_pd_busy_time(&eenv, cpus, p);
@@ -8113,7 +8114,7 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		base_energy = compute_energy(&eenv, pd, cpus, p, -1);
 
 		/* Evaluate the energy impact of using prev_cpu. */
-		if (prev_spare_cap > -1) {
+		if (prev_spare_cap > 0) {
 			prev_delta = compute_energy(&eenv, pd, cpus, p,
 						    prev_cpu);
 			/* CPU utilization has changed */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/7] sched/uclamp: Track a new util_avg_bias signal
  2024-06-24 10:23 [PATCH 0/7] uclamp sum aggregation Hongyan Xia
  2024-06-24 10:23 ` [PATCH 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" Hongyan Xia
@ 2024-06-24 10:23 ` Hongyan Xia
  2024-06-24 10:23 ` [PATCH 3/7] sched/uclamp: Add util_est_uclamp Hongyan Xia
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Hongyan Xia @ 2024-06-24 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Qais Yousef, Morten Rasmussen, Lukasz Luba, Christian Loehle,
	Pierre Gondois, Youssef Esmat, linux-kernel

Add a util_avg_bias signal in sched_avg, which is obtained by:

util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg

The task utilization after considering uclamp is;

util_avg_uclamp = util_avg + util_avg_bias

We then sum up all biases on the same rq and use the total bias to bias
the rq utilization. This is the core idea of uclamp sum aggregation. The
rq utilization will be

rq_util_avg_uclamp = rq_util_avg + total_util_avg_bias

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 include/linux/sched.h |  3 ++-
 kernel/sched/debug.c  |  2 +-
 kernel/sched/fair.c   | 31 +++++++++++++++++++++++++++++++
 kernel/sched/pelt.c   | 37 +++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h  | 24 ++++++++++++++++++++++++
 5 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..63bcb81b20bb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -476,7 +476,8 @@ struct sched_avg {
 	u32				period_contrib;
 	unsigned long			load_avg;
 	unsigned long			runnable_avg;
-	unsigned long			util_avg;
+	unsigned int			util_avg;
+	int				util_avg_bias;
 	unsigned int			util_est;
 } ____cacheline_aligned;
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c1eb9a1afd13..d416be6e3a83 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -683,7 +683,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->avg.load_avg);
 	SEQ_printf(m, "  .%-30s: %lu\n", "runnable_avg",
 			cfs_rq->avg.runnable_avg);
-	SEQ_printf(m, "  .%-30s: %lu\n", "util_avg",
+	SEQ_printf(m, "  .%-30s: %u\n", "util_avg",
 			cfs_rq->avg.util_avg);
 	SEQ_printf(m, "  .%-30s: %u\n", "util_est",
 			cfs_rq->avg.util_est);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f8ed6561a9f..23360c666829 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1089,6 +1089,7 @@ void post_init_entity_util_avg(struct task_struct *p)
 	}
 
 	sa->runnable_avg = sa->util_avg;
+	sa->util_avg_bias = 0;
 }
 
 #else /* !CONFIG_SMP */
@@ -4844,6 +4845,32 @@ static inline unsigned long task_util_est(struct task_struct *p)
 	return max(task_util(p), _task_util_est(p));
 }
 
+#ifdef CONFIG_UCLAMP_TASK
+static inline long task_util_bias(struct task_struct *p)
+{
+	return READ_ONCE(p->se.avg.util_avg_bias);
+}
+
+static inline unsigned long task_util_uclamp(struct task_struct *p)
+{
+	long ret = task_util(p);
+
+	ret += task_util_bias(p);
+
+	return max(ret, 0L);
+}
+#else
+static inline long task_util_bias(struct task_struct *p)
+{
+	return 0;
+}
+
+static inline unsigned long task_util_uclamp(struct task_struct *p)
+{
+	return task_util(p);
+}
+#endif
+
 static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
 				    struct task_struct *p)
 {
@@ -6807,6 +6834,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	/* At this point se is NULL and we are at root level*/
 	add_nr_running(rq, 1);
+	util_bias_enqueue(rq, p);
+	/* XXX: We should skip the update above and only do it once here. */
+	cpufreq_update_util(rq, 0);
 
 	/*
 	 * Since new tasks are assigned an initial util_avg equal to
@@ -6898,6 +6928,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 	/* At this point se is NULL and we are at root level*/
 	sub_nr_running(rq, 1);
+	util_bias_dequeue(rq, p);
 
 	/* balance early to pull high priority tasks */
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index fa52906a4478..11aa845d212c 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -266,6 +266,39 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load)
 	WRITE_ONCE(sa->util_avg, sa->util_sum / divider);
 }
 
+#ifdef CONFIG_UCLAMP_TASK
+/* avg must belong to the queue this se is on. */
+static void util_bias_update(struct task_struct *p)
+{
+	unsigned int util, uclamp_min, uclamp_max;
+	struct rq *rq;
+	int old, new;
+
+	util = READ_ONCE(p->se.avg.util_avg);
+	uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
+	uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
+	/*
+	 * uclamp_max at the max value means there is no uclamp_max, and should
+	 * not have any clamping effect at all here.
+	 */
+	if (uclamp_max == SCHED_CAPACITY_SCALE)
+		uclamp_max = UINT_MAX;
+	old = READ_ONCE(p->se.avg.util_avg_bias);
+	new = (int)clamp(util, uclamp_min, uclamp_max) - (int)util;
+
+	WRITE_ONCE(p->se.avg.util_avg_bias, new);
+	if (!p->se.on_rq)
+		return;
+	rq = task_rq(p);
+	WRITE_ONCE(rq->cfs.avg.util_avg_bias,
+		   READ_ONCE(rq->cfs.avg.util_avg_bias) + new - old);
+}
+#else /* !CONFIG_UCLAMP_TASK */
+static void util_bias_update(struct task_struct *p)
+{
+}
+#endif
+
 /*
  * sched_entity:
  *
@@ -296,6 +329,8 @@ int __update_load_avg_blocked_se(u64 now, struct sched_entity *se)
 {
 	if (___update_load_sum(now, &se->avg, 0, 0, 0)) {
 		___update_load_avg(&se->avg, se_weight(se));
+		if (entity_is_task(se))
+			util_bias_update(task_of(se));
 		trace_pelt_se_tp(se);
 		return 1;
 	}
@@ -310,6 +345,8 @@ int __update_load_avg_se(u64 now, struct cfs_rq *cfs_rq, struct sched_entity *se
 
 		___update_load_avg(&se->avg, se_weight(se));
 		cfs_se_util_change(&se->avg);
+		if (entity_is_task(se))
+			util_bias_update(task_of(se));
 		trace_pelt_se_tp(se);
 		return 1;
 	}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62fd8bc6fd08..5f93a6b1b563 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3178,6 +3178,22 @@ uclamp_se_set(struct uclamp_se *uc_se, unsigned int value, bool user_defined)
 	uc_se->user_defined = user_defined;
 }
 
+static inline void util_bias_enqueue(struct rq *rq, struct task_struct *p)
+{
+	int rq_val = READ_ONCE(rq->cfs.avg.util_avg_bias);
+	int p_val = READ_ONCE(p->se.avg.util_avg_bias);
+
+	WRITE_ONCE(rq->cfs.avg.util_avg_bias, rq_val + p_val);
+}
+
+static inline void util_bias_dequeue(struct rq *rq, struct task_struct *p)
+{
+	int rq_val = READ_ONCE(rq->cfs.avg.util_avg_bias);
+	int p_val = READ_ONCE(p->se.avg.util_avg_bias);
+
+	WRITE_ONCE(rq->cfs.avg.util_avg_bias, rq_val - p_val);
+}
+
 #else /* !CONFIG_UCLAMP_TASK: */
 
 static inline unsigned long
@@ -3215,6 +3231,14 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
 	return false;
 }
 
+static inline void util_bias_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void util_bias_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
+
 #endif /* !CONFIG_UCLAMP_TASK */
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/7] sched/uclamp: Add util_est_uclamp
  2024-06-24 10:23 [PATCH 0/7] uclamp sum aggregation Hongyan Xia
  2024-06-24 10:23 ` [PATCH 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" Hongyan Xia
  2024-06-24 10:23 ` [PATCH 2/7] sched/uclamp: Track a new util_avg_bias signal Hongyan Xia
@ 2024-06-24 10:23 ` Hongyan Xia
  2024-06-24 10:23 ` [PATCH 4/7] sched/fair: Use util biases for utilization and frequency Hongyan Xia
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Hongyan Xia @ 2024-06-24 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Qais Yousef, Morten Rasmussen, Lukasz Luba, Christian Loehle,
	Pierre Gondois, Youssef Esmat, linux-kernel

The new util_est_uclamp is essentially clamp(util_est, min, max) and
follows how util_est operates.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 30 ++++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63bcb81b20bb..0160567314ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -479,6 +479,7 @@ struct sched_avg {
 	unsigned int			util_avg;
 	int				util_avg_bias;
 	unsigned int			util_est;
+	unsigned int			util_est_uclamp;
 } ____cacheline_aligned;
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23360c666829..0fa48466e02e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4859,6 +4859,16 @@ static inline unsigned long task_util_uclamp(struct task_struct *p)
 
 	return max(ret, 0L);
 }
+
+static inline unsigned long _task_util_est_uclamp(struct task_struct *p)
+{
+	return READ_ONCE(p->se.avg.util_est_uclamp);
+}
+
+static inline unsigned long task_util_est_uclamp(struct task_struct *p)
+{
+	return max(task_util_uclamp(p), _task_util_est_uclamp(p));
+}
 #else
 static inline long task_util_bias(struct task_struct *p)
 {
@@ -4869,6 +4879,16 @@ static inline unsigned long task_util_uclamp(struct task_struct *p)
 {
 	return task_util(p);
 }
+
+static inline unsigned long _task_util_est_uclamp(struct task_struct *p)
+{
+	return _task_util_est(p);
+}
+
+static inline unsigned long task_util_est_uclamp(struct task_struct *p)
+{
+	return task_util_est(p);
+}
 #endif
 
 static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
@@ -4883,6 +4903,9 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
 	enqueued  = cfs_rq->avg.util_est;
 	enqueued += _task_util_est(p);
 	WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
+	enqueued  = cfs_rq->avg.util_est_uclamp;
+	enqueued += _task_util_est_uclamp(p);
+	WRITE_ONCE(cfs_rq->avg.util_est_uclamp, enqueued);
 
 	trace_sched_util_est_cfs_tp(cfs_rq);
 }
@@ -4899,6 +4922,9 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
 	enqueued  = cfs_rq->avg.util_est;
 	enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
 	WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
+	enqueued  = cfs_rq->avg.util_est_uclamp;
+	enqueued -= min_t(unsigned int, enqueued, _task_util_est_uclamp(p));
+	WRITE_ONCE(cfs_rq->avg.util_est_uclamp, enqueued);
 
 	trace_sched_util_est_cfs_tp(cfs_rq);
 }
@@ -4986,6 +5012,10 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
 	ewma  -= last_ewma_diff;
 	ewma >>= UTIL_EST_WEIGHT_SHIFT;
 done:
+	WRITE_ONCE(p->se.avg.util_est_uclamp,
+		   clamp(ewma,
+			 (unsigned int)uclamp_eff_value(p, UCLAMP_MIN),
+			 (unsigned int)uclamp_eff_value(p, UCLAMP_MAX)));
 	ewma |= UTIL_AVG_UNCHANGED;
 	WRITE_ONCE(p->se.avg.util_est, ewma);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/7] sched/fair: Use util biases for utilization and frequency
  2024-06-24 10:23 [PATCH 0/7] uclamp sum aggregation Hongyan Xia
                   ` (2 preceding siblings ...)
  2024-06-24 10:23 ` [PATCH 3/7] sched/uclamp: Add util_est_uclamp Hongyan Xia
@ 2024-06-24 10:23 ` Hongyan Xia
  2024-06-24 10:23 ` [PATCH 5/7] sched/uclamp: Remove all uclamp bucket logic Hongyan Xia
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Hongyan Xia @ 2024-06-24 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Rafael J. Wysocki, Viresh Kumar, Juri Lelli, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider
  Cc: Qais Yousef, Morten Rasmussen, Lukasz Luba, Christian Loehle,
	Pierre Gondois, Youssef Esmat, linux-kernel, linux-pm

Use the new util_avg_bias for task and runqueue utilization. We also
maintain separate util_est and util_est_uclamp signals.

Now that we have the uclamp sum aggregated CFS util value, we do not
need to consult uclamp buckets to know how the frequency should be
clamped. We simply look at the aggregated top level rq->cfs.avg.util_avg
+ rq->cfs.avg.util_avg_bias and rq->cfs.avg.util_est_uclamp to know what
frequency to choose and how to place tasks.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/cpufreq_schedutil.c |  12 +-
 kernel/sched/fair.c              | 303 +++++++++++--------------------
 kernel/sched/sched.h             |  20 +-
 kernel/sched/syscalls.c          |  14 +-
 4 files changed, 117 insertions(+), 232 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index eece6244f9d2..65fdcf4d73d1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -197,7 +197,7 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 
 static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 {
-	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+	unsigned long min, max, util = cpu_util_cfs_boost_uclamp(sg_cpu->cpu);
 
 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
 	util = max(util, boost);
@@ -385,11 +385,8 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
 	/*
 	 * Do not reduce the frequency if the CPU has not been idle
 	 * recently, as the reduction is likely to be premature then.
-	 *
-	 * Except when the rq is capped by uclamp_max.
 	 */
-	if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
-	    sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
+	if (sugov_cpu_is_busy(sg_cpu) && next_f < sg_policy->next_freq &&
 	    !sg_policy->need_freq_update) {
 		next_f = sg_policy->next_freq;
 
@@ -439,11 +436,8 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
 	/*
 	 * Do not reduce the target performance level if the CPU has not been
 	 * idle recently, as the reduction is likely to be premature then.
-	 *
-	 * Except when the rq is capped by uclamp_max.
 	 */
-	if (!uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)) &&
-	    sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
+	if (sugov_cpu_is_busy(sg_cpu) && sg_cpu->util < prev_util)
 		sg_cpu->util = prev_util;
 
 	cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fa48466e02e..a25de0044af8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4869,6 +4869,15 @@ static inline unsigned long task_util_est_uclamp(struct task_struct *p)
 {
 	return max(task_util_uclamp(p), _task_util_est_uclamp(p));
 }
+
+static inline unsigned long root_cfs_util_uclamp(struct rq *rq)
+{
+	long ret = READ_ONCE(rq->cfs.avg.util_avg);
+
+	ret += READ_ONCE(rq->cfs.avg.util_avg_bias);
+
+	return max(ret, 0L);
+}
 #else
 static inline long task_util_bias(struct task_struct *p)
 {
@@ -4889,6 +4898,11 @@ static inline unsigned long task_util_est_uclamp(struct task_struct *p)
 {
 	return task_util_est(p);
 }
+
+static inline unsigned long root_cfs_util_uclamp(struct rq *rq)
+{
+	return READ_ONCE(rq->cfs.avg.util_avg);
+}
 #endif
 
 static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
@@ -5032,134 +5046,29 @@ static inline unsigned long get_actual_cpu_capacity(int cpu)
 }
 
 static inline int util_fits_cpu(unsigned long util,
-				unsigned long uclamp_min,
-				unsigned long uclamp_max,
+				unsigned long util_uclamp,
 				int cpu)
 {
 	unsigned long capacity = capacity_of(cpu);
-	unsigned long capacity_orig;
-	bool fits, uclamp_max_fits;
-
-	/*
-	 * Check if the real util fits without any uclamp boost/cap applied.
-	 */
-	fits = fits_capacity(util, capacity);
-
-	if (!uclamp_is_used())
-		return fits;
 
-	/*
-	 * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
-	 * uclamp_max. We only care about capacity pressure (by using
-	 * capacity_of()) for comparing against the real util.
-	 *
-	 * If a task is boosted to 1024 for example, we don't want a tiny
-	 * pressure to skew the check whether it fits a CPU or not.
-	 *
-	 * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
-	 * should fit a little cpu even if there's some pressure.
-	 *
-	 * Only exception is for HW or cpufreq pressure since it has a direct impact
-	 * on available OPP of the system.
-	 *
-	 * We honour it for uclamp_min only as a drop in performance level
-	 * could result in not getting the requested minimum performance level.
-	 *
-	 * For uclamp_max, we can tolerate a drop in performance level as the
-	 * goal is to cap the task. So it's okay if it's getting less.
-	 */
-	capacity_orig = arch_scale_cpu_capacity(cpu);
-
-	/*
-	 * We want to force a task to fit a cpu as implied by uclamp_max.
-	 * But we do have some corner cases to cater for..
-	 *
-	 *
-	 *                                 C=z
-	 *   |                             ___
-	 *   |                  C=y       |   |
-	 *   |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _  uclamp_max
-	 *   |      C=x        |   |      |   |
-	 *   |      ___        |   |      |   |
-	 *   |     |   |       |   |      |   |    (util somewhere in this region)
-	 *   |     |   |       |   |      |   |
-	 *   |     |   |       |   |      |   |
-	 *   +----------------------------------------
-	 *         CPU0        CPU1       CPU2
-	 *
-	 *   In the above example if a task is capped to a specific performance
-	 *   point, y, then when:
-	 *
-	 *   * util = 80% of x then it does not fit on CPU0 and should migrate
-	 *     to CPU1
-	 *   * util = 80% of y then it is forced to fit on CPU1 to honour
-	 *     uclamp_max request.
-	 *
-	 *   which is what we're enforcing here. A task always fits if
-	 *   uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
-	 *   the normal upmigration rules should withhold still.
-	 *
-	 *   Only exception is when we are on max capacity, then we need to be
-	 *   careful not to block overutilized state. This is so because:
-	 *
-	 *     1. There's no concept of capping at max_capacity! We can't go
-	 *        beyond this performance level anyway.
-	 *     2. The system is being saturated when we're operating near
-	 *        max capacity, it doesn't make sense to block overutilized.
-	 */
-	uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
-	uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
-	fits = fits || uclamp_max_fits;
+	if (fits_capacity(util_uclamp, capacity))
+		return 1;
 
-	/*
-	 *
-	 *                                 C=z
-	 *   |                             ___       (region a, capped, util >= uclamp_max)
-	 *   |                  C=y       |   |
-	 *   |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
-	 *   |      C=x        |   |      |   |
-	 *   |      ___        |   |      |   |      (region b, uclamp_min <= util <= uclamp_max)
-	 *   |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
-	 *   |     |   |       |   |      |   |
-	 *   |     |   |       |   |      |   |      (region c, boosted, util < uclamp_min)
-	 *   +----------------------------------------
-	 *         CPU0        CPU1       CPU2
-	 *
-	 * a) If util > uclamp_max, then we're capped, we don't care about
-	 *    actual fitness value here. We only care if uclamp_max fits
-	 *    capacity without taking margin/pressure into account.
-	 *    See comment above.
-	 *
-	 * b) If uclamp_min <= util <= uclamp_max, then the normal
-	 *    fits_capacity() rules apply. Except we need to ensure that we
-	 *    enforce we remain within uclamp_max, see comment above.
-	 *
-	 * c) If util < uclamp_min, then we are boosted. Same as (b) but we
-	 *    need to take into account the boosted value fits the CPU without
-	 *    taking margin/pressure into account.
-	 *
-	 * Cases (a) and (b) are handled in the 'fits' variable already. We
-	 * just need to consider an extra check for case (c) after ensuring we
-	 * handle the case uclamp_min > uclamp_max.
-	 */
-	uclamp_min = min(uclamp_min, uclamp_max);
-	if (fits && (util < uclamp_min) &&
-	    (uclamp_min > get_actual_cpu_capacity(cpu)))
+	if (fits_capacity(util, capacity))
 		return -1;
 
-	return fits;
+	return 0;
 }
 
 static inline int task_fits_cpu(struct task_struct *p, int cpu)
 {
-	unsigned long uclamp_min = uclamp_eff_value(p, UCLAMP_MIN);
-	unsigned long uclamp_max = uclamp_eff_value(p, UCLAMP_MAX);
 	unsigned long util = task_util_est(p);
+	unsigned long util_uclamp = task_util_est_uclamp(p);
 	/*
 	 * Return true only if the cpu fully fits the task requirements, which
 	 * include the utilization but also the performance hints.
 	 */
-	return (util_fits_cpu(util, uclamp_min, uclamp_max, cpu) > 0);
+	return (util_fits_cpu(util, util_uclamp, cpu) > 0);
 }
 
 static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
@@ -6737,18 +6646,19 @@ static inline void hrtick_update(struct rq *rq)
 #endif
 
 #ifdef CONFIG_SMP
+static unsigned long cpu_util_cfs_uclamp(int cpu);
+
 static inline bool cpu_overutilized(int cpu)
 {
-	unsigned long  rq_util_min, rq_util_max;
+	unsigned long util, util_uclamp;
 
 	if (!sched_energy_enabled())
 		return false;
 
-	rq_util_min = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MIN);
-	rq_util_max = uclamp_rq_get(cpu_rq(cpu), UCLAMP_MAX);
+	util = cpu_util_cfs(cpu);
+	util_uclamp = uclamp_is_used() ? cpu_util_cfs_uclamp(cpu) : util;
 
-	/* Return true only if the utilization doesn't fit CPU's capacity */
-	return !util_fits_cpu(cpu_util_cfs(cpu), rq_util_min, rq_util_max, cpu);
+	return !util_fits_cpu(util, util_uclamp, cpu);
 }
 
 /*
@@ -7534,9 +7444,13 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  * maximize capacity.
  */
 static int
-select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
+select_idle_capacity(struct task_struct *p,
+		     unsigned long task_util,
+		     unsigned long task_util_uclamp,
+		     struct sched_domain *sd,
+		     int target)
 {
-	unsigned long task_util, util_min, util_max, best_cap = 0;
+	unsigned long best_cap = 0;
 	int fits, best_fits = 0;
 	int cpu, best_cpu = -1;
 	struct cpumask *cpus;
@@ -7544,17 +7458,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
 
-	task_util = task_util_est(p);
-	util_min = uclamp_eff_value(p, UCLAMP_MIN);
-	util_max = uclamp_eff_value(p, UCLAMP_MAX);
-
 	for_each_cpu_wrap(cpu, cpus, target) {
 		unsigned long cpu_cap = capacity_of(cpu);
 
 		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
 			continue;
 
-		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
+		fits = util_fits_cpu(task_util, task_util_uclamp, cpu);
 
 		/* This CPU fits with all requirements */
 		if (fits > 0)
@@ -7582,8 +7492,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 }
 
 static inline bool asym_fits_cpu(unsigned long util,
-				 unsigned long util_min,
-				 unsigned long util_max,
+				 unsigned long util_uclamp,
 				 int cpu)
 {
 	if (sched_asym_cpucap_active())
@@ -7591,7 +7500,7 @@ static inline bool asym_fits_cpu(unsigned long util,
 		 * Return true only if the cpu fully fits the task requirements
 		 * which include the utilization and the performance hints.
 		 */
-		return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
+		return (util_fits_cpu(util, util_uclamp, cpu) > 0);
 
 	return true;
 }
@@ -7603,18 +7512,18 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	bool has_idle_core = false;
 	struct sched_domain *sd;
-	unsigned long task_util, util_min, util_max;
+	unsigned long task_util, task_util_uclamp;
+	bool asym = sched_asym_cpucap_active();
 	int i, recent_used_cpu, prev_aff = -1;
 
 	/*
 	 * On asymmetric system, update task utilization because we will check
 	 * that the task fits with CPU's capacity.
 	 */
-	if (sched_asym_cpucap_active()) {
+	if (asym) {
 		sync_entity_load_avg(&p->se);
 		task_util = task_util_est(p);
-		util_min = uclamp_eff_value(p, UCLAMP_MIN);
-		util_max = uclamp_eff_value(p, UCLAMP_MAX);
+		task_util_uclamp = task_util_est_uclamp(p);
 	}
 
 	/*
@@ -7623,7 +7532,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	lockdep_assert_irqs_disabled();
 
 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
-	    asym_fits_cpu(task_util, util_min, util_max, target))
+	    asym_fits_cpu(task_util, task_util_uclamp, target))
 		return target;
 
 	/*
@@ -7631,7 +7540,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	if (prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
-	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
+	    asym_fits_cpu(task_util, task_util_uclamp, prev)) {
 
 		if (!static_branch_unlikely(&sched_cluster_active) ||
 		    cpus_share_resources(prev, target))
@@ -7652,7 +7561,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	    in_task() &&
 	    prev == smp_processor_id() &&
 	    this_rq()->nr_running <= 1 &&
-	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
+	    asym_fits_cpu(task_util, task_util_uclamp, prev)) {
 		return prev;
 	}
 
@@ -7664,7 +7573,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	    cpus_share_cache(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
-	    asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
+	    asym_fits_cpu(task_util, task_util_uclamp, recent_used_cpu)) {
 
 		if (!static_branch_unlikely(&sched_cluster_active) ||
 		    cpus_share_resources(recent_used_cpu, target))
@@ -7678,7 +7587,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * For asymmetric CPU capacity systems, our domain of interest is
 	 * sd_asym_cpucapacity rather than sd_llc.
 	 */
-	if (sched_asym_cpucap_active()) {
+	if (asym) {
 		sd = rcu_dereference(per_cpu(sd_asym_cpucapacity, target));
 		/*
 		 * On an asymmetric CPU capacity system where an exclusive
@@ -7689,7 +7598,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 		 * capacity path.
 		 */
 		if (sd) {
-			i = select_idle_capacity(p, sd, target);
+			i = select_idle_capacity(p, task_util, task_util_uclamp,
+						 sd, target);
 			return ((unsigned)i < nr_cpumask_bits) ? i : target;
 		}
 	}
@@ -7832,16 +7742,67 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
 	return min(util, arch_scale_cpu_capacity(cpu));
 }
 
+/* This is basically a copy-paste from cpu_util(), but instead using uclamp values. */
+static unsigned long
+cpu_util_uclamp(int cpu, struct task_struct *p, int dst_cpu, int boost)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct cfs_rq *cfs_rq = &rq->cfs;
+	unsigned long util = root_cfs_util_uclamp(rq);
+
+	if (boost) {
+		unsigned long runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
+		unsigned long util_raw = READ_ONCE(cfs_rq->avg.util_avg);
+
+		util = max(util, util_raw ? util * runnable / util_raw : 0);
+	}
+
+	if (p) {
+		if (task_cpu(p) == cpu && !p->se.on_rq) {
+			util += task_util_bias(p);
+			if ((long)util < 0)
+				util = 0;
+		}
+		if (task_cpu(p) == cpu && dst_cpu != cpu)
+			lsub_positive(&util, task_util_uclamp(p));
+		else if (task_cpu(p) != cpu && dst_cpu == cpu)
+			util += task_util_uclamp(p);
+	}
+
+	if (sched_feat(UTIL_EST)) {
+		unsigned long util_est = READ_ONCE(cfs_rq->avg.util_est_uclamp);
+
+		if (dst_cpu == cpu)
+			util_est += _task_util_est_uclamp(p);
+		else if (p && unlikely(task_on_rq_queued(p) || current == p))
+			lsub_positive(&util_est, _task_util_est_uclamp(p));
+
+		util = max(util, util_est);
+	}
+
+	return min(util, arch_scale_cpu_capacity(cpu));
+}
+
 unsigned long cpu_util_cfs(int cpu)
 {
 	return cpu_util(cpu, NULL, -1, 0);
 }
 
-unsigned long cpu_util_cfs_boost(int cpu)
+static unsigned long cpu_util_cfs_uclamp(int cpu)
+{
+	return cpu_util_uclamp(cpu, NULL, -1, 0);
+}
+
+static unsigned long cpu_util_cfs_boost(int cpu)
 {
 	return cpu_util(cpu, NULL, -1, 1);
 }
 
+unsigned long cpu_util_cfs_boost_uclamp(int cpu)
+{
+	return cpu_util_uclamp(cpu, NULL, -1, 1);
+}
+
 /*
  * cpu_util_without: compute cpu utilization without any contributions from *p
  * @cpu: the CPU which utilization is requested
@@ -7952,33 +7913,15 @@ eenv_pd_max_util(struct energy_env *eenv, struct cpumask *pd_cpus,
 	int cpu;
 
 	for_each_cpu(cpu, pd_cpus) {
-		struct task_struct *tsk = (cpu == dst_cpu) ? p : NULL;
-		unsigned long util = cpu_util(cpu, p, dst_cpu, 1);
+		unsigned long util = cpu_util_uclamp(cpu, p, dst_cpu, 1);
 		unsigned long eff_util, min, max;
 
 		/*
-		 * Performance domain frequency: utilization clamping
-		 * must be considered since it affects the selection
-		 * of the performance domain frequency.
-		 * NOTE: in case RT tasks are running, by default the min
-		 * utilization can be max OPP.
+		 * NOTE: in case RT tasks are running, by default the
+		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
 		eff_util = effective_cpu_util(cpu, util, &min, &max);
 
-		/* Task's uclamp can modify min and max value */
-		if (tsk && uclamp_is_used()) {
-			min = max(min, uclamp_eff_value(p, UCLAMP_MIN));
-
-			/*
-			 * If there is no active max uclamp constraint,
-			 * directly use task's one, otherwise keep max.
-			 */
-			if (uclamp_rq_is_idle(cpu_rq(cpu)))
-				max = uclamp_eff_value(p, UCLAMP_MAX);
-			else
-				max = max(max, uclamp_eff_value(p, UCLAMP_MAX));
-		}
-
 		eff_util = sugov_effective_cpu_perf(cpu, eff_util, min, max);
 		max_util = max(max_util, eff_util);
 	}
@@ -8052,8 +7995,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
 	unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
-	unsigned long p_util_min = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MIN) : 0;
-	unsigned long p_util_max = uclamp_is_used() ? uclamp_eff_value(p, UCLAMP_MAX) : 1024;
 	struct root_domain *rd = this_rq()->rd;
 	int cpu, best_energy_cpu, target = -1;
 	int prev_fits = -1, best_fits = -1;
@@ -8081,16 +8022,14 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 	target = prev_cpu;
 
 	sync_entity_load_avg(&p->se);
-	if (!task_util_est(p) && p_util_min == 0)
+	if (!task_util_est_uclamp(p))
 		goto unlock;
 
 	eenv_task_busy_time(&eenv, p, prev_cpu);
 
 	for (; pd; pd = pd->next) {
-		unsigned long util_min = p_util_min, util_max = p_util_max;
-		unsigned long cpu_cap, cpu_actual_cap, util;
+		unsigned long cpu_cap, cpu_actual_cap, util, util_uclamp;
 		unsigned long cur_delta, max_spare_cap = 0;
-		unsigned long rq_util_min, rq_util_max;
 		unsigned long prev_spare_cap = 0;
 		int max_spare_cap_cpu = -1;
 		unsigned long base_energy;
@@ -8109,8 +8048,6 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 		eenv.pd_cap = 0;
 
 		for_each_cpu(cpu, cpus) {
-			struct rq *rq = cpu_rq(cpu);
-
 			eenv.pd_cap += cpu_actual_cap;
 
 			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
@@ -8120,36 +8057,18 @@ static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
 				continue;
 
 			util = cpu_util(cpu, p, cpu, 0);
+			util_uclamp = uclamp_is_used() ?
+				      cpu_util_uclamp(cpu, p, cpu, 0) : util;
 			cpu_cap = capacity_of(cpu);
 
-			/*
-			 * Skip CPUs that cannot satisfy the capacity request.
-			 * IOW, placing the task there would make the CPU
-			 * overutilized. Take uclamp into account to see how
-			 * much capacity we can get out of the CPU; this is
-			 * aligned with sched_cpu_util().
-			 */
-			if (uclamp_is_used() && !uclamp_rq_is_idle(rq)) {
-				/*
-				 * Open code uclamp_rq_util_with() except for
-				 * the clamp() part. I.e.: apply max aggregation
-				 * only. util_fits_cpu() logic requires to
-				 * operate on non clamped util but must use the
-				 * max-aggregated uclamp_{min, max}.
-				 */
-				rq_util_min = uclamp_rq_get(rq, UCLAMP_MIN);
-				rq_util_max = uclamp_rq_get(rq, UCLAMP_MAX);
-
-				util_min = max(rq_util_min, p_util_min);
-				util_max = max(rq_util_max, p_util_max);
-			}
-
-			fits = util_fits_cpu(util, util_min, util_max, cpu);
+			fits = util_fits_cpu(util, util_uclamp, cpu);
+			if (fits == 1)
+				lsub_positive(&cpu_cap, util_uclamp);
+			else if (fits == -1)
+				lsub_positive(&cpu_cap, util);
 			if (!fits)
 				continue;
 
-			lsub_positive(&cpu_cap, util);
-
 			if (cpu == prev_cpu) {
 				/* Always use prev_cpu as a candidate. */
 				prev_spare_cap = cpu_cap;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5f93a6b1b563..c02ab8a54d66 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3089,9 +3089,8 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
 	return READ_ONCE(rq->avg_dl.util_avg);
 }
 
-
 extern unsigned long cpu_util_cfs(int cpu);
-extern unsigned long cpu_util_cfs_boost(int cpu);
+extern unsigned long cpu_util_cfs_boost_uclamp(int cpu);
 
 static inline unsigned long cpu_util_rt(struct rq *rq)
 {
@@ -3121,21 +3120,6 @@ static inline bool uclamp_rq_is_idle(struct rq *rq)
 	return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
 }
 
-/* Is the rq being capped/throttled by uclamp_max? */
-static inline bool uclamp_rq_is_capped(struct rq *rq)
-{
-	unsigned long rq_util;
-	unsigned long max_util;
-
-	if (!static_branch_likely(&sched_uclamp_used))
-		return false;
-
-	rq_util = cpu_util_cfs(cpu_of(rq)) + cpu_util_rt(rq);
-	max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
-
-	return max_util != SCHED_CAPACITY_SCALE && rq_util >= max_util;
-}
-
 /*
  * When uclamp is compiled in, the aggregation at rq level is 'turned off'
  * by default in the fast path and only gets turned on once userspace performs
@@ -3205,8 +3189,6 @@ uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
-static inline bool uclamp_rq_is_capped(struct rq *rq) { return false; }
-
 static inline bool uclamp_is_used(void)
 {
 	return false;
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index ae1b42775ef9..d6696d06829d 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -303,13 +303,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	}
 
 	if (min) {
-		/*
-		 * The minimum utilization returns the highest level between:
-		 * - the computed DL bandwidth needed with the IRQ pressure which
-		 *   steals time to the deadline task.
-		 * - The minimum performance requirement for CFS and/or RT.
-		 */
-		*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
+		*min = irq + cpu_bw_dl(rq);
 
 		/*
 		 * When an RT task is runnable and uclamp is not used, we must
@@ -328,12 +322,8 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
 	util = util_cfs + cpu_util_rt(rq);
 	util += cpu_util_dl(rq);
 
-	/*
-	 * The maximum hint is a soft bandwidth requirement, which can be lower
-	 * than the actual utilization because of uclamp_max requirements.
-	 */
 	if (max)
-		*max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+		*max = scale;
 
 	if (util >= scale)
 		return scale;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/7] sched/uclamp: Remove all uclamp bucket logic
  2024-06-24 10:23 [PATCH 0/7] uclamp sum aggregation Hongyan Xia
                   ` (3 preceding siblings ...)
  2024-06-24 10:23 ` [PATCH 4/7] sched/fair: Use util biases for utilization and frequency Hongyan Xia
@ 2024-06-24 10:23 ` Hongyan Xia
  2024-06-24 10:23 ` [PATCH 6/7] sched/uclamp: Simplify uclamp_eff_value() Hongyan Xia
  2024-06-24 10:23 ` [PATCH 7/7] Propagate negative bias Hongyan Xia
  6 siblings, 0 replies; 10+ messages in thread
From: Hongyan Xia @ 2024-06-24 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Qais Yousef, Morten Rasmussen, Lukasz Luba, Christian Loehle,
	Pierre Gondois, Youssef Esmat, linux-kernel

Also rewrite uclamp_update_active() so that the effective uclamp values
are updated every time we change task group properties, change system
defaults or a request is issued from userspace.

This also signnificantly reduces uclamp overhead because we no longer
need to compute effective uclamp values and manipulate buckets every
time a task is enqueued or dequeued (in uclamp_rq_{inc/dec}()).

TODO: Rewrite documentation to match the new logic.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 include/linux/sched.h   |   4 -
 init/Kconfig            |  32 -----
 kernel/sched/core.c     | 287 +++-------------------------------------
 kernel/sched/fair.c     |   4 -
 kernel/sched/rt.c       |   4 -
 kernel/sched/sched.h    |  95 +------------
 kernel/sched/syscalls.c |   2 +
 7 files changed, 20 insertions(+), 408 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0160567314ae..6ba0454d1174 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -685,9 +685,6 @@ struct sched_dl_entity {
 };
 
 #ifdef CONFIG_UCLAMP_TASK
-/* Number of utilization clamp buckets (shorter alias) */
-#define UCLAMP_BUCKETS CONFIG_UCLAMP_BUCKETS_COUNT
-
 /*
  * Utilization clamp for a scheduling entity
  * @value:		clamp value "assigned" to a se
@@ -713,7 +710,6 @@ struct sched_dl_entity {
  */
 struct uclamp_se {
 	unsigned int value		: bits_per(SCHED_CAPACITY_SCALE);
-	unsigned int bucket_id		: bits_per(UCLAMP_BUCKETS);
 	unsigned int active		: 1;
 	unsigned int user_defined	: 1;
 };
diff --git a/init/Kconfig b/init/Kconfig
index 72404c1f2157..b2dc6a3baa1f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -817,38 +817,6 @@ config UCLAMP_TASK
 	  enforce or grant any specific bandwidth for tasks.
 
 	  If in doubt, say N.
-
-config UCLAMP_BUCKETS_COUNT
-	int "Number of supported utilization clamp buckets"
-	range 5 20
-	default 5
-	depends on UCLAMP_TASK
-	help
-	  Defines the number of clamp buckets to use. The range of each bucket
-	  will be SCHED_CAPACITY_SCALE/UCLAMP_BUCKETS_COUNT. The higher the
-	  number of clamp buckets the finer their granularity and the higher
-	  the precision of clamping aggregation and tracking at run-time.
-
-	  For example, with the minimum configuration value we will have 5
-	  clamp buckets tracking 20% utilization each. A 25% boosted tasks will
-	  be refcounted in the [20..39]% bucket and will set the bucket clamp
-	  effective value to 25%.
-	  If a second 30% boosted task should be co-scheduled on the same CPU,
-	  that task will be refcounted in the same bucket of the first task and
-	  it will boost the bucket clamp effective value to 30%.
-	  The clamp effective value of a bucket is reset to its nominal value
-	  (20% in the example above) when there are no more tasks refcounted in
-	  that bucket.
-
-	  An additional boost/capping margin can be added to some tasks. In the
-	  example above the 25% task will be boosted to 30% until it exits the
-	  CPU. If that should be considered not acceptable on certain systems,
-	  it's always possible to reduce the margin by increasing the number of
-	  clamp buckets to trade off used memory for run-time tracking
-	  precision.
-
-	  If in doubt, use the default value.
-
 endmenu
 
 #
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..767894fc1562 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1410,54 +1410,6 @@ static struct uclamp_se uclamp_default[UCLAMP_CNT];
  */
 DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
 
-static inline unsigned int
-uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
-		  unsigned int clamp_value)
-{
-	/*
-	 * Avoid blocked utilization pushing up the frequency when we go
-	 * idle (which drops the max-clamp) by retaining the last known
-	 * max-clamp.
-	 */
-	if (clamp_id == UCLAMP_MAX) {
-		rq->uclamp_flags |= UCLAMP_FLAG_IDLE;
-		return clamp_value;
-	}
-
-	return uclamp_none(UCLAMP_MIN);
-}
-
-static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
-				     unsigned int clamp_value)
-{
-	/* Reset max-clamp retention only on idle exit */
-	if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
-		return;
-
-	uclamp_rq_set(rq, clamp_id, clamp_value);
-}
-
-static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
-				   unsigned int clamp_value)
-{
-	struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
-	int bucket_id = UCLAMP_BUCKETS - 1;
-
-	/*
-	 * Since both min and max clamps are max aggregated, find the
-	 * top most bucket with tasks in.
-	 */
-	for ( ; bucket_id >= 0; bucket_id--) {
-		if (!bucket[bucket_id].tasks)
-			continue;
-		return bucket[bucket_id].value;
-	}
-
-	/* No tasks -- default clamp values */
-	return uclamp_idle_value(rq, clamp_id, clamp_value);
-}
-
 static void __uclamp_update_util_min_rt_default(struct task_struct *p)
 {
 	unsigned int default_util_min;
@@ -1513,8 +1465,7 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 }
 
 /*
- * The effective clamp bucket index of a task depends on, by increasing
- * priority:
+ * The effective uclamp value of a task depends on, by increasing priority:
  * - the task specific clamp value, when explicitly requested from userspace
  * - the task group effective clamp value, for tasks not either in the root
  *   group or in an autogroup
@@ -1535,196 +1486,23 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
-	struct uclamp_se uc_eff;
-
-	/* Task currently refcounted: use back-annotated (effective) value */
-	if (p->uclamp[clamp_id].active)
-		return (unsigned long)p->uclamp[clamp_id].value;
-
-	uc_eff = uclamp_eff_get(p, clamp_id);
-
-	return (unsigned long)uc_eff.value;
-}
-
-/*
- * When a task is enqueued on a rq, the clamp bucket currently defined by the
- * task's uclamp::bucket_id is refcounted on that rq. This also immediately
- * updates the rq's clamp value if required.
- *
- * Tasks can have a task-specific value requested from user-space, track
- * within each bucket the maximum value for tasks refcounted in it.
- * This "local max aggregation" allows to track the exact "requested" value
- * for each bucket when all its RUNNABLE tasks require the same clamp.
- */
-static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
-				    enum uclamp_id clamp_id)
-{
-	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
-	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
-	struct uclamp_bucket *bucket;
-
-	lockdep_assert_rq_held(rq);
-
-	/* Update task effective clamp */
-	p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
-
-	bucket = &uc_rq->bucket[uc_se->bucket_id];
-	bucket->tasks++;
-	uc_se->active = true;
-
-	uclamp_idle_reset(rq, clamp_id, uc_se->value);
-
-	/*
-	 * Local max aggregation: rq buckets always track the max
-	 * "requested" clamp value of its RUNNABLE tasks.
-	 */
-	if (bucket->tasks == 1 || uc_se->value > bucket->value)
-		bucket->value = uc_se->value;
-
-	if (uc_se->value > uclamp_rq_get(rq, clamp_id))
-		uclamp_rq_set(rq, clamp_id, uc_se->value);
-}
-
-/*
- * When a task is dequeued from a rq, the clamp bucket refcounted by the task
- * is released. If this is the last task reference counting the rq's max
- * active clamp value, then the rq's clamp value is updated.
- *
- * Both refcounted tasks and rq's cached clamp values are expected to be
- * always valid. If it's detected they are not, as defensive programming,
- * enforce the expected state and warn.
- */
-static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
-				    enum uclamp_id clamp_id)
-{
-	struct uclamp_rq *uc_rq = &rq->uclamp[clamp_id];
-	struct uclamp_se *uc_se = &p->uclamp[clamp_id];
-	struct uclamp_bucket *bucket;
-	unsigned int bkt_clamp;
-	unsigned int rq_clamp;
-
-	lockdep_assert_rq_held(rq);
-
-	/*
-	 * If sched_uclamp_used was enabled after task @p was enqueued,
-	 * we could end up with unbalanced call to uclamp_rq_dec_id().
-	 *
-	 * In this case the uc_se->active flag should be false since no uclamp
-	 * accounting was performed at enqueue time and we can just return
-	 * here.
-	 *
-	 * Need to be careful of the following enqueue/dequeue ordering
-	 * problem too
-	 *
-	 *	enqueue(taskA)
-	 *	// sched_uclamp_used gets enabled
-	 *	enqueue(taskB)
-	 *	dequeue(taskA)
-	 *	// Must not decrement bucket->tasks here
-	 *	dequeue(taskB)
-	 *
-	 * where we could end up with stale data in uc_se and
-	 * bucket[uc_se->bucket_id].
-	 *
-	 * The following check here eliminates the possibility of such race.
-	 */
-	if (unlikely(!uc_se->active))
-		return;
-
-	bucket = &uc_rq->bucket[uc_se->bucket_id];
-
-	SCHED_WARN_ON(!bucket->tasks);
-	if (likely(bucket->tasks))
-		bucket->tasks--;
-
-	uc_se->active = false;
-
-	/*
-	 * Keep "local max aggregation" simple and accept to (possibly)
-	 * overboost some RUNNABLE tasks in the same bucket.
-	 * The rq clamp bucket value is reset to its base value whenever
-	 * there are no more RUNNABLE tasks refcounting it.
-	 */
-	if (likely(bucket->tasks))
-		return;
-
-	rq_clamp = uclamp_rq_get(rq, clamp_id);
-	/*
-	 * Defensive programming: this should never happen. If it happens,
-	 * e.g. due to future modification, warn and fix up the expected value.
-	 */
-	SCHED_WARN_ON(bucket->value > rq_clamp);
-	if (bucket->value >= rq_clamp) {
-		bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
-		uclamp_rq_set(rq, clamp_id, bkt_clamp);
-	}
-}
-
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
-{
-	enum uclamp_id clamp_id;
-
-	/*
-	 * Avoid any overhead until uclamp is actually used by the userspace.
-	 *
-	 * The condition is constructed such that a NOP is generated when
-	 * sched_uclamp_used is disabled.
-	 */
-	if (!static_branch_unlikely(&sched_uclamp_used))
-		return;
+	if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
+		return uclamp_none(clamp_id);
 
-	if (unlikely(!p->sched_class->uclamp_enabled))
-		return;
-
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_inc_id(rq, p, clamp_id);
-
-	/* Reset clamp idle holding when there is one RUNNABLE task */
-	if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)
-		rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
+	return p->uclamp[clamp_id].value;
 }
 
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
+void uclamp_update_active_nolock(struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
-	/*
-	 * Avoid any overhead until uclamp is actually used by the userspace.
-	 *
-	 * The condition is constructed such that a NOP is generated when
-	 * sched_uclamp_used is disabled.
-	 */
-	if (!static_branch_unlikely(&sched_uclamp_used))
-		return;
-
-	if (unlikely(!p->sched_class->uclamp_enabled))
-		return;
-
 	for_each_clamp_id(clamp_id)
-		uclamp_rq_dec_id(rq, p, clamp_id);
-}
-
-static inline void uclamp_rq_reinc_id(struct rq *rq, struct task_struct *p,
-				      enum uclamp_id clamp_id)
-{
-	if (!p->uclamp[clamp_id].active)
-		return;
-
-	uclamp_rq_dec_id(rq, p, clamp_id);
-	uclamp_rq_inc_id(rq, p, clamp_id);
-
-	/*
-	 * Make sure to clear the idle flag if we've transiently reached 0
-	 * active tasks on rq.
-	 */
-	if (clamp_id == UCLAMP_MAX && (rq->uclamp_flags & UCLAMP_FLAG_IDLE))
-		rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
+		p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
 }
 
 static inline void
 uclamp_update_active(struct task_struct *p)
 {
-	enum uclamp_id clamp_id;
 	struct rq_flags rf;
 	struct rq *rq;
 
@@ -1738,14 +1516,7 @@ uclamp_update_active(struct task_struct *p)
 	 */
 	rq = task_rq_lock(p, &rf);
 
-	/*
-	 * Setting the clamp bucket is serialized by task_rq_lock().
-	 * If the task is not yet RUNNABLE and its task_struct is not
-	 * affecting a valid clamp bucket, the next time it's enqueued,
-	 * it will already see the updated clamp bucket value.
-	 */
-	for_each_clamp_id(clamp_id)
-		uclamp_rq_reinc_id(rq, p, clamp_id);
+	uclamp_update_active_nolock(p);
 
 	task_rq_unlock(rq, p, &rf);
 }
@@ -1877,20 +1648,14 @@ static void uclamp_fork(struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
-	/*
-	 * We don't need to hold task_rq_lock() when updating p->uclamp_* here
-	 * as the task is still at its early fork stages.
-	 */
-	for_each_clamp_id(clamp_id)
-		p->uclamp[clamp_id].active = false;
-
-	if (likely(!p->sched_reset_on_fork))
-		return;
-
-	for_each_clamp_id(clamp_id) {
-		uclamp_se_set(&p->uclamp_req[clamp_id],
-			      uclamp_none(clamp_id), false);
+	if (unlikely(p->sched_reset_on_fork)) {
+		for_each_clamp_id(clamp_id) {
+			uclamp_se_set(&p->uclamp_req[clamp_id],
+				      uclamp_none(clamp_id), false);
+		}
 	}
+
+	uclamp_update_active(p);
 }
 
 static void uclamp_post_fork(struct task_struct *p)
@@ -1898,28 +1663,10 @@ static void uclamp_post_fork(struct task_struct *p)
 	uclamp_update_util_min_rt_default(p);
 }
 
-static void __init init_uclamp_rq(struct rq *rq)
-{
-	enum uclamp_id clamp_id;
-	struct uclamp_rq *uc_rq = rq->uclamp;
-
-	for_each_clamp_id(clamp_id) {
-		uc_rq[clamp_id] = (struct uclamp_rq) {
-			.value = uclamp_none(clamp_id)
-		};
-	}
-
-	rq->uclamp_flags = UCLAMP_FLAG_IDLE;
-}
-
 static void __init init_uclamp(void)
 {
 	struct uclamp_se uc_max = {};
 	enum uclamp_id clamp_id;
-	int cpu;
-
-	for_each_possible_cpu(cpu)
-		init_uclamp_rq(cpu_rq(cpu));
 
 	for_each_clamp_id(clamp_id) {
 		uclamp_se_set(&init_task.uclamp_req[clamp_id],
@@ -1938,8 +1685,6 @@ static void __init init_uclamp(void)
 }
 
 #else /* !CONFIG_UCLAMP_TASK */
-static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
-static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
 static inline void uclamp_fork(struct task_struct *p) { }
 static inline void uclamp_post_fork(struct task_struct *p) { }
 static inline void init_uclamp(void) { }
@@ -1979,7 +1724,6 @@ void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
 	}
 
-	uclamp_rq_inc(rq, p);
 	p->sched_class->enqueue_task(rq, p, flags);
 
 	if (sched_core_enabled(rq))
@@ -1999,7 +1743,6 @@ void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 		psi_dequeue(p, flags & DEQUEUE_SLEEP);
 	}
 
-	uclamp_rq_dec(rq, p);
 	p->sched_class->dequeue_task(rq, p, flags);
 }
 
@@ -8749,6 +8492,7 @@ void sched_move_task(struct task_struct *tsk)
 		put_prev_task(rq, tsk);
 
 	sched_change_group(tsk, group);
+	uclamp_update_active_nolock(tsk);
 
 	if (queued)
 		enqueue_task(rq, tsk, queue_flags);
@@ -8881,7 +8625,6 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
 			if (eff[clamp_id] == uc_se[clamp_id].value)
 				continue;
 			uc_se[clamp_id].value = eff[clamp_id];
-			uc_se[clamp_id].bucket_id = uclamp_bucket_id(eff[clamp_id]);
 			clamps |= (0x1 << clamp_id);
 		}
 		if (!clamps) {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a25de0044af8..3bb077df52ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13217,10 +13217,6 @@ DEFINE_SCHED_CLASS(fair) = {
 #ifdef CONFIG_SCHED_CORE
 	.task_is_throttled	= task_is_throttled_fair,
 #endif
-
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
 };
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 63e49c8ffc4d..df8bfab0232a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2680,10 +2680,6 @@ DEFINE_SCHED_CLASS(rt) = {
 #ifdef CONFIG_SCHED_CORE
 	.task_is_throttled	= task_is_throttled_rt,
 #endif
-
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c02ab8a54d66..e01e42d2703d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -953,46 +953,6 @@ extern void rto_push_irq_work_func(struct irq_work *work);
 #endif /* CONFIG_SMP */
 
 #ifdef CONFIG_UCLAMP_TASK
-/*
- * struct uclamp_bucket - Utilization clamp bucket
- * @value: utilization clamp value for tasks on this clamp bucket
- * @tasks: number of RUNNABLE tasks on this clamp bucket
- *
- * Keep track of how many tasks are RUNNABLE for a given utilization
- * clamp value.
- */
-struct uclamp_bucket {
-	unsigned long value : bits_per(SCHED_CAPACITY_SCALE);
-	unsigned long tasks : BITS_PER_LONG - bits_per(SCHED_CAPACITY_SCALE);
-};
-
-/*
- * struct uclamp_rq - rq's utilization clamp
- * @value: currently active clamp values for a rq
- * @bucket: utilization clamp buckets affecting a rq
- *
- * Keep track of RUNNABLE tasks on a rq to aggregate their clamp values.
- * A clamp value is affecting a rq when there is at least one task RUNNABLE
- * (or actually running) with that value.
- *
- * There are up to UCLAMP_CNT possible different clamp values, currently there
- * are only two: minimum utilization and maximum utilization.
- *
- * All utilization clamping values are MAX aggregated, since:
- * - for util_min: we want to run the CPU at least at the max of the minimum
- *   utilization required by its currently RUNNABLE tasks.
- * - for util_max: we want to allow the CPU to run up to the max of the
- *   maximum utilization allowed by its currently RUNNABLE tasks.
- *
- * Since on each system we expect only a limited number of different
- * utilization clamp values (UCLAMP_BUCKETS), use a simple array to track
- * the metrics required to compute all the per-rq utilization clamp values.
- */
-struct uclamp_rq {
-	unsigned int value;
-	struct uclamp_bucket bucket[UCLAMP_BUCKETS];
-};
-
 DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
 #endif /* CONFIG_UCLAMP_TASK */
 
@@ -1034,10 +994,6 @@ struct rq {
 	u64			nr_switches;
 
 #ifdef CONFIG_UCLAMP_TASK
-	/* Utilization clamp values based on CPU's RUNNABLE tasks */
-	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
-	unsigned int		uclamp_flags;
-#define UCLAMP_FLAG_IDLE 0x01
 #endif
 
 	struct cfs_rq		cfs;
@@ -2278,11 +2234,6 @@ struct affinity_context {
 extern s64 update_curr_common(struct rq *rq);
 
 struct sched_class {
-
-#ifdef CONFIG_UCLAMP_TASK
-	int uclamp_enabled;
-#endif
-
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task)   (struct rq *rq);
@@ -3102,23 +3053,7 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
 #ifdef CONFIG_UCLAMP_TASK
 
 unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
-
-static inline unsigned long uclamp_rq_get(struct rq *rq,
-					  enum uclamp_id clamp_id)
-{
-	return READ_ONCE(rq->uclamp[clamp_id].value);
-}
-
-static inline void uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id,
-				 unsigned int value)
-{
-	WRITE_ONCE(rq->uclamp[clamp_id].value, value);
-}
-
-static inline bool uclamp_rq_is_idle(struct rq *rq)
-{
-	return rq->uclamp_flags & UCLAMP_FLAG_IDLE;
-}
+void uclamp_update_active_nolock(struct task_struct *p);
 
 /*
  * When uclamp is compiled in, the aggregation at rq level is 'turned off'
@@ -3146,19 +3081,10 @@ static inline unsigned int uclamp_none(enum uclamp_id clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
-/* Integer rounded range for each bucket */
-#define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, UCLAMP_BUCKETS)
-
-static inline unsigned int uclamp_bucket_id(unsigned int clamp_value)
-{
-	return min_t(unsigned int, clamp_value / UCLAMP_BUCKET_DELTA, UCLAMP_BUCKETS - 1);
-}
-
 static inline void
 uclamp_se_set(struct uclamp_se *uc_se, unsigned int value, bool user_defined)
 {
 	uc_se->value = value;
-	uc_se->bucket_id = uclamp_bucket_id(value);
 	uc_se->user_defined = user_defined;
 }
 
@@ -3189,26 +3115,11 @@ uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
-static inline bool uclamp_is_used(void)
-{
-	return false;
-}
-
-static inline unsigned long
-uclamp_rq_get(struct rq *rq, enum uclamp_id clamp_id)
-{
-	if (clamp_id == UCLAMP_MIN)
-		return 0;
-
-	return SCHED_CAPACITY_SCALE;
-}
-
-static inline void
-uclamp_rq_set(struct rq *rq, enum uclamp_id clamp_id, unsigned int value)
+static inline void uclamp_update_active_nolock(struct task_struct *p)
 {
 }
 
-static inline bool uclamp_rq_is_idle(struct rq *rq)
+static inline bool uclamp_is_used(void)
 {
 	return false;
 }
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index d6696d06829d..b3fadd386d71 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -520,6 +520,8 @@ static void __setscheduler_uclamp(struct task_struct *p,
 		uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
 			      attr->sched_util_max, true);
 	}
+
+	uclamp_update_active_nolock(p);
 }
 
 #else /* !CONFIG_UCLAMP_TASK: */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/7] sched/uclamp: Simplify uclamp_eff_value()
  2024-06-24 10:23 [PATCH 0/7] uclamp sum aggregation Hongyan Xia
                   ` (4 preceding siblings ...)
  2024-06-24 10:23 ` [PATCH 5/7] sched/uclamp: Remove all uclamp bucket logic Hongyan Xia
@ 2024-06-24 10:23 ` Hongyan Xia
  2024-06-24 10:23 ` [PATCH 7/7] Propagate negative bias Hongyan Xia
  6 siblings, 0 replies; 10+ messages in thread
From: Hongyan Xia @ 2024-06-24 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Qais Yousef, Morten Rasmussen, Lukasz Luba, Christian Loehle,
	Pierre Gondois, Youssef Esmat, linux-kernel

The commit

sched: Remove all uclamp bucket logic

removes uclamp_rq_{inc/dec}() functions, so now p->uclamp contains the
correct values all the time after a uclamp_update_active() call, and
there's no need to toggle the boolean `active` after an update. As a
result, this function is fairly simple now and can live as a static
inline function.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/core.c  | 13 ++++---------
 kernel/sched/sched.h | 10 +++++++++-
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 767894fc1562..6dc8aa730eb6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1484,20 +1484,14 @@ uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 	return uc_req;
 }
 
-unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
-{
-	if (!uclamp_is_used() || !p->uclamp[clamp_id].active)
-		return uclamp_none(clamp_id);
-
-	return p->uclamp[clamp_id].value;
-}
-
 void uclamp_update_active_nolock(struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
 
-	for_each_clamp_id(clamp_id)
+	for_each_clamp_id(clamp_id) {
 		p->uclamp[clamp_id] = uclamp_eff_get(p, clamp_id);
+		p->uclamp[clamp_id].active = 1;
+	}
 }
 
 static inline void
@@ -1652,6 +1646,7 @@ static void uclamp_fork(struct task_struct *p)
 		for_each_clamp_id(clamp_id) {
 			uclamp_se_set(&p->uclamp_req[clamp_id],
 				      uclamp_none(clamp_id), false);
+			p->uclamp[clamp_id].active = 0;
 		}
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e01e42d2703d..59e5ea421a4c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3052,7 +3052,6 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
 
 #ifdef CONFIG_UCLAMP_TASK
 
-unsigned long uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id);
 void uclamp_update_active_nolock(struct task_struct *p);
 
 /*
@@ -3081,6 +3080,15 @@ static inline unsigned int uclamp_none(enum uclamp_id clamp_id)
 	return SCHED_CAPACITY_SCALE;
 }
 
+static inline unsigned long
+uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
+{
+	if (uclamp_is_used() && p->uclamp[clamp_id].active)
+		return p->uclamp[clamp_id].value;
+
+	return uclamp_none(clamp_id);
+}
+
 static inline void
 uclamp_se_set(struct uclamp_se *uc_se, unsigned int value, bool user_defined)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/7] Propagate negative bias
  2024-06-24 10:23 [PATCH 0/7] uclamp sum aggregation Hongyan Xia
                   ` (5 preceding siblings ...)
  2024-06-24 10:23 ` [PATCH 6/7] sched/uclamp: Simplify uclamp_eff_value() Hongyan Xia
@ 2024-06-24 10:23 ` Hongyan Xia
  2024-06-25  4:48   ` K Prateek Nayak
  6 siblings, 1 reply; 10+ messages in thread
From: Hongyan Xia @ 2024-06-24 10:23 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider
  Cc: Qais Yousef, Morten Rasmussen, Lukasz Luba, Christian Loehle,
	Pierre Gondois, Youssef Esmat, linux-kernel

Negative bias is interesting, because dequeuing such a task will
actually increase utilization.

Solve by applying PELT decay to negative biases as well. This in fact
can be implemented easily with some math tricks.

Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
---
 kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |  4 ++++
 2 files changed, 44 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3bb077df52ae..d09af6abf464 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4878,6 +4878,45 @@ static inline unsigned long root_cfs_util_uclamp(struct rq *rq)
 
 	return max(ret, 0L);
 }
+
+/*
+ * Negative biases are tricky. If we remove them right away then dequeuing a
+ * uclamp_max task has the interesting effect that dequeuing results in a higher
+ * rq utilization. Solve this by applying PELT decay to the bias itself.
+ *
+ * Keeping track of a PELT-decayed negative bias is extra overhead. However, we
+ * observe this interesting math property, where y is the decay factor and p is
+ * the number of periods elapsed:
+ *
+ *	util_new = util_old * y^p - neg_bias * y^p
+ *		 = (util_old - neg_bias) * y^p
+ *
+ * Therefore, we simply subtract the negative bias from util_avg the moment we
+ * dequeue, then the PELT signal itself is the total of util_avg and the decayed
+ * negative bias, and we no longer need to track the decayed bias separately.
+ */
+static void propagate_negative_bias(struct task_struct *p)
+{
+	if (task_util_bias(p) < 0 && !task_on_rq_migrating(p)) {
+		unsigned long neg_bias = -task_util_bias(p);
+		struct sched_entity *se = &p->se;
+		struct cfs_rq *cfs_rq;
+
+		p->se.avg.util_avg_bias = 0;
+
+		for_each_sched_entity(se) {
+			u32 divider, neg_sum;
+
+			cfs_rq = cfs_rq_of(se);
+			divider = get_pelt_divider(&cfs_rq->avg);
+			neg_sum = neg_bias * divider;
+			sub_positive(&se->avg.util_avg, neg_bias);
+			sub_positive(&se->avg.util_sum, neg_sum);
+			sub_positive(&cfs_rq->avg.util_avg, neg_bias);
+			sub_positive(&cfs_rq->avg.util_sum, neg_sum);
+		}
+	}
+}
 #else
 static inline long task_util_bias(struct task_struct *p)
 {
@@ -6869,6 +6908,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	/* At this point se is NULL and we are at root level*/
 	sub_nr_running(rq, 1);
 	util_bias_dequeue(rq, p);
+	propagate_negative_bias(p);
 
 	/* balance early to pull high priority tasks */
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 59e5ea421a4c..9d14ef9c717e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3140,6 +3140,10 @@ static inline void util_bias_dequeue(struct rq *rq, struct task_struct *p)
 {
 }
 
+static inline void propagate_negative_bias(struct task_struct *p)
+{
+}
+
 #endif /* !CONFIG_UCLAMP_TASK */
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 7/7] Propagate negative bias
  2024-06-24 10:23 ` [PATCH 7/7] Propagate negative bias Hongyan Xia
@ 2024-06-25  4:48   ` K Prateek Nayak
  2024-06-25 10:30     ` Hongyan Xia
  0 siblings, 1 reply; 10+ messages in thread
From: K Prateek Nayak @ 2024-06-25  4:48 UTC (permalink / raw)
  To: Hongyan Xia
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Qais Yousef,
	Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	Youssef Esmat, linux-kernel

Hello Hongyan,

On 6/24/2024 3:53 PM, Hongyan Xia wrote:
> Negative bias is interesting, because dequeuing such a task will
> actually increase utilization.
> 
> Solve by applying PELT decay to negative biases as well. This in fact
> can be implemented easily with some math tricks.
> 
> Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
> ---
>   kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
>   kernel/sched/sched.h |  4 ++++
>   2 files changed, 44 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3bb077df52ae..d09af6abf464 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4878,6 +4878,45 @@ static inline unsigned long root_cfs_util_uclamp(struct rq *rq)
>   
>   	return max(ret, 0L);
>   }
> +
> +/*
> + * Negative biases are tricky. If we remove them right away then dequeuing a
> + * uclamp_max task has the interesting effect that dequeuing results in a higher
> + * rq utilization. Solve this by applying PELT decay to the bias itself.
> + *
> + * Keeping track of a PELT-decayed negative bias is extra overhead. However, we
> + * observe this interesting math property, where y is the decay factor and p is
> + * the number of periods elapsed:
> + *
> + *	util_new = util_old * y^p - neg_bias * y^p
> + *		 = (util_old - neg_bias) * y^p
> + *
> + * Therefore, we simply subtract the negative bias from util_avg the moment we
> + * dequeue, then the PELT signal itself is the total of util_avg and the decayed
> + * negative bias, and we no longer need to track the decayed bias separately.
> + */
> +static void propagate_negative_bias(struct task_struct *p)
> +{
> +	if (task_util_bias(p) < 0 && !task_on_rq_migrating(p)) {
> +		unsigned long neg_bias = -task_util_bias(p);
> +		struct sched_entity *se = &p->se;
> +		struct cfs_rq *cfs_rq;
> +
> +		p->se.avg.util_avg_bias = 0;
> +
> +		for_each_sched_entity(se) {
> +			u32 divider, neg_sum;
> +
> +			cfs_rq = cfs_rq_of(se);
> +			divider = get_pelt_divider(&cfs_rq->avg);
> +			neg_sum = neg_bias * divider;
> +			sub_positive(&se->avg.util_avg, neg_bias);
> +			sub_positive(&se->avg.util_sum, neg_sum);

Most cases where I've seen "get_pelt_divider()" followed by
"add_positive()" or "sub_positive()" on "util_avg" and "util_sum" I've
seen a correction step that does:

	util_sum = max_t(u32, util_sum, util_avg * PELT_MIN_DIVIDER)

There is a comment on its significance in "update_cfs_rq_load_avg()".
Would it also apply in this case?

> +			sub_positive(&cfs_rq->avg.util_avg, neg_bias);
> +			sub_positive(&cfs_rq->avg.util_sum, neg_sum);
> +		}
> +	}
> +}
>   #else
>   static inline long task_util_bias(struct task_struct *p)
>   {
> @@ -6869,6 +6908,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	/* At this point se is NULL and we are at root level*/
>   	sub_nr_running(rq, 1);
>   	util_bias_dequeue(rq, p);
> +	propagate_negative_bias(p);

Perhaps I'm pointing to a premature optimization but since the hierarchy
is traversed above in "dequeue_task_fair()", could the "neg_bias" and
"neg_sum" removal be done along the way above instead of
"propagate_negative_bias()" traversing the hierarchy again? I don't see
a dependency on "util_bias_dequeue()" (which modifies
"rq->cfs.avg.util_avg_bias") for "propagate_negative_bias()" (which
works purely with task_util_bias() or "p->se.avg.util_avg_bias") but if
I'm missing something please do let me know.

Since you mentioned this patch isn't strictly necessary in the cover
letter, I would wait for other folks to chime in before changing this :)

>   
>   	/* balance early to pull high priority tasks */
>   	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
> [..snip..]

-- 
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 7/7] Propagate negative bias
  2024-06-25  4:48   ` K Prateek Nayak
@ 2024-06-25 10:30     ` Hongyan Xia
  0 siblings, 0 replies; 10+ messages in thread
From: Hongyan Xia @ 2024-06-25 10:30 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Dietmar Eggemann,
	Juri Lelli, Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, Qais Yousef,
	Morten Rasmussen, Lukasz Luba, Christian Loehle, Pierre Gondois,
	Youssef Esmat, linux-kernel

Hi,

Thanks for taking a look!

On 25/06/2024 05:48, K Prateek Nayak wrote:
> Hello Hongyan,
> 
> On 6/24/2024 3:53 PM, Hongyan Xia wrote:
>> Negative bias is interesting, because dequeuing such a task will
>> actually increase utilization.
>>
>> Solve by applying PELT decay to negative biases as well. This in fact
>> can be implemented easily with some math tricks.
>>
>> Signed-off-by: Hongyan Xia <hongyan.xia2@arm.com>
>> ---
>>   kernel/sched/fair.c  | 40 ++++++++++++++++++++++++++++++++++++++++
>>   kernel/sched/sched.h |  4 ++++
>>   2 files changed, 44 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3bb077df52ae..d09af6abf464 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4878,6 +4878,45 @@ static inline unsigned long 
>> root_cfs_util_uclamp(struct rq *rq)
>>       return max(ret, 0L);
>>   }
>> +
>> +/*
>> + * Negative biases are tricky. If we remove them right away then 
>> dequeuing a
>> + * uclamp_max task has the interesting effect that dequeuing results 
>> in a higher
>> + * rq utilization. Solve this by applying PELT decay to the bias itself.
>> + *
>> + * Keeping track of a PELT-decayed negative bias is extra overhead. 
>> However, we
>> + * observe this interesting math property, where y is the decay 
>> factor and p is
>> + * the number of periods elapsed:
>> + *
>> + *    util_new = util_old * y^p - neg_bias * y^p
>> + *         = (util_old - neg_bias) * y^p
>> + *
>> + * Therefore, we simply subtract the negative bias from util_avg the 
>> moment we
>> + * dequeue, then the PELT signal itself is the total of util_avg and 
>> the decayed
>> + * negative bias, and we no longer need to track the decayed bias 
>> separately.
>> + */
>> +static void propagate_negative_bias(struct task_struct *p)
>> +{
>> +    if (task_util_bias(p) < 0 && !task_on_rq_migrating(p)) {
>> +        unsigned long neg_bias = -task_util_bias(p);
>> +        struct sched_entity *se = &p->se;
>> +        struct cfs_rq *cfs_rq;
>> +
>> +        p->se.avg.util_avg_bias = 0;
>> +
>> +        for_each_sched_entity(se) {
>> +            u32 divider, neg_sum;
>> +
>> +            cfs_rq = cfs_rq_of(se);
>> +            divider = get_pelt_divider(&cfs_rq->avg);
>> +            neg_sum = neg_bias * divider;
>> +            sub_positive(&se->avg.util_avg, neg_bias);
>> +            sub_positive(&se->avg.util_sum, neg_sum);
> 
> Most cases where I've seen "get_pelt_divider()" followed by
> "add_positive()" or "sub_positive()" on "util_avg" and "util_sum" I've
> seen a correction step that does:
> 
>      util_sum = max_t(u32, util_sum, util_avg * PELT_MIN_DIVIDER)
> 
> There is a comment on its significance in "update_cfs_rq_load_avg()".
> Would it also apply in this case?
> 

That's a good point. The problem in update_cfs_rq_load_avg() should also 
be possible here. I can add the guard logic in the next rev.

But if we change the code in a way suggested below, then this problem is 
solved anyway.

>> +            sub_positive(&cfs_rq->avg.util_avg, neg_bias);
>> +            sub_positive(&cfs_rq->avg.util_sum, neg_sum);
>> +        }
>> +    }
>> +}
>>   #else
>>   static inline long task_util_bias(struct task_struct *p)
>>   {
>> @@ -6869,6 +6908,7 @@ static void dequeue_task_fair(struct rq *rq, 
>> struct task_struct *p, int flags)
>>       /* At this point se is NULL and we are at root level*/
>>       sub_nr_running(rq, 1);
>>       util_bias_dequeue(rq, p);
>> +    propagate_negative_bias(p);
> 
> Perhaps I'm pointing to a premature optimization but since the hierarchy
> is traversed above in "dequeue_task_fair()", could the "neg_bias" and
> "neg_sum" removal be done along the way above instead of
> "propagate_negative_bias()" traversing the hierarchy again? I don't see
> a dependency on "util_bias_dequeue()" (which modifies
> "rq->cfs.avg.util_avg_bias") for "propagate_negative_bias()" (which
> works purely with task_util_bias() or "p->se.avg.util_avg_bias") but if
> I'm missing something please do let me know.
> 
> Since you mentioned this patch isn't strictly necessary in the cover
> letter, I would wait for other folks to chime in before changing this :)

I've been thinking about similar things for both enqueue() and 
dequeue(). Currently this series makes util_avg_bias completely separate 
from util_avg to ease review, acting more like util_est, but like you 
said we do things twice in a couple of places.

enqueue_task_fair():
	for_each_sched_entity()
		enqueue_entity()
			if root_cfs()
				cpufreq_update_util()
	util_bias_enqueue(p)
	cpufreq_update_util()  // duplicate cpufreq update

dequeue_task_fair():
	for_each_sched_entity()
		dequeue_entity()
			if root_cfs()
				cpufreq_update_util()
	util_bias_dequeue(p)
	propagate_negative_bias() // duplicate tree traversal
	cpufreq_update_util()  // duplicate cpufreq update

But we can integrate the bias closer into the hierarchy, like this:

enqueue_task_fair():
	for_each_sched_entity()
		enqueue_entity()
			if (entity_is_task())
				util_bias_enqueue(p)
			if root_cfs()
				// No duplicate cpufreq updates
				cpufreq_update_util()

dequeue_task_fair():
	for_each_sched_entity()
		dequeue_entity()
			if (entity_is_task())
				util_bias_dequeue(p)
				// No need to traverse twice.
				propagate_negative_bias(p)
			if root_cfs()
				// No duplicate cpufreq updates
				cpufreq_update_util()

This new structure will address both of your concerns.

>>       /* balance early to pull high priority tasks */
>>       if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>> [..snip..]
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-06-25 10:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-24 10:23 [PATCH 0/7] uclamp sum aggregation Hongyan Xia
2024-06-24 10:23 ` [PATCH 1/7] Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is 0" Hongyan Xia
2024-06-24 10:23 ` [PATCH 2/7] sched/uclamp: Track a new util_avg_bias signal Hongyan Xia
2024-06-24 10:23 ` [PATCH 3/7] sched/uclamp: Add util_est_uclamp Hongyan Xia
2024-06-24 10:23 ` [PATCH 4/7] sched/fair: Use util biases for utilization and frequency Hongyan Xia
2024-06-24 10:23 ` [PATCH 5/7] sched/uclamp: Remove all uclamp bucket logic Hongyan Xia
2024-06-24 10:23 ` [PATCH 6/7] sched/uclamp: Simplify uclamp_eff_value() Hongyan Xia
2024-06-24 10:23 ` [PATCH 7/7] Propagate negative bias Hongyan Xia
2024-06-25  4:48   ` K Prateek Nayak
2024-06-25 10:30     ` Hongyan Xia

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox