Re: [RFC PATCH v3 3/6] sched/fair: Use util biases for utilization and frequency

public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed

From: Hongyan Xia <hongyan.xia2@arm.com>
To: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>
Cc: Qais Yousef <qyousef@layalina.io>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Lukasz Luba <lukasz.luba@arm.com>,
	Christian Loehle <christian.loehle@arm.com>,
	pierre.gondois@arm.com, linux-kernel@vger.kernel.org,
	linux-pm@vger.kernel.org
Subject: Re: [RFC PATCH v3 3/6] sched/fair: Use util biases for utilization and frequency
Date: Tue, 28 May 2024 12:38:11 +0100	[thread overview]
Message-ID: <ede83265-4107-4c21-bc33-f733326daafe@arm.com> (raw)
In-Reply-To: <a1adc985-056f-4fac-9b2d-d140c3ee9e9b@arm.com>

On 26/05/2024 23:52, Dietmar Eggemann wrote:
> On 07/05/2024 14:50, Hongyan Xia wrote:
>> Use the new util_avg_bias for task and runqueue utilization. We also
>> maintain separate util_est and util_est_uclamp signals.
>>
>> Now that we have the sum aggregated CFS util value, we do not need to
> 
> There is the work uclamp missing somehow here.

Ack.

>> consult uclamp buckets to know how the frequency should be clamped. We
>> simply look at the aggregated top level rq->cfs.avg.util_avg +
>> rq->cfs.avg.util_avg_bias and rq->cfs.avg.util_est_uclamp to know what
>> frequency to choose and how to place tasks.
> 
> [...]
> 
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 571c8de59508..14376f23a8b7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4819,14 +4819,14 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
>>   
>>   static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf);
>>   
>> -static inline unsigned long task_util(struct task_struct *p)
>> +static inline unsigned long task_runnable(struct task_struct *p)
>>   {
>> -	return READ_ONCE(p->se.avg.util_avg);
>> +	return READ_ONCE(p->se.avg.runnable_avg);
>>   }
>>   
>> -static inline unsigned long task_runnable(struct task_struct *p)
>> +static inline unsigned long task_util(struct task_struct *p)
>>   {
>> -	return READ_ONCE(p->se.avg.runnable_avg);
>> +	return READ_ONCE(p->se.avg.util_avg);
>>   }
>>   
>>   static inline unsigned long _task_util_est(struct task_struct *p)
>> @@ -4839,6 +4839,52 @@ static inline unsigned long task_util_est(struct task_struct *p)
>>   	return max(task_util(p), _task_util_est(p));
>>   }
> 
> IMHO, we now have two complete hierarchies of util signals:
> 
>         (a)                         (b) (uclamp version of (a))
> 
> (1) task_util()                 task_util_uclamp() (+ task_util_bias())
> 
> (2) _task_util_est()            _task_util_est_uclamp()
> 
> (3) task_util_est()             task_util_est_uclamp()
> 
> (4) cpu_util()                  cpu_util_uclamp()
> 
> (5) cpu_util_cfs()              cpu_util_cfs_uclamp()
> 
> (6) cpu_util_cfs_boost()        cpu_util_cfs_boost_uclamp()
> 
> 
> For (4), (5), (6) we now have:
> 
> ---		                   cpu_util()	  cpu_util_uclamp()
> 			
> eenv_pd_busy_time()			x
> 
> eenv_pd_max_util()					x
> 
> find_energy_efficient_cpu()		x		x <-- (A)
> 
> cpu_util_without()			x
> 
> ---			    cpu_util_cfs()	 cpu_util_cfs_uclamp()
> 
> cpu_overutilized()			x		x <-- (B)
> 
> update_sg_lb_stats()			x
> 
> update_numa_stats()			x
> 
> sched_cpu_util()			x
> 
> ---                     cpu_util_cfs_boost() cpu_util_cfs_boost_uclamp()
> 
> sched_balance_find_src_rq()		x
> 
> sugov_get_util()					x
> 
> uclamp version falls back to mon-uclamp version for !CONFIG_UCLAMP_TASK.
> So it looks like taht in this case we calculate the same cpu utilization
> value twice in (A) and (B).

That is indeed the case. I'll see how I can avoid this duplication, 
hopefully without too much #ifdef mess.

> [...]
> 
>> @@ -4970,134 +5026,29 @@ static inline unsigned long get_actual_cpu_capacity(int cpu)
>>   }
>>   
>>   static inline int util_fits_cpu(unsigned long util,
>> -				unsigned long uclamp_min,
>> -				unsigned long uclamp_max,
>> +				unsigned long util_uclamp,
>>   				int cpu)
>>   {
>>   	unsigned long capacity = capacity_of(cpu);
>> -	unsigned long capacity_orig;
>> -	bool fits, uclamp_max_fits;
>> -
>> -	/*
>> -	 * Check if the real util fits without any uclamp boost/cap applied.
>> -	 */
>> -	fits = fits_capacity(util, capacity);
>> -
>> -	if (!uclamp_is_used())
>> -		return fits;
>> -
>> -	/*
>> -	 * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
>> -	 * uclamp_max. We only care about capacity pressure (by using
>> -	 * capacity_of()) for comparing against the real util.
>> -	 *
>> -	 * If a task is boosted to 1024 for example, we don't want a tiny
>> -	 * pressure to skew the check whether it fits a CPU or not.
>> -	 *
>> -	 * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
>> -	 * should fit a little cpu even if there's some pressure.
>> -	 *
>> -	 * Only exception is for HW or cpufreq pressure since it has a direct impact
>> -	 * on available OPP of the system.
>> -	 *
>> -	 * We honour it for uclamp_min only as a drop in performance level
>> -	 * could result in not getting the requested minimum performance level.
>> -	 *
>> -	 * For uclamp_max, we can tolerate a drop in performance level as the
>> -	 * goal is to cap the task. So it's okay if it's getting less.
>> -	 */
>> -	capacity_orig = arch_scale_cpu_capacity(cpu);
>>   
>> -	/*
>> -	 * We want to force a task to fit a cpu as implied by uclamp_max.
>> -	 * But we do have some corner cases to cater for..
>> -	 *
>> -	 *
>> -	 *                                 C=z
>> -	 *   |                             ___
>> -	 *   |                  C=y       |   |
>> -	 *   |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _  uclamp_max
>> -	 *   |      C=x        |   |      |   |
>> -	 *   |      ___        |   |      |   |
>> -	 *   |     |   |       |   |      |   |    (util somewhere in this region)
>> -	 *   |     |   |       |   |      |   |
>> -	 *   |     |   |       |   |      |   |
>> -	 *   +----------------------------------------
>> -	 *         CPU0        CPU1       CPU2
>> -	 *
>> -	 *   In the above example if a task is capped to a specific performance
>> -	 *   point, y, then when:
>> -	 *
>> -	 *   * util = 80% of x then it does not fit on CPU0 and should migrate
>> -	 *     to CPU1
>> -	 *   * util = 80% of y then it is forced to fit on CPU1 to honour
>> -	 *     uclamp_max request.
>> -	 *
>> -	 *   which is what we're enforcing here. A task always fits if
>> -	 *   uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
>> -	 *   the normal upmigration rules should withhold still.
>> -	 *
>> -	 *   Only exception is when we are on max capacity, then we need to be
>> -	 *   careful not to block overutilized state. This is so because:
>> -	 *
>> -	 *     1. There's no concept of capping at max_capacity! We can't go
>> -	 *        beyond this performance level anyway.
>> -	 *     2. The system is being saturated when we're operating near
>> -	 *        max capacity, it doesn't make sense to block overutilized.
>> -	 */
>> -	uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
>> -	uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
>> -	fits = fits || uclamp_max_fits;
>> +	if (fits_capacity(util_uclamp, capacity))
>> +		return 1;
>>   
>> -	/*
>> -	 *
>> -	 *                                 C=z
>> -	 *   |                             ___       (region a, capped, util >= uclamp_max)
>> -	 *   |                  C=y       |   |
>> -	 *   |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
>> -	 *   |      C=x        |   |      |   |
>> -	 *   |      ___        |   |      |   |      (region b, uclamp_min <= util <= uclamp_max)
>> -	 *   |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
>> -	 *   |     |   |       |   |      |   |
>> -	 *   |     |   |       |   |      |   |      (region c, boosted, util < uclamp_min)
>> -	 *   +----------------------------------------
>> -	 *         CPU0        CPU1       CPU2
>> -	 *
>> -	 * a) If util > uclamp_max, then we're capped, we don't care about
>> -	 *    actual fitness value here. We only care if uclamp_max fits
>> -	 *    capacity without taking margin/pressure into account.
>> -	 *    See comment above.
>> -	 *
>> -	 * b) If uclamp_min <= util <= uclamp_max, then the normal
>> -	 *    fits_capacity() rules apply. Except we need to ensure that we
>> -	 *    enforce we remain within uclamp_max, see comment above.
>> -	 *
>> -	 * c) If util < uclamp_min, then we are boosted. Same as (b) but we
>> -	 *    need to take into account the boosted value fits the CPU without
>> -	 *    taking margin/pressure into account.
>> -	 *
>> -	 * Cases (a) and (b) are handled in the 'fits' variable already. We
>> -	 * just need to consider an extra check for case (c) after ensuring we
>> -	 * handle the case uclamp_min > uclamp_max.
>> -	 */
>> -	uclamp_min = min(uclamp_min, uclamp_max);
>> -	if (fits && (util < uclamp_min) &&
>> -	    (uclamp_min > get_actual_cpu_capacity(cpu)))
>> +	if (fits_capacity(util, capacity))
>>   		return -1;
>>   
>> -	return fits;
>> +	return 0;
> 
> The possible return values stay the same it seems: 1 if uclamp (and so
> util_avg) fit, -1 if util_avg fit and 0 if uclamp and uclamp don't fit.

Yes, this is the intended change to address some comments elsewhere and 
in previous revisions.

> [...]
> 
>> @@ -6914,6 +6861,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>   			"0 tasks on CFS of CPU %d, but util_avg_bias is %d\n",
>>   			rq->cpu, rq->cfs.avg.util_avg_bias);
>>   		WRITE_ONCE(rq->cfs.avg.util_avg_bias, 0);
>> +		WARN_ONCE(rq->cfs.avg.util_est_uclamp,
>> +			"0 tasks on CFS of CPU %d, but util_est_uclamp is %u\n",
>> +			rq->cpu, rq->cfs.avg.util_est_uclamp);
>> +		WRITE_ONCE(rq->cfs.avg.util_est_uclamp, 0);
> 
> Maybe better:
> 
> 		if (WARN_ONCE(...
> 			WRITE_ONCE(rq->cfs.avg.util_est_uclamp, 0);
> 
> How can this happen, that there are 0 tasks on rq->cfs hierarcy and
> rq->cfs.avg.util_est_uclamp is !0 ? Is there a specific code path when
> this happens?

Ack.

This should never happen. Triggering the warning here means something 
has gone very wrong in the maths and should be reported.

>>   	}
>>   #endif
>>   }
>> @@ -7485,7 +7436,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>   static int
>>   select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>>   {
>> -	unsigned long task_util, util_min, util_max, best_cap = 0;
>> +	unsigned long task_util, task_util_uclamp, best_cap = 0;
>>   	int fits, best_fits = 0;
>>   	int cpu, best_cpu = -1;
>>   	struct cpumask *cpus;
>> @@ -7494,8 +7445,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>>   	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>   
>>   	task_util = task_util_est(p);
>> -	util_min = uclamp_eff_value(p, UCLAMP_MIN);
>> -	util_max = uclamp_eff_value(p, UCLAMP_MAX);
>> +	task_util_uclamp = task_util_est_uclamp(p);
> 
> select_idle_sibling() could pass task_util and task_util_uclamp into
> select_idle_capacity(). This way we would save these calls to
> task_util_est() and task_util_est_uclamp() here.

Good idea.

>>   	for_each_cpu_wrap(cpu, cpus, target) {
>>   		unsigned long cpu_cap = capacity_of(cpu);
>> @@ -7503,7 +7453,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>>   		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
>>   			continue;
>>   
>> -		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>> +		fits = util_fits_cpu(task_util, task_util_uclamp, cpu);
>>   
>>   		/* This CPU fits with all requirements */
>>   		if (fits > 0)
> 
> [...]
>

     prev parent reply	other threads:[~2024-05-28 11:38 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1715082714.git.hongyan.xia2@arm.com>
2024-05-07 12:50 ` [RFC PATCH v3 3/6] sched/fair: Use util biases for utilization and frequency Hongyan Xia
2024-05-26 22:52   ` Dietmar Eggemann
2024-05-28 11:38     ` Hongyan Xia [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ede83265-4107-4c21-bc33-f733326daafe@arm.com \
    --to=hongyan.xia2@arm.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lukasz.luba@arm.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pierre.gondois@arm.com \
    --cc=qyousef@layalina.io \
    --cc=rafael@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox