public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Valentin Schneider <valentin.schneider@arm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	dietmar.eggemann@arm.com, patrick.bellasi@matbug.net,
	lenb@kernel.org, linux-kernel@vger.kernel.org,
	ionela.voinescu@arm.com, qperret@google.com,
	viresh.kumar@linaro.org
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt
Date: Fri, 20 Nov 2020 11:45:05 +0000	[thread overview]
Message-ID: <jhjzh3cuu9q.mognet@arm.com> (raw)
In-Reply-To: <20201120075527.GB2414@hirez.programming.kicks-ass.net>


Hi,

On 20/11/20 07:55, Peter Zijlstra wrote:
> Frequency- / Heterogeneous Invariance
> -------------------------------------
>
> Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
> for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
> a big CPU, we allow architectures to scale the time delta with two ratios, one
> DVFS ratio and one microarch ratio.
>
> For simple DVFS architectures (where software is in full control) we trivially
> compute the ratio as:
>
> 	    f_cur
>   r_dvfs := -----
>             f_max
>
> For more dynamic systems where the hardware is in control of DVFS (Intel,
> ARMv8.4-AMU) we use hardware counters to provide us this ratio. In specific,
> for Intel, we use:
>
> 	   APERF
>   f_cur := ----- * P0
> 	   MPERF
>
> 	     4C-turbo;	if available and turbo enabled
>   f_max := { 1C-turbo;	if turbo enabled
> 	     P0;	otherwise
>
>                     f_cur
>   r_dvfs := min( 1, ----- )
>                     f_max
>
> We pick 4C turbo over 1C turbo to make it slightly more sustainable.
>
> r_het is determined as the average performance difference between a big and
> LITTLE core when running at max frequency over 'relevant' benchmarks.

Welcome to our wonderful world where there can be more than just two types
of CPUs! A perhaps safer statement would be:

  r_het is determined as the ratio of highest performance level of the
  current CPU vs the highest performance level of any other CPU in the 
  system.


Also; do we want to further state the obvious?

  r_tot := r_het * r_dvfs

>
> The result is that the above 'running' and 'runnable' metrics become invariant
> of DVFS and Heterogenous state. IOW. we can transfer and compare them between
> CPUs.
>
> For more detail see:
>
>  - kernel/sched/pelt.h:update_rq_clock_pelt()
>  - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
>

Some of that is rephrased in

  Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"

(with added diagrams crafted with love by yours truly); I suppose a cross
reference can't hurt.

>
> UTIL_EST / UTIL_EST_FASTUP
> --------------------------
>
> Because periodic tasks have their averages decayed while they sleep, even
> though when running their expected utilization will be the same, they suffer a
> (DVFS) ramp-up after they become runnable again.
>
> To alleviate this (a default enabled option) UTIL_EST drives an (IIR) EWMA
> with the 'running' value on dequeue -- when it is highest. A further default
> enabled option UTIL_EST_FASTUP modifies the IIR filter to instantly increase
> and only decay on decrease.
>
> A further runqueue wide sum (of runnable tasks) is maintained of:
>
>   util_est := \Sum_t max( t_running, t_util_est_ewma )
>
> For more detail see: kernel/sched/fair.h:util_est_dequeue()
>
>
> UCLAMP
> ------
>
> It is possible to set effective u_min and u_max clamps on each task; the

Nit: effective clamps are the task clamps clamped by task group clamps
(yes, that is 4 times 'clamp' in a single line).

> runqueue keeps an max aggregate of these clamps for all running tasks.
>
> For more detail see: include/uapi/linux/sched/types.h
>
>
> Schedutil / DVFS
> ----------------
>
> Every time the scheduler load tracking is updated (task wakeup, task
> migration, time progression) we call out to schedutil to update the hardware
> DVFS state.
>
> The basis is the CPU runqueue's 'running' metric, which per the above it is
> the frequency invariant utilization estimate of the CPU. From this we compute
> a desired frequency like:
>
>              max( running, util_est );	if UTIL_EST
>   u_cfs := { running;			otherwise
>
>   u_clamp := clamp( u_cfs, u_min, u_max )
>
>   u := u_cfs + u_rt + u_irq + u_dl;	[approx. see source for more detail]
>
>   f_des := min( f_max, 1.25 u * f_max )
>
> XXX IO-wait; when the update is due to a task wakeup from IO-completion we
> boost 'u' above.
>

IIRC the boost is fiddled with during the above, but can be applied at
different subsequent updates (even if the task that triggered the boost is
no longer here).

> This frequency is then used to select a P-state/OPP or directly munged into a
> CPPC style request to the hardware.
>
> XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
> required to satisfy the workload.
>
> Because these callbacks are directly from the scheduler, the DVFS hardware
> interaction should be 'fast' and non-blocking. Schedutil supports
> rate-limiting DVFS requests for when hardware interaction is slow and
> expensive, this reduces effectiveness.
>
> For more information see: kernel/sched/cpufreq_schedutil.c
>
>
> NOTES
> -----
>
>  - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
>    will closely reflect utilization.
>
>  - In saturated scenarios task movement will cause some transient dips,
>    suppose we have a CPU saturated with 4 tasks, then when we migrate a task
>    to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
>    new CPU will gain 0.25. This is inevitable and time progression will
>    correct this. XXX do we still guarantee f_max due to no idle-time?
>
>  - Much of the above is about avoiding DVFS dips, and independent DVFS domains
>    having to re-learn / ramp-up when load shifts.


  parent reply	other threads:[~2020-11-20 11:45 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-20  7:55 [RFC] Documentation/scheduler/schedutil.txt Peter Zijlstra
2020-11-20  8:56 ` Peter Zijlstra
2020-11-20  9:13   ` Quentin Perret
2020-11-20  9:19     ` Viresh Kumar
2020-11-20  9:27       ` Quentin Perret
2020-11-23  9:30   ` Dietmar Eggemann
2020-11-23 10:05     ` Vincent Guittot
2020-11-23 11:27       ` Dietmar Eggemann
2020-11-23 13:42         ` Vincent Guittot
2020-11-23 18:39           ` Dietmar Eggemann
2020-11-20 11:45 ` Valentin Schneider [this message]
2020-11-20 14:37 ` Morten Rasmussen
2020-11-23  9:26 ` Dietmar Eggemann
2020-11-23 14:51   ` Peter Zijlstra
2020-12-02 14:18 ` Mel Gorman
2020-12-02 15:54   ` Peter Zijlstra
2020-12-02 16:45     ` Mel Gorman
2020-12-02 16:58       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=jhjzh3cuu9q.mognet@arm.com \
    --to=valentin.schneider@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=ionela.voinescu@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=patrick.bellasi@matbug.net \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=rjw@rjwysocki.net \
    --cc=tglx@linutronix.de \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox