Re: [RFC] Documentation/scheduler/schedutil.txt

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	dietmar.eggemann@arm.com, patrick.bellasi@matbug.net,
	lenb@kernel.org, linux-kernel@vger.kernel.org,
	valentin.schneider@arm.com, ionela.voinescu@arm.com,
	qperret@google.com, viresh.kumar@linaro.org
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt
Date: Wed, 2 Dec 2020 14:18:35 +0000	[thread overview]
Message-ID: <20201202141835.GY3306@suse.de> (raw)
In-Reply-To: <20201120075527.GB2414@hirez.programming.kicks-ass.net>

For whatever reason, this only arrived in my inbox today.

On Fri, Nov 20, 2020 at 08:55:27AM +0100, Peter Zijlstra wrote:
> PELT (Per Entity Load Tracking)
> -------------------------------
> 
> With PELT we track some metrics across the various entities, from individual

s/entities/scheduler entities/ so that someone will recognise
sched_entity in the code.

> tasks to task-group slices to CPU runqueues. As the basis for this we use an
> EWMA, each period (1024us) is decayed such that y^32 = 0.5. That is, the

Expand -- Exponentially Weighted Moving Average (EWMA). The cc list
should recognise it automatically, but maybe not a first entrant to
kernel/sched. kernel/sched is littered with institutional knowledge.

> most recent 32ms contribute half, while the rest of history contribute the
> other half.
> 

IIRC, this 32ms is tied to the value of LOAD_AVG_PERIOD and the length
of the ewma_sum series below. Might be worth expanding a little further.

> Specifically:
> 
>   ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
> 
>   ewma(u) = ewma_sum(u) / ewma_sum(1)
> 
> Since this is essentially a progression of an infinite geometric series, the
> results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
> is key, since it gives the ability to recompose the averages when tasks move
> around.
> 
> Note that blocked tasks still contribute to the aggregates (task-group slices
> and CPU runqueues), which reflects their expected contribution when they
> resume running.
> 
> Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
> reflects the time an entity spends on the CPU, while 'runnable' reflects the
> time an entity spends on the runqueue. When there is only a single task these
> two metrics are the same, but once there is contention for the CPU 'running'
> will decrease to reflect the fraction of time each task spends on the CPU
> while 'runnable' will increase to reflect the amount of contention.
> 
> For more detail see: kernel/sched/pelt.c
> 
> 
> Frequency- / Heterogeneous Invariance
> -------------------------------------
> 
> Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
> for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
> a big CPU, we allow architectures to scale the time delta with two ratios, one
> DVFS ratio and one microarch ratio.
> 

Expand -- Dynamic Voltage and Frequency Scaling (DVFS) and assume that
the reader will think of cpufreq.

> For simple DVFS architectures (where software is in full control) we trivially
> compute the ratio as:
> 
> 	    f_cur
>   r_dvfs := -----
>             f_max
> 
> For more dynamic systems where the hardware is in control of DVFS (Intel,
> ARMv8.4-AMU) we use hardware counters to provide us this ratio. In specific,
> for Intel, we use:
> 

s/In specific, for Intel/For Intel specifically,/

> 	   APERF
>   f_cur := ----- * P0
> 	   MPERF
> 
> 	     4C-turbo;	if available and turbo enabled
>   f_max := { 1C-turbo;	if turbo enabled
> 	     P0;	otherwise
> 
>                     f_cur
>   r_dvfs := min( 1, ----- )
>                     f_max
> 
> We pick 4C turbo over 1C turbo to make it slightly more sustainable.
> 
> r_het is determined as the average performance difference between a big and
> LITTLE core when running at max frequency over 'relevant' benchmarks.
> 

r_het is never mentioned again so it's not immediately obvious what it
ties into. I assume het is short for heterogeneous and is simply another
way of looking at current CPU compute power vs potential CPU compute power
(be that due to DVFS or big.LITTLE).

> The result is that the above 'running' and 'runnable' metrics become invariant
> of DVFS and Heterogenous state. IOW. we can transfer and compare them between
> CPUs.
> 
> For more detail see:
> 
>  - kernel/sched/pelt.h:update_rq_clock_pelt()
>  - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
> 

The role and importance of frequency invariance is mentioned but it
could be more explicit. However, looking at update_rq_clock_pelt may be
enough of a clue. 

Either way, decoding this document fully will require someone to spend a
lot of time on the source and then rereading this document. That's
probably a good thing.

> 
> UTIL_EST / UTIL_EST_FASTUP
> --------------------------
> 
> Because periodic tasks have their averages decayed while they sleep, even
> though when running their expected utilization will be the same, they suffer a
> (DVFS) ramp-up after they become runnable again.
> 

s/they become runnable again/they are running again/ ? Maybe refer as
"running" because it's only once they are on the CPU and running that
DVFS comes into play?



> To alleviate this (a default enabled option) UTIL_EST drives an (IIR) EWMA

Expand IIR -- Immediate Impulse Reponse?

> with the 'running' value on dequeue -- when it is highest. A further default
> enabled option UTIL_EST_FASTUP modifies the IIR filter to instantly increase
> and only decay on decrease.
> 
> A further runqueue wide sum (of runnable tasks) is maintained of:
> 
>   util_est := \Sum_t max( t_running, t_util_est_ewma )
> 
> For more detail see: kernel/sched/fair.h:util_est_dequeue()
> 

It's less obvious what the consequence is unless the reader manages to
tie the IO-wait comment in "Schedutil / DVFS" to this section.

> UCLAMP
> ------
> 
> It is possible to set effective u_min and u_max clamps on each task; the
> runqueue keeps an max aggregate of these clamps for all running tasks.
> 
> For more detail see: include/uapi/linux/sched/types.h
> 
> 
> Schedutil / DVFS
> ----------------
> 
> Every time the scheduler load tracking is updated (task wakeup, task
> migration, time progression) we call out to schedutil to update the hardware
> DVFS state.
> 
> The basis is the CPU runqueue's 'running' metric, which per the above it is
> the frequency invariant utilization estimate of the CPU. From this we compute
> a desired frequency like:
> 
>              max( running, util_est );	if UTIL_EST
>   u_cfs := { running;			otherwise
> 
>   u_clamp := clamp( u_cfs, u_min, u_max )
> 
>   u := u_cfs + u_rt + u_irq + u_dl;	[approx. see source for more detail]
> 
>   f_des := min( f_max, 1.25 u * f_max )
> 
> XXX IO-wait; when the update is due to a task wakeup from IO-completion we
> boost 'u' above.
> 
> This frequency is then used to select a P-state/OPP or directly munged into a
> CPPC style request to the hardware.
> 
> XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
> required to satisfy the workload.
> 
> Because these callbacks are directly from the scheduler, the DVFS hardware
> interaction should be 'fast' and non-blocking. Schedutil supports
> rate-limiting DVFS requests for when hardware interaction is slow and
> expensive, this reduces effectiveness.
> 

Is it worth explicitly mentioning that a key advantage over
hardware-based approaches is that schedutil carries utilisation state on
CPU migration? You say that it is tracked but it's less obvious why that
matters as a pure hardware based approach loses utilisation information
about a task once it migrates.

Even moving note 3 below into this section and expanding it with an
example based on HWP would be helpful.

> For more information see: kernel/sched/cpufreq_schedutil.c
> 
> 
> NOTES
> -----
> 
>  - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
>    will closely reflect utilization.
> 
>  - In saturated scenarios task movement will cause some transient dips,
>    suppose we have a CPU saturated with 4 tasks, then when we migrate a task
>    to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
>    new CPU will gain 0.25. This is inevitable and time progression will
>    correct this. XXX do we still guarantee f_max due to no idle-time?
> 
>  - Much of the above is about avoiding DVFS dips, and independent DVFS domains
>    having to re-learn / ramp-up when load shifts.
> 

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2020-12-02 14:19 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-20  7:55 [RFC] Documentation/scheduler/schedutil.txt Peter Zijlstra
2020-11-20  8:56 ` Peter Zijlstra
2020-11-20  9:13   ` Quentin Perret
2020-11-20  9:19     ` Viresh Kumar
2020-11-20  9:27       ` Quentin Perret
2020-11-23  9:30   ` Dietmar Eggemann
2020-11-23 10:05     ` Vincent Guittot
2020-11-23 11:27       ` Dietmar Eggemann
2020-11-23 13:42         ` Vincent Guittot
2020-11-23 18:39           ` Dietmar Eggemann
2020-11-20 11:45 ` Valentin Schneider
2020-11-20 14:37 ` Morten Rasmussen
2020-11-23  9:26 ` Dietmar Eggemann
2020-11-23 14:51   ` Peter Zijlstra
2020-12-02 14:18 ` Mel Gorman [this message]
2020-12-02 15:54   ` Peter Zijlstra
2020-12-02 16:45     ` Mel Gorman
2020-12-02 16:58       ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201202141835.GY3306@suse.de \
    --to=mgorman@suse.de \
    --cc=dietmar.eggemann@arm.com \
    --cc=ionela.voinescu@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=patrick.bellasi@matbug.net \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=rjw@rjwysocki.net \
    --cc=tglx@linutronix.de \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox