Re: [RFC 08/14] sched/tune: add detailed documentation

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Patrick Bellasi <patrick.bellasi@arm.com>
To: Steve Muckle <steve.muckle@linaro.org>
Cc: Ricky Liang <jcliang@chromium.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>,
	Jonathan Corbet <corbet@lwn.net>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	Viresh Kumar <viresh.kumar@linaro.org>
Subject: Re: [RFC 08/14] sched/tune: add detailed documentation
Date: Fri, 11 Sep 2015 12:09:59 +0100	[thread overview]
Message-ID: <20150911110959.GA22876@e105326-lin> (raw)
In-Reply-To: <55F0938A.9000607@linaro.org>

On Wed, Sep 09, 2015 at 09:16:10PM +0100, Steve Muckle wrote:
> Hi Patrick,

Hi Steve,

> On 09/03/2015 02:18 AM, Patrick Bellasi wrote:
> > In my view, one of the main goals of sched-DVFS is actually that to be
> > a solid and generic replacement of different CPUFreq governors.
> > Being driven by the scheduler, sched-DVFS can exploit information on
> > CPU demand of active tasks in order to select the optimal Operating
> > Performance Point (OPP) using a "proactive" approach instead of the
> > "reactive" approach commonly used by existing governors.
> 
> I'd agree that with knowledge of CPU demand on a per-task basis, rather
> than the aggregate per-CPU demand that cpufreq governors use today, it
> is possible to proactively address changes in CPU demand which result
> from task migrations, task creation and exit, etc.
> 
> That said I believe setting the OPP based on a particular given
> historical profile of task load still relies on a heuristic algorithm of
> some sort where there is no single right answer. I am concerned about
> whether sched-dvfs and SchedTune, as currently proposed, will support
> enough of a range of possible heuristics/policies to effectively replace
> the existing cpufreq governors.
> 
> The two most popular governors for normal operation in the mobile world:
> 
> * ondemand: Samples periodically, CPU usage calculated as simple busy
> fraction of last X ms window of time. Goes straight to fmax when load
> exceeds up_threshold tunable %, otherwise scales frequency
> proportionally with load. Can stay at fmax longer if requested before
> re-evaluating by configuring the sampling_down_factor tunable.
> 
> * interactive: Samples periodically, CPU usage calculated as simple busy
> fraction of last Xms window of time. Goes to an intermediate tunable
> freq (hispeed_freq) when load exceeds a user definable threshold
> (go_hispeed_load). Otherwise strives to maintain the CPU usage set by
> the user in the "target_loads" array. Other knobs that affect behavior
> include min_sample_time (min time to spend at a freq before slowing
> down) and above_hispeed_delay (allows various delays to further raise
> speed above hispeed freq).
> 
> It's also worth noting that mobile vendors typically add all sorts of
> hacks on top of the existing cpufreq governors which further complicate
> policy.

Could it be that many of the hacks introduced by vendors are just
there to implement a kind of "scenario based" tuning of governors?
I mean, depending on the specific use-case they try to refine the
value of exposed tunables to improve either performance,
responsiveness or power consumption?

If this is the case, it means that the currently available governors
are missing an important bit of information: what are the best
tunables values for a specific (set of) tasks?

> The current proposal:
> 
> * sched-dvfs/schedtune: Event driven, CPU usage calculated using
> exponential moving average. AFAICS tries to maintain some % of idle
> headroom, but if that headroom doesn't exist at task_tick_fair(), goes
> to max frequency. Schedtune provides a way to boost/inflate the demand
> of individual tasks or overall system demand.

That's quite of a good description. One small correction is that, at
least in the implementation presented by this RFC, SchedTune is not
boosting individual tasks but just the CPU usage.
The link with tasks is just that SchedTune knows how much to boost a
CPU usage by keeping track of which tasks are runnable on that CPU.
However, the utilization signal of each task is not actually modified
from the scheduler standpoint.

> This looks a bit like ondemand to me but without the
> sampling_down_factor functionality and using per-entity load tracking
> instead of a simple window-based aggregate CPU usage.

I agree in principle.
An important difference worth to notice is that we use an "event
based" approach. This means that an enqueue/dequeue can trigger
an immediate OPP change.
If you consider that commonly ondemand uses a 20ms sample rate while
an OPP switch never requires (quite likely) more than 1 or 2 ms, this
means that sched-DVFS can be much more reactive on adapting to
variable loads.

> The interactive functionality would require additional knobs. I
> don't think schedtune will allow for tuning the latency of CPU
> frequency changes (min_sample_time, above_hispeed_delay, etc).

Well, there can be certainly some limitations in the current
implementation. Indeed, the goal of this RFC is to trigger the
discussion and verify if the overall idea make sense and how we
can improve it.

However, regarding specifically the latency on OPP changes, there are
a couple of extension we was thinking about:
1. link the SchedTune boost value with the % of idle headroom which
   triggers an OPP increase
2. use the SchedTune boost value to defined the high frequency to jump
   at when a CPU crosses the % of idle headroom

These are tunables which allows to parameterize the way the PELT
signal for CPU usage is interpreted by the sched-DVFS governor.

How such tunables should be exposed and tuned is to be discussed.
Indeed, one of the main goals of the sched-DVFS and SchedTune
specifically, is to simplify the tuning of a platform by exposing to
userspace a reduced number of tunables, preferably just one.

> A separate but related concern - in the (IMO likely, given the above)
> case that folks want to tinker with that policy, it now means they're
> hacking the scheduler as opposed to a self-contained frequency policy
> plugin.

I do not agree on that point. SchedTune, as well as sched-DVFS, are
framework quit well separated from the scheduler.
They are "consumers" of signals usually used by the scheduler, but
they are not directly affecting scheduler decisions (at least in the
implementation proposed by this RFC).

Side effects are possible, of course. For example the selection of an
OPP instead of another can affect the residency of a task on a CPU,
thus somehow biasing some scheduler decisions. However, I think that
this kind of side effects can be produced by current governors as
well.

Eventually, I agree with you if you mean that one can have the
impression of hacking the scheduler because the main compilation unit
of SchedTune is a file under kernel/sched. If this can be a problem,
for example from a maintenance perspective, perhaps we can find a
better location for that code.

> Another issue with policy (but not specific to this proposal) is that
> putting a bunch of it in the CPU frequency selection may derail the
> efforts of the EAS algorithm, which I'm still working on digesting.
> Perhaps a unified sched/cpufreq policy could go there.

We have an internal extension of SchedTune which is proposing an
integration with EAS. We have not included it on that RFC to keep
things simple by exposing at first instance only generic bits which
can extend sched-DVFS features.

However, one of the main goals of this proposal is to respond to a
couple of long lasting demands (e.g. [1,2]) for:
1. a better integration of CPUFreq with the scheduler, which has all
   the required knowledge about workloads demands to target both
   performances and energy efficiency
2. a simple approach to configure a system to care more about
   performance or energy-efficiency

SchedTune addresses mainly the second point. Once SchedTune is
integrated with EAS it will provide a support to decide, in an
energy-efficient way, how much we want to reduce power or boost
performances.

> thanks,
> Steve

Thanks for the interesting feedbacks, this is actually the kind of
discussion we would like to have around this initial proposal.

Cheers Patrick

[1] https://lkml.org/lkml/2012/5/18/91
[2] http://lwn.net/Articles/552889/

-- 
#include <best/regards.h>

Patrick Bellasi

next prev parent reply	other threads:[~2015-09-11 11:09 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-08-19 18:47 [RFC PATCH 00/14] sched: Central, scheduler-driven, power-perfomance control Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 01/14] sched/cpufreq_sched: use static key for cpu frequency selection Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 02/14] sched/fair: add triggers for OPP change requests Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 03/14] sched/{core,fair}: trigger OPP change request on fork() Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 04/14] sched/{fair,cpufreq_sched}: add reset_capacity interface Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 05/14] sched/fair: jump to max OPP when crossing UP threshold Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 06/14] sched/cpufreq_sched: modify pcpu_capacity handling Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 07/14] sched/fair: cpufreq_sched triggers for load balancing Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 08/14] sched/tune: add detailed documentation Patrick Bellasi
2015-09-02  6:49   ` [RFC,08/14] " Ricky Liang
2015-09-03  9:18     ` [RFC 08/14] " Patrick Bellasi
2015-09-04  7:59       ` Ricky Liang
2015-09-09 20:16       ` Steve Muckle
2015-09-11 11:09         ` Patrick Bellasi [this message]
2015-09-14 20:00           ` Steve Muckle
2015-09-15 15:00             ` Patrick Bellasi
2015-09-15 15:19               ` Peter Zijlstra
2015-09-16  0:34                 ` Steve Muckle
2015-09-16  7:47                   ` Ingo Molnar
2015-09-15 23:55               ` Steve Muckle
2015-09-16  9:26                 ` Juri Lelli
2015-09-16 13:49                   ` Vincent Guittot
2015-09-16 10:03                 ` Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 09/14] sched/tune: add sysctl interface to define a boost value Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 10/14] sched/fair: add function to convert boost value into "margin" Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 11/14] sched/fair: add boosted CPU usage Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 12/14] sched/tune: add initial support for CGroups based boosting Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 13/14] sched/tune: compute and keep track of per CPU boost value Patrick Bellasi
2015-08-19 18:47 ` [RFC PATCH 14/14] sched/{fair,tune}: track RUNNABLE tasks impact on " Patrick Bellasi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150911110959.GA22876@e105326-lin \
    --to=patrick.bellasi@arm.com \
    --cc=corbet@lwn.net \
    --cc=jcliang@chromium.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=steve.muckle@linaro.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).