public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: mingo@kernel.org, linux-kernel@vger.kernel.org,
	dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
	yuyang.du@intel.com, pjt@google.com, bsegall@google.com
Subject: Re: [PATCH v2] sched/fair: update scale invariance of PELT
Date: Tue, 11 Apr 2017 09:52:21 +0200	[thread overview]
Message-ID: <20170411075221.GA30421@linaro.org> (raw)
In-Reply-To: <20170410173802.orygigjbcpefqtdv@hirez.programming.kicks-ass.net>

Le Monday 10 Apr 2017 à 19:38:02 (+0200), Peter Zijlstra a écrit :
> 
> Thanks for the rebase.
> 
> On Mon, Apr 10, 2017 at 11:18:29AM +0200, Vincent Guittot wrote:
> 
> Ok, so let me try and paraphrase what this patch does.
> 
> So consider a task that runs 16 out of our 32ms window:
> 
>    running   idle
>   |---------|---------|
> 
> 
> You're saying that when we scale running with the frequency, suppose we
> were at 50% freq, we'll end up with:
> 
>    run  idle
>   |----|---------|
> 
> 
> Which is obviously a shorter total then before; so what you do is add
> back the lost idle time like:
> 
>    run  lost idle
>   |----|----|---------|
> 
> 
> to arrive at the same total time. Which seems to make sense.

Yes

> 
> Now I have vague memories of Morten having issues with your previous
> patches, so I'll wait for him to chime in as well.

IIRC, Morten's concerns were about the lost idle time which was not taken into account in previous version.

> 
> 
> On to the implementation:
> 
> >  /*
> > + * Scale the time to reflect the effective amount of computation done during
> > + * this delta time.
> > + */
> > +static __always_inline u64
> > +scale_time(u64 delta, int cpu, struct sched_avg *sa,
> > +		unsigned long weight, int running)
> > +{
> > +	if (running) {
> > +		sa->stolen_idle_time += delta;
> > +		/*
> > +		 * scale the elapsed time to reflect the real amount of
> > +		 * computation
> > +		 */
> > +		delta = cap_scale(delta, arch_scale_freq_capacity(NULL, cpu));
> > +		delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, cpu));
> > +
> > +		/*
> > +		 * Track the amount of stolen idle time due to running at
> > +		 * lower capacity
> > +		 */
> > +		sa->stolen_idle_time -= delta;
> 
> OK so far so good, this tracks, in stolen_idle_time, the 'lost' bit from
> above.
> 
> > +	} else if (!weight) {
> > +		if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
> 
> But here I'm completely lost. WTF just happened ;-)
> 
> Firstly, I think we want a comment on why we care about the !weight
> case. Why isn't !running sufficient?

We track the time when the task is "really" idle but not the time that the task spent
to wait for running on the CPU. Running is used to detect when the task is really
running and how much idle time has been lost while weight is used to detect when the
task is back to sleep state and when we have account the lost idle time.

>
> 
> Secondly, what's up with the util_sum < LOAD_AVG_MAX * 1000 thing?

The lost idle time makes sense only if the task can also be "idle" when running at max
capacity. When util_sum reaches the LOAD_AVG_MAX*SCHED_CAPACITY_SCALE value, all tasks
are considered to be the same as we can't make any difference between a task running
400ms or a task running 400sec. It means that these tasks are "always running" tasks
even at max capacity. In this case, there is no lost idle time as they always run and
tracking and adding back the lost idle time because we run at lower capacity doesn't
make sense anymore so we discard it. Then an always running task can have a util_sum
that is less than the max value because of the rounding (util_avg varies between
[1006..1023]), so I use LOAD_AVG_MAX*1000 instead of LOAD_AVG_MAX*1024

> 
> Is that to deal with cpu_capacity?
> 
> 
> > +			/*
> > +			 * Add the idle time stolen by running at lower compute
> > +			 * capacity
> > +			 */
> > +			delta += sa->stolen_idle_time;
> > +		}
> > +		sa->stolen_idle_time = 0;
> > +	}
> > +
> > +	return delta;
> > +}
> 
> 
> Thirdly, I'm thinking this isn't quite right. Imagine a task that's
> running across a decay window, then we'll only add back the stolen_idle
> time in the next window, even though it should've been in this one,
> right?

I don't think so because the PELT should not see more decay window at half capacity
than at max capacity.
In the example below we can see that we cross the absolute time decay window when
running at half capacity but once we scale the running delta time we don't cross
it anymore and the update of PELT is done in the same manner in both case

decay window |-------|-------|-------|--
max capacity ---xxxx------------xxxx----
update         *   *           *   *

half capacity---xxxxxxxx--------xxxxxxxx
accounted    ---xxxx____--------xxxx____
update         *   *           *   *

x running
- sleep
_ lost idle time

  reply	other threads:[~2017-04-11  7:52 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-10  9:18 [PATCH v2] sched/fair: update scale invariance of PELT Vincent Guittot
2017-04-10 17:38 ` Peter Zijlstra
2017-04-11  7:52   ` Vincent Guittot [this message]
2017-04-11  8:53     ` Peter Zijlstra
2017-04-11  9:40       ` Vincent Guittot
2017-04-11 10:41         ` Peter Zijlstra
2017-04-11 10:49           ` Peter Zijlstra
2017-04-11 13:09             ` Vincent Guittot
2017-04-12 11:28               ` Peter Zijlstra
2017-04-12 14:50                 ` Vincent Guittot
2017-04-12 15:44                   ` Peter Zijlstra
2017-04-13  9:42                     ` Vincent Guittot
2017-04-13 13:32                 ` Peter Zijlstra
2017-04-13 14:59                   ` Vincent Guittot
2017-04-13 18:06                     ` Peter Zijlstra
2017-04-14  8:47                       ` Vincent Guittot
2017-04-11 12:08           ` Vincent Guittot
2017-04-11  9:12     ` Peter Zijlstra
2017-04-11  9:46       ` Vincent Guittot
2017-04-13 13:39     ` Peter Zijlstra
2017-04-13 15:16       ` Vincent Guittot
2017-04-13 16:13         ` Peter Zijlstra
2017-04-14  8:49           ` Vincent Guittot
2017-04-19 16:31             ` Vincent Guittot
2017-04-28 15:52 ` Morten Rasmussen
2017-04-28 17:08   ` Dietmar Eggemann
2017-05-03 17:11   ` Vincent Guittot
2017-04-28 22:09 ` Peter Zijlstra
2017-05-01  9:00   ` Peter Zijlstra
2017-05-02 13:38     ` Vincent Guittot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170411075221.GA30421@linaro.org \
    --to=vincent.guittot@linaro.org \
    --cc=Morten.Rasmussen@arm.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=yuyang.du@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox