Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Arjan van de Ven <arjan@linux.intel.com>
To: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: "mingo@kernel.org" <mingo@kernel.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"vincent.guittot@linaro.org" <vincent.guittot@linaro.org>,
	"preeti@linux.vnet.ibm.com" <preeti@linux.vnet.ibm.com>,
	"alex.shi@intel.com" <alex.shi@intel.com>,
	"efault@gmx.de" <efault@gmx.de>,
	"pjt@google.com" <pjt@google.com>,
	"len.brown@intel.com" <len.brown@intel.com>,
	"corbet@lwn.net" <corbet@lwn.net>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"torvalds@linux-foundation.org" <torvalds@linux-foundation.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	Catalin Marinas <Catalin.Marinas@arm.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linaro-kernel@lists.linaro.org" <linaro-kernel@lists.linaro.org>,
	rafael.j.wysocki@intel.com
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal
Date: Fri, 12 Jul 2013 08:35:59 -0700	[thread overview]
Message-ID: <51E0225F.7090509@linux.intel.com> (raw)
In-Reply-To: <20130712124612.GE20960@e103034-lin>

On 7/12/2013 5:46 AM, Morten Rasmussen wrote:

> I have had a quick look at intel_pstate.c and to me it seems that it can
> be turned into a power driver that uses the proposed interface with a
> few modifications. intel_pstate.c already has max and min P-state as
> well as a current P-state calculated using the aperf/mperf ratio. I

it calculates average frequency... not current p state.
first of all, it's completely and strictly backwards looking
(and in the light of this being used in a load balancing decision,
the past is NOT a predictor for the future since you're about to change the maximum)
and second, in the light of having idle time... you do not get what you
think you get ;-)

>
> In the first case, the power scheduler would not know about turbo mode
> and never request it. Turbo mode could still be used by the power driver
> as a hidden bonus when power scheduler requests max power.

but what do you do when you ask for low power? On Intel.. for various cases,
you also pick a high P state!

(the assumption "low P state == low power" and "high P state == high power"
is just not valid)

>
> In the second approach, the power scheduler may request power (P-state)
> that can only be provided by a turbo P-state. Since we cannot be
> guaranteed to get that, the power driver would return the power
> (P-state) that is guaranteed (or at least very likely)

even non-turbo is very likely to not be achievable in various very
common situations. Two year ago I would have said, sure, but today,
it's just not the case anymore.

> I understand that the difference between highest guaranteed P-state and
> highest potential P-state is likely to increase in the future. Without
> any feedback about what potential P-state we can approximately get, we
> can only pack tasks until we hit the load that can be handled at the
> highest guaranteed P-state.

the only highest guaranteed P state is... the lowest P state. Sorry.
Everything else is subject to thermal management and hardware policies.

> I believe that there already is a power limit notification mechanism on
> Intel that can notify the OS when the firmware chooses a lower P-state
> than the one requested by the OS.

and we turn that off to avoid interrupt floods.....

> You (or Rafael) mentioned in our previous discussion that you are
> working on an improved intel_pstate driver. Will that be fundamentally
> different from the current one?

yes.
the hardware has been changing, and will be changing more (at a faster rate),
and we'll have very different algorithms for the different generations.

For example, for the recently launched client Haswell (think Ultrabook) the
system idle power is going down about 20 times compared to the previous generation (e.g.
what you'd buy a month ago).
With that change, the rules about when to go fast and not are changing dramatically....
since going faster means you'll go to the low power faster (even on previous generations that
effect is there, but with lower power in idle, this just gets stronger).

> I agree that packing is not a good idea for cache or memory bound tasks.
> It is not any different on dual cluster ARM setups like big.LITTLE. But,
> we do see a lot of benefit in packing small tasks which are not cache or
> memory bound, or performance critical. Keeping them on as few cpus as
> possible means that the rest can enter deeper C-states for longer.

I totally agree with the idea of *statistically* grouping short running tasks.
But... this can be done VERY simple without such explicit "how many do we need".
All you need to do is to do a statistical "sort left", e.g. if a short running tasks
wants to run (that by definition has not run for a while, so is cache cold anyway),
make it prefer the lowest number idle cpu to wake up on.
Heck, even making it just prefer only cpu 0 when it's idle will by and large already achieve
this.
Remember that you don't have to be perfect; no point trying to move tasks that never run in your
management time window; only the ones that actually want to run need management.
And at the "I want to run" time, you can just sort it left.
(and this is fine for tasks that run short; all the numa/etc logic value kicks in for tasks that do
some serious amounts of work and thus by definition run for longer stretches)

What you don't want to do, is run tasks sequentially that could have run in parallel. That's the best
way to destroy power efficiency in multicore systems ;-(

And to be honest, the effect of per logical CPU C states is much smaller on Intel than the effect
of global idle (in Intel terms, "package C states"). The break even points of CPU core states are
extremely short for us, even for the deepest states. The bigger bang for the buck is with system wide
idle, so that memory can go to self refresh (and the memory controllers/etc can be turned off).
The break even point for those kind of things is longer, and that's where wakeups/etc make a much bigger dent.

> BTW. Packing one strictly memory bound task and one strictly cpu bound
> task on one socket might work. The only problem is to determine the task
> charateristics ;-)

yeah "NUMA is hard, lets go shopping" for sure.

next prev parent reply	other threads:[~2013-07-12 15:36 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-09 15:55 [RFC][PATCH 0/9] sched: Power scheduler design proposal Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 1/9] sched: Introduce power scheduler Morten Rasmussen
2013-07-09 16:48   ` Arjan van de Ven
2013-07-10  2:10   ` Arjan van de Ven
2013-07-10 11:11     ` Morten Rasmussen
2013-07-10 11:19       ` Vincent Guittot
2013-07-09 15:55 ` [RFC][PATCH 2/9] sched: Redirect update_cpu_power to sched/power.c Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 3/9] sched: Make select_idle_sibling() skip cpu with a cpu_power of 1 Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 4/9] sched: Make periodic load-balance disregard cpus " Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 5/9] sched: Make idle_balance() skip " Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 6/9] sched: power: add power_domain data structure Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 7/9] sched: power: Add power driver interface Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 8/9] sched: power: Add initial frequency scaling support to power scheduler Morten Rasmussen
2013-07-10 13:10   ` Arjan van de Ven
2013-07-12 12:51     ` Morten Rasmussen
2013-07-12 13:06       ` Catalin Marinas
2013-07-12 15:37       ` Arjan van de Ven
2013-07-09 15:55 ` [RFC][PATCH 9/9] sched: power: cpufreq: Initial schedpower cpufreq governor Morten Rasmussen
2013-07-09 16:58 ` [RFC][PATCH 0/9] sched: Power scheduler design proposal Arjan van de Ven
2013-07-10 11:16   ` Morten Rasmussen
2013-07-10 13:05     ` Arjan van de Ven
2013-07-12 12:46       ` Morten Rasmussen
2013-07-12 15:35         ` Arjan van de Ven [this message]
2013-07-12 13:00       ` Catalin Marinas
2013-07-12 15:44         ` Arjan van de Ven
2013-07-11 11:34   ` Preeti U Murthy
2013-07-12 13:48     ` Morten Rasmussen
2013-07-15  3:43       ` Preeti U Murthy
2013-07-15  9:55         ` Catalin Marinas
2013-07-15 15:24           ` Arjan van de Ven
2013-07-12 13:31   ` Catalin Marinas
2013-07-13  6:49 ` Peter Zijlstra
2013-07-13 10:23   ` Catalin Marinas
2013-07-15  7:53     ` Vincent Guittot
2013-07-15 20:39     ` Peter Zijlstra
2013-07-16 12:42       ` Catalin Marinas
2013-07-16 15:23         ` Arjan van de Ven
2013-07-17 14:14           ` Catalin Marinas
2013-07-24 13:50             ` Morten Rasmussen
2013-07-24 15:16               ` Arjan van de Ven
2013-07-24 16:46                 ` Morten Rasmussen
2013-07-24 16:48                   ` Arjan van de Ven
2013-07-25  8:00                     ` Morten Rasmussen
2013-07-13 14:40   ` Arjan van de Ven
2013-07-15 19:59     ` Peter Zijlstra
2013-07-15 20:37       ` Arjan van de Ven
2013-07-15 21:03         ` Peter Zijlstra
2013-07-15 22:46           ` Arjan van de Ven
2013-07-16 20:45             ` David Lang
2013-07-15 20:41       ` Arjan van de Ven
2013-07-15 21:06         ` Peter Zijlstra
2013-07-15 21:12           ` Peter Zijlstra
2013-07-15 22:52             ` Arjan van de Ven
2013-07-16 17:38               ` Peter Zijlstra
2013-07-16 18:44                 ` Arjan van de Ven
2013-07-16 19:21                   ` Peter Zijlstra
2013-07-16 19:57                     ` Arjan van de Ven
2013-07-16 20:17                       ` Peter Zijlstra
2013-07-16 20:21                         ` Arjan van de Ven
2013-07-16 20:32                         ` Arjan van de Ven
2013-07-15 22:46           ` Arjan van de Ven
2013-07-13 16:14   ` Arjan van de Ven
2013-07-15  2:05     ` Alex Shi
2013-07-24 13:16   ` Morten Rasmussen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51E0225F.7090509@linux.intel.com \
    --to=arjan@linux.intel.com \
    --cc=Catalin.Marinas@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=corbet@lwn.net \
    --cc=efault@gmx.de \
    --cc=len.brown@intel.com \
    --cc=linaro-kernel@lists.linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).