linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Arjan van de Ven <arjan@linux.intel.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>,
	mingo@kernel.org, vincent.guittot@linaro.org,
	preeti@linux.vnet.ibm.com, alex.shi@intel.com, efault@gmx.de,
	pjt@google.com, len.brown@intel.com, corbet@lwn.net,
	akpm@linux-foundation.org, torvalds@linux-foundation.org,
	tglx@linutronix.de, catalin.marinas@arm.com,
	linux-kernel@vger.kernel.org, linaro-kernel@lists.linaro.org
Subject: Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal
Date: Sat, 13 Jul 2013 07:40:08 -0700	[thread overview]
Message-ID: <51E166C8.3000902@linux.intel.com> (raw)
In-Reply-To: <20130713064909.GW25631@dyad.programming.kicks-ass.net>

On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
>
> Arjan; from reading your emails you're mostly busy explaining what cannot be
> done. Please explain what _can_ be done and what Intel wants. From what I can
> see you basically promote a max P state max concurrency race to idle FTW.

>
> Since you can't say what the max P state is; and I think I understand the
> reasons for that, and the hardware might not even respect the P state you tell
> it to run at, does it even make sense to talk about Intel P states? When would
> you not program the max P state?

this is where it gets complicated ;-(
the race-to-idle depends on the type of code that is running, if things are memory bound it's outright
not true, but for compute bound it often is.

What I would like to see is

1) Move the idle predictor logic into the scheduler, or at least a library
    (I'm not sure the scheduler can do better than the current code, but it might,
     and what menu does today is at least worth putting in some generic library)

2) An interface between scheduler and P state code in the form of (and don't take the names as actual function names ;-)
     void arch_please_go_fastest(void);   /* or maybe int cpunr as argument, but that's harder to implement */
     int arch_can_you_go_faster(void);  /* if the scheduler would like to know this instead of load balancing .. unsure */
     unsigned long arch_instructions_executed(void); /* like tsc, but on instructions, so the scheduler can account actual work done */

   the first one is for the scheduler to call when it sees a situation of "we care deeply about performance now" coming,
   for example near overload, or when a realtime (or otherwise high priority) task gets scheduled.
   the second one I am dubious about, but maybe you have a use for it; some folks think that there is value in
   deciding to ramp up the performance rather than load balancing. For load balancing to an idle cpu, I don't see that
   value (in terms of power efficiency) but I do see a case where your 2 cores happen to be busy (some sort of thundering
   herd effect) but imbalanced; in that case going faster rather than rebalance... I can certainly see the point.

3) an interface from the C state hardware driver to the scheduler to say "oh btw, the LLC got flushed, forget about past
    cache affinity". The C state driver can sometimes know this.. and linux today tries to keep affinity anyway
    while we could get more optimal by being allowed to balance more freely

4) this is the most important one, but like the hardest one:
    An interface from the scheduler that says "we are performance sensitive now":
    void arch_sched_performance_sensitive(int duration_ms);

    I've put a duration as argument, rather than a "arch_no_longer_sensitive", to avoid for the scheduler to run some
    periodic timer/whatever to keep this; rather it is sort of a "lease", that the scheduler can renew as often as it
    wants; but it auto-expires eventually.

    with this the hardware and/or hardware drivers can make a performance bias in their decisions based on what
    is actually the driving force behind both P and C state decisions: performance sensitivity.
    (all this utilization stuff menu but also the P state drivers try to do is estimating how sensitive we are to
    performance, and if we're not sensitive, consider sacrificing some performance for power. Even with race-to-halt,
    sometimes sacrificing a little performance gives a power benefit at the top of the range)

>
> IIRC you at one point said there was a time limit below which concurrency
> spread wasn't useful anymore?

there is a time below which waking up a core (not hyperthread pair, that is ALWAYS worth it since it's insanely cheap)
is not worth it.
Think in the order of "+/- 50 microseconds".


> Also, most what you say for single socket systems; what does Intel want for
> multi-socket systems?

for multisocket, rule number one is "don't screw up numa".
for tasks where numa matters, that's the top priority.
beyond that, experiments seem to show that grouping "a little" helps.
Say on a 2x 4 core system, it's worth running the first 2 tasks on the same package
but after that we need to start considering the 2nd package.
I have to say that we don't have quite enough data yet to figure out where this cutoff is;
most of the microbenchmarks in this have been done with fspin, which by design has zero cache
footprint or memory use... and the whole damage side of grouping (and thus the reason for spreading)
is in sharing of the caches and memory bandwidth.
(if you end up thrashing the cache, the power you burn by losing the efficiency there is not easy to win back
by placement)




  parent reply	other threads:[~2013-07-13 14:40 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-09 15:55 [RFC][PATCH 0/9] sched: Power scheduler design proposal Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 1/9] sched: Introduce power scheduler Morten Rasmussen
2013-07-09 16:48   ` Arjan van de Ven
2013-07-10  2:10   ` Arjan van de Ven
2013-07-10 11:11     ` Morten Rasmussen
2013-07-10 11:19       ` Vincent Guittot
2013-07-09 15:55 ` [RFC][PATCH 2/9] sched: Redirect update_cpu_power to sched/power.c Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 3/9] sched: Make select_idle_sibling() skip cpu with a cpu_power of 1 Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 4/9] sched: Make periodic load-balance disregard cpus " Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 5/9] sched: Make idle_balance() skip " Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 6/9] sched: power: add power_domain data structure Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 7/9] sched: power: Add power driver interface Morten Rasmussen
2013-07-09 15:55 ` [RFC][PATCH 8/9] sched: power: Add initial frequency scaling support to power scheduler Morten Rasmussen
2013-07-10 13:10   ` Arjan van de Ven
2013-07-12 12:51     ` Morten Rasmussen
2013-07-12 13:06       ` Catalin Marinas
2013-07-12 15:37       ` Arjan van de Ven
2013-07-09 15:55 ` [RFC][PATCH 9/9] sched: power: cpufreq: Initial schedpower cpufreq governor Morten Rasmussen
2013-07-09 16:58 ` [RFC][PATCH 0/9] sched: Power scheduler design proposal Arjan van de Ven
2013-07-10 11:16   ` Morten Rasmussen
2013-07-10 13:05     ` Arjan van de Ven
2013-07-12 12:46       ` Morten Rasmussen
2013-07-12 15:35         ` Arjan van de Ven
2013-07-12 13:00       ` Catalin Marinas
2013-07-12 15:44         ` Arjan van de Ven
2013-07-11 11:34   ` Preeti U Murthy
2013-07-12 13:48     ` Morten Rasmussen
2013-07-15  3:43       ` Preeti U Murthy
2013-07-15  9:55         ` Catalin Marinas
2013-07-15 15:24           ` Arjan van de Ven
2013-07-12 13:31   ` Catalin Marinas
2013-07-13  6:49 ` Peter Zijlstra
2013-07-13 10:23   ` Catalin Marinas
2013-07-15  7:53     ` Vincent Guittot
2013-07-15 20:39     ` Peter Zijlstra
2013-07-16 12:42       ` Catalin Marinas
2013-07-16 15:23         ` Arjan van de Ven
2013-07-17 14:14           ` Catalin Marinas
2013-07-24 13:50             ` Morten Rasmussen
2013-07-24 15:16               ` Arjan van de Ven
2013-07-24 16:46                 ` Morten Rasmussen
2013-07-24 16:48                   ` Arjan van de Ven
2013-07-25  8:00                     ` Morten Rasmussen
2013-07-13 14:40   ` Arjan van de Ven [this message]
2013-07-15 19:59     ` Peter Zijlstra
2013-07-15 20:37       ` Arjan van de Ven
2013-07-15 21:03         ` Peter Zijlstra
2013-07-15 22:46           ` Arjan van de Ven
2013-07-16 20:45             ` David Lang
2013-07-15 20:41       ` Arjan van de Ven
2013-07-15 21:06         ` Peter Zijlstra
2013-07-15 21:12           ` Peter Zijlstra
2013-07-15 22:52             ` Arjan van de Ven
2013-07-16 17:38               ` Peter Zijlstra
2013-07-16 18:44                 ` Arjan van de Ven
2013-07-16 19:21                   ` Peter Zijlstra
2013-07-16 19:57                     ` Arjan van de Ven
2013-07-16 20:17                       ` Peter Zijlstra
2013-07-16 20:21                         ` Arjan van de Ven
2013-07-16 20:32                         ` Arjan van de Ven
2013-07-15 22:46           ` Arjan van de Ven
2013-07-13 16:14   ` Arjan van de Ven
2013-07-15  2:05     ` Alex Shi
2013-07-24 13:16   ` Morten Rasmussen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51E166C8.3000902@linux.intel.com \
    --to=arjan@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=catalin.marinas@arm.com \
    --cc=corbet@lwn.net \
    --cc=efault@gmx.de \
    --cc=len.brown@intel.com \
    --cc=linaro-kernel@lists.linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=morten.rasmussen@arm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).