Re: [PATCH] sched: wake-affine throttle

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mike Galbraith <efault@gmx.de>
To: Michael Wang <wangyun@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>, Alex Shi <alex.shi@intel.com>,
	Namhyung Kim <namhyung@kernel.org>, Paul Turner <pjt@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Nikunj A. Dadhania" <nikunj@linux.vnet.ibm.com>,
	Ram Pai <linuxram@us.ibm.com>
Subject: Re: [PATCH] sched: wake-affine throttle
Date: Thu, 11 Apr 2013 09:30:47 +0200	[thread overview]
Message-ID: <1365665447.19620.102.camel@marge.simpson.net> (raw)
In-Reply-To: <516651C8.307@linux.vnet.ibm.com>

On Thu, 2013-04-11 at 14:01 +0800, Michael Wang wrote: 
> On 04/10/2013 05:22 PM, Michael Wang wrote:
> > Hi, Peter
> > 
> > Thanks for your reply :)
> > 
> > On 04/10/2013 04:51 PM, Peter Zijlstra wrote:
> >> On Wed, 2013-04-10 at 11:30 +0800, Michael Wang wrote:
> >>> | 15 GB   |      32 | 35918 |   | 37632 | +4.77% | 47923 | +33.42% |
> >>> 52241 | +45.45%
> >>
> >> So I don't get this... is wake_affine() once every milisecond _that_
> >> expensive?
> >>
> >> Seeing we get a 45%!! improvement out of once every 100ms that would
> >> mean we're like spending 1/3rd of our time in wake_affine()? that's
> >> preposterous. So what's happening?
> > 
> > Not all the regression was caused by overhead, adopt curr_cpu not
> > prev_cpu for select_idle_sibling() is a more important reason for the
> > regression of pgbench.
> > 
> > In other word, for pgbench, we waste time in wake_affine() and make the
> > wrong decision at most of the time, the previously patch show
> > wake_affine() do pull unrelated tasks together, that's good if current
> > cpu still cached hot data for wakee, but that's not the case of the
> > workload like pgbench.
> 
> Please let me know if I failed to express my thought clearly.
> 
> I know it's hard to figure out why throttle could bring so many benefit,
> since the wake-affine stuff is a black box with too many unmeasurable
> factors, but that's actually the reason why we finally figure out this
> throttle idea, not the approach like wakeup-buddy, although both of them
> help to stop the regression.

For that load, as soon as clients+server exceeds socket size, pull is
doomed to always be a guaranteed loser.  There simply is no way to win,
some tasks must drag their data cross node no matter what you do,
because there is one and only one source of data, so you can not
possibly do anything but harm by pulling or in any other way disturbing
task placement, because you will force tasks to re-heat their footprint
every time you migrate someone with zero benefit to offset cost.  That
is why the closer you get to completely killing all migration, the
better your throughput gets with this load.. you're killing the cost of
migration in a situation there simply is no gain to be had.

That's why that wakeup-buddy thingy is a ~good idea.  It will allow 1:1
buddies that can and do benefit from motion to pair up and jabber in a
shared cache (though that motion needs slowing down too), _and_ detect
the case where wakeup migration is utterly pointless.  Just killing
wakeup migration OTOH should (I'd say very emphatic will) hurt pgbench
just as much, because spreading a smallish set which could share a cache
across several nodes hurts things like pgbench via misses just as much
as any other load.. it's just that once this load (or ilk) doesn't fit
in a node, you're absolutely screwed as far as misses go, you will eat
that because there simply is no other option.

Any migration is pointless for this thing once it exceeds socket size,
and fairness plays a dominant role, is absolutely not throughputs best
friend when any component of a load requires more CPU than the other
components, which very definitely is the case with pgbench.  Fairness
hurts this thing a lot.  That's why pgbench took a whopping huge hit
when I fixed up select_idle_sibling() to not completely rape fast/light
communicating tasks, it forced pgbench to face the consequences of a
fair scheduler, by cutting off the escape routes that searching for
_any_ even ever so briefly idle spot to place tasks such that wakeup
preemption just didn't happen, and when we failed to pull, we instead
did the very same thing on wakees original socket, thus providing
pgbench the fairness escape mechanism that it needs.

When you wake to idle cores, you do not have a nanosecond resolution
ultra fair scheduler, with the fairness price to be paid.. tasks run as
long as they want to run, or at least full ticks, which of course makes
the hard working load components a lot more productive.  Hogs can be
hogs.  For pgbench run in 1:N mode, the hardest working load component
is the mother of all work, the (singular) server.  Any time 'mom' is not
continuously working her little digital a$$ off to keep all those kids
fed, you have a performance problem on your hands, the entire load
stalls, lives and dies with one and only 'mom'.

-Mike

next prev parent reply	other threads:[~2013-04-11  7:31 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-10  3:30 [PATCH] sched: wake-affine throttle Michael Wang
2013-04-10  4:16 ` Alex Shi
2013-04-10  5:11   ` Michael Wang
2013-04-10  5:27     ` Alex Shi
2013-04-10  8:51 ` Peter Zijlstra
2013-04-10  9:22   ` Michael Wang
2013-04-11  6:01     ` Michael Wang
2013-04-11  7:30       ` Mike Galbraith [this message]
2013-04-11  8:26         ` Michael Wang
2013-04-11  8:44           ` Mike Galbraith
2013-04-11  9:00             ` Mike Galbraith
2013-04-11  9:02             ` Michael Wang
2013-04-12  3:17   ` Michael Wang
2013-04-22  4:21 ` Michael Wang
2013-04-22  5:27   ` Mike Galbraith
2013-04-22  6:19     ` Michael Wang
2013-04-22 10:23 ` Peter Zijlstra
2013-04-22 10:35   ` Ingo Molnar
2013-04-23  4:05     ` Michael Wang
2013-04-22 17:49   ` Paul Turner
2013-04-23  4:01   ` Michael Wang
2013-04-27  2:46   ` Michael Wang
2013-05-02  5:48   ` Michael Wang
2013-05-02  7:10     ` Mike Galbraith
2013-05-02  7:36       ` Michael Wang
2013-05-03  3:46 ` Michael Wang
2013-05-03  5:01   ` Mike Galbraith
2013-05-03  5:57     ` Michael Wang
2013-05-03  6:14       ` Mike Galbraith
2013-05-04  2:20         ` Michael Wang
2013-05-07  2:46   ` Michael Wang
2013-05-13  2:27     ` Michael Wang
2013-05-16  7:40   ` Michael Wang
2013-05-16  7:45 ` Michael Wang
2013-05-21  3:20 ` [PATCH v2] " Michael Wang
2013-05-21  6:47   ` Alex Shi
2013-05-21  6:52     ` Michael Wang
2013-05-22  8:49   ` Peter Zijlstra
2013-05-22  9:25     ` Michael Wang
2013-05-22 14:55       ` Mike Galbraith
2013-05-23  2:12         ` Michael Wang
2013-05-28  5:02         ` Michael Wang
2013-05-28  6:29           ` Mike Galbraith
2013-05-28  7:22             ` Michael Wang
2013-05-28  8:49               ` Mike Galbraith
2013-05-28  8:56                 ` Michael Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1365665447.19620.102.camel@marge.simpson.net \
    --to=efault@gmx.de \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxram@us.ibm.com \
    --cc=mingo@kernel.org \
    --cc=namhyung@kernel.org \
    --cc=nikunj@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=wangyun@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.