All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michael Wang <wangyun@linux.vnet.ibm.com>
To: Mike Galbraith <efault@gmx.de>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Namhyung Kim <namhyung@kernel.org>, Alex Shi <alex.shi@intel.com>,
	Paul Turner <pjt@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"Nikunj A. Dadhania" <nikunj@linux.vnet.ibm.com>,
	Ram Pai <linuxram@us.ibm.com>
Subject: Re: [RFC PATCH] sched: wake-affine throttle
Date: Mon, 25 Mar 2013 18:21:07 +0800	[thread overview]
Message-ID: <51502513.6010607@linux.vnet.ibm.com> (raw)
In-Reply-To: <1364203359.4559.66.camel@marge.simpson.net>

Hi, Mike

Thanks for your reply :)

On 03/25/2013 05:22 PM, Mike Galbraith wrote:
> On Mon, 2013-03-25 at 13:24 +0800, Michael Wang wrote: 
>> Recently testing show that wake-affine stuff cause regression on pgbench, the
>> hiding rat was finally catched out.
>>
>> wake-affine stuff is always trying to pull wakee close to waker, by theory,
>> this will benefit us if waker's cpu cached hot data for wakee, or the extreme
>> ping-pong case.
>>
>> However, the whole stuff is somewhat blindly, there is no examining on the
>> relationship between waker and wakee, and since the stuff itself
>> is time-consuming, some workload suffered, pgbench is just the one who
>> has been found.
>>
>> Thus, throttle the wake-affine stuff for such workload is necessary.
>>
>> This patch introduced a new knob 'sysctl_sched_wake_affine_interval' with the
>> default value 1ms, which means wake-affine stuff only effect once per 1ms, which
>> usually the minimum balance interval (the idea is that the rapid of wake-affine
>> should lower than the rapid of load-balance at least).
>>
>> By turning the new knob, those workload who suffered will have the chance to
>> stop the regression.
> 
> I wouldn't do it quite this way.  Per task, yes (I suggested that too,
> better agree;), but for one, using jiffies in the scheduler when we have
> a spiffy clock sitting there ready to use seems wrong,

Well, I get the approach from load-balance code, this is one existed way
to play interval stuff, just try to keep consistent...

 and secondly,
> when you've got a bursty load, not pulling can hurt like hell.  Alex
> encountered that while working on his patch set.
> 
>> Test:
>> 	Test with 12 cpu X86 server and tip 3.9.0-rc2.
>>
>> 	Default 1ms interval bring limited performance improvement(<5%) for
>> 	pgbench, significant improvement start to show when turning the
>> 	knob to 100ms.
> 
> So it seems you'd be better served by an on/off switch for this load.
> 100ms in the future for many tasks is akin to a human todo list entry
> scheduled for Solar radius >= Earth orbital radius day ;-)

Do you mean 1ms interval is still too big? and you prefer to have a 0
option?

> 
>> 			    original	100ms	
>>
>> 	| db_size | clients |  tps  |	|  tps  |
>> 	+---------+---------+-------+   +-------+
>> 	| 21 MB   |       1 | 10572 |   | 10675 |
>> 	| 21 MB   |       2 | 21275 |   | 21228 |
>> 	| 21 MB   |       4 | 41866 |   | 41946 |
>> 	| 21 MB   |       8 | 53931 |   | 55176 |
>> 	| 21 MB   |      12 | 50956 |   | 54457 |	+6.87%
>> 	| 21 MB   |      16 | 49911 |   | 55468 |	+11.11%
>> 	| 21 MB   |      24 | 46046 |   | 56446 |	+22.59%
>> 	| 21 MB   |      32 | 43405 |   | 55177 |	+27.12%
>> 	| 7483 MB |       1 |  7734 |   |  7721 |
>> 	| 7483 MB |       2 | 19375 |   | 19277 |
>> 	| 7483 MB |       4 | 37408 |   | 37685 |
>> 	| 7483 MB |       8 | 49033 |   | 49152 |
>> 	| 7483 MB |      12 | 45525 |   | 49241 |	+8.16%
>> 	| 7483 MB |      16 | 45731 |   | 51425 |	+12.45%
>> 	| 7483 MB |      24 | 41533 |   | 52349 |	+26.04%
>> 	| 7483 MB |      32 | 36370 |   | 51022 |	+40.28%
>> 	| 15 GB   |       1 |  7576 |   |  7422 |
>> 	| 15 GB   |       2 | 19157 |   | 19176 |
>> 	| 15 GB   |       4 | 37285 |   | 36982 |
>> 	| 15 GB   |       8 | 48718 |   | 48413 |
>> 	| 15 GB   |      12 | 45167 |   | 48497 |	+7.37%
>> 	| 15 GB   |      16 | 45270 |   | 51276 |	+13.27%
>> 	| 15 GB   |      24 | 40984 |   | 51628 |	+25.97%
>> 	| 15 GB   |      32 | 35918 |   | 51060 |	+42.16%
> 
> The benefit you get with not pulling is two fold at least, first and
> foremost it keeps the forked off clients the hell away from the mother
> of all work so it can keep the kids fed.  Second, you keep the load
> spread out, which is the only way the full box sized load can possibly
> perform in the first place.  The full box benefit seems clear from the
> numbers.. hard working server can compete best for its share when it's
> competing against the same set of clients, that's likely why you have to
> set the knob to 100ms to get the big win.

Actually the 10ms will also get around 27% improvement at most, I use
100ms since it looks more significant...

I haven't tried the interval between 1 and 10, but I suppose the benefit
could be some kind of parabola, it's not a suddenly change, but smoothly.

> 
> With small burst loads of short running tasks, even things like pgbench
> will benefit from pulling to local llc more frequently than 100ms, iff
> burst does not exceed socket size.  That pulling is not completely evil,
> it automagically consolidates your mostly idle NUMA box to it's most
> efficient task placement for both power saving and throughput, so IMHO,
> you can't just let tasks sit cross node over ~extended idle periods
> without doing harm.

I see, and actually that's the reason for this proposal, it's just try
to reserve all the possible benefit of wake-affine, and provide a way to
control the rapid.

I think your point here is still that we need a 0 option, it that correct?

> 
> OTOH, if the box is mostly busy, or if there's only one llc, per task
> pick a smallish number is dirt simple, and should be a general case
> improvement over the overly 1:1 buddy centric current behavior.
> 
> Have you tried the patch set by Alex Shi?   In my fiddling with them,
> they put a very big dent in the evil side of select_idle_sibling() and
> affine wakeups in general, and should help pgbench and ilk heaping
> truckloads.

I haven't tried yet, whatever, a way to throttle the wake-affine is
necessary, since we have the proof that it will damage some workload.

OTOH, by theory, we have the chance to pull none-related tasks together,
and we can't promise this will help the system all the time, don't we?

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


  reply	other threads:[~2013-03-25 10:21 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-25  5:24 [RFC PATCH] sched: wake-affine throttle Michael Wang
2013-03-25  9:22 ` Mike Galbraith
2013-03-25 10:21   ` Michael Wang [this message]
2013-03-25 14:31     ` Mike Galbraith
2013-03-26  2:45       ` Michael Wang
2013-04-08  2:08 ` Michael Wang
2013-04-08 10:00 ` Peter Zijlstra
2013-04-09  5:01   ` Michael Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51502513.6010607@linux.vnet.ibm.com \
    --to=wangyun@linux.vnet.ibm.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=alex.shi@intel.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxram@us.ibm.com \
    --cc=mingo@kernel.org \
    --cc=namhyung@kernel.org \
    --cc=nikunj@linux.vnet.ibm.com \
    --cc=pjt@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.