RE: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RE: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
@ 2004-03-25 15:31 Nakajima, Jun
  2004-03-25 15:40 ` Andi Kleen
  0 siblings, 1 reply; 68+ messages in thread
From: Nakajima, Jun @ 2004-03-25 15:31 UTC (permalink / raw)
  To: Andi Kleen, Rick Lindsley
  Cc: Ingo Molnar, piggin, linux-kernel, akpm, kernel, rusty, anton,
	lse-tech, mbligh

Andi,

Can you be more specific with "it doesn't load balance threads
aggressively enough"? Or what behavior of the base NUMA scheduler is
missing in the sched-domain scheduler especially for NUMA?

Jun

>-----Original Message-----
>From: Andi Kleen [mailto:ak@suse.de]
>Sent: Thursday, March 25, 2004 3:47 AM
>To: Rick Lindsley
>Cc: Andi Kleen; Ingo Molnar; piggin@cyberone.com.au; linux-
>kernel@vger.kernel.org; akpm@osdl.org; kernel@kolivas.org;
>rusty@rustcorp.com.au; Nakajima, Jun; anton@samba.org; lse-
>tech@lists.sourceforge.net; mbligh@aracnet.com
>Subject: Re: [Lse-tech] [patch] sched-domain cleanups,
sched-2.6.5-rc2-mm2-
>A3
>
>On Thu, Mar 25, 2004 at 03:40:22AM -0800, Rick Lindsley wrote:
>>     The main problem it has is that it performs quite badly on
Opteron
>NUMA
>>     e.g. in the OpenMP STREAM test (much worse than the normal
scheduler)
>>
>> Andi, I've got some schedstat code which may help us to understand
why.
>> I'll need to port it to Ingo's changes, but if I drop you a patch in
a
>> day or two can you try your test on sched-domain/non-sched-domain,
>> collecting the stats?
>
>The openmp failure is already pretty well understood - it doesn't load
>balance
>threads aggressively enough over CPUs after startup.
>
>-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 15:31 [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Nakajima, Jun
@ 2004-03-25 15:40 ` Andi Kleen
  2004-03-25 19:09   ` Ingo Molnar
                     ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-25 15:40 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Andi Kleen, Rick Lindsley, Ingo Molnar, piggin, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech, mbligh

On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote:
> Andi,
> 
> Can you be more specific with "it doesn't load balance threads
> aggressively enough"? Or what behavior of the base NUMA scheduler is
> missing in the sched-domain scheduler especially for NUMA?

It doesn't do load balance in wake_up_forked_process()  and is relatively
non aggressive in balancing later. This leads to the multithreaded OpenMP
STREAM running its childs first on the same node as the original process
and allocating memory there. Then later they run on a different node when
the balancing finally happens, but generate  cross traffic to the old node, 
instead of using the memory bandwidth of their local nodes.

The difference is very visible, even the 4 thread STREAM only sees the
bandwidth of a single node. With a more aggressive scheduler you get
4 times as much.

Admittedly it's a bit of a stupid benchmark, but seems to representative
for a lot of HPC codes.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 15:40 ` Andi Kleen
@ 2004-03-25 19:09   ` Ingo Molnar
  2004-03-25 15:21     ` Andi Kleen
  2004-03-25 19:24     ` Martin J. Bligh
  2004-03-25 21:59   ` Ingo Molnar
  2004-03-26  3:23   ` Nick Piggin
  2 siblings, 2 replies; 68+ messages in thread
From: Ingo Molnar @ 2004-03-25 19:09 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

* Andi Kleen <ak@suse.de> wrote:

> It doesn't do load balance in wake_up_forked_process() and is
> relatively non aggressive in balancing later. This leads to the
> multithreaded OpenMP STREAM running its childs first on the same node
> as the original process and allocating memory there. [...]

i believe the fix we want is to pre-balance the context at fork() time. 
I've implemented this (which is basically just a reuse of
sched_balance_exec() in fork.c, and the related namespace cleanups), 
could you give it a go:

  http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5

another solution would be to add SD_BALANCE_FORK.

also, the best place to do fork() blancing is not at
wake_up_forked_process() time, but prior doing the MM copy. This patch
does it there. At wakeup time we've already copied all the pagetables
and created tons of dirty cachelines.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 19:09   ` Ingo Molnar
@ 2004-03-25 15:21     ` Andi Kleen
  2004-03-25 19:39       ` Ingo Molnar
  2004-03-25 19:24     ` Martin J. Bligh
  1 sibling, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2004-03-25 15:21 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh

On Thu, 25 Mar 2004 20:09:45 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> also, the best place to do fork() blancing is not at
> wake_up_forked_process() time, but prior doing the MM copy. This patch
> does it there. At wakeup time we've already copied all the pagetables
> and created tons of dirty cachelines.

That won't help for threaded programs that use clone(). OpenMP is such a case.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 15:21     ` Andi Kleen
@ 2004-03-25 19:39       ` Ingo Molnar
  2004-03-25 20:30         ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-25 19:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh

* Andi Kleen <ak@suse.de> wrote:

> That won't help for threaded programs that use clone(). OpenMP is such
> a case.

yeah, agreed. Also, exec-balance, if applied to fork(), would migrate
the parent which is not what we want. We could perhaps migrate the
parent to the target CPU, copy the context, then migrate the parent back
to the original CPU ... but this sounds too complex.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 19:39       ` Ingo Molnar
@ 2004-03-25 20:30         ` Ingo Molnar
  2004-03-29  8:45           ` Andi Kleen
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-25 20:30 UTC (permalink / raw)
  To: Andi Kleen
  Cc: jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh


* Andi Kleen <ak@suse.de> wrote:

> That won't help for threaded programs that use clone(). OpenMP is such
> a case.

this patch:

        redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4

does balancing at wake_up_forked_process()-time.

but it's a hard issue. Especially after fork() we do have a fair amount
of cache context, and migrating at this point can be bad for
performance.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 20:30         ` Ingo Molnar
@ 2004-03-29  8:45           ` Andi Kleen
  2004-03-29 10:20             ` Rick Lindsley
  2004-03-29 11:20             ` Nick Piggin
  0 siblings, 2 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-29  8:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, jun.nakajima, ricklind, piggin, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech, mbligh

On Thu, Mar 25, 2004 at 09:30:32PM +0100, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > That won't help for threaded programs that use clone(). OpenMP is such
> > a case.
> 
> this patch:
> 
>         redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4
> 
> does balancing at wake_up_forked_process()-time.
> 
> but it's a hard issue. Especially after fork() we do have a fair amount
> of cache context, and migrating at this point can be bad for
> performance.

I ported it by hand to the -mm4 scheduler now and tested it. While
it works marginally better than the standard -mm scheduler 
(you get 1 1/2 the bandwidth of one CPU instead of one) it's still
still much worse than the optimum of nearly 4 CPUs archived by
2.4 or the standard scheduler.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29  8:45           ` Andi Kleen
@ 2004-03-29 10:20             ` Rick Lindsley
  2004-03-29  5:07               ` Andi Kleen
  2004-03-29 11:28               ` Nick Piggin
  2004-03-29 11:20             ` Nick Piggin
  1 sibling, 2 replies; 68+ messages in thread
From: Rick Lindsley @ 2004-03-29 10:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

I've got a web page up now on my home machine which shows data from
schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
load from kernbench, SPECjbb, and SPECdet.

    http://eaglet.rain.com/rick/linux/sched-domain/index.html

Two things that stand out are that sched-domains tends to call
load_balance() less frequently when it is idle and more frequently when
it is busy (as compared to the "standard" scheduler.)  Another is that
even though it moves fewer tasks on average, the sched-domains code shows
about half of pull_task()'s work is coming from active_load_balance() ...
and that seems wrong.  Could these be contributing to what you're seeing?

Rick

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 10:20             ` Rick Lindsley
@ 2004-03-29  5:07               ` Andi Kleen
  2004-03-29 11:28               ` Nick Piggin
  1 sibling, 0 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-29  5:07 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: mingo, jun.nakajima, piggin, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh

On Mon, 29 Mar 2004 02:20:58 -0800
Rick Lindsley <ricklind@us.ibm.com> wrote:

> I've got a web page up now on my home machine which shows data from
> schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
> load from kernbench, SPECjbb, and SPECdet.
> 
>     http://eaglet.rain.com/rick/linux/sched-domain/index.html
> 
> Two things that stand out are that sched-domains tends to call
> load_balance() less frequently when it is idle and more frequently when
> it is busy (as compared to the "standard" scheduler.)  Another is that
> even though it moves fewer tasks on average, the sched-domains code shows
> about half of pull_task()'s work is coming from active_load_balance() ...
> and that seems wrong.  Could these be contributing to what you're seeing?

Sounds quite possible yes.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 10:20             ` Rick Lindsley
  2004-03-29  5:07               ` Andi Kleen
@ 2004-03-29 11:28               ` Nick Piggin
  2004-03-29 17:30                 ` Rick Lindsley
  1 sibling, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-29 11:28 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: Andi Kleen, Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech, mbligh

Rick Lindsley wrote:
> I've got a web page up now on my home machine which shows data from
> schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
> load from kernbench, SPECjbb, and SPECdet.
> 
>     http://eaglet.rain.com/rick/linux/sched-domain/index.html
> 

I can't see it

> Two things that stand out are that sched-domains tends to call
> load_balance() less frequently when it is idle and more frequently when
> it is busy (as compared to the "standard" scheduler.)  Another is that

John Hawkes noticed problems here too. mm5 has a patch to
improve this for NUMA node balancing. No change on non-NUMA
though if that is what you were testing - we might need to
tune this a bit if it is hurting.

> even though it moves fewer tasks on average, the sched-domains code shows
> about half of pull_task()'s work is coming from active_load_balance() ...

Yeah this is wrong and shouldn't be happening. It would have been
due to a bug in the imbalance calculation which is now fixed.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 11:28               ` Nick Piggin
@ 2004-03-29 17:30                 ` Rick Lindsley
  2004-03-30  0:01                   ` Nick Piggin
  0 siblings, 1 reply; 68+ messages in thread
From: Rick Lindsley @ 2004-03-29 17:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech, mbligh

    Rick Lindsley wrote:
    > I've got a web page up now on my home machine which shows data from
    > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
    > load from kernbench, SPECjbb, and SPECdet.
    > 
    >     http://eaglet.rain.com/rick/linux/sched-domain/index.html
    > 
    
    I can't see it

Ack, sorry, wrong path.  Hazards of typing at 3am .. should've used cut 'n'
paste ...

    http://eaglet.rain.com/rick/linux/results/sched-domain/index.html

Rick

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 17:30                 ` Rick Lindsley
@ 2004-03-30  0:01                   ` Nick Piggin
  2004-03-30  1:26                     ` Rick Lindsley
  0 siblings, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  0:01 UTC (permalink / raw)
  To: Rick Lindsley
  Cc: Andi Kleen, Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech, mbligh

Rick Lindsley wrote:
>     Rick Lindsley wrote:
>     > I've got a web page up now on my home machine which shows data from
>     > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under
>     > load from kernbench, SPECjbb, and SPECdet.
>     > 
>     >     http://eaglet.rain.com/rick/linux/sched-domain/index.html
>     > 
>     
>     I can't see it
> 
> Ack, sorry, wrong path.  Hazards of typing at 3am .. should've used cut 'n'
> paste ...
> 
>     http://eaglet.rain.com/rick/linux/results/sched-domain/index.html
> 

Hi Rick,
This looks very cool. Very comprehensive. Have you got any
plans to intergrate it with sched_domains (so for example,
you can see stats for each domain)?

I will have to have a look at the code, it should be useful
for testing.

Thanks
Nick

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  0:01                   ` Nick Piggin
@ 2004-03-30  1:26                     ` Rick Lindsley
  0 siblings, 0 replies; 68+ messages in thread
From: Rick Lindsley @ 2004-03-30  1:26 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech, mbligh

    This looks very cool. Very comprehensive. Have you got any
    plans to intergrate it with sched_domains (so for example,
    you can see stats for each domain)?

Yes -- ideally we can add some stats to domains too, so we can tell
(for example) how often it is adjusting rebalance intervals, or how many
processes are moved as a result of each domain's policy, etc.  Every time
I add another counter I cringe a bit, because we don't want to impose
overhead in the scheduler.  But so far,  using per-cpu data, utilizing
runqueue locking when it's in use, and accepting minor inaccuracies that
may result from the remaining cases, seems to be yielding a pretty good
picture of things without imposing a measurable load.

If you want to start using it yourself, I'm open to feedback.  I have patches
for major releases at

    http://oss.software.ibm.com/linux/patches/?patch_id=730

and a host of smaller releases (like rc2-mm5) at eaglet:

    http://eaglet.rain.com/rick/linux/schedstat/

If you're feeling *really* lucky I have a handful of useful but often
ungeneralized tools I can share,  like the the ones that made that web
page.

Rick

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29  8:45           ` Andi Kleen
  2004-03-29 10:20             ` Rick Lindsley
@ 2004-03-29 11:20             ` Nick Piggin
  2004-03-29  6:01               ` Andi Kleen
  1 sibling, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-29 11:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

Andi Kleen wrote:
> On Thu, Mar 25, 2004 at 09:30:32PM +0100, Ingo Molnar wrote:
> 
>>* Andi Kleen <ak@suse.de> wrote:
>>
>>
>>>That won't help for threaded programs that use clone(). OpenMP is such
>>>a case.
>>
>>this patch:
>>
>>        redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4
>>
>>does balancing at wake_up_forked_process()-time.
>>
>>but it's a hard issue. Especially after fork() we do have a fair amount
>>of cache context, and migrating at this point can be bad for
>>performance.
> 
> 
> I ported it by hand to the -mm4 scheduler now and tested it. While
> it works marginally better than the standard -mm scheduler 
> (you get 1 1/2 the bandwidth of one CPU instead of one) it's still
> still much worse than the optimum of nearly 4 CPUs archived by
> 2.4 or the standard scheduler.
> 

OK there must be some pretty simple reason why this is happening.
I guess being OpenMP it is probably a bit complicated for you to
try your own scheduling in userspace using CPU affinities?
Otherwise could you trace what gets scheduled where for both
good and bad kernels? It should help us work out what is going
on.

I wonder if using one CPU from each quad of the NUMAQ would be
give at all comparable behaviour...

If it isn't a big problem, could you test with -mm5 with the
generic sched domain? STREAM doesn't take long, does it?
I don't expect much difference, but the code is in flux while
Ingo and I try to sort things out.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 11:20             ` Nick Piggin
@ 2004-03-29  6:01               ` Andi Kleen
  2004-03-29 11:46                 ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2004-03-29  6:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh

On Mon, 29 Mar 2004 21:20:12 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > 
> > I ported it by hand to the -mm4 scheduler now and tested it. While
> > it works marginally better than the standard -mm scheduler 
> > (you get 1 1/2 the bandwidth of one CPU instead of one) it's still
> > still much worse than the optimum of nearly 4 CPUs archived by
> > 2.4 or the standard scheduler.
> > 
> 

Sorry ignore this report - I just found out I booted the wrong
kernel by mistake. Currently retesting, also with the proposed change
to only use a single scheduling domain.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29  6:01               ` Andi Kleen
@ 2004-03-29 11:46                 ` Ingo Molnar
  2004-03-29  7:03                   ` Andi Kleen
  2004-03-29 20:14                   ` Andi Kleen
  0 siblings, 2 replies; 68+ messages in thread
From: Ingo Molnar @ 2004-03-29 11:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh


* Andi Kleen <ak@suse.de> wrote:

> Sorry ignore this report - I just found out I booted the wrong kernel
> by mistake. Currently retesting, also with the proposed change to only
> use a single scheduling domain.

here are the items that are in the works:

  redhat.com/~mingo/scheduler-patches/sched.patch

it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active
balancing a bit.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 11:46                 ` Ingo Molnar
@ 2004-03-29  7:03                   ` Andi Kleen
  2004-03-29  7:10                     ` Andi Kleen
  2004-03-29 20:14                   ` Andi Kleen
  1 sibling, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2004-03-29  7:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

On Mon, 29 Mar 2004 13:46:35 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > Sorry ignore this report - I just found out I booted the wrong kernel
> > by mistake. Currently retesting, also with the proposed change to only
> > use a single scheduling domain.
> 
> here are the items that are in the works:
> 
>   redhat.com/~mingo/scheduler-patches/sched.patch

I'm trying to, but -mm5 doesn't work at all on the 4 way machine.
It goes through the full boot up sequence, but then never opens a login
on the console and sshd also doesn't work.

Andrew, maybe that's related to your tty fixes?

-Andi


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29  7:03                   ` Andi Kleen
@ 2004-03-29  7:10                     ` Andi Kleen
  0 siblings, 0 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-29  7:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: mingo, nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech, mbligh

On Mon, 29 Mar 2004 09:03:01 +0200
Andi Kleen <ak@suse.de> wrote:
> 
> I'm trying to, but -mm5 doesn't work at all on the 4 way machine.
> It goes through the full boot up sequence, but then never opens a login
> on the console and sshd also doesn't work.
> 
> Andrew, maybe that's related to your tty fixes?

Reverting the two makes login work again

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 11:46                 ` Ingo Molnar
  2004-03-29  7:03                   ` Andi Kleen
@ 2004-03-29 20:14                   ` Andi Kleen
  2004-03-29 23:51                     ` Nick Piggin
  1 sibling, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2004-03-29 20:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

On Mon, 29 Mar 2004 13:46:35 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > Sorry ignore this report - I just found out I booted the wrong kernel
> > by mistake. Currently retesting, also with the proposed change to only
> > use a single scheduling domain.
> 
> here are the items that are in the works:
> 
>   redhat.com/~mingo/scheduler-patches/sched.patch
> 
> it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active
> balancing a bit.

I applied only this patch and it did slightly better than the normal -mm* 
1.5 - 2x CPU bandwidth, but still very short of the 3.7x-4x mainline
and 2.4 reach.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 20:14                   ` Andi Kleen
@ 2004-03-29 23:51                     ` Nick Piggin
  2004-03-30  6:34                       ` Andi Kleen
  0 siblings, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-29 23:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

Andi Kleen wrote:
> On Mon, 29 Mar 2004 13:46:35 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> 
>>* Andi Kleen <ak@suse.de> wrote:
>>
>>
>>>Sorry ignore this report - I just found out I booted the wrong kernel
>>>by mistake. Currently retesting, also with the proposed change to only
>>>use a single scheduling domain.
>>
>>here are the items that are in the works:
>>
>>  redhat.com/~mingo/scheduler-patches/sched.patch
>>
>>it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active
>>balancing a bit.
> 
> 
> I applied only this patch and it did slightly better than the normal -mm* 
> 1.5 - 2x CPU bandwidth, but still very short of the 3.7x-4x mainline
> and 2.4 reach.

So both -mm5 and Ingo's sched.patch are much worse than
what 2.4 and 2.6 get?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 23:51                     ` Nick Piggin
@ 2004-03-30  6:34                       ` Andi Kleen
  2004-03-30  6:40                         ` Ingo Molnar
  2004-03-30  7:03                         ` Nick Piggin
  0 siblings, 2 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-30  6:34 UTC (permalink / raw)
  To: Nick Piggin
  Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh

On Tue, 30 Mar 2004 09:51:46 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:


> So both -mm5 and Ingo's sched.patch are much worse than
> what 2.4 and 2.6 get?

Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) 

Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still
much worse than the max of 3.7x-4x CPU bandwidth.

-Andi


 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  6:34                       ` Andi Kleen
@ 2004-03-30  6:40                         ` Ingo Molnar
  2004-03-30  7:07                           ` Andi Kleen
  2004-03-30  7:03                         ` Nick Piggin
  1 sibling, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30  6:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh


* Andi Kleen <ak@suse.de> wrote:

> > So both -mm5 and Ingo's sched.patch are much worse than
> > what 2.4 and 2.6 get?
> 
> Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla)
> 
> Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU),
> but still much worse than the max of 3.7x-4x CPU bandwidth.

Andi, could you please try the patch below - this will test whether this
has to do with the rate of balancing between NUMA nodes. The patch
itself is not correct (it way overbalances on NUMA), but it tests the
theory.

	Ingo

--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -627,7 +627,7 @@ struct sched_domain {
 	.parent			= NULL,			\
 	.groups			= NULL,			\
 	.min_interval		= 8,			\
-	.max_interval		= 256*fls(num_online_cpus()),\
+	.max_interval		= 8,			\
 	.busy_factor		= 8,			\
 	.imbalance_pct		= 125,			\
 	.cache_hot_time		= (10*1000000),		\

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  6:40                         ` Ingo Molnar
@ 2004-03-30  7:07                           ` Andi Kleen
  2004-03-30  7:14                             ` Nick Piggin
                                               ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-30  7:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

On Tue, 30 Mar 2004 08:40:15 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > > So both -mm5 and Ingo's sched.patch are much worse than
> > > what 2.4 and 2.6 get?
> > 
> > Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla)
> > 
> > Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU),
> > but still much worse than the max of 3.7x-4x CPU bandwidth.
> 
> Andi, could you please try the patch below - this will test whether this
> has to do with the rate of balancing between NUMA nodes. The patch
> itself is not correct (it way overbalances on NUMA), but it tests the
> theory.

This works much better, but wildly varying (my tests go from 2.8xCPU to 
~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent 
results would be better though.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:07                           ` Andi Kleen
@ 2004-03-30  7:14                             ` Nick Piggin
  2004-03-30  7:45                               ` Ingo Molnar
  2004-03-30  7:15                             ` Ingo Molnar
  2004-03-30  7:42                             ` Ingo Molnar
  2 siblings, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  7:14 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

Andi Kleen wrote:
> On Tue, 30 Mar 2004 08:40:15 +0200
> Ingo Molnar <mingo@elte.hu> wrote:
> 
> 
>>* Andi Kleen <ak@suse.de> wrote:
>>
>>
>>>>So both -mm5 and Ingo's sched.patch are much worse than
>>>>what 2.4 and 2.6 get?
>>>
>>>Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla)
>>>
>>>Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU),
>>>but still much worse than the max of 3.7x-4x CPU bandwidth.
>>
>>Andi, could you please try the patch below - this will test whether this
>>has to do with the rate of balancing between NUMA nodes. The patch
>>itself is not correct (it way overbalances on NUMA), but it tests the
>>theory.
> 
> 
> This works much better, but wildly varying (my tests go from 2.8xCPU to 
> ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent 
> results would be better though.
> 

Oh good, thanks Ingo. Andi you probably want to lower your minimum
balance time too then, and maybe try with an even lower maximum.
Maybe reduce cache_hot_time a bit too.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:14                             ` Nick Piggin
@ 2004-03-30  7:45                               ` Ingo Molnar
  2004-03-30  7:58                                 ` Nick Piggin
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30  7:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> >This works much better, but wildly varying (my tests go from 2.8xCPU to 
> >~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent 
> >results would be better though.
> 
> Oh good, thanks Ingo. Andi you probably want to lower your minimum
> balance time too then, and maybe try with an even lower maximum. Maybe
> reduce cache_hot_time a bit too.

i dont think we want to balance with that high of a frequency on NUMA
Opteron. These tunes were for testing only.

i'm dusting off the balance-on-clone patch right now, that should be the
correct solution. It is based on a find_idlest_cpu() function which
searches for the least loaded CPU and checks whether we can do passive
load-balancing to it. Ie. it's yet another balancing point in the
scheduler, _not_ some balancing logic change.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:45                               ` Ingo Molnar
@ 2004-03-30  7:58                                 ` Nick Piggin
  0 siblings, 0 replies; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  7:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>>This works much better, but wildly varying (my tests go from 2.8xCPU to 
>>>~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent 
>>>results would be better though.
>>
>>Oh good, thanks Ingo. Andi you probably want to lower your minimum
>>balance time too then, and maybe try with an even lower maximum. Maybe
>>reduce cache_hot_time a bit too.
> 
> 
> i dont think we want to balance with that high of a frequency on NUMA
> Opteron. These tunes were for testing only.
> 

I guess not. Andi says he wants it more like UMA balancing though...


> i'm dusting off the balance-on-clone patch right now, that should be the
> correct solution. It is based on a find_idlest_cpu() function which
> searches for the least loaded CPU and checks whether we can do passive
> load-balancing to it. Ie. it's yet another balancing point in the
> scheduler, _not_ some balancing logic change.
> 

Yep, as I said to Martin, I also agree this is probably good if it
is done carefully. I think we'll need to get a horde of thread
benchmarking people together before turning it on by default, of
course.

It seems Andi can now get equivalent results without it now, so it
isn't a pressing issue.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:07                           ` Andi Kleen
  2004-03-30  7:14                             ` Nick Piggin
@ 2004-03-30  7:15                             ` Ingo Molnar
  2004-03-30  7:18                               ` Nick Piggin
  2004-03-30  7:48                               ` Andi Kleen
  2004-03-30  7:42                             ` Ingo Molnar
  2 siblings, 2 replies; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30  7:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh


* Andi Kleen <ak@suse.de> wrote:

> > Andi, could you please try the patch below - this will test whether this
> > has to do with the rate of balancing between NUMA nodes. The patch
> > itself is not correct (it way overbalances on NUMA), but it tests the
> > theory.
> 
> This works much better, but wildly varying (my tests go from 2.8xCPU
> to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> results would be better though.

ok, could you try min_interval,max_interval and busy_factor all with a
value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
purposes.)

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:15                             ` Ingo Molnar
@ 2004-03-30  7:18                               ` Nick Piggin
  2004-03-30  7:48                               ` Andi Kleen
  1 sibling, 0 replies; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  7:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

Ingo Molnar wrote:
> * Andi Kleen <ak@suse.de> wrote:
> 
> 
>>>Andi, could you please try the patch below - this will test whether this
>>>has to do with the rate of balancing between NUMA nodes. The patch
>>>itself is not correct (it way overbalances on NUMA), but it tests the
>>>theory.
>>
>>This works much better, but wildly varying (my tests go from 2.8xCPU
>>to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
>>results would be better though.
> 
> 
> ok, could you try min_interval,max_interval and busy_factor all with a
> value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> purposes.)
> 

(sorry, forget what I said then, I'll leave it to Ingo)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:15                             ` Ingo Molnar
  2004-03-30  7:18                               ` Nick Piggin
@ 2004-03-30  7:48                               ` Andi Kleen
  2004-03-30  8:18                                 ` Ingo Molnar
  1 sibling, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2004-03-30  7:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

On Tue, 30 Mar 2004 09:15:19 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > > Andi, could you please try the patch below - this will test whether this
> > > has to do with the rate of balancing between NUMA nodes. The patch
> > > itself is not correct (it way overbalances on NUMA), but it tests the
> > > theory.
> > 
> > This works much better, but wildly varying (my tests go from 2.8xCPU
> > to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> > results would be better though.
> 
> ok, could you try min_interval,max_interval and busy_factor all with a
> value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> purposes.)

I kept the old patch and made these changes. The results are much more
consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had this
with older kernels too.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:48                               ` Andi Kleen
@ 2004-03-30  8:18                                 ` Ingo Molnar
  2004-03-30  9:36                                   ` Andi Kleen
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30  8:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh


* Andi Kleen <ak@suse.de> wrote:

> > ok, could you try min_interval,max_interval and busy_factor all with a
> > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> > purposes.)
> 
> I kept the old patch and made these changes. The results are much more
> consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had
> this with older kernels too.

great.

now, could you try the following patch, against vanilla -mm5:

	redhat.com/~mingo/scheduler-patches/sched2.patch

this includes 'context balancing' and doesnt touch the NUMA async
balancing tunables. Do you get better performance than with stock -mm5?

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  8:18                                 ` Ingo Molnar
@ 2004-03-30  9:36                                   ` Andi Kleen
  0 siblings, 0 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-30  9:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

On Tue, 30 Mar 2004 10:18:40 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > > ok, could you try min_interval,max_interval and busy_factor all with a
> > > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing
> > > purposes.)
> > 
> > I kept the old patch and made these changes. The results are much more
> > consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had
> > this with older kernels too.
> 
> great.
> 
> now, could you try the following patch, against vanilla -mm5:
> 
> 	redhat.com/~mingo/scheduler-patches/sched2.patch
> 
> this includes 'context balancing' and doesnt touch the NUMA async
> balancing tunables. Do you get better performance than with stock -mm5?

I get better performance (roughly 2.1x CPU), but only about half the optimum.

-Andi


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:07                           ` Andi Kleen
  2004-03-30  7:14                             ` Nick Piggin
  2004-03-30  7:15                             ` Ingo Molnar
@ 2004-03-30  7:42                             ` Ingo Molnar
  2 siblings, 0 replies; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30  7:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh


* Andi Kleen <ak@suse.de> wrote:

> This works much better, but wildly varying (my tests go from 2.8xCPU
> to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent
> results would be better though.

i'm resurrecting the balance-on-clone patch i sent a couple of days ago. 
I found at least one bug in it that might explain why it didnt work back
then. (also, the scheduler back then was also too agressive at migrating
tasks back.) Stay tuned.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  6:34                       ` Andi Kleen
  2004-03-30  6:40                         ` Ingo Molnar
@ 2004-03-30  7:03                         ` Nick Piggin
  2004-03-30  7:13                           ` Andi Kleen
  2004-03-30  7:13                           ` Martin J. Bligh
  1 sibling, 2 replies; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  7:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh

Andi Kleen wrote:
> On Tue, 30 Mar 2004 09:51:46 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
> 
>>So both -mm5 and Ingo's sched.patch are much worse than
>>what 2.4 and 2.6 get?
> 
> 
> Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) 
> 
> Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still
> much worse than the max of 3.7x-4x CPU bandwidth.
> 

So it is very likely to be a case of the threads running too
long on one CPU before being balanced off, and faulting in
most of their working memory from one node, right?

I think it is impossible for the scheduler to correctly
identify this and implement the behaviour that OpenMP wants
without causing regressions on more general workloads
(Assuming this is the problem).

We are not going to go back to the wild balancing that
numasched does (I have some benchmarks where sched-domains
reduces cross node task movement by several orders of
magnitude). So the other option is to do balance on clone
across NUMA nodes, and make it very sensitive to imbalance.
Or probably better to make it easy to balance off to an idle
CPU, but much more difficult to balance off to a busy CPU.

I suspect this would still be a regression for other tests
though where thread creation is more frequent, threads share
working set more often, or the number of threads > the number
of CPUs.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:03                         ` Nick Piggin
@ 2004-03-30  7:13                           ` Andi Kleen
  2004-03-30  7:24                             ` Nick Piggin
  2004-03-30  7:38                             ` Arjan van de Ven
  2004-03-30  7:13                           ` Martin J. Bligh
  1 sibling, 2 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-30  7:13 UTC (permalink / raw)
  To: Nick Piggin
  Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh

On Tue, 30 Mar 2004 17:03:42 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>
> So it is very likely to be a case of the threads running too
> long on one CPU before being balanced off, and faulting in
> most of their working memory from one node, right?

Yes.

> I think it is impossible for the scheduler to correctly
> identify this and implement the behaviour that OpenMP wants
> without causing regressions on more general workloads
> (Assuming this is the problem).

Regression on what workload? The 2.4 kernel who did the
early balancing didn't seem to have problems.

I have NUMA API for an application to select memory placement
manually, but it's unrealistic to expect all applications to use it,
so the scheduler has to do at least an reasonable default.

In general on Opteron you want to go as quickly as possible
to your target node. Keeping things on the local node and hoping
that threads won't need to be balanced off is probably a loss.
It is quite possible that other systems have different requirements,
but I doubt there is a "one size fits all" requirement and 
doing a custom domain setup or similar would be fine for me.
(or at least if sched domain cannot be tuned for Opteron then
it would have failed its promise of being a configurable scheduler)

> I suspect this would still be a regression for other tests
> though where thread creation is more frequent, threads share
> working set more often, or the number of threads > the number
> of CPUs.

I can try such tests if they're not too time consuming to set up.
What did you have in mind?

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:13                           ` Andi Kleen
@ 2004-03-30  7:24                             ` Nick Piggin
  2004-03-30  7:38                             ` Arjan van de Ven
  1 sibling, 0 replies; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  7:24 UTC (permalink / raw)
  To: Andi Kleen
  Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech, mbligh

Andi Kleen wrote:
> On Tue, 30 Mar 2004 17:03:42 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>So it is very likely to be a case of the threads running too
>>long on one CPU before being balanced off, and faulting in
>>most of their working memory from one node, right?
> 
> 
> Yes.
>  
> 
>>I think it is impossible for the scheduler to correctly
>>identify this and implement the behaviour that OpenMP wants
>>without causing regressions on more general workloads
>>(Assuming this is the problem).
> 
> 
> Regression on what workload? The 2.4 kernel who did the
> early balancing didn't seem to have problems.
> 

No, but hopefully sched domains balancing will do
better than the old numasched.


> I have NUMA API for an application to select memory placement
> manually, but it's unrealistic to expect all applications to use it,
> so the scheduler has to do at least an reasonable default.
> 
> In general on Opteron you want to go as quickly as possible
> to your target node. Keeping things on the local node and hoping
> that threads won't need to be balanced off is probably a loss.
> It is quite possible that other systems have different requirements,
> but I doubt there is a "one size fits all" requirement and 
> doing a custom domain setup or similar would be fine for me.

It is the same situation with all NUMA, obviously Opteron's
1 CPU per node means it is sensitive to node imbalances.

> (or at least if sched domain cannot be tuned for Opteron then
> it would have failed its promise of being a configurable scheduler)
>  

Well it seems like Ingo is on to something. Phew! :)

> 
>>I suspect this would still be a regression for other tests
>>though where thread creation is more frequent, threads share
>>working set more often, or the number of threads > the number
>>of CPUs.
> 
> 
> I can try such tests if they're not too time consuming to set up.
> What did you have in mind?
> 

Not really sure. I guess probably most things that use a
lot of threads, maybe java, a web server using per connection
threads (if there is such a thing).

On the other hand though, maybe it will be a good idea if it
is done carefully...

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:13                           ` Andi Kleen
  2004-03-30  7:24                             ` Nick Piggin
@ 2004-03-30  7:38                             ` Arjan van de Ven
  1 sibling, 0 replies; 68+ messages in thread
From: Arjan van de Ven @ 2004-03-30  7:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, mingo, jun.nakajima, ricklind, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech, mbligh

[-- Attachment #1: Type: text/plain, Size: 578 bytes --]

> Regression on what workload? The 2.4 kernel who did the
> early balancing didn't seem to have problems.

well the hard balance is between a program that just splits of one
thread and has those 2 threads working closely together (in which case
you want the 2 threads to be together on the same quad in a quad-like
setup) and a program that splits of a thread and has the 2 threads
working basically entirely independent.

Benchmarks are typically of the later kind... but real world
applications ???? The ones I can think of using threads are of the
former kind.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:03                         ` Nick Piggin
  2004-03-30  7:13                           ` Andi Kleen
@ 2004-03-30  7:13                           ` Martin J. Bligh
  2004-03-30  7:31                             ` Nick Piggin
  1 sibling, 1 reply; 68+ messages in thread
From: Martin J. Bligh @ 2004-03-30  7:13 UTC (permalink / raw)
  To: Nick Piggin, Andi Kleen
  Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty,
	anton, lse-tech

> We are not going to go back to the wild balancing that
> numasched does (I have some benchmarks where sched-domains
> reduces cross node task movement by several orders of
> magnitude). 

Agreed, I think that'd be a fatal mistake ...

> So the other option is to do balance on clone
> across NUMA nodes, and make it very sensitive to imbalance.
> Or probably better to make it easy to balance off to an idle
> CPU, but much more difficult to balance off to a busy CPU.

I think that's correct, but we need to be careful. We really, really 
do want to try to keep threads on the same node *if* we have enough 
processes around to keep the machine busy. Because we don't balance
on fork, we make a reasonable job of that today, but we should probably
be more reluctant on rebalance than we are.

It's when we have less processes than nodes that we want to spread things 
around. That's a difficult balance to strike (and exactly why I wimped 
out on it originally ;-)).

M.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:13                           ` Martin J. Bligh
@ 2004-03-30  7:31                             ` Nick Piggin
  2004-03-30  7:38                               ` Martin J. Bligh
  2004-03-30  8:05                               ` Ingo Molnar
  0 siblings, 2 replies; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  7:31 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andi Kleen, mingo, jun.nakajima, ricklind, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech

Martin J. Bligh wrote:
>>We are not going to go back to the wild balancing that
>>numasched does (I have some benchmarks where sched-domains
>>reduces cross node task movement by several orders of
>>magnitude). 
> 
> 
> Agreed, I think that'd be a fatal mistake ...
> 
> 
>>So the other option is to do balance on clone
>>across NUMA nodes, and make it very sensitive to imbalance.
>>Or probably better to make it easy to balance off to an idle
>>CPU, but much more difficult to balance off to a busy CPU.
> 
> 
> I think that's correct, but we need to be careful. We really, really 
> do want to try to keep threads on the same node *if* we have enough 
> processes around to keep the machine busy. Because we don't balance
> on fork, we make a reasonable job of that today, but we should probably
> be more reluctant on rebalance than we are.
> 
> It's when we have less processes than nodes that we want to spread things 
> around. That's a difficult balance to strike (and exactly why I wimped 
> out on it originally ;-)).
> 

Well NUMA balance on exec is obviously the right thing to do.

Maybe balance on clone would be beneficial if we only balance onto
CPUs which are idle or very very imbalanced. Basically, if you are
very sure that it is going to be balanced off anyway, it is probably
better to do it at clone.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:31                             ` Nick Piggin
@ 2004-03-30  7:38                               ` Martin J. Bligh
  2004-03-30  8:05                               ` Ingo Molnar
  1 sibling, 0 replies; 68+ messages in thread
From: Martin J. Bligh @ 2004-03-30  7:38 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andi Kleen, mingo, jun.nakajima, ricklind, linux-kernel, akpm,
	kernel, rusty, anton, lse-tech

> Well NUMA balance on exec is obviously the right thing to do.
> 
> Maybe balance on clone would be beneficial if we only balance onto
> CPUs which are idle or very very imbalanced. Basically, if you are
> very sure that it is going to be balanced off anyway, it is probably
> better to do it at clone.

Yup ... sounds utterly sensible. But I think we need to make the current
balance favour grouping threads together on the same CPU/node more first
if possible ;-)

M.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  7:31                             ` Nick Piggin
  2004-03-30  7:38                               ` Martin J. Bligh
@ 2004-03-30  8:05                               ` Ingo Molnar
  2004-03-30  8:19                                 ` Nick Piggin
  1 sibling, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30  8:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Andi Kleen, jun.nakajima, ricklind, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Maybe balance on clone would be beneficial if we only balance onto
> CPUs which are idle or very very imbalanced. Basically, if you are
> very sure that it is going to be balanced off anyway, it is probably
> better to do it at clone.

balancing threads/processes is not a problem, as long as it happens
within the rules of normal balancing.

ie. 'new context created' (on exec, fork or clone) is just an event that
impacts the load scenario, and which might trigger rebalancing.

_if_ the sharing between various contexts is very high and it's actually
faster to run them all single-threaded, then the application writer can
bind them to one CPU, via the affinity syscalls. But the scheduler
cannot know this advance.

so the cleanest assumption, from the POV of the scheduler, is that
there's no sharing between contexts. Things become really simple once
this assumption is made.

and frankly, it's much easier to argue with application developers whose
application scales badly and thus the scheduler over-distributes it,
than with application developers who's application scales badly due to
the scheduler.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  8:05                               ` Ingo Molnar
@ 2004-03-30  8:19                                 ` Nick Piggin
  2004-03-30  8:45                                   ` Ingo Molnar
  0 siblings, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  8:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Andi Kleen, jun.nakajima, ricklind, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Maybe balance on clone would be beneficial if we only balance onto
>>CPUs which are idle or very very imbalanced. Basically, if you are
>>very sure that it is going to be balanced off anyway, it is probably
>>better to do it at clone.
> 
> 
> balancing threads/processes is not a problem, as long as it happens
> within the rules of normal balancing.
> 
> ie. 'new context created' (on exec, fork or clone) is just an event that
> impacts the load scenario, and which might trigger rebalancing.
> 
> _if_ the sharing between various contexts is very high and it's actually
> faster to run them all single-threaded, then the application writer can
> bind them to one CPU, via the affinity syscalls. But the scheduler
> cannot know this advance.
> 
> so the cleanest assumption, from the POV of the scheduler, is that
> there's no sharing between contexts. Things become really simple once
> this assumption is made.
> 
> and frankly, it's much easier to argue with application developers whose
> application scales badly and thus the scheduler over-distributes it,
> than with application developers who's application scales badly due to
> the scheduler.
> 

You're probably mostly right, but I really don't know if I'd
start with the assumption that threads don't share anything.
I think they're very likely to share memory and cache.

Also, these additional system wide balance points don't come
for free if you attach them to common operations (as opposed
to the slow periodic balancing).

find_best_cpu needs to pull down NR_CPUs remote (and probably
hot&dirty) cachelines, which can get expensive, for an
operation that you are very likely to be better off *without*
if your threads do share any memory.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  8:19                                 ` Nick Piggin
@ 2004-03-30  8:45                                   ` Ingo Molnar
  2004-03-30  8:53                                     ` Nick Piggin
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30  8:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Andi Kleen, jun.nakajima, ricklind, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> You're probably mostly right, but I really don't know if I'd start
> with the assumption that threads don't share anything. I think they're
> very likely to share memory and cache.

it all depends on the workload i guess, but generally if the application
scales well then the threads only share data in a read-mostly manner -
hence we can balance at creation time.

if the application does not scale well then balancing too early cannot
make the app perform much worse.

things like JVMs tend to want good balancing - they really are userspace
simulations of separate contexts with little sharing and good overall
scalability of the architecture.

> Also, these additional system wide balance points don't come for free
> if you attach them to common operations (as opposed to the slow
> periodic balancing).

yes, definitely.

the implementation in sched2.patch does not take this into account yet. 
There are a number of things we can do about the 500 CPUs case. Eg. only
do the balance search towards the next N nodes/cpus (tunable via a
domain parameter).

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  8:45                                   ` Ingo Molnar
@ 2004-03-30  8:53                                     ` Nick Piggin
  2004-03-30 15:27                                       ` Martin J. Bligh
  0 siblings, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  8:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Martin J. Bligh, Andi Kleen, jun.nakajima, ricklind, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>You're probably mostly right, but I really don't know if I'd start
>>with the assumption that threads don't share anything. I think they're
>>very likely to share memory and cache.
> 
> 
> it all depends on the workload i guess, but generally if the application
> scales well then the threads only share data in a read-mostly manner -
> hence we can balance at creation time.
> 
> if the application does not scale well then balancing too early cannot
> make the app perform much worse.
> 
> things like JVMs tend to want good balancing - they really are userspace
> simulations of separate contexts with little sharing and good overall
> scalability of the architecture.
> 

Well, it will be interesting to see how it goes. Unfortunately
I don't have a single realistic benchmark. In fact the only
threaded one I have is volanomark.

> 
>>Also, these additional system wide balance points don't come for free
>>if you attach them to common operations (as opposed to the slow
>>periodic balancing).
> 
> 
> yes, definitely.
> 
> the implementation in sched2.patch does not take this into account yet. 
> There are a number of things we can do about the 500 CPUs case. Eg. only
> do the balance search towards the next N nodes/cpus (tunable via a
> domain parameter).

Yeah I think we shouldn't worry too much about the 500 CPUs
case, because they will obviously end up using their own
domains. But it is possible this would hurt smaller CPU
counts too. Again, it means testing.

I think we should probably aim to have a usable and decent
default domain for 32, maybe 64 CPUs, and not worry about
larger numbers too much if it would hurt lower end performance.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  8:53                                     ` Nick Piggin
@ 2004-03-30 15:27                                       ` Martin J. Bligh
  0 siblings, 0 replies; 68+ messages in thread
From: Martin J. Bligh @ 2004-03-30 15:27 UTC (permalink / raw)
  To: Nick Piggin, Ingo Molnar, Erich Focht
  Cc: Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech

> Well, it will be interesting to see how it goes. Unfortunately
> I don't have a single realistic benchmark. 

That's OK, neither does anyone else ;-) OK, for HPC workloads they do,
but not for other stuff.

The closest I can come conceptually is to run multiple instances of a 
Java benchmark in parallel. The existing ones all tend to be either 1 
process with many threads, or many processes each with one thread. There's 
no m x n benchamrks around I've found, and that seems to be a lot more 
like what the customers I've seen are interested in (throwing a DB, 
webserver, java, etc all on one machine).

Making balance_on_fork a userspace hintable thing wouldn't hurt us at all
though, and would provide a great escape route for the HPC people. 
Some simple pokeable in /proc would probably be sufficient. balance_on_clone
is harder, as whether you want to do it or not depends more on the state
of the rest of the system, which is very hard for userspace to know ...

M.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 19:09   ` Ingo Molnar
  2004-03-25 15:21     ` Andi Kleen
@ 2004-03-25 19:24     ` Martin J. Bligh
  2004-03-25 21:48       ` Ingo Molnar
  1 sibling, 1 reply; 68+ messages in thread
From: Martin J. Bligh @ 2004-03-25 19:24 UTC (permalink / raw)
  To: Ingo Molnar, Andi Kleen
  Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech

>> It doesn't do load balance in wake_up_forked_process() and is
>> relatively non aggressive in balancing later. This leads to the
>> multithreaded OpenMP STREAM running its childs first on the same node
>> as the original process and allocating memory there. [...]
> 
> i believe the fix we want is to pre-balance the context at fork() time. 
> I've implemented this (which is basically just a reuse of
> sched_balance_exec() in fork.c, and the related namespace cleanups), 
> could you give it a go:
> 
>   http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5
> 
> another solution would be to add SD_BALANCE_FORK.
> 
> also, the best place to do fork() blancing is not at
> wake_up_forked_process() time, but prior doing the MM copy. This patch
> does it there. At wakeup time we've already copied all the pagetables
> and created tons of dirty cachelines.

How are you going to decide whether to rebalance at fork time or exec time?
Exec time balancing is a *lot* more efficient, it just doesn't work for
things that don't exec ... cloned threads would certainly be one case.

M.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 19:24     ` Martin J. Bligh
@ 2004-03-25 21:48       ` Ingo Molnar
  2004-03-25 22:28         ` Martin J. Bligh
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-25 21:48 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech


* Martin J. Bligh <mbligh@aracnet.com> wrote:

> Exec time balancing is a *lot* more efficient, it just doesn't work
> for things that don't exec ... cloned threads would certainly be one
> case.

yeah - exec-balancing is a clear thing. fork/clone time balancing is
alot less clear.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 21:48       ` Ingo Molnar
@ 2004-03-25 22:28         ` Martin J. Bligh
  2004-03-29 22:30           ` Erich Focht
  0 siblings, 1 reply; 68+ messages in thread
From: Martin J. Bligh @ 2004-03-25 22:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

>> Exec time balancing is a *lot* more efficient, it just doesn't work
>> for things that don't exec ... cloned threads would certainly be one
>> case.
> 
> yeah - exec-balancing is a clear thing. fork/clone time balancing is
> alot less clear.

OK, well it *looks* to me from a quick look at your patch like
sched_balance_context will rebalance at both fork *and* exec time.
That seems like a bad plan, but maybe I'm misreading it.

Can we hold off on changing the fork/exec time balancing until we've
come to a plan as to what should actually be done with it? Unless we're
giving it some hint from userspace, it's frigging hard to be sure if
it's going to exec or not - and the vast majority of things do. 

There was a really good reason why the code is currently set up that
way, it's not some random accident ;-)

Clone is a much more interesting case, though at the time, I consciously
decided NOT to do that, as we really mostly want threads on the same
node. The exception is the case where we have one app with lots of threads,
and nothing much else running on the system ... I tend to think of that
as an artificial benchmark situation, but maybe that's not fair. We 
probably need to just do a more conservative version of the cross-node
rebalance at fork time.

M.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 22:28         ` Martin J. Bligh
@ 2004-03-29 22:30           ` Erich Focht
  2004-03-30  9:05             ` Nick Piggin
  2004-03-30 15:01             ` Martin J. Bligh
  0 siblings, 2 replies; 68+ messages in thread
From: Erich Focht @ 2004-03-29 22:30 UTC (permalink / raw)
  To: Martin J. Bligh, Ingo Molnar
  Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

On Thursday 25 March 2004 23:28, Martin J. Bligh wrote:
> Can we hold off on changing the fork/exec time balancing until we've
> come to a plan as to what should actually be done with it? Unless we're
> giving it some hint from userspace, it's frigging hard to be sure if
> it's going to exec or not - and the vast majority of things do.

After more than a year (or two?) of discussions there's no better idea
yet than giving a userspace hint. Default should be to balance at
exec(), and maybe use a syscall for saying: balance all children a
particular process is going to fork/clone at creation time. Everybody
reached the insight that we can't foresee what's optimal, so there is
only one solution: control the behavior. Give the user a tool to
improve the performance. Just a small inheritable variable in the task
structure is enough. Whether you give the hint at or before run-time
or even at compile-time is not really the point...

I don't think it's worth to wait and hope that somebody shows up with
a magic algorithm which balances every kind of job optimally.

> There was a really good reason why the code is currently set up that
> way, it's not some random accident ;-)

The current code isn't a result of a big optimization effort, it's the
result of stripping stuff down to something which was acceptable at
all in the 2.6 feature freeze phase such that we get at least _some_
NUMA scheduler infrastructure. It was clear right from the beginning
that it has to be extended to really become useful.

> Clone is a much more interesting case, though at the time, I consciously
> decided NOT to do that, as we really mostly want threads on the same
> node.

That is not true in the case of HPC applications. And if someone uses
OpenMP he is just doing that kind of stuff. I consider STREAM a good
benchmark because it shows exactly the problem of HPC applications:
they need a lot of memory bandwidth, they don't run in cache and the
tasks live really long. Spreading those tasks across the nodes gives
me more bandwidth per task and I accumulate the positive effect
because the tasks run for hours or days. It's a simple and clear case
where the scheduler should be improved.

Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
are not relevant for HPC. In a compute center it actually doesn't
matter much whether some shell command returns 10% faster, it just
shouldn't disturb my super simulation code for which I bought an
expensive NUMA box.

Regards,
Erich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 22:30           ` Erich Focht
@ 2004-03-30  9:05             ` Nick Piggin
  2004-03-30 10:04               ` Erich Focht
  2004-03-30 15:01             ` Martin J. Bligh
  1 sibling, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-30  9:05 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Ingo Molnar, Andi Kleen, Nakajima, Jun,
	Rick Lindsley, linux-kernel, akpm, kernel, rusty, anton, lse-tech

(please use piggin@yahoo.com.au)

Erich Focht wrote:

>On Thursday 25 March 2004 23:28, Martin J. Bligh wrote:
>
>>Can we hold off on changing the fork/exec time balancing until we've
>>come to a plan as to what should actually be done with it? Unless we're
>>giving it some hint from userspace, it's frigging hard to be sure if
>>it's going to exec or not - and the vast majority of things do.
>>
>
>After more than a year (or two?) of discussions there's no better idea
>yet than giving a userspace hint. Default should be to balance at
>exec(), and maybe use a syscall for saying: balance all children a
>particular process is going to fork/clone at creation time. Everybody
>reached the insight that we can't foresee what's optimal, so there is
>only one solution: control the behavior. Give the user a tool to
>improve the performance. Just a small inheritable variable in the task
>structure is enough. Whether you give the hint at or before run-time
>or even at compile-time is not really the point...
>
>I don't think it's worth to wait and hope that somebody shows up with
>a magic algorithm which balances every kind of job optimally.
>
>

I'm with Martin here, we are just about to merge all this
sched-domains stuff. So we should at least wait until after
that. And of course, *nothing* gets changed without at least
one benchmark that shows it improves something. So far
nobody has come up to the plate with that.

>>There was a really good reason why the code is currently set up that
>>way, it's not some random accident ;-)
>>
>
>The current code isn't a result of a big optimization effort, it's the
>result of stripping stuff down to something which was acceptable at
>all in the 2.6 feature freeze phase such that we get at least _some_
>NUMA scheduler infrastructure. It was clear right from the beginning
>that it has to be extended to really become useful.
>
>
>>Clone is a much more interesting case, though at the time, I consciously
>>decided NOT to do that, as we really mostly want threads on the same
>>node.
>>
>
>That is not true in the case of HPC applications. And if someone uses
>OpenMP he is just doing that kind of stuff. I consider STREAM a good
>benchmark because it shows exactly the problem of HPC applications:
>they need a lot of memory bandwidth, they don't run in cache and the
>tasks live really long. Spreading those tasks across the nodes gives
>me more bandwidth per task and I accumulate the positive effect
>because the tasks run for hours or days. It's a simple and clear case
>where the scheduler should be improved.
>
>Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
>are not relevant for HPC. In a compute center it actually doesn't
>matter much whether some shell command returns 10% faster, it just
>shouldn't disturb my super simulation code for which I bought an
>expensive NUMA box.
>
>

There are other things, like java, ervers, etc that use threads.
The point is that we have never had this before, and nobody
(until now) has been asking for it. And there are as yet no
convincing benchmarks that even show best case improvements. And
it could very easily have some bad cases. And finally, HPC
applications are the very ones that should be using CPU
affinities because they are usually tuned quite tightly to the
specific architecture.

Let's just make sure we don't change defaults without any
reason...


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30  9:05             ` Nick Piggin
@ 2004-03-30 10:04               ` Erich Focht
  2004-03-30 10:58                 ` Andi Kleen
                                   ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Erich Focht @ 2004-03-30 10:04 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Ingo Molnar, Andi Kleen, Nakajima, Jun,
	Rick Lindsley, linux-kernel, akpm, kernel, rusty, anton, lse-tech

Hi Nick,

On Tuesday 30 March 2004 11:05, Nick Piggin wrote:
> >exec(), and maybe use a syscall for saying: balance all children a
> >particular process is going to fork/clone at creation time. Everybody
> >reached the insight that we can't foresee what's optimal, so there is
> >only one solution: control the behavior. Give the user a tool to
> >improve the performance. Just a small inheritable variable in the task
> >structure is enough. Whether you give the hint at or before run-time
> >or even at compile-time is not really the point...
> >
> >I don't think it's worth to wait and hope that somebody shows up with
> >a magic algorithm which balances every kind of job optimally.
>
> I'm with Martin here, we are just about to merge all this
> sched-domains stuff. So we should at least wait until after
> that. And of course, *nothing* gets changed without at least
> one benchmark that shows it improves something. So far
> nobody has come up to the plate with that.

I thought you're talking the whole time about STREAM. That is THE
benchmark which shows you an impact of balancing at fork. At it is a
VERY relevant benchmark. Though you shouldn't run it on historical
machines like NUMAQ, no compute center in the western world will buy
NUMAQs for high performance... Andy typically runs STREAM on all CPUs
of a machine. Try on N/2 and N/4 and so on, you'll see the impact.

> >>Clone is a much more interesting case, though at the time, I consciously
> >>decided NOT to do that, as we really mostly want threads on the same
> >>node.
> >
> >That is not true in the case of HPC applications. And if someone uses
> >OpenMP he is just doing that kind of stuff. I consider STREAM a good
> >benchmark because it shows exactly the problem of HPC applications:
> >they need a lot of memory bandwidth, they don't run in cache and the
> >tasks live really long. Spreading those tasks across the nodes gives
> >me more bandwidth per task and I accumulate the positive effect
> >because the tasks run for hours or days. It's a simple and clear case
> >where the scheduler should be improved.
> >
> >Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
> >are not relevant for HPC. In a compute center it actually doesn't
> >matter much whether some shell command returns 10% faster, it just
> >shouldn't disturb my super simulation code for which I bought an
> >expensive NUMA box.
>
> There are other things, like java, ervers, etc that use threads.

I'm just saying that you should have the choice. The default should be
as before, balance at exec().

> The point is that we have never had this before, and nobody
> (until now) has been asking for it. And there are as yet no

?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA
kernels and users use it intensively with OpenMP. Advertised it a lot,
asked for it, atlked about it at the last OLS. Only IA64 was
considered rare big iron. I understand that the issue gets hotter if
the problem hurts on AMD64...

> convincing benchmarks that even show best case improvements. And
> it could very easily have some bad cases.

Again: I'm talking about having the choice. The user decides. Nothing
protects you against user stupidity, but if they just have the choice
of poor automatic initial scheduling, it's not enough. And: having the
fork/clone initial balancing policy means: you don't need to make your
code complicated and unportable by playing with setaffinity (which is
just plainly unusable when you share the machine with other users).

> And finally, HPC
> applications are the very ones that should be using CPU
> affinities because they are usually tuned quite tightly to the
> specific architecture.

There are companies mainly selling NUMA machines for HPC (SGI?), so
this is not a niche market. Clusters of big NUMA machines are not
unusual, and they're typically not used for databases but for HPC
apps. Unfortunately proprietary UNIX is still considered to have
better features than Linux for such configurations.

> Let's just make sure we don't change defaults without any
> reason...

No reason? Aaarghh...   >;-)

Erich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30 10:04               ` Erich Focht
@ 2004-03-30 10:58                 ` Andi Kleen
  2004-03-30 16:03                   ` [patch] sched-2.6.5-rc3-mm1-A0 Ingo Molnar
  2004-03-30 11:02                 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Andrew Morton
  2004-03-31  2:08                 ` Nick Piggin
  2 siblings, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2004-03-30 10:58 UTC (permalink / raw)
  To: Erich Focht
  Cc: nickpiggin, mbligh, mingo, jun.nakajima, ricklind, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

On Tue, 30 Mar 2004 12:04:13 +0200
Erich Focht <efocht@hpce.nec.com> wrote:

Hallo Erich,

> On Tuesday 30 March 2004 11:05, Nick Piggin wrote:
> > >exec(), and maybe use a syscall for saying: balance all children a
> > >particular process is going to fork/clone at creation time. Everybody
> > >reached the insight that we can't foresee what's optimal, so there is
> > >only one solution: control the behavior. Give the user a tool to
> > >improve the performance. Just a small inheritable variable in the task
> > >structure is enough. Whether you give the hint at or before run-time
> > >or even at compile-time is not really the point...
> > >
> > >I don't think it's worth to wait and hope that somebody shows up with
> > >a magic algorithm which balances every kind of job optimally.
> >
> > I'm with Martin here, we are just about to merge all this
> > sched-domains stuff. So we should at least wait until after
> > that. And of course, *nothing* gets changed without at least
> > one benchmark that shows it improves something. So far
> > nobody has come up to the plate with that.
> 
> I thought you're talking the whole time about STREAM. That is THE
> benchmark which shows you an impact of balancing at fork. At it is a
> VERY relevant benchmark. Though you shouldn't run it on historical
> machines like NUMAQ, no compute center in the western world will buy
> NUMAQs for high performance... Andy typically runs STREAM on all CPUs
> of a machine. Try on N/2 and N/4 and so on, you'll see the impact.

Actually I run it on 1-4 CPUs (don't have more to try), but didn't 
always bother to report everything.. With the default
mm5 scheduler the bandwidth of 1,2,3,4 is constantly like 1 CPU.

I agree with you that the "balancing on fork is bad" assumption is dubious
at best. For HPC it definitly is wrong, for others it is unproven as well.

As I wrote earlier our own results on HyperThreaded machines running 2.4
were similar. On HT at least early balancing seems to be a win too - 
it's obvious because there is no cache cost to be paid when you move
between two virtual CPUs on the same core.

> > There are other things, like java, ervers, etc that use threads.
> 
> I'm just saying that you should have the choice. The default should be
> as before, balance at exec().

Choice is probably not bad, but a good default is important too.

I'm not really sure doing it by default would be such a bad idea.

A thread allocating some memory on its own is probably not that unusual,
even outside the HPC space. And on a NUMA system you want that already
on the final node.

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [patch] sched-2.6.5-rc3-mm1-A0
  2004-03-30 10:58                 ` Andi Kleen
@ 2004-03-30 16:03                   ` Ingo Molnar
  2004-03-31  2:30                     ` Nick Piggin
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30 16:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Erich Focht, nickpiggin, mbligh, jun.nakajima, ricklind, akpm,
	kernel, rusty, anton, lse-tech, Andi Kleen


the latest scheduler patch, against 2.6.5-rc3-mm1, can be found at:

	redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc3-mm1-A0

this includes:

 - fork/clone-time balancing. It looks quite good here, but needs more
   testing for impact.

 - a minor fix for passive balancing. (calculating at a -1 load level
   was not perfectly precise with a runqueue length of ~4 or longer.)

 - use sync wakeups for parent-wakeup. This makes a single-task strace
   execute on only one CPU on SMP, which is precisely what we want. It
   should also be a speedup for a number of workloads where the parent
   is actively wait4()-ing for the child to exit.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch] sched-2.6.5-rc3-mm1-A0
  2004-03-30 16:03                   ` [patch] sched-2.6.5-rc3-mm1-A0 Ingo Molnar
@ 2004-03-31  2:30                     ` Nick Piggin
  0 siblings, 0 replies; 68+ messages in thread
From: Nick Piggin @ 2004-03-31  2:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Erich Focht, mbligh, jun.nakajima, ricklind, akpm,
	kernel, rusty, anton, lse-tech, Andi Kleen

Ingo Molnar wrote:

>  - use sync wakeups for parent-wakeup. This makes a single-task strace
>    execute on only one CPU on SMP, which is precisely what we want. It
>    should also be a speedup for a number of workloads where the parent
>    is actively wait4()-ing for the child to exit.

Nice

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30 10:04               ` Erich Focht
  2004-03-30 10:58                 ` Andi Kleen
@ 2004-03-30 11:02                 ` Andrew Morton
       [not found]                   ` <20040330161438.GA2257@elte.hu>
  2004-03-31 18:59                   ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Erich Focht
  2004-03-31  2:08                 ` Nick Piggin
  2 siblings, 2 replies; 68+ messages in thread
From: Andrew Morton @ 2004-03-30 11:02 UTC (permalink / raw)
  To: Erich Focht
  Cc: nickpiggin, mbligh, mingo, ak, jun.nakajima, ricklind,
	linux-kernel, kernel, rusty, anton, lse-tech

Erich Focht <efocht@hpce.nec.com> wrote:
>
>  > And finally, HPC
>  > applications are the very ones that should be using CPU
>  > affinities because they are usually tuned quite tightly to the
>  > specific architecture.
> 
>  There are companies mainly selling NUMA machines for HPC (SGI?), so
>  this is not a niche market.

It is niche in terms of number of machines and in terms of affected users. 
And the people who provide these machines have the resources to patch the
scheduler if needs be.

Correct me if I'm wrong, but what we have here is a situation where if we
design the scheduler around the HPC requirement, it will work poorly in a
significant number of other applications.  And we don't see a way of fixing
this without either a /proc/i-am-doing-hpc, or a config option, or
requiring someone to carry an external patch, yes?

If so then all of those seem reasonable options to me.  We should optimise
the scheduler for the common case, and that ain't HPC.

If we agree that architecturally sched-domains _can_ satisfy the HPC
requirement then I think that's good enough for now.  I'd prefer that Ingo
and Nick not have to bust a gut trying to get optimum HPC performance
before the code is even merged up.

Do you agree that sched-domains is architected appropriately?

^ permalink raw reply	[flat|nested] 68+ messages in thread

[parent not found: <20040330161438.GA2257@elte.hu>]

[parent not found: <20040330161910.GA2860@elte.hu>]

[parent not found: <20040330162514.GA2943@elte.hu>]

* [patch] new-context balancing, 2.6.5-rc3-mm1
       [not found]                       ` <20040330162514.GA2943@elte.hu>
@ 2004-03-30 21:03                         ` Ingo Molnar
  2004-03-31  2:30                           ` Nick Piggin
  0 siblings, 1 reply; 68+ messages in thread
From: Ingo Molnar @ 2004-03-30 21:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Erich Focht, nickpiggin, mbligh, ak, jun.nakajima, ricklind,
	linux-kernel, kernel, rusty, anton, lse-tech

[-- Attachment #1: Type: text/plain, Size: 384 bytes --]


i've attached sched-balance-context.patch, which is the current version
of fork()/clone() balancing, against 2.6.5-rc3-mm1.

Changes:

 - only balance CLONE_VM threads

 - take ->cpus_allowed into account when balancing.

i've checked kernel recompiles and while they didnt hurt from fork()
balancing on an 8-way SMP box, i implemented the thread-only balancing
nevertheless.

	Ingo

[-- Attachment #2: sched-balance-context.patch --]
[-- Type: text/plain, Size: 4796 bytes --]

--- linux/include/linux/sched.h.orig	
+++ linux/include/linux/sched.h	
@@ -715,12 +715,17 @@ extern void do_timer(struct pt_regs *);
 
 extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state));
 extern int FASTCALL(wake_up_process(struct task_struct * tsk));
+extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
 #ifdef CONFIG_SMP
  extern void kick_process(struct task_struct *tsk);
+ extern void FASTCALL(wake_up_forked_thread(struct task_struct * tsk));
 #else
  static inline void kick_process(struct task_struct *tsk) { }
+ static inline void wake_up_forked_thread(struct task_struct * tsk)
+ {
+	return wake_up_forked_process(tsk);
+ }
 #endif
-extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
 extern void FASTCALL(sched_fork(task_t * p));
 extern void FASTCALL(sched_exit(task_t * p));
 
--- linux/kernel/sched.c.orig	
+++ linux/kernel/sched.c	
@@ -1139,6 +1137,119 @@ enum idle_type
 };
 
 #ifdef CONFIG_SMP
+
+/*
+ * find_idlest_cpu - find the least busy runqueue.
+ */
+static int find_idlest_cpu(int this_cpu, runqueue_t *this_rq, cpumask_t mask)
+{
+	unsigned long load, min_load, this_load;
+	int i, min_cpu;
+	cpumask_t tmp;
+
+	min_cpu = UINT_MAX;
+	min_load = ULONG_MAX;
+
+	cpus_and(tmp, mask, cpu_online_map);
+	for_each_cpu_mask(i, tmp) {
+		load = cpu_load(i);
+
+		if (load < min_load) {
+			min_cpu = i;
+			min_load = load;
+
+			/* break out early on an idle CPU: */
+			if (!min_load)
+				break;
+		}
+	}
+
+	/* add +1 to account for the new task */
+	this_load = cpu_load(this_cpu) + SCHED_LOAD_SCALE;
+
+	/*
+	 * Would with the addition of the new task to the
+	 * current CPU there be an imbalance between this
+	 * CPU and the idlest CPU?
+	 */
+	if (min_load*this_rq->sd->imbalance_pct < 100*this_load)
+		return min_cpu;
+
+	return this_cpu;
+}
+
+/*
+ * wake_up_forked_thread - wake up a freshly forked thread.
+ *
+ * This function will do some initial scheduler statistics housekeeping
+ * that must be done for every newly created context, and it also does
+ * runqueue balancing.
+ */
+void fastcall wake_up_forked_thread(task_t * p)
+{
+	unsigned long flags;
+	int this_cpu = get_cpu(), cpu;
+	runqueue_t *this_rq = cpu_rq(this_cpu), *rq;
+
+	/*
+	 * Migrate the new context to the least busy CPU,
+	 * if that CPU is out of balance.
+	 */
+	cpu = find_idlest_cpu(this_cpu, this_rq, p->cpus_allowed);
+
+	local_irq_save(flags);
+lock_again:
+	rq = cpu_rq(cpu);
+	double_rq_lock(this_rq, rq);
+
+	BUG_ON(p->state != TASK_RUNNING);
+
+	/*
+	 * We did find_idlest_cpu() unlocked, so in theory
+	 * the mask could have changed:
+	 */
+	if (!cpu_isset(cpu, p->cpus_allowed)) {
+		cpu = any_online_cpu(p->cpus_allowed);
+		double_rq_unlock(this_rq, rq);
+		goto lock_again;
+	}
+	/*
+	 * We decrease the sleep average of forking parents
+	 * and children as well, to keep max-interactive tasks
+	 * from forking tasks that are max-interactive.
+	 */
+	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
+		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+
+	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
+		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+
+	p->interactive_credit = 0;
+
+	p->prio = effective_prio(p);
+	set_task_cpu(p, cpu);
+
+	if (cpu == this_cpu) {
+		if (unlikely(!current->array))
+			__activate_task(p, rq);
+		else {
+			p->prio = current->prio;
+			list_add_tail(&p->run_list, &current->run_list);
+			p->array = current->array;
+			p->array->nr_active++;
+			rq->nr_running++;
+		}
+	} else {
+		__activate_task(p, rq);
+		if (TASK_PREEMPTS_CURR(p, rq))
+			resched_task(rq->curr);
+	}
+
+	double_rq_unlock(this_rq, rq);
+	local_irq_restore(flags);
+	put_cpu();
+}
+
 /*
  * If dest_cpu is allowed for this process, migrate the task to it.
  * This is accomplished by forcing the cpu_allowed mask to only
--- linux/kernel/fork.c.orig	
+++ linux/kernel/fork.c	
@@ -1179,9 +1179,23 @@ long do_fork(unsigned long clone_flags,
 			set_tsk_thread_flag(p, TIF_SIGPENDING);
 		}
 
-		if (!(clone_flags & CLONE_STOPPED))
-			wake_up_forked_process(p);	/* do this last */
-		else
+		if (!(clone_flags & CLONE_STOPPED)) {
+			/*
+			 * Do the wakeup last. On SMP we treat fork() and
+			 * CLONE_VM separately, because fork() has already
+			 * created cache footprint on this CPU (due to
+			 * copying the pagetables), hence migration would
+			 * probably be costy. Threads on the other hand
+			 * have less traction to the current CPU, and if
+			 * there's an imbalance then the scheduler can
+			 * migrate this fresh thread now, before it
+			 * accumulates a larger cache footprint:
+			 */
+			if (clone_flags & CLONE_VM)
+				wake_up_forked_thread(p);
+			else
+				wake_up_forked_process(p);
+		} else
 			p->state = TASK_STOPPED;
 		++total_forks;
 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [patch] new-context balancing, 2.6.5-rc3-mm1
  2004-03-30 21:03                         ` [patch] new-context balancing, 2.6.5-rc3-mm1 Ingo Molnar
@ 2004-03-31  2:30                           ` Nick Piggin
  0 siblings, 0 replies; 68+ messages in thread
From: Nick Piggin @ 2004-03-31  2:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Erich Focht, mbligh, ak, jun.nakajima, ricklind,
	linux-kernel, kernel, rusty, anton, lse-tech

Ingo Molnar wrote:
> i've attached sched-balance-context.patch, which is the current version
> of fork()/clone() balancing, against 2.6.5-rc3-mm1.
> 
> Changes:
> 
>  - only balance CLONE_VM threads
> 
>  - take ->cpus_allowed into account when balancing.
> 
> i've checked kernel recompiles and while they didnt hurt from fork()
> balancing on an 8-way SMP box, i implemented the thread-only balancing
> nevertheless.

You'd probably want to be testing on a NUMA to bring out any
problems.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30 11:02                 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Andrew Morton
       [not found]                   ` <20040330161438.GA2257@elte.hu>
@ 2004-03-31 18:59                   ` Erich Focht
  1 sibling, 0 replies; 68+ messages in thread
From: Erich Focht @ 2004-03-31 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: nickpiggin, mbligh, mingo, ak, jun.nakajima, ricklind,
	linux-kernel, kernel, rusty, anton, lse-tech

On Tuesday 30 March 2004 13:02, Andrew Morton wrote:
> Erich Focht <efocht@hpce.nec.com> wrote:
> >  > And finally, HPC
> >  > applications are the very ones that should be using CPU
> >  > affinities because they are usually tuned quite tightly to the
> >  > specific architecture.
> >
> >  There are companies mainly selling NUMA machines for HPC (SGI?), so
> >  this is not a niche market.
>
> It is niche in terms of number of machines and in terms of affected users.
> And the people who provide these machines have the resources to patch the
> scheduler if needs be.

Uhm, depends on the CPUs you think of. I bet much more than half of
the Opterons and Itanium2 CPUs sold last year went into HPC. Certainly
not so many IA64s went into NUMA machines. But almost all Opterons ;-)
IBM's NUMA machines with Power CPUs are mainly sold with AIX into the
HPC market, I don't recall to have seen big HPC installations with HP
Superdome under Linux, not yet...? IBM sells x86-NUMA more into the
commercial market, the only big visible Linux-NUMA in HPC is SGI's
Altix. Most of the other NUMA machines go into HPC with other OSes and
we don't care about them (yet?). So you're probably right about the
number of Linux-NUMA-HPC users, but this actually shows that
Linux-NUMA is currently not the ideal choice. We're working on it,
right?

> Correct me if I'm wrong, but what we have here is a situation where if we
> design the scheduler around the HPC requirement, it will work poorly in a
> significant number of other applications.  And we don't see a way of fixing
> this without either a /proc/i-am-doing-hpc, or a config option, or
> requiring someone to carry an external patch, yes?
>
> If so then all of those seem reasonable options to me.  We should optimise
> the scheduler for the common case, and that ain't HPC.

Yes! A per process flag would be enough to have the choice.

> If we agree that architecturally sched-domains _can_ satisfy the HPC
> requirement then I think that's good enough for now.  I'd prefer that Ingo
> and Nick not have to bust a gut trying to get optimum HPC performance
> before the code is even merged up.

Sure. On the other hand the benchmark brought into discussion by Andi
is very easy to understand, much easier than any Java monster. If the
scheduler doesn't have a screw for running this optimally, it's
disappointing.

> Do you agree that sched-domains is architected appropriately?

My current impression is: YES. My testing experience with it is
still very limited...

Regards,
Erich

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30 10:04               ` Erich Focht
  2004-03-30 10:58                 ` Andi Kleen
  2004-03-30 11:02                 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Andrew Morton
@ 2004-03-31  2:08                 ` Nick Piggin
  2004-03-31 22:23                   ` Erich Focht
  2 siblings, 1 reply; 68+ messages in thread
From: Nick Piggin @ 2004-03-31  2:08 UTC (permalink / raw)
  To: Erich Focht
  Cc: Martin J. Bligh, Ingo Molnar, Andi Kleen, Nakajima, Jun,
	Rick Lindsley, linux-kernel, akpm, kernel, rusty, anton, lse-tech

Erich Focht wrote:
> Hi Nick,
> 

Hi Erich,

> On Tuesday 30 March 2004 11:05, Nick Piggin wrote:
> 
>>I'm with Martin here, we are just about to merge all this
>>sched-domains stuff. So we should at least wait until after
>>that. And of course, *nothing* gets changed without at least
>>one benchmark that shows it improves something. So far
>>nobody has come up to the plate with that.
> 
> 
> I thought you're talking the whole time about STREAM. That is THE
> benchmark which shows you an impact of balancing at fork. At it is a
> VERY relevant benchmark. Though you shouldn't run it on historical
> machines like NUMAQ, no compute center in the western world will buy
> NUMAQs for high performance... Andy typically runs STREAM on all CPUs
> of a machine. Try on N/2 and N/4 and so on, you'll see the impact.
> 

Well yeah, but the immediate problem was that sched-domains was
*much* worse than 2.6's numasched, neither of which balance on
fork/clone. I didn't want to obscure the issue by implementing
balance on fork/clone until we worked out exactly the problem.

Anyway, once sched-domains goes in, you can basically do whatever
you like without impacting anyone else...

>>
>>There are other things, like java, ervers, etc that use threads.
> 
> 
> I'm just saying that you should have the choice. The default should be
> as before, balance at exec().
> 

Yeah well that is a very sane thing to do ;)

> 
>>The point is that we have never had this before, and nobody
>>(until now) has been asking for it. And there are as yet no
> 
> 
> ?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA
> kernels and users use it intensively with OpenMP. Advertised it a lot,
> asked for it, atlked about it at the last OLS. Only IA64 was
> considered rare big iron. I understand that the issue gets hotter if
> the problem hurts on AMD64...
> 

Sorry I hadn't realised. I guess because you are happy with
your own stuff you don't make too much noise about it on the
list lately. I apologise.

I wonder though, why don't you just teach OpenMP to use
affinities as well? Surely that is better than relying on the
behaviour of the scheduler, even if it does balance on clone.

> 
>>convincing benchmarks that even show best case improvements. And
>>it could very easily have some bad cases.
> 
> 
> Again: I'm talking about having the choice. The user decides. Nothing
> protects you against user stupidity, but if they just have the choice
> of poor automatic initial scheduling, it's not enough. And: having the
> fork/clone initial balancing policy means: you don't need to make your
> code complicated and unportable by playing with setaffinity (which is
> just plainly unusable when you share the machine with other users).
> 

If you do it by hand, you know exactly what is going to happen,
and you can turn off the balance-on-clone flags and you don't
incur the hit of pulling in remote cachelines from every CPU at
clone time to do balancing. Surely an HPC application wouldn't
mind doing that? (I guess they probably don't call clone a lot
though).

> 
>>And finally, HPC
>>applications are the very ones that should be using CPU
>>affinities because they are usually tuned quite tightly to the
>>specific architecture.
> 
> 
> There are companies mainly selling NUMA machines for HPC (SGI?), so
> this is not a niche market. Clusters of big NUMA machines are not
> unusual, and they're typically not used for databases but for HPC
> apps. Unfortunately proprietary UNIX is still considered to have
> better features than Linux for such configurations.
> 

Well, SGI should be doing tests soon and tuning the scheduler
to their liking. Hopefully others will too, so we'll see what
happens.

> 
>>Let's just make sure we don't change defaults without any
>>reason...
> 
> 
> No reason? Aaarghh...   >;-)
> 

Sorry I mean evidence. I'm sure with a properly tuned
implementation, you could get really good speedups in lots
of places... I just want to *see* them. All I have seen so
far is Andi getting a bit better performance on something
where he can get *much* better performance by making a
trivial tweak instead.

I really don't have the software or hardware to test this
at all so I just have to sit and watch.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-31  2:08                 ` Nick Piggin
@ 2004-03-31 22:23                   ` Erich Focht
  0 siblings, 0 replies; 68+ messages in thread
From: Erich Focht @ 2004-03-31 22:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Martin J. Bligh, Ingo Molnar, Andi Kleen, Nakajima, Jun,
	Rick Lindsley, linux-kernel, akpm, kernel, rusty, anton, lse-tech

On Wednesday 31 March 2004 04:08, Nick Piggin wrote:
> >>I'm with Martin here, we are just about to merge all this
> >>sched-domains stuff. So we should at least wait until after
> >>that. And of course, *nothing* gets changed without at least
> >>one benchmark that shows it improves something. So far
> >>nobody has come up to the plate with that.
> >
> > I thought you're talking the whole time about STREAM. That is THE
> > benchmark which shows you an impact of balancing at fork. At it is a
> > VERY relevant benchmark. Though you shouldn't run it on historical
> > machines like NUMAQ, no compute center in the western world will buy
> > NUMAQs for high performance... Andy typically runs STREAM on all CPUs
> > of a machine. Try on N/2 and N/4 and so on, you'll see the impact.
>
> Well yeah, but the immediate problem was that sched-domains was
> *much* worse than 2.6's numasched, neither of which balance on
> fork/clone. I didn't want to obscure the issue by implementing
> balance on fork/clone until we worked out exactly the problem.

I had the feeling that solving the performance issue reported by Andi
would ease the integration into the baseline...

> >>The point is that we have never had this before, and nobody
> >>(until now) has been asking for it. And there are as yet no
> >
> > ?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA
> > kernels and users use it intensively with OpenMP. Advertised it a lot,
> > asked for it, atlked about it at the last OLS. Only IA64 was
> > considered rare big iron. I understand that the issue gets hotter if
> > the problem hurts on AMD64...
>
> Sorry I hadn't realised. I guess because you are happy with
> your own stuff you don't make too much noise about it on the
> list lately. I apologise.

The usual excuse: busy with other stuff...

> I wonder though, why don't you just teach OpenMP to use
> affinities as well? Surely that is better than relying on the
> behaviour of the scheduler, even if it does balance on clone.

You mean in the compiler? I don't think this is a good idea, that way you
loose flexibility in resource overcomitment. And performance when overselling
the machine's CPUs.

> > Again: I'm talking about having the choice. The user decides. Nothing
> > protects you against user stupidity, but if they just have the choice
> > of poor automatic initial scheduling, it's not enough. And: having the
> > fork/clone initial balancing policy means: you don't need to make your
> > code complicated and unportable by playing with setaffinity (which is
> > just plainly unusable when you share the machine with other users).
>
> If you do it by hand, you know exactly what is going to happen,
> and you can turn off the balance-on-clone flags and you don't
> incur the hit of pulling in remote cachelines from every CPU at
> clone time to do balancing. Surely an HPC application wouldn't
> mind doing that? (I guess they probably don't call clone a lot
> though).

OpenMP is implemented with clone. MPI parallel applications just exec,
they're fine. IMO the static affinity/cpumask handling should be done
externally by some resource manager which has a good overview on the
long-term load of the machine. It's a different issue, nothing for the
scheduler. I wouldn't leave it to the program, too unflexible and
unportable across machines and OSes.

> > There are companies mainly selling NUMA machines for HPC (SGI?), so
> > this is not a niche market. Clusters of big NUMA machines are not
> > unusual, and they're typically not used for databases but for HPC
> > apps. Unfortunately proprietary UNIX is still considered to have
> > better features than Linux for such configurations.
>
> Well, SGI should be doing tests soon and tuning the scheduler
> to their liking. Hopefully others will too, so we'll see what
> happens.

Maybe they are happy with their stuff, too. They have the cpumemsets
and some external affinity control, AFAIK.

> >>Let's just make sure we don't change defaults without any
> >>reason...
> >
> > No reason? Aaarghh...   >;-)
>
> Sorry I mean evidence. I'm sure with a properly tuned
> implementation, you could get really good speedups in lots
> of places... I just want to *see* them. All I have seen so
> far is Andi getting a bit better performance on something
> where he can get *much* better performance by making a
> trivial tweak instead.

I get the feeling that Andi's simple OpenMP job is already complex
enough to lead to wrong initial scheduling with the current aproach.
I suppose the reason are the 1-2 helper threads which are started
together with the worker threads (depending on the used compiler).
On small machines (and 4 cpus is small) they significantly disturb
the initial task distribution. For example with the Intel compiler
and 4 worker threads you get 6 tasks. The helper tasks are typically
runnable when the code starts so you get (in order of creation)
   CPU   Task   Role
    1     1     worker
    2     2     helper
    3     3     helper
    4     4     worker
    1-4   5     worker
    1-4   6     worker

So the difficulty is to find out which task will do real work and
which task is just spoiling the statistics. I think...

Regards,
Erich


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-29 22:30           ` Erich Focht
  2004-03-30  9:05             ` Nick Piggin
@ 2004-03-30 15:01             ` Martin J. Bligh
  2004-03-31 21:23               ` Erich Focht
  1 sibling, 1 reply; 68+ messages in thread
From: Martin J. Bligh @ 2004-03-30 15:01 UTC (permalink / raw)
  To: Erich Focht, Ingo Molnar
  Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

--Erich Focht <efocht@hpce.nec.com> wrote (on Tuesday, March 30, 2004 00:30:25 +0200):

> On Thursday 25 March 2004 23:28, Martin J. Bligh wrote:
>> Can we hold off on changing the fork/exec time balancing until we've
>> come to a plan as to what should actually be done with it? Unless we're
>> giving it some hint from userspace, it's frigging hard to be sure if
>> it's going to exec or not - and the vast majority of things do.
> 
> After more than a year (or two?) of discussions there's no better idea
> yet than giving a userspace hint. Default should be to balance at
> exec(), and maybe use a syscall for saying: balance all children a
> particular process is going to fork/clone at creation time. Everybody
> reached the insight that we can't foresee what's optimal, so there is
> only one solution: control the behavior. Give the user a tool to
> improve the performance. Just a small inheritable variable in the task
> structure is enough. Whether you give the hint at or before run-time
> or even at compile-time is not really the point...

Agreed ... absolutely.
 
> I don't think it's worth to wait and hope that somebody shows up with
> a magic algorithm which balances every kind of job optimally.

Especially as I don't believe that exists ;-) It's not deterministic.

>> Clone is a much more interesting case, though at the time, I consciously
>> decided NOT to do that, as we really mostly want threads on the same
>> node.
> 
> That is not true in the case of HPC applications. And if someone uses
> OpenMP he is just doing that kind of stuff. I consider STREAM a good
> benchmark because it shows exactly the problem of HPC applications:
> they need a lot of memory bandwidth, they don't run in cache and the
> tasks live really long. Spreading those tasks across the nodes gives
> me more bandwidth per task and I accumulate the positive effect
> because the tasks run for hours or days. It's a simple and clear case
> where the scheduler should be improved.
>
> Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
> are not relevant for HPC. In a compute center it actually doesn't
> matter much whether some shell command returns 10% faster, it just
> shouldn't disturb my super simulation code for which I bought an
> expensive NUMA box.

OK, but the scheduler can't know the difference automatically, I don't
think ... and whether we should tune the scheduler for "user work" or
HPC is going to be a hotly contested point ;-) We need to try to find
something that works for both. And suppose you have a 4 node system,
with 4 HPC apps running? Surely you want each app to have one node to
itself? That's more the case I'm worried about than "user work" vs HPC,
to be honest.

M.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-30 15:01             ` Martin J. Bligh
@ 2004-03-31 21:23               ` Erich Focht
  2004-03-31 21:33                 ` Martin J. Bligh
  0 siblings, 1 reply; 68+ messages in thread
From: Erich Focht @ 2004-03-31 21:23 UTC (permalink / raw)
  To: Martin J. Bligh, Ingo Molnar
  Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

On Tuesday 30 March 2004 17:01, Martin J. Bligh wrote:
> > I don't think it's worth to wait and hope that somebody shows up with
> > a magic algorithm which balances every kind of job optimally.
>
> Especially as I don't believe that exists ;-) It's not deterministic.

Right, so let's choose the initial balancing policy on a per process
basis.

> > Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
> > are not relevant for HPC. In a compute center it actually doesn't
> > matter much whether some shell command returns 10% faster, it just
> > shouldn't disturb my super simulation code for which I bought an
> > expensive NUMA box.
>
> OK, but the scheduler can't know the difference automatically, I don't
> think ... and whether we should tune the scheduler for "user work" or
> HPC is going to be a hotly contested point ;-) We need to try to find
> something that works for both. And suppose you have a 4 node system,
> with 4 HPC apps running? Surely you want each app to have one node to
> itself?

If the machine is 100% full all the time and all apps demand the same
amount of bandwidth, yes, I want 1 job per node. If the average load is
less than 100% (sometimes only 2-3 jobs are running) then I'd prefer to
spread the processes of a job across the machine. The average bandwidth
per process will be higher. Modern NUMA machines have big bandwidth to
neighboring nodes and not too bad latency penalties for remote accesses.

Regards,
Erich


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-31 21:23               ` Erich Focht
@ 2004-03-31 21:33                 ` Martin J. Bligh
  0 siblings, 0 replies; 68+ messages in thread
From: Martin J. Bligh @ 2004-03-31 21:33 UTC (permalink / raw)
  To: Erich Focht, Ingo Molnar
  Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech

> On Tuesday 30 March 2004 17:01, Martin J. Bligh wrote:
>> > I don't think it's worth to wait and hope that somebody shows up with
>> > a magic algorithm which balances every kind of job optimally.
>> 
>> Especially as I don't believe that exists ;-) It's not deterministic.
> 
> Right, so let's choose the initial balancing policy on a per process
> basis.

Yup, that seems like a reasonable thing to do. That way you can override
it for things that fork and never exec, if they're performance critical
(like HPC maybe).
 
>> > Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7
>> > are not relevant for HPC. In a compute center it actually doesn't
>> > matter much whether some shell command returns 10% faster, it just
>> > shouldn't disturb my super simulation code for which I bought an
>> > expensive NUMA box.
>> 
>> OK, but the scheduler can't know the difference automatically, I don't
>> think ... and whether we should tune the scheduler for "user work" or
>> HPC is going to be a hotly contested point ;-) We need to try to find
>> something that works for both. And suppose you have a 4 node system,
>> with 4 HPC apps running? Surely you want each app to have one node to
>> itself?
> 
> If the machine is 100% full all the time and all apps demand the same
> amount of bandwidth, yes, I want 1 job per node. If the average load is
> less than 100% (sometimes only 2-3 jobs are running) then I'd prefer to
> spread the processes of a job across the machine. The average bandwidth
> per process will be higher. Modern NUMA machines have big bandwidth to
> neighboring nodes and not too bad latency penalties for remote accesses.

In theory at least, doing the rebalance_on_clone if and only if there are
idle procs on another node sounds reasonable. In practice, I'm not sure
how well that'll work, since one app may well start wholly before another,
but maybe we can figure out something smart to do.

M.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 15:40 ` Andi Kleen
  2004-03-25 19:09   ` Ingo Molnar
@ 2004-03-25 21:59   ` Ingo Molnar
  2004-03-25 22:26     ` Rick Lindsley
  2004-03-25 22:30     ` Andrew Theurer
  2004-03-26  3:23   ` Nick Piggin
  2 siblings, 2 replies; 68+ messages in thread
From: Ingo Molnar @ 2004-03-25 21:59 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh


* Andi Kleen <ak@suse.de> wrote:

> It doesn't do load balance in wake_up_forked_process() and is
> relatively non aggressive in balancing later. This leads to the
> multithreaded OpenMP STREAM running its childs first on the same node
> as the original process and allocating memory there. Then later they
> run on a different node when the balancing finally happens, but
> generate cross traffic to the old node, instead of using the memory
> bandwidth of their local nodes.
> 
> The difference is very visible, even the 4 thread STREAM only sees the
> bandwidth of a single node. With a more aggressive scheduler you get 4
> times as much.
> 
> Admittedly it's a bit of a stupid benchmark, but seems to
> representative for a lot of HPC codes.

There's no way the scheduler can figure out the scheduling and memory
use patterns of the new tasks in advance.

but userspace could give hints - e.g. a syscall that triggers a
rebalancing: sys_sched_load_balance(). This way userspace notifies the
scheduler that it is on 'zero ground' and that the scheduler can move it
to the least loaded cpu/node.

a variant of this is already possible, userspace can use setaffinity to
load-balance manually - but sched_load_balance() would be automatic.

	Ingo

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 21:59   ` Ingo Molnar
@ 2004-03-25 22:26     ` Rick Lindsley
  2004-03-25 22:30     ` Andrew Theurer
  1 sibling, 0 replies; 68+ messages in thread
From: Rick Lindsley @ 2004-03-25 22:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Nakajima, Jun, piggin, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

    There's no way the scheduler can figure out the scheduling and memory
    use patterns of the new tasks in advance.

True.  Four threads may want to stay on the same node because they are
sharing a lot of data and working on something in parallel, or they
may want to go to different nodes because the only thing they have in
common is a control structure that directs their (largely independent
but highly synchronized) efforts.

A while ago there was some effort at user-level page replication, which
meant you took a hit once but after that you'd effectively migrated a page
to your local memory.  The longer you stayed put, the more local your
RSS got.  I seem to recall some bugs or caveats, though.  Anybody know
the state of that?  It might take the burden off the scheduler using a
crystal ball and putting it on a 20/20-hindsight VM system instead.

Rick

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 21:59   ` Ingo Molnar
  2004-03-25 22:26     ` Rick Lindsley
@ 2004-03-25 22:30     ` Andrew Theurer
  2004-03-25 22:38       ` Martin J. Bligh
  2004-03-26  1:29       ` Andi Kleen
  1 sibling, 2 replies; 68+ messages in thread
From: Andrew Theurer @ 2004-03-25 22:30 UTC (permalink / raw)
  To: Ingo Molnar, Andi Kleen
  Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

On Thursday 25 March 2004 15:59, Ingo Molnar wrote:
> * Andi Kleen <ak@suse.de> wrote:
> > It doesn't do load balance in wake_up_forked_process() and is
> > relatively non aggressive in balancing later. This leads to the
> > multithreaded OpenMP STREAM running its childs first on the same node
> > as the original process and allocating memory there. Then later they
> > run on a different node when the balancing finally happens, but
> > generate cross traffic to the old node, instead of using the memory
> > bandwidth of their local nodes.
> >
> > The difference is very visible, even the 4 thread STREAM only sees the
> > bandwidth of a single node. With a more aggressive scheduler you get 4
> > times as much.
> >
> > Admittedly it's a bit of a stupid benchmark, but seems to
> > representative for a lot of HPC codes.
>
> There's no way the scheduler can figure out the scheduling and memory
> use patterns of the new tasks in advance.
>
> but userspace could give hints - e.g. a syscall that triggers a
> rebalancing: sys_sched_load_balance(). This way userspace notifies the
> scheduler that it is on 'zero ground' and that the scheduler can move it
> to the least loaded cpu/node.
>
> a variant of this is already possible, userspace can use setaffinity to
> load-balance manually - but sched_load_balance() would be automatic.

For Opteron simply placing all cpus in the same sched domain may solve all of 
this, since we will have balancing frequency of the default scheduler.  Is 
there any reason this cannot be done for Opteron?

Also, I think Erich Focht had another patch which would allow much more 
frequent node balancing is the nr_cpus_node was 1.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 22:30     ` Andrew Theurer
@ 2004-03-25 22:38       ` Martin J. Bligh
  2004-03-26  1:29       ` Andi Kleen
  1 sibling, 0 replies; 68+ messages in thread
From: Martin J. Bligh @ 2004-03-25 22:38 UTC (permalink / raw)
  To: Andrew Theurer, Ingo Molnar, Andi Kleen
  Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech

> For Opteron simply placing all cpus in the same sched domain may solve all of 
> this, since we will have balancing frequency of the default scheduler.  Is 
> there any reason this cannot be done for Opteron?

That seems like a good plan to me - they really don't want that cross-node
balancing. It might be cleaner to implement it by just tweaking the 
cross-balance paramters for that system to have the same effect, but it
probably doesn't matter much (I'm thinking of some future case when they
decide to do multi-chip on die or SMT, so just keying off 1 cpu per node
doesn't really fix it).

M.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 22:30     ` Andrew Theurer
  2004-03-25 22:38       ` Martin J. Bligh
@ 2004-03-26  1:29       ` Andi Kleen
  1 sibling, 0 replies; 68+ messages in thread
From: Andi Kleen @ 2004-03-26  1:29 UTC (permalink / raw)
  To: Andrew Theurer
  Cc: mingo, jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel,
	rusty, anton, lse-tech, mbligh

On Thu, 25 Mar 2004 16:30:16 -0600
Andrew Theurer <habanero@us.ibm.com> wrote:


> For Opteron simply placing all cpus in the same sched domain may solve all of 
> this, since we will have balancing frequency of the default scheduler.  Is 
> there any reason this cannot be done for Opteron?

Yes, that makes sense. I will try that

-Andi

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3
  2004-03-25 15:40 ` Andi Kleen
  2004-03-25 19:09   ` Ingo Molnar
  2004-03-25 21:59   ` Ingo Molnar
@ 2004-03-26  3:23   ` Nick Piggin
  2 siblings, 0 replies; 68+ messages in thread
From: Nick Piggin @ 2004-03-26  3:23 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nakajima, Jun, Rick Lindsley, Ingo Molnar, piggin, linux-kernel,
	akpm, kernel, rusty, anton, lse-tech, mbligh

Andi Kleen wrote:
> On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote:
> 
>>Andi,
>>
>>Can you be more specific with "it doesn't load balance threads
>>aggressively enough"? Or what behavior of the base NUMA scheduler is
>>missing in the sched-domain scheduler especially for NUMA?
> 
> 
> It doesn't do load balance in wake_up_forked_process()  and is relatively
> non aggressive in balancing later. This leads to the multithreaded OpenMP
> STREAM running its childs first on the same node as the original process
> and allocating memory there. Then later they run on a different node when
> the balancing finally happens, but generate  cross traffic to the old node, 
> instead of using the memory bandwidth of their local nodes.
> 
> The difference is very visible, even the 4 thread STREAM only sees the
> bandwidth of a single node. With a more aggressive scheduler you get
> 4 times as much.
> 
> Admittedly it's a bit of a stupid benchmark, but seems to representative
> for a lot of HPC codes.

Hi Andi,
Sorry I keep telling you I'll work on this, but I never get
around to it. Mostly lack of hardware makes it difficult. I've
fixed a few bugs and some other workloads, so I keep hoping
that they will fix your problem :P

Your STREAM performance is really bad and I hope you don't
think I'm going to ignore it even if it is a bit stupid. Give
me a bit more time.

Of course, there is nothing fundamentally wrong with
sched-domains that is causing your problem. It can easily do
anything the old numa scheduler can do. It must be a bug or
some bad tuning somewhere.

Nick

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2004-03-31 22:23 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-25 15:31 [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Nakajima, Jun
2004-03-25 15:40 ` Andi Kleen
2004-03-25 19:09   ` Ingo Molnar
2004-03-25 15:21     ` Andi Kleen
2004-03-25 19:39       ` Ingo Molnar
2004-03-25 20:30         ` Ingo Molnar
2004-03-29  8:45           ` Andi Kleen
2004-03-29 10:20             ` Rick Lindsley
2004-03-29  5:07               ` Andi Kleen
2004-03-29 11:28               ` Nick Piggin
2004-03-29 17:30                 ` Rick Lindsley
2004-03-30  0:01                   ` Nick Piggin
2004-03-30  1:26                     ` Rick Lindsley
2004-03-29 11:20             ` Nick Piggin
2004-03-29  6:01               ` Andi Kleen
2004-03-29 11:46                 ` Ingo Molnar
2004-03-29  7:03                   ` Andi Kleen
2004-03-29  7:10                     ` Andi Kleen
2004-03-29 20:14                   ` Andi Kleen
2004-03-29 23:51                     ` Nick Piggin
2004-03-30  6:34                       ` Andi Kleen
2004-03-30  6:40                         ` Ingo Molnar
2004-03-30  7:07                           ` Andi Kleen
2004-03-30  7:14                             ` Nick Piggin
2004-03-30  7:45                               ` Ingo Molnar
2004-03-30  7:58                                 ` Nick Piggin
2004-03-30  7:15                             ` Ingo Molnar
2004-03-30  7:18                               ` Nick Piggin
2004-03-30  7:48                               ` Andi Kleen
2004-03-30  8:18                                 ` Ingo Molnar
2004-03-30  9:36                                   ` Andi Kleen
2004-03-30  7:42                             ` Ingo Molnar
2004-03-30  7:03                         ` Nick Piggin
2004-03-30  7:13                           ` Andi Kleen
2004-03-30  7:24                             ` Nick Piggin
2004-03-30  7:38                             ` Arjan van de Ven
2004-03-30  7:13                           ` Martin J. Bligh
2004-03-30  7:31                             ` Nick Piggin
2004-03-30  7:38                               ` Martin J. Bligh
2004-03-30  8:05                               ` Ingo Molnar
2004-03-30  8:19                                 ` Nick Piggin
2004-03-30  8:45                                   ` Ingo Molnar
2004-03-30  8:53                                     ` Nick Piggin
2004-03-30 15:27                                       ` Martin J. Bligh
2004-03-25 19:24     ` Martin J. Bligh
2004-03-25 21:48       ` Ingo Molnar
2004-03-25 22:28         ` Martin J. Bligh
2004-03-29 22:30           ` Erich Focht
2004-03-30  9:05             ` Nick Piggin
2004-03-30 10:04               ` Erich Focht
2004-03-30 10:58                 ` Andi Kleen
2004-03-30 16:03                   ` [patch] sched-2.6.5-rc3-mm1-A0 Ingo Molnar
2004-03-31  2:30                     ` Nick Piggin
2004-03-30 11:02                 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Andrew Morton
     [not found]                   ` <20040330161438.GA2257@elte.hu>
     [not found]                     ` <20040330161910.GA2860@elte.hu>
     [not found]                       ` <20040330162514.GA2943@elte.hu>
2004-03-30 21:03                         ` [patch] new-context balancing, 2.6.5-rc3-mm1 Ingo Molnar
2004-03-31  2:30                           ` Nick Piggin
2004-03-31 18:59                   ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Erich Focht
2004-03-31  2:08                 ` Nick Piggin
2004-03-31 22:23                   ` Erich Focht
2004-03-30 15:01             ` Martin J. Bligh
2004-03-31 21:23               ` Erich Focht
2004-03-31 21:33                 ` Martin J. Bligh
2004-03-25 21:59   ` Ingo Molnar
2004-03-25 22:26     ` Rick Lindsley
2004-03-25 22:30     ` Andrew Theurer
2004-03-25 22:38       ` Martin J. Bligh
2004-03-26  1:29       ` Andi Kleen
2004-03-26  3:23   ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox