* RE: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 @ 2004-03-25 15:31 Nakajima, Jun 2004-03-25 15:40 ` Andi Kleen 0 siblings, 1 reply; 68+ messages in thread From: Nakajima, Jun @ 2004-03-25 15:31 UTC (permalink / raw) To: Andi Kleen, Rick Lindsley Cc: Ingo Molnar, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Andi, Can you be more specific with "it doesn't load balance threads aggressively enough"? Or what behavior of the base NUMA scheduler is missing in the sched-domain scheduler especially for NUMA? Jun >-----Original Message----- >From: Andi Kleen [mailto:ak@suse.de] >Sent: Thursday, March 25, 2004 3:47 AM >To: Rick Lindsley >Cc: Andi Kleen; Ingo Molnar; piggin@cyberone.com.au; linux- >kernel@vger.kernel.org; akpm@osdl.org; kernel@kolivas.org; >rusty@rustcorp.com.au; Nakajima, Jun; anton@samba.org; lse- >tech@lists.sourceforge.net; mbligh@aracnet.com >Subject: Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2- >A3 > >On Thu, Mar 25, 2004 at 03:40:22AM -0800, Rick Lindsley wrote: >> The main problem it has is that it performs quite badly on Opteron >NUMA >> e.g. in the OpenMP STREAM test (much worse than the normal scheduler) >> >> Andi, I've got some schedstat code which may help us to understand why. >> I'll need to port it to Ingo's changes, but if I drop you a patch in a >> day or two can you try your test on sched-domain/non-sched-domain, >> collecting the stats? > >The openmp failure is already pretty well understood - it doesn't load >balance >threads aggressively enough over CPUs after startup. > >-Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 15:31 [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Nakajima, Jun @ 2004-03-25 15:40 ` Andi Kleen 2004-03-25 19:09 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-25 15:40 UTC (permalink / raw) To: Nakajima, Jun Cc: Andi Kleen, Rick Lindsley, Ingo Molnar, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote: > Andi, > > Can you be more specific with "it doesn't load balance threads > aggressively enough"? Or what behavior of the base NUMA scheduler is > missing in the sched-domain scheduler especially for NUMA? It doesn't do load balance in wake_up_forked_process() and is relatively non aggressive in balancing later. This leads to the multithreaded OpenMP STREAM running its childs first on the same node as the original process and allocating memory there. Then later they run on a different node when the balancing finally happens, but generate cross traffic to the old node, instead of using the memory bandwidth of their local nodes. The difference is very visible, even the 4 thread STREAM only sees the bandwidth of a single node. With a more aggressive scheduler you get 4 times as much. Admittedly it's a bit of a stupid benchmark, but seems to representative for a lot of HPC codes. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 15:40 ` Andi Kleen @ 2004-03-25 19:09 ` Ingo Molnar 2004-03-25 15:21 ` Andi Kleen 2004-03-25 19:24 ` Martin J. Bligh 2004-03-25 21:59 ` Ingo Molnar 2004-03-26 3:23 ` Nick Piggin 2 siblings, 2 replies; 68+ messages in thread From: Ingo Molnar @ 2004-03-25 19:09 UTC (permalink / raw) To: Andi Kleen Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > It doesn't do load balance in wake_up_forked_process() and is > relatively non aggressive in balancing later. This leads to the > multithreaded OpenMP STREAM running its childs first on the same node > as the original process and allocating memory there. [...] i believe the fix we want is to pre-balance the context at fork() time. I've implemented this (which is basically just a reuse of sched_balance_exec() in fork.c, and the related namespace cleanups), could you give it a go: http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5 another solution would be to add SD_BALANCE_FORK. also, the best place to do fork() blancing is not at wake_up_forked_process() time, but prior doing the MM copy. This patch does it there. At wakeup time we've already copied all the pagetables and created tons of dirty cachelines. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 19:09 ` Ingo Molnar @ 2004-03-25 15:21 ` Andi Kleen 2004-03-25 19:39 ` Ingo Molnar 2004-03-25 19:24 ` Martin J. Bligh 1 sibling, 1 reply; 68+ messages in thread From: Andi Kleen @ 2004-03-25 15:21 UTC (permalink / raw) To: Ingo Molnar Cc: jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Thu, 25 Mar 2004 20:09:45 +0100 Ingo Molnar <mingo@elte.hu> wrote: > also, the best place to do fork() blancing is not at > wake_up_forked_process() time, but prior doing the MM copy. This patch > does it there. At wakeup time we've already copied all the pagetables > and created tons of dirty cachelines. That won't help for threaded programs that use clone(). OpenMP is such a case. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 15:21 ` Andi Kleen @ 2004-03-25 19:39 ` Ingo Molnar 2004-03-25 20:30 ` Ingo Molnar 0 siblings, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-25 19:39 UTC (permalink / raw) To: Andi Kleen Cc: jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > That won't help for threaded programs that use clone(). OpenMP is such > a case. yeah, agreed. Also, exec-balance, if applied to fork(), would migrate the parent which is not what we want. We could perhaps migrate the parent to the target CPU, copy the context, then migrate the parent back to the original CPU ... but this sounds too complex. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 19:39 ` Ingo Molnar @ 2004-03-25 20:30 ` Ingo Molnar 2004-03-29 8:45 ` Andi Kleen 0 siblings, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-25 20:30 UTC (permalink / raw) To: Andi Kleen Cc: jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > That won't help for threaded programs that use clone(). OpenMP is such > a case. this patch: redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4 does balancing at wake_up_forked_process()-time. but it's a hard issue. Especially after fork() we do have a fair amount of cache context, and migrating at this point can be bad for performance. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 20:30 ` Ingo Molnar @ 2004-03-29 8:45 ` Andi Kleen 2004-03-29 10:20 ` Rick Lindsley 2004-03-29 11:20 ` Nick Piggin 0 siblings, 2 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-29 8:45 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Thu, Mar 25, 2004 at 09:30:32PM +0100, Ingo Molnar wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > That won't help for threaded programs that use clone(). OpenMP is such > > a case. > > this patch: > > redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4 > > does balancing at wake_up_forked_process()-time. > > but it's a hard issue. Especially after fork() we do have a fair amount > of cache context, and migrating at this point can be bad for > performance. I ported it by hand to the -mm4 scheduler now and tested it. While it works marginally better than the standard -mm scheduler (you get 1 1/2 the bandwidth of one CPU instead of one) it's still still much worse than the optimum of nearly 4 CPUs archived by 2.4 or the standard scheduler. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 8:45 ` Andi Kleen @ 2004-03-29 10:20 ` Rick Lindsley 2004-03-29 5:07 ` Andi Kleen 2004-03-29 11:28 ` Nick Piggin 2004-03-29 11:20 ` Nick Piggin 1 sibling, 2 replies; 68+ messages in thread From: Rick Lindsley @ 2004-03-29 10:20 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh I've got a web page up now on my home machine which shows data from schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under load from kernbench, SPECjbb, and SPECdet. http://eaglet.rain.com/rick/linux/sched-domain/index.html Two things that stand out are that sched-domains tends to call load_balance() less frequently when it is idle and more frequently when it is busy (as compared to the "standard" scheduler.) Another is that even though it moves fewer tasks on average, the sched-domains code shows about half of pull_task()'s work is coming from active_load_balance() ... and that seems wrong. Could these be contributing to what you're seeing? Rick ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 10:20 ` Rick Lindsley @ 2004-03-29 5:07 ` Andi Kleen 2004-03-29 11:28 ` Nick Piggin 1 sibling, 0 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-29 5:07 UTC (permalink / raw) To: Rick Lindsley Cc: mingo, jun.nakajima, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Mon, 29 Mar 2004 02:20:58 -0800 Rick Lindsley <ricklind@us.ibm.com> wrote: > I've got a web page up now on my home machine which shows data from > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under > load from kernbench, SPECjbb, and SPECdet. > > http://eaglet.rain.com/rick/linux/sched-domain/index.html > > Two things that stand out are that sched-domains tends to call > load_balance() less frequently when it is idle and more frequently when > it is busy (as compared to the "standard" scheduler.) Another is that > even though it moves fewer tasks on average, the sched-domains code shows > about half of pull_task()'s work is coming from active_load_balance() ... > and that seems wrong. Could these be contributing to what you're seeing? Sounds quite possible yes. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 10:20 ` Rick Lindsley 2004-03-29 5:07 ` Andi Kleen @ 2004-03-29 11:28 ` Nick Piggin 2004-03-29 17:30 ` Rick Lindsley 1 sibling, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-29 11:28 UTC (permalink / raw) To: Rick Lindsley Cc: Andi Kleen, Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Rick Lindsley wrote: > I've got a web page up now on my home machine which shows data from > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under > load from kernbench, SPECjbb, and SPECdet. > > http://eaglet.rain.com/rick/linux/sched-domain/index.html > I can't see it > Two things that stand out are that sched-domains tends to call > load_balance() less frequently when it is idle and more frequently when > it is busy (as compared to the "standard" scheduler.) Another is that John Hawkes noticed problems here too. mm5 has a patch to improve this for NUMA node balancing. No change on non-NUMA though if that is what you were testing - we might need to tune this a bit if it is hurting. > even though it moves fewer tasks on average, the sched-domains code shows > about half of pull_task()'s work is coming from active_load_balance() ... Yeah this is wrong and shouldn't be happening. It would have been due to a bug in the imbalance calculation which is now fixed. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 11:28 ` Nick Piggin @ 2004-03-29 17:30 ` Rick Lindsley 2004-03-30 0:01 ` Nick Piggin 0 siblings, 1 reply; 68+ messages in thread From: Rick Lindsley @ 2004-03-29 17:30 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Rick Lindsley wrote: > I've got a web page up now on my home machine which shows data from > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under > load from kernbench, SPECjbb, and SPECdet. > > http://eaglet.rain.com/rick/linux/sched-domain/index.html > I can't see it Ack, sorry, wrong path. Hazards of typing at 3am .. should've used cut 'n' paste ... http://eaglet.rain.com/rick/linux/results/sched-domain/index.html Rick ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 17:30 ` Rick Lindsley @ 2004-03-30 0:01 ` Nick Piggin 2004-03-30 1:26 ` Rick Lindsley 0 siblings, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-30 0:01 UTC (permalink / raw) To: Rick Lindsley Cc: Andi Kleen, Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Rick Lindsley wrote: > Rick Lindsley wrote: > > I've got a web page up now on my home machine which shows data from > > schedstats across the various flavors of 2.6.4 and 2.6.5-rc2 under > > load from kernbench, SPECjbb, and SPECdet. > > > > http://eaglet.rain.com/rick/linux/sched-domain/index.html > > > > I can't see it > > Ack, sorry, wrong path. Hazards of typing at 3am .. should've used cut 'n' > paste ... > > http://eaglet.rain.com/rick/linux/results/sched-domain/index.html > Hi Rick, This looks very cool. Very comprehensive. Have you got any plans to intergrate it with sched_domains (so for example, you can see stats for each domain)? I will have to have a look at the code, it should be useful for testing. Thanks Nick ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 0:01 ` Nick Piggin @ 2004-03-30 1:26 ` Rick Lindsley 0 siblings, 0 replies; 68+ messages in thread From: Rick Lindsley @ 2004-03-30 1:26 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, Ingo Molnar, jun.nakajima, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh This looks very cool. Very comprehensive. Have you got any plans to intergrate it with sched_domains (so for example, you can see stats for each domain)? Yes -- ideally we can add some stats to domains too, so we can tell (for example) how often it is adjusting rebalance intervals, or how many processes are moved as a result of each domain's policy, etc. Every time I add another counter I cringe a bit, because we don't want to impose overhead in the scheduler. But so far, using per-cpu data, utilizing runqueue locking when it's in use, and accepting minor inaccuracies that may result from the remaining cases, seems to be yielding a pretty good picture of things without imposing a measurable load. If you want to start using it yourself, I'm open to feedback. I have patches for major releases at http://oss.software.ibm.com/linux/patches/?patch_id=730 and a host of smaller releases (like rc2-mm5) at eaglet: http://eaglet.rain.com/rick/linux/schedstat/ If you're feeling *really* lucky I have a handful of useful but often ungeneralized tools I can share, like the the ones that made that web page. Rick ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 8:45 ` Andi Kleen 2004-03-29 10:20 ` Rick Lindsley @ 2004-03-29 11:20 ` Nick Piggin 2004-03-29 6:01 ` Andi Kleen 1 sibling, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-29 11:20 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Andi Kleen wrote: > On Thu, Mar 25, 2004 at 09:30:32PM +0100, Ingo Molnar wrote: > >>* Andi Kleen <ak@suse.de> wrote: >> >> >>>That won't help for threaded programs that use clone(). OpenMP is such >>>a case. >> >>this patch: >> >> redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm3-A4 >> >>does balancing at wake_up_forked_process()-time. >> >>but it's a hard issue. Especially after fork() we do have a fair amount >>of cache context, and migrating at this point can be bad for >>performance. > > > I ported it by hand to the -mm4 scheduler now and tested it. While > it works marginally better than the standard -mm scheduler > (you get 1 1/2 the bandwidth of one CPU instead of one) it's still > still much worse than the optimum of nearly 4 CPUs archived by > 2.4 or the standard scheduler. > OK there must be some pretty simple reason why this is happening. I guess being OpenMP it is probably a bit complicated for you to try your own scheduling in userspace using CPU affinities? Otherwise could you trace what gets scheduled where for both good and bad kernels? It should help us work out what is going on. I wonder if using one CPU from each quad of the NUMAQ would be give at all comparable behaviour... If it isn't a big problem, could you test with -mm5 with the generic sched domain? STREAM doesn't take long, does it? I don't expect much difference, but the code is in flux while Ingo and I try to sort things out. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 11:20 ` Nick Piggin @ 2004-03-29 6:01 ` Andi Kleen 2004-03-29 11:46 ` Ingo Molnar 0 siblings, 1 reply; 68+ messages in thread From: Andi Kleen @ 2004-03-29 6:01 UTC (permalink / raw) To: Nick Piggin Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Mon, 29 Mar 2004 21:20:12 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > I ported it by hand to the -mm4 scheduler now and tested it. While > > it works marginally better than the standard -mm scheduler > > (you get 1 1/2 the bandwidth of one CPU instead of one) it's still > > still much worse than the optimum of nearly 4 CPUs archived by > > 2.4 or the standard scheduler. > > > Sorry ignore this report - I just found out I booted the wrong kernel by mistake. Currently retesting, also with the proposed change to only use a single scheduling domain. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 6:01 ` Andi Kleen @ 2004-03-29 11:46 ` Ingo Molnar 2004-03-29 7:03 ` Andi Kleen 2004-03-29 20:14 ` Andi Kleen 0 siblings, 2 replies; 68+ messages in thread From: Ingo Molnar @ 2004-03-29 11:46 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > Sorry ignore this report - I just found out I booted the wrong kernel > by mistake. Currently retesting, also with the proposed change to only > use a single scheduling domain. here are the items that are in the works: redhat.com/~mingo/scheduler-patches/sched.patch it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active balancing a bit. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 11:46 ` Ingo Molnar @ 2004-03-29 7:03 ` Andi Kleen 2004-03-29 7:10 ` Andi Kleen 2004-03-29 20:14 ` Andi Kleen 1 sibling, 1 reply; 68+ messages in thread From: Andi Kleen @ 2004-03-29 7:03 UTC (permalink / raw) To: Ingo Molnar Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Mon, 29 Mar 2004 13:46:35 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > Sorry ignore this report - I just found out I booted the wrong kernel > > by mistake. Currently retesting, also with the proposed change to only > > use a single scheduling domain. > > here are the items that are in the works: > > redhat.com/~mingo/scheduler-patches/sched.patch I'm trying to, but -mm5 doesn't work at all on the 4 way machine. It goes through the full boot up sequence, but then never opens a login on the console and sshd also doesn't work. Andrew, maybe that's related to your tty fixes? -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 7:03 ` Andi Kleen @ 2004-03-29 7:10 ` Andi Kleen 0 siblings, 0 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-29 7:10 UTC (permalink / raw) To: Andi Kleen Cc: mingo, nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Mon, 29 Mar 2004 09:03:01 +0200 Andi Kleen <ak@suse.de> wrote: > > I'm trying to, but -mm5 doesn't work at all on the 4 way machine. > It goes through the full boot up sequence, but then never opens a login > on the console and sshd also doesn't work. > > Andrew, maybe that's related to your tty fixes? Reverting the two makes login work again -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 11:46 ` Ingo Molnar 2004-03-29 7:03 ` Andi Kleen @ 2004-03-29 20:14 ` Andi Kleen 2004-03-29 23:51 ` Nick Piggin 1 sibling, 1 reply; 68+ messages in thread From: Andi Kleen @ 2004-03-29 20:14 UTC (permalink / raw) To: Ingo Molnar Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Mon, 29 Mar 2004 13:46:35 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > Sorry ignore this report - I just found out I booted the wrong kernel > > by mistake. Currently retesting, also with the proposed change to only > > use a single scheduling domain. > > here are the items that are in the works: > > redhat.com/~mingo/scheduler-patches/sched.patch > > it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active > balancing a bit. I applied only this patch and it did slightly better than the normal -mm* 1.5 - 2x CPU bandwidth, but still very short of the 3.7x-4x mainline and 2.4 reach. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 20:14 ` Andi Kleen @ 2004-03-29 23:51 ` Nick Piggin 2004-03-30 6:34 ` Andi Kleen 0 siblings, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-29 23:51 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Andi Kleen wrote: > On Mon, 29 Mar 2004 13:46:35 +0200 > Ingo Molnar <mingo@elte.hu> wrote: > > >>* Andi Kleen <ak@suse.de> wrote: >> >> >>>Sorry ignore this report - I just found out I booted the wrong kernel >>>by mistake. Currently retesting, also with the proposed change to only >>>use a single scheduling domain. >> >>here are the items that are in the works: >> >> redhat.com/~mingo/scheduler-patches/sched.patch >> >>it's against 2.6.5-rc2-mm5. This patch also reduces the rate of active >>balancing a bit. > > > I applied only this patch and it did slightly better than the normal -mm* > 1.5 - 2x CPU bandwidth, but still very short of the 3.7x-4x mainline > and 2.4 reach. So both -mm5 and Ingo's sched.patch are much worse than what 2.4 and 2.6 get? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 23:51 ` Nick Piggin @ 2004-03-30 6:34 ` Andi Kleen 2004-03-30 6:40 ` Ingo Molnar 2004-03-30 7:03 ` Nick Piggin 0 siblings, 2 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-30 6:34 UTC (permalink / raw) To: Nick Piggin Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Tue, 30 Mar 2004 09:51:46 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > So both -mm5 and Ingo's sched.patch are much worse than > what 2.4 and 2.6 get? Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still much worse than the max of 3.7x-4x CPU bandwidth. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 6:34 ` Andi Kleen @ 2004-03-30 6:40 ` Ingo Molnar 2004-03-30 7:07 ` Andi Kleen 2004-03-30 7:03 ` Nick Piggin 1 sibling, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 6:40 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > > So both -mm5 and Ingo's sched.patch are much worse than > > what 2.4 and 2.6 get? > > Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) > > Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), > but still much worse than the max of 3.7x-4x CPU bandwidth. Andi, could you please try the patch below - this will test whether this has to do with the rate of balancing between NUMA nodes. The patch itself is not correct (it way overbalances on NUMA), but it tests the theory. Ingo --- linux/include/linux/sched.h.orig +++ linux/include/linux/sched.h @@ -627,7 +627,7 @@ struct sched_domain { .parent = NULL, \ .groups = NULL, \ .min_interval = 8, \ - .max_interval = 256*fls(num_online_cpus()),\ + .max_interval = 8, \ .busy_factor = 8, \ .imbalance_pct = 125, \ .cache_hot_time = (10*1000000), \ ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 6:40 ` Ingo Molnar @ 2004-03-30 7:07 ` Andi Kleen 2004-03-30 7:14 ` Nick Piggin ` (2 more replies) 0 siblings, 3 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-30 7:07 UTC (permalink / raw) To: Ingo Molnar Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Tue, 30 Mar 2004 08:40:15 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > > So both -mm5 and Ingo's sched.patch are much worse than > > > what 2.4 and 2.6 get? > > > > Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) > > > > Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), > > but still much worse than the max of 3.7x-4x CPU bandwidth. > > Andi, could you please try the patch below - this will test whether this > has to do with the rate of balancing between NUMA nodes. The patch > itself is not correct (it way overbalances on NUMA), but it tests the > theory. This works much better, but wildly varying (my tests go from 2.8xCPU to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent results would be better though. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:07 ` Andi Kleen @ 2004-03-30 7:14 ` Nick Piggin 2004-03-30 7:45 ` Ingo Molnar 2004-03-30 7:15 ` Ingo Molnar 2004-03-30 7:42 ` Ingo Molnar 2 siblings, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-30 7:14 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Andi Kleen wrote: > On Tue, 30 Mar 2004 08:40:15 +0200 > Ingo Molnar <mingo@elte.hu> wrote: > > >>* Andi Kleen <ak@suse.de> wrote: >> >> >>>>So both -mm5 and Ingo's sched.patch are much worse than >>>>what 2.4 and 2.6 get? >>> >>>Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) >>> >>>Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), >>>but still much worse than the max of 3.7x-4x CPU bandwidth. >> >>Andi, could you please try the patch below - this will test whether this >>has to do with the rate of balancing between NUMA nodes. The patch >>itself is not correct (it way overbalances on NUMA), but it tests the >>theory. > > > This works much better, but wildly varying (my tests go from 2.8xCPU to > ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > results would be better though. > Oh good, thanks Ingo. Andi you probably want to lower your minimum balance time too then, and maybe try with an even lower maximum. Maybe reduce cache_hot_time a bit too. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:14 ` Nick Piggin @ 2004-03-30 7:45 ` Ingo Molnar 2004-03-30 7:58 ` Nick Piggin 0 siblings, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 7:45 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >This works much better, but wildly varying (my tests go from 2.8xCPU to > >~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > >results would be better though. > > Oh good, thanks Ingo. Andi you probably want to lower your minimum > balance time too then, and maybe try with an even lower maximum. Maybe > reduce cache_hot_time a bit too. i dont think we want to balance with that high of a frequency on NUMA Opteron. These tunes were for testing only. i'm dusting off the balance-on-clone patch right now, that should be the correct solution. It is based on a find_idlest_cpu() function which searches for the least loaded CPU and checks whether we can do passive load-balancing to it. Ie. it's yet another balancing point in the scheduler, _not_ some balancing logic change. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:45 ` Ingo Molnar @ 2004-03-30 7:58 ` Nick Piggin 0 siblings, 0 replies; 68+ messages in thread From: Nick Piggin @ 2004-03-30 7:58 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>>This works much better, but wildly varying (my tests go from 2.8xCPU to >>>~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent >>>results would be better though. >> >>Oh good, thanks Ingo. Andi you probably want to lower your minimum >>balance time too then, and maybe try with an even lower maximum. Maybe >>reduce cache_hot_time a bit too. > > > i dont think we want to balance with that high of a frequency on NUMA > Opteron. These tunes were for testing only. > I guess not. Andi says he wants it more like UMA balancing though... > i'm dusting off the balance-on-clone patch right now, that should be the > correct solution. It is based on a find_idlest_cpu() function which > searches for the least loaded CPU and checks whether we can do passive > load-balancing to it. Ie. it's yet another balancing point in the > scheduler, _not_ some balancing logic change. > Yep, as I said to Martin, I also agree this is probably good if it is done carefully. I think we'll need to get a horde of thread benchmarking people together before turning it on by default, of course. It seems Andi can now get equivalent results without it now, so it isn't a pressing issue. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:07 ` Andi Kleen 2004-03-30 7:14 ` Nick Piggin @ 2004-03-30 7:15 ` Ingo Molnar 2004-03-30 7:18 ` Nick Piggin 2004-03-30 7:48 ` Andi Kleen 2004-03-30 7:42 ` Ingo Molnar 2 siblings, 2 replies; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 7:15 UTC (permalink / raw) To: Andi Kleen Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > > Andi, could you please try the patch below - this will test whether this > > has to do with the rate of balancing between NUMA nodes. The patch > > itself is not correct (it way overbalances on NUMA), but it tests the > > theory. > > This works much better, but wildly varying (my tests go from 2.8xCPU > to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > results would be better though. ok, could you try min_interval,max_interval and busy_factor all with a value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing purposes.) Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:15 ` Ingo Molnar @ 2004-03-30 7:18 ` Nick Piggin 2004-03-30 7:48 ` Andi Kleen 1 sibling, 0 replies; 68+ messages in thread From: Nick Piggin @ 2004-03-30 7:18 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Ingo Molnar wrote: > * Andi Kleen <ak@suse.de> wrote: > > >>>Andi, could you please try the patch below - this will test whether this >>>has to do with the rate of balancing between NUMA nodes. The patch >>>itself is not correct (it way overbalances on NUMA), but it tests the >>>theory. >> >>This works much better, but wildly varying (my tests go from 2.8xCPU >>to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent >>results would be better though. > > > ok, could you try min_interval,max_interval and busy_factor all with a > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing > purposes.) > (sorry, forget what I said then, I'll leave it to Ingo) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:15 ` Ingo Molnar 2004-03-30 7:18 ` Nick Piggin @ 2004-03-30 7:48 ` Andi Kleen 2004-03-30 8:18 ` Ingo Molnar 1 sibling, 1 reply; 68+ messages in thread From: Andi Kleen @ 2004-03-30 7:48 UTC (permalink / raw) To: Ingo Molnar Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Tue, 30 Mar 2004 09:15:19 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > > Andi, could you please try the patch below - this will test whether this > > > has to do with the rate of balancing between NUMA nodes. The patch > > > itself is not correct (it way overbalances on NUMA), but it tests the > > > theory. > > > > This works much better, but wildly varying (my tests go from 2.8xCPU > > to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > > results would be better though. > > ok, could you try min_interval,max_interval and busy_factor all with a > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing > purposes.) I kept the old patch and made these changes. The results are much more consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had this with older kernels too. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:48 ` Andi Kleen @ 2004-03-30 8:18 ` Ingo Molnar 2004-03-30 9:36 ` Andi Kleen 0 siblings, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 8:18 UTC (permalink / raw) To: Andi Kleen Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > > ok, could you try min_interval,max_interval and busy_factor all with a > > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing > > purposes.) > > I kept the old patch and made these changes. The results are much more > consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had > this with older kernels too. great. now, could you try the following patch, against vanilla -mm5: redhat.com/~mingo/scheduler-patches/sched2.patch this includes 'context balancing' and doesnt touch the NUMA async balancing tunables. Do you get better performance than with stock -mm5? Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 8:18 ` Ingo Molnar @ 2004-03-30 9:36 ` Andi Kleen 0 siblings, 0 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-30 9:36 UTC (permalink / raw) To: Ingo Molnar Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Tue, 30 Mar 2004 10:18:40 +0200 Ingo Molnar <mingo@elte.hu> wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > > ok, could you try min_interval,max_interval and busy_factor all with a > > > value as 4, in sched.h's SD_NODE_INIT template? (again, only for testing > > > purposes.) > > > > I kept the old patch and made these changes. The results are much more > > consistent now 3+x CPU. I still get varyations of ~2GB/s, but I had > > this with older kernels too. > > great. > > now, could you try the following patch, against vanilla -mm5: > > redhat.com/~mingo/scheduler-patches/sched2.patch > > this includes 'context balancing' and doesnt touch the NUMA async > balancing tunables. Do you get better performance than with stock -mm5? I get better performance (roughly 2.1x CPU), but only about half the optimum. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:07 ` Andi Kleen 2004-03-30 7:14 ` Nick Piggin 2004-03-30 7:15 ` Ingo Molnar @ 2004-03-30 7:42 ` Ingo Molnar 2 siblings, 0 replies; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 7:42 UTC (permalink / raw) To: Andi Kleen Cc: nickpiggin, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > This works much better, but wildly varying (my tests go from 2.8xCPU > to ~3.8x CPU for 4 CPUs. 2,3 CPU cases are ok). A bit more consistent > results would be better though. i'm resurrecting the balance-on-clone patch i sent a couple of days ago. I found at least one bug in it that might explain why it didnt work back then. (also, the scheduler back then was also too agressive at migrating tasks back.) Stay tuned. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 6:34 ` Andi Kleen 2004-03-30 6:40 ` Ingo Molnar @ 2004-03-30 7:03 ` Nick Piggin 2004-03-30 7:13 ` Andi Kleen 2004-03-30 7:13 ` Martin J. Bligh 1 sibling, 2 replies; 68+ messages in thread From: Nick Piggin @ 2004-03-30 7:03 UTC (permalink / raw) To: Andi Kleen Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Andi Kleen wrote: > On Tue, 30 Mar 2004 09:51:46 +1000 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > >>So both -mm5 and Ingo's sched.patch are much worse than >>what 2.4 and 2.6 get? > > > Yes (2.6 vanilla and 2.4-aa at that, i haven't tested 2.4-vanilla) > > Ingo's sched.patch makes it a bit better (from 1x CPU to 1.5-1.7xCPU), but still > much worse than the max of 3.7x-4x CPU bandwidth. > So it is very likely to be a case of the threads running too long on one CPU before being balanced off, and faulting in most of their working memory from one node, right? I think it is impossible for the scheduler to correctly identify this and implement the behaviour that OpenMP wants without causing regressions on more general workloads (Assuming this is the problem). We are not going to go back to the wild balancing that numasched does (I have some benchmarks where sched-domains reduces cross node task movement by several orders of magnitude). So the other option is to do balance on clone across NUMA nodes, and make it very sensitive to imbalance. Or probably better to make it easy to balance off to an idle CPU, but much more difficult to balance off to a busy CPU. I suspect this would still be a regression for other tests though where thread creation is more frequent, threads share working set more often, or the number of threads > the number of CPUs. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:03 ` Nick Piggin @ 2004-03-30 7:13 ` Andi Kleen 2004-03-30 7:24 ` Nick Piggin 2004-03-30 7:38 ` Arjan van de Ven 2004-03-30 7:13 ` Martin J. Bligh 1 sibling, 2 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-30 7:13 UTC (permalink / raw) To: Nick Piggin Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Tue, 30 Mar 2004 17:03:42 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > So it is very likely to be a case of the threads running too > long on one CPU before being balanced off, and faulting in > most of their working memory from one node, right? Yes. > I think it is impossible for the scheduler to correctly > identify this and implement the behaviour that OpenMP wants > without causing regressions on more general workloads > (Assuming this is the problem). Regression on what workload? The 2.4 kernel who did the early balancing didn't seem to have problems. I have NUMA API for an application to select memory placement manually, but it's unrealistic to expect all applications to use it, so the scheduler has to do at least an reasonable default. In general on Opteron you want to go as quickly as possible to your target node. Keeping things on the local node and hoping that threads won't need to be balanced off is probably a loss. It is quite possible that other systems have different requirements, but I doubt there is a "one size fits all" requirement and doing a custom domain setup or similar would be fine for me. (or at least if sched domain cannot be tuned for Opteron then it would have failed its promise of being a configurable scheduler) > I suspect this would still be a regression for other tests > though where thread creation is more frequent, threads share > working set more often, or the number of threads > the number > of CPUs. I can try such tests if they're not too time consuming to set up. What did you have in mind? -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:13 ` Andi Kleen @ 2004-03-30 7:24 ` Nick Piggin 2004-03-30 7:38 ` Arjan van de Ven 1 sibling, 0 replies; 68+ messages in thread From: Nick Piggin @ 2004-03-30 7:24 UTC (permalink / raw) To: Andi Kleen Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Andi Kleen wrote: > On Tue, 30 Mar 2004 17:03:42 +1000 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>So it is very likely to be a case of the threads running too >>long on one CPU before being balanced off, and faulting in >>most of their working memory from one node, right? > > > Yes. > > >>I think it is impossible for the scheduler to correctly >>identify this and implement the behaviour that OpenMP wants >>without causing regressions on more general workloads >>(Assuming this is the problem). > > > Regression on what workload? The 2.4 kernel who did the > early balancing didn't seem to have problems. > No, but hopefully sched domains balancing will do better than the old numasched. > I have NUMA API for an application to select memory placement > manually, but it's unrealistic to expect all applications to use it, > so the scheduler has to do at least an reasonable default. > > In general on Opteron you want to go as quickly as possible > to your target node. Keeping things on the local node and hoping > that threads won't need to be balanced off is probably a loss. > It is quite possible that other systems have different requirements, > but I doubt there is a "one size fits all" requirement and > doing a custom domain setup or similar would be fine for me. It is the same situation with all NUMA, obviously Opteron's 1 CPU per node means it is sensitive to node imbalances. > (or at least if sched domain cannot be tuned for Opteron then > it would have failed its promise of being a configurable scheduler) > Well it seems like Ingo is on to something. Phew! :) > >>I suspect this would still be a regression for other tests >>though where thread creation is more frequent, threads share >>working set more often, or the number of threads > the number >>of CPUs. > > > I can try such tests if they're not too time consuming to set up. > What did you have in mind? > Not really sure. I guess probably most things that use a lot of threads, maybe java, a web server using per connection threads (if there is such a thing). On the other hand though, maybe it will be a good idea if it is done carefully... ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:13 ` Andi Kleen 2004-03-30 7:24 ` Nick Piggin @ 2004-03-30 7:38 ` Arjan van de Ven 1 sibling, 0 replies; 68+ messages in thread From: Arjan van de Ven @ 2004-03-30 7:38 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh [-- Attachment #1: Type: text/plain, Size: 578 bytes --] > Regression on what workload? The 2.4 kernel who did the > early balancing didn't seem to have problems. well the hard balance is between a program that just splits of one thread and has those 2 threads working closely together (in which case you want the 2 threads to be together on the same quad in a quad-like setup) and a program that splits of a thread and has the 2 threads working basically entirely independent. Benchmarks are typically of the later kind... but real world applications ???? The ones I can think of using threads are of the former kind. [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:03 ` Nick Piggin 2004-03-30 7:13 ` Andi Kleen @ 2004-03-30 7:13 ` Martin J. Bligh 2004-03-30 7:31 ` Nick Piggin 1 sibling, 1 reply; 68+ messages in thread From: Martin J. Bligh @ 2004-03-30 7:13 UTC (permalink / raw) To: Nick Piggin, Andi Kleen Cc: mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech > We are not going to go back to the wild balancing that > numasched does (I have some benchmarks where sched-domains > reduces cross node task movement by several orders of > magnitude). Agreed, I think that'd be a fatal mistake ... > So the other option is to do balance on clone > across NUMA nodes, and make it very sensitive to imbalance. > Or probably better to make it easy to balance off to an idle > CPU, but much more difficult to balance off to a busy CPU. I think that's correct, but we need to be careful. We really, really do want to try to keep threads on the same node *if* we have enough processes around to keep the machine busy. Because we don't balance on fork, we make a reasonable job of that today, but we should probably be more reluctant on rebalance than we are. It's when we have less processes than nodes that we want to spread things around. That's a difficult balance to strike (and exactly why I wimped out on it originally ;-)). M. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:13 ` Martin J. Bligh @ 2004-03-30 7:31 ` Nick Piggin 2004-03-30 7:38 ` Martin J. Bligh 2004-03-30 8:05 ` Ingo Molnar 0 siblings, 2 replies; 68+ messages in thread From: Nick Piggin @ 2004-03-30 7:31 UTC (permalink / raw) To: Martin J. Bligh Cc: Andi Kleen, mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech Martin J. Bligh wrote: >>We are not going to go back to the wild balancing that >>numasched does (I have some benchmarks where sched-domains >>reduces cross node task movement by several orders of >>magnitude). > > > Agreed, I think that'd be a fatal mistake ... > > >>So the other option is to do balance on clone >>across NUMA nodes, and make it very sensitive to imbalance. >>Or probably better to make it easy to balance off to an idle >>CPU, but much more difficult to balance off to a busy CPU. > > > I think that's correct, but we need to be careful. We really, really > do want to try to keep threads on the same node *if* we have enough > processes around to keep the machine busy. Because we don't balance > on fork, we make a reasonable job of that today, but we should probably > be more reluctant on rebalance than we are. > > It's when we have less processes than nodes that we want to spread things > around. That's a difficult balance to strike (and exactly why I wimped > out on it originally ;-)). > Well NUMA balance on exec is obviously the right thing to do. Maybe balance on clone would be beneficial if we only balance onto CPUs which are idle or very very imbalanced. Basically, if you are very sure that it is going to be balanced off anyway, it is probably better to do it at clone. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:31 ` Nick Piggin @ 2004-03-30 7:38 ` Martin J. Bligh 2004-03-30 8:05 ` Ingo Molnar 1 sibling, 0 replies; 68+ messages in thread From: Martin J. Bligh @ 2004-03-30 7:38 UTC (permalink / raw) To: Nick Piggin Cc: Andi Kleen, mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech > Well NUMA balance on exec is obviously the right thing to do. > > Maybe balance on clone would be beneficial if we only balance onto > CPUs which are idle or very very imbalanced. Basically, if you are > very sure that it is going to be balanced off anyway, it is probably > better to do it at clone. Yup ... sounds utterly sensible. But I think we need to make the current balance favour grouping threads together on the same CPU/node more first if possible ;-) M. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 7:31 ` Nick Piggin 2004-03-30 7:38 ` Martin J. Bligh @ 2004-03-30 8:05 ` Ingo Molnar 2004-03-30 8:19 ` Nick Piggin 1 sibling, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 8:05 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Maybe balance on clone would be beneficial if we only balance onto > CPUs which are idle or very very imbalanced. Basically, if you are > very sure that it is going to be balanced off anyway, it is probably > better to do it at clone. balancing threads/processes is not a problem, as long as it happens within the rules of normal balancing. ie. 'new context created' (on exec, fork or clone) is just an event that impacts the load scenario, and which might trigger rebalancing. _if_ the sharing between various contexts is very high and it's actually faster to run them all single-threaded, then the application writer can bind them to one CPU, via the affinity syscalls. But the scheduler cannot know this advance. so the cleanest assumption, from the POV of the scheduler, is that there's no sharing between contexts. Things become really simple once this assumption is made. and frankly, it's much easier to argue with application developers whose application scales badly and thus the scheduler over-distributes it, than with application developers who's application scales badly due to the scheduler. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 8:05 ` Ingo Molnar @ 2004-03-30 8:19 ` Nick Piggin 2004-03-30 8:45 ` Ingo Molnar 0 siblings, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-30 8:19 UTC (permalink / raw) To: Ingo Molnar Cc: Martin J. Bligh, Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Maybe balance on clone would be beneficial if we only balance onto >>CPUs which are idle or very very imbalanced. Basically, if you are >>very sure that it is going to be balanced off anyway, it is probably >>better to do it at clone. > > > balancing threads/processes is not a problem, as long as it happens > within the rules of normal balancing. > > ie. 'new context created' (on exec, fork or clone) is just an event that > impacts the load scenario, and which might trigger rebalancing. > > _if_ the sharing between various contexts is very high and it's actually > faster to run them all single-threaded, then the application writer can > bind them to one CPU, via the affinity syscalls. But the scheduler > cannot know this advance. > > so the cleanest assumption, from the POV of the scheduler, is that > there's no sharing between contexts. Things become really simple once > this assumption is made. > > and frankly, it's much easier to argue with application developers whose > application scales badly and thus the scheduler over-distributes it, > than with application developers who's application scales badly due to > the scheduler. > You're probably mostly right, but I really don't know if I'd start with the assumption that threads don't share anything. I think they're very likely to share memory and cache. Also, these additional system wide balance points don't come for free if you attach them to common operations (as opposed to the slow periodic balancing). find_best_cpu needs to pull down NR_CPUs remote (and probably hot&dirty) cachelines, which can get expensive, for an operation that you are very likely to be better off *without* if your threads do share any memory. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 8:19 ` Nick Piggin @ 2004-03-30 8:45 ` Ingo Molnar 2004-03-30 8:53 ` Nick Piggin 0 siblings, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 8:45 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > You're probably mostly right, but I really don't know if I'd start > with the assumption that threads don't share anything. I think they're > very likely to share memory and cache. it all depends on the workload i guess, but generally if the application scales well then the threads only share data in a read-mostly manner - hence we can balance at creation time. if the application does not scale well then balancing too early cannot make the app perform much worse. things like JVMs tend to want good balancing - they really are userspace simulations of separate contexts with little sharing and good overall scalability of the architecture. > Also, these additional system wide balance points don't come for free > if you attach them to common operations (as opposed to the slow > periodic balancing). yes, definitely. the implementation in sched2.patch does not take this into account yet. There are a number of things we can do about the 500 CPUs case. Eg. only do the balance search towards the next N nodes/cpus (tunable via a domain parameter). Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 8:45 ` Ingo Molnar @ 2004-03-30 8:53 ` Nick Piggin 2004-03-30 15:27 ` Martin J. Bligh 0 siblings, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-30 8:53 UTC (permalink / raw) To: Ingo Molnar Cc: Martin J. Bligh, Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>You're probably mostly right, but I really don't know if I'd start >>with the assumption that threads don't share anything. I think they're >>very likely to share memory and cache. > > > it all depends on the workload i guess, but generally if the application > scales well then the threads only share data in a read-mostly manner - > hence we can balance at creation time. > > if the application does not scale well then balancing too early cannot > make the app perform much worse. > > things like JVMs tend to want good balancing - they really are userspace > simulations of separate contexts with little sharing and good overall > scalability of the architecture. > Well, it will be interesting to see how it goes. Unfortunately I don't have a single realistic benchmark. In fact the only threaded one I have is volanomark. > >>Also, these additional system wide balance points don't come for free >>if you attach them to common operations (as opposed to the slow >>periodic balancing). > > > yes, definitely. > > the implementation in sched2.patch does not take this into account yet. > There are a number of things we can do about the 500 CPUs case. Eg. only > do the balance search towards the next N nodes/cpus (tunable via a > domain parameter). Yeah I think we shouldn't worry too much about the 500 CPUs case, because they will obviously end up using their own domains. But it is possible this would hurt smaller CPU counts too. Again, it means testing. I think we should probably aim to have a usable and decent default domain for 32, maybe 64 CPUs, and not worry about larger numbers too much if it would hurt lower end performance. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 8:53 ` Nick Piggin @ 2004-03-30 15:27 ` Martin J. Bligh 0 siblings, 0 replies; 68+ messages in thread From: Martin J. Bligh @ 2004-03-30 15:27 UTC (permalink / raw) To: Nick Piggin, Ingo Molnar, Erich Focht Cc: Andi Kleen, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech > Well, it will be interesting to see how it goes. Unfortunately > I don't have a single realistic benchmark. That's OK, neither does anyone else ;-) OK, for HPC workloads they do, but not for other stuff. The closest I can come conceptually is to run multiple instances of a Java benchmark in parallel. The existing ones all tend to be either 1 process with many threads, or many processes each with one thread. There's no m x n benchamrks around I've found, and that seems to be a lot more like what the customers I've seen are interested in (throwing a DB, webserver, java, etc all on one machine). Making balance_on_fork a userspace hintable thing wouldn't hurt us at all though, and would provide a great escape route for the HPC people. Some simple pokeable in /proc would probably be sufficient. balance_on_clone is harder, as whether you want to do it or not depends more on the state of the rest of the system, which is very hard for userspace to know ... M. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 19:09 ` Ingo Molnar 2004-03-25 15:21 ` Andi Kleen @ 2004-03-25 19:24 ` Martin J. Bligh 2004-03-25 21:48 ` Ingo Molnar 1 sibling, 1 reply; 68+ messages in thread From: Martin J. Bligh @ 2004-03-25 19:24 UTC (permalink / raw) To: Ingo Molnar, Andi Kleen Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech >> It doesn't do load balance in wake_up_forked_process() and is >> relatively non aggressive in balancing later. This leads to the >> multithreaded OpenMP STREAM running its childs first on the same node >> as the original process and allocating memory there. [...] > > i believe the fix we want is to pre-balance the context at fork() time. > I've implemented this (which is basically just a reuse of > sched_balance_exec() in fork.c, and the related namespace cleanups), > could you give it a go: > > http://redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc2-mm2-A5 > > another solution would be to add SD_BALANCE_FORK. > > also, the best place to do fork() blancing is not at > wake_up_forked_process() time, but prior doing the MM copy. This patch > does it there. At wakeup time we've already copied all the pagetables > and created tons of dirty cachelines. How are you going to decide whether to rebalance at fork time or exec time? Exec time balancing is a *lot* more efficient, it just doesn't work for things that don't exec ... cloned threads would certainly be one case. M. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 19:24 ` Martin J. Bligh @ 2004-03-25 21:48 ` Ingo Molnar 2004-03-25 22:28 ` Martin J. Bligh 0 siblings, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-25 21:48 UTC (permalink / raw) To: Martin J. Bligh Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech * Martin J. Bligh <mbligh@aracnet.com> wrote: > Exec time balancing is a *lot* more efficient, it just doesn't work > for things that don't exec ... cloned threads would certainly be one > case. yeah - exec-balancing is a clear thing. fork/clone time balancing is alot less clear. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 21:48 ` Ingo Molnar @ 2004-03-25 22:28 ` Martin J. Bligh 2004-03-29 22:30 ` Erich Focht 0 siblings, 1 reply; 68+ messages in thread From: Martin J. Bligh @ 2004-03-25 22:28 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech >> Exec time balancing is a *lot* more efficient, it just doesn't work >> for things that don't exec ... cloned threads would certainly be one >> case. > > yeah - exec-balancing is a clear thing. fork/clone time balancing is > alot less clear. OK, well it *looks* to me from a quick look at your patch like sched_balance_context will rebalance at both fork *and* exec time. That seems like a bad plan, but maybe I'm misreading it. Can we hold off on changing the fork/exec time balancing until we've come to a plan as to what should actually be done with it? Unless we're giving it some hint from userspace, it's frigging hard to be sure if it's going to exec or not - and the vast majority of things do. There was a really good reason why the code is currently set up that way, it's not some random accident ;-) Clone is a much more interesting case, though at the time, I consciously decided NOT to do that, as we really mostly want threads on the same node. The exception is the case where we have one app with lots of threads, and nothing much else running on the system ... I tend to think of that as an artificial benchmark situation, but maybe that's not fair. We probably need to just do a more conservative version of the cross-node rebalance at fork time. M. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 22:28 ` Martin J. Bligh @ 2004-03-29 22:30 ` Erich Focht 2004-03-30 9:05 ` Nick Piggin 2004-03-30 15:01 ` Martin J. Bligh 0 siblings, 2 replies; 68+ messages in thread From: Erich Focht @ 2004-03-29 22:30 UTC (permalink / raw) To: Martin J. Bligh, Ingo Molnar Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech On Thursday 25 March 2004 23:28, Martin J. Bligh wrote: > Can we hold off on changing the fork/exec time balancing until we've > come to a plan as to what should actually be done with it? Unless we're > giving it some hint from userspace, it's frigging hard to be sure if > it's going to exec or not - and the vast majority of things do. After more than a year (or two?) of discussions there's no better idea yet than giving a userspace hint. Default should be to balance at exec(), and maybe use a syscall for saying: balance all children a particular process is going to fork/clone at creation time. Everybody reached the insight that we can't foresee what's optimal, so there is only one solution: control the behavior. Give the user a tool to improve the performance. Just a small inheritable variable in the task structure is enough. Whether you give the hint at or before run-time or even at compile-time is not really the point... I don't think it's worth to wait and hope that somebody shows up with a magic algorithm which balances every kind of job optimally. > There was a really good reason why the code is currently set up that > way, it's not some random accident ;-) The current code isn't a result of a big optimization effort, it's the result of stripping stuff down to something which was acceptable at all in the 2.6 feature freeze phase such that we get at least _some_ NUMA scheduler infrastructure. It was clear right from the beginning that it has to be extended to really become useful. > Clone is a much more interesting case, though at the time, I consciously > decided NOT to do that, as we really mostly want threads on the same > node. That is not true in the case of HPC applications. And if someone uses OpenMP he is just doing that kind of stuff. I consider STREAM a good benchmark because it shows exactly the problem of HPC applications: they need a lot of memory bandwidth, they don't run in cache and the tasks live really long. Spreading those tasks across the nodes gives me more bandwidth per task and I accumulate the positive effect because the tasks run for hours or days. It's a simple and clear case where the scheduler should be improved. Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7 are not relevant for HPC. In a compute center it actually doesn't matter much whether some shell command returns 10% faster, it just shouldn't disturb my super simulation code for which I bought an expensive NUMA box. Regards, Erich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 22:30 ` Erich Focht @ 2004-03-30 9:05 ` Nick Piggin 2004-03-30 10:04 ` Erich Focht 2004-03-30 15:01 ` Martin J. Bligh 1 sibling, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-30 9:05 UTC (permalink / raw) To: Erich Focht Cc: Martin J. Bligh, Ingo Molnar, Andi Kleen, Nakajima, Jun, Rick Lindsley, linux-kernel, akpm, kernel, rusty, anton, lse-tech (please use piggin@yahoo.com.au) Erich Focht wrote: >On Thursday 25 March 2004 23:28, Martin J. Bligh wrote: > >>Can we hold off on changing the fork/exec time balancing until we've >>come to a plan as to what should actually be done with it? Unless we're >>giving it some hint from userspace, it's frigging hard to be sure if >>it's going to exec or not - and the vast majority of things do. >> > >After more than a year (or two?) of discussions there's no better idea >yet than giving a userspace hint. Default should be to balance at >exec(), and maybe use a syscall for saying: balance all children a >particular process is going to fork/clone at creation time. Everybody >reached the insight that we can't foresee what's optimal, so there is >only one solution: control the behavior. Give the user a tool to >improve the performance. Just a small inheritable variable in the task >structure is enough. Whether you give the hint at or before run-time >or even at compile-time is not really the point... > >I don't think it's worth to wait and hope that somebody shows up with >a magic algorithm which balances every kind of job optimally. > > I'm with Martin here, we are just about to merge all this sched-domains stuff. So we should at least wait until after that. And of course, *nothing* gets changed without at least one benchmark that shows it improves something. So far nobody has come up to the plate with that. >>There was a really good reason why the code is currently set up that >>way, it's not some random accident ;-) >> > >The current code isn't a result of a big optimization effort, it's the >result of stripping stuff down to something which was acceptable at >all in the 2.6 feature freeze phase such that we get at least _some_ >NUMA scheduler infrastructure. It was clear right from the beginning >that it has to be extended to really become useful. > > >>Clone is a much more interesting case, though at the time, I consciously >>decided NOT to do that, as we really mostly want threads on the same >>node. >> > >That is not true in the case of HPC applications. And if someone uses >OpenMP he is just doing that kind of stuff. I consider STREAM a good >benchmark because it shows exactly the problem of HPC applications: >they need a lot of memory bandwidth, they don't run in cache and the >tasks live really long. Spreading those tasks across the nodes gives >me more bandwidth per task and I accumulate the positive effect >because the tasks run for hours or days. It's a simple and clear case >where the scheduler should be improved. > >Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7 >are not relevant for HPC. In a compute center it actually doesn't >matter much whether some shell command returns 10% faster, it just >shouldn't disturb my super simulation code for which I bought an >expensive NUMA box. > > There are other things, like java, ervers, etc that use threads. The point is that we have never had this before, and nobody (until now) has been asking for it. And there are as yet no convincing benchmarks that even show best case improvements. And it could very easily have some bad cases. And finally, HPC applications are the very ones that should be using CPU affinities because they are usually tuned quite tightly to the specific architecture. Let's just make sure we don't change defaults without any reason... ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 9:05 ` Nick Piggin @ 2004-03-30 10:04 ` Erich Focht 2004-03-30 10:58 ` Andi Kleen ` (2 more replies) 0 siblings, 3 replies; 68+ messages in thread From: Erich Focht @ 2004-03-30 10:04 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Ingo Molnar, Andi Kleen, Nakajima, Jun, Rick Lindsley, linux-kernel, akpm, kernel, rusty, anton, lse-tech Hi Nick, On Tuesday 30 March 2004 11:05, Nick Piggin wrote: > >exec(), and maybe use a syscall for saying: balance all children a > >particular process is going to fork/clone at creation time. Everybody > >reached the insight that we can't foresee what's optimal, so there is > >only one solution: control the behavior. Give the user a tool to > >improve the performance. Just a small inheritable variable in the task > >structure is enough. Whether you give the hint at or before run-time > >or even at compile-time is not really the point... > > > >I don't think it's worth to wait and hope that somebody shows up with > >a magic algorithm which balances every kind of job optimally. > > I'm with Martin here, we are just about to merge all this > sched-domains stuff. So we should at least wait until after > that. And of course, *nothing* gets changed without at least > one benchmark that shows it improves something. So far > nobody has come up to the plate with that. I thought you're talking the whole time about STREAM. That is THE benchmark which shows you an impact of balancing at fork. At it is a VERY relevant benchmark. Though you shouldn't run it on historical machines like NUMAQ, no compute center in the western world will buy NUMAQs for high performance... Andy typically runs STREAM on all CPUs of a machine. Try on N/2 and N/4 and so on, you'll see the impact. > >>Clone is a much more interesting case, though at the time, I consciously > >>decided NOT to do that, as we really mostly want threads on the same > >>node. > > > >That is not true in the case of HPC applications. And if someone uses > >OpenMP he is just doing that kind of stuff. I consider STREAM a good > >benchmark because it shows exactly the problem of HPC applications: > >they need a lot of memory bandwidth, they don't run in cache and the > >tasks live really long. Spreading those tasks across the nodes gives > >me more bandwidth per task and I accumulate the positive effect > >because the tasks run for hours or days. It's a simple and clear case > >where the scheduler should be improved. > > > >Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7 > >are not relevant for HPC. In a compute center it actually doesn't > >matter much whether some shell command returns 10% faster, it just > >shouldn't disturb my super simulation code for which I bought an > >expensive NUMA box. > > There are other things, like java, ervers, etc that use threads. I'm just saying that you should have the choice. The default should be as before, balance at exec(). > The point is that we have never had this before, and nobody > (until now) has been asking for it. And there are as yet no ?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA kernels and users use it intensively with OpenMP. Advertised it a lot, asked for it, atlked about it at the last OLS. Only IA64 was considered rare big iron. I understand that the issue gets hotter if the problem hurts on AMD64... > convincing benchmarks that even show best case improvements. And > it could very easily have some bad cases. Again: I'm talking about having the choice. The user decides. Nothing protects you against user stupidity, but if they just have the choice of poor automatic initial scheduling, it's not enough. And: having the fork/clone initial balancing policy means: you don't need to make your code complicated and unportable by playing with setaffinity (which is just plainly unusable when you share the machine with other users). > And finally, HPC > applications are the very ones that should be using CPU > affinities because they are usually tuned quite tightly to the > specific architecture. There are companies mainly selling NUMA machines for HPC (SGI?), so this is not a niche market. Clusters of big NUMA machines are not unusual, and they're typically not used for databases but for HPC apps. Unfortunately proprietary UNIX is still considered to have better features than Linux for such configurations. > Let's just make sure we don't change defaults without any > reason... No reason? Aaarghh... >;-) Erich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 10:04 ` Erich Focht @ 2004-03-30 10:58 ` Andi Kleen 2004-03-30 16:03 ` [patch] sched-2.6.5-rc3-mm1-A0 Ingo Molnar 2004-03-30 11:02 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Andrew Morton 2004-03-31 2:08 ` Nick Piggin 2 siblings, 1 reply; 68+ messages in thread From: Andi Kleen @ 2004-03-30 10:58 UTC (permalink / raw) To: Erich Focht Cc: nickpiggin, mbligh, mingo, jun.nakajima, ricklind, linux-kernel, akpm, kernel, rusty, anton, lse-tech On Tue, 30 Mar 2004 12:04:13 +0200 Erich Focht <efocht@hpce.nec.com> wrote: Hallo Erich, > On Tuesday 30 March 2004 11:05, Nick Piggin wrote: > > >exec(), and maybe use a syscall for saying: balance all children a > > >particular process is going to fork/clone at creation time. Everybody > > >reached the insight that we can't foresee what's optimal, so there is > > >only one solution: control the behavior. Give the user a tool to > > >improve the performance. Just a small inheritable variable in the task > > >structure is enough. Whether you give the hint at or before run-time > > >or even at compile-time is not really the point... > > > > > >I don't think it's worth to wait and hope that somebody shows up with > > >a magic algorithm which balances every kind of job optimally. > > > > I'm with Martin here, we are just about to merge all this > > sched-domains stuff. So we should at least wait until after > > that. And of course, *nothing* gets changed without at least > > one benchmark that shows it improves something. So far > > nobody has come up to the plate with that. > > I thought you're talking the whole time about STREAM. That is THE > benchmark which shows you an impact of balancing at fork. At it is a > VERY relevant benchmark. Though you shouldn't run it on historical > machines like NUMAQ, no compute center in the western world will buy > NUMAQs for high performance... Andy typically runs STREAM on all CPUs > of a machine. Try on N/2 and N/4 and so on, you'll see the impact. Actually I run it on 1-4 CPUs (don't have more to try), but didn't always bother to report everything.. With the default mm5 scheduler the bandwidth of 1,2,3,4 is constantly like 1 CPU. I agree with you that the "balancing on fork is bad" assumption is dubious at best. For HPC it definitly is wrong, for others it is unproven as well. As I wrote earlier our own results on HyperThreaded machines running 2.4 were similar. On HT at least early balancing seems to be a win too - it's obvious because there is no cache cost to be paid when you move between two virtual CPUs on the same core. > > There are other things, like java, ervers, etc that use threads. > > I'm just saying that you should have the choice. The default should be > as before, balance at exec(). Choice is probably not bad, but a good default is important too. I'm not really sure doing it by default would be such a bad idea. A thread allocating some memory on its own is probably not that unusual, even outside the HPC space. And on a NUMA system you want that already on the final node. -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* [patch] sched-2.6.5-rc3-mm1-A0 2004-03-30 10:58 ` Andi Kleen @ 2004-03-30 16:03 ` Ingo Molnar 2004-03-31 2:30 ` Nick Piggin 0 siblings, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 16:03 UTC (permalink / raw) To: linux-kernel Cc: Erich Focht, nickpiggin, mbligh, jun.nakajima, ricklind, akpm, kernel, rusty, anton, lse-tech, Andi Kleen the latest scheduler patch, against 2.6.5-rc3-mm1, can be found at: redhat.com/~mingo/scheduler-patches/sched-2.6.5-rc3-mm1-A0 this includes: - fork/clone-time balancing. It looks quite good here, but needs more testing for impact. - a minor fix for passive balancing. (calculating at a -1 load level was not perfectly precise with a runqueue length of ~4 or longer.) - use sync wakeups for parent-wakeup. This makes a single-task strace execute on only one CPU on SMP, which is precisely what we want. It should also be a speedup for a number of workloads where the parent is actively wait4()-ing for the child to exit. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [patch] sched-2.6.5-rc3-mm1-A0 2004-03-30 16:03 ` [patch] sched-2.6.5-rc3-mm1-A0 Ingo Molnar @ 2004-03-31 2:30 ` Nick Piggin 0 siblings, 0 replies; 68+ messages in thread From: Nick Piggin @ 2004-03-31 2:30 UTC (permalink / raw) To: Ingo Molnar Cc: linux-kernel, Erich Focht, mbligh, jun.nakajima, ricklind, akpm, kernel, rusty, anton, lse-tech, Andi Kleen Ingo Molnar wrote: > - use sync wakeups for parent-wakeup. This makes a single-task strace > execute on only one CPU on SMP, which is precisely what we want. It > should also be a speedup for a number of workloads where the parent > is actively wait4()-ing for the child to exit. Nice ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 10:04 ` Erich Focht 2004-03-30 10:58 ` Andi Kleen @ 2004-03-30 11:02 ` Andrew Morton [not found] ` <20040330161438.GA2257@elte.hu> 2004-03-31 18:59 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Erich Focht 2004-03-31 2:08 ` Nick Piggin 2 siblings, 2 replies; 68+ messages in thread From: Andrew Morton @ 2004-03-30 11:02 UTC (permalink / raw) To: Erich Focht Cc: nickpiggin, mbligh, mingo, ak, jun.nakajima, ricklind, linux-kernel, kernel, rusty, anton, lse-tech Erich Focht <efocht@hpce.nec.com> wrote: > > > And finally, HPC > > applications are the very ones that should be using CPU > > affinities because they are usually tuned quite tightly to the > > specific architecture. > > There are companies mainly selling NUMA machines for HPC (SGI?), so > this is not a niche market. It is niche in terms of number of machines and in terms of affected users. And the people who provide these machines have the resources to patch the scheduler if needs be. Correct me if I'm wrong, but what we have here is a situation where if we design the scheduler around the HPC requirement, it will work poorly in a significant number of other applications. And we don't see a way of fixing this without either a /proc/i-am-doing-hpc, or a config option, or requiring someone to carry an external patch, yes? If so then all of those seem reasonable options to me. We should optimise the scheduler for the common case, and that ain't HPC. If we agree that architecturally sched-domains _can_ satisfy the HPC requirement then I think that's good enough for now. I'd prefer that Ingo and Nick not have to bust a gut trying to get optimum HPC performance before the code is even merged up. Do you agree that sched-domains is architected appropriately? ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20040330161438.GA2257@elte.hu>]
[parent not found: <20040330161910.GA2860@elte.hu>]
[parent not found: <20040330162514.GA2943@elte.hu>]
* [patch] new-context balancing, 2.6.5-rc3-mm1 [not found] ` <20040330162514.GA2943@elte.hu> @ 2004-03-30 21:03 ` Ingo Molnar 2004-03-31 2:30 ` Nick Piggin 0 siblings, 1 reply; 68+ messages in thread From: Ingo Molnar @ 2004-03-30 21:03 UTC (permalink / raw) To: Andrew Morton Cc: Erich Focht, nickpiggin, mbligh, ak, jun.nakajima, ricklind, linux-kernel, kernel, rusty, anton, lse-tech [-- Attachment #1: Type: text/plain, Size: 384 bytes --] i've attached sched-balance-context.patch, which is the current version of fork()/clone() balancing, against 2.6.5-rc3-mm1. Changes: - only balance CLONE_VM threads - take ->cpus_allowed into account when balancing. i've checked kernel recompiles and while they didnt hurt from fork() balancing on an 8-way SMP box, i implemented the thread-only balancing nevertheless. Ingo [-- Attachment #2: sched-balance-context.patch --] [-- Type: text/plain, Size: 4796 bytes --] --- linux/include/linux/sched.h.orig +++ linux/include/linux/sched.h @@ -715,12 +715,17 @@ extern void do_timer(struct pt_regs *); extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state)); extern int FASTCALL(wake_up_process(struct task_struct * tsk)); +extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk)); #ifdef CONFIG_SMP extern void kick_process(struct task_struct *tsk); + extern void FASTCALL(wake_up_forked_thread(struct task_struct * tsk)); #else static inline void kick_process(struct task_struct *tsk) { } + static inline void wake_up_forked_thread(struct task_struct * tsk) + { + return wake_up_forked_process(tsk); + } #endif -extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk)); extern void FASTCALL(sched_fork(task_t * p)); extern void FASTCALL(sched_exit(task_t * p)); --- linux/kernel/sched.c.orig +++ linux/kernel/sched.c @@ -1139,6 +1137,119 @@ enum idle_type }; #ifdef CONFIG_SMP + +/* + * find_idlest_cpu - find the least busy runqueue. + */ +static int find_idlest_cpu(int this_cpu, runqueue_t *this_rq, cpumask_t mask) +{ + unsigned long load, min_load, this_load; + int i, min_cpu; + cpumask_t tmp; + + min_cpu = UINT_MAX; + min_load = ULONG_MAX; + + cpus_and(tmp, mask, cpu_online_map); + for_each_cpu_mask(i, tmp) { + load = cpu_load(i); + + if (load < min_load) { + min_cpu = i; + min_load = load; + + /* break out early on an idle CPU: */ + if (!min_load) + break; + } + } + + /* add +1 to account for the new task */ + this_load = cpu_load(this_cpu) + SCHED_LOAD_SCALE; + + /* + * Would with the addition of the new task to the + * current CPU there be an imbalance between this + * CPU and the idlest CPU? + */ + if (min_load*this_rq->sd->imbalance_pct < 100*this_load) + return min_cpu; + + return this_cpu; +} + +/* + * wake_up_forked_thread - wake up a freshly forked thread. + * + * This function will do some initial scheduler statistics housekeeping + * that must be done for every newly created context, and it also does + * runqueue balancing. + */ +void fastcall wake_up_forked_thread(task_t * p) +{ + unsigned long flags; + int this_cpu = get_cpu(), cpu; + runqueue_t *this_rq = cpu_rq(this_cpu), *rq; + + /* + * Migrate the new context to the least busy CPU, + * if that CPU is out of balance. + */ + cpu = find_idlest_cpu(this_cpu, this_rq, p->cpus_allowed); + + local_irq_save(flags); +lock_again: + rq = cpu_rq(cpu); + double_rq_lock(this_rq, rq); + + BUG_ON(p->state != TASK_RUNNING); + + /* + * We did find_idlest_cpu() unlocked, so in theory + * the mask could have changed: + */ + if (!cpu_isset(cpu, p->cpus_allowed)) { + cpu = any_online_cpu(p->cpus_allowed); + double_rq_unlock(this_rq, rq); + goto lock_again; + } + /* + * We decrease the sleep average of forking parents + * and children as well, to keep max-interactive tasks + * from forking tasks that are max-interactive. + */ + current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) * + PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); + + p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) * + CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS); + + p->interactive_credit = 0; + + p->prio = effective_prio(p); + set_task_cpu(p, cpu); + + if (cpu == this_cpu) { + if (unlikely(!current->array)) + __activate_task(p, rq); + else { + p->prio = current->prio; + list_add_tail(&p->run_list, ¤t->run_list); + p->array = current->array; + p->array->nr_active++; + rq->nr_running++; + } + } else { + __activate_task(p, rq); + if (TASK_PREEMPTS_CURR(p, rq)) + resched_task(rq->curr); + } + + double_rq_unlock(this_rq, rq); + local_irq_restore(flags); + put_cpu(); +} + /* * If dest_cpu is allowed for this process, migrate the task to it. * This is accomplished by forcing the cpu_allowed mask to only --- linux/kernel/fork.c.orig +++ linux/kernel/fork.c @@ -1179,9 +1179,23 @@ long do_fork(unsigned long clone_flags, set_tsk_thread_flag(p, TIF_SIGPENDING); } - if (!(clone_flags & CLONE_STOPPED)) - wake_up_forked_process(p); /* do this last */ - else + if (!(clone_flags & CLONE_STOPPED)) { + /* + * Do the wakeup last. On SMP we treat fork() and + * CLONE_VM separately, because fork() has already + * created cache footprint on this CPU (due to + * copying the pagetables), hence migration would + * probably be costy. Threads on the other hand + * have less traction to the current CPU, and if + * there's an imbalance then the scheduler can + * migrate this fresh thread now, before it + * accumulates a larger cache footprint: + */ + if (clone_flags & CLONE_VM) + wake_up_forked_thread(p); + else + wake_up_forked_process(p); + } else p->state = TASK_STOPPED; ++total_forks; ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [patch] new-context balancing, 2.6.5-rc3-mm1 2004-03-30 21:03 ` [patch] new-context balancing, 2.6.5-rc3-mm1 Ingo Molnar @ 2004-03-31 2:30 ` Nick Piggin 0 siblings, 0 replies; 68+ messages in thread From: Nick Piggin @ 2004-03-31 2:30 UTC (permalink / raw) To: Ingo Molnar Cc: Andrew Morton, Erich Focht, mbligh, ak, jun.nakajima, ricklind, linux-kernel, kernel, rusty, anton, lse-tech Ingo Molnar wrote: > i've attached sched-balance-context.patch, which is the current version > of fork()/clone() balancing, against 2.6.5-rc3-mm1. > > Changes: > > - only balance CLONE_VM threads > > - take ->cpus_allowed into account when balancing. > > i've checked kernel recompiles and while they didnt hurt from fork() > balancing on an 8-way SMP box, i implemented the thread-only balancing > nevertheless. You'd probably want to be testing on a NUMA to bring out any problems. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 11:02 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Andrew Morton [not found] ` <20040330161438.GA2257@elte.hu> @ 2004-03-31 18:59 ` Erich Focht 1 sibling, 0 replies; 68+ messages in thread From: Erich Focht @ 2004-03-31 18:59 UTC (permalink / raw) To: Andrew Morton Cc: nickpiggin, mbligh, mingo, ak, jun.nakajima, ricklind, linux-kernel, kernel, rusty, anton, lse-tech On Tuesday 30 March 2004 13:02, Andrew Morton wrote: > Erich Focht <efocht@hpce.nec.com> wrote: > > > And finally, HPC > > > applications are the very ones that should be using CPU > > > affinities because they are usually tuned quite tightly to the > > > specific architecture. > > > > There are companies mainly selling NUMA machines for HPC (SGI?), so > > this is not a niche market. > > It is niche in terms of number of machines and in terms of affected users. > And the people who provide these machines have the resources to patch the > scheduler if needs be. Uhm, depends on the CPUs you think of. I bet much more than half of the Opterons and Itanium2 CPUs sold last year went into HPC. Certainly not so many IA64s went into NUMA machines. But almost all Opterons ;-) IBM's NUMA machines with Power CPUs are mainly sold with AIX into the HPC market, I don't recall to have seen big HPC installations with HP Superdome under Linux, not yet...? IBM sells x86-NUMA more into the commercial market, the only big visible Linux-NUMA in HPC is SGI's Altix. Most of the other NUMA machines go into HPC with other OSes and we don't care about them (yet?). So you're probably right about the number of Linux-NUMA-HPC users, but this actually shows that Linux-NUMA is currently not the ideal choice. We're working on it, right? > Correct me if I'm wrong, but what we have here is a situation where if we > design the scheduler around the HPC requirement, it will work poorly in a > significant number of other applications. And we don't see a way of fixing > this without either a /proc/i-am-doing-hpc, or a config option, or > requiring someone to carry an external patch, yes? > > If so then all of those seem reasonable options to me. We should optimise > the scheduler for the common case, and that ain't HPC. Yes! A per process flag would be enough to have the choice. > If we agree that architecturally sched-domains _can_ satisfy the HPC > requirement then I think that's good enough for now. I'd prefer that Ingo > and Nick not have to bust a gut trying to get optimum HPC performance > before the code is even merged up. Sure. On the other hand the benchmark brought into discussion by Andi is very easy to understand, much easier than any Java monster. If the scheduler doesn't have a screw for running this optimally, it's disappointing. > Do you agree that sched-domains is architected appropriately? My current impression is: YES. My testing experience with it is still very limited... Regards, Erich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 10:04 ` Erich Focht 2004-03-30 10:58 ` Andi Kleen 2004-03-30 11:02 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Andrew Morton @ 2004-03-31 2:08 ` Nick Piggin 2004-03-31 22:23 ` Erich Focht 2 siblings, 1 reply; 68+ messages in thread From: Nick Piggin @ 2004-03-31 2:08 UTC (permalink / raw) To: Erich Focht Cc: Martin J. Bligh, Ingo Molnar, Andi Kleen, Nakajima, Jun, Rick Lindsley, linux-kernel, akpm, kernel, rusty, anton, lse-tech Erich Focht wrote: > Hi Nick, > Hi Erich, > On Tuesday 30 March 2004 11:05, Nick Piggin wrote: > >>I'm with Martin here, we are just about to merge all this >>sched-domains stuff. So we should at least wait until after >>that. And of course, *nothing* gets changed without at least >>one benchmark that shows it improves something. So far >>nobody has come up to the plate with that. > > > I thought you're talking the whole time about STREAM. That is THE > benchmark which shows you an impact of balancing at fork. At it is a > VERY relevant benchmark. Though you shouldn't run it on historical > machines like NUMAQ, no compute center in the western world will buy > NUMAQs for high performance... Andy typically runs STREAM on all CPUs > of a machine. Try on N/2 and N/4 and so on, you'll see the impact. > Well yeah, but the immediate problem was that sched-domains was *much* worse than 2.6's numasched, neither of which balance on fork/clone. I didn't want to obscure the issue by implementing balance on fork/clone until we worked out exactly the problem. Anyway, once sched-domains goes in, you can basically do whatever you like without impacting anyone else... >> >>There are other things, like java, ervers, etc that use threads. > > > I'm just saying that you should have the choice. The default should be > as before, balance at exec(). > Yeah well that is a very sane thing to do ;) > >>The point is that we have never had this before, and nobody >>(until now) has been asking for it. And there are as yet no > > > ?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA > kernels and users use it intensively with OpenMP. Advertised it a lot, > asked for it, atlked about it at the last OLS. Only IA64 was > considered rare big iron. I understand that the issue gets hotter if > the problem hurts on AMD64... > Sorry I hadn't realised. I guess because you are happy with your own stuff you don't make too much noise about it on the list lately. I apologise. I wonder though, why don't you just teach OpenMP to use affinities as well? Surely that is better than relying on the behaviour of the scheduler, even if it does balance on clone. > >>convincing benchmarks that even show best case improvements. And >>it could very easily have some bad cases. > > > Again: I'm talking about having the choice. The user decides. Nothing > protects you against user stupidity, but if they just have the choice > of poor automatic initial scheduling, it's not enough. And: having the > fork/clone initial balancing policy means: you don't need to make your > code complicated and unportable by playing with setaffinity (which is > just plainly unusable when you share the machine with other users). > If you do it by hand, you know exactly what is going to happen, and you can turn off the balance-on-clone flags and you don't incur the hit of pulling in remote cachelines from every CPU at clone time to do balancing. Surely an HPC application wouldn't mind doing that? (I guess they probably don't call clone a lot though). > >>And finally, HPC >>applications are the very ones that should be using CPU >>affinities because they are usually tuned quite tightly to the >>specific architecture. > > > There are companies mainly selling NUMA machines for HPC (SGI?), so > this is not a niche market. Clusters of big NUMA machines are not > unusual, and they're typically not used for databases but for HPC > apps. Unfortunately proprietary UNIX is still considered to have > better features than Linux for such configurations. > Well, SGI should be doing tests soon and tuning the scheduler to their liking. Hopefully others will too, so we'll see what happens. > >>Let's just make sure we don't change defaults without any >>reason... > > > No reason? Aaarghh... >;-) > Sorry I mean evidence. I'm sure with a properly tuned implementation, you could get really good speedups in lots of places... I just want to *see* them. All I have seen so far is Andi getting a bit better performance on something where he can get *much* better performance by making a trivial tweak instead. I really don't have the software or hardware to test this at all so I just have to sit and watch. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-31 2:08 ` Nick Piggin @ 2004-03-31 22:23 ` Erich Focht 0 siblings, 0 replies; 68+ messages in thread From: Erich Focht @ 2004-03-31 22:23 UTC (permalink / raw) To: Nick Piggin Cc: Martin J. Bligh, Ingo Molnar, Andi Kleen, Nakajima, Jun, Rick Lindsley, linux-kernel, akpm, kernel, rusty, anton, lse-tech On Wednesday 31 March 2004 04:08, Nick Piggin wrote: > >>I'm with Martin here, we are just about to merge all this > >>sched-domains stuff. So we should at least wait until after > >>that. And of course, *nothing* gets changed without at least > >>one benchmark that shows it improves something. So far > >>nobody has come up to the plate with that. > > > > I thought you're talking the whole time about STREAM. That is THE > > benchmark which shows you an impact of balancing at fork. At it is a > > VERY relevant benchmark. Though you shouldn't run it on historical > > machines like NUMAQ, no compute center in the western world will buy > > NUMAQs for high performance... Andy typically runs STREAM on all CPUs > > of a machine. Try on N/2 and N/4 and so on, you'll see the impact. > > Well yeah, but the immediate problem was that sched-domains was > *much* worse than 2.6's numasched, neither of which balance on > fork/clone. I didn't want to obscure the issue by implementing > balance on fork/clone until we worked out exactly the problem. I had the feeling that solving the performance issue reported by Andi would ease the integration into the baseline... > >>The point is that we have never had this before, and nobody > >>(until now) has been asking for it. And there are as yet no > > > > ?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA > > kernels and users use it intensively with OpenMP. Advertised it a lot, > > asked for it, atlked about it at the last OLS. Only IA64 was > > considered rare big iron. I understand that the issue gets hotter if > > the problem hurts on AMD64... > > Sorry I hadn't realised. I guess because you are happy with > your own stuff you don't make too much noise about it on the > list lately. I apologise. The usual excuse: busy with other stuff... > I wonder though, why don't you just teach OpenMP to use > affinities as well? Surely that is better than relying on the > behaviour of the scheduler, even if it does balance on clone. You mean in the compiler? I don't think this is a good idea, that way you loose flexibility in resource overcomitment. And performance when overselling the machine's CPUs. > > Again: I'm talking about having the choice. The user decides. Nothing > > protects you against user stupidity, but if they just have the choice > > of poor automatic initial scheduling, it's not enough. And: having the > > fork/clone initial balancing policy means: you don't need to make your > > code complicated and unportable by playing with setaffinity (which is > > just plainly unusable when you share the machine with other users). > > If you do it by hand, you know exactly what is going to happen, > and you can turn off the balance-on-clone flags and you don't > incur the hit of pulling in remote cachelines from every CPU at > clone time to do balancing. Surely an HPC application wouldn't > mind doing that? (I guess they probably don't call clone a lot > though). OpenMP is implemented with clone. MPI parallel applications just exec, they're fine. IMO the static affinity/cpumask handling should be done externally by some resource manager which has a good overview on the long-term load of the machine. It's a different issue, nothing for the scheduler. I wouldn't leave it to the program, too unflexible and unportable across machines and OSes. > > There are companies mainly selling NUMA machines for HPC (SGI?), so > > this is not a niche market. Clusters of big NUMA machines are not > > unusual, and they're typically not used for databases but for HPC > > apps. Unfortunately proprietary UNIX is still considered to have > > better features than Linux for such configurations. > > Well, SGI should be doing tests soon and tuning the scheduler > to their liking. Hopefully others will too, so we'll see what > happens. Maybe they are happy with their stuff, too. They have the cpumemsets and some external affinity control, AFAIK. > >>Let's just make sure we don't change defaults without any > >>reason... > > > > No reason? Aaarghh... >;-) > > Sorry I mean evidence. I'm sure with a properly tuned > implementation, you could get really good speedups in lots > of places... I just want to *see* them. All I have seen so > far is Andi getting a bit better performance on something > where he can get *much* better performance by making a > trivial tweak instead. I get the feeling that Andi's simple OpenMP job is already complex enough to lead to wrong initial scheduling with the current aproach. I suppose the reason are the 1-2 helper threads which are started together with the worker threads (depending on the used compiler). On small machines (and 4 cpus is small) they significantly disturb the initial task distribution. For example with the Intel compiler and 4 worker threads you get 6 tasks. The helper tasks are typically runnable when the code starts so you get (in order of creation) CPU Task Role 1 1 worker 2 2 helper 3 3 helper 4 4 worker 1-4 5 worker 1-4 6 worker So the difficulty is to find out which task will do real work and which task is just spoiling the statistics. I think... Regards, Erich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-29 22:30 ` Erich Focht 2004-03-30 9:05 ` Nick Piggin @ 2004-03-30 15:01 ` Martin J. Bligh 2004-03-31 21:23 ` Erich Focht 1 sibling, 1 reply; 68+ messages in thread From: Martin J. Bligh @ 2004-03-30 15:01 UTC (permalink / raw) To: Erich Focht, Ingo Molnar Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech --Erich Focht <efocht@hpce.nec.com> wrote (on Tuesday, March 30, 2004 00:30:25 +0200): > On Thursday 25 March 2004 23:28, Martin J. Bligh wrote: >> Can we hold off on changing the fork/exec time balancing until we've >> come to a plan as to what should actually be done with it? Unless we're >> giving it some hint from userspace, it's frigging hard to be sure if >> it's going to exec or not - and the vast majority of things do. > > After more than a year (or two?) of discussions there's no better idea > yet than giving a userspace hint. Default should be to balance at > exec(), and maybe use a syscall for saying: balance all children a > particular process is going to fork/clone at creation time. Everybody > reached the insight that we can't foresee what's optimal, so there is > only one solution: control the behavior. Give the user a tool to > improve the performance. Just a small inheritable variable in the task > structure is enough. Whether you give the hint at or before run-time > or even at compile-time is not really the point... Agreed ... absolutely. > I don't think it's worth to wait and hope that somebody shows up with > a magic algorithm which balances every kind of job optimally. Especially as I don't believe that exists ;-) It's not deterministic. >> Clone is a much more interesting case, though at the time, I consciously >> decided NOT to do that, as we really mostly want threads on the same >> node. > > That is not true in the case of HPC applications. And if someone uses > OpenMP he is just doing that kind of stuff. I consider STREAM a good > benchmark because it shows exactly the problem of HPC applications: > they need a lot of memory bandwidth, they don't run in cache and the > tasks live really long. Spreading those tasks across the nodes gives > me more bandwidth per task and I accumulate the positive effect > because the tasks run for hours or days. It's a simple and clear case > where the scheduler should be improved. > > Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7 > are not relevant for HPC. In a compute center it actually doesn't > matter much whether some shell command returns 10% faster, it just > shouldn't disturb my super simulation code for which I bought an > expensive NUMA box. OK, but the scheduler can't know the difference automatically, I don't think ... and whether we should tune the scheduler for "user work" or HPC is going to be a hotly contested point ;-) We need to try to find something that works for both. And suppose you have a 4 node system, with 4 HPC apps running? Surely you want each app to have one node to itself? That's more the case I'm worried about than "user work" vs HPC, to be honest. M. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-30 15:01 ` Martin J. Bligh @ 2004-03-31 21:23 ` Erich Focht 2004-03-31 21:33 ` Martin J. Bligh 0 siblings, 1 reply; 68+ messages in thread From: Erich Focht @ 2004-03-31 21:23 UTC (permalink / raw) To: Martin J. Bligh, Ingo Molnar Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech On Tuesday 30 March 2004 17:01, Martin J. Bligh wrote: > > I don't think it's worth to wait and hope that somebody shows up with > > a magic algorithm which balances every kind of job optimally. > > Especially as I don't believe that exists ;-) It's not deterministic. Right, so let's choose the initial balancing policy on a per process basis. > > Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7 > > are not relevant for HPC. In a compute center it actually doesn't > > matter much whether some shell command returns 10% faster, it just > > shouldn't disturb my super simulation code for which I bought an > > expensive NUMA box. > > OK, but the scheduler can't know the difference automatically, I don't > think ... and whether we should tune the scheduler for "user work" or > HPC is going to be a hotly contested point ;-) We need to try to find > something that works for both. And suppose you have a 4 node system, > with 4 HPC apps running? Surely you want each app to have one node to > itself? If the machine is 100% full all the time and all apps demand the same amount of bandwidth, yes, I want 1 job per node. If the average load is less than 100% (sometimes only 2-3 jobs are running) then I'd prefer to spread the processes of a job across the machine. The average bandwidth per process will be higher. Modern NUMA machines have big bandwidth to neighboring nodes and not too bad latency penalties for remote accesses. Regards, Erich ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-31 21:23 ` Erich Focht @ 2004-03-31 21:33 ` Martin J. Bligh 0 siblings, 0 replies; 68+ messages in thread From: Martin J. Bligh @ 2004-03-31 21:33 UTC (permalink / raw) To: Erich Focht, Ingo Molnar Cc: Andi Kleen, Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech > On Tuesday 30 March 2004 17:01, Martin J. Bligh wrote: >> > I don't think it's worth to wait and hope that somebody shows up with >> > a magic algorithm which balances every kind of job optimally. >> >> Especially as I don't believe that exists ;-) It's not deterministic. > > Right, so let's choose the initial balancing policy on a per process > basis. Yup, that seems like a reasonable thing to do. That way you can override it for things that fork and never exec, if they're performance critical (like HPC maybe). >> > Benchmarks simulating "user work" like SPECsdet, kernel compile, AIM7 >> > are not relevant for HPC. In a compute center it actually doesn't >> > matter much whether some shell command returns 10% faster, it just >> > shouldn't disturb my super simulation code for which I bought an >> > expensive NUMA box. >> >> OK, but the scheduler can't know the difference automatically, I don't >> think ... and whether we should tune the scheduler for "user work" or >> HPC is going to be a hotly contested point ;-) We need to try to find >> something that works for both. And suppose you have a 4 node system, >> with 4 HPC apps running? Surely you want each app to have one node to >> itself? > > If the machine is 100% full all the time and all apps demand the same > amount of bandwidth, yes, I want 1 job per node. If the average load is > less than 100% (sometimes only 2-3 jobs are running) then I'd prefer to > spread the processes of a job across the machine. The average bandwidth > per process will be higher. Modern NUMA machines have big bandwidth to > neighboring nodes and not too bad latency penalties for remote accesses. In theory at least, doing the rebalance_on_clone if and only if there are idle procs on another node sounds reasonable. In practice, I'm not sure how well that'll work, since one app may well start wholly before another, but maybe we can figure out something smart to do. M. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 15:40 ` Andi Kleen 2004-03-25 19:09 ` Ingo Molnar @ 2004-03-25 21:59 ` Ingo Molnar 2004-03-25 22:26 ` Rick Lindsley 2004-03-25 22:30 ` Andrew Theurer 2004-03-26 3:23 ` Nick Piggin 2 siblings, 2 replies; 68+ messages in thread From: Ingo Molnar @ 2004-03-25 21:59 UTC (permalink / raw) To: Andi Kleen Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh * Andi Kleen <ak@suse.de> wrote: > It doesn't do load balance in wake_up_forked_process() and is > relatively non aggressive in balancing later. This leads to the > multithreaded OpenMP STREAM running its childs first on the same node > as the original process and allocating memory there. Then later they > run on a different node when the balancing finally happens, but > generate cross traffic to the old node, instead of using the memory > bandwidth of their local nodes. > > The difference is very visible, even the 4 thread STREAM only sees the > bandwidth of a single node. With a more aggressive scheduler you get 4 > times as much. > > Admittedly it's a bit of a stupid benchmark, but seems to > representative for a lot of HPC codes. There's no way the scheduler can figure out the scheduling and memory use patterns of the new tasks in advance. but userspace could give hints - e.g. a syscall that triggers a rebalancing: sys_sched_load_balance(). This way userspace notifies the scheduler that it is on 'zero ground' and that the scheduler can move it to the least loaded cpu/node. a variant of this is already possible, userspace can use setaffinity to load-balance manually - but sched_load_balance() would be automatic. Ingo ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 21:59 ` Ingo Molnar @ 2004-03-25 22:26 ` Rick Lindsley 2004-03-25 22:30 ` Andrew Theurer 1 sibling, 0 replies; 68+ messages in thread From: Rick Lindsley @ 2004-03-25 22:26 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Nakajima, Jun, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh There's no way the scheduler can figure out the scheduling and memory use patterns of the new tasks in advance. True. Four threads may want to stay on the same node because they are sharing a lot of data and working on something in parallel, or they may want to go to different nodes because the only thing they have in common is a control structure that directs their (largely independent but highly synchronized) efforts. A while ago there was some effort at user-level page replication, which meant you took a hit once but after that you'd effectively migrated a page to your local memory. The longer you stayed put, the more local your RSS got. I seem to recall some bugs or caveats, though. Anybody know the state of that? It might take the burden off the scheduler using a crystal ball and putting it on a 20/20-hindsight VM system instead. Rick ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 21:59 ` Ingo Molnar 2004-03-25 22:26 ` Rick Lindsley @ 2004-03-25 22:30 ` Andrew Theurer 2004-03-25 22:38 ` Martin J. Bligh 2004-03-26 1:29 ` Andi Kleen 1 sibling, 2 replies; 68+ messages in thread From: Andrew Theurer @ 2004-03-25 22:30 UTC (permalink / raw) To: Ingo Molnar, Andi Kleen Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Thursday 25 March 2004 15:59, Ingo Molnar wrote: > * Andi Kleen <ak@suse.de> wrote: > > It doesn't do load balance in wake_up_forked_process() and is > > relatively non aggressive in balancing later. This leads to the > > multithreaded OpenMP STREAM running its childs first on the same node > > as the original process and allocating memory there. Then later they > > run on a different node when the balancing finally happens, but > > generate cross traffic to the old node, instead of using the memory > > bandwidth of their local nodes. > > > > The difference is very visible, even the 4 thread STREAM only sees the > > bandwidth of a single node. With a more aggressive scheduler you get 4 > > times as much. > > > > Admittedly it's a bit of a stupid benchmark, but seems to > > representative for a lot of HPC codes. > > There's no way the scheduler can figure out the scheduling and memory > use patterns of the new tasks in advance. > > but userspace could give hints - e.g. a syscall that triggers a > rebalancing: sys_sched_load_balance(). This way userspace notifies the > scheduler that it is on 'zero ground' and that the scheduler can move it > to the least loaded cpu/node. > > a variant of this is already possible, userspace can use setaffinity to > load-balance manually - but sched_load_balance() would be automatic. For Opteron simply placing all cpus in the same sched domain may solve all of this, since we will have balancing frequency of the default scheduler. Is there any reason this cannot be done for Opteron? Also, I think Erich Focht had another patch which would allow much more frequent node balancing is the nr_cpus_node was 1. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 22:30 ` Andrew Theurer @ 2004-03-25 22:38 ` Martin J. Bligh 2004-03-26 1:29 ` Andi Kleen 1 sibling, 0 replies; 68+ messages in thread From: Martin J. Bligh @ 2004-03-25 22:38 UTC (permalink / raw) To: Andrew Theurer, Ingo Molnar, Andi Kleen Cc: Nakajima, Jun, Rick Lindsley, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech > For Opteron simply placing all cpus in the same sched domain may solve all of > this, since we will have balancing frequency of the default scheduler. Is > there any reason this cannot be done for Opteron? That seems like a good plan to me - they really don't want that cross-node balancing. It might be cleaner to implement it by just tweaking the cross-balance paramters for that system to have the same effect, but it probably doesn't matter much (I'm thinking of some future case when they decide to do multi-chip on die or SMT, so just keying off 1 cpu per node doesn't really fix it). M. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 22:30 ` Andrew Theurer 2004-03-25 22:38 ` Martin J. Bligh @ 2004-03-26 1:29 ` Andi Kleen 1 sibling, 0 replies; 68+ messages in thread From: Andi Kleen @ 2004-03-26 1:29 UTC (permalink / raw) To: Andrew Theurer Cc: mingo, jun.nakajima, ricklind, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh On Thu, 25 Mar 2004 16:30:16 -0600 Andrew Theurer <habanero@us.ibm.com> wrote: > For Opteron simply placing all cpus in the same sched domain may solve all of > this, since we will have balancing frequency of the default scheduler. Is > there any reason this cannot be done for Opteron? Yes, that makes sense. I will try that -Andi ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 2004-03-25 15:40 ` Andi Kleen 2004-03-25 19:09 ` Ingo Molnar 2004-03-25 21:59 ` Ingo Molnar @ 2004-03-26 3:23 ` Nick Piggin 2 siblings, 0 replies; 68+ messages in thread From: Nick Piggin @ 2004-03-26 3:23 UTC (permalink / raw) To: Andi Kleen Cc: Nakajima, Jun, Rick Lindsley, Ingo Molnar, piggin, linux-kernel, akpm, kernel, rusty, anton, lse-tech, mbligh Andi Kleen wrote: > On Thu, Mar 25, 2004 at 07:31:37AM -0800, Nakajima, Jun wrote: > >>Andi, >> >>Can you be more specific with "it doesn't load balance threads >>aggressively enough"? Or what behavior of the base NUMA scheduler is >>missing in the sched-domain scheduler especially for NUMA? > > > It doesn't do load balance in wake_up_forked_process() and is relatively > non aggressive in balancing later. This leads to the multithreaded OpenMP > STREAM running its childs first on the same node as the original process > and allocating memory there. Then later they run on a different node when > the balancing finally happens, but generate cross traffic to the old node, > instead of using the memory bandwidth of their local nodes. > > The difference is very visible, even the 4 thread STREAM only sees the > bandwidth of a single node. With a more aggressive scheduler you get > 4 times as much. > > Admittedly it's a bit of a stupid benchmark, but seems to representative > for a lot of HPC codes. Hi Andi, Sorry I keep telling you I'll work on this, but I never get around to it. Mostly lack of hardware makes it difficult. I've fixed a few bugs and some other workloads, so I keep hoping that they will fix your problem :P Your STREAM performance is really bad and I hope you don't think I'm going to ignore it even if it is a bit stupid. Give me a bit more time. Of course, there is nothing fundamentally wrong with sched-domains that is causing your problem. It can easily do anything the old numa scheduler can do. It must be a bug or some bad tuning somewhere. Nick ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2004-03-31 22:23 UTC | newest]
Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-25 15:31 [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Nakajima, Jun
2004-03-25 15:40 ` Andi Kleen
2004-03-25 19:09 ` Ingo Molnar
2004-03-25 15:21 ` Andi Kleen
2004-03-25 19:39 ` Ingo Molnar
2004-03-25 20:30 ` Ingo Molnar
2004-03-29 8:45 ` Andi Kleen
2004-03-29 10:20 ` Rick Lindsley
2004-03-29 5:07 ` Andi Kleen
2004-03-29 11:28 ` Nick Piggin
2004-03-29 17:30 ` Rick Lindsley
2004-03-30 0:01 ` Nick Piggin
2004-03-30 1:26 ` Rick Lindsley
2004-03-29 11:20 ` Nick Piggin
2004-03-29 6:01 ` Andi Kleen
2004-03-29 11:46 ` Ingo Molnar
2004-03-29 7:03 ` Andi Kleen
2004-03-29 7:10 ` Andi Kleen
2004-03-29 20:14 ` Andi Kleen
2004-03-29 23:51 ` Nick Piggin
2004-03-30 6:34 ` Andi Kleen
2004-03-30 6:40 ` Ingo Molnar
2004-03-30 7:07 ` Andi Kleen
2004-03-30 7:14 ` Nick Piggin
2004-03-30 7:45 ` Ingo Molnar
2004-03-30 7:58 ` Nick Piggin
2004-03-30 7:15 ` Ingo Molnar
2004-03-30 7:18 ` Nick Piggin
2004-03-30 7:48 ` Andi Kleen
2004-03-30 8:18 ` Ingo Molnar
2004-03-30 9:36 ` Andi Kleen
2004-03-30 7:42 ` Ingo Molnar
2004-03-30 7:03 ` Nick Piggin
2004-03-30 7:13 ` Andi Kleen
2004-03-30 7:24 ` Nick Piggin
2004-03-30 7:38 ` Arjan van de Ven
2004-03-30 7:13 ` Martin J. Bligh
2004-03-30 7:31 ` Nick Piggin
2004-03-30 7:38 ` Martin J. Bligh
2004-03-30 8:05 ` Ingo Molnar
2004-03-30 8:19 ` Nick Piggin
2004-03-30 8:45 ` Ingo Molnar
2004-03-30 8:53 ` Nick Piggin
2004-03-30 15:27 ` Martin J. Bligh
2004-03-25 19:24 ` Martin J. Bligh
2004-03-25 21:48 ` Ingo Molnar
2004-03-25 22:28 ` Martin J. Bligh
2004-03-29 22:30 ` Erich Focht
2004-03-30 9:05 ` Nick Piggin
2004-03-30 10:04 ` Erich Focht
2004-03-30 10:58 ` Andi Kleen
2004-03-30 16:03 ` [patch] sched-2.6.5-rc3-mm1-A0 Ingo Molnar
2004-03-31 2:30 ` Nick Piggin
2004-03-30 11:02 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Andrew Morton
[not found] ` <20040330161438.GA2257@elte.hu>
[not found] ` <20040330161910.GA2860@elte.hu>
[not found] ` <20040330162514.GA2943@elte.hu>
2004-03-30 21:03 ` [patch] new-context balancing, 2.6.5-rc3-mm1 Ingo Molnar
2004-03-31 2:30 ` Nick Piggin
2004-03-31 18:59 ` [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3 Erich Focht
2004-03-31 2:08 ` Nick Piggin
2004-03-31 22:23 ` Erich Focht
2004-03-30 15:01 ` Martin J. Bligh
2004-03-31 21:23 ` Erich Focht
2004-03-31 21:33 ` Martin J. Bligh
2004-03-25 21:59 ` Ingo Molnar
2004-03-25 22:26 ` Rick Lindsley
2004-03-25 22:30 ` Andrew Theurer
2004-03-25 22:38 ` Martin J. Bligh
2004-03-26 1:29 ` Andi Kleen
2004-03-26 3:23 ` Nick Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox