public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [Fwd: Re: SMP Panic caused by [PATCH] sched: consolidate sched domains]
@ 2004-09-03 21:21 James Bottomley
  2004-09-03 21:59 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2004-09-03 21:21 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: William Lee Irwin III, Jesse Barnes, Matthew Dobson, Nick Piggin,
	Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 242 bytes --]

Could we get this in please?  The current screw up in the scheduling
domain patch means that any architecture that actually hotplugs CPUs
will crash in find_busiest_group() ... and I notice this has just bitten
the z Series people...

James


[-- Attachment #2: Forwarded message - Re: SMP Panic caused by [PATCH] sched: consolidate sched domains --]
[-- Type: message/rfc822, Size: 3528 bytes --]

From: William Lee Irwin III <wli@holomorphy.com>
To: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jesse Barnes <jbarnes@engr.sgi.com>, Andrew Morton <akpm@osdl.org>, Linus Torvalds <torvalds@osdl.org>, Matthew Dobson <colpatch@us.ibm.com>, Nick Piggin <nickpiggin@yahoo.com.au>, Linux Kernel <linux-kernel@vger.kernel.org>
Subject: Re: SMP Panic caused by [PATCH] sched: consolidate sched domains
Date: Sun, 29 Aug 2004 10:50:58 -0700
Message-ID: <20040829175058.GP5492@holomorphy.com>

On Sun, Aug 29, 2004 at 10:40:39AM -0700, William Lee Irwin III wrote:
> Okay, if you prefer the #ifdef:

And for the other half of it:


Index: wait-2.6.9-rc1-mm1/kernel/sched.c
===================================================================
--- wait-2.6.9-rc1-mm1.orig/kernel/sched.c	2004-08-28 11:41:47.000000000 -0700
+++ wait-2.6.9-rc1-mm1/kernel/sched.c	2004-08-29 10:46:52.543081208 -0700
@@ -4224,7 +4224,11 @@
 		sd = &per_cpu(phys_domains, i);
 		group = cpu_to_phys_group(i);
 		*sd = SD_CPU_INIT;
+#ifdef CONFIG_NUMA
 		sd->span = nodemask;
+#else
+		sd->span = cpu_possible_map;
+#endif
 		sd->parent = p;
 		sd->groups = &sched_group_phys[group];
 
@@ -4262,6 +4266,7 @@
 						&cpu_to_isolated_group);
 	}
 
+#ifdef CONFIG_NUMA
 	/* Set up physical groups */
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		cpumask_t nodemask = node_to_cpumask(i);
@@ -4273,6 +4278,10 @@
 		init_sched_build_groups(sched_group_phys, nodemask,
 						&cpu_to_phys_group);
 	}
+#else
+	init_sched_build_groups(sched_group_phys, cpu_possible_map,
+							&cpu_to_phys_group);
+#endif
 
 #ifdef CONFIG_NUMA
 	/* Set up node groups */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: Re: SMP Panic caused by [PATCH] sched: consolidate sched domains]
  2004-09-03 21:21 [Fwd: Re: SMP Panic caused by [PATCH] sched: consolidate sched domains] James Bottomley
@ 2004-09-03 21:59 ` Andrew Morton
  2004-09-03 22:13   ` James Bottomley
  2004-09-03 22:22   ` William Lee Irwin III
  0 siblings, 2 replies; 9+ messages in thread
From: Andrew Morton @ 2004-09-03 21:59 UTC (permalink / raw)
  To: James Bottomley
  Cc: torvalds, wli, jbarnes, colpatch, nickpiggin, linux-kernel

James Bottomley <James.Bottomley@SteelEye.com> wrote:
>
> Could we get this in please?  The current screw up in the scheduling
> domain patch means that any architecture that actually hotplugs CPUs
> will crash in find_busiest_group() ... and I notice this has just bitten
> the z Series people...

Have we yet seen anything which looks like a completed and tested patch?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: Re: SMP Panic caused by [PATCH] sched: consolidate sched domains]
  2004-09-03 21:59 ` Andrew Morton
@ 2004-09-03 22:13   ` James Bottomley
  2004-09-03 22:22   ` William Lee Irwin III
  1 sibling, 0 replies; 9+ messages in thread
From: James Bottomley @ 2004-09-03 22:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, wli, jbarnes, colpatch, nickpiggin, Linux Kernel

On Fri, 2004-09-03 at 17:59, Andrew Morton wrote:
> James Bottomley <James.Bottomley@SteelEye.com> wrote:
> >
> > Could we get this in please?  The current screw up in the scheduling
> > domain patch means that any architecture that actually hotplugs CPUs
> > will crash in find_busiest_group() ... and I notice this has just bitten
> > the z Series people...
> 
> Have we yet seen anything which looks like a completed and tested patch?

The attached was what I've tested on parisc.  It looks complete to me;
what's wrong with it?

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Fwd: Re: SMP Panic caused by [PATCH] sched: consolidate sched domains]
  2004-09-03 21:59 ` Andrew Morton
  2004-09-03 22:13   ` James Bottomley
@ 2004-09-03 22:22   ` William Lee Irwin III
       [not found]     ` <20040903153434.15719192.akpm@osdl.org>
  1 sibling, 1 reply; 9+ messages in thread
From: William Lee Irwin III @ 2004-09-03 22:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, torvalds, jbarnes, colpatch, nickpiggin,
	linux-kernel

James Bottomley <James.Bottomley@SteelEye.com> wrote:
>> Could we get this in please?  The current screw up in the scheduling
>> domain patch means that any architecture that actually hotplugs CPUs
>> will crash in find_busiest_group() ... and I notice this has just bitten
>> the z Series people...

On Fri, Sep 03, 2004 at 02:59:25PM -0700, Andrew Morton wrote:
> Have we yet seen anything which looks like a completed and tested patch?

This is the whole thing; the "other half" referred to a new hunk added to
the patch (identical to this one) posted in its entirety.


-- wli

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_map issue
       [not found]     ` <20040903153434.15719192.akpm@osdl.org>
@ 2004-09-03 22:45       ` William Lee Irwin III
  2004-09-04  1:57         ` Nick Piggin
  2004-09-05 11:46         ` Ingo Molnar
  0 siblings, 2 replies; 9+ messages in thread
From: William Lee Irwin III @ 2004-09-03 22:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Jesse Barnes, Linus Torvalds, Nick Piggin,
	linux-kernel

William Lee Irwin III <wli@holomorphy.com> wrote:
>> This is the whole thing; the "other half" referred to a new hunk added to
>> the patch (identical to this one) posted in its entirety.

On Fri, Sep 03, 2004 at 03:34:34PM -0700, Andrew Morton wrote:
> ho-hum. changelog, please?

cpu_online_map is not set up at the time of sched domain initialization
when hotplug cpu paths are used for SMP booting. At this phase of
bootstrapping, cpu_possible_map can be used by the various
architectures using cpu hotplugging for SMP bootstrap, but the
manipulations of cpu_online_map done on behalf of NUMA architectures,
done indirectly via node_to_cpumask(), can't, because cpu_online_map
starts depopulated and hasn't yet been populated. On true NUMA
architectures this is a distinct cpumask_t from cpu_online_map and so
the unpatched code works on NUMA; on non-NUMA architectures the
definition of node_to_cpumask() this way breaks and would require an
invasive sweeping of users of node_to_cpumask() to change it to e.g.
cpu_possible_map, as cpu_possible_map is not suitable for use at
runtime as a substitute for cpu_online_map.

Signed-off-by: William Irwin <wli@holomorphy.com>


Index: wait-2.6.9-rc1-mm1/kernel/sched.c
===================================================================
--- wait-2.6.9-rc1-mm1.orig/kernel/sched.c	2004-08-28 11:41:47.000000000 -0700
+++ wait-2.6.9-rc1-mm1/kernel/sched.c	2004-08-29 10:46:52.543081208 -0700
@@ -4224,7 +4224,11 @@
 		sd = &per_cpu(phys_domains, i);
 		group = cpu_to_phys_group(i);
 		*sd = SD_CPU_INIT;
+#ifdef CONFIG_NUMA
 		sd->span = nodemask;
+#else
+		sd->span = cpu_possible_map;
+#endif
 		sd->parent = p;
 		sd->groups = &sched_group_phys[group];
 
@@ -4262,6 +4266,7 @@
 						&cpu_to_isolated_group);
 	}
 
+#ifdef CONFIG_NUMA
 	/* Set up physical groups */
 	for (i = 0; i < MAX_NUMNODES; i++) {
 		cpumask_t nodemask = node_to_cpumask(i);
@@ -4273,6 +4278,10 @@
 		init_sched_build_groups(sched_group_phys, nodemask,
 						&cpu_to_phys_group);
 	}
+#else
+	init_sched_build_groups(sched_group_phys, cpu_possible_map,
+							&cpu_to_phys_group);
+#endif
 
 #ifdef CONFIG_NUMA
 	/* Set up node groups */

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_map issue
  2004-09-03 22:45       ` [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_map issue William Lee Irwin III
@ 2004-09-04  1:57         ` Nick Piggin
  2004-09-05 11:46         ` Ingo Molnar
  1 sibling, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2004-09-04  1:57 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andrew Morton, James Bottomley, Jesse Barnes, Linus Torvalds,
	linux-kernel, Martin Schwidefsky

William Lee Irwin III wrote:
> William Lee Irwin III <wli@holomorphy.com> wrote:
> 
>>>This is the whole thing; the "other half" referred to a new hunk added to
>>>the patch (identical to this one) posted in its entirety.
> 
> 
> On Fri, Sep 03, 2004 at 03:34:34PM -0700, Andrew Morton wrote:
> 
>>ho-hum. changelog, please?
> 
> 
> cpu_online_map is not set up at the time of sched domain initialization
> when hotplug cpu paths are used for SMP booting. At this phase of
> bootstrapping, cpu_possible_map can be used by the various
> architectures using cpu hotplugging for SMP bootstrap, but the
> manipulations of cpu_online_map done on behalf of NUMA architectures,
> done indirectly via node_to_cpumask(), can't, because cpu_online_map
> starts depopulated and hasn't yet been populated. On true NUMA
> architectures this is a distinct cpumask_t from cpu_online_map and so
> the unpatched code works on NUMA; on non-NUMA architectures the
> definition of node_to_cpumask() this way breaks and would require an
> invasive sweeping of users of node_to_cpumask() to change it to e.g.
> cpu_possible_map, as cpu_possible_map is not suitable for use at
> runtime as a substitute for cpu_online_map.
> 
> Signed-off-by: William Irwin <wli@holomorphy.com>
> 

Yeah I guess this is probably the best thing to do for now.

Martin, I wonder if this patch fixes your z/VM problem?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_map issue
  2004-09-03 22:45       ` [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_map issue William Lee Irwin III
  2004-09-04  1:57         ` Nick Piggin
@ 2004-09-05 11:46         ` Ingo Molnar
  2004-09-05 22:35           ` James Bottomley
  1 sibling, 1 reply; 9+ messages in thread
From: Ingo Molnar @ 2004-09-05 11:46 UTC (permalink / raw)
  To: William Lee Irwin III, Andrew Morton, James Bottomley,
	Jesse Barnes, Linus Torvalds, Nick Piggin, linux-kernel


* William Lee Irwin III <wli@holomorphy.com> wrote:

> William Lee Irwin III <wli@holomorphy.com> wrote:
> >> This is the whole thing; the "other half" referred to a new hunk added to
> >> the patch (identical to this one) posted in its entirety.
> 
> On Fri, Sep 03, 2004 at 03:34:34PM -0700, Andrew Morton wrote:
> > ho-hum. changelog, please?
> 
> cpu_online_map is not set up at the time of sched domain
> initialization when hotplug cpu paths are used for SMP booting. At
> this phase of bootstrapping, cpu_possible_map can be used by the
> various architectures using cpu hotplugging for SMP bootstrap, but the
> manipulations of cpu_online_map done on behalf of NUMA architectures,
> done indirectly via node_to_cpumask(), can't, because cpu_online_map
> starts depopulated and hasn't yet been populated. On true NUMA
> architectures this is a distinct cpumask_t from cpu_online_map and so
> the unpatched code works on NUMA; on non-NUMA architectures the
> definition of node_to_cpumask() this way breaks and would require an
> invasive sweeping of users of node_to_cpumask() to change it to e.g.
> cpu_possible_map, as cpu_possible_map is not suitable for use at
> runtime as a substitute for cpu_online_map.
> 
> Signed-off-by: William Irwin <wli@holomorphy.com>

Signed-off-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_map issue
  2004-09-05 11:46         ` Ingo Molnar
@ 2004-09-05 22:35           ` James Bottomley
  2004-09-06  2:48             ` Nick Piggin
  0 siblings, 1 reply; 9+ messages in thread
From: James Bottomley @ 2004-09-05 22:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: William Lee Irwin III, Andrew Morton, Jesse Barnes,
	Linus Torvalds, Nick Piggin, Linux Kernel

On Sun, 2004-09-05 at 07:46, Ingo Molnar wrote:
> > cpu_online_map is not set up at the time of sched domain
> > initialization when hotplug cpu paths are used for SMP booting. At
> > this phase of bootstrapping, cpu_possible_map can be used by the
> > various architectures using cpu hotplugging for SMP bootstrap, but the
> > manipulations of cpu_online_map done on behalf of NUMA architectures,
> > done indirectly via node_to_cpumask(), can't, because cpu_online_map
> > starts depopulated and hasn't yet been populated. On true NUMA
> > architectures this is a distinct cpumask_t from cpu_online_map and so
> > the unpatched code works on NUMA; on non-NUMA architectures the
> > definition of node_to_cpumask() this way breaks and would require an
> > invasive sweeping of users of node_to_cpumask() to change it to e.g.
> > cpu_possible_map, as cpu_possible_map is not suitable for use at
> > runtime as a substitute for cpu_online_map.
> > 
> > Signed-off-by: William Irwin <wli@holomorphy.com>
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>

Well this patch got in, which is what I want, since it allows the
non-NUMA machines to work with hotplug CPUs again.  However, is anyone
actually looking to fix this for real?

The fundamental problem is that NUMA or the scheduler (or both) are
broken with regard to hotplug.

The origin of the breakage is the differences between cpu_possible_map
and cpu_online_map.  In hotplug CPU, there are two ways to do
initialisations: you can initialise from cpu_online_map, but then you
*must* have a cpu hotplug notify listener to add data structures for the
extra CPUs as they come on-line, or you can initialise from
cpu_possible_map and not bother with a notifier.  The disadvantage of
the latter is that cpu_possible_map may be vastly larger than
cpu_online_map ever gets to, thus wasting valuable kernel memory.

The scheduler code is schizophrenic in this regard in that it does both:
it initialises static data structures from cpu_possible_map, but it also
has a hotplug cpu listener for starting things like the migration
threads.

I suspect the NUMA people would like us all to go to the former method
(initialise only from cpu_online_map and have a proper hotplug listener)
since their possible maps are pretty huge.  However, which is it to be:
fix NUMA (to have two cpu_to_node() maps for the possible and online
cpus per node) or fix the scheduler to do initialisation correctly?

Perhaps this should be phased: change NUMA first temporarily for phase
one and then fix the scheduler (and everyone else initialising from
cpu_possible_map) in the second.

James



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_map issue
  2004-09-05 22:35           ` James Bottomley
@ 2004-09-06  2:48             ` Nick Piggin
  0 siblings, 0 replies; 9+ messages in thread
From: Nick Piggin @ 2004-09-06  2:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Ingo Molnar, William Lee Irwin III, Andrew Morton, Jesse Barnes,
	Linus Torvalds, Linux Kernel

James Bottomley wrote:

>
>Well this patch got in, which is what I want, since it allows the
>non-NUMA machines to work with hotplug CPUs again.  However, is anyone
>actually looking to fix this for real?
>
>

I think someone else (tm) is looking at it :)
Some of the IBM hotplug guys I think.

>The fundamental problem is that NUMA or the scheduler (or both) are
>broken with regard to hotplug.
>
>The origin of the breakage is the differences between cpu_possible_map
>and cpu_online_map.  In hotplug CPU, there are two ways to do
>initialisations: you can initialise from cpu_online_map, but then you
>*must* have a cpu hotplug notify listener to add data structures for the
>extra CPUs as they come on-line, or you can initialise from
>cpu_possible_map and not bother with a notifier.  The disadvantage of
>the latter is that cpu_possible_map may be vastly larger than
>cpu_online_map ever gets to, thus wasting valuable kernel memory.
>
>The scheduler code is schizophrenic in this regard in that it does both:
>it initialises static data structures from cpu_possible_map, but it also
>has a hotplug cpu listener for starting things like the migration
>threads.
>
>I suspect the NUMA people would like us all to go to the former method
>(initialise only from cpu_online_map and have a proper hotplug listener)
>since their possible maps are pretty huge.  However, which is it to be:
>fix NUMA (to have two cpu_to_node() maps for the possible and online
>cpus per node) or fix the scheduler to do initialisation correctly?
>
>Perhaps this should be phased: change NUMA first temporarily for phase
>one and then fix the scheduler (and everyone else initialising from
>cpu_possible_map) in the second.
>
>

The scheduler *should* be able to be fixed nicely by using cpu_online_map
everywhere, and basically undoing then redoing the domains setup before and
after the hoplug, respectively.

So you'd re-attach the dummy domain to all CPUs, do the hotplug operation,
then setup the domains again and re-attach them.

This whole sequence could be pretty expensive, but I don't think the hotplug
guys care. It would allow us to get rid of cpus_and(... cpu_online_map) 
from a
lot of places in the scheduler too, which would be nice.

The actual code to do it shouldn't be more than a few lines (but I could be
overlooking something).


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2004-09-06  2:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-03 21:21 [Fwd: Re: SMP Panic caused by [PATCH] sched: consolidate sched domains] James Bottomley
2004-09-03 21:59 ` Andrew Morton
2004-09-03 22:13   ` James Bottomley
2004-09-03 22:22   ` William Lee Irwin III
     [not found]     ` <20040903153434.15719192.akpm@osdl.org>
2004-09-03 22:45       ` [sched] fix sched_domains hotplug bootstrap ordering vs. cpu_online_map issue William Lee Irwin III
2004-09-04  1:57         ` Nick Piggin
2004-09-05 11:46         ` Ingo Molnar
2004-09-05 22:35           ` James Bottomley
2004-09-06  2:48             ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox