* Hang in sched_balance_self() @ 2005-06-03 22:55 Jack Steiner 2005-06-04 2:09 ` Nick Piggin 2005-06-07 14:09 ` Nick Piggin 0 siblings, 2 replies; 4+ messages in thread From: Jack Steiner @ 2005-06-03 22:55 UTC (permalink / raw) To: nickpiggin; +Cc: linux-kernel Nick - The latest 2.6.12-rc5-mm2 tree fails to boot on some of the 64p SGI systems. The system hangs immediately after printing: ... Inode-cache hash table entries: 8388608 (order: 12, 67108864 bytes) Mount-cache hash table entries: 1024 Boot processor id 0x0/0x0 Brought up 64 CPUs Total of 64 processors activated (118415.36 BogoMIPS). I have isolated the failure to cpu 0 hanging in sched_balance_self() during a fork (or clone). The "while" loop at the end of function never terminates, ie. sd is never NULL. Is this a problem that you have seen before. If not, I'll do some more digging & isolate the problem. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Hang in sched_balance_self() 2005-06-03 22:55 Hang in sched_balance_self() Jack Steiner @ 2005-06-04 2:09 ` Nick Piggin 2005-06-07 14:09 ` Nick Piggin 1 sibling, 0 replies; 4+ messages in thread From: Nick Piggin @ 2005-06-04 2:09 UTC (permalink / raw) To: Jack Steiner; +Cc: linux-kernel Jack Steiner wrote: > Nick - > > The latest 2.6.12-rc5-mm2 tree fails to boot on some of the 64p > SGI systems. The system hangs immediately after printing: > > ... > Inode-cache hash table entries: 8388608 (order: 12, 67108864 bytes) > Mount-cache hash table entries: 1024 > Boot processor id 0x0/0x0 > Brought up 64 CPUs > Total of 64 processors activated (118415.36 BogoMIPS). > > > I have isolated the failure to cpu 0 hanging in sched_balance_self() during > a fork (or clone). The "while" loop at the end of function never > terminates, ie. sd is never NULL. > > Is this a problem that you have seen before. If not, I'll do some > more digging & isolate the problem. > Hi Jack, I have not seen this problem, however I don't think it has had much testing with multilevel NUMA domains. If you could do some more digging that would be great, however I plan to get time on a 64-way IA64 next week and look at some scheduler issues, so I'll keep this in mind if you haven't made any progress. Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Hang in sched_balance_self() 2005-06-03 22:55 Hang in sched_balance_self() Jack Steiner 2005-06-04 2:09 ` Nick Piggin @ 2005-06-07 14:09 ` Nick Piggin 2005-06-07 14:54 ` Jack Steiner 1 sibling, 1 reply; 4+ messages in thread From: Nick Piggin @ 2005-06-07 14:09 UTC (permalink / raw) To: Jack Steiner; +Cc: linux-kernel, John Hawkes [-- Attachment #1: Type: text/plain, Size: 979 bytes --] Jack Steiner wrote: > Nick - > > The latest 2.6.12-rc5-mm2 tree fails to boot on some of the 64p > SGI systems. The system hangs immediately after printing: > > ... > Inode-cache hash table entries: 8388608 (order: 12, 67108864 bytes) > Mount-cache hash table entries: 1024 > Boot processor id 0x0/0x0 > Brought up 64 CPUs > Total of 64 processors activated (118415.36 BogoMIPS). > > > I have isolated the failure to cpu 0 hanging in sched_balance_self() during > a fork (or clone). The "while" loop at the end of function never > terminates, ie. sd is never NULL. > > Is this a problem that you have seen before. If not, I'll do some > more digging & isolate the problem. > Hi Jack, I haven't completely got to the bottom of this yet, but I was able to reproduce on a 64-way Altix, and something like the attached patch seems to 'fix' the problem. I didn't have time to find what's gone wrong tonight, but I'll get to that tomorrow. -- SUSE Labs, Novell Inc. [-- Attachment #2: sched-balance-self-fix.patch --] [-- Type: text/plain, Size: 745 bytes --] Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c 2005-06-08 00:01:53.000000000 +1000 +++ linux-2.6/kernel/sched.c 2005-06-08 00:02:47.000000000 +1000 @@ -1113,6 +1113,7 @@ static int sched_balance_self(int cpu, i cpumask_t span; struct sched_group *group; int new_cpu; + int weight; span = sd->span; group = find_idlest_group(sd, t, cpu); @@ -1127,8 +1128,9 @@ static int sched_balance_self(int cpu, i cpu = new_cpu; nextlevel: sd = NULL; + weight = cpus_weight(span); for_each_domain(cpu, tmp) { - if (cpus_subset(span, tmp->span)) + if (weight <= cpus_weight(tmp->span)) break; if (tmp->flags & flag) sd = tmp; ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Hang in sched_balance_self() 2005-06-07 14:09 ` Nick Piggin @ 2005-06-07 14:54 ` Jack Steiner 0 siblings, 0 replies; 4+ messages in thread From: Jack Steiner @ 2005-06-07 14:54 UTC (permalink / raw) To: Nick Piggin; +Cc: linux-kernel, John Hawkes On Wed, Jun 08, 2005 at 12:09:09AM +1000, Nick Piggin wrote: > Jack Steiner wrote: > >Nick - > > > >The latest 2.6.12-rc5-mm2 tree fails to boot on some of the 64p > >SGI systems. The system hangs immediately after printing: > > > > ... > > Inode-cache hash table entries: 8388608 (order: 12, 67108864 bytes) > > Mount-cache hash table entries: 1024 > > Boot processor id 0x0/0x0 > > Brought up 64 CPUs > > Total of 64 processors activated (118415.36 BogoMIPS). > > > > > >I have isolated the failure to cpu 0 hanging in sched_balance_self() during > >a fork (or clone). The "while" loop at the end of function never > >terminates, ie. sd is never NULL. > > > >Is this a problem that you have seen before. If not, I'll do some > >more digging & isolate the problem. > > > > Hi Jack, > I haven't completely got to the bottom of this yet, but I was able > to reproduce on a 64-way Altix, and something like the attached patch > seems to 'fix' the problem. The fix works for me. I'll send you notes that I collected. They may help explain what is going wrong. (I learned a lot about scheduler domains...:-) > > I didn't have time to find what's gone wrong tonight, but I'll get > to that tomorrow. > > -- > SUSE Labs, Novell Inc. > > Index: linux-2.6/kernel/sched.c > =================================================================== > --- linux-2.6.orig/kernel/sched.c 2005-06-08 00:01:53.000000000 +1000 > +++ linux-2.6/kernel/sched.c 2005-06-08 00:02:47.000000000 +1000 > @@ -1113,6 +1113,7 @@ static int sched_balance_self(int cpu, i > cpumask_t span; > struct sched_group *group; > int new_cpu; > + int weight; > > span = sd->span; > group = find_idlest_group(sd, t, cpu); > @@ -1127,8 +1128,9 @@ static int sched_balance_self(int cpu, i > cpu = new_cpu; > nextlevel: > sd = NULL; > + weight = cpus_weight(span); > for_each_domain(cpu, tmp) { > - if (cpus_subset(span, tmp->span)) > + if (weight <= cpus_weight(tmp->span)) > break; > if (tmp->flags & flag) > sd = tmp; -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2005-06-07 15:36 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2005-06-03 22:55 Hang in sched_balance_self() Jack Steiner 2005-06-04 2:09 ` Nick Piggin 2005-06-07 14:09 ` Nick Piggin 2005-06-07 14:54 ` Jack Steiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox