Hard hang in hypervisor!?

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Hard hang in hypervisor!?
@ 2007-10-09 20:37 Linas Vepstas
  2007-10-09 21:18 ` Nathan Lynch
  0 siblings, 1 reply; 7+ messages in thread
From: Linas Vepstas @ 2007-10-09 20:37 UTC (permalink / raw)
  To: linuxppc-dev


I was futzing with linux-2.6.23-rc8-mm1 in a power6 lpar when,
for whatever reason, a spinlock locked up. The bizarre thing 
was that the rest of system locked up as well: an ssh terminal,
and also an hvc console.

Breaking into the debugger showed 4 cpus, 1 of which was 
deadlocked in the spinlock, and the other 3 in 
.pseries_dedicated_idle_sleep

This was, ahhh, unexpected.  What's up with that? Can
anyone provide any insight?

I should mention:
-- prior to the complete hard lockp, I did see 
 BUG: soft lockup - CPU#0 stuck for 11s! [ip:4473]
go off, and I did manage to sneak in a few commands
into the console and the ssh session. Then it locked
up hard -- but still not completely -- exactly 
360 seconds later, a kernel thread ran for a while, 
producing some console output, even though the 
keyboard and console were locked up.

--linas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hard hang in hypervisor!?
  2007-10-09 20:37 Hard hang in hypervisor!? Linas Vepstas
@ 2007-10-09 21:18 ` Nathan Lynch
  2007-10-09 21:28   ` Linas Vepstas
  0 siblings, 1 reply; 7+ messages in thread
From: Nathan Lynch @ 2007-10-09 21:18 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev

Linas Vepstas wrote:
> 
> I was futzing with linux-2.6.23-rc8-mm1 in a power6 lpar when,
> for whatever reason, a spinlock locked up. The bizarre thing 
> was that the rest of system locked up as well: an ssh terminal,
> and also an hvc console.
> 
> Breaking into the debugger showed 4 cpus, 1 of which was 
> deadlocked in the spinlock, and the other 3 in 
> .pseries_dedicated_idle_sleep
> 
> This was, ahhh, unexpected.  What's up with that? Can
> anyone provide any insight?

Sounds consistent with a task trying to double-acquire the lock, or an
interrupt handler attempting to acquire a lock that the current task
holds.  Or maybe even an uninitialized spinlock.  Do you know which
lock it was?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hard hang in hypervisor!?
  2007-10-09 21:18 ` Nathan Lynch
@ 2007-10-09 21:28   ` Linas Vepstas
  2007-10-09 23:22     ` never mind .. [was " Linas Vepstas
  2007-10-11  0:04     ` Paul Mackerras
  0 siblings, 2 replies; 7+ messages in thread
From: Linas Vepstas @ 2007-10-09 21:28 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev

On Tue, Oct 09, 2007 at 04:18:19PM -0500, Nathan Lynch wrote:
> Linas Vepstas wrote:
> > 
> > I was futzing with linux-2.6.23-rc8-mm1 in a power6 lpar when,
> > for whatever reason, a spinlock locked up. The bizarre thing 
> > was that the rest of system locked up as well: an ssh terminal,
> > and also an hvc console.
> > 
> > Breaking into the debugger showed 4 cpus, 1 of which was 
> > deadlocked in the spinlock, and the other 3 in 
> > .pseries_dedicated_idle_sleep
> > 
> > This was, ahhh, unexpected.  What's up with that? Can
> > anyone provide any insight?
> 
> Sounds consistent with a task trying to double-acquire the lock, or an
> interrupt handler attempting to acquire a lock that the current task
> holds.  Or maybe even an uninitialized spinlock.  Do you know which
> lock it was?

Not sure .. trying to find out now. But why would that kill the
ssh session, and the console? Sure, so maybe one cpu is spinning,
but the other three can still take interrupts, right?  The ssh session
should have been generating ethernet card interrupts, and the console
should have been generating hvc interrupts.  

Err ..  it was cpu 0 that was spinlocked.  Are interrupts not
distributed?

Perhaps I should IRC this ... 

--linas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* never mind .. [was Re: Hard hang in hypervisor!?
  2007-10-09 21:28   ` Linas Vepstas
@ 2007-10-09 23:22     ` Linas Vepstas
  2007-10-11  0:04     ` Paul Mackerras
  1 sibling, 0 replies; 7+ messages in thread
From: Linas Vepstas @ 2007-10-09 23:22 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: linuxppc-dev

On Tue, Oct 09, 2007 at 04:28:10PM -0500, Linas Vepstas wrote:
> 
> Perhaps I should IRC this ... 

yeah. I guess I'd forgotten how funky things can get. So never mind ... 

--linas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hard hang in hypervisor!?
  2007-10-09 21:28   ` Linas Vepstas
  2007-10-09 23:22     ` never mind .. [was " Linas Vepstas
@ 2007-10-11  0:04     ` Paul Mackerras
  2007-10-11 20:30       ` Linas Vepstas
  2007-10-11 21:35       ` Milton Miller
  1 sibling, 2 replies; 7+ messages in thread
From: Paul Mackerras @ 2007-10-11  0:04 UTC (permalink / raw)
  To: Linas Vepstas; +Cc: linuxppc-dev, Nathan Lynch

Linas Vepstas writes:

> Err ..  it was cpu 0 that was spinlocked.  Are interrupts not
> distributed?

We have some bogosities in the xics code that I noticed a couple of
days ago.  Basically we only set the xics to distribute interrupts to
all cpus if (a) the affinity mask is equal to CPU_MASK_ALL (which has
ones in every bit position from 0 to NR_CPUS-1) and (b) all present
cpus are online (cpu_online_map == cpu_present_map).  Otherwise we
direct interrupts to the first cpu in the affinity map.  So you can
easily have the affinity mask containing all the online cpus and still
not get distributed interrupts.

So in your case it's quite possible that all interrupts were directed
to cpu 0.

Paul.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hard hang in hypervisor!?
  2007-10-11  0:04     ` Paul Mackerras
@ 2007-10-11 20:30       ` Linas Vepstas
  2007-10-11 21:35       ` Milton Miller
  1 sibling, 0 replies; 7+ messages in thread
From: Linas Vepstas @ 2007-10-11 20:30 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Nathan Lynch

On Thu, Oct 11, 2007 at 10:04:40AM +1000, Paul Mackerras wrote:
> Linas Vepstas writes:
> 
> > Err ..  it was cpu 0 that was spinlocked.  Are interrupts not
> > distributed?
> 
> We have some bogosities in the xics code that I noticed a couple of
> days ago.  Basically we only set the xics to distribute interrupts to
> all cpus if (a) the affinity mask is equal to CPU_MASK_ALL (which has
> ones in every bit position from 0 to NR_CPUS-1) and (b) all present
> cpus are online (cpu_online_map == cpu_present_map).  Otherwise we
> direct interrupts to the first cpu in the affinity map.  So you can
> easily have the affinity mask containing all the online cpus and still
> not get distributed interrupts.
> 
> So in your case it's quite possible that all interrupts were directed
> to cpu 0.

Thanks,
I'll give this a whirl if I don't get distracted by other tasks. 

A simple cat /proc/interrupts shows them evenly distributed on
my "usual" box, and all glommed up on cpu 0 on the one thats 
giving me fits.

Also, I noticed years ago that "BAD" was non-zero and large.
Vowed to look into it someday ...

--linas

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Hard hang in hypervisor!?
  2007-10-11  0:04     ` Paul Mackerras
  2007-10-11 20:30       ` Linas Vepstas
@ 2007-10-11 21:35       ` Milton Miller
  1 sibling, 0 replies; 7+ messages in thread
From: Milton Miller @ 2007-10-11 21:35 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev

On Thu Oct 11 10:04:40 EST 2007, Paul Mackerras wrote:
> Linas Vepstas writes:
>> Err ..  it was cpu 0 that was spinlocked.  Are interrupts not
>> distributed?
> 
> We have some bogosities in the xics code that I noticed a couple of
> days ago.  Basically we only set the xics to distribute interrupts to
> all cpus if (a) the affinity mask is equal to CPU_MASK_ALL (which has
> ones in every bit position from 0 to NR_CPUS-1) and (b) all present
> cpus are online (cpu_online_map == cpu_present_map).  Otherwise we
> direct interrupts to the first cpu in the affinity map.  So you can
> easily have the affinity mask containing all the online cpus and still
> not get distributed interrupts.

The second condition was just added to try fix some issues where a 
vendor wants to always run the kdump kernel with maxcpus=1 on all
architectures, and the emulated xics on js20 was not working.
For a true xics, this should work because we (1) remove all but 1
cpu from the global server list and (2) raise the prioirity of the
cpu to disabled and the hardware will deliver to another cpu in the
parition.

http://ozlabs.org/pipermail/linuxppc-dev/2006-December/028941.html
http://ozlabs.org/pipermail/linuxppc-dev/2007-January/029607.html
http://ozlabs.org/pipermail/linuxppc-dev/2007-March/032621.html

However, my experience the other day on a js21 was that firmware
delivered either to all cpus (if we bound to the global server) or
the first online cpu in the partition, regardless of to which cpu
we bound the interrupt, so I don't know that the change will fix
the original problem.

It does mean that taking a cpu offline but not dlpar removing it from the
kernel will result in the inability to actually distribute interrupts
to all cpus.

I'd be happy to say remove the extra check and work with firmware to
property distribute the interrupts.

milton

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-10-11 21:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-09 20:37 Hard hang in hypervisor!? Linas Vepstas
2007-10-09 21:18 ` Nathan Lynch
2007-10-09 21:28   ` Linas Vepstas
2007-10-09 23:22     ` never mind .. [was " Linas Vepstas
2007-10-11  0:04     ` Paul Mackerras
2007-10-11 20:30       ` Linas Vepstas
2007-10-11 21:35       ` Milton Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).