public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* 2.4.31 hangs, no information on console or serial port
@ 2006-02-21 15:23 David Golombek
  2006-02-21 15:29 ` Benjamin LaHaise
  0 siblings, 1 reply; 7+ messages in thread
From: David Golombek @ 2006-02-21 15:23 UTC (permalink / raw)
  To: linux-kernel

I have a box running a modified Debian/woody system and 2.4.31.  It is
intermittently hanging such that:

* All logging to /var/log ceases.
* Machine is still pingable.
* Machine can be telneted to on time port, but no time is echoed.
* After attaching a console+keyboard, console would not unblank.
* Nothing responded when attaching a serial console.
* Machine does not respond to Ctrl-Alt-Del
* No DMI messages are logged.
* Hang is persistent until physical reboot.

This has happened 4 times, on 2 separate machines (under roughly
similar conditions).  Machines are up variable amounts of time before
crashing, between many weeks and less than 1 day.  Nothing unusual is
logged in /var/log/{deamon.log,kern.log,messages,syslog} prior the
hang, except that /var/log/messages includes the "TCP: Treason
uncloaked!" warnings that are fixed in 2.4.32.  No users were logged
on at the time of 3 of the 4 crashes, and no local user activity was
present at the time of the 4th.

The machines are Intel P4's with 2GB of memory

The machine is under relatively high load and has a custom userspace
nfs server running on it (which is potentially to blame, but we've
been unable to determine how).  The custom userspace nfs server and
tomcat4 are the primary applications running.

Any suggestions as to how we might debug this or possible causes would
be greatly appreciated.

Thanks,
Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.31 hangs, no information on console or serial port
  2006-02-21 15:23 2.4.31 hangs, no information on console or serial port David Golombek
@ 2006-02-21 15:29 ` Benjamin LaHaise
  2006-02-21 16:04   ` David Golombek
  2006-02-27 16:24   ` David Golombek
  0 siblings, 2 replies; 7+ messages in thread
From: Benjamin LaHaise @ 2006-02-21 15:29 UTC (permalink / raw)
  To: David Golombek; +Cc: linux-kernel

On Tue, Feb 21, 2006 at 10:23:56AM -0500, David Golombek wrote:
> Any suggestions as to how we might debug this or possible causes would
> be greatly appreciated.

Have you tried turning on the NMI watchdog (nmi_watchdog=1)?  It should 
be able to kick the machine out of the locked state, as these symptoms 
would hint at a spinlock deadlock with interrupts disabled.  Also, try 
to reproduce on the latest 2.4.33pre.  That said, for an io intensive 
workload like you're running, 2.6 is much better, especially for systems 
using highmem.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.31 hangs, no information on console or serial port
  2006-02-21 15:29 ` Benjamin LaHaise
@ 2006-02-21 16:04   ` David Golombek
  2006-02-21 21:41     ` Willy Tarreau
  2006-02-27 16:24   ` David Golombek
  1 sibling, 1 reply; 7+ messages in thread
From: David Golombek @ 2006-02-21 16:04 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: linux-kernel

Benjamin LaHaise <bcrl@kvack.org> writes:
> On Tue, Feb 21, 2006 at 10:23:56AM -0500, David Golombek wrote:
> > Any suggestions as to how we might debug this or possible causes would
> > be greatly appreciated.
> 
> Have you tried turning on the NMI watchdog (nmi_watchdog=1)?  It
> should be able to kick the machine out of the locked state, as these
> symptoms would hint at a spinlock deadlock with interrupts disabled.
> Also, try to reproduce on the latest 2.4.33pre.  That said, for an
> io intensive workload like you're running, 2.6 is much better,
> especially for systems using highmem.

I'll enable nmi_watchdog as soon as we can bring the machine down,
thanks for the excellent suggestion. I'd entirely forgotten about the
watchdog.  I'll try to switch to 2.4.33pre out as soon as poosible, it
certainly has several fixes we've been waiting for.  2.6 is still a
ways off, lots of qualification work to do.

Thanks,
Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.31 hangs, no information on console or serial port
  2006-02-21 16:04   ` David Golombek
@ 2006-02-21 21:41     ` Willy Tarreau
  0 siblings, 0 replies; 7+ messages in thread
From: Willy Tarreau @ 2006-02-21 21:41 UTC (permalink / raw)
  To: David Golombek; +Cc: Benjamin LaHaise, linux-kernel

On Tue, Feb 21, 2006 at 11:04:57AM -0500, David Golombek wrote:
> Benjamin LaHaise <bcrl@kvack.org> writes:
> > On Tue, Feb 21, 2006 at 10:23:56AM -0500, David Golombek wrote:
> > > Any suggestions as to how we might debug this or possible causes would
> > > be greatly appreciated.
> > 
> > Have you tried turning on the NMI watchdog (nmi_watchdog=1)?  It
> > should be able to kick the machine out of the locked state, as these
> > symptoms would hint at a spinlock deadlock with interrupts disabled.
> > Also, try to reproduce on the latest 2.4.33pre.  That said, for an
> > io intensive workload like you're running, 2.6 is much better,
> > especially for systems using highmem.
> 
> I'll enable nmi_watchdog as soon as we can bring the machine down,
> thanks for the excellent suggestion. I'd entirely forgotten about the
> watchdog.  I'll try to switch to 2.4.33pre out as soon as poosible, it
> certainly has several fixes we've been waiting for.  2.6 is still a
> ways off, lots of qualification work to do.

BTW, if your console blanks, you should use this :

   # setterm -blank 0

Maybe you'll notice some "OOM: killing process" messages indicating
that some hungry process is going mad (possibly the NFS server).

> Thanks,
> Dave

Regards,
Willy


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.31 hangs, no information on console or serial port
  2006-02-21 15:29 ` Benjamin LaHaise
  2006-02-21 16:04   ` David Golombek
@ 2006-02-27 16:24   ` David Golombek
  2006-02-27 16:39     ` Benjamin LaHaise
  1 sibling, 1 reply; 7+ messages in thread
From: David Golombek @ 2006-02-27 16:24 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: linux-kernel

> On Tue, Feb 21, 2006 at 10:23:56AM -0500, David Golombek wrote:
> > I have a box running a modified Debian/woody system and 2.4.31.  It is
> > intermittently hanging such that:
> > 
> > * All logging to /var/log ceases.
> > * Machine is still pingable.
> > * Machine can be telneted to on time port, but no time is echoed.
> > * After attaching a console+keyboard, console would not unblank.
> > * Nothing responded when attaching a serial console.
> > * Machine does not respond to Ctrl-Alt-Del
> > * No DMI messages are logged.
> > * Hang is persistent until physical reboot.
> > 
> > This has happened 4 times, on 2 separate machines (under roughly
> > similar conditions).  Machines are up variable amounts of time before
> > crashing, between many weeks and less than 1 day.  Nothing unusual is
> > logged in /var/log/{deamon.log,kern.log,messages,syslog} prior the
> > hang, except that /var/log/messages includes the "TCP: Treason
> > uncloaked!" warnings that are fixed in 2.4.32.  No users were logged
> > on at the time of 3 of the 4 crashes, and no local user activity was
> > present at the time of the 4th.
> > 
> > The machines are Intel P4's with 2GB of memory
> > 
> > The machine is under relatively high load and has a custom userspace
> > nfs server running on it (which is potentially to blame, but we've
> > been unable to determine how).  The custom userspace nfs server and
> > tomcat4 are the primary applications running.
> > 
> > Any suggestions as to how we might debug this or possible causes would
> > be greatly appreciated.
>
> Benjamin LaHaise <bcrl@kvack.org> writes:
> Have you tried turning on the NMI watchdog (nmi_watchdog=1)?  It
> should be able to kick the machine out of the locked state, as these
> symptoms would hint at a spinlock deadlock with interrupts disabled.
> Also, try to reproduce on the latest 2.4.33pre.  That said, for an
> io intensive workload like you're running, 2.6 is much better,
> especially for systems using highmem.

After a week of intensive testing, we were finally able to reproduce
this hang.  Sadly, the nmi watchdog did not appear to trigger (I'm
pretty sure it was configured correctly, I did see NMIs occurring).
No information appeared on serial or console (although this time they
weren't blanked).  We're building 2.4.33pre kernel now to try and test
on now to see if we're still able to reproduce using it.

We're beginning to suspect that a hung loopback NFS mount might be to
blame, although we can't reproduce this trivially.  Is there anyway in
which a mount that was behaving badly could affect the kernel in this
manner?

Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.31 hangs, no information on console or serial port
  2006-02-27 16:24   ` David Golombek
@ 2006-02-27 16:39     ` Benjamin LaHaise
  2006-02-27 17:48       ` David Golombek
  0 siblings, 1 reply; 7+ messages in thread
From: Benjamin LaHaise @ 2006-02-27 16:39 UTC (permalink / raw)
  To: David Golombek; +Cc: linux-kernel

On Mon, Feb 27, 2006 at 11:24:10AM -0500, David Golombek wrote:
> We're beginning to suspect that a hung loopback NFS mount might be to
> blame, although we can't reproduce this trivially.  Is there anyway in
> which a mount that was behaving badly could affect the kernel in this
> manner?

Loopback NFS can deadlock in trying to free memory when writing back dirty 
pages.  Use mount --bind instead.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: 2.4.31 hangs, no information on console or serial port
  2006-02-27 16:39     ` Benjamin LaHaise
@ 2006-02-27 17:48       ` David Golombek
  0 siblings, 0 replies; 7+ messages in thread
From: David Golombek @ 2006-02-27 17:48 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: linux-kernel

Benjamin LaHaise <bcrl@kvack.org> writes:
> On Mon, Feb 27, 2006 at 11:24:10AM -0500, David Golombek wrote:
> > We're beginning to suspect that a hung loopback NFS mount might be to
> > blame, although we can't reproduce this trivially.  Is there anyway in
> > which a mount that was behaving badly could affect the kernel in this
> > manner?
> 
> Loopback NFS can deadlock in trying to free memory when writing back
> dirty pages.  Use mount --bind instead.

Unfortunately, --bind is not an option for us.  The custom nfs-server
is actually a protocol adapter, mapping a custom filesystem spread
across a cluster of machines into NFS.  We have the loopback mount in
order to provide CIFS access via samba.  Looking at
http://www.ussg.iu.edu/hypermail/linux/kernel/0407.3/0297.html

it certainly does seem like we're susceptible to this failure and are
looking at memory usage at the time of the crash.

Thanks,
Dave


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-02-27 17:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-21 15:23 2.4.31 hangs, no information on console or serial port David Golombek
2006-02-21 15:29 ` Benjamin LaHaise
2006-02-21 16:04   ` David Golombek
2006-02-21 21:41     ` Willy Tarreau
2006-02-27 16:24   ` David Golombek
2006-02-27 16:39     ` Benjamin LaHaise
2006-02-27 17:48       ` David Golombek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox