All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Gerst <bgerst@didntduck.org>
To: Joel Becker <Joel.Becker@oracle.com>
Cc: lkml <linux-kernel@vger.kernel.org>,
	Alan Cox <alan@lxorguk.ukuu.org.uk>,
	Marcelo Tosatti <marcelo@conectiva.com.br>,
	Wim Coekaerts <Wim.Coekaerts@oracle.com>
Subject: Re: [RFC] hangcheck-timer module
Date: Thu, 21 Nov 2002 15:31:04 -0500	[thread overview]
Message-ID: <3DDD4288.70605@didntduck.org> (raw)
In-Reply-To: 20021121201711.GG770@nic1-pc.us.oracle.com

Joel Becker wrote:
> Folks,
> 	Attached is a module, hangcheck-timer.  It is used to detect
> when the system goes out to lunch for a period of time, such as when a
> driver like qla2x00 udelays a bunch.
> 	The module sets a timer.  When the timer goes off, it then uses
> the TSC (warning: portability needed) to determine how much real time
> has passed.
> 	On a normal system, the real elapsed time will be almost
> identical to the expected timer duration.  However, if a device decided
> to udelay for 60 seconds (or some other circumstance), the module takes
> notice.  If the margin of error passes a threshold, the machine is
> rebooted.
> 	The module is currently used in a cluster environment.  After
> some time out to lunch, the rest of the cluster will have given up on a
> machine.  If the machine suddenly comes back and assumes it is still
> "live", bad things can happen.
> 	We can also see use for this in a debugging sense, for kernel
> hangs as well as driver code.  That's why I'm proposing it for general
> inclusion.
> 	Comments?  Thoughts?
> 
> Joel
> 
> Building:
> 	The module should happily build against most 2.4 kernels.  The
> usual module building compile line:
> 	gcc  -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
> 		-DMODULE -D__KERNEL__ -DLINUX  -c -o hangcheck-timer.o \
> 		hangcheck-timer.c
> 
> Running:
> 	Load the module with insmod.  There are two options.
> "hangcheck_tick=<seconds>" specifies the timer timeout, and
> "hangcheck_margin=<seconds" specifies the margin of error.
> 
> Joel
> 

There is already an NMI watchdog that is better than what you propose, 
because it will also catch cases where something gets stuck with 
interrupts disabled.

--
				Brian Gerst


  reply	other threads:[~2002-11-21 20:24 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-11-21 20:17 [RFC] hangcheck-timer module Joel Becker
2002-11-21 20:31 ` Brian Gerst [this message]
2002-11-21 22:08   ` Joel Becker
  -- strict thread matches above, loose matches on Subject: below --
2002-11-21 20:19 Joel Becker
2002-11-22 11:56 ` William Lee Irwin III
2002-11-26 13:35 ` Pavel Machek
2002-11-26 22:36   ` Joel Becker

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3DDD4288.70605@didntduck.org \
    --to=bgerst@didntduck.org \
    --cc=Joel.Becker@oracle.com \
    --cc=Wim.Coekaerts@oracle.com \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcelo@conectiva.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.