* [RFC] hangcheck-timer module
@ 2002-11-21 20:17 Joel Becker
2002-11-21 20:31 ` Brian Gerst
0 siblings, 1 reply; 7+ messages in thread
From: Joel Becker @ 2002-11-21 20:17 UTC (permalink / raw)
To: lkml; +Cc: Alan Cox, Marcelo Tosatti, Wim Coekaerts
Folks,
Attached is a module, hangcheck-timer. It is used to detect
when the system goes out to lunch for a period of time, such as when a
driver like qla2x00 udelays a bunch.
The module sets a timer. When the timer goes off, it then uses
the TSC (warning: portability needed) to determine how much real time
has passed.
On a normal system, the real elapsed time will be almost
identical to the expected timer duration. However, if a device decided
to udelay for 60 seconds (or some other circumstance), the module takes
notice. If the margin of error passes a threshold, the machine is
rebooted.
The module is currently used in a cluster environment. After
some time out to lunch, the rest of the cluster will have given up on a
machine. If the machine suddenly comes back and assumes it is still
"live", bad things can happen.
We can also see use for this in a debugging sense, for kernel
hangs as well as driver code. That's why I'm proposing it for general
inclusion.
Comments? Thoughts?
Joel
Building:
The module should happily build against most 2.4 kernels. The
usual module building compile line:
gcc -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
-DMODULE -D__KERNEL__ -DLINUX -c -o hangcheck-timer.o \
hangcheck-timer.c
Running:
Load the module with insmod. There are two options.
"hangcheck_tick=<seconds>" specifies the timer timeout, and
"hangcheck_margin=<seconds" specifies the margin of error.
Joel
--
"Friends may come and go, but enemies accumulate."
- Thomas Jones
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] hangcheck-timer module
2002-11-21 20:17 [RFC] hangcheck-timer module Joel Becker
@ 2002-11-21 20:31 ` Brian Gerst
2002-11-21 22:08 ` Joel Becker
0 siblings, 1 reply; 7+ messages in thread
From: Brian Gerst @ 2002-11-21 20:31 UTC (permalink / raw)
To: Joel Becker; +Cc: lkml, Alan Cox, Marcelo Tosatti, Wim Coekaerts
Joel Becker wrote:
> Folks,
> Attached is a module, hangcheck-timer. It is used to detect
> when the system goes out to lunch for a period of time, such as when a
> driver like qla2x00 udelays a bunch.
> The module sets a timer. When the timer goes off, it then uses
> the TSC (warning: portability needed) to determine how much real time
> has passed.
> On a normal system, the real elapsed time will be almost
> identical to the expected timer duration. However, if a device decided
> to udelay for 60 seconds (or some other circumstance), the module takes
> notice. If the margin of error passes a threshold, the machine is
> rebooted.
> The module is currently used in a cluster environment. After
> some time out to lunch, the rest of the cluster will have given up on a
> machine. If the machine suddenly comes back and assumes it is still
> "live", bad things can happen.
> We can also see use for this in a debugging sense, for kernel
> hangs as well as driver code. That's why I'm proposing it for general
> inclusion.
> Comments? Thoughts?
>
> Joel
>
> Building:
> The module should happily build against most 2.4 kernels. The
> usual module building compile line:
> gcc -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
> -DMODULE -D__KERNEL__ -DLINUX -c -o hangcheck-timer.o \
> hangcheck-timer.c
>
> Running:
> Load the module with insmod. There are two options.
> "hangcheck_tick=<seconds>" specifies the timer timeout, and
> "hangcheck_margin=<seconds" specifies the margin of error.
>
> Joel
>
There is already an NMI watchdog that is better than what you propose,
because it will also catch cases where something gets stuck with
interrupts disabled.
--
Brian Gerst
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] hangcheck-timer module
2002-11-21 20:31 ` Brian Gerst
@ 2002-11-21 22:08 ` Joel Becker
0 siblings, 0 replies; 7+ messages in thread
From: Joel Becker @ 2002-11-21 22:08 UTC (permalink / raw)
To: Brian Gerst; +Cc: lkml, Alan Cox, Marcelo Tosatti, Wim Coekaerts
On Thu, Nov 21, 2002 at 03:31:04PM -0500, Brian Gerst wrote:
> Joel Becker wrote:
> > Attached is a module, hangcheck-timer. It is used to detect
> >when the system goes out to lunch for a period of time, such as when a
> >driver like qla2x00 udelays a bunch.
>
> There is already an NMI watchdog that is better than what you propose,
> because it will also catch cases where something gets stuck with
> interrupts disabled.
The issue at hand is not permanent hangs. The issue is hangs
that return. Consider a clustering enviornment where the other nodes
have given up on the delayed node and clean up after it. When the hang
finally ends, the node still thinks it is "alive" and happily scribbles
to places it shouldn't.
udelay will not ever trigger the NMI watchdog, as it is running
on the processor, so the cpu timer will run happily. But as far as
everything higher up (kernel + userspace), the delay will be unnoticed
and bad things can happen.
Joel
--
"I'm so tired of being tired,
Sure as night will follow day.
Most things I worry about
Never happen anyway."
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC] hangcheck-timer module
@ 2002-11-21 20:19 Joel Becker
2002-11-22 11:56 ` William Lee Irwin III
2002-11-26 13:35 ` Pavel Machek
0 siblings, 2 replies; 7+ messages in thread
From: Joel Becker @ 2002-11-21 20:19 UTC (permalink / raw)
To: lkml; +Cc: Alan Cox, Marcelo Tosatti, Wim Coekaerts
[-- Attachment #1: Type: text/plain, Size: 1686 bytes --]
[ Feh, forgot to attach the damned file. ]
Folks,
Attached is a module, hangcheck-timer. It is used to detect
when the system goes out to lunch for a period of time, such as when a
driver like qla2x00 udelays a bunch.
The module sets a timer. When the timer goes off, it then uses
the TSC (warning: portability needed) to determine how much real time
has passed.
On a normal system, the real elapsed time will be almost
identical to the expected timer duration. However, if a device decided
to udelay for 60 seconds (or some other circumstance), the module takes
notice. If the margin of error passes a threshold, the machine is
rebooted.
The module is currently used in a cluster environment. After
some time out to lunch, the rest of the cluster will have given up on a
machine. If the machine suddenly comes back and assumes it is still
"live", bad things can happen.
We can also see use for this in a debugging sense, for kernel
hangs as well as driver code. That's why I'm proposing it for general
inclusion.
Comments? Thoughts?
Joel
Building:
The module should happily build against most 2.4 kernels. The
usual module building compile line:
gcc -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
-DMODULE -D__KERNEL__ -DLINUX -c -o hangcheck-timer.o \
hangcheck-timer.c
Running:
Load the module with insmod. There are two options.
"hangcheck_tick=<seconds>" specifies the timer timeout, and
"hangcheck_margin=<seconds" specifies the margin of error.
--
"Friends may come and go, but enemies accumulate."
- Thomas Jones
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
[-- Attachment #2: hangcheck-timer.c --]
[-- Type: text/x-csrc, Size: 3552 bytes --]
/*
* hangcheck-timer.c
*
* Test driver for a little io fencing timer.
*
* Copyright (C) 2002 Oracle Corporation. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License version 2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have recieved a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*/
#include <linux/module.h>
#include <linux/config.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/reboot.h>
#include <linux/smp_lock.h>
#include <linux/init.h>
#include <asm/uaccess.h>
/* #include "hangcheck-timer.h" */
#define DEFAULT_IOFENCE_MARGIN 60 /* Default fudge factor, in seconds */
#define DEFAULT_IOFENCE_TICK 180 /* Default timer timeout, in seconds */
static int hangcheck_tick = DEFAULT_IOFENCE_TICK;
static int hangcheck_margin = DEFAULT_IOFENCE_MARGIN;
MODULE_PARM(hangcheck_tick,"i");
MODULE_PARM_DESC(hangcheck_tick, "Timer delay.");
MODULE_PARM(hangcheck_margin,"i");
MODULE_PARM_DESC(hangcheck_margin, "If the hangcheck timer has been delayed more than hangcheck_margin seconds, the machine will reboot.");
MODULE_LICENSE("GPL");
static void hangcheck_fire(unsigned long);
/* Last time scheduled */
static unsigned long long hangcheck_tsc, hangcheck_tsc_margin;
static struct timer_list hangcheck_ticktock = {
function: hangcheck_fire,
};
static void hangcheck_fire(unsigned long data)
{
unsigned long long cur_tsc, tsc_diff;
rdtscll(cur_tsc);
if (cur_tsc > hangcheck_tsc)
tsc_diff = cur_tsc - hangcheck_tsc;
else
tsc_diff = (cur_tsc + (~0ULL - hangcheck_tsc)); /* or something */
#if 0
printk(KERN_CRIT "tsc_diff = %lu.%lu, predicted diff is %lu.%lu.\n",
(unsigned long) ((tsc_diff >> 32) & 0xFFFFFFFFULL),
(unsigned long) (tsc_diff & 0xFFFFFFFFULL),
(unsigned long) ((hangcheck_tsc_margin >> 32) & 0xFFFFFFFFULL),
(unsigned long) (hangcheck_tsc_margin & 0xFFFFFFFFULL));
printk(KERN_CRIT "hangcheck_margin = %lu, HZ = %lu, current_cpu_data.loops_per_jiffy = %lu.\n", hangcheck_margin, HZ, current_cpu_data.loops_per_jiffy);
#endif
if (tsc_diff > hangcheck_tsc_margin) {
printk(KERN_CRIT "hangcheck is restarting the machine.\n");
machine_restart(NULL);
}
mod_timer(&hangcheck_ticktock, jiffies + (hangcheck_tick*HZ));
rdtscll(hangcheck_tsc);
} /* hangcheck_fire() */
static int __init hangcheck_init(void)
{
version_hash_print();
printk("Starting hangcheck timer (tick is %d seconds, margin is %d seconds).\n",
hangcheck_tick, hangcheck_margin);
hangcheck_tsc_margin = (unsigned long long)(hangcheck_margin + hangcheck_tick) * (unsigned long long)HZ * (unsigned long long)current_cpu_data.loops_per_jiffy;
rdtscll(hangcheck_tsc);
mod_timer(&hangcheck_ticktock, jiffies + (hangcheck_tick*HZ));
return 0;
} /* hangcheck_init() */
static void __exit hangcheck_exit(void)
{
printk("Stopping hangcheck timer.\n");
lock_kernel();
del_timer(&hangcheck_ticktock);
unlock_kernel();
} /* hangcheck_exit() */
module_init(hangcheck_init);
module_exit(hangcheck_exit);
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC] hangcheck-timer module
2002-11-21 20:19 Joel Becker
@ 2002-11-22 11:56 ` William Lee Irwin III
2002-11-26 13:35 ` Pavel Machek
1 sibling, 0 replies; 7+ messages in thread
From: William Lee Irwin III @ 2002-11-22 11:56 UTC (permalink / raw)
To: Joel Becker; +Cc: lkml, Alan Cox, Marcelo Tosatti, Wim Coekaerts
On Thu, Nov 21, 2002 at 12:19:31PM -0800, Joel Becker wrote:
> it then uses the TSC (warning: portability needed)
ISTR get_cycles() being around, which should be defined for other arches.
Bill
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] hangcheck-timer module
2002-11-21 20:19 Joel Becker
2002-11-22 11:56 ` William Lee Irwin III
@ 2002-11-26 13:35 ` Pavel Machek
2002-11-26 22:36 ` Joel Becker
1 sibling, 1 reply; 7+ messages in thread
From: Pavel Machek @ 2002-11-26 13:35 UTC (permalink / raw)
To: Joel Becker; +Cc: lkml, Alan Cox, Marcelo Tosatti, Wim Coekaerts
> [ Feh, forgot to attach the damned file. ]
:-)
> The module is currently used in a cluster environment. After
> some time out to lunch, the rest of the cluster will have given up on a
> machine. If the machine suddenly comes back and assumes it is still
> "live", bad things can happen.
Would it make it more sense for other machines
to "kill" offending machine (cut power or press reset)?
--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC] hangcheck-timer module
2002-11-26 13:35 ` Pavel Machek
@ 2002-11-26 22:36 ` Joel Becker
0 siblings, 0 replies; 7+ messages in thread
From: Joel Becker @ 2002-11-26 22:36 UTC (permalink / raw)
To: Pavel Machek; +Cc: lkml, Alan Cox, Marcelo Tosatti, Wim Coekaerts
On Tue, Nov 26, 2002 at 02:35:47PM +0100, Pavel Machek wrote:
> Would it make it more sense for other machines
> to "kill" offending machine (cut power or press reset)?
There is no solution that is general and inexpensive. STONITH
is as close as it gets, and we don't have support for that. On other
platforms where the shared disk is on FC, the device driver supports
fencing nodes from the switch.
That said, this module isn't exclusively useful to a
cluster+shared disk environment. If it were, I couldn't see generic
inclusion. This code is useful in many other situations.
Joel
--
Life's Little Instruction Book #313
"Never underestimate the power of love."
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2002-11-26 22:29 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-11-21 20:17 [RFC] hangcheck-timer module Joel Becker
2002-11-21 20:31 ` Brian Gerst
2002-11-21 22:08 ` Joel Becker
-- strict thread matches above, loose matches on Subject: below --
2002-11-21 20:19 Joel Becker
2002-11-22 11:56 ` William Lee Irwin III
2002-11-26 13:35 ` Pavel Machek
2002-11-26 22:36 ` Joel Becker
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.