From: Joel Becker <Joel.Becker@oracle.com>
To: lkml <linux-kernel@vger.kernel.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>,
Marcelo Tosatti <marcelo@conectiva.com.br>,
Wim Coekaerts <Wim.Coekaerts@oracle.com>
Subject: [RFC] hangcheck-timer module
Date: Thu, 21 Nov 2002 12:19:31 -0800 [thread overview]
Message-ID: <20021121201931.GH770@nic1-pc.us.oracle.com> (raw)
[-- Attachment #1: Type: text/plain, Size: 1686 bytes --]
[ Feh, forgot to attach the damned file. ]
Folks,
Attached is a module, hangcheck-timer. It is used to detect
when the system goes out to lunch for a period of time, such as when a
driver like qla2x00 udelays a bunch.
The module sets a timer. When the timer goes off, it then uses
the TSC (warning: portability needed) to determine how much real time
has passed.
On a normal system, the real elapsed time will be almost
identical to the expected timer duration. However, if a device decided
to udelay for 60 seconds (or some other circumstance), the module takes
notice. If the margin of error passes a threshold, the machine is
rebooted.
The module is currently used in a cluster environment. After
some time out to lunch, the rest of the cluster will have given up on a
machine. If the machine suddenly comes back and assumes it is still
"live", bad things can happen.
We can also see use for this in a debugging sense, for kernel
hangs as well as driver code. That's why I'm proposing it for general
inclusion.
Comments? Thoughts?
Joel
Building:
The module should happily build against most 2.4 kernels. The
usual module building compile line:
gcc -I /scratch/jlbec/kernel/linux-2.4.20-rc2/include \
-DMODULE -D__KERNEL__ -DLINUX -c -o hangcheck-timer.o \
hangcheck-timer.c
Running:
Load the module with insmod. There are two options.
"hangcheck_tick=<seconds>" specifies the timer timeout, and
"hangcheck_margin=<seconds" specifies the margin of error.
--
"Friends may come and go, but enemies accumulate."
- Thomas Jones
Joel Becker
Senior Member of Technical Staff
Oracle Corporation
E-mail: joel.becker@oracle.com
Phone: (650) 506-8127
[-- Attachment #2: hangcheck-timer.c --]
[-- Type: text/x-csrc, Size: 3552 bytes --]
/*
* hangcheck-timer.c
*
* Test driver for a little io fencing timer.
*
* Copyright (C) 2002 Oracle Corporation. All rights reserved.
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public
* License version 2 as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*
* You should have recieved a copy of the GNU General Public
* License along with this program; if not, write to the
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
* Boston, MA 021110-1307, USA.
*/
#include <linux/module.h>
#include <linux/config.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/reboot.h>
#include <linux/smp_lock.h>
#include <linux/init.h>
#include <asm/uaccess.h>
/* #include "hangcheck-timer.h" */
#define DEFAULT_IOFENCE_MARGIN 60 /* Default fudge factor, in seconds */
#define DEFAULT_IOFENCE_TICK 180 /* Default timer timeout, in seconds */
static int hangcheck_tick = DEFAULT_IOFENCE_TICK;
static int hangcheck_margin = DEFAULT_IOFENCE_MARGIN;
MODULE_PARM(hangcheck_tick,"i");
MODULE_PARM_DESC(hangcheck_tick, "Timer delay.");
MODULE_PARM(hangcheck_margin,"i");
MODULE_PARM_DESC(hangcheck_margin, "If the hangcheck timer has been delayed more than hangcheck_margin seconds, the machine will reboot.");
MODULE_LICENSE("GPL");
static void hangcheck_fire(unsigned long);
/* Last time scheduled */
static unsigned long long hangcheck_tsc, hangcheck_tsc_margin;
static struct timer_list hangcheck_ticktock = {
function: hangcheck_fire,
};
static void hangcheck_fire(unsigned long data)
{
unsigned long long cur_tsc, tsc_diff;
rdtscll(cur_tsc);
if (cur_tsc > hangcheck_tsc)
tsc_diff = cur_tsc - hangcheck_tsc;
else
tsc_diff = (cur_tsc + (~0ULL - hangcheck_tsc)); /* or something */
#if 0
printk(KERN_CRIT "tsc_diff = %lu.%lu, predicted diff is %lu.%lu.\n",
(unsigned long) ((tsc_diff >> 32) & 0xFFFFFFFFULL),
(unsigned long) (tsc_diff & 0xFFFFFFFFULL),
(unsigned long) ((hangcheck_tsc_margin >> 32) & 0xFFFFFFFFULL),
(unsigned long) (hangcheck_tsc_margin & 0xFFFFFFFFULL));
printk(KERN_CRIT "hangcheck_margin = %lu, HZ = %lu, current_cpu_data.loops_per_jiffy = %lu.\n", hangcheck_margin, HZ, current_cpu_data.loops_per_jiffy);
#endif
if (tsc_diff > hangcheck_tsc_margin) {
printk(KERN_CRIT "hangcheck is restarting the machine.\n");
machine_restart(NULL);
}
mod_timer(&hangcheck_ticktock, jiffies + (hangcheck_tick*HZ));
rdtscll(hangcheck_tsc);
} /* hangcheck_fire() */
static int __init hangcheck_init(void)
{
version_hash_print();
printk("Starting hangcheck timer (tick is %d seconds, margin is %d seconds).\n",
hangcheck_tick, hangcheck_margin);
hangcheck_tsc_margin = (unsigned long long)(hangcheck_margin + hangcheck_tick) * (unsigned long long)HZ * (unsigned long long)current_cpu_data.loops_per_jiffy;
rdtscll(hangcheck_tsc);
mod_timer(&hangcheck_ticktock, jiffies + (hangcheck_tick*HZ));
return 0;
} /* hangcheck_init() */
static void __exit hangcheck_exit(void)
{
printk("Stopping hangcheck timer.\n");
lock_kernel();
del_timer(&hangcheck_ticktock);
unlock_kernel();
} /* hangcheck_exit() */
module_init(hangcheck_init);
module_exit(hangcheck_exit);
next reply other threads:[~2002-11-21 20:12 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-11-21 20:19 Joel Becker [this message]
2002-11-22 11:56 ` [RFC] hangcheck-timer module William Lee Irwin III
2002-11-26 13:35 ` Pavel Machek
2002-11-26 22:36 ` Joel Becker
-- strict thread matches above, loose matches on Subject: below --
2002-11-21 20:17 Joel Becker
2002-11-21 20:31 ` Brian Gerst
2002-11-21 22:08 ` Joel Becker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20021121201931.GH770@nic1-pc.us.oracle.com \
--to=joel.becker@oracle.com \
--cc=Wim.Coekaerts@oracle.com \
--cc=alan@lxorguk.ukuu.org.uk \
--cc=linux-kernel@vger.kernel.org \
--cc=marcelo@conectiva.com.br \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.