From: Kevin D. Kissell <kevink@paralogos.com>
Your description sounds an awful lot
like failures I've seen when
interrupts get lost or blocked for some reason (could be
hardware, the
kernel, or some interaction between them). Have you
looked at
to see if "Spurious" interrupts are
occurring, or if
the rate of serviced timer and I/O interrupts decreases or
increases as
the system degrades?
No I haven't checked - but I will. What would I be looking for that would stick out as "spurious"? The type of interrupt, qty or random interrupts appearing and dissapearing?
There's a separate counter, and /proc/interrupts report, for spurious
interrupts.
When the system becomes unresponsive, by any
chance does it "wake up" after 10-20 minutes (the time for
the Count
register to wrap)?
Not that I've noticed, I just see it degrade further and further untill it dies over the course of an hour or so.
If other Qube2s don't exhibit this behavior with a given
Linux kernel,
but yours does, and yet yours runs NetBSD OK, it suggests
that there's a
difference in interrupt setup/handling between the two
systems that just
happens to work around a hardware problem on your board.
I'm sure that's a valid possibility, however I do have two of these machines and I have tried both with the same results.
Ah. I had misunderstood your messages to have stated that you had one
Qube2 that exhibited the behavior while others did not. In the actual
case, it definitely sounds like a kernel interrupt management problem,
either at the level of the interrupt controller support code or some
bit of low-level management of the Status.IM interrupt mask. If you
can force the kernel to dump the state of the Status and Cause
registers, as well as that of whatever outboard interrupt controller is
on that thing, that would be good. I used to have a hook in the NMI
handler of my Malta kernels for that, which was useful when I was
debugging the SMTC interrupt support, which was pretty subtle and
nasty. And why this failure mode sounds vaguely familiar. ;o) The
interrupt ack/mask/enable machinery has changed and standardized (for
the better) since the Qube2 was a current product, and the controller
"chip" struct/functions being used may not in fact be entirely correct
for the platform, e.g. you may have non-atomic changes to interrupt
masks being done that screw up in the presence of nested service.