Glyn Astill wrote:

From: Kevin D. Kissell <kevink@paralogos.com>

Your description sounds an awful lot
like failures I've seen when 
interrupts get lost or blocked for some reason (could be
hardware, the 
kernel, or some interaction between them).  Have you
looked at 
 to see if "Spurious" interrupts are
occurring, or if 
the rate of serviced timer and I/O interrupts decreases or
increases as 
the system degrades?


No I haven't checked - but I will. What would I be looking for that would stick out as "spurious"? The type of interrupt, qty or random interrupts appearing and dissapearing?

There's a separate counter, and /proc/interrupts report, for spurious interrupts.

When the system becomes unresponsive, by any 
chance does it "wake up" after 10-20 minutes (the time for
the Count 
register to wrap)?


Not that I've noticed, I just see it degrade further and further untill it dies over the course of an hour or so.

If other Qube2s don't exhibit this behavior with a given
Linux kernel, 
but yours does, and yet yours runs NetBSD OK, it suggests
that there's a 
difference in interrupt setup/handling between the two
systems that just 
happens to work around a hardware problem on your board.


I'm sure that's a valid possibility, however I do have two of these machines and I have tried both with the same results.

Ah. I had misunderstood your messages to have stated that you had one Qube2 that exhibited the behavior while others did not. In the actual case, it definitely sounds like a kernel interrupt management problem, either at the level of the interrupt controller support code or some bit of low-level management of the Status.IM interrupt mask. If you can force the kernel to dump the state of the Status and Cause registers, as well as that of whatever outboard interrupt controller is on that thing, that would be good. I used to have a hook in the NMI handler of my Malta kernels for that, which was useful when I was debugging the SMTC interrupt support, which was pretty subtle and nasty. And why this failure mode sounds vaguely familiar. ;o) The interrupt ack/mask/enable machinery has changed and standardized (for the better) since the Qube2 was a current product, and the controller "chip" struct/functions being used may not in fact be entirely correct for the platform, e.g. you may have non-atomic changes to interrupt masks being done that screw up in the presence of nested service.

I also had a problem back when I tried etch with the 2.6.18 kernel, however in this case I saw no degraded performance at all, however after a some of hours of activity (anywhere between 2 and 24+) it'd just fall on it's ass.

That's not a very scientific description of a failure. I mean, did the Qube2 literally jump off the table? ;o)

Regards,

Kevin K.