Qube2 slowly dies

linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Qube2 slowly dies
@ 2009-06-10 14:04 Glyn Astill
  2009-06-10 14:12 ` Florian Fainelli
  2009-06-11  3:39 ` Kevin D. Kissell
  0 siblings, 2 replies; 7+ messages in thread
From: Glyn Astill @ 2009-06-10 14:04 UTC (permalink / raw)
  To: linux-mips

Hi people,

I've been directed here from the Debian lists by Martin Michlmayr. I'm running lenny on a qube2 128mb ram / 40gb disk.

I've tried kernels 2.6.26 and 2.6.30~rc8 and the issue I'm about to describe is present in both, I haven't tried any other kernels - but I will try 2.6.22 when I can.

Essentially the machine gets more and more sluggish until it finally dies. I've had a quick look in meminfo and I can't see that it's running out of memory, and I'm not sure what else to check?

I find it hard to describe what's going off, but here's a scenario I hope illustrates the problem. The configure script is just an example of doing something - I could easily have extracted an archive with tar or something for the same results;

- I start 2 ssh sessions and in one start configure for the postgres source, in the other I just started top.

- And for a while all seems fine; configure ticks away and top refreshes every second.

- Then top stops ticking over - but it'll refresh with a keypress. Anyway I exit top and try to run it again... nothing. I hit ctrl-c which brings me back to the prompt and I try again... nothing.

- The configure script is still ticking over slowly.

- I try "ps ax" - it works; so I try it again... nothing.

- I try "ipcs" and "lsof" they both work and seem to keep working.

- I try "ps ax" again... nothing. I hit ctrl-c and now it doesn't come back to the command prompt for a while.. say 5 minutes and eventually it's back.

- It's still going. Some commands still work, some just do nothing. proc/meminfo shows it's not eaten all the memory.

- If I try to start another ssh session I can log in, I get the motd, but I don't get to the shell.

- Eventually the configure script ends, and all shells come back to the prompt. But it now seems totally braindamaged, I can run "ps ax" but "top" and other commands still do nothing. Heres strace attached to the top process:

deb:~# strace -p 7228
Process 7228 attached - interrupt to quit
_newselect(0, NULL, NULL, NULL, {0, 500013}

- Then after a little while the whole thing becomes unresponsive.

Can anyone confirm they've seen the same behaviour or direct me what to look into?

Thanks
Glyn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Qube2 slowly dies
  2009-06-10 14:04 Qube2 slowly dies Glyn Astill
@ 2009-06-10 14:12 ` Florian Fainelli
  2009-06-11  3:39 ` Kevin D. Kissell
  1 sibling, 0 replies; 7+ messages in thread
From: Florian Fainelli @ 2009-06-10 14:12 UTC (permalink / raw)
  To: Glyn Astill; +Cc: linux-mips

Hi Glyn,

Le Wednesday 10 June 2009 16:04:03 Glyn Astill, vous avez écrit :
> Hi people,
>
> I've been directed here from the Debian lists by Martin Michlmayr. I'm
> running lenny on a qube2 128mb ram / 40gb disk.
>
> I've tried kernels 2.6.26 and 2.6.30~rc8 and the issue I'm about to
> describe is present in both, I haven't tried any other kernels - but I will
> try 2.6.22 when I can.
>
> Essentially the machine gets more and more sluggish until it finally dies.
> I've had a quick look in meminfo and I can't see that it's running out of
> memory, and I'm not sure what else to check?

Determine which process consumes all that memory. Can you describe which 
programs you are running on your Qube2 ?

>
> I find it hard to describe what's going off, but here's a scenario I hope
> illustrates the problem. The configure script is just an example of doing
> something - I could easily have extracted an archive with tar or something
> for the same results;
>
> - I start 2 ssh sessions and in one start configure for the postgres
> source, in the other I just started top.
>
> - And for a while all seems fine; configure ticks away and top refreshes
> every second.
>
> - Then top stops ticking over - but it'll refresh with a keypress. Anyway I
> exit top and try to run it again... nothing. I hit ctrl-c which brings me
> back to the prompt and I try again... nothing.
>
> - The configure script is still ticking over slowly.
>
> - I try "ps ax" - it works; so I try it again... nothing.
>
> - I try "ipcs" and "lsof" they both work and seem to keep working.
>
> - I try "ps ax" again... nothing. I hit ctrl-c and now it doesn't come back
> to the command prompt for a while.. say 5 minutes and eventually it's back.
>
> - It's still going. Some commands still work, some just do nothing.
> proc/meminfo shows it's not eaten all the memory.
>
> - If I try to start another ssh session I can log in, I get the motd, but I
> don't get to the shell.
>
> - Eventually the configure script ends, and all shells come back to the
> prompt. But it now seems totally braindamaged, I can run "ps ax" but "top"
> and other commands still do nothing. Heres strace attached to the top
> process:
>
> deb:~# strace -p 7228
> Process 7228 attached - interrupt to quit
> _newselect(0, NULL, NULL, NULL, {0, 500013}
>
> - Then after a little while the whole thing becomes unresponsive.
>
>
> Can anyone confirm they've seen the same behaviour or direct me what to
> look into?

I have been running linux-mips git builds for about a year and half now on my 
Qube2 without any troubles, the box serves as NFS/FTP server and works pretty 
well and sustains bandwidth.

My guess is that you are having a hardware problem or the box might not be 
cooled as it should be. If you want, you can test the following kernel which 
I have been running on this qube2 for some months: 
http://alphacore.org/~florian/linux-mips/qube/

Hope that helps.
-- 
Best regards, Florian Fainelli
Email : florian@openwrt.org
http://openwrt.org
-------------------------------

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Qube2 slowly dies
  2009-06-10 14:04 Qube2 slowly dies Glyn Astill
  2009-06-10 14:12 ` Florian Fainelli
@ 2009-06-11  3:39 ` Kevin D. Kissell
  1 sibling, 0 replies; 7+ messages in thread
From: Kevin D. Kissell @ 2009-06-11  3:39 UTC (permalink / raw)
  To: Glyn Astill; +Cc: linux-mips

Your description sounds an awful lot like failures I've seen when 
interrupts get lost or blocked for some reason (could be hardware, the 
kernel, or some interaction between them).  Have you looked at 
/proc/interrupts to see if "Spurious" interrupts are occurring, or if 
the rate of serviced timer and I/O interrupts decreases or increases as 
the system degrades?  When the system becomes unresponsive, by any 
chance does it "wake up" after 10-20 minutes (the time for the Count 
register to wrap)?

If other Qube2s don't exhibit this behavior with a given Linux kernel, 
but yours does, and yet yours runs NetBSD OK, it suggests that there's a 
difference in interrupt setup/handling between the two systems that just 
happens to work around a hardware problem on your board.

          Regards,

          Kevin K.

Glyn Astill wrote:
> Hi people,
>
> I've been directed here from the Debian lists by Martin Michlmayr. I'm running lenny on a qube2 128mb ram / 40gb disk.
>
> I've tried kernels 2.6.26 and 2.6.30~rc8 and the issue I'm about to describe is present in both, I haven't tried any other kernels - but I will try 2.6.22 when I can.
>
> Essentially the machine gets more and more sluggish until it finally dies. I've had a quick look in meminfo and I can't see that it's running out of memory, and I'm not sure what else to check?
>
> I find it hard to describe what's going off, but here's a scenario I hope illustrates the problem. The configure script is just an example of doing something - I could easily have extracted an archive with tar or something for the same results;
>
> - I start 2 ssh sessions and in one start configure for the postgres source, in the other I just started top.
>
> - And for a while all seems fine; configure ticks away and top refreshes every second.
>
> - Then top stops ticking over - but it'll refresh with a keypress. Anyway I exit top and try to run it again... nothing. I hit ctrl-c which brings me back to the prompt and I try again... nothing.
>
> - The configure script is still ticking over slowly.
>
> - I try "ps ax" - it works; so I try it again... nothing.
>
> - I try "ipcs" and "lsof" they both work and seem to keep working.
>
> - I try "ps ax" again... nothing. I hit ctrl-c and now it doesn't come back to the command prompt for a while.. say 5 minutes and eventually it's back.
>
> - It's still going. Some commands still work, some just do nothing. proc/meminfo shows it's not eaten all the memory.
>
> - If I try to start another ssh session I can log in, I get the motd, but I don't get to the shell.
>
> - Eventually the configure script ends, and all shells come back to the prompt. But it now seems totally braindamaged, I can run "ps ax" but "top" and other commands still do nothing. Heres strace attached to the top process:
>
> deb:~# strace -p 7228
> Process 7228 attached - interrupt to quit
> _newselect(0, NULL, NULL, NULL, {0, 500013}
>
> - Then after a little while the whole thing becomes unresponsive.
>
>
> Can anyone confirm they've seen the same behaviour or direct me what to look into?
>
> Thanks
> Glyn
>
>
>       
>
>   

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Qube2 slowly dies
@ 2009-06-10 14:24 Glyn Astill
  0 siblings, 0 replies; 7+ messages in thread
From: Glyn Astill @ 2009-06-10 14:24 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: linux-mips

Hi Florian,

> From: Florian Fainelli <florian@openwrt.org>

> Determine which process consumes all that memory. Can you
> describe which 
> programs you are running on your Qube2 ?
> 

That's exactly it - I couldn't see anything using excessive memory.

> 
> I have been running linux-mips git builds for about a year
> and half now on my 
> Qube2 without any troubles, the box serves as NFS/FTP
> server and works pretty 
> well and sustains bandwidth.
> 

Thats good to hear anyway.

> My guess is that you are having a hardware problem or the
> box might not be 
> cooled as it should be.

I have 2 qubes, both cooled properly and both have run netbsd as solid as a rock for the past 3 years. My usual yearly upgrade routine is prepare a fresh qube and switch them, when I do this I normally have 1 year+ uptime.

I should mention that I've been using this qube with netbsd without issue for years.

> If you want, you can test the
> following kernel which 
> I have been running on this qube2 for some months: 
> http://alphacore.org/~florian/linux-mips/qube/
> 

Thanks, I will have a go with one of those - I'll have to lookup my notes on preparing a kernel for debian though.

If it is not the kernel though (I suspect it is not) Any Ideas what I should be looking at to catch whatever is causing this? I've checked memory usage, turned off dma, and there isn't much IO load.  As it's a qube there's plenty of CPU load.

?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Qube2 slowly dies
@ 2009-06-11  8:54 Glyn Astill
  2009-06-12 19:45 ` Kevin D. Kissell
  0 siblings, 1 reply; 7+ messages in thread
From: Glyn Astill @ 2009-06-11  8:54 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips


Hi Kevin,

It's nice to see a scientific suggestion to the nature of the problem

> From: Kevin D. Kissell <kevink@paralogos.com>

> Your description sounds an awful lot
> like failures I've seen when 
> interrupts get lost or blocked for some reason (could be
> hardware, the 
> kernel, or some interaction between them).  Have you
> looked at 
>  to see if "Spurious" interrupts are
> occurring, or if 
> the rate of serviced timer and I/O interrupts decreases or
> increases as 
> the system degrades?

No I haven't checked - but I will. What would I be looking for that would stick out as "spurious"? The type of interrupt, qty or random interrupts appearing and dissapearing?
  
> When the system becomes unresponsive, by any 
> chance does it "wake up" after 10-20 minutes (the time for
> the Count 
> register to wrap)?
> 

Not that I've noticed, I just see it degrade further and further untill it dies over the course of an hour or so.

> If other Qube2s don't exhibit this behavior with a given
> Linux kernel, 
> but yours does, and yet yours runs NetBSD OK, it suggests
> that there's a 
> difference in interrupt setup/handling between the two
> systems that just 
> happens to work around a hardware problem on your board.

I'm sure that's a valid possibility, however I do have two of these machines and I have tried both with the same results.

I also had a problem back when I tried etch with the 2.6.18 kernel, however in this case I saw no degraded performance at all, however after a some of hours of activity (anywhere between 2 and 24+) it'd just fall on it's ass.

> 
>           Regards,
> 
>           Kevin K.
> 
> Glyn Astill wrote:
> > Hi people,
> >
> > I've been directed here from the Debian lists by
> Martin Michlmayr. I'm running lenny on a qube2 128mb ram /
> 40gb disk.
> >
> > I've tried kernels 2.6.26 and 2.6.30~rc8 and the issue
> I'm about to describe is present in both, I haven't tried
> any other kernels - but I will try 2.6.22 when I can.
> >
> > Essentially the machine gets more and more sluggish
> until it finally dies. I've had a quick look in meminfo and
> I can't see that it's running out of memory, and I'm not
> sure what else to check?
> >
> > I find it hard to describe what's going off, but
> here's a scenario I hope illustrates the problem. The
> configure script is just an example of doing something - I
> could easily have extracted an archive with tar or something
> for the same results;
> >
> > - I start 2 ssh sessions and in one start configure
> for the postgres source, in the other I just started top.
> >
> > - And for a while all seems fine; configure ticks away
> and top refreshes every second.
> >
> > - Then top stops ticking over - but it'll refresh with
> a keypress. Anyway I exit top and try to run it again...
> nothing. I hit ctrl-c which brings me back to the prompt and
> I try again... nothing.
> >
> > - The configure script is still ticking over slowly.
> >
> > - I try "ps ax" - it works; so I try it again...
> nothing.
> >
> > - I try "ipcs" and "lsof" they both work and seem to
> keep working.
> >
> > - I try "ps ax" again... nothing. I hit ctrl-c and now
> it doesn't come back to the command prompt for a while.. say
> 5 minutes and eventually it's back.
> >
> > - It's still going. Some commands still work, some
> just do nothing. proc/meminfo shows it's not eaten all the
> memory.
> >
> > - If I try to start another ssh session I can log in,
> I get the motd, but I don't get to the shell.
> >
> > - Eventually the configure script ends, and all shells
> come back to the prompt. But it now seems totally
> braindamaged, I can run "ps ax" but "top" and other commands
> still do nothing. Heres strace attached to the top process:
> >
> > deb:~# strace -p 7228
> > Process 7228 attached - interrupt to quit
> > _newselect(0, NULL, NULL, NULL, {0, 500013}
> >
> > - Then after a little while the whole thing becomes
> unresponsive.
> >
> >
> > Can anyone confirm they've seen the same behaviour or
> direct me what to look into?
> >
> > Thanks
> > Glyn
> >
> >
> >       
> >
> >   
> 
> 


      

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Qube2 slowly dies
  2009-06-11  8:54 Glyn Astill
@ 2009-06-12 19:45 ` Kevin D. Kissell
  0 siblings, 0 replies; 7+ messages in thread
From: Kevin D. Kissell @ 2009-06-12 19:45 UTC (permalink / raw)
  To: Glyn Astill; +Cc: linux-mips

[-- Attachment #1: Type: text/plain, Size: 2970 bytes --]

Glyn Astill wrote:
>> From: Kevin D. Kissell <kevink@paralogos.com>
>>     
>> Your description sounds an awful lot
>> like failures I've seen when 
>> interrupts get lost or blocked for some reason (could be
>> hardware, the 
>> kernel, or some interaction between them).  Have you
>> looked at 
>>  to see if "Spurious" interrupts are
>> occurring, or if 
>> the rate of serviced timer and I/O interrupts decreases or
>> increases as 
>> the system degrades?
>>     
>
> No I haven't checked - but I will. What would I be looking for that would stick out as "spurious"? The type of interrupt, qty or random interrupts appearing and dissapearing?
>   
There's a separate counter, and /proc/interrupts report, for spurious 
interrupts.
>   
>   
>> When the system becomes unresponsive, by any 
>> chance does it "wake up" after 10-20 minutes (the time for
>> the Count 
>> register to wrap)?
>>
>>     
>
> Not that I've noticed, I just see it degrade further and further untill it dies over the course of an hour or so.
>
>   
>> If other Qube2s don't exhibit this behavior with a given
>> Linux kernel, 
>> but yours does, and yet yours runs NetBSD OK, it suggests
>> that there's a 
>> difference in interrupt setup/handling between the two
>> systems that just 
>> happens to work around a hardware problem on your board.
>>     
>
> I'm sure that's a valid possibility, however I do have two of these machines and I have tried both with the same results.
>   
Ah.  I had misunderstood your messages to have stated that you had one 
Qube2 that exhibited the behavior while others did not.  In the actual 
case, it definitely sounds like a kernel interrupt management problem, 
either at the level of the interrupt controller support code or some bit 
of low-level management of the Status.IM interrupt mask.  If you can 
force the kernel to dump the state of the Status and Cause registers, as 
well as that of whatever outboard interrupt controller is on that thing, 
that would be good.  I used to have a hook in the NMI handler of my 
Malta kernels for that, which was useful when I was debugging the SMTC 
interrupt support, which was pretty subtle and nasty.  And why this 
failure mode sounds vaguely familiar.  ;o)  The interrupt 
ack/mask/enable machinery  has changed and standardized (for the better) 
since the Qube2 was a current product, and the controller "chip" 
struct/functions being used may not in fact be entirely correct for the 
platform, e.g. you may have non-atomic changes to interrupt masks being 
done that screw up in the presence of nested service.
> I also had a problem back when I tried etch with the 2.6.18 kernel, however in this case I saw no degraded performance at all, however after a some of hours of activity (anywhere between 2 and 24+) it'd just fall on it's ass.
>   
That's not a very scientific description of a failure.  I mean, did the 
Qube2 literally jump off the table? ;o)


          Regards,

          Kevin K.

[-- Attachment #2: Type: text/html, Size: 3964 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Qube2 slowly dies
@ 2009-06-15 21:34 Glyn Astill
  0 siblings, 0 replies; 7+ messages in thread
From: Glyn Astill @ 2009-06-15 21:34 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips




--- On Fri, 12/6/09, Kevin D. Kissell <kevink@paralogos.com> wrote:
>   
>     Your description sounds an awful lot
> like failures I've seen when 
> interrupts get lost or blocked for some reason (could be
> hardware, the 
> kernel, or some interaction between them).  Have you
> looked at 
>  to see if "Spurious" interrupts are
> occurring, or if 
> the rate of serviced timer and I/O interrupts decreases or
> increases as 
> the system degrades?
>     
>   
>   
> No I haven't checked - but I will. What would I be
> looking for that would stick out as "spurious"?
> The type of interrupt, qty or random interrupts appearing
> and dissapearing?
>   
> 
> There's a separate counter, and /proc/interrupts
> report, for spurious
> interrupts.
> 
>

I've just tested it and I see no extra counters appearing, unless the cascade is an issue

deb:~#  cat /proc/interrupts
           CPU0
  0:          1          XT-PIC  timer
  2:          0          XT-PIC  cascade
  8:          2          XT-PIC  rtc0
  9:          0          XT-PIC  ohci_hcd:usb1, ehci_hcd:usb2, ohci_hcd:usb3
 14:       3166          XT-PIC  ide0
 15:          0          XT-PIC  ide1
 18:          0            MIPS  cascade
 19:       4399            MIPS  eth0
 21:        361            MIPS  serial
 22:          0            MIPS  cascade
 23:     274025            MIPS  timer
 32:          2         GT641xx  gt641xx_timer0


When the machine starts to go, the cpu time column in top sometimes shows nan - surely that shouldn't happen -it should be either 0 or >0 
 
Any other ides chaps?


      

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-06-15 21:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-10 14:04 Qube2 slowly dies Glyn Astill
2009-06-10 14:12 ` Florian Fainelli
2009-06-11  3:39 ` Kevin D. Kissell
  -- strict thread matches above, loose matches on Subject: below --
2009-06-10 14:24 Glyn Astill
2009-06-11  8:54 Glyn Astill
2009-06-12 19:45 ` Kevin D. Kissell
2009-06-15 21:34 Glyn Astill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).