linux-mips.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Qube2 slowly dies
@ 2009-06-10 14:04 Glyn Astill
  2009-06-10 14:12 ` Florian Fainelli
  2009-06-11  3:39 ` Kevin D. Kissell
  0 siblings, 2 replies; 7+ messages in thread
From: Glyn Astill @ 2009-06-10 14:04 UTC (permalink / raw)
  To: linux-mips


Hi people,

I've been directed here from the Debian lists by Martin Michlmayr. I'm running lenny on a qube2 128mb ram / 40gb disk.

I've tried kernels 2.6.26 and 2.6.30~rc8 and the issue I'm about to describe is present in both, I haven't tried any other kernels - but I will try 2.6.22 when I can.

Essentially the machine gets more and more sluggish until it finally dies. I've had a quick look in meminfo and I can't see that it's running out of memory, and I'm not sure what else to check?

I find it hard to describe what's going off, but here's a scenario I hope illustrates the problem. The configure script is just an example of doing something - I could easily have extracted an archive with tar or something for the same results;

- I start 2 ssh sessions and in one start configure for the postgres source, in the other I just started top.

- And for a while all seems fine; configure ticks away and top refreshes every second.

- Then top stops ticking over - but it'll refresh with a keypress. Anyway I exit top and try to run it again... nothing. I hit ctrl-c which brings me back to the prompt and I try again... nothing.

- The configure script is still ticking over slowly.

- I try "ps ax" - it works; so I try it again... nothing.

- I try "ipcs" and "lsof" they both work and seem to keep working.

- I try "ps ax" again... nothing. I hit ctrl-c and now it doesn't come back to the command prompt for a while.. say 5 minutes and eventually it's back.

- It's still going. Some commands still work, some just do nothing. proc/meminfo shows it's not eaten all the memory.

- If I try to start another ssh session I can log in, I get the motd, but I don't get to the shell.

- Eventually the configure script ends, and all shells come back to the prompt. But it now seems totally braindamaged, I can run "ps ax" but "top" and other commands still do nothing. Heres strace attached to the top process:

deb:~# strace -p 7228
Process 7228 attached - interrupt to quit
_newselect(0, NULL, NULL, NULL, {0, 500013}

- Then after a little while the whole thing becomes unresponsive.


Can anyone confirm they've seen the same behaviour or direct me what to look into?

Thanks
Glyn


      

^ permalink raw reply	[flat|nested] 7+ messages in thread
* Re: Qube2 slowly dies
@ 2009-06-10 14:24 Glyn Astill
  0 siblings, 0 replies; 7+ messages in thread
From: Glyn Astill @ 2009-06-10 14:24 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: linux-mips


Hi Florian,

> From: Florian Fainelli <florian@openwrt.org>

> Determine which process consumes all that memory. Can you
> describe which 
> programs you are running on your Qube2 ?
> 

That's exactly it - I couldn't see anything using excessive memory.

> 
> I have been running linux-mips git builds for about a year
> and half now on my 
> Qube2 without any troubles, the box serves as NFS/FTP
> server and works pretty 
> well and sustains bandwidth.
> 

Thats good to hear anyway.

> My guess is that you are having a hardware problem or the
> box might not be 
> cooled as it should be.

I have 2 qubes, both cooled properly and both have run netbsd as solid as a rock for the past 3 years. My usual yearly upgrade routine is prepare a fresh qube and switch them, when I do this I normally have 1 year+ uptime.

I should mention that I've been using this qube with netbsd without issue for years.

> If you want, you can test the
> following kernel which 
> I have been running on this qube2 for some months: 
> http://alphacore.org/~florian/linux-mips/qube/
> 

Thanks, I will have a go with one of those - I'll have to lookup my notes on preparing a kernel for debian though.

If it is not the kernel though (I suspect it is not) Any Ideas what I should be looking at to catch whatever is causing this? I've checked memory usage, turned off dma, and there isn't much IO load.  As it's a qube there's plenty of CPU load.

?


      

^ permalink raw reply	[flat|nested] 7+ messages in thread
* Re: Qube2 slowly dies
@ 2009-06-11  8:54 Glyn Astill
  2009-06-12 19:45 ` Kevin D. Kissell
  0 siblings, 1 reply; 7+ messages in thread
From: Glyn Astill @ 2009-06-11  8:54 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips


Hi Kevin,

It's nice to see a scientific suggestion to the nature of the problem

> From: Kevin D. Kissell <kevink@paralogos.com>

> Your description sounds an awful lot
> like failures I've seen when 
> interrupts get lost or blocked for some reason (could be
> hardware, the 
> kernel, or some interaction between them).  Have you
> looked at 
>  to see if "Spurious" interrupts are
> occurring, or if 
> the rate of serviced timer and I/O interrupts decreases or
> increases as 
> the system degrades?

No I haven't checked - but I will. What would I be looking for that would stick out as "spurious"? The type of interrupt, qty or random interrupts appearing and dissapearing?
  
> When the system becomes unresponsive, by any 
> chance does it "wake up" after 10-20 minutes (the time for
> the Count 
> register to wrap)?
> 

Not that I've noticed, I just see it degrade further and further untill it dies over the course of an hour or so.

> If other Qube2s don't exhibit this behavior with a given
> Linux kernel, 
> but yours does, and yet yours runs NetBSD OK, it suggests
> that there's a 
> difference in interrupt setup/handling between the two
> systems that just 
> happens to work around a hardware problem on your board.

I'm sure that's a valid possibility, however I do have two of these machines and I have tried both with the same results.

I also had a problem back when I tried etch with the 2.6.18 kernel, however in this case I saw no degraded performance at all, however after a some of hours of activity (anywhere between 2 and 24+) it'd just fall on it's ass.

> 
>           Regards,
> 
>           Kevin K.
> 
> Glyn Astill wrote:
> > Hi people,
> >
> > I've been directed here from the Debian lists by
> Martin Michlmayr. I'm running lenny on a qube2 128mb ram /
> 40gb disk.
> >
> > I've tried kernels 2.6.26 and 2.6.30~rc8 and the issue
> I'm about to describe is present in both, I haven't tried
> any other kernels - but I will try 2.6.22 when I can.
> >
> > Essentially the machine gets more and more sluggish
> until it finally dies. I've had a quick look in meminfo and
> I can't see that it's running out of memory, and I'm not
> sure what else to check?
> >
> > I find it hard to describe what's going off, but
> here's a scenario I hope illustrates the problem. The
> configure script is just an example of doing something - I
> could easily have extracted an archive with tar or something
> for the same results;
> >
> > - I start 2 ssh sessions and in one start configure
> for the postgres source, in the other I just started top.
> >
> > - And for a while all seems fine; configure ticks away
> and top refreshes every second.
> >
> > - Then top stops ticking over - but it'll refresh with
> a keypress. Anyway I exit top and try to run it again...
> nothing. I hit ctrl-c which brings me back to the prompt and
> I try again... nothing.
> >
> > - The configure script is still ticking over slowly.
> >
> > - I try "ps ax" - it works; so I try it again...
> nothing.
> >
> > - I try "ipcs" and "lsof" they both work and seem to
> keep working.
> >
> > - I try "ps ax" again... nothing. I hit ctrl-c and now
> it doesn't come back to the command prompt for a while.. say
> 5 minutes and eventually it's back.
> >
> > - It's still going. Some commands still work, some
> just do nothing. proc/meminfo shows it's not eaten all the
> memory.
> >
> > - If I try to start another ssh session I can log in,
> I get the motd, but I don't get to the shell.
> >
> > - Eventually the configure script ends, and all shells
> come back to the prompt. But it now seems totally
> braindamaged, I can run "ps ax" but "top" and other commands
> still do nothing. Heres strace attached to the top process:
> >
> > deb:~# strace -p 7228
> > Process 7228 attached - interrupt to quit
> > _newselect(0, NULL, NULL, NULL, {0, 500013}
> >
> > - Then after a little while the whole thing becomes
> unresponsive.
> >
> >
> > Can anyone confirm they've seen the same behaviour or
> direct me what to look into?
> >
> > Thanks
> > Glyn
> >
> >
> >       
> >
> >   
> 
> 


      

^ permalink raw reply	[flat|nested] 7+ messages in thread
* Re: Qube2 slowly dies
@ 2009-06-15 21:34 Glyn Astill
  0 siblings, 0 replies; 7+ messages in thread
From: Glyn Astill @ 2009-06-15 21:34 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips




--- On Fri, 12/6/09, Kevin D. Kissell <kevink@paralogos.com> wrote:
>   
>     Your description sounds an awful lot
> like failures I've seen when 
> interrupts get lost or blocked for some reason (could be
> hardware, the 
> kernel, or some interaction between them).  Have you
> looked at 
>  to see if "Spurious" interrupts are
> occurring, or if 
> the rate of serviced timer and I/O interrupts decreases or
> increases as 
> the system degrades?
>     
>   
>   
> No I haven't checked - but I will. What would I be
> looking for that would stick out as "spurious"?
> The type of interrupt, qty or random interrupts appearing
> and dissapearing?
>   
> 
> There's a separate counter, and /proc/interrupts
> report, for spurious
> interrupts.
> 
>

I've just tested it and I see no extra counters appearing, unless the cascade is an issue

deb:~#  cat /proc/interrupts
           CPU0
  0:          1          XT-PIC  timer
  2:          0          XT-PIC  cascade
  8:          2          XT-PIC  rtc0
  9:          0          XT-PIC  ohci_hcd:usb1, ehci_hcd:usb2, ohci_hcd:usb3
 14:       3166          XT-PIC  ide0
 15:          0          XT-PIC  ide1
 18:          0            MIPS  cascade
 19:       4399            MIPS  eth0
 21:        361            MIPS  serial
 22:          0            MIPS  cascade
 23:     274025            MIPS  timer
 32:          2         GT641xx  gt641xx_timer0


When the machine starts to go, the cpu time column in top sometimes shows nan - surely that shouldn't happen -it should be either 0 or >0 
 
Any other ides chaps?


      

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-06-15 21:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-10 14:04 Qube2 slowly dies Glyn Astill
2009-06-10 14:12 ` Florian Fainelli
2009-06-11  3:39 ` Kevin D. Kissell
  -- strict thread matches above, loose matches on Subject: below --
2009-06-10 14:24 Glyn Astill
2009-06-11  8:54 Glyn Astill
2009-06-12 19:45 ` Kevin D. Kissell
2009-06-15 21:34 Glyn Astill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).