process lockups

Linux MIPS Architecture development
 help / color / mirror / Atom feed

* process lockups
@ 2000-10-24  1:22 Karsten Merker
  2000-10-24  2:47 ` Ralf Baechle
  2000-10-24 12:25 ` Florian Lohoff
  0 siblings, 2 replies; 9+ messages in thread
From: Karsten Merker @ 2000-10-24  1:22 UTC (permalink / raw)
  To: linux-mips; +Cc: linux-mips

Hallo everyone,

I am running Kernel 2.4.0-test9 on a DECstation 5000/150. I am
experiencing a strange behaviour when having strong I/O-load, such as
running a "tar xvf foobar.tgz" with a large archive. After some time of
activity the process (in this case tar) is stuck in status "D". There is
neither an entry in the syslog nor on the console that would give me a
hint what is happening. Is anyone else experiencing this?

Another thing I see on my 5000/150 (and only there - this is my only
R4K-machine, so I do not know whether this is CPU- or machine-type-bound)
is "top" going weird, eating lots of CPU cycles and spitting messages
"schedule_timeout: wrong timeout value fffbd0b2 from 800900f8; Setting
flush to zero for top". I know Florian also has this on his 5000/150.
Anyone else with the same behavoiur or any idea about the cause for this?

Greetings,
Karsten
-- 
#include <standard_disclaimer>
Nach Paragraph 28 Abs. 3 Bundesdatenschutzgesetz widerspreche ich der
Nutzung oder Uebermittlung meiner Daten fuer Werbezwecke oder fuer die
Markt- oder Meinungsforschung.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: process lockups
  2000-10-24  1:22 process lockups Karsten Merker
@ 2000-10-24  2:47 ` Ralf Baechle
  2000-10-24  5:51   ` Houten K.H.C. van (Karel)
  2000-10-24 12:25 ` Florian Lohoff
  1 sibling, 1 reply; 9+ messages in thread
From: Ralf Baechle @ 2000-10-24  2:47 UTC (permalink / raw)
  To: linux-mips, linux-mips

On Tue, Oct 24, 2000 at 03:22:32AM +0200, Karsten Merker wrote:

> I am running Kernel 2.4.0-test9 on a DECstation 5000/150. I am
> experiencing a strange behaviour when having strong I/O-load, such as
> running a "tar xvf foobar.tgz" with a large archive. After some time of
> activity the process (in this case tar) is stuck in status "D". There is
> neither an entry in the syslog nor on the console that would give me a
> hint what is happening. Is anyone else experiencing this?

I observe similar stuck processes on Origins - even without massive I/O
load.  I'm trying to track them but little success aside of fixing a few
unrelated little bugs.  Do you observe those on your R4k box also?

Another things which I'm observing is that I occasinally can't unmount
a filesystem.  umount then says the fs is still in use.  Sometimes it's
at least possible to remount the fs r/o.  Have you also observed this one?

> Another thing I see on my 5000/150 (and only there - this is my only
> R4K-machine, so I do not know whether this is CPU- or machine-type-bound)
> is "top" going weird, eating lots of CPU cycles and spitting messages
> "schedule_timeout: wrong timeout value fffbd0b2 from 800900f8; Setting
> flush to zero for top". I know Florian also has this on his 5000/150.
> Anyone else with the same behavoiur or any idea about the cause for this?

Setting flush to zero for <process name> means that the floating point
approximator is now enabled ;-)

The schedule_timeout thing is unrelated; I've never heared of it before.

  Ralf

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: process lockups
  2000-10-24  2:47 ` Ralf Baechle
@ 2000-10-24  5:51   ` Houten K.H.C. van (Karel)
  2000-10-24 11:15     ` Karsten Merker
                       ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Houten K.H.C. van (Karel) @ 2000-10-24  5:51 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips, linux-mips, K.H.C.vanHouten


Ralf Baechle wrote:
>
> On Tue, Oct 24, 2000 at 03:22:32AM +0200, Karsten Merker wrote:
> 
> > I am running Kernel 2.4.0-test9 on a DECstation 5000/150. I am
> > experiencing a strange behaviour when having strong I/O-load, such as
> > running a "tar xvf foobar.tgz" with a large archive. After some time of
> > activity the process (in this case tar) is stuck in status "D". There is
> > neither an entry in the syslog nor on the console that would give me a
> > hint what is happening. Is anyone else experiencing this?
> 
> I observe similar stuck processes on Origins - even without massive I/O
> load.  I'm trying to track them but little success aside of fixing a few
> unrelated little bugs.  Do you observe those on your R4k box also?
On my DEC 5000/260 (R4k) I have no stuck processes, but I should mention
that I am running without swap (I have 192Mb RAM).
 
> Another things which I'm observing is that I occasinally can't unmount
> a filesystem.  umount then says the fs is still in use.  Sometimes it's
> at least possible to remount the fs r/o.  Have you also observed this one?
Yes, but only the root FS. I thought I might have to upgrade to a newer
mount program for the 2.4 kernel, or is the system call returning the error?

> > Another thing I see on my 5000/150 (and only there - this is my only
> > R4K-machine, so I do not know whether this is CPU- or machine-type-bound)
> > is "top" going weird, eating lots of CPU cycles and spitting messages
> > "schedule_timeout: wrong timeout value fffbd0b2 from 800900f8; Setting
> > flush to zero for top". I know Florian also has this on his 5000/150.
> > Anyone else with the same behavoiur or any idea about the cause for this?
> 
> Setting flush to zero for <process name> means that the floating point
> approximator is now enabled ;-)
> 
> The schedule_timeout thing is unrelated; I've never heared of it before.

Aside from this I stil get 'bug in get_wchan' messages, but everything
seems to run fine. I hope to test my current kernels on a 5000/150 and
a 3100.

Regards,

-- 
Karel van Houten

----------------------------------------------------------
The box said "Requires Windows 95 or better."
I can't understand why it won't work on my Linux computer. 
----------------------------------------------------------

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: process lockups
  2000-10-24  5:51   ` Houten K.H.C. van (Karel)
@ 2000-10-24 11:15     ` Karsten Merker
  2000-10-24 14:38     ` Ralf Baechle
  2000-10-24 15:09     ` Ralf Baechle
  2 siblings, 0 replies; 9+ messages in thread
From: Karsten Merker @ 2000-10-24 11:15 UTC (permalink / raw)
  To: linux-mips, linux-mips

On Tue, Oct 24, 2000 at 07:51:42AM +0200, Houten K.H.C. van (Karel) wrote:
> 
> Ralf Baechle wrote:
[hanging processes in status "D"]
> > I observe similar stuck processes on Origins - even without massive I/O
> > load.  I'm trying to track them but little success aside of fixing a few
> > unrelated little bugs.  Do you observe those on your R4k box also?
> On my DEC 5000/260 (R4k) I have no stuck processes, but I should mention
> that I am running without swap (I have 192Mb RAM).

Having swap or not does not seem to influence the behaviour - I also get
hangs with swap disabled. Good candidates for hangig are either tar or
gcc.

> > Another things which I'm observing is that I occasinally can't unmount
> > a filesystem.  umount then says the fs is still in use.  Sometimes it's
> > at least possible to remount the fs r/o.  Have you also observed this one?
> Yes, but only the root FS. I thought I might have to upgrade to a newer
> mount program for the 2.4 kernel, or is the system call returning the error?

Similar effect here - sometimes unmounting the root fs on shutdown is
successfull, sometimes I get "/ is busy" without being able to find a
reason for that. Possibly it is a bug in the mount (I am still running
mount-2.9o).

> > > Another thing I see on my 5000/150 (and only there - this is my only
> > > R4K-machine, so I do not know whether this is CPU- or machine-type-bound)
> > > is "top" going weird, eating lots of CPU cycles and spitting messages
> > > "schedule_timeout: wrong timeout value fffbd0b2 from 800900f8; Setting
> > > flush to zero for top". I know Florian also has this on his 5000/150.
> > > Anyone else with the same behavoiur or any idea about the cause for this?
> > 
> > Setting flush to zero for <process name> means that the floating point
> > approximator is now enabled ;-)

???

Greetings,
Karsten
-- 
#include <standard_disclaimer>
Nach Paragraph 28 Abs. 3 Bundesdatenschutzgesetz widerspreche ich der Nutzung
oder Uebermittlung meiner Daten fuer Werbezwecke oder fuer die Markt- oder
Meinungsforschung.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: process lockups
  2000-10-24  5:51   ` Houten K.H.C. van (Karel)
  2000-10-24 11:15     ` Karsten Merker
@ 2000-10-24 14:38     ` Ralf Baechle
  2000-10-24 17:55       ` Karsten Merker
  2000-10-24 15:09     ` Ralf Baechle
  2 siblings, 1 reply; 9+ messages in thread
From: Ralf Baechle @ 2000-10-24 14:38 UTC (permalink / raw)
  To: K.H.C.vanHouten; +Cc: linux-mips, linux-mips, K.H.C.vanHouten

On Tue, Oct 24, 2000 at 07:51:42AM +0200, Houten K.H.C. van (Karel) wrote:

> > > I am running Kernel 2.4.0-test9 on a DECstation 5000/150. I am
> > > experiencing a strange behaviour when having strong I/O-load, such as
> > > running a "tar xvf foobar.tgz" with a large archive. After some time of
> > > activity the process (in this case tar) is stuck in status "D". There is
> > > neither an entry in the syslog nor on the console that would give me a
> > > hint what is happening. Is anyone else experiencing this?
> > 
> > I observe similar stuck processes on Origins - even without massive I/O
> > load.  I'm trying to track them but little success aside of fixing a few
> > unrelated little bugs.  Do you observe those on your R4k box also?
> On my DEC 5000/260 (R4k) I have no stuck processes, but I should mention
> that I am running without swap (I have 192Mb RAM).

That matches my Origin experience with it's 1.5gb RAM and no swap.

> > Another things which I'm observing is that I occasinally can't unmount
> > a filesystem.  umount then says the fs is still in use.  Sometimes it's
> > at least possible to remount the fs r/o.  Have you also observed this one?

> Yes, but only the root FS. I thought I might have to upgrade to a newer
> mount program for the 2.4 kernel, or is the system call returning the error?

It also happens for other filesystems; the heavier the usage of the
filesystem has been the more often.  But I've never seen a hanging tar or
gcc process.

> Aside from this I stil get 'bug in get_wchan' messages, but everything
> seems to run fine. I hope to test my current kernels on a 5000/150 and
> a 3100.

This message is harmless.  The only effect is that the WCHAN column of
ps axl will have bogus information.

Which is a problem - I need exactly the WCHAN information to debug this
problem.

  Ralf

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: process lockups
  2000-10-24 14:38     ` Ralf Baechle
@ 2000-10-24 17:55       ` Karsten Merker
  2000-10-25  1:29         ` Ralf Baechle
  0 siblings, 1 reply; 9+ messages in thread
From: Karsten Merker @ 2000-10-24 17:55 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips, linux-mips

On Tue, Oct 24, 2000 at 04:38:43PM +0200, Ralf Baechle wrote:

> Which is a problem - I need exactly the WCHAN information to debug this
> problem.

Here we go...

Two major processes are running: a tar zxvf (PIDs 212 and 213) and a
dpkg-buildpackage. Both together should consume all CPU time available,
but they do not, they just sit idle. Interesting is that here there is no
process in state "D" as I had before. This seems to be reproducible.

These logs were created from a fresh cvs-checkout (already including your
patch).

root# ps -laww
  F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
100 S     0   189   168  0  60   0 -  1000 pause  ttyp0    00:00:00 screen
100 S     0   212   191  1  60   0 -   579 ?      ttya0    00:00:00 tar
000 S     0   213   212  0  60   0 -   394 pipe_w ttya0    00:00:00 gzip
100 S     0   220   197  0  60   0 -   873 wait4  ttya2    00:00:00 dpkg-buildpacka
100 S     0   272   220  0  60   0 -  1563 wait4  ttya2    00:00:02 dpkg-source
000 S     0   277   272  0  60   0 -   394 pipe_w ttya2    00:00:00 gunzip
100 S     0   278   272  0  60   0 -   536 ?      ttya2    00:00:00 cpio
000 S     0   279   278  0  60   0 -   864 wait4  ttya2    00:00:00 sh
000 S     0   280   279  0  60   0 -   333 pipe_w ttya2    00:00:00 egrep
000 R     0   283   196  0  60   0 -   800 -      ttya1    00:00:00 ps

While this happens, top tells:

27 processes: 26 sleeping, 1 running, 0 zombie, 0 stopped
CPU states:  10.5% user,   9.5% system,   0.0% nice,  79.9% idle
Mem:  127056K av,  22064K used, 104992K free,      0K shrd,    488K buff
Swap:      0K av,      0K used,      0K free                 10952K cached
 
  PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
  284 root       0   0  1048 1048   684 R       0 12.9  0.8   0:00 top
  190 root       0   0  1204 1204   984 S       0  0.8  0.9   0:02 screen
    1 root       0   0   484  484   408 S       0  0.0  0.3   0:02 init
[...]
Any further processes have 0% CPU.


After some time the tar zxvf suddenly starts running and decompresses the
archive in one step.

Hope this description is helpful, if you need further information, just
mail me.

Greetings,
Karsten
-- 
#include <standard_disclaimer>
Nach Paragraph 28 Abs. 3 Bundesdatenschutzgesetz widerspreche ich der Nutzung
oder Uebermittlung meiner Daten fuer Werbezwecke oder fuer die Markt- oder
Meinungsforschung.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: process lockups
  2000-10-24 17:55       ` Karsten Merker
@ 2000-10-25  1:29         ` Ralf Baechle
  0 siblings, 0 replies; 9+ messages in thread
From: Ralf Baechle @ 2000-10-25  1:29 UTC (permalink / raw)
  To: linux-mips, linux-mips

On Tue, Oct 24, 2000 at 07:55:55PM +0200, Karsten Merker wrote:

> Two major processes are running: a tar zxvf (PIDs 212 and 213) and a
> dpkg-buildpackage. Both together should consume all CPU time available,
> but they do not, they just sit idle. Interesting is that here there is no
> process in state "D" as I had before. This seems to be reproducible.
> 
> These logs were created from a fresh cvs-checkout (already including your
> patch).

Which was still pretty fishy.  The scheduler has changed significantly
and so it took a little bit more fixing.  Which explains the `?' in the
listing below.  I tried to fix this in the CVS tree.

> root# ps -laww
>   F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
> 100 S     0   189   168  0  60   0 -  1000 pause  ttyp0    00:00:00 screen
> 100 S     0   212   191  1  60   0 -   579 ?      ttya0    00:00:00 tar
> 000 S     0   213   212  0  60   0 -   394 pipe_w ttya0    00:00:00 gzip
> 100 S     0   220   197  0  60   0 -   873 wait4  ttya2    00:00:00 dpkg-buildpacka
> 100 S     0   272   220  0  60   0 -  1563 wait4  ttya2    00:00:02 dpkg-source
> 000 S     0   277   272  0  60   0 -   394 pipe_w ttya2    00:00:00 gunzip
> 100 S     0   278   272  0  60   0 -   536 ?      ttya2    00:00:00 cpio
> 000 S     0   279   278  0  60   0 -   864 wait4  ttya2    00:00:00 sh
> 000 S     0   280   279  0  60   0 -   333 pipe_w ttya2    00:00:00 egrep
> 000 R     0   283   196  0  60   0 -   800 -      ttya1    00:00:00 ps

Ok, so dpkg-buildpackage is waiting for the termination of some other
process.

> While this happens, top tells:
> 
> 27 processes: 26 sleeping, 1 running, 0 zombie, 0 stopped
> CPU states:  10.5% user,   9.5% system,   0.0% nice,  79.9% idle
> Mem:  127056K av,  22064K used, 104992K free,      0K shrd,    488K buff
> Swap:      0K av,      0K used,      0K free                 10952K cached
>  
>   PID USER     PRI  NI  SIZE  RSS SHARE STAT  LIB %CPU %MEM   TIME COMMAND
>   284 root       0   0  1048 1048   684 R       0 12.9  0.8   0:00 top
>   190 root       0   0  1204 1204   984 S       0  0.8  0.9   0:02 screen
>     1 root       0   0   484  484   408 S       0  0.0  0.3   0:02 init
> [...]
> Any further processes have 0% CPU.

Those CPU percentage are meaninless anyway.  They don't indicate anything
about a process' current CPU usage.

> After some time the tar zxvf suddenly starts running and decompresses the
> archive in one step.

The `?' show that tar is sleeping but due to thie get_wchan bug was don't
see on what it is waiting for so there is little I can do with this
information ...

  Ralf

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: process lockups
  2000-10-24  5:51   ` Houten K.H.C. van (Karel)
  2000-10-24 11:15     ` Karsten Merker
  2000-10-24 14:38     ` Ralf Baechle
@ 2000-10-24 15:09     ` Ralf Baechle
  2 siblings, 0 replies; 9+ messages in thread
From: Ralf Baechle @ 2000-10-24 15:09 UTC (permalink / raw)
  To: K.H.C.vanHouten; +Cc: linux-mips, linux-mips

On Tue, Oct 24, 2000 at 07:51:42AM +0200, Houten K.H.C. van (Karel) wrote:

> Aside from this I stil get 'bug in get_wchan' messages, but everything
> seems to run fine. I hope to test my current kernels on a 5000/150 and
> a 3100.

Try this untested fix for get_wchan.  The values in the ps axl column should
now be numbers that make sense as addresses.  Unless the `n' option is
also used ps will try to translate the address back into a symbol.  Cite
from ps(1):

[...]
       To  produce  the  WCHAN  field,  ps needs to read the Sys
       tem.map file created when  the  kernel  is  compiled.  The
       search path is:
              $PS_SYSTEM_MAP
              /boot/System.map-`uname -r`
              /boot/System.map
              /lib/modules/`uname -r`/System.map
              /usr/src/linux/System.map
              /System.map
[...]

If that's working as planned please send me the WCHAN of any stuck process.
I need to know where they're stuck.

  Ralf

--- arch/mips/kernel/process.c	2000/10/05 01:18:43	1.21
+++ arch/mips/kernel/process.c	2000/10/24 14:54:29
@@ -203,18 +203,9 @@
 		return 0;
 
 	pc = thread_saved_pc(&p->thread);
-	if (pc == (unsigned long) interruptible_sleep_on
-	    || pc == (unsigned long) sleep_on) {
-		schedule_frame = ((unsigned long *)p->thread.reg30)[9];
-		return ((unsigned long *)schedule_frame)[15];
-	}
-	if (pc == (unsigned long) interruptible_sleep_on_timeout
-	    || pc == (unsigned long) sleep_on_timeout) {
-		schedule_frame = ((unsigned long *)p->thread.reg30)[9];
-		return ((unsigned long *)schedule_frame)[16];
-	}
 	if (pc >= first_sched && pc < last_sched) {
-		printk(KERN_DEBUG "Bug in %s\n", __FUNCTION__);
+		schedule_frame = ((unsigned long *)p->thread.reg30)[9];
+		return ((unsigned long *)schedule_frame)[11];
 	}
 
 	return pc;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: process lockups
  2000-10-24  1:22 process lockups Karsten Merker
  2000-10-24  2:47 ` Ralf Baechle
@ 2000-10-24 12:25 ` Florian Lohoff
  1 sibling, 0 replies; 9+ messages in thread
From: Florian Lohoff @ 2000-10-24 12:25 UTC (permalink / raw)
  To: linux-mips

On Tue, Oct 24, 2000 at 03:22:32AM +0200, Karsten Merker wrote:
> Hallo everyone,
> 
> I am running Kernel 2.4.0-test9 on a DECstation 5000/150. I am
> experiencing a strange behaviour when having strong I/O-load, such as
> running a "tar xvf foobar.tgz" with a large archive. After some time of
> activity the process (in this case tar) is stuck in status "D". There is
> neither an entry in the syslog nor on the console that would give me a
> hint what is happening. Is anyone else experiencing this?

I have not seen this on my /150 although i have not been running -test9. I
got that machine @home right now so ill check if i can reproduce this.

> Another thing I see on my 5000/150 (and only there - this is my only
> R4K-machine, so I do not know whether this is CPU- or machine-type-bound)
> is "top" going weird, eating lots of CPU cycles and spitting messages
> "schedule_timeout: wrong timeout value fffbd0b2 from 800900f8; Setting
> flush to zero for top". I know Florian also has this on his 5000/150.
> Anyone else with the same behavoiur or any idea about the cause for this?

I guess this is Decstation specific as i cant seem to be able
to reproduce this on the I2 - I have seen this too.

Flo
-- 
Florian Lohoff		flo@rfc822.org		      	+49-5201-669912
      "Write only memory - Oops. Time for my medication again ..."

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2000-10-25  1:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2000-10-24  1:22 process lockups Karsten Merker
2000-10-24  2:47 ` Ralf Baechle
2000-10-24  5:51   ` Houten K.H.C. van (Karel)
2000-10-24 11:15     ` Karsten Merker
2000-10-24 14:38     ` Ralf Baechle
2000-10-24 17:55       ` Karsten Merker
2000-10-25  1:29         ` Ralf Baechle
2000-10-24 15:09     ` Ralf Baechle
2000-10-24 12:25 ` Florian Lohoff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox