All of lore.kernel.org
 help / color / mirror / Atom feed
* [2.6.23] tasks stuck in running state?
@ 2007-10-19 21:39 Jeff Garzik
  2007-10-19 21:53 ` Chuck Ebbert
  2007-10-20  1:08 ` Ray Lee
  0 siblings, 2 replies; 11+ messages in thread
From: Jeff Garzik @ 2007-10-19 21:39 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a 
certain behavior at least once a day.  I'll start a kernel build (make 
-sj5 on this box), and it will "hang" in the following way:

> 31003 ?        S      0:04 sshd: jgarzik@pts/0
> 31004 pts/0    Ss     0:02  \_ -bash
>  8280 pts/0    S+     0:00      \_ make ARCH=i386 -sj4
>  8690 pts/0    Z+     0:00          \_ [rm] <defunct>
>  8691 pts/0    S+     0:00          \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
>  8692 pts/0    R+     6:12              \_ cat include/config/kernel.release

Specifically, the symptom is a process, often a simple one like cat(1) 
or rm(1) or somewhere in check-headers, will stay in the running state, 
accumulating CPU time.

If I Ctrl-C the build, and start over, the build will normally -not- get 
stuck at the same point, but proceed to chew through one of a bazillion 
allmodconfig builds.

I also see this occasionally on my main workstation (also 
2.6.23/x86-64/Fedora-7), though not as frequently.

This is a new behavior since the new scheduler was merged... I think.

Nothing more concrete to report at this time.  I cannot easily reproduce 
the behavior, as it happens [apparently] randomly sometime during the 
day.  Generally, the files these programs are dealing with are -always- 
in the pagecache, if that makes any difference.

	Jeff



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-19 21:39 [2.6.23] tasks stuck in running state? Jeff Garzik
@ 2007-10-19 21:53 ` Chuck Ebbert
  2007-10-19 22:03   ` Jeff Garzik
  2007-10-20  1:08 ` Ray Lee
  1 sibling, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2007-10-19 21:53 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linux Kernel Mailing List

On 10/19/2007 05:39 PM, Jeff Garzik wrote:
> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> certain behavior at least once a day.  I'll start a kernel build (make
> -sj5 on this box), and it will "hang" in the following way:
> 

Can you try to strace the hanging task?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-19 21:53 ` Chuck Ebbert
@ 2007-10-19 22:03   ` Jeff Garzik
  2007-10-19 22:18     ` Chuck Ebbert
  2007-10-21 17:25     ` Dmitry Adamushko
  0 siblings, 2 replies; 11+ messages in thread
From: Jeff Garzik @ 2007-10-19 22:03 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Linux Kernel Mailing List

Chuck Ebbert wrote:
> On 10/19/2007 05:39 PM, Jeff Garzik wrote:
>> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
>> certain behavior at least once a day.  I'll start a kernel build (make
>> -sj5 on this box), and it will "hang" in the following way:
>>
> 
> Can you try to strace the hanging task?

Well, to the system it's running, so that doesn't do much of anything...

> 
>  8482 pts/0    S+     0:00                          \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h
>  8484 pts/0    R+     3:10                              \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h
>  8486 pts/0    S+     0:00                              \_ cut -f2 -d<
>  8487 pts/0    S+     0:00                              \_ cut -f1 -d>
>  8488 pts/0    S+     0:00                              \_ egrep ^linux|^asm
> [jgarzik@pretzel misc-2.6]$ strace -p8484
> Process 8484 attached - interrupt to quit
[sits there, chewing up CPU grepping a 47-line header file]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-19 22:03   ` Jeff Garzik
@ 2007-10-19 22:18     ` Chuck Ebbert
  2007-10-20  0:01       ` David Miller
  2007-10-21 17:25     ` Dmitry Adamushko
  1 sibling, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2007-10-19 22:18 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linux Kernel Mailing List

On 10/19/2007 06:03 PM, Jeff Garzik wrote:
>> [jgarzik@pretzel misc-2.6]$ strace -p8484
>> Process 8484 attached - interrupt to quit
> [sits there, chewing up CPU grepping a 47-line header file]
> 

And sysrq-p is pretty useless unless you can force the keyboard
interrupt and the spinning process onto the same CPU.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-19 22:18     ` Chuck Ebbert
@ 2007-10-20  0:01       ` David Miller
  2007-10-21 15:59         ` Chuck Ebbert
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2007-10-20  0:01 UTC (permalink / raw)
  To: cebbert; +Cc: jeff, linux-kernel

From: Chuck Ebbert <cebbert@redhat.com>
Date: Fri, 19 Oct 2007 18:18:08 -0400

> On 10/19/2007 06:03 PM, Jeff Garzik wrote:
> >> [jgarzik@pretzel misc-2.6]$ strace -p8484
> >> Process 8484 attached - interrupt to quit
> > [sits there, chewing up CPU grepping a 47-line header file]
> > 
> 
> And sysrq-p is pretty useless unless you can force the keyboard
> interrupt and the spinning process onto the same CPU.

Yes, I find this a painful limitation too.

Sparc64 used to dump the registers on all active cpus for show_regs()
via a cross-call, and this was incredibly useful.  But I disabled that
as soon as I started playing with Niagara because at 32 cpus and
larger the output is just too voluminous to be useful.

What might be appropriate is just to get a one-line program counter
dump on every cpu via some new sysrq keystroke.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-19 21:39 [2.6.23] tasks stuck in running state? Jeff Garzik
  2007-10-19 21:53 ` Chuck Ebbert
@ 2007-10-20  1:08 ` Ray Lee
  2007-10-21 20:11   ` ext3 deadlock or " Guennadi Liakhovetski
  1 sibling, 1 reply; 11+ messages in thread
From: Ray Lee @ 2007-10-20  1:08 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Linux Kernel Mailing List

On 10/19/07, Jeff Garzik <jeff@garzik.org> wrote:
> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> certain behavior at least once a day.  I'll start a kernel build (make
> -sj5 on this box), and it will "hang" in the following way:
>
> > 31003 ?        S      0:04 sshd: jgarzik@pts/0
> > 31004 pts/0    Ss     0:02  \_ -bash
> >  8280 pts/0    S+     0:00      \_ make ARCH=i386 -sj4
> >  8690 pts/0    Z+     0:00          \_ [rm] <defunct>
> >  8691 pts/0    S+     0:00          \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
> >  8692 pts/0    R+     6:12              \_ cat include/config/kernel.release
>
> Specifically, the symptom is a process, often a simple one like cat(1)
> or rm(1) or somewhere in check-headers, will stay in the running state,
> accumulating CPU time.
>
> If I Ctrl-C the build, and start over, the build will normally -not- get
> stuck at the same point, but proceed to chew through one of a bazillion
> allmodconfig builds.

I *think* I'm seeing this with firefox under 2.6.23-rc6. I tried a
`killall -SIGSTOP firefox; killall -SIGCONT firefox` and when I looked
back it was back to life again, but that may have been a fluke.
Regardless, try that the next time it happens?

Ray

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-20  0:01       ` David Miller
@ 2007-10-21 15:59         ` Chuck Ebbert
  2007-10-21 19:52           ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Ebbert @ 2007-10-21 15:59 UTC (permalink / raw)
  To: David Miller; +Cc: jeff, linux-kernel, Andrew Morton

On 10/19/2007 08:01 PM, David Miller wrote:
> 
> What might be appropriate is just to get a one-line program counter
> dump on every cpu via some new sysrq keystroke.
> 

IIRC -mm had something like this but it was buggy because we were
sending IPIs to each processor asking them to print their state.
Maybe it would work if we had a way of making them dump their
state to a memory location and then collected and printed it from
the CPU that's handling the sysrq.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-19 22:03   ` Jeff Garzik
  2007-10-19 22:18     ` Chuck Ebbert
@ 2007-10-21 17:25     ` Dmitry Adamushko
  2007-10-21 17:34       ` Dmitry Adamushko
  1 sibling, 1 reply; 11+ messages in thread
From: Dmitry Adamushko @ 2007-10-21 17:25 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Ingo Molnar, Chuck Ebbert, Linux Kernel Mailing List

On 20/10/2007, Jeff Garzik <jeff@garzik.org> wrote:
> Chuck Ebbert wrote:
> > On 10/19/2007 05:39 PM, Jeff Garzik wrote:
> >> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> >> certain behavior at least once a day.  I'll start a kernel build (make
> >> -sj5 on this box), and it will "hang" in the following way:
> >>
> >
> > Can you try to strace the hanging task?
>
> Well, to the system it's running, so that doesn't do much of anything...
>
> >
> >  8482 pts/0    S+     0:00                          \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h
> >  8484 pts/0    R+     3:10                              \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h
> >  8486 pts/0    S+     0:00                              \_ cut -f2 -d<
> >  8487 pts/0    S+     0:00                              \_ cut -f1 -d>
> >  8488 pts/0    S+     0:00                              \_ egrep ^linux|^asm
> > [jgarzik@pretzel misc-2.6]$ strace -p8484
> > Process 8484 attached - interrupt to quit
> [sits there, chewing up CPU grepping a 47-line header file]

I think, the following information would be helpful :

(1)  # cat /proc/PID/stat ; sleep 5; cat /proc/PID/stat

as we could know :

- whether utime or/and stime keeps increasing ;

- one of the fields is actually 'eip' ... so we could, hopefully, more
or less precisely localize the area of the code where it's looping.

(do_task_stat() in fs/proc/array.c)


(2) # cat /proc/sched_debug;
and
# cat /proc/PID/sched

(if your kernel has CONFIG_SCHED_DEBUG=y)


(3) is anything else able to run on this CPU (as I understand it's
SMP) or the task does completely locks it up.

again, one of the fields of /proc/PID/stat is the CPU# on which the
task is currently active.

Let's start a buzy-loop task on this cpu and see wheather it's able to
make any progress (TIME counter in 'ps')

# taskset -c THIS_CPU some_busy_looping_prog

TIA,

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-21 17:25     ` Dmitry Adamushko
@ 2007-10-21 17:34       ` Dmitry Adamushko
  0 siblings, 0 replies; 11+ messages in thread
From: Dmitry Adamushko @ 2007-10-21 17:34 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Ingo Molnar, Chuck Ebbert, Linux Kernel Mailing List

On 21/10/2007, Dmitry Adamushko <dmitry.adamushko@gmail.com> wrote:
> On 20/10/2007, Jeff Garzik <jeff@garzik.org> wrote:
> > Chuck Ebbert wrote:
> > > On 10/19/2007 05:39 PM, Jeff Garzik wrote:
> > >> On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> > >> certain behavior at least once a day.  I'll start a kernel build (make
> > >> -sj5 on this box), and it will "hang" in the following way:
> > >>
> > >
> > > Can you try to strace the hanging task?
> >
> > Well, to the system it's running, so that doesn't do much of anything...
> >
> > >
> > >  8482 pts/0    S+     0:00                          \_ /bin/sh /garz/repo/misc-2.6/scripts/hdrcheck.sh /garz/repo/misc-2.6/usr/include /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h /garz/repo/misc-2.6/usr/include/linux/.check.kernelcapi.h
> > >  8484 pts/0    R+     3:10                              \_ grep ^[ \t]*#[ \t]*include[ \t]*< /garz/repo/misc-2.6/usr/include/linux/kernelcapi.h
> > >  8486 pts/0    S+     0:00                              \_ cut -f2 -d<
> > >  8487 pts/0    S+     0:00                              \_ cut -f1 -d>
> > >  8488 pts/0    S+     0:00                              \_ egrep ^linux|^asm
> > > [jgarzik@pretzel misc-2.6]$ strace -p8484
> > > Process 8484 attached - interrupt to quit
> > [sits there, chewing up CPU grepping a 47-line header file]
>
> I think, the following information would be helpful :
>
> (1)  # cat /proc/PID/stat ; sleep 5; cat /proc/PID/stat
>
> as we could know :
>
> - whether utime or/and stime keeps increasing ;
>
> - one of the fields is actually 'eip' ... so we could, hopefully, more
> or less precisely localize the area of the code where it's looping.

sure, this 'eip' will be useless in this case.

[ ... ]

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [2.6.23] tasks stuck in running state?
  2007-10-21 15:59         ` Chuck Ebbert
@ 2007-10-21 19:52           ` David Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2007-10-21 19:52 UTC (permalink / raw)
  To: cebbert; +Cc: jeff, linux-kernel, akpm

From: Chuck Ebbert <cebbert@redhat.com>
Date: Sun, 21 Oct 2007 11:59:14 -0400

> IIRC -mm had something like this but it was buggy because we were
> sending IPIs to each processor asking them to print their state.

What's buggy about this? :-)  That's exactly how it work(ed)
on sparc64.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* ext3 deadlock or Re: [2.6.23] tasks stuck in running state?
  2007-10-20  1:08 ` Ray Lee
@ 2007-10-21 20:11   ` Guennadi Liakhovetski
  0 siblings, 0 replies; 11+ messages in thread
From: Guennadi Liakhovetski @ 2007-10-21 20:11 UTC (permalink / raw)
  To: Ray Lee; +Cc: Jeff Garzik, Linux Kernel Mailing List

On Fri, 19 Oct 2007, Ray Lee wrote:

> On 10/19/07, Jeff Garzik <jeff@garzik.org> wrote:
> > On my main devel box, vanilla 2.6.23 on x86-64/Fedora-7, I'm seeing a
> > certain behavior at least once a day.  I'll start a kernel build (make
> > -sj5 on this box), and it will "hang" in the following way:
> >
> > > 31003 ?        S      0:04 sshd: jgarzik@pts/0
> > > 31004 pts/0    Ss     0:02  \_ -bash
> > >  8280 pts/0    S+     0:00      \_ make ARCH=i386 -sj4
> > >  8690 pts/0    Z+     0:00          \_ [rm] <defunct>
> > >  8691 pts/0    S+     0:00          \_ /bin/sh -c cat include/config/kernel.release 2> /dev/null
> > >  8692 pts/0    R+     6:12              \_ cat include/config/kernel.release
> >
> > Specifically, the symptom is a process, often a simple one like cat(1)
> > or rm(1) or somewhere in check-headers, will stay in the running state,
> > accumulating CPU time.
> >
> > If I Ctrl-C the build, and start over, the build will normally -not- get
> > stuck at the same point, but proceed to chew through one of a bazillion
> > allmodconfig builds.
> 
> I *think* I'm seeing this with firefox under 2.6.23-rc6. I tried a
> `killall -SIGSTOP firefox; killall -SIGCONT firefox` and when I looked
> back it was back to life again, but that may have been a fluke.
> Regardless, try that the next time it happens?

Don't know if that's the same problem as above, but a few minutes ago my 
mail-server locked down completely. First pine froze, then more processes 
started freezing, then the system became unusable, ssh logins got stuck, 
USB- and ps/2 keyboards. I managed to get a trace with the "w" sysrq:

SysRq : Show Blocked State
  task                PC stack   pid father
syslogd       D c01275a3     0  2818      1
       e8a1fe4c 00000086 e8a1fe4c c01275a3 f7e9a200 00000282 e8a1fe5c 00b5ce60
       c1b05cc0 e8a1fe7c c030c8e7 e8a1ff30 c8d553c4 c0407348 c1b58b18 00b5ce60
       c01277d0 c1b1ba90 c0407040 000000da d5f806c0 e8a1fe84 c030c974 e8a1feb4
Call Trace:
 [<c030c8e7>] schedule_timeout+0x47/0xc0
 [<c030c974>] schedule_timeout_uninterruptible+0x14/0x20
 [<c01b94cb>] journal_stop+0xcb/0x270
 [<c01ba1fd>] journal_force_commit+0x1d/0x30
 [<c01b2265>] ext3_force_commit+0x25/0x30
 [<c01ac7dc>] ext3_write_inode+0x2c/0x40
 [<c0183f8b>] __writeback_single_inode+0x30b/0x3e0
 [<c01849b4>] sync_inode+0x24/0x60
 [<c01a8d42>] ext3_sync_file+0xc2/0xd0
 [<c0187340>] do_fsync+0x60/0xa0
 [<c01873a8>] __do_fsync+0x28/0x40
 [<c01873ed>] sys_fsync+0xd/0x10
 [<c010424e>] sysenter_past_esp+0x5f/0x85
 =======================
pine          D c01275a3     0  6910   6243
       d1f55e4c 00200082 c03c0e40 c01275a3 f7e42c80 00200282 d1f55e5c 00b5ce60
       c1b05cc0 d1f55e7c c030c8e7 d1f55f30 c8d553d8 c1b58b18 c90a9e5c 00b5ce60
       c01277d0 d8ae8a90 c0407040 000000c3 d5f806c0 d1f55e84 c030c974 d1f55eb4
Call Trace:
 [<c030c8e7>] schedule_timeout+0x47/0xc0
 [<c030c974>] schedule_timeout_uninterruptible+0x14/0x20
 [<c01b94cb>] journal_stop+0xcb/0x270
 [<c01ba1fd>] journal_force_commit+0x1d/0x30
 [<c01b2265>] ext3_force_commit+0x25/0x30
 [<c01ac7dc>] ext3_write_inode+0x2c/0x40
 [<c0183f8b>] __writeback_single_inode+0x30b/0x3e0
 [<c01849b4>] sync_inode+0x24/0x60
 [<c01a8d42>] ext3_sync_file+0xc2/0xd0
 [<c0187340>] do_fsync+0x60/0xa0
 [<c01873a8>] __do_fsync+0x28/0x40
 [<c01873ed>] sys_fsync+0xd/0x10
 [<c010424e>] sysenter_past_esp+0x5f/0x85
 =======================
sendmail      D c01275a3     0  7448   7446
       c90a9e4c 00000082 c03c0e40 c01275a3 f7e00580 00000282 c90a9e5c 00b5ce60
       c1b05cc0 c90a9e7c c030c8e7 c90a9f30 c8d553ec d1f55e5c c0407348 00b5ce60
       c01277d0 d8ae8030 c0407040 0000004b d5f806c0 c90a9e84 c030c974 c90a9eb4
Call Trace:
 [<c030c8e7>] schedule_timeout+0x47/0xc0
 [<c030c974>] schedule_timeout_uninterruptible+0x14/0x20
 [<c01b94cb>] journal_stop+0xcb/0x270
 [<c01ba1fd>] journal_force_commit+0x1d/0x30
 [<c01b2265>] ext3_force_commit+0x25/0x30
 [<c01ac7dc>] ext3_write_inode+0x2c/0x40
 [<c0183f8b>] __writeback_single_inode+0x30b/0x3e0
 [<c01849b4>] sync_inode+0x24/0x60
 [<c01a8d42>] ext3_sync_file+0xc2/0xd0
 [<c0187340>] do_fsync+0x60/0xa0
 [<c01873a8>] __do_fsync+0x28/0x40
 [<c01873ed>] sys_fsync+0xd/0x10
 [<c010424e>] sysenter_past_esp+0x5f/0x85
 =======================

now you see why I wrote "ext3 deadlock." It's y VIA C7 system, running 
2.6.23-rc9-g804b3f9a, no problems since 5 October, when the kernel has 
been built. No Oops / warnings in dmesg. Or has this been fixed since 
23-rc9?

Thanks
Guennadi
---
Guennadi Liakhovetski

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2007-10-21 20:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-19 21:39 [2.6.23] tasks stuck in running state? Jeff Garzik
2007-10-19 21:53 ` Chuck Ebbert
2007-10-19 22:03   ` Jeff Garzik
2007-10-19 22:18     ` Chuck Ebbert
2007-10-20  0:01       ` David Miller
2007-10-21 15:59         ` Chuck Ebbert
2007-10-21 19:52           ` David Miller
2007-10-21 17:25     ` Dmitry Adamushko
2007-10-21 17:34       ` Dmitry Adamushko
2007-10-20  1:08 ` Ray Lee
2007-10-21 20:11   ` ext3 deadlock or " Guennadi Liakhovetski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.