* Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
@ 2004-07-08 22:15 Rob Mueller
2004-07-09 12:58 ` Chris Mason
0 siblings, 1 reply; 10+ messages in thread
From: Rob Mueller @ 2004-07-08 22:15 UTC (permalink / raw)
To: linux-kernel; +Cc: Chris Mason
This is an update to a thread I started last week about processes getting
stuck in D state.
About 2 days ago, we upgraded to 2.6.7-mm6. Things have generally been
running fine, but today again, some processes got stuck in an unkillable D
state. This time, rather than 1 process getting stuck however, about 20 got
stuck in a relatively short period of time (seems to have been over about
half an hour). All of processes are cyrus imapd processes.
I've tried to get sysreq-t output, but as this machine is still up and
running, it has about 2500 processes on it, and I can't seem to get
consistent sysreq-t output. I set the kernel log buffer size to 17 (128k)
but that definitely doesn't seem to be enough. I notice that it also seems
to dump to /var/log/messages, and I get more output there, but it still
doesn't seem to be a complete process list, and each time I do a sysreq-t, I
get a different number of procs (though always incomplete) in the output.
Anyway, I've done sysreq-t twice, and got the output from dmesg -s 1000000
and /var/log/messages. Since the output is so big, I've put them, and the
kernel config here:
http://robm.fastmail.fm/kernel/t1/
Process ID's that are definitely stuck are:
1013, 13389, 13469, 16056, 17340, 18489, 21341, 22661, 23976, 29138, 29752,
30330, 31106, 31956, 32559, 32575, 3753, 5926, 6052, 8857, 9914
But as mentioned above, you won't find most of these in the sysreq-t output,
I presume because the buffer isn't big enough. Still, hopefully the ones you
can see there will be some useful information. (FYI, searching for imapd\s+D
in the sysreq-t output rather than the individual pids seems to be a quicker
way of finding the problem procs)
Having a quick look myself, there are some odd things there though. For
instance, from sysreqmsglog1.txt
imapd D F1778660 0 3753 1906 3754 809 (NOTLB)
eb15adb8 00000086 00000020 f1778660 c0310318 c43fc600 08155888 0000002d
f567d380 f7b97480 c42c3d20 00000000 0001ece6 6051d45f 00007c67
c42c3d20
c03d8180 f1778660 f1778810 f78ad9cc 00000003 f78ad9cc f78ad9cc
c025d40c
Call Trace:
[<c0310318>] memcpy_fromiovec+0x38/0x60
[<c025d40c>] generic_unplug_device+0x2c/0x40
[<c037a288>] io_schedule+0x28/0x40
[<c012e17c>] __lock_page+0xbc/0xe0
[<c012deb0>] page_wake_function+0x0/0x50
[<c012deb0>] page_wake_function+0x0/0x50
[<c012f1a1>] filemap_nopage+0x231/0x360
[<c013dd58>] do_no_page+0xb8/0x3a0
[<c013bbbb>] pte_alloc_map+0xdb/0xf0
[<c013e1ee>] handle_mm_fault+0xbe/0x1a0
[<c0112c62>] do_page_fault+0x172/0x5ec
[<c012435b>] do_sigaction+0x19b/0x210
[<c0120dac>] update_process_times+0x2c/0x40
[<c0110230>] smp_apic_timer_interrupt+0x140/0x150
[<c0112af0>] do_page_fault+0x0/0x5ec
[<c0104b19>] error_code+0x2d/0x38
imapd D E59812C0 0 22661 1906 23248 22592 (NOTLB)
d54f5db8 00000086 f7b7de18 e59812c0 d54f5d94 c04b0dc0 00000020 00000000
c42c3060 f71696f0 c42c3d20 00000000 0002cda6 891b682d 00007b15
c42c3d20
f71696f0 e59812c0 e5981470 00000003 c025d3bb f78ad9cc f78ad9cc
c025d40c
Call Trace:
[<c025d3bb>] __generic_unplug_device+0x1b/0x40
[<c025d40c>] generic_unplug_device+0x2c/0x40
[<c037a288>] io_schedule+0x28/0x40
[<c012e17c>] __lock_page+0xbc/0xe0
[<c012deb0>] page_wake_function+0x0/0x50
[<c012deb0>] page_wake_function+0x0/0x50
[<c012f1a1>] filemap_nopage+0x231/0x360
[<c013dd58>] do_no_page+0xb8/0x3a0
[<c013bbbb>] pte_alloc_map+0xdb/0xf0
[<c013e1ee>] handle_mm_fault+0xbe/0x1a0
[<c0112af0>] do_page_fault+0x0/0x5ec
[<c0104a5a>] apic_timer_interrupt+0x1a/0x20
[<c0112c62>] do_page_fault+0x172/0x5ec
[<c012435b>] do_sigaction+0x19b/0x210
[<c0124693>] sys_rt_sigaction+0x53/0x90
[<c030c631>] sys_socketcall+0x111/0x200
[<c0112af0>] do_page_fault+0x0/0x5ec
[<c0104b19>] error_code+0x2d/0x38
Those calls into "generic_unplug_device" look really strange to me...
Rob
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
@ 2004-07-09 9:52 Rob Mueller
0 siblings, 0 replies; 10+ messages in thread
From: Rob Mueller @ 2004-07-09 9:52 UTC (permalink / raw)
To: linux-kernel; +Cc: Chris Mason
As an update, this machine eventually ended up dieing pretty horribly. When
we found it, there were 100's of procs stuck in D state, and our automated
"ping" script was reporting all sorts of problems. Anyway we killed off as
many processes as possible, and did a sysreq-t before trying to reboot. The
soft reboot failed, and a hard reboot was required.
The newly gathered sysreq output has been placed in the files
sysreqdmesg3.txt and sysreqmsglog3.txt here:
http://robm.fastmail.fm/kernel/t1/
Rob
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
2004-07-08 22:15 Processes stuck in unkillable D state (now seen in 2.6.7-mm6) Rob Mueller
@ 2004-07-09 12:58 ` Chris Mason
2004-07-12 19:53 ` Rob Mueller
2004-07-15 16:12 ` Processes stuck in unkillable D state (now seen in 2.6.8-rc1) Rob Mueller
0 siblings, 2 replies; 10+ messages in thread
From: Chris Mason @ 2004-07-09 12:58 UTC (permalink / raw)
To: Rob Mueller; +Cc: linux-kernel
On Thu, 2004-07-08 at 18:15, Rob Mueller wrote:
> This is an update to a thread I started last week about processes getting
> stuck in D state.
>
> About 2 days ago, we upgraded to 2.6.7-mm6. Things have generally been
> running fine, but today again, some processes got stuck in an unkillable D
> state. This time, rather than 1 process getting stuck however, about 20 got
> stuck in a relatively short period of time (seems to have been over about
> half an hour). All of processes are cyrus imapd processes.
>
> I've tried to get sysreq-t output, but as this machine is still up and
> running, it has about 2500 processes on it, and I can't seem to get
> consistent sysreq-t output. I set the kernel log buffer size to 17 (128k)
> but that definitely doesn't seem to be enough. I notice that it also seems
> to dump to /var/log/messages, and I get more output there, but it still
> doesn't seem to be a complete process list, and each time I do a sysreq-t, I
> get a different number of procs (though always incomplete) in the output.
> Anyway, I've done sysreq-t twice, and got the output from dmesg -s 1000000
> and /var/log/messages. Since the output is so big, I've put them, and the
> kernel config here:
>
Things will be much easier for you if you configure a serial or network
console.
> Having a quick look myself, there are some odd things there though. For
> instance, from sysreqmsglog1.txt
>
> imapd D F1778660 0 3753 1906 3754 809 (NOTLB)
> eb15adb8 00000086 00000020 f1778660 c0310318 c43fc600 08155888 0000002d
> f567d380 f7b97480 c42c3d20 00000000 0001ece6 6051d45f 00007c67
> c42c3d20
> c03d8180 f1778660 f1778810 f78ad9cc 00000003 f78ad9cc f78ad9cc
> c025d40c
> Call Trace:
> [<c0310318>] memcpy_fromiovec+0x38/0x60
> [<c025d40c>] generic_unplug_device+0x2c/0x40
> [<c037a288>] io_schedule+0x28/0x40
> [<c012e17c>] __lock_page+0xbc/0xe0
> [<c012deb0>] page_wake_function+0x0/0x50
> [<c012deb0>] page_wake_function+0x0/0x50
> [<c012f1a1>] filemap_nopage+0x231/0x360
> [<c013dd58>] do_no_page+0xb8/0x3a0
> [<c013bbbb>] pte_alloc_map+0xdb/0xf0
> [<c013e1ee>] handle_mm_fault+0xbe/0x1a0
> [<c0112c62>] do_page_fault+0x172/0x5ec
> [<c012435b>] do_sigaction+0x19b/0x210
> [<c0120dac>] update_process_times+0x2c/0x40
> [<c0110230>] smp_apic_timer_interrupt+0x140/0x150
> [<c0112af0>] do_page_fault+0x0/0x5ec
> [<c0104b19>] error_code+0x2d/0x38
>
> Those calls into "generic_unplug_device" look really strange to me...
It's just crud on the stack, you're really waiting in io_schedule() for
a page to get unlocked. Why isn't the page unlocking? Hard to say for
sure without seeing the whole sysrq-t. If the network/serial console
doesn't work out, I can help you configure lkcd as well.
-chris
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
2004-07-09 12:58 ` Chris Mason
@ 2004-07-12 19:53 ` Rob Mueller
2004-07-12 20:11 ` William Lee Irwin III
2004-07-20 19:51 ` Chris Mason
2004-07-15 16:12 ` Processes stuck in unkillable D state (now seen in 2.6.8-rc1) Rob Mueller
1 sibling, 2 replies; 10+ messages in thread
From: Rob Mueller @ 2004-07-12 19:53 UTC (permalink / raw)
To: Chris Mason; +Cc: linux-kernel
> Things will be much easier for you if you configure a serial or network
> console.
> It's just crud on the stack, you're really waiting in io_schedule() for
> a page to get unlocked. Why isn't the page unlocking? Hard to say for
> sure without seeing the whole sysrq-t. If the network/serial console
> doesn't work out, I can help you configure lkcd as well.
Well, I tried compiling in the network console, but it seems to be way too
buggy. Basically the machine would crash (hard lockup) within about 12-24
hours after booting, nothing on the network console itself or in any log
file. Not much help there.
Anyway, after rebooting back into a non-netconsole enabled kernel, we did
get another stuck process. This time there was only 1, and I was able to
shutdown all the other processes, so that there were only about 50 procs
running when I did the sysreq-t command, so I should have been able to
capture all the output this time??? I've put the dumps here:
http://robm.fastmail.fm/kernel/t2/
Here's the relevant stuck proc.
imapd D E17BE6E0 0 3761 1 10291 (NOTLB)
e11c3bc8 00000086 00000020 e17be6e0 c1372d20 00000246 00000220 f7e12380
00000020 c0136667 c42c6da0 00000001 00000d74 bbfe8a6a 0000040d
c42c6da0
f7f91140 e17be6e0 e17be890 f78cd9cc 00000003 f78cd9cc f78cd9cc
c025d2cc
Call Trace:
[<c0136667>] kmem_cache_alloc+0x57/0x70
[<c025d2cc>] generic_unplug_device+0x2c/0x40
[<c037a148>] io_schedule+0x28/0x40
[<c012e03c>] __lock_page+0xbc/0xe0
[<c012dd70>] page_wake_function+0x0/0x50
[<c012dd70>] page_wake_function+0x0/0x50
[<c012f061>] filemap_nopage+0x231/0x360
[<c013dc18>] do_no_page+0xb8/0x3a0
[<c013ba7b>] pte_alloc_map+0xdb/0xf0
[<c013e0ae>] handle_mm_fault+0xbe/0x1a0
[<c025d292>] __generic_unplug_device+0x32/0x40
[<c0112af2>] do_page_fault+0x172/0x5ec
[<c014cab0>] bh_wake_function+0x0/0x40
[<c014cab0>] bh_wake_function+0x0/0x40
[<c018ec9f>] reiserfs_prepare_file_region_for_write+0x94f/0x9b0
[<c0112980>] do_page_fault+0x0/0x5ec
[<c0104b19>] error_code+0x2d/0x38
[<c018dc0f>] reiserfs_copy_from_user_to_file_region+0x8f/0x100
[<c018f2b1>] reiserfs_file_write+0x5b1/0x750
[<c0186675>] reiserfs_link+0xb5/0x190
[<c0186719>] reiserfs_link+0x159/0x190
[<c016134c>] dput+0x1c/0x1b0
[<c016134c>] dput+0x1c/0x1b0
[<c01581a0>] path_release+0x10/0x40
[<c015a9bc>] sys_link+0xcc/0xe0
[<c014bb9a>] vfs_write+0xaa/0xe0
[<c014b610>] default_llseek+0x0/0x110
[<c014bc4f>] sys_write+0x2f/0x50
[<c010406b>] syscall_call+0x7/0xb
Is that in lock_page again?
Hopefully there's some helpful information there. If the dump there isn't
complete, can you give me an idea why it might not be? I've set the kernel
buffer to 17 (128k), and the proc list was definitely small enough to fit in
the buffer. When I did "dmesg -s 1000000 > foo", the first part of the file
was still the original boot sequence. Any other suggestions on what to do?
Rob
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
2004-07-12 19:53 ` Rob Mueller
@ 2004-07-12 20:11 ` William Lee Irwin III
2004-07-12 20:14 ` Rob Mueller
2004-07-20 19:51 ` Chris Mason
1 sibling, 1 reply; 10+ messages in thread
From: William Lee Irwin III @ 2004-07-12 20:11 UTC (permalink / raw)
To: Rob Mueller; +Cc: Chris Mason, linux-kernel
At some point in the past, someone's attribution was removed from:
>> Things will be much easier for you if you configure a serial or network
>> console.
>> It's just crud on the stack, you're really waiting in io_schedule() for
>> a page to get unlocked. Why isn't the page unlocking? Hard to say for
>> sure without seeing the whole sysrq-t. If the network/serial console
>> doesn't work out, I can help you configure lkcd as well.
On Mon, Jul 12, 2004 at 12:53:44PM -0700, Rob Mueller wrote:
> Well, I tried compiling in the network console, but it seems to be way too
> buggy. Basically the machine would crash (hard lockup) within about 12-24
> hours after booting, nothing on the network console itself or in any log
> file. Not much help there.
> Anyway, after rebooting back into a non-netconsole enabled kernel, we did
> get another stuck process. This time there was only 1, and I was able to
> shutdown all the other processes, so that there were only about 50 procs
> running when I did the sysreq-t command, so I should have been able to
> capture all the output this time??? I've put the dumps here:
> http://robm.fastmail.fm/kernel/t2/
> Here's the relevant stuck proc.
I have also experienced no end of aggravation at the hands of hardware
vendors who saw fit to remove serial in the interest of legacy free
-ness with no adequate replacement whatsoever.
On Mon, Jul 12, 2004 at 12:53:44PM -0700, Rob Mueller wrote:
> imapd D E17BE6E0 0 3761 1 10291 (NOTLB)
> e11c3bc8 00000086 00000020 e17be6e0 c1372d20 00000246 00000220 f7e12380
> 00000020 c0136667 c42c6da0 00000001 00000d74 bbfe8a6a 0000040d
> c42c6da0
> f7f91140 e17be6e0 e17be890 f78cd9cc 00000003 f78cd9cc f78cd9cc
> c025d2cc
> Call Trace:
> [<c0136667>] kmem_cache_alloc+0x57/0x70
> [<c025d2cc>] generic_unplug_device+0x2c/0x40
> [<c037a148>] io_schedule+0x28/0x40
> [<c012e03c>] __lock_page+0xbc/0xe0
> [<c012dd70>] page_wake_function+0x0/0x50
> [<c012f061>] filemap_nopage+0x231/0x360
> [<c013dc18>] do_no_page+0xb8/0x3a0
> [<c013ba7b>] pte_alloc_map+0xdb/0xf0
> [<c013e0ae>] handle_mm_fault+0xbe/0x1a0
> [<c025d292>] __generic_unplug_device+0x32/0x40
> [<c0112af2>] do_page_fault+0x172/0x5ec
> [<c014cab0>] bh_wake_function+0x0/0x40
> [<c018ec9f>] reiserfs_prepare_file_region_for_write+0x94f/0x9b0
> [<c0112980>] do_page_fault+0x0/0x5ec
> [<c0104b19>] error_code+0x2d/0x38
> [<c018dc0f>] reiserfs_copy_from_user_to_file_region+0x8f/0x100
> [<c018f2b1>] reiserfs_file_write+0x5b1/0x750
> [<c0186719>] reiserfs_link+0x159/0x190
> [<c016134c>] dput+0x1c/0x1b0
> [<c01581a0>] path_release+0x10/0x40
> [<c015a9bc>] sys_link+0xcc/0xe0
> [<c014bb9a>] vfs_write+0xaa/0xe0
> [<c014b610>] default_llseek+0x0/0x110
> [<c014bc4f>] sys_write+0x2f/0x50
> [<c010406b>] syscall_call+0x7/0xb
> Is that in lock_page again?
> Hopefully there's some helpful information there. If the dump there isn't
> complete, can you give me an idea why it might not be? I've set the kernel
> buffer to 17 (128k), and the proc list was definitely small enough to fit
> in the buffer. When I did "dmesg -s 1000000 > foo", the first part of the
> file was still the original boot sequence. Any other suggestions on what to
> do?
Nice, deep stack there; however, this appears to only be one process. It
may be helpful to see the others.
-- wli
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
2004-07-12 20:11 ` William Lee Irwin III
@ 2004-07-12 20:14 ` Rob Mueller
2004-07-12 20:25 ` William Lee Irwin III
0 siblings, 1 reply; 10+ messages in thread
From: Rob Mueller @ 2004-07-12 20:14 UTC (permalink / raw)
To: William Lee Irwin III; +Cc: Chris Mason, linux-kernel
> Nice, deep stack there; however, this appears to only be one process. It
> may be helpful to see the others.
I've put the dumps here. I did sysreq-t twice, thus the 2 dumps. If you diff
them, you'll see they're very very similar.
http://robm.fastmail.fm/kernel/t2/
Rob
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
2004-07-12 20:14 ` Rob Mueller
@ 2004-07-12 20:25 ` William Lee Irwin III
0 siblings, 0 replies; 10+ messages in thread
From: William Lee Irwin III @ 2004-07-12 20:25 UTC (permalink / raw)
To: Rob Mueller; +Cc: Chris Mason, linux-kernel
At some point in the past, my attribution was shamelessly removed from:
>> Nice, deep stack there; however, this appears to only be one process. It
>> may be helpful to see the others.
On Mon, Jul 12, 2004 at 01:14:12PM -0700, Rob Mueller wrote:
> I've put the dumps here. I did sysreq-t twice, thus the 2 dumps. If you
> diff them, you'll see they're very very similar.
> http://robm.fastmail.fm/kernel/t2/
Hmm, I wonder which of the two lock_page()'s in filemap_nopage() this is.
-- wli
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.8-rc1)
2004-07-09 12:58 ` Chris Mason
2004-07-12 19:53 ` Rob Mueller
@ 2004-07-15 16:12 ` Rob Mueller
1 sibling, 0 replies; 10+ messages in thread
From: Rob Mueller @ 2004-07-15 16:12 UTC (permalink / raw)
To: Chris Mason, William Lee Irwin III; +Cc: linux-kernel
I upgraded the kernel on a couple of the machines to 2.6.8-rc1 (compiled
with debug symbols), but now we've seen two completely different types of
failures
1. The same as the old one, where a couple of processes (half dozen in this
case) would get stuck in D state, but the machine was otherwise pretty much
fine
2. A new one where over 1000 processes get stuck within a short period of
time and leave the machine is a very fragile state (even attempts to run
'ps -auxw' freeze up)
I've placed all the results here:
http://robm.fastmail.fm/kernel/t3/
sysreqdmesg1-s1.txt - output of sysreq-t for system with a few procs in D
state
sysreqdmesg2-s1.txt - same again, just done a second time
sysreqdmesg1-s2.txt - output of sysreq-t for system with 1000 procs in D
state
vmlinux.gz - kernel image, built with debug symbols
config - config used to compile the kernel
Is there anything else I can provide? This problem is driving us crazy and
I'd like to help in any way possible to try and get it investigated and
resolved.
Rob
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
2004-07-12 19:53 ` Rob Mueller
2004-07-12 20:11 ` William Lee Irwin III
@ 2004-07-20 19:51 ` Chris Mason
2004-07-20 21:19 ` Rob Mueller
1 sibling, 1 reply; 10+ messages in thread
From: Chris Mason @ 2004-07-20 19:51 UTC (permalink / raw)
To: Rob Mueller, wli, akpm; +Cc: linux-kernel
On Mon, 2004-07-12 at 15:53, Rob Mueller wrote:
> Here's the relevant stuck proc.
>
> imapd D E17BE6E0 0 3761 1 10291 (NOTLB)
> e11c3bc8 00000086 00000020 e17be6e0 c1372d20 00000246 00000220 f7e12380
> 00000020 c0136667 c42c6da0 00000001 00000d74 bbfe8a6a 0000040d
> c42c6da0
> f7f91140 e17be6e0 e17be890 f78cd9cc 00000003 f78cd9cc f78cd9cc
> c025d2cc
> Call Trace:
> [<c0136667>] kmem_cache_alloc+0x57/0x70
> [<c025d2cc>] generic_unplug_device+0x2c/0x40
> [<c037a148>] io_schedule+0x28/0x40
> [<c012e03c>] __lock_page+0xbc/0xe0
> [<c012dd70>] page_wake_function+0x0/0x50
> [<c012dd70>] page_wake_function+0x0/0x50
> [<c012f061>] filemap_nopage+0x231/0x360
> [<c013dc18>] do_no_page+0xb8/0x3a0
> [<c013ba7b>] pte_alloc_map+0xdb/0xf0
> [<c013e0ae>] handle_mm_fault+0xbe/0x1a0
> [<c025d292>] __generic_unplug_device+0x32/0x40
> [<c0112af2>] do_page_fault+0x172/0x5ec
> [<c014cab0>] bh_wake_function+0x0/0x40
> [<c014cab0>] bh_wake_function+0x0/0x40
> [<c018ec9f>] reiserfs_prepare_file_region_for_write+0x94f/0x9b0
> [<c0112980>] do_page_fault+0x0/0x5ec
> [<c0104b19>] error_code+0x2d/0x38
> [<c018dc0f>] reiserfs_copy_from_user_to_file_region+0x8f/0x100
> [<c018f2b1>] reiserfs_file_write+0x5b1/0x750
> [<c0186675>] reiserfs_link+0xb5/0x190
> [<c0186719>] reiserfs_link+0x159/0x190
> [<c016134c>] dput+0x1c/0x1b0
> [<c016134c>] dput+0x1c/0x1b0
> [<c01581a0>] path_release+0x10/0x40
> [<c015a9bc>] sys_link+0xcc/0xe0
> [<c014bb9a>] vfs_write+0xaa/0xe0
> [<c014b610>] default_llseek+0x0/0x110
> [<c014bc4f>] sys_write+0x2f/0x50
> [<c010406b>] syscall_call+0x7/0xb
>
> Is that in lock_page again?
>
> Hopefully there's some helpful information there. If the dump there isn't
> complete, can you give me an idea why it might not be? I've set the kernel
> buffer to 17 (128k), and the proc list was definitely small enough to fit in
> the buffer. When I did "dmesg -s 1000000 > foo", the first part of the file
> was still the original boot sequence. Any other suggestions on what to do?
Ugh, so the call path here is:
reiserfs_file_write -> start a transaction
copy_from_user -> fault in the page
page fault handler -> lock page
This means we're trying to lock a page with a running transaction, and
that's not allowed, since some other process on the box most likely has
that page locked and is trying to start a transaction.
That makes for 3 different deadlocks in this exact same call path
(dirty_inode, lock_page and kmap), and my patch for it has major
problems. So, I'll talk things over with everyone during OLS and try to
work out a proper fix.
Sorry Rob, this one is non-trivial.
-chris
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6)
2004-07-20 19:51 ` Chris Mason
@ 2004-07-20 21:19 ` Rob Mueller
0 siblings, 0 replies; 10+ messages in thread
From: Rob Mueller @ 2004-07-20 21:19 UTC (permalink / raw)
To: Chris Mason, wli, akpm; +Cc: linux-kernel
> Ugh, so the call path here is:
>
> reiserfs_file_write -> start a transaction
> copy_from_user -> fault in the page
> page fault handler -> lock page
>
> This means we're trying to lock a page with a running transaction, and
> that's not allowed, since some other process on the box most likely has
> that page locked and is trying to start a transaction.
>
> That makes for 3 different deadlocks in this exact same call path
> (dirty_inode, lock_page and kmap), and my patch for it has major
> problems. So, I'll talk things over with everyone during OLS and try to
> work out a proper fix.
>
> Sorry Rob, this one is non-trivial.
Thanks for looking at it Chris. At least it seems that there is now a
diagnosis of what's happening, which can be half the battle!
I'm surprised that this seems so rare, and that no-one else has reported it
as a significant problem before. Do you think there's anything in particular
about our kernel config that would be causing this to happen?
Rob
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2004-07-20 21:19 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-08 22:15 Processes stuck in unkillable D state (now seen in 2.6.7-mm6) Rob Mueller
2004-07-09 12:58 ` Chris Mason
2004-07-12 19:53 ` Rob Mueller
2004-07-12 20:11 ` William Lee Irwin III
2004-07-12 20:14 ` Rob Mueller
2004-07-12 20:25 ` William Lee Irwin III
2004-07-20 19:51 ` Chris Mason
2004-07-20 21:19 ` Rob Mueller
2004-07-15 16:12 ` Processes stuck in unkillable D state (now seen in 2.6.8-rc1) Rob Mueller
-- strict thread matches above, loose matches on Subject: below --
2004-07-09 9:52 Processes stuck in unkillable D state (now seen in 2.6.7-mm6) Rob Mueller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox