From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S264542AbUGINCh (ORCPT ); Fri, 9 Jul 2004 09:02:37 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S264629AbUGINCh (ORCPT ); Fri, 9 Jul 2004 09:02:37 -0400 Received: from cantor.suse.de ([195.135.220.2]:64906 "EHLO Cantor.suse.de") by vger.kernel.org with ESMTP id S264542AbUGINCe (ORCPT ); Fri, 9 Jul 2004 09:02:34 -0400 Subject: Re: Processes stuck in unkillable D state (now seen in 2.6.7-mm6) From: Chris Mason To: Rob Mueller Cc: linux-kernel@vger.kernel.org In-Reply-To: <00f601c46539$0bdf47a0$e6afc742@ROBMHP> References: <00f601c46539$0bdf47a0$e6afc742@ROBMHP> Content-Type: text/plain Message-Id: <1089377936.3956.148.camel@watt.suse.com> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.4.6 Date: Fri, 09 Jul 2004 08:58:56 -0400 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2004-07-08 at 18:15, Rob Mueller wrote: > This is an update to a thread I started last week about processes getting > stuck in D state. > > About 2 days ago, we upgraded to 2.6.7-mm6. Things have generally been > running fine, but today again, some processes got stuck in an unkillable D > state. This time, rather than 1 process getting stuck however, about 20 got > stuck in a relatively short period of time (seems to have been over about > half an hour). All of processes are cyrus imapd processes. > > I've tried to get sysreq-t output, but as this machine is still up and > running, it has about 2500 processes on it, and I can't seem to get > consistent sysreq-t output. I set the kernel log buffer size to 17 (128k) > but that definitely doesn't seem to be enough. I notice that it also seems > to dump to /var/log/messages, and I get more output there, but it still > doesn't seem to be a complete process list, and each time I do a sysreq-t, I > get a different number of procs (though always incomplete) in the output. > Anyway, I've done sysreq-t twice, and got the output from dmesg -s 1000000 > and /var/log/messages. Since the output is so big, I've put them, and the > kernel config here: > Things will be much easier for you if you configure a serial or network console. > Having a quick look myself, there are some odd things there though. For > instance, from sysreqmsglog1.txt > > imapd D F1778660 0 3753 1906 3754 809 (NOTLB) > eb15adb8 00000086 00000020 f1778660 c0310318 c43fc600 08155888 0000002d > f567d380 f7b97480 c42c3d20 00000000 0001ece6 6051d45f 00007c67 > c42c3d20 > c03d8180 f1778660 f1778810 f78ad9cc 00000003 f78ad9cc f78ad9cc > c025d40c > Call Trace: > [] memcpy_fromiovec+0x38/0x60 > [] generic_unplug_device+0x2c/0x40 > [] io_schedule+0x28/0x40 > [] __lock_page+0xbc/0xe0 > [] page_wake_function+0x0/0x50 > [] page_wake_function+0x0/0x50 > [] filemap_nopage+0x231/0x360 > [] do_no_page+0xb8/0x3a0 > [] pte_alloc_map+0xdb/0xf0 > [] handle_mm_fault+0xbe/0x1a0 > [] do_page_fault+0x172/0x5ec > [] do_sigaction+0x19b/0x210 > [] update_process_times+0x2c/0x40 > [] smp_apic_timer_interrupt+0x140/0x150 > [] do_page_fault+0x0/0x5ec > [] error_code+0x2d/0x38 > > Those calls into "generic_unplug_device" look really strange to me... It's just crud on the stack, you're really waiting in io_schedule() for a page to get unlocked. Why isn't the page unlocking? Hard to say for sure without seeing the whole sysrq-t. If the network/serial console doesn't work out, I can help you configure lkcd as well. -chris