From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [2.6.32 ubuntu] I/O hang at start_this_handle Date: Fri, 8 Apr 2011 20:29:16 +0200 Message-ID: <20110408182916.GA16729@quack.suse.cz> References: <20110215151633.GG17313@quack.suse.cz> <201102160652.BDI60469.JOVFSFOHLQOFtM@I-love.SAKURA.ne.jp> <20110216155317.GD5592@quack.suse.cz> <201102170813.p1H8DhOJ083597@www262.sakura.ne.jp> <20110217153847.GE4947@quack.suse.cz> <201104082338.GCG69297.FQVLtSJOFHOMOF@I-love.SAKURA.ne.jp> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: jack@suse.cz, linux-fsdevel@vger.kernel.org To: Tetsuo Handa Return-path: Received: from cantor.suse.de ([195.135.220.2]:55434 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756899Ab1DHS3W (ORCPT ); Fri, 8 Apr 2011 14:29:22 -0400 Content-Disposition: inline In-Reply-To: <201104082338.GCG69297.FQVLtSJOFHOMOF@I-love.SAKURA.ne.jp> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On Fri 08-04-11 23:38:55, Tetsuo Handa wrote: > Jan Kara wrote: > > > I waited for a few hours but the process did not resume. Thus, I gave up. > > OK, so stuck forever ;). Interesting. So we probably missed a wakeup > > somehow or j_barrier_count got corrupted. I suppose you are not able to > > reproduce the hang, are you? Looking at the code, it looks safe and I have > > no clue how it could happen. So unless you are able to see the issue again > > (so that we can gather some more debug information), I'm not able to help... > > I don't know how to reproduce the hang, but I got below message (I think it was > 2.6.18-238.5.1.el5) and CPU usage became 100% soon after I resumed a VM (which > took a few *minutes* to resume) running on the VMware Workstation. Hmm, but 2.6.18 is really ancient. I'm afraid I cannot help you much with it. There have been numerous bugs fixed since then. > INFO: task kjournald:543 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kjournald D 00000122 3072 543 9 573 530 (L-TLB) > c193cf40 00000046 c37e5098 00000122 c1a41254 00000001 c193cf18 00000001 > f7c25aa0 c3807a0a 00000122 00022972 00000000 f7c25bac c1806f00 f7096e40 > f7c25aa0 00000000 c195ec50 c1803e00 f7096e40 c195eaa0 c193cf70 ffffffff > Call Trace: > [] journal_commit_transaction+0x137/0xefc [jbd] > [] autoremove_wake_function+0x0/0x2d > [] try_to_del_timer_sync+0x65/0x6c > [] kjournald+0xa1/0x1c2 [jbd] > [] autoremove_wake_function+0x0/0x2d > [] kjournald+0x0/0x1c2 [jbd] > [] kthread+0xc0/0xed > [] kthread+0x0/0xed > [] kernel_thread_helper+0x7/0x10 > ======================= > INFO: task login:2107 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > login D 00000151 2752 2107 1 2108 604 (NOTLB) > c1940f38 00000086 ff6037b8 00000151 00000000 c1940f3c c0484602 0000000a > f7c25000 ffd43070 00000151 0073f8b8 00000000 f7c2510c c1806f00 f49683c0 > f1eff380 00000000 c0475922 f7d6d8c0 f7679cc8 f1eff380 c1940f3c ffffffff > Call Trace: > [] may_open+0x125/0x203 > [] __dentry_open+0xea/0x1ab > [] __mutex_lock_slowpath+0x4d/0x7c > [] .text.lock.mutex+0xf/0x14 > [] inotify_inode_queue_event+0x45/0xc6 > [] generic_file_llseek+0x2a/0xd2 > [] generic_file_llseek+0x0/0xd2 > [] vfs_llseek+0x30/0x34 > [] sys_llseek+0x43/0x84 > [] sys_fcntl64+0x6f/0x80 > [] syscall_call+0x7/0xb > ======================= > INFO: task syslogd:2633 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > syslogd D 00000122 2412 2633 1 2636 2621 (NOTLB) > f7138ed4 00000082 c35f4a61 00000122 0000000e 00000000 f7138f18 00000007 > c195eaa0 c37e5098 00000122 001f0637 00000000 c195ebac c1806f00 f7096e40 > c04975ff 00000000 00000282 c042deff c081ef80 f7138edc c18073bc c042e073 > Call Trace: > [] mpage_writepages+0x2b1/0x310 > [] lock_timer_base+0x15/0x2f > [] __mod_timer+0xda/0xe4 > [] schedule_timeout+0x71/0x8c > [] process_timeout+0x0/0x5 > [] journal_stop+0xd3/0x1ba [jbd] > [] __writeback_single_inode+0x197/0x2a3 > [] do_writepages+0x2b/0x32 > [] __filemap_fdatawrite_range+0x66/0x72 > [] sync_inode+0x19/0x24 > [] ext3_sync_file+0xb1/0xdc [ext3] > [] do_fsync+0x41/0x83 > [] __do_fsync+0x1d/0x2b > [] syscall_call+0x7/0xb > ======================= > INFO: task auditd:2688 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > auditd D 00000122 2396 2688 1 2940 2687 (NOTLB) > f4cc1ed0 00000086 c34cb5c9 00000122 c0459b22 0000000e 00000000 00000009 > f77bcaa0 c35f4a61 00000122 00129498 00000000 f77bcbac c1806f00 f7096200 > c1a41268 00000000 00000000 f4cc1ecc c041ec40 00000000 f4cc1ed8 c1a41250 > Call Trace: > [] find_get_pages_tag+0x30/0x75 > [] __wake_up+0x2a/0x3d > [] prepare_to_wait+0x24/0x46 > [] log_wait_commit+0x80/0xc7 [jbd] > [] autoremove_wake_function+0x0/0x2d > [] journal_stop+0x195/0x1ba [jbd] > [] __writeback_single_inode+0x197/0x2a3 > [] do_writepages+0x2b/0x32 > [] __filemap_fdatawrite_range+0x66/0x72 > [] sync_inode+0x19/0x24 > [] ext3_sync_file+0xb1/0xdc [ext3] > [] do_fsync+0x41/0x83 > [] __do_fsync+0x1d/0x2b > [] syscall_call+0x7/0xb > ======================= > INFO: task nmbd:3186 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > nmbd D 00000177 2572 3186 1 3198 3183 (NOTLB) > f48c8e28 00200086 eb451402 00000177 00000010 c0637d00 00000000 00000007 > f7140000 eb4f644b 00000177 000a5049 00000000 f714010c c1806f00 f77233c0 > ffffffff 00000000 00000000 00000000 00000000 00000000 f48c8e40 ffffffff > Call Trace: > [] start_this_handle+0x21b/0x30d [jbd] > [] autoremove_wake_function+0x0/0x2d > [] journal_start+0xac/0xdb [jbd] > [] ext3_dirty_inode+0x24/0x66 [ext3] > [] __mark_inode_dirty+0x27/0x14f > [] file_update_time+0x30/0xa1 > [] do_wp_page+0x5b7/0x5f1 > [] __handle_mm_fault+0xa33/0xaac > [] fcntl_setlk64+0x1f9/0x21f > [] do_page_fault+0x23a/0x52d > [] do_page_fault+0x0/0x52d > [] error_code+0x39/0x40 > ======================= > INFO: task login:3370 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > login D 00000122 2672 3370 1 3288 (NOTLB) > eff1cdc4 00000086 c9528bc8 00000122 00a41000 c062c080 00000007 00000009 > f776f000 c9702c11 00000122 001da049 00000000 f776f10c c1806f00 f4a85900 > 00000000 00000000 00160057 ffffffff 00000000 ffffffff eff1cddc ffffffff > Call Trace: > [] start_this_handle+0x21b/0x30d [jbd] > [] autoremove_wake_function+0x0/0x2d > [] journal_start+0xac/0xdb [jbd] > [] ext3_dirty_inode+0x24/0x66 [ext3] > [] __mark_inode_dirty+0x27/0x14f > [] file_update_time+0x30/0xa1 > [] __generic_file_aio_write_nolock+0x38d/0x52a > [] avc_has_perm+0x3c/0x46 > [] generic_file_aio_write+0x57/0xaa > [] ext3_file_write+0x19/0x83 [ext3] > [] do_sync_write+0xb6/0xf1 > [] autoremove_wake_function+0x0/0x2d > [] do_sync_write+0x0/0xf1 > [] vfs_write+0xa1/0x143 > [] sys_write+0x3c/0x63 > [] syscall_call+0x7/0xb > ======================= > INFO: task kjournald:543 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kjournald D 00000122 3072 543 9 573 530 (L-TLB) > c193cf40 00000046 c37e5098 00000122 c1a41254 00000001 c193cf18 00000001 > f7c25aa0 c3807a0a 00000122 00022972 00000000 f7c25bac c1806f00 f7096e40 > f7c25aa0 00000000 c195ec50 c1803e00 f7096e40 c195eaa0 c193cf70 ffffffff > Call Trace: > [] journal_commit_transaction+0x137/0xefc [jbd] > [] autoremove_wake_function+0x0/0x2d > [] try_to_del_timer_sync+0x65/0x6c > [] kjournald+0xa1/0x1c2 [jbd] > [] autoremove_wake_function+0x0/0x2d > [] kjournald+0x0/0x1c2 [jbd] > [] kthread+0xc0/0xed > [] kthread+0x0/0xed > [] kernel_thread_helper+0x7/0x10 > ======================= > INFO: task login:2107 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > login D 00000151 2752 2107 1 2108 604 (NOTLB) > c1940f38 00000086 ff6037b8 00000151 00000000 c1940f3c c0484602 0000000a > f7c25000 ffd43070 00000151 0073f8b8 00000000 f7c2510c c1806f00 f49683c0 > f1eff380 00000000 c0475922 f7d6d8c0 f7679cc8 f1eff380 c1940f3c ffffffff > Call Trace: > [] may_open+0x125/0x203 > [] __dentry_open+0xea/0x1ab > [] __mutex_lock_slowpath+0x4d/0x7c > [] .text.lock.mutex+0xf/0x14 > [] inotify_inode_queue_event+0x45/0xc6 > [] generic_file_llseek+0x2a/0xd2 > [] generic_file_llseek+0x0/0xd2 > [] vfs_llseek+0x30/0x34 > [] sys_llseek+0x43/0x84 > [] sys_fcntl64+0x6f/0x80 > [] syscall_call+0x7/0xb > ======================= > INFO: task syslogd:2633 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > syslogd D 00000122 2412 2633 1 2636 2621 (NOTLB) > f7138ed4 00000082 c35f4a61 00000122 0000000e 00000000 f7138f18 00000007 > c195eaa0 c37e5098 00000122 001f0637 00000000 c195ebac c1806f00 f7096e40 > c04975ff 00000000 00000282 c042deff c081ef80 f7138edc c18073bc c042e073 > Call Trace: > [] mpage_writepages+0x2b1/0x310 > [] lock_timer_base+0x15/0x2f > [] __mod_timer+0xda/0xe4 > [] schedule_timeout+0x71/0x8c > [] process_timeout+0x0/0x5 > [] journal_stop+0xd3/0x1ba [jbd] > [] __writeback_single_inode+0x197/0x2a3 > [] do_writepages+0x2b/0x32 > [] __filemap_fdatawrite_range+0x66/0x72 > [] sync_inode+0x19/0x24 > [] ext3_sync_file+0xb1/0xdc [ext3] > [] do_fsync+0x41/0x83 > [] __do_fsync+0x1d/0x2b > [] syscall_call+0x7/0xb > ======================= > INFO: task auditd:2688 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > auditd D 00000122 2396 2688 1 2940 2687 (NOTLB) > f4cc1ed0 00000086 c34cb5c9 00000122 c0459b22 0000000e 00000000 00000009 > f77bcaa0 c35f4a61 00000122 00129498 00000000 f77bcbac c1806f00 f7096200 > c1a41268 00000000 00000000 f4cc1ecc c041ec40 00000000 f4cc1ed8 c1a41250 > Call Trace: > [] find_get_pages_tag+0x30/0x75 > [] __wake_up+0x2a/0x3d > [] prepare_to_wait+0x24/0x46 > [] log_wait_commit+0x80/0xc7 [jbd] > [] autoremove_wake_function+0x0/0x2d > [] journal_stop+0x195/0x1ba [jbd] > [] __writeback_single_inode+0x197/0x2a3 > [] do_writepages+0x2b/0x32 > [] __filemap_fdatawrite_range+0x66/0x72 > [] sync_inode+0x19/0x24 > [] ext3_sync_file+0xb1/0xdc [ext3] > [] do_fsync+0x41/0x83 > [] __do_fsync+0x1d/0x2b > [] syscall_call+0x7/0xb > ======================= > > If this comes from the same cause, it might be related with clock or timer event. > I couldn't get more information because sysrq didn't work. I see. From the traces it looks as if some processes are waiting in journal_stop() in transaction batching code (which then blocks a transaction commit). So indeed if the timer / wakeup code didn't work quite well, it could cause situations like this. Honza -- Jan Kara SUSE Labs, CR