From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: I/O related (?) domU crash on xen 4.0 + pv_ops Date: Thu, 10 Jun 2010 15:47:30 -0700 Message-ID: <4C116B82.5000901@goop.org> References: <4C0EA505.8050500@buzz-media.com> <4C0ECF72.3050509@goop.org> <4C1131FB.1030601@buzz-media.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4C1131FB.1030601@buzz-media.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Rick Boone Cc: "xen-devel@lists.xensource.com" List-Id: xen-devel@lists.xenproject.org On 06/10/2010 11:42 AM, Rick Boone wrote: > I booted into 2.6.32.12 yesterday, but I'm still getting a hung task > crash. (Initially, I got a soft lockup crash on one of the cpu's, > after switching to 2.6.32.12, but it seems I was able to resolve that > by switching the kernel to tickless) > > I built the kernel myself (for both dom0 and domU), from the > xen/stable-2.6.31.x and xen/stable-2.6.32.x branches. > > I tried "clocksource=jiffies" earlier, but didnt give it a fair enough > shake. The time drift ended up becoming pretty bad, even with ntp, so > I had to switch out of it. I'll give it another try today; I'm sure I > can probably figure out the drift issue. Yes, jiffies is a horrible clocksource; it is just a diagnostic test (not even a workaround), since it seems to have some effect. I'm really not sure what the root cause of the problem is; it seems that something is deadlocked in the io scheduler, and everything else ends up getting hung as a result. If you can get the full output of sysrq-t while the domain is hung it may help try to work out who the culprit is (I've never managed to get a complete dump in a hung state, but that was with an older kernel with some bugs in the console code which truncated the output; 2.6.32 should be OK). J > > -- Rick > > On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote: >> On 06/08/2010 01:16 PM, Rick Boone wrote: >> >>> Hey, >>> >>> I'm running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 - >>> my domU's are continually locking up, under heavy IO load. My issue >>> seems similar to what these guys are reportiing: >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=551552 >>> https://bugzilla.redhat.com/show_bug.cgi?id=526627 >>> https://bugzilla.redhat.com/show_bug.cgi?id=550724 >>> >>> Any solutions/ideas that haven't been covered in those reports? I've >>> turned off tickless kernel, and have also set the guest I/O scheduler >>> to "noop", but machines are still crashing. I'm using LVM-backed block >>> devices on the dom0. For awhile, I didn't have the kernel set to panic >>> on a hung task, and from digging around while the kernel was still up, >>> I was able to determine that the device that seems to be causing >>> issues is one that sees a lot of IO (it's receiving all of the logs on >>> a heavily used web server). >>> >> Does booting the domU with "clocksource=jiffies" help at all? >> >> Have you tried a 2.6.32-based domU? I used to see these kinds of >> lockups with older kernels on my linode domU, but have seen none since >> moving it to Fedora 12 with a 2.6.32 kernel. >> >> Are you using a distro kernel, or one you've built yourself? >> >> J >> >> >>> Here's a couple of my tracebacks: >>> 1) >>> >>> INFO: task pdflush:36 blocked for more than 120 seconds. >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0 >>> ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94 >>> 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec >>> Call Trace: >>> [] ? blk_unplug+0x56/0x72 >>> [] io_schedule+0x37/0x59 >>> [] ? block_sync_page+0x5b/0x71 >>> [] sync_page+0x5a/0x72 >>> [] __wait_on_bit_lock+0x55/0xb3 >>> [] ? sync_page+0x0/0x72 >>> [] ? find_get_pages_tag+0xf7/0x144 >>> [] __lock_page+0x71/0x8c >>> [] ? wake_bit_function+0x0/0x51 >>> [] write_cache_pages+0x201/0x3bf >>> [] ? __writepage+0x0/0x5a >>> [] ? xen_force_evtchn_callback+0x20/0x36 >>> [] ? xen_restore_fl_direct_end+0x0/0x1 >>> [] generic_writepages+0x35/0x4f >>> [] do_writepages+0x3f/0x5e >>> [] writeback_single_inode+0x161/0x2d7 >>> [] generic_sync_sb_inodes+0x1ef/0x355 >>> [] ? pdflush+0x0/0x286 >>> [] sync_sb_inodes+0x1c/0x32 >>> [] writeback_inodes+0x79/0xdf >>> [] ? sync_supers+0xb3/0xce >>> [] wb_kupdate+0xb9/0x13a >>> [] ? pdflush+0x126/0x286 >>> [] pdflush+0x163/0x286 >>> [] ? wb_kupdate+0x0/0x13a >>> [] ? pdflush+0x0/0x286 >>> [] kthread+0x9e/0xa8 >>> [] child_rip+0xa/0x20 >>> [] ? int_ret_from_sys_call+0x7/0x1b >>> [] ? retint_restore_args+0x5/0x6 >>> [] ? child_rip+0x0/0x20 >>> 1 lock held by pdflush/36: >>> #0: (&type->s_umount_key#23){......}, at: [] >>> writeback_inodes+0x63/0xdf >>> Kernel panic - not syncing: hung_task: blocked tasks >>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4 >>> Call Trace: >>> [] panic+0xb2/0x168 >>> [] ? print_lock+0x96/0xb1 >>> [] ? lockdep_print_held_locks+0xa5/0xc9 >>> [] ? show_stack+0x2a/0x40 >>> [] ? touch_nmi_watchdog+0x6c/0x87 >>> [] ? __debug_show_held_locks+0x33/0x49 >>> [] watchdog+0x209/0x258 >>> [] ? watchdog+0x75/0x258 >>> [] ? complete+0x52/0x71 >>> [] ? watchdog+0x0/0x258 >>> [] kthread+0x9e/0xa8 >>> [] child_rip+0xa/0x20 >>> [] ? int_ret_from_sys_call+0x7/0x1b >>> [] ? retint_restore_args+0x5/0x6 >>> [] ? child_rip+0x0/0x20 >>> >>> --------------------------------------------- >>> >>> 2) >>> INFO: task kjournald:951 blocked for more than 120 seconds. >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00 >>> ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001 >>> 0000000000000001 0000000000000200 0000000000000001 000000000160015f >>> Call Trace: >>> [] io_schedule+0x37/0x59 >>> [] ? blk_backing_dev_unplug+0x20/0x36 >>> [] sync_buffer+0x51/0x69 >>> [] __wait_on_bit+0x54/0x9c >>> [] ? sync_buffer+0x0/0x69 >>> [] ? sync_buffer+0x0/0x69 >>> [] out_of_line_wait_on_bit+0x7d/0x9e >>> [] ? wake_bit_function+0x0/0x51 >>> [] __wait_on_buffer+0x32/0x48 >>> [] journal_commit_transaction+0x684/0x12f2 [jbd] >>> [] ? check_events+0x12/0x20 >>> [] ? xen_restore_fl_direct_end+0x0/0x1 >>> [] ? _spin_unlock_irqrestore+0x44/0x5f >>> [] ? try_to_del_timer_sync+0x65/0x84 >>> [] ? del_timer_sync+0x0/0xa0 >>> [] kjournald+0x161/0x3ae [jbd] >>> [] ? autoremove_wake_function+0x0/0x5e >>> [] ? kjournald+0x0/0x3ae [jbd] >>> [] kthread+0x9e/0xa8 >>> [] child_rip+0xa/0x20 >>> [] ? int_ret_from_sys_call+0x7/0x1b >>> [] ? retint_restore_args+0x5/0x6 >>> [] ? child_rip+0x0/0x20 >>> no locks held by kjournald/951. >>> Kernel panic - not syncing: hung_task: blocked tasks >>> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18 >>> Call Trace: >>> [] panic+0xb2/0x168 >>> [] ? lockdep_print_held_locks+0x46/0xc9 >>> [] ? show_stack+0x2a/0x40 >>> [] ? touch_nmi_watchdog+0x6c/0x87 >>> [] ? __debug_show_held_locks+0x33/0x49 >>> [] watchdog+0x209/0x258 >>> [] ? watchdog+0x75/0x258 >>> [] ? complete+0x52/0x71 >>> [] ? watchdog+0x0/0x258 >>> [] kthread+0x9e/0xa8 >>> [] child_rip+0xa/0x20 >>> [] ? int_ret_from_sys_call+0x7/0x1b >>> [] ? retint_restore_args+0x5/0x6 >>> [] ? child_rip+0x0/0x20 >>> >>> -- Rick Boone >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >>> >>