From mboxrd@z Thu Jan 1 00:00:00 1970 From: Rick Boone Subject: Re: I/O related (?) domU crash on xen 4.0 + pv_ops Date: Thu, 10 Jun 2010 11:42:03 -0700 Message-ID: <4C1131FB.1030601@buzz-media.com> References: <4C0EA505.8050500@buzz-media.com> <4C0ECF72.3050509@goop.org> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============1811342369==" Return-path: In-Reply-To: <4C0ECF72.3050509@goop.org> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Jeremy Fitzhardinge Cc: "xen-devel@lists.xensource.com" List-Id: xen-devel@lists.xenproject.org --===============1811342369== Content-Type: multipart/alternative; boundary="------------050303080009070406000603" --------------050303080009070406000603 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit I booted into 2.6.32.12 yesterday, but I'm still getting a hung task crash. (Initially, I got a soft lockup crash on one of the cpu's, after switching to 2.6.32.12, but it seems I was able to resolve that by switching the kernel to tickless) I built the kernel myself (for both dom0 and domU), from the xen/stable-2.6.31.x and xen/stable-2.6.32.x branches. I tried "clocksource=jiffies" earlier, but didnt give it a fair enough shake. The time drift ended up becoming pretty bad, even with ntp, so I had to switch out of it. I'll give it another try today; I'm sure I can probably figure out the drift issue. -- Rick On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote: > On 06/08/2010 01:16 PM, Rick Boone wrote: > >> Hey, >> >> I'm running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 - >> my domU's are continually locking up, under heavy IO load. My issue >> seems similar to what these guys are reportiing: >> >> https://bugzilla.redhat.com/show_bug.cgi?id=551552 >> https://bugzilla.redhat.com/show_bug.cgi?id=526627 >> https://bugzilla.redhat.com/show_bug.cgi?id=550724 >> >> Any solutions/ideas that haven't been covered in those reports? I've >> turned off tickless kernel, and have also set the guest I/O scheduler >> to "noop", but machines are still crashing. I'm using LVM-backed block >> devices on the dom0. For awhile, I didn't have the kernel set to panic >> on a hung task, and from digging around while the kernel was still up, >> I was able to determine that the device that seems to be causing >> issues is one that sees a lot of IO (it's receiving all of the logs on >> a heavily used web server). >> > Does booting the domU with "clocksource=jiffies" help at all? > > Have you tried a 2.6.32-based domU? I used to see these kinds of > lockups with older kernels on my linode domU, but have seen none since > moving it to Fedora 12 with a 2.6.32 kernel. > > Are you using a distro kernel, or one you've built yourself? > > J > > >> Here's a couple of my tracebacks: >> 1) >> >> INFO: task pdflush:36 blocked for more than 120 seconds. >> "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0 >> ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94 >> 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec >> Call Trace: >> [] ? blk_unplug+0x56/0x72 >> [] io_schedule+0x37/0x59 >> [] ? block_sync_page+0x5b/0x71 >> [] sync_page+0x5a/0x72 >> [] __wait_on_bit_lock+0x55/0xb3 >> [] ? sync_page+0x0/0x72 >> [] ? find_get_pages_tag+0xf7/0x144 >> [] __lock_page+0x71/0x8c >> [] ? wake_bit_function+0x0/0x51 >> [] write_cache_pages+0x201/0x3bf >> [] ? __writepage+0x0/0x5a >> [] ? xen_force_evtchn_callback+0x20/0x36 >> [] ? xen_restore_fl_direct_end+0x0/0x1 >> [] generic_writepages+0x35/0x4f >> [] do_writepages+0x3f/0x5e >> [] writeback_single_inode+0x161/0x2d7 >> [] generic_sync_sb_inodes+0x1ef/0x355 >> [] ? pdflush+0x0/0x286 >> [] sync_sb_inodes+0x1c/0x32 >> [] writeback_inodes+0x79/0xdf >> [] ? sync_supers+0xb3/0xce >> [] wb_kupdate+0xb9/0x13a >> [] ? pdflush+0x126/0x286 >> [] pdflush+0x163/0x286 >> [] ? wb_kupdate+0x0/0x13a >> [] ? pdflush+0x0/0x286 >> [] kthread+0x9e/0xa8 >> [] child_rip+0xa/0x20 >> [] ? int_ret_from_sys_call+0x7/0x1b >> [] ? retint_restore_args+0x5/0x6 >> [] ? child_rip+0x0/0x20 >> 1 lock held by pdflush/36: >> #0: (&type->s_umount_key#23){......}, at: [] >> writeback_inodes+0x63/0xdf >> Kernel panic - not syncing: hung_task: blocked tasks >> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4 >> Call Trace: >> [] panic+0xb2/0x168 >> [] ? print_lock+0x96/0xb1 >> [] ? lockdep_print_held_locks+0xa5/0xc9 >> [] ? show_stack+0x2a/0x40 >> [] ? touch_nmi_watchdog+0x6c/0x87 >> [] ? __debug_show_held_locks+0x33/0x49 >> [] watchdog+0x209/0x258 >> [] ? watchdog+0x75/0x258 >> [] ? complete+0x52/0x71 >> [] ? watchdog+0x0/0x258 >> [] kthread+0x9e/0xa8 >> [] child_rip+0xa/0x20 >> [] ? int_ret_from_sys_call+0x7/0x1b >> [] ? retint_restore_args+0x5/0x6 >> [] ? child_rip+0x0/0x20 >> >> --------------------------------------------- >> >> 2) >> INFO: task kjournald:951 blocked for more than 120 seconds. >> "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00 >> ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001 >> 0000000000000001 0000000000000200 0000000000000001 000000000160015f >> Call Trace: >> [] io_schedule+0x37/0x59 >> [] ? blk_backing_dev_unplug+0x20/0x36 >> [] sync_buffer+0x51/0x69 >> [] __wait_on_bit+0x54/0x9c >> [] ? sync_buffer+0x0/0x69 >> [] ? sync_buffer+0x0/0x69 >> [] out_of_line_wait_on_bit+0x7d/0x9e >> [] ? wake_bit_function+0x0/0x51 >> [] __wait_on_buffer+0x32/0x48 >> [] journal_commit_transaction+0x684/0x12f2 [jbd] >> [] ? check_events+0x12/0x20 >> [] ? xen_restore_fl_direct_end+0x0/0x1 >> [] ? _spin_unlock_irqrestore+0x44/0x5f >> [] ? try_to_del_timer_sync+0x65/0x84 >> [] ? del_timer_sync+0x0/0xa0 >> [] kjournald+0x161/0x3ae [jbd] >> [] ? autoremove_wake_function+0x0/0x5e >> [] ? kjournald+0x0/0x3ae [jbd] >> [] kthread+0x9e/0xa8 >> [] child_rip+0xa/0x20 >> [] ? int_ret_from_sys_call+0x7/0x1b >> [] ? retint_restore_args+0x5/0x6 >> [] ? child_rip+0x0/0x20 >> no locks held by kjournald/951. >> Kernel panic - not syncing: hung_task: blocked tasks >> Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18 >> Call Trace: >> [] panic+0xb2/0x168 >> [] ? lockdep_print_held_locks+0x46/0xc9 >> [] ? show_stack+0x2a/0x40 >> [] ? touch_nmi_watchdog+0x6c/0x87 >> [] ? __debug_show_held_locks+0x33/0x49 >> [] watchdog+0x209/0x258 >> [] ? watchdog+0x75/0x258 >> [] ? complete+0x52/0x71 >> [] ? watchdog+0x0/0x258 >> [] kthread+0x9e/0xa8 >> [] child_rip+0xa/0x20 >> [] ? int_ret_from_sys_call+0x7/0x1b >> [] ? retint_restore_args+0x5/0x6 >> [] ? child_rip+0x0/0x20 >> >> -- Rick Boone >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> >> > --------------050303080009070406000603 Content-Type: text/html; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit I booted into 2.6.32.12 yesterday, but I'm still getting a hung task crash. (Initially, I got a soft lockup crash on one of the cpu's, after switching to 2.6.32.12, but it seems I was able to resolve that by switching the kernel to tickless)

I built the kernel myself (for both dom0 and domU), from the xen/stable-2.6.31.x and xen/stable-2.6.32.x branches.

I tried "clocksource=jiffies" earlier, but didnt give it a fair enough shake. The time drift ended up becoming pretty bad, even with ntp, so I had to switch out of it. I'll give it another try today; I'm sure I can probably figure out the drift issue.


-- Rick

On 06/08/2010 04:17 PM, Jeremy Fitzhardinge wrote:
On 06/08/2010 01:16 PM, Rick Boone wrote:
  
Hey,

I'm running into an issue with pv_ops kernel (2.6.31.13) and xen 4.0 -
my domU's are continually locking up, under heavy IO load. My issue
seems similar to what these guys are reportiing:

https://bugzilla.redhat.com/show_bug.cgi?id=551552
https://bugzilla.redhat.com/show_bug.cgi?id=526627
https://bugzilla.redhat.com/show_bug.cgi?id=550724

Any solutions/ideas that haven't been covered in those reports? I've
turned off tickless kernel, and have also set the guest I/O scheduler
to "noop", but machines are still crashing. I'm using LVM-backed block
devices on the dom0. For awhile, I didn't have the kernel set to panic
on a hung task, and from digging around while the kernel was still up,
I was able to determine that the device that seems to be causing
issues is one that sees a lot of IO (it's receiving all of the logs on
a heavily used web server).
    
Does booting the domU with "clocksource=jiffies" help at all?

Have you tried a 2.6.32-based domU?  I used to see these kinds of
lockups with older kernels on my linode domU, but have seen none since
moving it to Fedora 12 with a 2.6.32 kernel.

Are you using a distro kernel, or one you've built yourself?

    J

  
Here's a couple of my tracebacks:
1)

INFO: task pdflush:36 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ffff8801e963f9c0 0000000000000282 00000000e4f10f56 ffff8801e963f9d0
 ffff8801eb7a31b0 ffff8801eb67c240 ffff8801eb7a3590 0000000103902b94
 00000000e4f10f56 ffff8801e963fa70 ffff8801e963f9b0 ffffffff811f14ec
Call Trace:
 [<ffffffff811f14ec>] ? blk_unplug+0x56/0x72
 [<ffffffff813f1ee0>] io_schedule+0x37/0x59
 [<ffffffff8112b1a8>] ? block_sync_page+0x5b/0x71
 [<ffffffff810c2e77>] sync_page+0x5a/0x72
 [<ffffffff813f2237>] __wait_on_bit_lock+0x55/0xb3
 [<ffffffff810c2e1d>] ? sync_page+0x0/0x72
 [<ffffffff810c2b0d>] ? find_get_pages_tag+0xf7/0x144
 [<ffffffff810c2dce>] __lock_page+0x71/0x8c
 [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
 [<ffffffff810cafe4>] write_cache_pages+0x201/0x3bf
 [<ffffffff810cac23>] ? __writepage+0x0/0x5a
 [<ffffffff8100ef6c>] ? xen_force_evtchn_callback+0x20/0x36
 [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff810cb1d7>] generic_writepages+0x35/0x4f
 [<ffffffff810cb230>] do_writepages+0x3f/0x5e
 [<ffffffff811261e5>] writeback_single_inode+0x161/0x2d7
 [<ffffffff811267ab>] generic_sync_sb_inodes+0x1ef/0x355
 [<ffffffff810cc726>] ? pdflush+0x0/0x286
 [<ffffffff8112692d>] sync_sb_inodes+0x1c/0x32
 [<ffffffff811269bc>] writeback_inodes+0x79/0xdf
 [<ffffffff81107819>] ? sync_supers+0xb3/0xce
 [<ffffffff810cc1f6>] wb_kupdate+0xb9/0x13a
 [<ffffffff810cc84c>] ? pdflush+0x126/0x286
 [<ffffffff810cc889>] pdflush+0x163/0x286
 [<ffffffff810cc13d>] ? wb_kupdate+0x0/0x13a
 [<ffffffff810cc726>] ? pdflush+0x0/0x286
 [<ffffffff810754ce>] kthread+0x9e/0xa8
 [<ffffffff8101606a>] child_rip+0xa/0x20
 [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
 [<ffffffff81016060>] ? child_rip+0x0/0x20
1 lock held by pdflush/36:
 #0:  (&type->s_umount_key#23){......}, at: [<ffffffff811269a6>]
writeback_inodes+0x63/0xdf
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #4
Call Trace:
 [<ffffffff8105c8f3>] panic+0xb2/0x168
 [<ffffffff81085881>] ? print_lock+0x96/0xb1
 [<ffffffff810861d5>] ? lockdep_print_held_locks+0xa5/0xc9
 [<ffffffff8101907a>] ? show_stack+0x2a/0x40
 [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
 [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
 [<ffffffff810b146c>] watchdog+0x209/0x258
 [<ffffffff810b12d8>] ? watchdog+0x75/0x258
 [<ffffffff8104a45f>] ? complete+0x52/0x71
 [<ffffffff810b1263>] ? watchdog+0x0/0x258
 [<ffffffff810754ce>] kthread+0x9e/0xa8
 [<ffffffff8101606a>] child_rip+0xa/0x20
 [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
 [<ffffffff81016060>] ? child_rip+0x0/0x20

---------------------------------------------

2)
INFO: task kjournald:951 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 ffff8801c8993bf0 0000000000000282 00000000cf63a654 ffff8801c8993c00
 ffff8801ca899090 ffff8801ccae31b0 ffff8801ca899470 0000000000000001
 0000000000000001 0000000000000200 0000000000000001 000000000160015f
Call Trace:
 [<ffffffff813f1ee0>] io_schedule+0x37/0x59
 [<ffffffff811f1528>] ? blk_backing_dev_unplug+0x20/0x36
 [<ffffffff8112cb73>] sync_buffer+0x51/0x69
 [<ffffffff813f2387>] __wait_on_bit+0x54/0x9c
 [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
 [<ffffffff8112cb22>] ? sync_buffer+0x0/0x69
 [<ffffffff813f244c>] out_of_line_wait_on_bit+0x7d/0x9e
 [<ffffffff8107569f>] ? wake_bit_function+0x0/0x51
 [<ffffffff8112ca8f>] __wait_on_buffer+0x32/0x48
 [<ffffffffa005cf62>] journal_commit_transaction+0x684/0x12f2 [jbd]
 [<ffffffff8100fa82>] ? check_events+0x12/0x20
 [<ffffffff8100fa6f>] ? xen_restore_fl_direct_end+0x0/0x1
 [<ffffffff813f4ad7>] ? _spin_unlock_irqrestore+0x44/0x5f
 [<ffffffff81067de4>] ? try_to_del_timer_sync+0x65/0x84
 [<ffffffff81067e03>] ? del_timer_sync+0x0/0xa0
 [<ffffffffa0061dd4>] kjournald+0x161/0x3ae [jbd]
 [<ffffffff81075641>] ? autoremove_wake_function+0x0/0x5e
 [<ffffffffa0061c73>] ? kjournald+0x0/0x3ae [jbd]
 [<ffffffff810754ce>] kthread+0x9e/0xa8
 [<ffffffff8101606a>] child_rip+0xa/0x20
 [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
 [<ffffffff81016060>] ? child_rip+0x0/0x20
no locks held by kjournald/951.
Kernel panic - not syncing: hung_task: blocked tasks
Pid: 34, comm: khungtaskd Not tainted 2.6.31.13-xen-4.0.0 #18
Call Trace:
 [<ffffffff8105c8f3>] panic+0xb2/0x168
 [<ffffffff81086176>] ? lockdep_print_held_locks+0x46/0xc9
 [<ffffffff8101907a>] ? show_stack+0x2a/0x40
 [<ffffffff8102f025>] ? touch_nmi_watchdog+0x6c/0x87
 [<ffffffff810862c7>] ? __debug_show_held_locks+0x33/0x49
 [<ffffffff810b146c>] watchdog+0x209/0x258
 [<ffffffff810b12d8>] ? watchdog+0x75/0x258
 [<ffffffff8104a45f>] ? complete+0x52/0x71
 [<ffffffff810b1263>] ? watchdog+0x0/0x258
 [<ffffffff810754ce>] kthread+0x9e/0xa8
 [<ffffffff8101606a>] child_rip+0xa/0x20
 [<ffffffff810151ac>] ? int_ret_from_sys_call+0x7/0x1b
 [<ffffffff810159e6>] ? retint_restore_args+0x5/0x6
 [<ffffffff81016060>] ? child_rip+0x0/0x20

-- Rick Boone


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
  
    
  
--------------050303080009070406000603-- --===============1811342369== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel --===============1811342369==--