I/O Performance Tips

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* I/O Performance Tips
@ 2010-12-09  8:10 Sebastian Nickel - Hetzner Online AG
  2010-12-09  8:55 ` RW
  2010-12-09 10:30 ` Stefan Hajnoczi
  0 siblings, 2 replies; 5+ messages in thread
From: Sebastian Nickel - Hetzner Online AG @ 2010-12-09  8:10 UTC (permalink / raw)
  To: kvm

Hello,
we have got some issues with I/O in our kvm environment. We are using
kernel version 2.6.32 (Ubuntu 10.04 LTS) to virtualise our hosts and we
are using ksm, too. Recently we noticed that sometimes the guest systems
(mainly OpenSuse guest systems) suddenly have a read only filesystem.
After some inspection we found out that the guest system generates some
ata errors due to timeouts (mostly in "flush cache" situations). On the
physical host there are always the same kernel messages when this
happens:

"""
[1508127.195469] INFO: task kjournald:497 blocked for more than 120
seconds.
[1508127.212828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1508127.246841] kjournald     D 00000000ffffffff     0   497      2
0x00000000
[1508127.246848]  ffff88062128dba0 0000000000000046 0000000000015bc0
0000000000015bc0
[1508127.246855]  ffff880621089ab0 ffff88062128dfd8 0000000000015bc0
ffff8806210896f0
[1508127.246862]  0000000000015bc0 ffff88062128dfd8 0000000000015bc0
ffff880621089ab0
[1508127.246868] Call Trace:
[1508127.246880]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
[1508127.246889]  [<ffffffff81557d87>] io_schedule+0x47/0x70
[1508127.246893]  [<ffffffff8116e545>] sync_buffer+0x45/0x50
[1508127.246897]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
[1508127.246901]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
[1508127.246905]  [<ffffffff81558338>] out_of_line_wait_on_bit_lock
+0x78/0x90
[1508127.246911]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
[1508127.246915]  [<ffffffff8116e6c6>] __lock_buffer+0x36/0x40
[1508127.246920]  [<ffffffff81213d11>] journal_submit_data_buffers
+0x311/0x320
[1508127.246924]  [<ffffffff81213ff2>] journal_commit_transaction
+0x2d2/0xe40
[1508127.246931]  [<ffffffff810397a9>] ? default_spin_lock_flags
+0x9/0x10
[1508127.246935]  [<ffffffff81076c7c>] ? lock_timer_base+0x3c/0x70
[1508127.246939]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
[1508127.246943]  [<ffffffff81217f0d>] kjournald+0xed/0x250
[1508127.246947]  [<ffffffff81085090>] ? autoremove_wake_function
+0x0/0x40
[1508127.246951]  [<ffffffff81217e20>] ? kjournald+0x0/0x250
[1508127.246954]  [<ffffffff81084d16>] kthread+0x96/0xa0
[1508127.246959]  [<ffffffff810141ea>] child_rip+0xa/0x20
[1508127.246962]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
[1508127.246966]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
[1508127.246969] INFO: task flush-251:0:505 blocked for more than 120
seconds.
[1508127.264076] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[1508127.298343] flush-251:0   D ffffffff810f4370     0   505      2
0x00000000
[1508127.298349]  ffff880621fdba30 0000000000000046 0000000000015bc0
0000000000015bc0
[1508127.298354]  ffff88062108b1a0 ffff880621fdbfd8 0000000000015bc0
ffff88062108ade0
[1508127.298358]  0000000000015bc0 ffff880621fdbfd8 0000000000015bc0
ffff88062108b1a0
[1508127.298362] Call Trace:
[1508127.298370]  [<ffffffff810f4370>] ? sync_page+0x0/0x50
[1508127.298375]  [<ffffffff81557d87>] io_schedule+0x47/0x70
[1508127.298379]  [<ffffffff810f43ad>] sync_page+0x3d/0x50
[1508127.298383]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
[1508127.298391]  [<ffffffff810f4347>] __lock_page+0x67/0x70
[1508127.298395]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
[1508127.298402]  [<ffffffff810f4487>] ? unlock_page+0x27/0x30
[1508127.298410]  [<ffffffff810fd9dd>] write_cache_pages+0x3bd/0x4d0
[1508127.298417]  [<ffffffff810fc670>] ? __writepage+0x0/0x40
[1508127.298425]  [<ffffffff810fdb14>] generic_writepages+0x24/0x30
[1508127.298432]  [<ffffffff810fdb55>] do_writepages+0x35/0x40
[1508127.298439]  [<ffffffff811668a6>] writeback_single_inode+0xf6/0x3d0
[1508127.298449]  [<ffffffff812b81d6>] ? rb_erase+0xd6/0x160
[1508127.298455]  [<ffffffff8116750e>] writeback_inodes_wb+0x40e/0x5e0
[1508127.298462]  [<ffffffff811677ea>] wb_writeback+0x10a/0x1d0
[1508127.298469]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
[1508127.298477]  [<ffffffff8155803d>] ? schedule_timeout+0x19d/0x300
[1508127.298485]  [<ffffffff81167b1c>] wb_do_writeback+0x18c/0x1a0
[1508127.298493]  [<ffffffff81167b83>] bdi_writeback_task+0x53/0xe0
[1508127.298503]  [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
[1508127.298510]  [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
[1508127.298518]  [<ffffffff81084d16>] kthread+0x96/0xa0
[1508127.298526]  [<ffffffff810141ea>] child_rip+0xa/0x20
[1508127.298533]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
[1508127.298541]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
... some more messages like those before...
"""
Sometimes the message "INFO: task kvm blocked for more than 120 seconds"
appears, too. I thought the error happens in cache writeback situations
so I started to adjust "/proc/sys/vm/dirty_background_ratio" to 5 and
"/proc/sys/vm/dirty_ratio" to 40. I thought this will write continously
smaller parts of cached memory to HDD (more often, but smaller chunks).
This did not help. There are still "readonly" filesystems in the guest
systems. Does anybody has some tips to regulate I/O on linux systems or
to stop those "readonly" filesystems?

I tried the "cfq" scheduler and did some ionice (best efforts method,
all kvm processes in the same class). I thought of cgroups, but I could
not find any I/O related properties to set.

We are using logical volumes as virtual guest HDDs. The volume group is
on top of a mdraid device.

Thank you for your help...


Best Regards

Sebastian


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: I/O Performance Tips
  2010-12-09  8:10 I/O Performance Tips Sebastian Nickel - Hetzner Online AG
@ 2010-12-09  8:55 ` RW
  2010-12-09 10:30 ` Stefan Hajnoczi
  1 sibling, 0 replies; 5+ messages in thread
From: RW @ 2010-12-09  8:55 UTC (permalink / raw)
  To: kvm

We've don't use Ubuntu (we use Gentoo/KVM 0.12.5) but we've had a 
similar problem with kernels <2.6.32-r11 (this is Gentoo specific and 
means update -r11 and not release candidate 11) and with the first three 
releases of 2.6.34 especially when NFS was involved. We're currently 
running 2.6.32-r11 and 2.6.32-r20 kernels without this problems. And 
we're using deadline scheduler in host and guests. The VM images are 
stored on a LVM volume. Maybe a kernel update will help (if available of 
course...). We're also running all VMs with "cache=none".

You should also send the qemu-kvm options you're using and version to 
the list. Maybe someone else could help further.

- Robert


On 12/09/10 09:10, Sebastian Nickel - Hetzner Online AG wrote:
> Hello,
> we have got some issues with I/O in our kvm environment. We are using
> kernel version 2.6.32 (Ubuntu 10.04 LTS) to virtualise our hosts and we
> are using ksm, too. Recently we noticed that sometimes the guest systems
> (mainly OpenSuse guest systems) suddenly have a read only filesystem.
> After some inspection we found out that the guest system generates some
> ata errors due to timeouts (mostly in "flush cache" situations). On the
> physical host there are always the same kernel messages when this
> happens:
>
> """
> [1508127.195469] INFO: task kjournald:497 blocked for more than 120
> seconds.
> [1508127.212828] "echo 0>  /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [1508127.246841] kjournald     D 00000000ffffffff     0   497      2
> 0x00000000
> [1508127.246848]  ffff88062128dba0 0000000000000046 0000000000015bc0
> 0000000000015bc0
> [1508127.246855]  ffff880621089ab0 ffff88062128dfd8 0000000000015bc0
> ffff8806210896f0
> [1508127.246862]  0000000000015bc0 ffff88062128dfd8 0000000000015bc0
> ffff880621089ab0
> [1508127.246868] Call Trace:
> [1508127.246880]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
> [1508127.246889]  [<ffffffff81557d87>] io_schedule+0x47/0x70
> [1508127.246893]  [<ffffffff8116e545>] sync_buffer+0x45/0x50
> [1508127.246897]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
> [1508127.246901]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
> [1508127.246905]  [<ffffffff81558338>] out_of_line_wait_on_bit_lock
> +0x78/0x90
> [1508127.246911]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
> [1508127.246915]  [<ffffffff8116e6c6>] __lock_buffer+0x36/0x40
> [1508127.246920]  [<ffffffff81213d11>] journal_submit_data_buffers
> +0x311/0x320
> [1508127.246924]  [<ffffffff81213ff2>] journal_commit_transaction
> +0x2d2/0xe40
> [1508127.246931]  [<ffffffff810397a9>] ? default_spin_lock_flags
> +0x9/0x10
> [1508127.246935]  [<ffffffff81076c7c>] ? lock_timer_base+0x3c/0x70
> [1508127.246939]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
> [1508127.246943]  [<ffffffff81217f0d>] kjournald+0xed/0x250
> [1508127.246947]  [<ffffffff81085090>] ? autoremove_wake_function
> +0x0/0x40
> [1508127.246951]  [<ffffffff81217e20>] ? kjournald+0x0/0x250
> [1508127.246954]  [<ffffffff81084d16>] kthread+0x96/0xa0
> [1508127.246959]  [<ffffffff810141ea>] child_rip+0xa/0x20
> [1508127.246962]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
> [1508127.246966]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
> [1508127.246969] INFO: task flush-251:0:505 blocked for more than 120
> seconds.
> [1508127.264076] "echo 0>  /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [1508127.298343] flush-251:0   D ffffffff810f4370     0   505      2
> 0x00000000
> [1508127.298349]  ffff880621fdba30 0000000000000046 0000000000015bc0
> 0000000000015bc0
> [1508127.298354]  ffff88062108b1a0 ffff880621fdbfd8 0000000000015bc0
> ffff88062108ade0
> [1508127.298358]  0000000000015bc0 ffff880621fdbfd8 0000000000015bc0
> ffff88062108b1a0
> [1508127.298362] Call Trace:
> [1508127.298370]  [<ffffffff810f4370>] ? sync_page+0x0/0x50
> [1508127.298375]  [<ffffffff81557d87>] io_schedule+0x47/0x70
> [1508127.298379]  [<ffffffff810f43ad>] sync_page+0x3d/0x50
> [1508127.298383]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
> [1508127.298391]  [<ffffffff810f4347>] __lock_page+0x67/0x70
> [1508127.298395]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
> [1508127.298402]  [<ffffffff810f4487>] ? unlock_page+0x27/0x30
> [1508127.298410]  [<ffffffff810fd9dd>] write_cache_pages+0x3bd/0x4d0
> [1508127.298417]  [<ffffffff810fc670>] ? __writepage+0x0/0x40
> [1508127.298425]  [<ffffffff810fdb14>] generic_writepages+0x24/0x30
> [1508127.298432]  [<ffffffff810fdb55>] do_writepages+0x35/0x40
> [1508127.298439]  [<ffffffff811668a6>] writeback_single_inode+0xf6/0x3d0
> [1508127.298449]  [<ffffffff812b81d6>] ? rb_erase+0xd6/0x160
> [1508127.298455]  [<ffffffff8116750e>] writeback_inodes_wb+0x40e/0x5e0
> [1508127.298462]  [<ffffffff811677ea>] wb_writeback+0x10a/0x1d0
> [1508127.298469]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
> [1508127.298477]  [<ffffffff8155803d>] ? schedule_timeout+0x19d/0x300
> [1508127.298485]  [<ffffffff81167b1c>] wb_do_writeback+0x18c/0x1a0
> [1508127.298493]  [<ffffffff81167b83>] bdi_writeback_task+0x53/0xe0
> [1508127.298503]  [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
> [1508127.298510]  [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
> [1508127.298518]  [<ffffffff81084d16>] kthread+0x96/0xa0
> [1508127.298526]  [<ffffffff810141ea>] child_rip+0xa/0x20
> [1508127.298533]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
> [1508127.298541]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
> ... some more messages like those before...
> """
> Sometimes the message "INFO: task kvm blocked for more than 120 seconds"
> appears, too. I thought the error happens in cache writeback situations
> so I started to adjust "/proc/sys/vm/dirty_background_ratio" to 5 and
> "/proc/sys/vm/dirty_ratio" to 40. I thought this will write continously
> smaller parts of cached memory to HDD (more often, but smaller chunks).
> This did not help. There are still "readonly" filesystems in the guest
> systems. Does anybody has some tips to regulate I/O on linux systems or
> to stop those "readonly" filesystems?
>
> I tried the "cfq" scheduler and did some ionice (best efforts method,
> all kvm processes in the same class). I thought of cgroups, but I could
> not find any I/O related properties to set.
>
> We are using logical volumes as virtual guest HDDs. The volume group is
> on top of a mdraid device.
>
> Thank you for your help...
>
>
> Best Regards
>
> Sebastian
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: I/O Performance Tips
  2010-12-09  8:10 I/O Performance Tips Sebastian Nickel - Hetzner Online AG
  2010-12-09  8:55 ` RW
@ 2010-12-09 10:30 ` Stefan Hajnoczi
  2010-12-09 12:52   ` Sebastian Nickel - Hetzner Online AG
  1 sibling, 1 reply; 5+ messages in thread
From: Stefan Hajnoczi @ 2010-12-09 10:30 UTC (permalink / raw)
  To: sebastian.nickel; +Cc: kvm

On Thu, Dec 9, 2010 at 8:10 AM, Sebastian Nickel - Hetzner Online AG
<sebastian.nickel@hetzner.de> wrote:
> Hello,
> we have got some issues with I/O in our kvm environment. We are using
> kernel version 2.6.32 (Ubuntu 10.04 LTS) to virtualise our hosts and we
> are using ksm, too. Recently we noticed that sometimes the guest systems
> (mainly OpenSuse guest systems) suddenly have a read only filesystem.
> After some inspection we found out that the guest system generates some
> ata errors due to timeouts (mostly in "flush cache" situations). On the
> physical host there are always the same kernel messages when this
> happens:
>
> """
> [1508127.195469] INFO: task kjournald:497 blocked for more than 120
> seconds.
> [1508127.212828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [1508127.246841] kjournald     D 00000000ffffffff     0   497      2
> 0x00000000
> [1508127.246848]  ffff88062128dba0 0000000000000046 0000000000015bc0
> 0000000000015bc0
> [1508127.246855]  ffff880621089ab0 ffff88062128dfd8 0000000000015bc0
> ffff8806210896f0
> [1508127.246862]  0000000000015bc0 ffff88062128dfd8 0000000000015bc0
> ffff880621089ab0
> [1508127.246868] Call Trace:
> [1508127.246880]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
> [1508127.246889]  [<ffffffff81557d87>] io_schedule+0x47/0x70
> [1508127.246893]  [<ffffffff8116e545>] sync_buffer+0x45/0x50
> [1508127.246897]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
> [1508127.246901]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
> [1508127.246905]  [<ffffffff81558338>] out_of_line_wait_on_bit_lock
> +0x78/0x90
> [1508127.246911]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
> [1508127.246915]  [<ffffffff8116e6c6>] __lock_buffer+0x36/0x40
> [1508127.246920]  [<ffffffff81213d11>] journal_submit_data_buffers
> +0x311/0x320
> [1508127.246924]  [<ffffffff81213ff2>] journal_commit_transaction
> +0x2d2/0xe40
> [1508127.246931]  [<ffffffff810397a9>] ? default_spin_lock_flags
> +0x9/0x10
> [1508127.246935]  [<ffffffff81076c7c>] ? lock_timer_base+0x3c/0x70
> [1508127.246939]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
> [1508127.246943]  [<ffffffff81217f0d>] kjournald+0xed/0x250
> [1508127.246947]  [<ffffffff81085090>] ? autoremove_wake_function
> +0x0/0x40
> [1508127.246951]  [<ffffffff81217e20>] ? kjournald+0x0/0x250
> [1508127.246954]  [<ffffffff81084d16>] kthread+0x96/0xa0
> [1508127.246959]  [<ffffffff810141ea>] child_rip+0xa/0x20
> [1508127.246962]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
> [1508127.246966]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
> [1508127.246969] INFO: task flush-251:0:505 blocked for more than 120
> seconds.
> [1508127.264076] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [1508127.298343] flush-251:0   D ffffffff810f4370     0   505      2
> 0x00000000
> [1508127.298349]  ffff880621fdba30 0000000000000046 0000000000015bc0
> 0000000000015bc0
> [1508127.298354]  ffff88062108b1a0 ffff880621fdbfd8 0000000000015bc0
> ffff88062108ade0
> [1508127.298358]  0000000000015bc0 ffff880621fdbfd8 0000000000015bc0
> ffff88062108b1a0
> [1508127.298362] Call Trace:
> [1508127.298370]  [<ffffffff810f4370>] ? sync_page+0x0/0x50
> [1508127.298375]  [<ffffffff81557d87>] io_schedule+0x47/0x70
> [1508127.298379]  [<ffffffff810f43ad>] sync_page+0x3d/0x50
> [1508127.298383]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
> [1508127.298391]  [<ffffffff810f4347>] __lock_page+0x67/0x70
> [1508127.298395]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
> [1508127.298402]  [<ffffffff810f4487>] ? unlock_page+0x27/0x30
> [1508127.298410]  [<ffffffff810fd9dd>] write_cache_pages+0x3bd/0x4d0
> [1508127.298417]  [<ffffffff810fc670>] ? __writepage+0x0/0x40
> [1508127.298425]  [<ffffffff810fdb14>] generic_writepages+0x24/0x30
> [1508127.298432]  [<ffffffff810fdb55>] do_writepages+0x35/0x40
> [1508127.298439]  [<ffffffff811668a6>] writeback_single_inode+0xf6/0x3d0
> [1508127.298449]  [<ffffffff812b81d6>] ? rb_erase+0xd6/0x160
> [1508127.298455]  [<ffffffff8116750e>] writeback_inodes_wb+0x40e/0x5e0
> [1508127.298462]  [<ffffffff811677ea>] wb_writeback+0x10a/0x1d0
> [1508127.298469]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
> [1508127.298477]  [<ffffffff8155803d>] ? schedule_timeout+0x19d/0x300
> [1508127.298485]  [<ffffffff81167b1c>] wb_do_writeback+0x18c/0x1a0
> [1508127.298493]  [<ffffffff81167b83>] bdi_writeback_task+0x53/0xe0
> [1508127.298503]  [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
> [1508127.298510]  [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
> [1508127.298518]  [<ffffffff81084d16>] kthread+0x96/0xa0
> [1508127.298526]  [<ffffffff810141ea>] child_rip+0xa/0x20
> [1508127.298533]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
> [1508127.298541]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
> ... some more messages like those before...
> """
> Sometimes the message "INFO: task kvm blocked for more than 120 seconds"
> appears, too. I thought the error happens in cache writeback situations
> so I started to adjust "/proc/sys/vm/dirty_background_ratio" to 5 and
> "/proc/sys/vm/dirty_ratio" to 40. I thought this will write continously
> smaller parts of cached memory to HDD (more often, but smaller chunks).
> This did not help. There are still "readonly" filesystems in the guest
> systems. Does anybody has some tips to regulate I/O on linux systems or
> to stop those "readonly" filesystems?
[...]
> We are using logical volumes as virtual guest HDDs. The volume group is
> on top of a mdraid device.

I agree with Robert, the QEMU command-line used to launch these guests
would be useful (on host: ps aux | grep kvm).  Have you explicitly set
a -drive cache= mode (like "none", "writeback", or "writethrough")?

The first backtrace is odd.  You are using logical volumes for the
guest but the backtrace shows kjournald is blocked.  I believe logical
volumes should not directly affect kjournald at all (they don't use
journalling).  Perhaps this is a deadlock.

About the "flush-251:0:505" hang, please cat /proc/partitions on the
host to see which block device has major number 251 and minor number 0
is.

The fact that your host is having problems suggests the issue is not
in qemu-kvm (it's just a userspace process).  Are you sure disk I/O is
working under load on this machine without KVM?

Stefan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: I/O Performance Tips
  2010-12-09 10:30 ` Stefan Hajnoczi
@ 2010-12-09 12:52   ` Sebastian Nickel - Hetzner Online AG
  2010-12-09 14:51     ` Stefan Hajnoczi
  0 siblings, 1 reply; 5+ messages in thread
From: Sebastian Nickel - Hetzner Online AG @ 2010-12-09 12:52 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kvm

Hello,

> I agree with Robert, the QEMU command-line used to launch these guests
> would be useful (on host: ps aux | grep kvm).  Have you explicitly set
> a -drive cache= mode (like "none", "writeback", or "writethrough")?

here is the qemu command line we are using (or which libvirt generates):

/usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 512 -smp
1,sockets=1,cores=1,threads=1 -name vm-933 -uuid
0d737610-e59b-012d-f453-32287f7402ab -nodefaults -chardev
socket,id=monitor,path=/var/lib/libvirt/qemu/vm-933.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot nc -drive file=/dev/vg0/934,if=none,id=drive-ide0-0-0,boot=on,format=raw,cache=writeback -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device rtl8139,vlan=0,id=net0,mac=00:1c:14:01:03:67,bus=pci.0,addr=0x3 -net tap,fd=23,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

We are explicit using "writeback" in cache settings.

> The first backtrace is odd.  You are using logical volumes for the
> guest but the backtrace shows kjournald is blocked.  I believe logical
> volumes should not directly affect kjournald at all (they don't use
> journalling).  Perhaps this is a deadlock.
The dmesg output was just an example. Most of the time I can see the tasks "kjournald"
and "flush".
I recently saw "kvm","kthreadd","rsyslogd" and others in such outputs. I thought
that sometimes /proc/sys/vm/dirty_ratio gets exceeded and so all processes are blocked
for writes to the cache (kvm processes too). Could this be the case?
I set "dirty_background_ratio" to 5 to constantly flush the cache to the disk, but this
did not help.

> About the "flush-251:0:505" hang, please cat /proc/partitions on the
> host to see which block device has major number 251 and minor number 0
> is.
This is our logical volume "root" partition of the physical host.

> The fact that your host is having problems suggests the issue is not
> in qemu-kvm (it's just a userspace process).  Are you sure disk I/O is
> working under load on this machine without KVM?
I do not think that kvm generates this issue (as you said it is a normal
user space process). I thought that perhaps somebody knows how to handle
this situation, because the kvm developers have much more experience with
kvm than I do. Perhaps there are some tuning tips for this or anybody knows
why only OpenSuse sets the filesystem read only if there are disk timeouts
in the guest? This behavior appeared on almost all hosts (>20) so I can
eliminate a single machine HW failure.

Thank you for your help...


Best Regards

Sebastian

Am Donnerstag, den 09.12.2010, 10:30 +0000 schrieb Stefan Hajnoczi:
> On Thu, Dec 9, 2010 at 8:10 AM, Sebastian Nickel - Hetzner Online AG
> <sebastian.nickel@hetzner.de> wrote:
> > Hello,
> > we have got some issues with I/O in our kvm environment. We are using
> > kernel version 2.6.32 (Ubuntu 10.04 LTS) to virtualise our hosts and we
> > are using ksm, too. Recently we noticed that sometimes the guest systems
> > (mainly OpenSuse guest systems) suddenly have a read only filesystem.
> > After some inspection we found out that the guest system generates some
> > ata errors due to timeouts (mostly in "flush cache" situations). On the
> > physical host there are always the same kernel messages when this
> > happens:
> >
> > """
> > [1508127.195469] INFO: task kjournald:497 blocked for more than 120
> > seconds.
> > [1508127.212828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [1508127.246841] kjournald     D 00000000ffffffff     0   497      2
> > 0x00000000
> > [1508127.246848]  ffff88062128dba0 0000000000000046 0000000000015bc0
> > 0000000000015bc0
> > [1508127.246855]  ffff880621089ab0 ffff88062128dfd8 0000000000015bc0
> > ffff8806210896f0
> > [1508127.246862]  0000000000015bc0 ffff88062128dfd8 0000000000015bc0
> > ffff880621089ab0
> > [1508127.246868] Call Trace:
> > [1508127.246880]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
> > [1508127.246889]  [<ffffffff81557d87>] io_schedule+0x47/0x70
> > [1508127.246893]  [<ffffffff8116e545>] sync_buffer+0x45/0x50
> > [1508127.246897]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
> > [1508127.246901]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
> > [1508127.246905]  [<ffffffff81558338>] out_of_line_wait_on_bit_lock
> > +0x78/0x90
> > [1508127.246911]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
> > [1508127.246915]  [<ffffffff8116e6c6>] __lock_buffer+0x36/0x40
> > [1508127.246920]  [<ffffffff81213d11>] journal_submit_data_buffers
> > +0x311/0x320
> > [1508127.246924]  [<ffffffff81213ff2>] journal_commit_transaction
> > +0x2d2/0xe40
> > [1508127.246931]  [<ffffffff810397a9>] ? default_spin_lock_flags
> > +0x9/0x10
> > [1508127.246935]  [<ffffffff81076c7c>] ? lock_timer_base+0x3c/0x70
> > [1508127.246939]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
> > [1508127.246943]  [<ffffffff81217f0d>] kjournald+0xed/0x250
> > [1508127.246947]  [<ffffffff81085090>] ? autoremove_wake_function
> > +0x0/0x40
> > [1508127.246951]  [<ffffffff81217e20>] ? kjournald+0x0/0x250
> > [1508127.246954]  [<ffffffff81084d16>] kthread+0x96/0xa0
> > [1508127.246959]  [<ffffffff810141ea>] child_rip+0xa/0x20
> > [1508127.246962]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
> > [1508127.246966]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
> > [1508127.246969] INFO: task flush-251:0:505 blocked for more than 120
> > seconds.
> > [1508127.264076] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [1508127.298343] flush-251:0   D ffffffff810f4370     0   505      2
> > 0x00000000
> > [1508127.298349]  ffff880621fdba30 0000000000000046 0000000000015bc0
> > 0000000000015bc0
> > [1508127.298354]  ffff88062108b1a0 ffff880621fdbfd8 0000000000015bc0
> > ffff88062108ade0
> > [1508127.298358]  0000000000015bc0 ffff880621fdbfd8 0000000000015bc0
> > ffff88062108b1a0
> > [1508127.298362] Call Trace:
> > [1508127.298370]  [<ffffffff810f4370>] ? sync_page+0x0/0x50
> > [1508127.298375]  [<ffffffff81557d87>] io_schedule+0x47/0x70
> > [1508127.298379]  [<ffffffff810f43ad>] sync_page+0x3d/0x50
> > [1508127.298383]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
> > [1508127.298391]  [<ffffffff810f4347>] __lock_page+0x67/0x70
> > [1508127.298395]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
> > [1508127.298402]  [<ffffffff810f4487>] ? unlock_page+0x27/0x30
> > [1508127.298410]  [<ffffffff810fd9dd>] write_cache_pages+0x3bd/0x4d0
> > [1508127.298417]  [<ffffffff810fc670>] ? __writepage+0x0/0x40
> > [1508127.298425]  [<ffffffff810fdb14>] generic_writepages+0x24/0x30
> > [1508127.298432]  [<ffffffff810fdb55>] do_writepages+0x35/0x40
> > [1508127.298439]  [<ffffffff811668a6>] writeback_single_inode+0xf6/0x3d0
> > [1508127.298449]  [<ffffffff812b81d6>] ? rb_erase+0xd6/0x160
> > [1508127.298455]  [<ffffffff8116750e>] writeback_inodes_wb+0x40e/0x5e0
> > [1508127.298462]  [<ffffffff811677ea>] wb_writeback+0x10a/0x1d0
> > [1508127.298469]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
> > [1508127.298477]  [<ffffffff8155803d>] ? schedule_timeout+0x19d/0x300
> > [1508127.298485]  [<ffffffff81167b1c>] wb_do_writeback+0x18c/0x1a0
> > [1508127.298493]  [<ffffffff81167b83>] bdi_writeback_task+0x53/0xe0
> > [1508127.298503]  [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
> > [1508127.298510]  [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
> > [1508127.298518]  [<ffffffff81084d16>] kthread+0x96/0xa0
> > [1508127.298526]  [<ffffffff810141ea>] child_rip+0xa/0x20
> > [1508127.298533]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
> > [1508127.298541]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
> > ... some more messages like those before...
> > """
> > Sometimes the message "INFO: task kvm blocked for more than 120 seconds"
> > appears, too. I thought the error happens in cache writeback situations
> > so I started to adjust "/proc/sys/vm/dirty_background_ratio" to 5 and
> > "/proc/sys/vm/dirty_ratio" to 40. I thought this will write continously
> > smaller parts of cached memory to HDD (more often, but smaller chunks).
> > This did not help. There are still "readonly" filesystems in the guest
> > systems. Does anybody has some tips to regulate I/O on linux systems or
> > to stop those "readonly" filesystems?
> [...]
> > We are using logical volumes as virtual guest HDDs. The volume group is
> > on top of a mdraid device.
> 
> I agree with Robert, the QEMU command-line used to launch these guests
> would be useful (on host: ps aux | grep kvm).  Have you explicitly set
> a -drive cache= mode (like "none", "writeback", or "writethrough")?
> 
> The first backtrace is odd.  You are using logical volumes for the
> guest but the backtrace shows kjournald is blocked.  I believe logical
> volumes should not directly affect kjournald at all (they don't use
> journalling).  Perhaps this is a deadlock.
> 
> About the "flush-251:0:505" hang, please cat /proc/partitions on the
> host to see which block device has major number 251 and minor number 0
> is.
> 
> The fact that your host is having problems suggests the issue is not
> in qemu-kvm (it's just a userspace process).  Are you sure disk I/O is
> working under load on this machine without KVM?
> 
> Stefan


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: I/O Performance Tips
  2010-12-09 12:52   ` Sebastian Nickel - Hetzner Online AG
@ 2010-12-09 14:51     ` Stefan Hajnoczi
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2010-12-09 14:51 UTC (permalink / raw)
  To: sebastian.nickel; +Cc: kvm, Christoph Hellwig

On Thu, Dec 9, 2010 at 12:52 PM, Sebastian Nickel - Hetzner Online AG
<sebastian.nickel@hetzner.de> wrote:
> here is the qemu command line we are using (or which libvirt generates):
>
> /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 512 -smp
> 1,sockets=1,cores=1,threads=1 -name vm-933 -uuid
> 0d737610-e59b-012d-f453-32287f7402ab -nodefaults -chardev
> socket,id=monitor,path=/var/lib/libvirt/qemu/vm-933.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot nc -drive file=/dev/vg0/934,if=none,id=drive-ide0-0-0,boot=on,format=raw,cache=writeback -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device rtl8139,vlan=0,id=net0,mac=00:1c:14:01:03:67,bus=pci.0,addr=0x3 -net tap,fd=23,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4
>
> We are explicit using "writeback" in cache settings.
>
>> The first backtrace is odd.  You are using logical volumes for the
>> guest but the backtrace shows kjournald is blocked.  I believe logical
>> volumes should not directly affect kjournald at all (they don't use
>> journalling).  Perhaps this is a deadlock.
> The dmesg output was just an example. Most of the time I can see the tasks "kjournald"
> and "flush".
> I recently saw "kvm","kthreadd","rsyslogd" and others in such outputs. I thought
> that sometimes /proc/sys/vm/dirty_ratio gets exceeded and so all processes are blocked
> for writes to the cache (kvm processes too). Could this be the case?
> I set "dirty_background_ratio" to 5 to constantly flush the cache to the disk, but this
> did not help.
>
>> About the "flush-251:0:505" hang, please cat /proc/partitions on the
>> host to see which block device has major number 251 and minor number 0
>> is.
> This is our logical volume "root" partition of the physical host.
>
>> The fact that your host is having problems suggests the issue is not
>> in qemu-kvm (it's just a userspace process).  Are you sure disk I/O is
>> working under load on this machine without KVM?
> I do not think that kvm generates this issue (as you said it is a normal
> user space process). I thought that perhaps somebody knows how to handle
> this situation, because the kvm developers have much more experience with
> kvm than I do. Perhaps there are some tuning tips for this or anybody knows
> why only OpenSuse sets the filesystem read only if there are disk timeouts
> in the guest? This behavior appeared on almost all hosts (>20) so I can
> eliminate a single machine HW failure.

Christoph any pointers on how to debug this?

The backtraces from the original email are below:

> Am Donnerstag, den 09.12.2010, 10:30 +0000 schrieb Stefan Hajnoczi:
>> On Thu, Dec 9, 2010 at 8:10 AM, Sebastian Nickel - Hetzner Online AG
>> <sebastian.nickel@hetzner.de> wrote:
>> > Hello,
>> > we have got some issues with I/O in our kvm environment. We are using
>> > kernel version 2.6.32 (Ubuntu 10.04 LTS) to virtualise our hosts and we
>> > are using ksm, too. Recently we noticed that sometimes the guest systems
>> > (mainly OpenSuse guest systems) suddenly have a read only filesystem.
>> > After some inspection we found out that the guest system generates some
>> > ata errors due to timeouts (mostly in "flush cache" situations). On the
>> > physical host there are always the same kernel messages when this
>> > happens:
>> >
>> > """
>> > [1508127.195469] INFO: task kjournald:497 blocked for more than 120
>> > seconds.
>> > [1508127.212828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [1508127.246841] kjournald     D 00000000ffffffff     0   497      2
>> > 0x00000000
>> > [1508127.246848]  ffff88062128dba0 0000000000000046 0000000000015bc0
>> > 0000000000015bc0
>> > [1508127.246855]  ffff880621089ab0 ffff88062128dfd8 0000000000015bc0
>> > ffff8806210896f0
>> > [1508127.246862]  0000000000015bc0 ffff88062128dfd8 0000000000015bc0
>> > ffff880621089ab0
>> > [1508127.246868] Call Trace:
>> > [1508127.246880]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
>> > [1508127.246889]  [<ffffffff81557d87>] io_schedule+0x47/0x70
>> > [1508127.246893]  [<ffffffff8116e545>] sync_buffer+0x45/0x50
>> > [1508127.246897]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
>> > [1508127.246901]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
>> > [1508127.246905]  [<ffffffff81558338>] out_of_line_wait_on_bit_lock
>> > +0x78/0x90
>> > [1508127.246911]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
>> > [1508127.246915]  [<ffffffff8116e6c6>] __lock_buffer+0x36/0x40
>> > [1508127.246920]  [<ffffffff81213d11>] journal_submit_data_buffers
>> > +0x311/0x320
>> > [1508127.246924]  [<ffffffff81213ff2>] journal_commit_transaction
>> > +0x2d2/0xe40
>> > [1508127.246931]  [<ffffffff810397a9>] ? default_spin_lock_flags
>> > +0x9/0x10
>> > [1508127.246935]  [<ffffffff81076c7c>] ? lock_timer_base+0x3c/0x70
>> > [1508127.246939]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
>> > [1508127.246943]  [<ffffffff81217f0d>] kjournald+0xed/0x250
>> > [1508127.246947]  [<ffffffff81085090>] ? autoremove_wake_function
>> > +0x0/0x40
>> > [1508127.246951]  [<ffffffff81217e20>] ? kjournald+0x0/0x250
>> > [1508127.246954]  [<ffffffff81084d16>] kthread+0x96/0xa0
>> > [1508127.246959]  [<ffffffff810141ea>] child_rip+0xa/0x20
>> > [1508127.246962]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
>> > [1508127.246966]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
>> > [1508127.246969] INFO: task flush-251:0:505 blocked for more than 120
>> > seconds.
>> > [1508127.264076] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [1508127.298343] flush-251:0   D ffffffff810f4370     0   505      2
>> > 0x00000000
>> > [1508127.298349]  ffff880621fdba30 0000000000000046 0000000000015bc0
>> > 0000000000015bc0
>> > [1508127.298354]  ffff88062108b1a0 ffff880621fdbfd8 0000000000015bc0
>> > ffff88062108ade0
>> > [1508127.298358]  0000000000015bc0 ffff880621fdbfd8 0000000000015bc0
>> > ffff88062108b1a0
>> > [1508127.298362] Call Trace:
>> > [1508127.298370]  [<ffffffff810f4370>] ? sync_page+0x0/0x50
>> > [1508127.298375]  [<ffffffff81557d87>] io_schedule+0x47/0x70
>> > [1508127.298379]  [<ffffffff810f43ad>] sync_page+0x3d/0x50
>> > [1508127.298383]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
>> > [1508127.298391]  [<ffffffff810f4347>] __lock_page+0x67/0x70
>> > [1508127.298395]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
>> > [1508127.298402]  [<ffffffff810f4487>] ? unlock_page+0x27/0x30
>> > [1508127.298410]  [<ffffffff810fd9dd>] write_cache_pages+0x3bd/0x4d0
>> > [1508127.298417]  [<ffffffff810fc670>] ? __writepage+0x0/0x40
>> > [1508127.298425]  [<ffffffff810fdb14>] generic_writepages+0x24/0x30
>> > [1508127.298432]  [<ffffffff810fdb55>] do_writepages+0x35/0x40
>> > [1508127.298439]  [<ffffffff811668a6>] writeback_single_inode+0xf6/0x3d0
>> > [1508127.298449]  [<ffffffff812b81d6>] ? rb_erase+0xd6/0x160
>> > [1508127.298455]  [<ffffffff8116750e>] writeback_inodes_wb+0x40e/0x5e0
>> > [1508127.298462]  [<ffffffff811677ea>] wb_writeback+0x10a/0x1d0
>> > [1508127.298469]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
>> > [1508127.298477]  [<ffffffff8155803d>] ? schedule_timeout+0x19d/0x300
>> > [1508127.298485]  [<ffffffff81167b1c>] wb_do_writeback+0x18c/0x1a0
>> > [1508127.298493]  [<ffffffff81167b83>] bdi_writeback_task+0x53/0xe0
>> > [1508127.298503]  [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
>> > [1508127.298510]  [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
>> > [1508127.298518]  [<ffffffff81084d16>] kthread+0x96/0xa0
>> > [1508127.298526]  [<ffffffff810141ea>] child_rip+0xa/0x20
>> > [1508127.298533]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
>> > [1508127.298541]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
>> > ... some more messages like those before...
>> > """
>> > Sometimes the message "INFO: task kvm blocked for more than 120 seconds"
>> > appears, too. I thought the error happens in cache writeback situations
>> > so I started to adjust "/proc/sys/vm/dirty_background_ratio" to 5 and
>> > "/proc/sys/vm/dirty_ratio" to 40. I thought this will write continously
>> > smaller parts of cached memory to HDD (more often, but smaller chunks).
>> > This did not help. There are still "readonly" filesystems in the guest
>> > systems. Does anybody has some tips to regulate I/O on linux systems or
>> > to stop those "readonly" filesystems?
>> [...]
>> > We are using logical volumes as virtual guest HDDs. The volume group is
>> > on top of a mdraid device.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-12-09 14:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-09  8:10 I/O Performance Tips Sebastian Nickel - Hetzner Online AG
2010-12-09  8:55 ` RW
2010-12-09 10:30 ` Stefan Hajnoczi
2010-12-09 12:52   ` Sebastian Nickel - Hetzner Online AG
2010-12-09 14:51     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox