From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jes Sorensen <Jes.Sorensen@redhat.com>
Subject: Re: kvm + raid1 showstopper bug
Date: Tue, 21 Feb 2012 13:40:28 +0100
Message-ID: <4F4390BC.5060809@redhat.com>
References: <20120217045733.GC31397@xmission.com>	<CAJSP0QVag9qNnd=ZN+fR9wExX1ZB1=OafROt=YkzNC8rJNpF0A@mail.gmail.com>	<4F3E72B5.6030502@xmission.com> <CAJSP0QUBnnmQepv8AZuXdg+mEXY6ckPMeUTuW_drDD3yLEtouw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: Pete Ashdown <pashdown@xmission.com>, kvm@vger.kernel.org,
	Aaron Toponce <atoponce@xmission.com>
To: Stefan Hajnoczi <stefanha@gmail.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:39858 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1753804Ab2BUMkb (ORCPT <rfc822;kvm@vger.kernel.org>);
	Tue, 21 Feb 2012 07:40:31 -0500
In-Reply-To: <CAJSP0QUBnnmQepv8AZuXdg+mEXY6ckPMeUTuW_drDD3yLEtouw@mail.gmail.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 02/18/12 14:25, Stefan Hajnoczi wrote:
> On Fri, Feb 17, 2012 at 3:31 PM, Pete Ashdown <pashdown@xmission.com> wrote:
>> > On 02/17/2012 04:30 AM, Stefan Hajnoczi wrote:
>>> >> On Fri, Feb 17, 2012 at 4:57 AM, Pete Ashdown <pashdown@xmission.com> wrote:
>>>> >>> I've been waiting for some response from the Ubuntu team regarding a bug on
>>>> >>> launchpad, but it appears that it isn't being taken seriously:
>>>> >>>
>>>> >>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/745785
>>> >> This looks interesting.  Let me try to summarize, please point out if
>>> >> I get something wrong:
>>> >>
>>> >> You have software RAID1 on the host, your disk images live on this
>>> >> device.  Whenever checkarray runs on the host you find that VMs become
>>> >> unresponsive.  Guests print warnings that a task is blocked for more
>>> >> than 120 seconds.  Guests become unresponsive on the network.
>> > In my case, it is drbd+RAID10, but the bug still applies.  It isn't
>> > whenever checkarray runs, but whenever checkarray decides to do a resync,
>> > it will block all IO somewhere before the end of the resync.  Then yes, it
>> > isn't long before the guests start to fail due to their inability to
>> > read/write.
> I have not attempted to reproduce this yet but have taken a look at
> drviers/md/raid10.c resync code.  md resync uses a similar mechanism
> for RAID1 and RAID10.  While a block is being synced the entire device
> will force regular I/O requests to wait.  There are tunables which let
> you rate-limit resyncing, I think this can solve your problem.
> Perhaps the resync is too aggressive and is impacting regular I/O so
> much that the guest is warning about it.  See Documentation/md.txt for
> sync_speed_max and other sysfs attributes.
> 
> The bug report suggests qemu-kvm itself is operating fine because the
> guest is still executing and VNC/monitor are alive.  After a while the
> guest warns about the stuck I/O.

It could be a bug in the raid1/raid10 code, which is triggered by the
way qemu keeps file on it (O_DIRECT or something). However there have
been a *lot* of fixes to the raid code since 2.6.38 (which is the kernel
version I saw referenced in the launchpad link). Please try and
reproduce it with a more uptodate kernel.

I have no idea what drbd8 is and what relation it has to the kernel or
kvm here.

Cheers,
Jes