From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1KQi2S-0005wQ-4k
	for qemu-devel@nongnu.org; Wed, 06 Aug 2008 08:22:48 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1KQi2Q-0005tw-8v
	for qemu-devel@nongnu.org; Wed, 06 Aug 2008 08:22:47 -0400
Received: from [199.232.76.173] (port=48525 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1KQi2Q-0005tR-0F
	for qemu-devel@nongnu.org; Wed, 06 Aug 2008 08:22:46 -0400
Received: from mail2.shareable.org ([80.68.89.115]:58824)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <jamie@shareable.org>) id 1KQi2P-0007kP-JG
	for qemu-devel@nongnu.org; Wed, 06 Aug 2008 08:22:45 -0400
Date: Wed, 6 Aug 2008 13:22:43 +0100
From: Jamie Lokier <jamie@shareable.org>
Subject: Re: [Qemu-devel] [PATCH] report read/write errors to IDE guest driver
	as ECC errors
Message-ID: <20080806122241.GA14937@shareable.org>
References: <20080805115506.GR4478@implementation.uk.xensource.com>
	<48990BC6.1050503@codemonkey.ws> <20080806092822.GC9055@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20080806092822.GC9055@redhat.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Daniel P. Berrange" <berrange@redhat.com>, qemu-devel@nongnu.org

Daniel P. Berrange wrote:
> If you have a journalling filesystem, the worst that'll happen in the
> ENOSPC scenario is that you'll loose data from the open application files
> that aren't flusshed to disk - no different to pulling the power plug.
> The filesystem itself will not corrupt itself - it'll happily recover
> the journal & carry on after rebootint.

So the filesystem's ok but the application's files are corrupt -
doesn't sound too good :-)

Journalling filesystems are supposed to be robust against sudden
reboot/power failure (despite this basic expectation, Linux ext3 is
not robust against power failure by default).

Journalled filesystems should also be robust against I/O errors, but
in fact that would require an IOP sequence like WRITE, BARRIER, WRITE
to abort the second WRITE if the first one fails with an I/O error.
Linux does not abort the second WRITE - and therefore an isolated
write I/O error can result in filesystem corruption on all its
journalled filesystems.  When TCQ/NCQ are used, all commands my be in
flight concurrently, I'm not sure if it's even possible to auto-abort
the second WRITE when the first errors, in any guest.

(There are also weaknesses in Linux's handling of I/O errors in the
VM, discussed recently with a "sweep it under the carpet, handling I/O
errors properly in the VM is too hard" conclusion.)

I wouldn't be surprised if other guests have similar weaknesses.

Solaris ZFS may be an exception, as they claim to have thoroughly
tested it with simulated I/O errors.

Therefore, at least, when QEMU reports a write I/O error due to ENOSPC
(and perhaps due to EIO), it should set a sticky flag so that all
subsequent writes error without trying to write.

> Unless someone wants to implement the ENOSPC handling right now, I'd 
> like to see this patch just committed as is, so we at least get
> incremental benefit over current behaviour, which definitely *does*
> corrupt guest filesystems by silently pretending the write succeeed.
> Special ENOSPC handling can be added on top.

I suggest adding a sticky flag:  Once hit ENOSPC (due to extending
qcow2), all further writes should fail even if they don't need to
extend the file.  This will prevent some kinds of guest journalled
filesystem corruption.

> I agree that pausing the guest is probably best option in that scenario,
> the interesting question being how to inform management tools/API that
> the VM has just paused itself. In libvirt we handle pause/resume by doing
> 'stop'/'cont' in the QEMU monitor, and since we're triggering it ourselves
> we can track the state change from running to paused. If the VM pauses
> itself though we nee to figure out a way to detect this state change.
> The monitor doesn't have any asynchronous notification capability as it
> stands.

It does have the log file, I suppose, or it could poll the CPU state
every so often.  Not the prettiest mechanisms.

-- Jamie