From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=34858 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Otb0U-0002iy-2J
	for qemu-devel@nongnu.org; Thu, 09 Sep 2010 02:53:15 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1Otb0S-0007Vj-Ur
	for qemu-devel@nongnu.org; Thu, 09 Sep 2010 02:53:14 -0400
Received: from mx1.redhat.com ([209.132.183.28]:56241)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1Otb0S-0007Ve-NP
	for qemu-devel@nongnu.org; Thu, 09 Sep 2010 02:53:12 -0400
Message-ID: <4C888451.9070306@redhat.com>
Date: Thu, 09 Sep 2010 09:53:05 +0300
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [RFC] qed: Add QEMU Enhanced Disk format
References: <1283767478-16740-1-git-send-email-stefanha@linux.vnet.ibm.com>	<4C84E738.3020802@codemonkey.ws>	<4C865187.6090508@redhat.com>	<4C865CFE.7010508@codemonkey.ws>	<4C8663C4.1090508@redhat.com>	<4C866773.2030103@codemonkey.ws>	<4C86BC6B.5010809@codemonkey.ws>	<4C874812.9090807@redhat.com>	<395D4377-00F9-4765-94C4-470BDFA1F96E@suse.de>	<4C874F22.6060802@redhat.com>
	<AANLkTik+NHXjVmW5ozSGOOLf_FeQE8DHhoQPN6LOpFW0@mail.gmail.com>
In-Reply-To: <AANLkTik+NHXjVmW5ozSGOOLf_FeQE8DHhoQPN6LOpFW0@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org, Alexander Graf <agraf@suse.de>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

  On 09/08/2010 02:15 PM, Stefan Hajnoczi wrote:
> 3. Metadata update reaches disk but data does not.  The interesting
> case!  The L2 table now points to a cluster which is beyond the last
> cluster in the image file.  Remember that file size is rounded down by
> cluster size, so partial data writes are discarded and this case
> applies.
>
> Now we're in trouble.  The image cannot be accessed without some
> sanity checking because not only do table entries point to invalid
> clusters, but new allocating writes might make previously invalid
> cluster offsets valid again (then there would be two or more table
> entries pointing to the same cluster)!
>
> Anthony's suggestion is to use a "mounted" or "dirty" bit in the qed
> header to detect a crashed image when opening the image file.  If no
> crash has occurred, then the mounted bit is unset and normal operation
> is safe.  If the mounted bit is set, then an check of the L1/L2 tables
> must be performed and any invalid cluster offsets must be cleared to
> zero.  When an invalid cluster is cleared to zero, we arrive back at
> case 1 above: neither data write nor metadata update reached the disk,
> and we are in a safe state.

While fsck has a lovely ext2 retro feel, there's a reason it's shunned - 
it can take quite a while to run.  A fully loaded L1 with 32K entries 
will require 32K random I/Os, which can take over 5 minutes on a disk 
that provides 100 IOPS.  On a large shared disk, you'll have a lot more 
IOPS, but likely much fewer IOPS per guest, so if you have a power loss, 
fsck time per guest will likely be longer (irrespective of guest size).

Preallocation, on the other hand, is amortized, or you can piggy-back 
its fsync on a guest flush.  Note its equally applicable to qcow2 and qed.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.