From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=58305 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OoACe-00045Q-Rc
	for qemu-devel@nongnu.org; Wed, 25 Aug 2010 03:15:22 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1OoACd-0003uN-M4
	for qemu-devel@nongnu.org; Wed, 25 Aug 2010 03:15:20 -0400
Received: from mx1.redhat.com ([209.132.183.28]:11603)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <avi@redhat.com>) id 1OoACd-0003uH-Cn
	for qemu-devel@nongnu.org; Wed, 25 Aug 2010 03:15:19 -0400
Message-ID: <4C74C2F3.9050506@redhat.com>
Date: Wed, 25 Aug 2010 10:14:59 +0300
From: Avi Kivity <avi@redhat.com>
MIME-Version: 1.0
References: <1282646430-5777-1-git-send-email-kwolf@redhat.com>
	<4C73C2BF.8050300@codemonkey.ws> <4C73C622.7080808@redhat.com>
	<4C73C926.3010901@codemonkey.ws> <4C73C9CF.7090800@redhat.com>
	<4C73CAA9.2060104@codemonkey.ws> <4C73CB85.9010306@redhat.com>
	<4C73CBD6.7000900@codemonkey.ws> <4C73CCCB.6050704@redhat.com>
	<4C73CF8D.5060405@codemonkey.ws>
In-Reply-To: <4C73CF8D.5060405@codemonkey.ws>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] Re: [RFC][STABLE 0.13] Revert "qcow2: Use
 bdrv_(p)write_sync for metadata writes"
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>, stefanha@gmail.com, mjt@tls.msk.ru, qemu-devel@nongnu.org, hch@lst.de

  On 08/24/2010 04:56 PM, Anthony Liguori wrote:
>> One doesn't follow from the other (though I'm no fan of internal 
>> snapshots, myself).
>
>
> It does.  Let's consider the failure scenarios:
>
> 1) guest submits write request
> 2) allocate extent
> 3) write data to disk (a)
> 4) write (a) completes
> 5) update reference count table for new extent (b)
> 6) write (b) completes
> 7) write extent table (c)
> 8) write (c) completes
> 9) complete guest write request
>
> If this all happened in order and we lost power, the worst case error 
> is that we leak a block which isn't terrible.
>
> But we're not guaranteed that this happens in order.
>
> If (b) or (c) happen before (a), then the image is not corrupted but 
> data gets lost.  That's okay because it's part of the guest contract.
>
> If (c) happens before (b), then we've created an extent that's 
> attached to a table with a zero reference count.  This is a corrupt 
> image.
>

If the only issue is new block allocation, it can be easily solved.  
Instead of allocating exactly the needed amount of blocks, allocate a 
large extent and hold them in memory.  The next allocation can then be 
filled from memory, so the allocation sync is amortized over many 
blocks.  A power fail will leak the preallocated blocks, losing some 
megabytes of address space, but not real disk space.


> Let's consider if we eliminate the reference count table which means 
> eliminating internal snapshots.
>
> 1) guest submits write request
> 2) allocate extent
> 3) write data to disk (a)
> 4) write (a) completes
> 5) write extent table (c)
> 6) write (c) completes
> 7) complete guest write request
>
> If this all happens in order and we lose power, we just leak a block.  
> It means we need a periodic fsck.
>
> If (c) completes before (a), then it means that the image is not 
> corrupted but data gets lost.  This is okay based on the guest contract.
>
> And that's it.  There is no scenario where the disk is corrupted.

_if_ that's the only failure mode.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.