From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=58305 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1OoACe-00045Q-Rc for qemu-devel@nongnu.org; Wed, 25 Aug 2010 03:15:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1OoACd-0003uN-M4 for qemu-devel@nongnu.org; Wed, 25 Aug 2010 03:15:20 -0400 Received: from mx1.redhat.com ([209.132.183.28]:11603) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OoACd-0003uH-Cn for qemu-devel@nongnu.org; Wed, 25 Aug 2010 03:15:19 -0400 Message-ID: <4C74C2F3.9050506@redhat.com> Date: Wed, 25 Aug 2010 10:14:59 +0300 From: Avi Kivity MIME-Version: 1.0 References: <1282646430-5777-1-git-send-email-kwolf@redhat.com> <4C73C2BF.8050300@codemonkey.ws> <4C73C622.7080808@redhat.com> <4C73C926.3010901@codemonkey.ws> <4C73C9CF.7090800@redhat.com> <4C73CAA9.2060104@codemonkey.ws> <4C73CB85.9010306@redhat.com> <4C73CBD6.7000900@codemonkey.ws> <4C73CCCB.6050704@redhat.com> <4C73CF8D.5060405@codemonkey.ws> In-Reply-To: <4C73CF8D.5060405@codemonkey.ws> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: [Qemu-devel] Re: [RFC][STABLE 0.13] Revert "qcow2: Use bdrv_(p)write_sync for metadata writes" List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Kevin Wolf , stefanha@gmail.com, mjt@tls.msk.ru, qemu-devel@nongnu.org, hch@lst.de On 08/24/2010 04:56 PM, Anthony Liguori wrote: >> One doesn't follow from the other (though I'm no fan of internal >> snapshots, myself). > > > It does. Let's consider the failure scenarios: > > 1) guest submits write request > 2) allocate extent > 3) write data to disk (a) > 4) write (a) completes > 5) update reference count table for new extent (b) > 6) write (b) completes > 7) write extent table (c) > 8) write (c) completes > 9) complete guest write request > > If this all happened in order and we lost power, the worst case error > is that we leak a block which isn't terrible. > > But we're not guaranteed that this happens in order. > > If (b) or (c) happen before (a), then the image is not corrupted but > data gets lost. That's okay because it's part of the guest contract. > > If (c) happens before (b), then we've created an extent that's > attached to a table with a zero reference count. This is a corrupt > image. > If the only issue is new block allocation, it can be easily solved. Instead of allocating exactly the needed amount of blocks, allocate a large extent and hold them in memory. The next allocation can then be filled from memory, so the allocation sync is amortized over many blocks. A power fail will leak the preallocated blocks, losing some megabytes of address space, but not real disk space. > Let's consider if we eliminate the reference count table which means > eliminating internal snapshots. > > 1) guest submits write request > 2) allocate extent > 3) write data to disk (a) > 4) write (a) completes > 5) write extent table (c) > 6) write (c) completes > 7) complete guest write request > > If this all happens in order and we lose power, we just leak a block. > It means we need a periodic fsck. > > If (c) completes before (a), then it means that the image is not > corrupted but data gets lost. This is okay based on the guest contract. > > And that's it. There is no scenario where the disk is corrupted. _if_ that's the only failure mode. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.