[Qemu-devel] Re: [RFC][STABLE 0.13] Revert "qcow2: Use bdrv_(p)write_sync for metadata writes"

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>,
	stefanha@gmail.com, mjt@tls.msk.ru, qemu-devel@nongnu.org,
	hch@lst.de
Subject: [Qemu-devel] Re: [RFC][STABLE 0.13] Revert "qcow2: Use bdrv_(p)write_sync for metadata writes"
Date: Wed, 25 Aug 2010 10:14:59 +0300	[thread overview]
Message-ID: <4C74C2F3.9050506@redhat.com> (raw)
In-Reply-To: <4C73CF8D.5060405@codemonkey.ws>

  On 08/24/2010 04:56 PM, Anthony Liguori wrote:
>> One doesn't follow from the other (though I'm no fan of internal 
>> snapshots, myself).
>
>
> It does.  Let's consider the failure scenarios:
>
> 1) guest submits write request
> 2) allocate extent
> 3) write data to disk (a)
> 4) write (a) completes
> 5) update reference count table for new extent (b)
> 6) write (b) completes
> 7) write extent table (c)
> 8) write (c) completes
> 9) complete guest write request
>
> If this all happened in order and we lost power, the worst case error 
> is that we leak a block which isn't terrible.
>
> But we're not guaranteed that this happens in order.
>
> If (b) or (c) happen before (a), then the image is not corrupted but 
> data gets lost.  That's okay because it's part of the guest contract.
>
> If (c) happens before (b), then we've created an extent that's 
> attached to a table with a zero reference count.  This is a corrupt 
> image.
>

If the only issue is new block allocation, it can be easily solved.  
Instead of allocating exactly the needed amount of blocks, allocate a 
large extent and hold them in memory.  The next allocation can then be 
filled from memory, so the allocation sync is amortized over many 
blocks.  A power fail will leak the preallocated blocks, losing some 
megabytes of address space, but not real disk space.


> Let's consider if we eliminate the reference count table which means 
> eliminating internal snapshots.
>
> 1) guest submits write request
> 2) allocate extent
> 3) write data to disk (a)
> 4) write (a) completes
> 5) write extent table (c)
> 6) write (c) completes
> 7) complete guest write request
>
> If this all happens in order and we lose power, we just leak a block.  
> It means we need a periodic fsck.
>
> If (c) completes before (a), then it means that the image is not 
> corrupted but data gets lost.  This is okay based on the guest contract.
>
> And that's it.  There is no scenario where the disk is corrupted.

_if_ that's the only failure mode.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

next prev parent reply	other threads:[~2010-08-25  7:15 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-24 10:40 [Qemu-devel] [RFC][STABLE 0.13] Revert "qcow2: Use bdrv_(p)write_sync for metadata writes" Kevin Wolf
2010-08-24 11:02 ` [Qemu-devel] " Stefan Hajnoczi
2010-08-24 11:06   ` Michael Tokarev
2010-08-24 11:40   ` Kevin Wolf
2010-08-24 11:56     ` Alexander Graf
2010-08-24 12:10       ` Kevin Wolf
2010-08-24 12:12         ` Alexander Graf
2010-08-24 12:18           ` Avi Kivity
2010-08-24 12:21             ` Alexander Graf
2010-08-24 12:27               ` Avi Kivity
2010-08-24 12:35                 ` Kevin Wolf
2010-08-24 12:39                   ` Avi Kivity
2010-08-24 12:53                     ` Kevin Wolf
2010-08-24 12:21     ` Stefan Hajnoczi
2010-08-24 12:23       ` Michael Tokarev
2010-08-24 12:48 ` Juan Quintela
2010-08-24 13:01 ` Anthony Liguori
2010-08-24 13:16   ` Kevin Wolf
2010-08-24 13:29     ` Anthony Liguori
2010-08-24 13:31       ` Avi Kivity
2010-08-24 13:35         ` Anthony Liguori
2010-08-24 13:39           ` Avi Kivity
2010-08-24 13:40             ` Anthony Liguori
2010-08-24 13:44               ` Avi Kivity
2010-08-24 13:56                 ` Anthony Liguori
2010-08-25  7:14                   ` Avi Kivity [this message]
2010-08-25 12:46                     ` Anthony Liguori
2010-08-25 13:07                       ` Avi Kivity
2010-08-25 13:37                         ` Anthony Liguori
2010-08-25 13:23                       ` Avi Kivity
2010-08-25 13:42                         ` Anthony Liguori
2010-08-25 14:00                           ` Avi Kivity
2010-08-25 14:14                             ` Anthony Liguori
2010-08-25 14:36                               ` Avi Kivity
2010-08-25 15:06                                 ` Anthony Liguori
2010-08-25 15:15                                   ` Avi Kivity
2010-08-25 15:21                                     ` Anthony Liguori
2010-08-25 13:46                         ` Anthony Liguori
2010-08-25 14:03                           ` Avi Kivity
2010-08-25 14:19                           ` Christoph Hellwig
2010-08-25 14:37                             ` Avi Kivity
2010-08-25 14:18                         ` Christoph Hellwig
2010-08-25 14:26                           ` Anthony Liguori
2010-08-25 14:49                             ` Daniel P. Berrange

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C74C2F3.9050506@redhat.com \
    --to=avi@redhat.com \
    --cc=anthony@codemonkey.ws \
    --cc=hch@lst.de \
    --cc=kwolf@redhat.com \
    --cc=mjt@tls.msk.ru \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.