public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Juan Quintela <quintela@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: "Daniel P. Berrange" <berrange@redhat.com>,
	Avi Kivity <avi@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
	Lucas Meneghel Rodrigues <lmr@redhat.com>,
	KVM mailing list <kvm@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"libvir-list\@redhat.com" <libvir-list@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	QEMU devel <qemu-devel@nongnu.org>
Subject: Re: qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions
Date: Tue, 15 Nov 2011 14:20:12 +0100	[thread overview]
Message-ID: <874ny5iktv.fsf@trasno.mitica> (raw)
In-Reply-To: <4EC1238B.2030906@codemonkey.ws> (Anthony Liguori's message of "Mon, 14 Nov 2011 08:19:55 -0600")

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/14/2011 04:16 AM, Daniel P. Berrange wrote:
>> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
>>> On 11/11/2011 12:15 PM, Kevin Wolf wrote:
>>>> Am 10.11.2011 22:30, schrieb Anthony Liguori:
>>>>> Live migration with qcow2 or any other image format is just not going to work
>>>>> right now even with proper clustered storage.  I think doing a block level flush
>>>>> cache interface and letting block devices decide how to do it is the best approach.
>>>>
>>>> I would really prefer reusing the existing open/close code. It means
>>>> less (duplicated) code, is existing code that is well tested and doesn't
>>>> make migration much of a special case.
>>>>
>>>> If you want to avoid reopening the file on the OS level, we can reopen
>>>> only the topmost layer (i.e. the format, but not the protocol) for now
>>>> and in 1.1 we can use bdrv_reopen().
>>>>
>>>
>>> Intuitively I dislike _reopen style interfaces.  If the second open
>>> yields different results from the first, does it invalidate any
>>> computations in between?
>>>
>>> What's wrong with just delaying the open?
>>
>> If you delay the 'open' until the mgmt app issues 'cont', then you loose
>> the ability to rollback to the source host upon open failure for most
>> deployed versions of libvirt. We only fairly recently switched to a five
>> stage migration handshake to cope with rollback when 'cont' fails.
>
> Delayed open isn't a panacea.  With the series I sent, we should be
> able to migration with a qcow2 file on coherent shared storage.
>
> There are two other cases that we care about: migration with nfs
> cache!=none and direct attached storage with cache!=none
>
> Whether the open is deferred matters less with NFS than if the open
> happens after the close on the source.  To fix NFS cache!=none, we
> would have to do a bdrv_close() before sending the last byte of
> migration data and make sure that we bdrv_open() after receiving the
> last byte of migration data.
>
> The problem with this IMHO is it creates a large window where noone
> has the file open and you're critically vulnerable to losing your VM.

Red Hat NFS guru told that fsync() on source + open() after that on
target is enough.  But anyways, it still depends of nothing else having
the file opened on target.

> I'm much more in favor of a smarter caching policy.  If we can fcntl()
> our way to O_DIRECT on NFS, that would be fairly interesting.  I'm not
> sure if this is supported today but it's something we could look into
> adding in the kernel. That way we could force NFS to O_DIRECT during
> migration which would solve this problem robustly.

We would need O_DIRECT on target during migration, I agree than that
would work.

> Deferred open doesn't help with direct attached storage.  There simple
> is no guarantee that there isn't data in the page cache.

Yeap, I asked the clustered filesystem people how they fixed the
problem, because clustered filesystem have this problem, right.  After
lots of arm twisting, I got the ioctl(BLKFLSBUF,...), but that only
works:
- on linux
- on some block devices

So, we are back to square 1.

> Again, I think defaulting DAS to cache=none|directsync is what makes
> the most sense here.

I think it is the only sane solution.  Otherwise, we need to write the
equivalent of a lock manager, to know _who_ has the storage, and
distributed lock managers are a mess :-(

> We can even add a migration blocker for DAS with cache=on.  If we can
> do dynamic toggling of the cache setting, then that's pretty friendly
> at the end of the day.

That could fix the problem also.  At the moment that we start migration,
we do an fsync() + switch to O_DIRECT for all filesystems.

As you said, time for implementing fcntl(O_DIRECT).

Later, Juan.

  reply	other threads:[~2011-11-15 13:20 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-09 16:29 qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions Lucas Meneghel Rodrigues
2011-11-09 16:39 ` Anthony Liguori
2011-11-09 17:02   ` [Qemu-devel] " Avi Kivity
2011-11-09 17:35     ` Anthony Liguori
2011-11-09 19:53       ` Juan Quintela
2011-11-09 20:18       ` Michael S. Tsirkin
2011-11-09 20:22         ` [Qemu-devel] " Anthony Liguori
2011-11-09 21:00           ` Michael S. Tsirkin
2011-11-09 21:01             ` Anthony Liguori
2011-11-10 10:41               ` [Qemu-devel] " Kevin Wolf
2011-11-10 16:50                 ` Juan Quintela
2011-11-10 17:59                   ` Anthony Liguori
2011-11-10 18:00                 ` [Qemu-devel] " Anthony Liguori
2011-11-09 20:57         ` Juan Quintela
2011-11-10  8:55       ` [Qemu-devel] " Avi Kivity
2011-11-10 17:50         ` Juan Quintela
2011-11-10 17:54         ` Anthony Liguori
2011-11-12 10:20           ` Avi Kivity
2011-11-12 13:30             ` Anthony Liguori
2011-11-12 14:36               ` Avi Kivity
2011-11-10 18:27         ` [Qemu-devel] " Anthony Liguori
2011-11-10 18:42           ` Daniel P. Berrange
2011-11-10 19:11             ` [Qemu-devel] " Anthony Liguori
2011-11-10 20:06               ` Daniel P. Berrange
2011-11-10 20:07                 ` Anthony Liguori
2011-11-10 21:30           ` Anthony Liguori
2011-11-11 10:15             ` Kevin Wolf
2011-11-11 14:03               ` [Qemu-devel] " Anthony Liguori
2011-11-11 14:29                 ` Kevin Wolf
2011-11-11 14:35                   ` Anthony Liguori
2011-11-11 14:44                     ` Kevin Wolf
2011-11-11 20:38                       ` Anthony Liguori
2011-11-12 10:27                 ` Avi Kivity
2011-11-12 13:39                   ` Anthony Liguori
2011-11-12 14:43                     ` Avi Kivity
2011-11-12 16:01                       ` Anthony Liguori
2011-11-12 10:25               ` Avi Kivity
2011-11-14  9:58                 ` Kevin Wolf
2011-11-14 10:10                   ` Michael S. Tsirkin
2011-11-15 13:28                   ` [Qemu-devel] " Avi Kivity
2011-11-14 10:16                 ` Daniel P. Berrange
2011-11-14 10:24                   ` Michael S. Tsirkin
2011-11-14 11:08                     ` Daniel P. Berrange
2011-11-14 11:21                       ` Kevin Wolf
2011-11-14 11:29                         ` [Qemu-devel] " Daniel P. Berrange
2011-11-14 11:34                           ` Michael S. Tsirkin
2011-11-14 11:37                             ` Daniel P. Berrange
2011-11-14 11:51                               ` Michael S. Tsirkin
2011-11-14 11:55                                 ` Daniel P. Berrange
2011-11-14 11:56                               ` Michael S. Tsirkin
2011-11-14 11:58                                 ` [Qemu-devel] " Daniel P. Berrange
2011-11-14 12:17                                   ` Michael S. Tsirkin
2011-11-14 11:36                           ` Gleb Natapov
2011-11-14 11:32                       ` Michael S. Tsirkin
2011-11-14 14:19                   ` Anthony Liguori
2011-11-15 13:20                     ` Juan Quintela [this message]
2011-11-15 13:56                       ` Anthony Liguori
2011-11-09 19:25 ` Juan Quintela
2011-11-09 23:33   ` Lucas Meneghel Rodrigues

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874ny5iktv.fsf@trasno.mitica \
    --to=quintela@redhat.com \
    --cc=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=berrange@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwolf@redhat.com \
    --cc=libvir-list@redhat.com \
    --cc=lmr@redhat.com \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox