All of lore.kernel.org
 help / color / mirror / Atom feed
From: Juan Quintela <quintela@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: "Daniel P. Berrange" <berrange@redhat.com>,
	Avi Kivity <avi@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
	Lucas Meneghel Rodrigues <lmr@redhat.com>,
	KVM mailing list <kvm@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"libvir-list\@redhat.com" <libvir-list@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	QEMU devel <qemu-devel@nongnu.org>
Subject: Re: qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions
Date: Tue, 15 Nov 2011 14:20:12 +0100	[thread overview]
Message-ID: <874ny5iktv.fsf@trasno.mitica> (raw)
In-Reply-To: <4EC1238B.2030906@codemonkey.ws> (Anthony Liguori's message of "Mon, 14 Nov 2011 08:19:55 -0600")

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/14/2011 04:16 AM, Daniel P. Berrange wrote:
>> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
>>> On 11/11/2011 12:15 PM, Kevin Wolf wrote:
>>>> Am 10.11.2011 22:30, schrieb Anthony Liguori:
>>>>> Live migration with qcow2 or any other image format is just not going to work
>>>>> right now even with proper clustered storage.  I think doing a block level flush
>>>>> cache interface and letting block devices decide how to do it is the best approach.
>>>>
>>>> I would really prefer reusing the existing open/close code. It means
>>>> less (duplicated) code, is existing code that is well tested and doesn't
>>>> make migration much of a special case.
>>>>
>>>> If you want to avoid reopening the file on the OS level, we can reopen
>>>> only the topmost layer (i.e. the format, but not the protocol) for now
>>>> and in 1.1 we can use bdrv_reopen().
>>>>
>>>
>>> Intuitively I dislike _reopen style interfaces.  If the second open
>>> yields different results from the first, does it invalidate any
>>> computations in between?
>>>
>>> What's wrong with just delaying the open?
>>
>> If you delay the 'open' until the mgmt app issues 'cont', then you loose
>> the ability to rollback to the source host upon open failure for most
>> deployed versions of libvirt. We only fairly recently switched to a five
>> stage migration handshake to cope with rollback when 'cont' fails.
>
> Delayed open isn't a panacea.  With the series I sent, we should be
> able to migration with a qcow2 file on coherent shared storage.
>
> There are two other cases that we care about: migration with nfs
> cache!=none and direct attached storage with cache!=none
>
> Whether the open is deferred matters less with NFS than if the open
> happens after the close on the source.  To fix NFS cache!=none, we
> would have to do a bdrv_close() before sending the last byte of
> migration data and make sure that we bdrv_open() after receiving the
> last byte of migration data.
>
> The problem with this IMHO is it creates a large window where noone
> has the file open and you're critically vulnerable to losing your VM.

Red Hat NFS guru told that fsync() on source + open() after that on
target is enough.  But anyways, it still depends of nothing else having
the file opened on target.

> I'm much more in favor of a smarter caching policy.  If we can fcntl()
> our way to O_DIRECT on NFS, that would be fairly interesting.  I'm not
> sure if this is supported today but it's something we could look into
> adding in the kernel. That way we could force NFS to O_DIRECT during
> migration which would solve this problem robustly.

We would need O_DIRECT on target during migration, I agree than that
would work.

> Deferred open doesn't help with direct attached storage.  There simple
> is no guarantee that there isn't data in the page cache.

Yeap, I asked the clustered filesystem people how they fixed the
problem, because clustered filesystem have this problem, right.  After
lots of arm twisting, I got the ioctl(BLKFLSBUF,...), but that only
works:
- on linux
- on some block devices

So, we are back to square 1.

> Again, I think defaulting DAS to cache=none|directsync is what makes
> the most sense here.

I think it is the only sane solution.  Otherwise, we need to write the
equivalent of a lock manager, to know _who_ has the storage, and
distributed lock managers are a mess :-(

> We can even add a migration blocker for DAS with cache=on.  If we can
> do dynamic toggling of the cache setting, then that's pretty friendly
> at the end of the day.

That could fix the problem also.  At the moment that we start migration,
we do an fsync() + switch to O_DIRECT for all filesystems.

As you said, time for implementing fcntl(O_DIRECT).

Later, Juan.

WARNING: multiple messages have this Message-ID (diff)
From: Juan Quintela <quintela@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Lucas Meneghel Rodrigues <lmr@redhat.com>,
	KVM mailing list <kvm@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"libvir-list@redhat.com" <libvir-list@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	QEMU devel <qemu-devel@nongnu.org>, Avi Kivity <avi@redhat.com>
Subject: Re: [Qemu-devel] qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions
Date: Tue, 15 Nov 2011 14:20:12 +0100	[thread overview]
Message-ID: <874ny5iktv.fsf@trasno.mitica> (raw)
In-Reply-To: <4EC1238B.2030906@codemonkey.ws> (Anthony Liguori's message of "Mon, 14 Nov 2011 08:19:55 -0600")

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 11/14/2011 04:16 AM, Daniel P. Berrange wrote:
>> On Sat, Nov 12, 2011 at 12:25:34PM +0200, Avi Kivity wrote:
>>> On 11/11/2011 12:15 PM, Kevin Wolf wrote:
>>>> Am 10.11.2011 22:30, schrieb Anthony Liguori:
>>>>> Live migration with qcow2 or any other image format is just not going to work
>>>>> right now even with proper clustered storage.  I think doing a block level flush
>>>>> cache interface and letting block devices decide how to do it is the best approach.
>>>>
>>>> I would really prefer reusing the existing open/close code. It means
>>>> less (duplicated) code, is existing code that is well tested and doesn't
>>>> make migration much of a special case.
>>>>
>>>> If you want to avoid reopening the file on the OS level, we can reopen
>>>> only the topmost layer (i.e. the format, but not the protocol) for now
>>>> and in 1.1 we can use bdrv_reopen().
>>>>
>>>
>>> Intuitively I dislike _reopen style interfaces.  If the second open
>>> yields different results from the first, does it invalidate any
>>> computations in between?
>>>
>>> What's wrong with just delaying the open?
>>
>> If you delay the 'open' until the mgmt app issues 'cont', then you loose
>> the ability to rollback to the source host upon open failure for most
>> deployed versions of libvirt. We only fairly recently switched to a five
>> stage migration handshake to cope with rollback when 'cont' fails.
>
> Delayed open isn't a panacea.  With the series I sent, we should be
> able to migration with a qcow2 file on coherent shared storage.
>
> There are two other cases that we care about: migration with nfs
> cache!=none and direct attached storage with cache!=none
>
> Whether the open is deferred matters less with NFS than if the open
> happens after the close on the source.  To fix NFS cache!=none, we
> would have to do a bdrv_close() before sending the last byte of
> migration data and make sure that we bdrv_open() after receiving the
> last byte of migration data.
>
> The problem with this IMHO is it creates a large window where noone
> has the file open and you're critically vulnerable to losing your VM.

Red Hat NFS guru told that fsync() on source + open() after that on
target is enough.  But anyways, it still depends of nothing else having
the file opened on target.

> I'm much more in favor of a smarter caching policy.  If we can fcntl()
> our way to O_DIRECT on NFS, that would be fairly interesting.  I'm not
> sure if this is supported today but it's something we could look into
> adding in the kernel. That way we could force NFS to O_DIRECT during
> migration which would solve this problem robustly.

We would need O_DIRECT on target during migration, I agree than that
would work.

> Deferred open doesn't help with direct attached storage.  There simple
> is no guarantee that there isn't data in the page cache.

Yeap, I asked the clustered filesystem people how they fixed the
problem, because clustered filesystem have this problem, right.  After
lots of arm twisting, I got the ioctl(BLKFLSBUF,...), but that only
works:
- on linux
- on some block devices

So, we are back to square 1.

> Again, I think defaulting DAS to cache=none|directsync is what makes
> the most sense here.

I think it is the only sane solution.  Otherwise, we need to write the
equivalent of a lock manager, to know _who_ has the storage, and
distributed lock managers are a mess :-(

> We can even add a migration blocker for DAS with cache=on.  If we can
> do dynamic toggling of the cache setting, then that's pretty friendly
> at the end of the day.

That could fix the problem also.  At the moment that we start migration,
we do an fsync() + switch to O_DIRECT for all filesystems.

As you said, time for implementing fcntl(O_DIRECT).

Later, Juan.

  reply	other threads:[~2011-11-15 13:20 UTC|newest]

Thread overview: 102+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-09 16:29 qemu and qemu.git -> Migration + disk stress introduces qcow2 corruptions Lucas Meneghel Rodrigues
2011-11-09 16:29 ` [Qemu-devel] " Lucas Meneghel Rodrigues
2011-11-09 16:39 ` Anthony Liguori
2011-11-09 16:39   ` [Qemu-devel] " Anthony Liguori
2011-11-09 17:02   ` Avi Kivity
2011-11-09 17:02     ` Avi Kivity
2011-11-09 17:35     ` Anthony Liguori
2011-11-09 17:35       ` [Qemu-devel] " Anthony Liguori
2011-11-09 19:53       ` Juan Quintela
2011-11-09 19:53         ` [Qemu-devel] " Juan Quintela
2011-11-09 20:18       ` Michael S. Tsirkin
2011-11-09 20:18         ` [Qemu-devel] " Michael S. Tsirkin
2011-11-09 20:22         ` Anthony Liguori
2011-11-09 21:00           ` Michael S. Tsirkin
2011-11-09 21:00             ` [Qemu-devel] " Michael S. Tsirkin
2011-11-09 21:01             ` Anthony Liguori
2011-11-09 21:01               ` [Qemu-devel] " Anthony Liguori
2011-11-10 10:41               ` Kevin Wolf
2011-11-10 10:41                 ` Kevin Wolf
2011-11-10 16:50                 ` Juan Quintela
2011-11-10 16:50                   ` [Qemu-devel] " Juan Quintela
2011-11-10 17:59                   ` Anthony Liguori
2011-11-10 17:59                     ` [Qemu-devel] " Anthony Liguori
2011-11-10 18:00                 ` Anthony Liguori
2011-11-10 18:00                   ` Anthony Liguori
2011-11-09 20:57         ` Juan Quintela
2011-11-09 20:57           ` [Qemu-devel] " Juan Quintela
2011-11-10  8:55       ` Avi Kivity
2011-11-10 17:50         ` Juan Quintela
2011-11-10 17:50           ` [Qemu-devel] " Juan Quintela
2011-11-10 17:54         ` Anthony Liguori
2011-11-10 17:54           ` [Qemu-devel] " Anthony Liguori
2011-11-12 10:20           ` Avi Kivity
2011-11-12 10:20             ` [Qemu-devel] " Avi Kivity
2011-11-12 13:30             ` Anthony Liguori
2011-11-12 13:30               ` [Qemu-devel] " Anthony Liguori
2011-11-12 14:36               ` Avi Kivity
2011-11-12 14:36                 ` [Qemu-devel] " Avi Kivity
2011-11-10 18:27         ` Anthony Liguori
2011-11-10 18:27           ` Anthony Liguori
2011-11-10 18:42           ` Daniel P. Berrange
2011-11-10 18:42             ` [Qemu-devel] " Daniel P. Berrange
2011-11-10 19:11             ` Anthony Liguori
2011-11-10 20:06               ` Daniel P. Berrange
2011-11-10 20:07                 ` Anthony Liguori
2011-11-10 21:30           ` Anthony Liguori
2011-11-10 21:30             ` Anthony Liguori
2011-11-11 10:15             ` Kevin Wolf
2011-11-11 10:15               ` [Qemu-devel] " Kevin Wolf
2011-11-11 14:03               ` Anthony Liguori
2011-11-11 14:29                 ` Kevin Wolf
2011-11-11 14:35                   ` Anthony Liguori
2011-11-11 14:44                     ` Kevin Wolf
2011-11-11 20:38                       ` Anthony Liguori
2011-11-12 10:27                 ` Avi Kivity
2011-11-12 13:39                   ` Anthony Liguori
2011-11-12 14:43                     ` Avi Kivity
2011-11-12 16:01                       ` Anthony Liguori
2011-11-12 10:25               ` Avi Kivity
2011-11-12 10:25                 ` Avi Kivity
2011-11-14  9:58                 ` Kevin Wolf
2011-11-14  9:58                   ` Kevin Wolf
2011-11-14 10:10                   ` Michael S. Tsirkin
2011-11-14 10:10                     ` [Qemu-devel] " Michael S. Tsirkin
2011-11-15 13:28                   ` Avi Kivity
2011-11-15 13:28                     ` Avi Kivity
2011-11-14 10:16                 ` Daniel P. Berrange
2011-11-14 10:16                   ` Daniel P. Berrange
2011-11-14 10:24                   ` Michael S. Tsirkin
2011-11-14 10:24                     ` Michael S. Tsirkin
2011-11-14 11:08                     ` Daniel P. Berrange
2011-11-14 11:08                       ` Daniel P. Berrange
2011-11-14 11:21                       ` Kevin Wolf
2011-11-14 11:21                         ` [Qemu-devel] " Kevin Wolf
2011-11-14 11:29                         ` Daniel P. Berrange
2011-11-14 11:29                           ` Daniel P. Berrange
2011-11-14 11:34                           ` Michael S. Tsirkin
2011-11-14 11:34                             ` Michael S. Tsirkin
2011-11-14 11:37                             ` Daniel P. Berrange
2011-11-14 11:37                               ` Daniel P. Berrange
2011-11-14 11:51                               ` Michael S. Tsirkin
2011-11-14 11:51                                 ` Michael S. Tsirkin
2011-11-14 11:55                                 ` Daniel P. Berrange
2011-11-14 11:55                                   ` Daniel P. Berrange
2011-11-14 11:56                               ` Michael S. Tsirkin
2011-11-14 11:56                                 ` [Qemu-devel] " Michael S. Tsirkin
2011-11-14 11:58                                 ` Daniel P. Berrange
2011-11-14 11:58                                   ` Daniel P. Berrange
2011-11-14 12:17                                   ` Michael S. Tsirkin
2011-11-14 12:17                                     ` Michael S. Tsirkin
2011-11-14 11:36                           ` Gleb Natapov
2011-11-14 11:32                       ` Michael S. Tsirkin
2011-11-14 11:32                         ` Michael S. Tsirkin
2011-11-14 14:19                   ` Anthony Liguori
2011-11-14 14:19                     ` Anthony Liguori
2011-11-15 13:20                     ` Juan Quintela [this message]
2011-11-15 13:20                       ` Juan Quintela
2011-11-15 13:56                       ` Anthony Liguori
2011-11-09 19:25 ` Juan Quintela
2011-11-09 19:25   ` [Qemu-devel] " Juan Quintela
2011-11-09 23:33   ` Lucas Meneghel Rodrigues
2011-11-09 23:33     ` [Qemu-devel] " Lucas Meneghel Rodrigues

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874ny5iktv.fsf@trasno.mitica \
    --to=quintela@redhat.com \
    --cc=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=berrange@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwolf@redhat.com \
    --cc=libvir-list@redhat.com \
    --cc=lmr@redhat.com \
    --cc=mst@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.